Lydia: Knowledge Extraction from Curated Text - PowerPoint PPT Presentation

About This Presentation
Title:

Lydia: Knowledge Extraction from Curated Text

Description:

Lydia: Knowledge Extraction from Curated Text. Steven Skiena. Dept. of ... NN ,/, the/DT foundation/NN announced/VBD yesterday/RB in/IN New/NNP York/NNP. ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 27
Provided by: csSu5
Category:

less

Transcript and Presenter's Notes

Title: Lydia: Knowledge Extraction from Curated Text


1
Lydia Knowledge Extraction from Curated Text
  • Steven Skiena
  • Dept. of Computer Science
  • SUNY Stony Brook
  • http//www.cs.sunysb.edu/skiena

2
Opportunities in Text Analysis
  • The increasing volume of online information
    coupled with decreasing costs of communications
    and computation creates exciting new
    opportunities in text mining.
  • Our Lydia system can analyze all of the 1000
    online English-language newspapers daily on a
    single commodity computer!
  • Our ultimate goal is to build a relational
    encyclopedia of much of the worlds knowledge
    through analysis of news media, reference texts,
    and primary sources.

3
Garbage In, Garbage Out
  • Knowledge extraction becomes easier when you
  • start with reliable sources
  • Online newspapers (both domestic and foreign)
  • Online reference sources, e.g. Wikipedia
  • Scientific abstracts, e.g. Medline/Pubmed
  • Financial reports, e.g. Edgar
  • Court decisions and other legal documents.
  • DMOZ-approved websites

4
System Architecture
  • Spidering text is retrieved from a given site
    on a daily basis using semi-custom spidering
    agents.
  • Normalization clean text is extracted with
    semi-custom parsers and formatted for our
    pipeline
  • Text Markup annotates parts of the source text
    for storage and analysis.
  • Back Office Operations we aggregate entity
    frequency and relational data for a variety of
    statistical analyses.

5
Text Markup
  • We apply natural language processing (NLP)
    techniques to annotate interesting features of
    the document.
  • Full parsing techniques are far too slow to keep
    up with our volume of text, so we employ shallow
    parsing instead.
  • We can currently markup approximately 2000
    newspapers per day.
  • Analysis phases include

6
Input
Dr. Judith Rodin, the former president of the
University of Pennsylvania, will become president
of the Rockefeller Foundation next year, the
foundation announced yesterday in New York. She
will take over in March 2005, succeeding Gordon
Conway, the foundation's first non-American
president. Mr. Conway announced last year that he
would retire at 66 in December and return to
Britain, where his children and grandchildren
live.
7
Sentence and Paragraph Identification
ltpgt Dr. Judith Rodin, the former president of the
University of Pennsylvania, will become president
of the Rockefeller Foundation next year, the
foundation announced yesterday in New
York. lt/pgt ltpgt She will take over in March 2005,
succeeding Gordon Conway, the foundation's first
non-American president. Mr. Conway announced last
year that he would retire at 66 in December and
return to Britain, where his children and
grandchildren live. lt/pgt
8
Part Of Speech Tagging
ltpgt Dr./NNP Judith/NNP Rodin/NNP ,/, the/DT
former/JJ president/NN of/IN the/DT
University/NNP of/IN Pennsylvania/NNP ,/, will/MD
become/VB president/NN of/IN the/DT
Rockefeller/NNP Foundation/NN next/JJ year/NN ,/,
the/DT foundation/NN announced/VBD yesterday/RB
in/IN New/NNP York/NNP./. lt/pgt ltpgt She/PRP
will/MD take/VB over/IN in/IN March/NNP 2005/CD
,/, succeeding/VBG Gordon/NNP Conway/NNP ,/,
the/DT foundation/NN 's/POS first/JJ
non-American/JJ president/NN ./. Mr./NNP
Conway/NNP announced/VBD last/JJ year/NN that/IN
he/PRP would/MD retire/VB at/IN 66/CD in/IN
December/NNP and/CC return/NN to/TO Britain/NNP
,/, where/WRB his/PRP children/NNS and/CC
grandchildren/NNS live/VBP ./. lt/pgt
9
Proper Noun Extraction
ltpgt ltpngt Dr./NNP Judith/NNP Rodin/NNP lt/pngt ,/,
the/DT former/JJ president/NN of/IN the/DT ltpngt
University/NNP lt/pngt of/IN ltpngt Pennsylvania/NNP
lt/pngt ,/, will/MD become/VB president/NN of/IN
the/DT ltpngt Rockefeller/NNP lt/pngt Foundation/NN
next/JJ year/NN ,/, the/DT foundation/NN
announced/VBD yesterday/RB in/IN ltpngt New/NNP
York/NNP lt/pngt ./. lt/pgt ltpgt She/PRP will/MD
take/VB over/IN in/IN March/NNP 2005/CD ,/,
succeeding/VBG ltpngt Gordon/NNP Conway/NNP lt/pngt
,/, the/DT foundation/NN 's/POS first/JJ
non-American/JJ president/NN ./. ltpngt Mr./NNP
Conway/NNP lt/pngt announced/VBD last/JJ year/NN
that/IN he/PRP would/MD retire/VB at/IN 66/CD
in/IN December/NNP and/CC return/NN to/TO ltpngt
Britain/NNP lt/pngt ,/, where/WRB his/PRP
children/NNS and/CC grandchildren/NNS live/VBP
./. lt/pgt
10
Date and Number Extraction
ltpgt ltpngt Dr./NNP Judith/NNP Rodin/NNP lt/pngt ,/,
the/DT former/JJ president/NN of/IN the/DT ltpngt
University/NNP lt/pngt of/IN ltpngt Pennsylvania/NNP
lt/pngt ,/, will/MD become/VB president/NN of/IN
the/DT ltpngt Rockefeller/NNP lt/pngt Foundation/NN
next/JJ year/NN ,/, the/DT foundation/NN
announced/VBD yesterday/RB in/IN ltpngt New/NNP
York/NNP lt/pngt ./. lt/pgt ltpgt She/PRP will/MD
take/VB over/IN in/IN ltembedded_dategt March/NNP
2005/CD lt/embedded_dategt ,/, succeeding/VBG ltpngt
Gordon/NNP Conway/NNP lt/pngt ,/, the/DT
foundation/NN 's/POS ltnum type "ORDINAL"gt
first/JJ lt/numgt non-American/JJ president/NN
./. ltpngt Mr./NNP Conway/NNP lt/pngt announced/VBD
last/JJ year/NN that/IN he/PRP would/MD retire/VB
at/IN ltnum type "CARDINAL"gt 66/CD lt/numgt in/IN
ltembedded_dategt December/NNP lt/embedded_dategt
and/CC return/NN to/TO ltpngt Britain/NNP lt/pngt ,/,
where/WRB his/PRP children/NNS and/CC
grandchildren/NNS live/VBP ./. lt/pgt
11
Actor Classification
ltpgt ltpn category "PERSON"gt Dr./NNP Judith/NNP
Rodin/NNP lt/pngt ,/, the/DT former/JJ president/NN
of/IN the/DT ltpn category "UNKNOWN"gt
University/NNP lt/pngt of/IN ltpn category
"STATE"gt Pennsylvania/NNP lt/pngt ,/, will/MD
become/VB president/NN of/IN the/DT ltpn category
"UNKNOWN"gt Rockefeller/NNP lt/pngt Foundation/NN
next/JJ year/NN ,/, the/DT foundation/NN
announced/VBD yesterday/RB in/IN ltpn category
CITYgt New/NNP York/NNP lt/pngt ./. lt/pgt ltpgt She/PR
P will/MD take/VB over/IN in/IN ltembedded_dategt
March/NNP 2005/CD lt/embedded_dategt ,/,
succeeding/VBG ltpn category "PERSON"gt
Gordon/NNP Conway/NNP lt/pngt ,/, the/DT
foundation/NN 's/POS ltnum type "ORDINAL"gt
first/JJ lt/numgt non-American/JJ president/NN
./. ltpn category "PERSON"gt Mr./NNP Conway/NNP
lt/pngt announced/VBD last/JJ year/NN that/IN
he/PRP would/MD retire/VB at/IN ltnum type
"CARDINAL"gt 66/CD lt/numgt in/IN ltembedded_dategt
December/NNP lt/embedded_dategt and/CC return/NN
to/TO ltpn category "COUNTRY"gt Britain/NNP lt/pngt
,/, where/WRB his/PRP children/NNS and/CC
grandchildren/NNS live/VBP ./. lt/pgt
12
Rewrite Rules
ltpgt ltappellationgt Dr. lt/appellationgt ltpn category
"PERSON"gt Judith Rodin lt/pngt , the former
president of the ltpn category "UNIVERSITY"gt
University of Pennsylvania lt/pngt , will become
president of the ltpn category "UNKNOWN"gt
Rockefeller Foundation lt/pngt next year , the
foundation announced yesterday in ltpn category
CITYgt New York lt/pngt . lt/pgt ltpgt She will take
over in ltembedded_dategt March 2005
lt/embedded_dategt , succeeding ltpn category
"PERSON"gt Gordon Conway lt/pngt , the foundation 's
ltnum type "ORDINAL"gt first lt/numgt non-American
president . ltappellationgt Mr. lt/appellationgt ltpn
category "PERSON"gt Conway lt/pngt announced last
year that he would retire at ltnum type
"CARDINAL"gt 66 lt/numgt in ltembedded_dategt December
lt/embedded_dategt and return to ltpn category
"COUNTRY"gt Britain lt/pngt , where his children and
grandchildren live . lt/pgt
13
Alias Expansion
ltpgt ltappellationgt Dr. lt/appellationgt ltpn category
"PERSON"gt Judith Rodin lt/pngt , the former
president of the ltpn category "UNIVERSITY"gt
University of Pennsylvania lt/pngt , will become
president of the ltpn category "UNKNOWN"gt
Rockefeller Foundation lt/pngt next year , the
foundation announced yesterday in ltpn category
CITYgt New York lt/pngt . lt/pgt ltpgt She will take
over in ltembedded_dategt March 2005
lt/embedded_dategt , succeeding ltpn category
"PERSON"gt Gordon Conway lt/pngt , the foundation 's
ltnum type "ORDINAL"gt first lt/numgt non-American
president . ltappellationgt Mr. lt/appellationgt ltpn
category "PERSON"gt Gordon Conway lt/pngt
announced last year that he would retire at ltnum
type "CARDINAL"gt 66 lt/numgt in ltembedded_dategt
December lt/embedded_dategt and return to ltpn
category "COUNTRY"gt Britain lt/pngt , where his
children and grandchildren live. lt/pgt
14
Geography Normalization
ltpgt ltappellationgt Dr. lt/appellationgt ltpn category
"PERSON"gt Judith Rodin lt/pngt , the former
president of the ltpn category "UNIVERSITY"gt
University of Pennsylvania lt/pngt , will become
president of the ltpn category "UNKNOWN"gt
Rockefeller Foundation lt/pngt next year , the
foundation announced yesterday in ltpn category
CITY, STATE, COUNTRYgt New York City, New York,
USA lt/pngt . lt/pgt ltpgt She will take over in
ltembedded_dategt March 2005 lt/embedded_dategt ,
succeeding ltpn category "PERSON"gt Gordon Conway
lt/pngt , the foundation 's ltnum type "ORDINAL"gt
first lt/numgt non-American president
. ltappellationgt Mr. lt/appellationgt ltpn category
"PERSON"gt Gordon Conway lt/pngt announced last
year that he would retire at ltnum type
"CARDINAL"gt 66 lt/numgt in ltembedded_dategt December
lt/embedded_dategt and return to ltpn category
"COUNTRY"gt Britain lt/pngt , where his children and
grandchildren live. lt/pgt
15
Back Office Operations
  • The most interesting analysis occurs after
    markup, using our MySQL database of all
    occurrences of interesting entities.
  • Each days worth of analysis yields about 10
    million occurrences of about 1 million different
    entities, so efficiency matters...
  • Linkage of each occurrence to source and time
    facilitates a variety of interesting analysis.

16
Duplicate Article Elimination
Supreme Court Justice David Souter suffered minor
injuries when a group of young men assaulted him
as he jogged on a city street, a court
spokeswoman and Metropolitan Police said
Saturday. Supreme Court Justice David Souter
suffered minor injuries when a group of young men
assaulted him as he jogged on a city street, a
court spokeswoman and Metropolitan Police
said. Hashing techniques can efficiently
identify duplicate and near-duplicate articles
appearing in different news sources.
17
Synonym Sets
  • JFK, John Kennedy, John F. Kennedy, and John
    Fitzgerald Kennedy all refer to the same person.
  • We need a mechanism to link multiple entities
    that have slightly different names but refer to
    the same thing.
  • We say that two actors belong in the same synonym
    set if
  • There names are morphologically compatible.
  • If the sets of entities that they are related to
    are similar.

18
Juxtapositions
19
Applications
  • Knowledge-based (instead of document-based)
    search engines
  • Market research geographic/temporal analysis
  • Legal document (deposition/filing) analysis
  • Financial modeling and analysis
  • Medical/Scientific applications
  • Law enforcement/Homeland Security
  • We look forward to industrial collaboration..

20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
The Lydia Team
  • Levon Lloyd systems architecture
  • Alex Kim production
  • Dimitris Kechagias spidering
  • Andrew Mehler rule-based processing
  • Michael Papile geographic normalization
  • Izzet Zorlu user interface
  • Dwayne Mason software tools
  • http//www.cs.sunysb.edu/lydia
Write a Comment
User Comments (0)
About PowerShow.com