Title: Information Retrieval from Complex Unstructured Data Sources
1Information Retrieval from Complex Unstructured
Data Sources
- David Eichmann
- Director, School of Library and Information
Science - (and Dept. of Computer Science)
- david-eichmann_at_uiowa.edu
2The Basic Problem
- Data living in a database are structured,
controlled, retrievable - Text is complicated, redundant, ambiguous
- The vast majority of information lives in text,
rather than databases
3Extraction from a PDF File
- GRAMMARS HAVE EXCEPTIONS
- Valter Crescenzi 1 and Giansalvatore Mecca 2
- 1 Dipartimento di Informatica e Automazione
Universit?a di Roma Tre Via della Vasca Navale,
84 - 00146 Roma tel 39 06 5517 3219, fax 39
06 557 3030 crescenz_at_dia.uniroma3.it - 2D.I.F.A. Universit?a della Basilicata via
della Tecnica, 3 85100 Potenza, Italy tel 39
0971 474 638, fax 39 0971 56537
mecca_at_dia.uniroma3.it http//www.difa.unibas.it/us
ers/gmecca - Abstract Extending database-like techniques to
semi-structured and Web data sources is becoming
a prominent research field. These data sources
are essentially collections of textual documents.
4PDF Example, cont.
- References
- 11 S. Abiteboul, D. Quass, J. McHugh, J. Widom,
and J. Wiener. The Lorel query language for
semistructured data. Journal of Digital
Libraries, 1(1)6888 (1997). - 12 A. V. Aho, R. Sethi, and J. D. Ullman.
Compilers Principles, Techniques and Tools.
Addison Wesley Publ. Co., Reading, Massachussetts
(1985). - 13 G. O. Arocena and A. O. Mendelzon. WebOQL
Restructuring documents, databases and Webs. In
Fourteenth IEEE International Conference on Data
Engineering (ICDE'98), Orlando, Florida (1998).
5Semi-Structured Data
- Reseach in the area of semi-structured data views
the preceeding text as not truly structured (in
the database sense), but not completely
unstructured, either. - There are repeating patterns
- Macro level - scholarly documents have titles,
author(s), introduction, , references - Micro level - references have (e.g.) author(s),
titles, journal, volume, issue, page(s), date.
6Semi-Structured Data
- Retrieval in this domain has prerequisites
- Format recognition, translation (e.g., Word, PDF,
etc.) - Heuristic assessment of document class
- For a given class of document, heuristic
retrieval of conjectured structure
7PDF Extraction Example (Journal Paper)
- Abiteboul, S. Quass, D., McHugh, J. Widom,
J. Wiener, J. - The Lorel query language for semistructured data
- Journal of Digital Libraries.
- vol,issue 1, 1
- pages 68, 88
- location
- 1997
8PDF Extraction Examples (Book)
- Aho, A. V. Sethi, R. Ullman, J. D.
- Compilers Principles, Techniques and Tools
- Addison Wesley Publ. Co., Reading,
Massachussetts. - vol,issue ,
- pages -1, -1
- location
- 1985
9PDF Extraction Example (Conference Paper)
- Arocena, G. O. Mendelzon, A. O.
- WebOQL Restructuring documents, databases and
Webs - Fourteenth IEEE International Conference on Data
Engineering (ICDE'98), Orlando, Florida. - vol,issue ,
- pages -1, -1
- location
- 1998
10Citation Recognition Issues
- Correct classification of these citations
involves a number of heuristics (e.g.) - If there is a volume number and issue number,
assume a journal paper - If there is a location or a date involving a
specific day or range of days (e.g., Sept.
23-25), assume a conference paper - Note that for these examples, were missing the
dates and flubbed the location for the conference
paper
11Even So. . .
- Scanning through a small set of PDFs retrieved
for a search on semistructured data yields - Arocena, G. O.
- WebOQL Restructuring documents, databases and
Webs - Abiteboul, S.
- Querying documents in object databases
- Querying and updating the file
- The Lorel query language for semistructured data
- Mendelzon, A. O.
- WebOQL Restructuring documents, databases and
Webs - Querying the World Wide Web
- Querying the World Wide Web
- Formal models of Web queries
12A Functioning Example
- This technology forms the core of CiteSeer, a
research project and search engine operated by
NECs New Jersey research lab - http//citeseer.nj.nec.com
- PostScript and PDF content is discovered with a
standard Web crawler - Coverage is currently predominantly Computer
Science, primarily because thats whats out
there - Performance is significantly on a par with
Science Citations
13A PubMed Example
- Document 89316080 - Multiple and repetitive uses
of the extended hamstring V-Y myocutaneous flap .
- An extended hamstring V-Y myocutaneous
advancement flap is described that may be used to
cover unusually large defects in the ischial
region. Technical points that allow a large
amount of flap advancement are discussed. Because
of its large size, the flap can be raised and
used on repeated occasions to repair defects from
recurrent ischial pressure sores. Two patients
are presented in whom the same flap was used
repeatedly on multiple occasions, demonstrating
the potential for preservation of future options
in such patients when this flap is used.
14A PubMed Example, cont.
- Classification systems such MeSH provide for
organization of such data, but populating the
data is human-intensive
15MeSH Terms for PubMed Ex.
- Case Report
- Decubitus Ulcer/SU
- C17.800.893.289
- Human
- Male
- Methods
- E05.581
- H01.770.370
- Middle Age
- M01.060.116.630
16MeSH Terms, cont.
- Reoperation
- E04.690
- Surgical Flaps/
- A10.850.710
- E07.862.710
- Thigh
- A01.378.592.867
17Extracting Classification Terms
- For this type of data, the additional challenge
is to recognize and extract from the abstract or
full paper text that could serve as automatically
generation classification terms - Phrase recognition and matching against the
classification hierarchy (MeSH) - From the MeSH terms themselves, or
- From noun phrases generated with a part-of-speech
tagger
18Noun Phrases from the Example
- Multiplerepetitive, uses
- hamstring, V-Y, myocutaneous, flap
- hamstring, V-Y, myocutaneous, advancement, flap
- large, defects
- ischial, region
- Technical, points
- large, amount
- flap, advancement
- large, size
- . . .
19Extraction Results
- MeSH Terms (occurrences in parentheses, direct
matches with human classifier in green) - Surgical Flaps (6)
- A10.850.710
- E07.862.710
- Decubitus Ulcer (1)
- C17.800.893.289
- Patients (2)
- M01.643
- Forecasting (1)
- I01.320
20Extraction Results, cont.
- Other Phrases
- flap advancement (1)
- future options (1)
- hamstring V-Y myocutaneous advancement flap (1)
- hamstring V-Y myocutaneous flap (1)
- ischial pressure sores (1)
- ischial region (1)
- repetitive uses (1)
21Broadening the Scope
- So far, weve limited ourselves to extraction
from sources with a fair amount of predictable
structure, or at least a very specialized
vocabulary - Expanding our extraction capabilities to more
general categories of information requires
additional tools. . .
22Named Entity Extraction
- Virtually all text and speech is rich with
references to entities - Some categories of entities involve reasonably
unique naming of the members of the category - Dave Eichmann
- School of Library and Information Science
- The University of Iowa
- Iowa City, Iowa (actually two entities)
23Named Entity Extraction work at the University of
Iowa
- We have five categories currently being
recognized - Persons
- Organizations
- Locations
- Events (preliminary)
- MeSH (medical terminology)
- Plus generic noun phrases (e.g., health care)
24Named Entity Recognition
- All categories are driven through examination of
noun phrases recognized by a part-of-speech
tagger (with special handling of certain glue
words and, of, the, etc.) - Named entity vectors are maintained separately
from the regular word vector, weighted by their
length and the frequency of the constituent terms
25Person Recognition Resources
- Various Web lists of cultural names
- Anglo, Chinese, Arab, Hebrew, Hindi, Indian,
Japanese, Latino, Muslim, Russian - World leaders
- This is enriched with a set of pattern
expressions for other instances - President
- III
26Organization Recognition Resources
- International political organizations (from CIA
Fact Book) - Fortune 500 company list
- Global 500 company list
- This is enriched with a set of pattern
expressions for other instances - Incorporated
- Sons
27Location Recognition Resources
- We mine the text of the CIA Fact Book for
variants of country names, administrative
divisions, capitals, harbors, etc. - Various Web lists of
- World cities
- U.S. Cities
- Rivers
- Lakes
- This is enriched with a set of pattern
expressions for other instances - Street
- Mount
28Example Document Sources
- Newswires
- Associated Press
- Wall Street Journal
- Financial Times of London
- Los Angeles Times
- Reuters
- Broadcast news transcripts
- CNN Headline News
- Voice of America
- Medical Literature
- PubMed
- The Web
29Newswire Entity Recognition Sample 1
- Persons
- Bill Clinton (3)
- Jonathan Pollard (8)
- Moshe Fogel (2)
- Benjamin Netanyahu (2)
- Esther (1)
- Israeli Embassy (1)
- Organizations
- Cabinet (1)
- Places
- Israel (16)
- United States (5)
- Washington (2)
30Newswire Entity Recognition Sample 2
- Persons
- Vladimir Meciar (8)
- Jozef Moravcik (2)
- God (1)
- Kalman Petocz (2)
- Organizations
- Slovak Democratic Coalition (2)
- United States and Germany (1)
- NATO (1)
- European Union (1)
- Hungarian Coalition Party (1)
- Places
- Slovakia (4)
- Europe (1)
31Some Performance Data
- The chart on the next slide shows a set of topics
(generated by information analysts) plotted by - of returns that were false alarms (X axis)
- of good matches that were missed (Y axis)
- The retrieval decision was based on the level of
entity matching between the topic and a given
document
32Some Performance Data
33Comments on Performance
- Note that the false alarm rate is very low
- The scale on the X-axis is 0.00 - 0.05
- The extremely broad spread on miss rates (from
missing everything to missing nothing) correlates
roughly with the nature of the topic - Those involving actions of an individual or group
(e.g., World Trade Organization talks) are quite
good - Thos involving concepts (e.g., Federal monetary
policy) are quite poor
34Conclusions / Observations
- Given the rate of increase in the generation of
structured, semistructured and unstructured data,
some form of automated extraction, analysis and
retrieval technology can be quite valuable - Current technology performs well for some, but
not all, categories of information request
35The Cutting Edge
- Moving beyond this involves even more complicated
techniques - The current hot topic is question answering -
given a question, provide not a document, but the
actual answer - What is Colin Powell famous for?
- How many cases of West Nile have been detected in
Iowa this year?
36The Cutting Edge
- QA systems use much of the preceding approaches,
adding significantly in the areas of natural
language parsing and question classification. - Current state of the art
- How much folic acid should an expectant mother
get daily? - 400 micrograms
- Answers are factoids, and anything in the corpus
(real or not) is scored as correct