Title: Information Extraction
1Information Extraction
- Shih-Hung Wu
- Assistant Professor
- CSIE, Chaoyang University of Technology
2Outline
- Information Extraction
- Introduction
- Applications
- Table Reading
- Citation Extraction
- Chinese Named Entity Recognition
3Introduction
4Information Extraction
- extracts pieces of information that are salient
to the user's needs
5Message Understanding Conferences (MUC)
Evaluations
- provide prepared data and task definitions in
addition to providing fully automated scoring
software to measure machine and human
performance. - The databases now include named entities,
multilingual named entities, attributes of those
entities, facts about relationships between
entities, and events in which the entities
participated. - The multilingual portion was known as
"Multilingual Entitity Task (MET)"
6Examples
- The following fictional news story portrays the
levels of detail that systems can extract
Fletcher Maddox, former Dean of the UCSD Business
School, announced the formation of La Jolla
Genomatics together with his two sons. La Jolla
Genomatics will release its product Geninfo in
June 1999. Geninfo is a turnkey system to assist
biotechnology researchers in keeping up with the
voluminous literature in all aspects of their
field. Dr. Maddox will be the firm's CEO. His
son, Oliver, is the Chief Scientist and holds
patents on many of the algorithms used in
Geninfo. Oliver's brother, Ambrose, follows more
in his father's footsteps and will be the CFO of
L.J.G. headquartered in the Maddox family's
hometown of La Jolla, CA.
7Entities
Persons Organizations Locations Artifacts Dates
Fletcher Maddox UCSD Business School La Jolla Geninfo June 1999
Dr. Maddox La Jolla Genomatics CA Geninfo
Oliver La Jolla Genomatics
Oliver L.J.G.
Ambrose
Maddox
8Attributes
NAME Fletcher Maddox
DESCRIPTOR former Dean of the UCSD Business Schoolhis fatherthe firm's CEO
CATEGORY PERSON
NAME La Jolla Genomatics
DESCRIPTOR
CATEGORY ORGANIZATION
NAME Geninfo
DESCRIPTOR its product
CATEGORY ARTIFACT
NAME La Jolla
DESCRIPTOR the Maddox family's hometown
CATEGORY LOCATION
Attributes
9Facts
PERSON Employee_of ORGANIZATION
Fletcher MaddoxFletcher MaddoxOliverAmbrose Employee_ofEmployee_ofEmployee_ofEmployee_of UCSD Business SchoolLa Jolla GenomaticsLa Jolla GenomaticsLa Jolla Genomatics
ARTIFACT Product_of ORGANIZATION
Geninfo Product_of La Jolla Genomatics
LOCATION Location_of ORGANIZATION
La Jolla Location_of La Jolla Genomatics
CA Location_of La Jolla Genomatics
10Events COMPANY-FORMATION_EVENT
RELEASE-EVENT
COMPANY La Jolla Genomatics
PRINCIPALS Fletcher MaddoxOliverAmbrose
DATE
CAPITAL
COMPANY La Jolla Genomatics
PRODUCT Geninfo
DATE June 1999
COST
11Information Extraction
- current indicators of the state of the art
Items of Information Percentile
Reliability Entities 90 Attributes 80 Facts
70 Events 60
12Technical definition of IE
- The process of creating database entries by
skimming a text and looking for occurrences of a
particular class of object or event and for
relationships among those objects and events
Russell, Norvig 2003
13Basic IE tasks
- Extract addresses from Web pages
- target street, city, state, and zip code
- Extract storms from weather report
- target temperature, wind speed, and
precipitation
14IE Applications
- Competitive intelligence
- find instances of corporate mergers and joint
ventures. - Intelligence gathering
- terrorist activities.
- any damage to buildings or the infrastructure, as
well as the time and location of the event. - Health care delivery
- summarize medical patient records by extracting
diagnoses, symptoms, physical findings, test
results, and therapeutic treatments..
15Technology
- Method in literature
- Regular expressions
- Cascaded finite-state transducers
- Our approaches
- Ontological domain knowledge
- Machine Learning
- Hybrid method
16Regular expression approach example
- From the text
- 17in SXGA Monitor for only 249.99
- Extract
- ?m m ? ComputerMonitors ? Size(m,Inches(17)) ?
Price(m, (249.99)) ? Resolution(m, 12801024)
17Regular Expressions
- 0-9
- 0-9
- .0-9 0-9
- (.0-9 0-9)?
- 0-9(.0-9 0-9)?
- Any digit from 0 to 9
- One or more digits
- A period followed by two digits
- A period followed by two digits, or nothing
- 249.99, 1.23, 100000,
matches
18Weakness
- Whats the price ?
- List price 99.00, special sale price 78.00,
shipping 3.00.
19Cascaded finite-state transducers approach example
- From
- Bridgestone Sports Co. said Friday it has set up
a joint venture in Taiwan with a local concern
and a Japanese trading house to produce golf
clubs to be shipped to Japan. - Extract
- e ? JointVentures ? Product(e, golf clubs) ?
Date(e,Friday) ? Entity(e,Bridgetstone Sports
Co) ? Entity(e, a local concern) ? Entity(e,
a Japanese trading house)
20Cascaded finite-state transducers
- A typical relational extraction systems consists
of the following five stages - Tokenization
- Complex word handling
- Basic group handling
- Complex phrase handling
- Structure merging
21- Tokenization
- Word segmentation
- ??????-gt??????, ??????
- Complex word handling
- Bridgestone Sports Co.
- CapitalizedWord(CompanyCoIncLtd)
- Intel Chairman Andy Grove
- CapitalizedWord(GroveForestVillage)
- ???????
22- Basic group handling
- Noun group, verb group, Preposition, Conjunction
1 NG Bridgestone Sports Co. 2 VG said 3 NG
Friday 4 NG it 5 VG had set up 6 NG a joint
venture 7 PR in 8 NG Taiwan 9 PR with
10 NG a local concern 11 CJ and 12 NG a
Japanese trading house 13 VG to produce 14 NG
golf clubs 15 VG to be shipped 16 PR to 17 NG
Japan
23- Complex phrase handling
- CompanySetUp JointVenture (with Company)?
- Structure merging
- If the next sentence says something about the
same event.
24A brief remark
- IE works well for a restricted domain
- Predetermine the Subjects and how they are
mentioned
25Applications
26- Table Reading
- Citation Extraction
- Chinese NER
27Semantic Search on Internet Tabular Information
Extraction for Answering Queries
28Table Reading
Gives a algorithm to interpret tables of the type
shown below where some cells span over multiple
rows or columns.
An example of interpretation is (Attribute)gt(Val
ue) (Adult-Price-Single Room-Economic
class)gt35,450
29Table Reading
30Method
Ambiguous
Tagging
Relations of Cells
Layout Recognition
Layout Transformation
31Method
Tagging
Layout Identifying
Layout Trans.
32Airline Schedule Ontology
33Tagging
C Departure City
I Departure City
34Four Relations of Table Cells
- Relations of Concept - Instances
- Concept - Instance of the Concept
- Concept - Descent Concept
- Concept - Instance of Descent Concept
- Instance - Instance of the same Concept
35Layout Recognition
C-I Table
Layout Descriptions
Template Matching
Defined by Layout Syntax Grammar
Matched Layout Description
36Layout Transformation
Origin Layout Description
Destination Layout Description
37Experiments
- 23 tables from 23 web pages
- 13 2-dimension tables, 10 complex tables
- Success is no miss, Any miss results fail
38Conclusion Future Works
- Layout Transformation from complex tables to
simple tables (1D, 2D). - A general approach
- 1. Tagging
- 2. Semantic Layout Recognition
- 3. Layout Transformation
- Ambiguous reduced by checking cell relations
39Reference
- Huei-Long Wang, Shih-Hung Wu, I. C. Wang,
Cheng-Lung Sung, W. L. Hsu, W. K. Shih, Semantic
Search on Internet Tabular Information Extraction
for Answering Queries, Ninth International
Conference on Information and Knowledge
Management (CIKM-2000), McLean, VA, November
6-11, 2000. pp. 243-249. (EI) - H.-H. Chen, S.-C. Tsai, and J.-H. Tsai., Mining
Tables from Large Scale HTML Texts, In Proc. 18th
International Conference on Computational
Linguistics, Saabrucken, Germany, July 2000.
40A Knowledge-based Approach to Citation Extraction
41Introduction
- Integration of the bibliographical information of
scholarly publications available on the Internet - Accurate reference metadata extraction from
heterogeneous reference sources. - We propose a knowledge-based approach to
reference metadata extraction - INFOMAP ontological knowledge representation
framework - Automatically extract the reference metadata.
42Proposed Approach
43Reference Data Collection
Phase 1
- Journal Spider (journal agent)
- collect journal data from the Journal Citation
Reports (JCR) indexed by the ISI and digital
libraries on the Web. - Citation data source
- ISI web of science
- DBLP
- Citeseer
- PubMed
44Domain Knowledge
Phase 2
45INFOMAP
- INFOMAP as ontological knowledge representation
framework - extracts important citation concepts from a
natural language text. - Feature of INFOMAP
- represent and match complicated template
structures - hierarchical matching
- regular expressions
- semantic template matching
- frame (non-linear relations) matching
- Using INFOMAP, we can extract author, title,
journal, volume, number (issue), year, and page
information from different kinds of reference
formats or styles.
46Reference Metadata Extraction
Phase 3
Journal Reference styles Reference style example
Bioinformatics style (BIOI) Davenport, T., DeLong, D., Beers, M. (1998) Successful knowledge management projects. Sloan Management Review, 39(2), 43-57.
ACM style (ACM) 1. Davenport, T., DeLong, D. and Beers, M. 1998. Successful knowledge management projects. Sloan Management Review, 39 (2). 43-57.
IEEE style (IEEE) 1 T. Davenport, D. DeLong, and M. Beers, "Successful knowledge management projects," Sloan Management Review, vol. 39, no. 2, pp. 43-57, 1998.
APA style (APA) Davenport, T., DeLong, D., Beers, M. (1998). Successful knowledge management projects. Sloan Management Review, 39(2), 43-57.
JCB style (JCB) Davenport, T., DeLong, D., Beers, M. 1998. Successful knowledge management projects. Sloan Management Review 39(2), 43-57.
MISQ style (MISQ) Davenport, T., DeLong, D., and Beers, M. "Successful knowledge management projects," Sloan Management Review (392) 1998, pp 43-57.
Table 1. Examples of different journal reference
styles
47Knowledge-based Reference Metadata Extraction -
Online Service
Phase 4
48Citation Extraction From Text to BixTex
_at_article Author W. L. Hsu, Title The
coloring and maximum independent set problems on
planar perfect graphs,", Journal J. Assoc.
Comput. Machin., Volume , Number ,
Pages 535-563, Year 1988 _at_article
Author W. L. Hsu, Title On the general
feasibility test of scheduling lot sizes for
several products on one machine,", Journal
Management Science, Volume 29, Number
, Pages 93-105, Year 1983
_at_article Author W. L. Hsu, Title
The distance-domination numbers of trees,",
Journal Operations Research Letters, Volume
1, Number 3, Pages 96-100, Year
1982
- W. L. Hsu, "The coloring and maximum independent
set problems on planar perfect graphs," J. Assoc.
Comput. Machin., (1988), 535-563. - W. L. Hsu, "On the general feasibility test of
scheduling lot sizes for several products on one
machine," Management Science 29, (1983), 93-105. - W. L. Hsu, "The distance-domination numbers of
trees," Operations Research Letters 1, (3),
(1982), 96-100.
Figure 3. The system input of knowledge-based RME
Figure 5. The system output of BibTex Format
49System Input (Plain text)
System Output
Output BibTex
Figure 6. The online service of knowledge-based
RME (http//bioinformatics.iis.sinica.edu.tw/Cita
tionAgent/)
50Experimental Results and Discussion
- Experimental data
- We used EndNote to collect Bioinformatics
citation data for 2004 from PubMed. - A total of 907 bibliography records were
collected from PubMed digital libraries on the
Web. - Reference testing data was generated for each of
the six reference styles (BIOI, ACM, IEEE, APA,
MISQ, and JCB). - Randomly selected 500 records for testing from
each of the six reference styles.
51Experimental results of citation extraction from
six reference styles
52Example Results
53Field Field Relation Structure Percentage
Author ltAuthorgtltYeargt 54.29
ltAuthorgtltTitlegt 42.86
N/A 2.85
Year ltAuthorgtltYeargtltTitlegt 48.57
ltJournalgtltYeargtltVolumegt 20.00
ltIssuegtltYeargtltPagesgt 14.29
ltAuthorgtltYeargtltJournalgt 5.71
ltPagesgtltYeargt 2.86
ltVolumegtltYeargtltPagesgt 2.86
N/A 5.71
Title ltYeargtltTitlegtltJournalgt 48.57
ltAuthorgtltTitlegtltJournalgt 42.86
N/A 8.57
Journal ltTitlegtltJournalgtltVolumegt 71.43
ltTitlegtltJournalgtltYeargt 20.00
ltYeargtltJournalgtltVolumegt 5.71
N/A 2.86
Volume ltJournalgtltVolumegtltPagesgt 40.00
ltJournalgtltVolumegtltIssuegt 31.43
ltYeargtltVolumegtltIssuegt 14.29
ltYeargtltVolumegtltPagesgt 5.71
ltJournalgtltVolumegtltVolumegt 2.86
ltJournalgtltVolumegtltYeargt 2.86
N/A 2.85
Issue ltVolumegtltIssuegtltPagesgt 34.29
ltVolumegtltIssuegtltYeargt 14.29
N/A 51.42
Pages ltVolumegtltPagesgt 42.86
ltIssuegtltPagesgt 34.29
ltYeargtltPagesgt 17.14
ltVolumegtltPagesgtltYeargt 2.86
N/A 2.85
The various structures of different
styles (Analysis of structures of 30 reference
styles )
54Comparison with related works
- Knowledge-based approach
- Our proposed knowledge-based method for scholarly
publications can extract reference information
from 907 records in various reference styles with
a high degree of precision - the overall average field accuracy is 97.87 for
six major styles listed in Table 1 - 98.20 for the MISQ style
- 87 for other 30 randomly selected styles
55Conclusions
- Citation extraction is a challenging problem
- The diverse nature of reference styles
- We have proposed a knowledge-based citation
extraction method for scholarly publications. - The experimental results indicate that, by using
INFOMAP, we can extract author, title, journal,
volume, number (issue), year, and page
information from different reference styles with
a high degree of precision. - The overall average field accuracy of citation
extraction is 97.87 for six major reference
styles.
56Future Research
- Integrate the ontological and the machine
learning approaches to boost the performance of
citation information extraction - Maximum-Entropy Method (MEM)
- Hidden Markov Model (HMM)
- Conditional Random Fields (CRF)
- Support Vector Machines (SVM)
57Reference
- Min-Yuh Day, Tzong-Han Tsai, Cheng-Lung Sung,
Cheng-Wei Lee, Shih-Hung Wu, Chorng-Shyong Ong,
and Wen-Lian Hsu, A Knowledge-based Approach to
Citation Extraction, to appear in Proceedings of
IEEE International Conference on Information
Reuse and Integration (IEEE IRI-2005), pp.50-55.
(EI)
58Chinese Named Entity Recognition Using a Hybrid
Approach of Machine Learning and Domain Knowledge
59Named Entity Recognition
- ??????,??????Named Entity
- ???????????????
- lt??gt???lt/??gt???lt??gt??lt/??gt??lt???gt????lt/???gt
60Sequential Labeling
Token-based
??? ?? ? ?? ?? ????
Per Loc Org
Charactor-based
? ? ? ? ? ? ? ? ? ? ? ? ? ?
B-P I-P I-P B-L I-L B-O I-O I-O I-O
61Machine Learning???????
- ??????????named entity
- ?????corpus
- ????corpus?????target named entity,
????corpus???????????. - ???????NER????
- ????????NER?????
62Hybrid NER method
- Domain knowledge
- ??, ????, ?????, ????
- Machine Learning
- SVM, Bigram/Trigram Model
- Hybrid
- Maximum-Entropy Framework
- Domain knowledge serves as features
63Statical knowledge is insufficient
- New names
- ???
- ????SARS???
- ??????
- Ambiguity
- ????????
- ??????????
- Context dependence
- ???????
- ????????
64Pure machine learning might suffer
- Lack context information
- ???Window Size
- ???????token????tag??
- ?????????NE????
- ??????????NER??
- ???????????
- ???????, ?NER???????
65Basic Concepts of Our ME-based Hybrid Approach
- ?????NE??????????Context Information
- Internal/External Features
- ??Training Data?????Feature???, ??????confidence
66Internal/External Features
- Internal
- Found within the name string itself
- e.g., ? ? ? ? ? ? ?
- External
- Context
- e.g., ? ? ? ?
??
????
67Tag Set (outcome)
- ???????Character??Token, ????Named Entity????,
??, ??? - Tag Set
- ?/B-P ?/I-P ?/I-P
- ?/B-L ?/I-L ?/I-L
- ?/B-O ?/I-O ?/I-O
68ME-based NER Framework-Feature Representation
- For example
- ???token????, ??????????????
- ??????????
Feature f is active!!
69ME-based NER Framework-Training
- Given a set of features and a training corpus
- The ME estimation process produces a model in
which every feature fi has a weight ai. - Then we are allowed to compute
70ME-based NER Framework-Decoding
- Tokenize the text and preprocess the testing
sentence - For each token, check which features are active
and combine the ai of active features according
to Equation 1 - A Viterbi search is run to find the highest
probability path
71Hybrid NER Example
- ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
- The NER problem has been formulated as maximize
p(oh) and find its corresponding outcome o
Ps
Ls
Context (History)
Feature 1 ??
Os
W0 the current token
72Advantages of Hybrid NER
- ????????, ??????????.
- ??????????????????????????
- ???????????????, ?????????????
- ??????????????Performance
- ???????????
73Experiment-Data Set
United Daily News (December, 2002)
Domain Number of Named Entities Number of Named Entities Number of Named Entities Size (in characters)
Domain PER LOC ORG Size (in characters)
Local News 84 139 97 11835
Social Affairs 310 287 354 37719
Investment 20 63 33 14397
Politics 419 209 233 17168
Headline News 267 70 243 19938
Business 142 186 187 25815
Total 1242 954 1147 126872
74Experiment Result
- Use domain knowledge only
NE P() R() F()
PER 72.98 97.93 83.63
LOC 67.96 74.67 71.16
ORG 95.77 64.07 76.78
Total 75.62 82.13 78.74
NE P() R() F()
PER 97.94 87.39 92.36
LOC 78.60 69.35 73.69
ORG 94.39 62.57 75.25
Total 90.56 73.70 81.26
75Performance Comparison
Sys. Person Person Person Location Location Location Organization Organization Organization Overall Overall Overall
Sys. P R F P R F P R F P R F
NTU (98) 74 91 81.6 69 78 73.2 85 78 81.3 77 83 79.9
KRDL (98) 66.4 92 77.1 89 90.9 90 89.5 87.8 88.6 85.2 90.2 87.6
IASL (03) 92.1 83.3 87.5 88.1 81.8 84.9 93.3 88.7 90.9 90.4 85 87.7
Corpus MET2 Dataset Number of Entities 3646
76Conclusion and Future Work
- Conclusion
- Hybrid Approach???????????????????
- Hybrid Approach?Precision?????????Improvement,
- Hybrid ????Improvement??, ????????????
- Future Work
- ???????Named Entity???Features
- ????????????
- ??Named Entity???
- Multi Iteration NER
- Hierachical Named Entity???
77References
- Tsai 2003 Tzong-Han Tsai, Shih-Hung Wu and
Wen-Lian Hsu. (2003), Mencius A Chinese Named
Entity Recognizer Using Hybrid Model, in
Proceedings of the Fifteenth Research on
Computational Linguistics International
Conference (ROCLING XV), pp.193-209, 2003. - Tsai 2004 Tzong-Han Tsai, Shih-Hung Wu, and
Wen-Lian Hsu, "Mencius A Chinese Named Entity
Recognizer Based on a Maximum Entropy Framework,"
Computational Linguistics and Chinese Language
Processing, Vol.9, No.1, pp.65-82, 2004. - Shih 2004 Cheng-Wei Shih, Tzong-Han Tsai,
Shih-Hung Wu, Chiu-Chen Hsieh, and Wen-Lian Hsu,
(2004) The Construction of a Chinese Named
Entity Tagged Corpus CNEC1.0, in Proceedings of
the Fifteenth Conference on Computational
Linguistics and Speech Processing (ROCLING XVI),
pp. 305-313.