Information Extraction - PowerPoint PPT Presentation

1 / 77
About This Presentation
Title:

Information Extraction

Description:

Information Extraction Shih-Hung Wu Assistant Professor CSIE, Chaoyang University of Technology Outline Information Extraction Introduction Applications Table Reading ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 78
Provided by: csieCyut
Category:

less

Transcript and Presenter's Notes

Title: Information Extraction


1
Information Extraction
  • Shih-Hung Wu
  • Assistant Professor
  • CSIE, Chaoyang University of Technology

2
Outline
  • Information Extraction
  • Introduction
  • Applications
  • Table Reading
  • Citation Extraction
  • Chinese Named Entity Recognition

3
Introduction
4
Information Extraction
  • extracts pieces of information that are salient
    to the user's needs

5
Message Understanding Conferences (MUC)
Evaluations
  • provide prepared data and task definitions in
    addition to providing fully automated scoring
    software to measure machine and human
    performance.
  • The databases now include named entities,
    multilingual named entities, attributes of those
    entities, facts about relationships between
    entities, and events in which the entities
    participated.
  • The multilingual portion was known as
    "Multilingual Entitity Task (MET)"

6
Examples
  • The following fictional news story portrays the
    levels of detail that systems can extract

Fletcher Maddox, former Dean of the UCSD Business
School, announced the formation of La Jolla
Genomatics together with his two sons. La Jolla
Genomatics will release its product Geninfo in
June 1999. Geninfo is a turnkey system to assist
biotechnology researchers in keeping up with the
voluminous literature in all aspects of their
field. Dr. Maddox will be the firm's CEO. His
son, Oliver, is the Chief Scientist and holds
patents on many of the algorithms used in
Geninfo. Oliver's brother, Ambrose, follows more
in his father's footsteps and will be the CFO of
L.J.G. headquartered in the Maddox family's
hometown of La Jolla, CA.
7
Entities
Persons Organizations Locations Artifacts Dates
Fletcher Maddox UCSD Business School La Jolla Geninfo June 1999
Dr. Maddox La Jolla Genomatics CA Geninfo  
Oliver La Jolla Genomatics      
Oliver L.J.G.      
Ambrose        
Maddox        
8
Attributes
NAME Fletcher Maddox
DESCRIPTOR former Dean of the UCSD Business Schoolhis fatherthe firm's CEO
CATEGORY PERSON
NAME La Jolla Genomatics
DESCRIPTOR  
CATEGORY ORGANIZATION
NAME Geninfo
DESCRIPTOR its product
CATEGORY ARTIFACT
NAME La Jolla
DESCRIPTOR the Maddox family's hometown
CATEGORY LOCATION
Attributes
9
Facts
PERSON Employee_of ORGANIZATION
Fletcher MaddoxFletcher MaddoxOliverAmbrose Employee_ofEmployee_ofEmployee_ofEmployee_of UCSD Business SchoolLa Jolla GenomaticsLa Jolla GenomaticsLa Jolla Genomatics
ARTIFACT Product_of ORGANIZATION
Geninfo Product_of La Jolla Genomatics
LOCATION Location_of ORGANIZATION
La Jolla Location_of La Jolla Genomatics
CA Location_of La Jolla Genomatics
10
Events COMPANY-FORMATION_EVENT
RELEASE-EVENT
COMPANY La Jolla Genomatics
PRINCIPALS Fletcher MaddoxOliverAmbrose
DATE  
CAPITAL  
COMPANY La Jolla Genomatics
PRODUCT Geninfo
DATE June 1999
COST  
11
Information Extraction
  • current indicators of the state of the art

Items of Information Percentile
Reliability Entities 90 Attributes 80 Facts
70 Events 60
12
Technical definition of IE
  • The process of creating database entries by
    skimming a text and looking for occurrences of a
    particular class of object or event and for
    relationships among those objects and events
    Russell, Norvig 2003

13
Basic IE tasks
  • Extract addresses from Web pages
  • target street, city, state, and zip code
  • Extract storms from weather report
  • target temperature, wind speed, and
    precipitation

14
IE Applications
  • Competitive intelligence
  • find instances of corporate mergers and joint
    ventures.
  • Intelligence gathering
  • terrorist activities.
  • any damage to buildings or the infrastructure, as
    well as the time and location of the event.
  • Health care delivery
  • summarize medical patient records by extracting
    diagnoses, symptoms, physical findings, test
    results, and therapeutic treatments..

15
Technology
  • Method in literature
  • Regular expressions
  • Cascaded finite-state transducers
  • Our approaches
  • Ontological domain knowledge
  • Machine Learning
  • Hybrid method

16
Regular expression approach example
  • From the text
  • 17in SXGA Monitor for only 249.99
  • Extract
  • ?m m ? ComputerMonitors ? Size(m,Inches(17)) ?
    Price(m, (249.99)) ? Resolution(m, 12801024)

17
Regular Expressions
  • 0-9
  • 0-9
  • .0-9 0-9
  • (.0-9 0-9)?
  • 0-9(.0-9 0-9)?
  • Any digit from 0 to 9
  • One or more digits
  • A period followed by two digits
  • A period followed by two digits, or nothing
  • 249.99, 1.23, 100000,

matches
18
Weakness
  • Whats the price ?
  • List price 99.00, special sale price 78.00,
    shipping 3.00.

19
Cascaded finite-state transducers approach example
  • From
  • Bridgestone Sports Co. said Friday it has set up
    a joint venture in Taiwan with a local concern
    and a Japanese trading house to produce golf
    clubs to be shipped to Japan.
  • Extract
  • e ? JointVentures ? Product(e, golf clubs) ?
    Date(e,Friday) ? Entity(e,Bridgetstone Sports
    Co) ? Entity(e, a local concern) ? Entity(e,
    a Japanese trading house)

20
Cascaded finite-state transducers
  • A typical relational extraction systems consists
    of the following five stages
  • Tokenization
  • Complex word handling
  • Basic group handling
  • Complex phrase handling
  • Structure merging

21
  • Tokenization
  • Word segmentation
  • ??????-gt??????, ??????
  • Complex word handling
  • Bridgestone Sports Co.
  • CapitalizedWord(CompanyCoIncLtd)
  • Intel Chairman Andy Grove
  • CapitalizedWord(GroveForestVillage)
  • ???????

22
  • Basic group handling
  • Noun group, verb group, Preposition, Conjunction

1 NG Bridgestone Sports Co. 2 VG said 3 NG
Friday 4 NG it 5 VG had set up 6 NG a joint
venture 7 PR in 8 NG Taiwan 9 PR with
10 NG a local concern 11 CJ and 12 NG a
Japanese trading house 13 VG to produce 14 NG
golf clubs 15 VG to be shipped 16 PR to 17 NG
Japan
23
  • Complex phrase handling
  • CompanySetUp JointVenture (with Company)?
  • Structure merging
  • If the next sentence says something about the
    same event.

24
A brief remark
  • IE works well for a restricted domain
  • Predetermine the Subjects and how they are
    mentioned

25
Applications
26
  • Table Reading
  • Citation Extraction
  • Chinese NER

27
Semantic Search on Internet Tabular Information
Extraction for Answering Queries
  • CIKM 2000

28
Table Reading
Gives a algorithm to interpret tables of the type
shown below where some cells span over multiple
rows or columns.
An example of interpretation is (Attribute)gt(Val
ue) (Adult-Price-Single Room-Economic
class)gt35,450
29
Table Reading
30
Method
Ambiguous
Tagging
Relations of Cells
Layout Recognition
Layout Transformation
31
Method
Tagging
Layout Identifying
Layout Trans.
32
Airline Schedule Ontology
33
Tagging
C Departure City
I Departure City
34
Four Relations of Table Cells
  • Relations of Concept - Instances
  • Concept - Instance of the Concept
  • Concept - Descent Concept
  • Concept - Instance of Descent Concept
  • Instance - Instance of the same Concept

35
Layout Recognition
C-I Table
Layout Descriptions
Template Matching
Defined by Layout Syntax Grammar
Matched Layout Description
36
Layout Transformation
Origin Layout Description
Destination Layout Description
37
Experiments
  • 23 tables from 23 web pages
  • 13 2-dimension tables, 10 complex tables
  • Success is no miss, Any miss results fail

38
Conclusion Future Works
  • Layout Transformation from complex tables to
    simple tables (1D, 2D).
  • A general approach
  • 1. Tagging
  • 2. Semantic Layout Recognition
  • 3. Layout Transformation
  • Ambiguous reduced by checking cell relations

39
Reference
  • Huei-Long Wang, Shih-Hung Wu, I. C. Wang,
    Cheng-Lung Sung, W. L. Hsu, W. K. Shih, Semantic
    Search on Internet Tabular Information Extraction
    for Answering Queries, Ninth International
    Conference on Information and Knowledge
    Management (CIKM-2000), McLean, VA, November
    6-11, 2000. pp. 243-249. (EI)
  • H.-H. Chen, S.-C. Tsai, and J.-H. Tsai., Mining
    Tables from Large Scale HTML Texts, In Proc. 18th
    International Conference on Computational
    Linguistics, Saabrucken, Germany, July 2000.

40
A Knowledge-based Approach to Citation Extraction
  • IRI-2005

41
Introduction
  • Integration of the bibliographical information of
    scholarly publications available on the Internet
  • Accurate reference metadata extraction from
    heterogeneous reference sources.
  • We propose a knowledge-based approach to
    reference metadata extraction
  • INFOMAP ontological knowledge representation
    framework
  • Automatically extract the reference metadata.

42
Proposed Approach
43
Reference Data Collection
Phase 1
  • Journal Spider (journal agent)
  • collect journal data from the Journal Citation
    Reports (JCR) indexed by the ISI and digital
    libraries on the Web.
  • Citation data source
  • ISI web of science
  • DBLP
  • Citeseer
  • PubMed

44
Domain Knowledge
Phase 2
45
INFOMAP
  • INFOMAP as ontological knowledge representation
    framework
  • extracts important citation concepts from a
    natural language text.
  • Feature of INFOMAP
  • represent and match complicated template
    structures
  • hierarchical matching
  • regular expressions
  • semantic template matching
  • frame (non-linear relations) matching
  • Using INFOMAP, we can extract author, title,
    journal, volume, number (issue), year, and page
    information from different kinds of reference
    formats or styles.

46
Reference Metadata Extraction
Phase 3
Journal Reference styles Reference style example
Bioinformatics style (BIOI) Davenport, T., DeLong, D., Beers, M. (1998) Successful knowledge management projects. Sloan Management Review, 39(2), 43-57.
ACM style (ACM) 1. Davenport, T., DeLong, D. and Beers, M. 1998. Successful knowledge management projects. Sloan Management Review, 39 (2). 43-57.
IEEE style (IEEE) 1 T. Davenport, D. DeLong, and M. Beers, "Successful knowledge management projects," Sloan Management Review, vol. 39, no. 2, pp. 43-57, 1998.
APA style (APA) Davenport, T., DeLong, D., Beers, M. (1998). Successful knowledge management projects. Sloan Management Review, 39(2), 43-57.
JCB style (JCB) Davenport, T., DeLong, D., Beers, M. 1998. Successful knowledge management projects. Sloan Management Review 39(2), 43-57.
MISQ style (MISQ) Davenport, T., DeLong, D., and Beers, M. "Successful knowledge management projects," Sloan Management Review (392) 1998, pp 43-57.
Table 1. Examples of different journal reference
styles
47
Knowledge-based Reference Metadata Extraction -
Online Service
Phase 4
48
Citation Extraction From Text to BixTex
_at_article Author W. L. Hsu, Title The
coloring and maximum independent set problems on
planar perfect graphs,", Journal J. Assoc.
Comput. Machin., Volume , Number ,
Pages 535-563, Year 1988 _at_article
Author W. L. Hsu, Title On the general
feasibility test of scheduling lot sizes for
several products on one machine,", Journal
Management Science, Volume 29, Number
, Pages 93-105, Year 1983
_at_article Author W. L. Hsu, Title
The distance-domination numbers of trees,",
Journal Operations Research Letters, Volume
1, Number 3, Pages 96-100, Year
1982
  • W. L. Hsu, "The coloring and maximum independent
    set problems on planar perfect graphs," J. Assoc.
    Comput. Machin., (1988), 535-563.
  • W. L. Hsu, "On the general feasibility test of
    scheduling lot sizes for several products on one
    machine," Management Science 29, (1983), 93-105.
  • W. L. Hsu, "The distance-domination numbers of
    trees," Operations Research Letters 1, (3),
    (1982), 96-100.

Figure 3. The system input of knowledge-based RME
Figure 5. The system output of BibTex Format
49
System Input (Plain text)
System Output
Output BibTex
Figure 6. The online service of knowledge-based
RME (http//bioinformatics.iis.sinica.edu.tw/Cita
tionAgent/)
50
Experimental Results and Discussion
  • Experimental data
  • We used EndNote to collect Bioinformatics
    citation data for 2004 from PubMed.
  • A total of 907 bibliography records were
    collected from PubMed digital libraries on the
    Web.
  • Reference testing data was generated for each of
    the six reference styles (BIOI, ACM, IEEE, APA,
    MISQ, and JCB).
  • Randomly selected 500 records for testing from
    each of the six reference styles.

51
Experimental results of citation extraction from
six reference styles
52
Example Results
53
Field Field Relation Structure Percentage
Author ltAuthorgtltYeargt 54.29
ltAuthorgtltTitlegt 42.86
N/A 2.85
Year ltAuthorgtltYeargtltTitlegt 48.57
ltJournalgtltYeargtltVolumegt 20.00
ltIssuegtltYeargtltPagesgt 14.29
ltAuthorgtltYeargtltJournalgt 5.71
ltPagesgtltYeargt 2.86
ltVolumegtltYeargtltPagesgt 2.86
N/A 5.71
Title ltYeargtltTitlegtltJournalgt 48.57
ltAuthorgtltTitlegtltJournalgt 42.86
N/A 8.57
Journal ltTitlegtltJournalgtltVolumegt 71.43
ltTitlegtltJournalgtltYeargt 20.00
ltYeargtltJournalgtltVolumegt 5.71
N/A 2.86
Volume ltJournalgtltVolumegtltPagesgt 40.00
ltJournalgtltVolumegtltIssuegt 31.43
ltYeargtltVolumegtltIssuegt 14.29
ltYeargtltVolumegtltPagesgt 5.71
ltJournalgtltVolumegtltVolumegt 2.86
ltJournalgtltVolumegtltYeargt 2.86
N/A 2.85
Issue ltVolumegtltIssuegtltPagesgt 34.29
ltVolumegtltIssuegtltYeargt 14.29
N/A 51.42
Pages ltVolumegtltPagesgt 42.86
ltIssuegtltPagesgt 34.29
ltYeargtltPagesgt 17.14
ltVolumegtltPagesgtltYeargt 2.86
N/A 2.85
The various structures of different
styles (Analysis of structures of 30 reference
styles )
54
Comparison with related works
  • Knowledge-based approach
  • Our proposed knowledge-based method for scholarly
    publications can extract reference information
    from 907 records in various reference styles with
    a high degree of precision
  • the overall average field accuracy is 97.87 for
    six major styles listed in Table 1
  • 98.20 for the MISQ style
  • 87 for other 30 randomly selected styles

55
Conclusions
  • Citation extraction is a challenging problem
  • The diverse nature of reference styles
  • We have proposed a knowledge-based citation
    extraction method for scholarly publications.
  • The experimental results indicate that, by using
    INFOMAP, we can extract author, title, journal,
    volume, number (issue), year, and page
    information from different reference styles with
    a high degree of precision.
  • The overall average field accuracy of citation
    extraction is 97.87 for six major reference
    styles.

56
Future Research
  • Integrate the ontological and the machine
    learning approaches to boost the performance of
    citation information extraction
  • Maximum-Entropy Method (MEM)
  • Hidden Markov Model (HMM)
  • Conditional Random Fields (CRF)
  • Support Vector Machines (SVM)

57
Reference
  • Min-Yuh Day, Tzong-Han Tsai, Cheng-Lung Sung,
    Cheng-Wei Lee, Shih-Hung Wu, Chorng-Shyong Ong,
    and Wen-Lian Hsu, A Knowledge-based Approach to
    Citation Extraction, to appear in Proceedings of
    IEEE International Conference on Information
    Reuse and Integration (IEEE IRI-2005), pp.50-55.
    (EI)

58
Chinese Named Entity Recognition Using a Hybrid
Approach of Machine Learning and Domain Knowledge
  • ROCLING 2003, CLCLP 2004

59
Named Entity Recognition
  • ??????,??????Named Entity
  • ???????????????
  • lt??gt???lt/??gt???lt??gt??lt/??gt??lt???gt????lt/???gt

60
Sequential Labeling
Token-based
??? ?? ? ?? ?? ????
Per Loc Org
Charactor-based
? ? ? ? ? ? ? ? ? ? ? ? ? ?
B-P I-P I-P B-L I-L B-O I-O I-O I-O
61
Machine Learning???????
  1. ??????????named entity
  2. ?????corpus
  3. ????corpus?????target named entity,
    ????corpus???????????.
  4. ???????NER????
  5. ????????NER?????

62
Hybrid NER method
  • Domain knowledge
  • ??, ????, ?????, ????
  • Machine Learning
  • SVM, Bigram/Trigram Model
  • Hybrid
  • Maximum-Entropy Framework
  • Domain knowledge serves as features

63
Statical knowledge is insufficient
  • New names
  • ???
  • ????SARS???
  • ??????
  • Ambiguity
  • ????????
  • ??????????
  • Context dependence
  • ???????
  • ????????

64
Pure machine learning might suffer
  • Lack context information
  • ???Window Size
  • ???????token????tag??
  • ?????????NE????
  • ??????????NER??
  • ???????????
  • ???????, ?NER???????

65
Basic Concepts of Our ME-based Hybrid Approach
  • ?????NE??????????Context Information
  • Internal/External Features
  • ??Training Data?????Feature???, ??????confidence

66
Internal/External Features
  • Internal
  • Found within the name string itself
  • e.g., ? ? ? ? ? ? ?
  • External
  • Context
  • e.g., ? ? ? ?

??
????
67
Tag Set (outcome)
  • ???????Character??Token, ????Named Entity????,
    ??, ???
  • Tag Set
  • ?/B-P ?/I-P ?/I-P
  • ?/B-L ?/I-L ?/I-L
  • ?/B-O ?/I-O ?/I-O

68
ME-based NER Framework-Feature Representation
  • For example
  • ???token????, ??????????????
  • ??????????

Feature f is active!!
69
ME-based NER Framework-Training
  • Given a set of features and a training corpus
  • The ME estimation process produces a model in
    which every feature fi has a weight ai.
  • Then we are allowed to compute

70
ME-based NER Framework-Decoding
  • Tokenize the text and preprocess the testing
    sentence
  • For each token, check which features are active
    and combine the ai of active features according
    to Equation 1
  • A Viterbi search is run to find the highest
    probability path

71
Hybrid NER Example
  • ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
  • The NER problem has been formulated as maximize
    p(oh) and find its corresponding outcome o

Ps
Ls
Context (History)
Feature 1 ??
Os
W0 the current token
72
Advantages of Hybrid NER
  • ????????, ??????????.
  • ??????????????????????????
  • ???????????????, ?????????????
  • ??????????????Performance
  • ???????????

73
Experiment-Data Set
United Daily News (December, 2002)
Domain Number of Named Entities Number of Named Entities Number of Named Entities Size (in characters)
Domain PER LOC ORG Size (in characters)
Local News 84 139 97 11835
Social Affairs 310 287 354 37719
Investment 20 63 33 14397
Politics 419 209 233 17168
Headline News 267 70 243 19938
Business 142 186 187 25815
Total 1242 954 1147 126872
74
Experiment Result
  • Use domain knowledge only

NE P() R() F()
PER 72.98 97.93 83.63
LOC 67.96 74.67 71.16
ORG 95.77 64.07 76.78
Total 75.62 82.13 78.74
  • ME-based Hybrid

NE P() R() F()
PER 97.94 87.39 92.36
LOC 78.60 69.35 73.69
ORG 94.39 62.57 75.25
Total 90.56 73.70 81.26
75
Performance Comparison
Sys. Person Person Person Location Location Location Organization Organization Organization Overall Overall Overall
Sys. P R F P R F P R F P R F
NTU (98) 74 91 81.6 69 78 73.2 85 78 81.3 77 83 79.9
KRDL (98) 66.4 92 77.1 89 90.9 90 89.5 87.8 88.6 85.2 90.2 87.6
IASL (03) 92.1 83.3 87.5 88.1 81.8 84.9 93.3 88.7 90.9 90.4 85 87.7
Corpus MET2 Dataset Number of Entities 3646
76
Conclusion and Future Work
  • Conclusion
  • Hybrid Approach???????????????????
  • Hybrid Approach?Precision?????????Improvement,
  • Hybrid ????Improvement??, ????????????
  • Future Work
  • ???????Named Entity???Features
  • ????????????
  • ??Named Entity???
  • Multi Iteration NER
  • Hierachical Named Entity???

77
References
  • Tsai 2003 Tzong-Han Tsai, Shih-Hung Wu and
    Wen-Lian Hsu. (2003), Mencius A Chinese Named
    Entity Recognizer Using Hybrid Model, in
    Proceedings of the Fifteenth Research on
    Computational Linguistics International
    Conference (ROCLING XV), pp.193-209, 2003.
  • Tsai 2004 Tzong-Han Tsai, Shih-Hung Wu, and
    Wen-Lian Hsu, "Mencius A Chinese Named Entity
    Recognizer Based on a Maximum Entropy Framework,"
    Computational Linguistics and Chinese Language
    Processing, Vol.9, No.1, pp.65-82, 2004.
  • Shih 2004 Cheng-Wei Shih, Tzong-Han Tsai,
    Shih-Hung Wu, Chiu-Chen Hsieh, and Wen-Lian Hsu,
    (2004) The Construction of a Chinese Named
    Entity Tagged Corpus CNEC1.0, in Proceedings of
    the Fifteenth Conference on Computational
    Linguistics and Speech Processing (ROCLING XVI),
    pp. 305-313.
Write a Comment
User Comments (0)
About PowerShow.com