Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation PowerPoint PPT Presentation

presentation player overlay
1 / 36
About This Presentation
Transcript and Presenter's Notes

Title: Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation


1
Information Extraction Based on Extraction
Ontologies Design, Deployment and Evaluation
  • Martin Labský, Vojtech Svátek
  • Dept. of Knowledge Engineering, UEP
  • labsky,svatek_at_vse.cz
  • AI Seminar, November 13th 2008

2
Agenda
  • Example applications of Web IE
  • Difficulties in practical applications
  • Extraction Ontologies
  • Extraction process
  • Experimental results
  • Future work and Conclusion

3
Example apps of Web IE (1/5) online products
4
Example apps of Web IE (2/5) contact information
5
Example apps of Web IE (3/5) seminars, events
6
Example apps of Web IE (4/5) bike products
7
Example apps of Web IE (4/5)
  • Store the extracted results in a DB to enable
    structured search over documents
  • information retrieval
  • database-like querying
  • e.g. online product search engine,
  • e.g. building a contact DB
  • Support for web page quality assessment
  • involved in an EU project MedIEQ to support
    medical website accreditation agencies
  • Source documents
  • internet, intranet, emails
  • can be very diverse

8
Agenda
  • Example applications of Web IE
  • Difficulties in practical IE applications
  • Extraction Ontologies
  • Extraction process
  • Experimental results
  • Future work and Conclusion

9
Difficulties in practical applications (1/3)
  • Requirements
  • quickly prototype IE applications
  • not necessarily with the best accuracy initially
  • often needed for a proof-of-concept application
  • then more work can be done to boost accuracy
  • the extraction model changes
  • meaning of to-be-extracted items may shift,
  • new items are often added or removed

10
Difficulties in practical applications (2/3)
  • Purely manual rules
  • writing extraction rules manually does not scale
    when more complex extraction rules need to be
    encoded
  • not easy to combine with trained models when
    training data become available in later phases
  • Training data
  • trainable IE systems often require large amounts
    of training data these are typically not
    available for the desired task
  • when training data is collected, it is not easy
    to adapt it to modified or additional criteria
  • Wrappers
  • cannot rely on wrapper-only systems when
    extracting from multiple websites
  • non-wrapper systems often do not utilize regular
    formatting cues

11
Difficulties in practical applications (3/3)
  • Seems interesting to exploit at the same time
  • extraction knowledge from domain experts
  • training data
  • formatting regularities

12
Agenda
  • Example applications of Web IE
  • Difficulties in practical applications
  • Extraction Ontologies
  • Extraction process
  • Experimental results
  • Future work and Conclusion

13
Extraction ontologies
  • An extraction ontology is a part of a domain
    ontology transformed to suit extraction needs
  • Contains classes composed of attributes
  • more like UML class diagrams, less like
    ontologies where e.g. relations are standalone
  • also contains axioms related to classes or
    attributes
  • Classes and attributes are augmented with
    extraction evidence
  • manually provided patterns for content and
    context
  • axioms
  • value or length ranges
  • links to trained models

Person name 1 degree 0-5 email 0-2 phone
0-3
Responsible
14
Extraction evidence provided by domain expert (1)
  • Patterns
  • for attributes and classes
  • for their content and context
  • patterns may be defined at the following levels
  • word and character-level,
  • formatting tag level
  • level of labels (e.g. sentence breaks, POS tags)
  • Attribute value constraints
  • word length constraints, numeric value ranges
  • possible to attach units to numeric attributes
  • Axioms
  • may enforce relations among attributes
  • interpreted using JavaScript scripting language
  • Simple co-reference resolution rules

15
Extraction evidence provided by domain expert (2)
  • Axioms
  • class level
  • attribute level
  • Patterns
  • class content
  • attribute value
  • attribute context
  • class context
  • Value constraints
  • word length
  • numeric value

16
Extraction evidence based on trained models (1)
  • Links to trainable classifiers
  • may classify attributes only
  • binary or multi-class
  • Trained models may use as features
  • simple word level features (word itself, word
    type, possibly POS tags)
  • re-use all evidence provided by expert (patterns,
    axioms, constraints)
  • induced binary features based on word n-grams

classifier usage
classifier definition
17
Extraction evidence based on trained models (2)
  • Data representation for classifiers
  • word sequence (1 word 1 sample)
  • phrase set (sliding window method)
  • Tested trainable classifiers
  • CRF (Conditional Random Fields)
    http//crfpp.sourceforge.net
  • algorithms from the Weka machine learning toolkit
  • SVM (Support Vector Machine)
  • JRip (rule induction)
  • http//www.cs.waikato.ac.nz/ml/weka
  • Hidden Markov Model extractor

18
Extraction evidence based on trained models (3)
  • Feature induction
  • candidate features are all word n-grams of given
    lengths occurring inside or near training
    attribute values
  • pruning parameters
  • point-wise mutual information thresholds
  • minimal absolute occurrence count
  • maximum number of features

19
Probabilistic model to combine evidence
  • Each piece of evidence E is equipped with 2
    probability estimates with respect to predicted
    attribute A
  • evidence precision P(AE) ... prediction
    confidence
  • evidence coverage P(EA) ... necessity of
    evidence (support)
  • Each attribute is assigned some low prior
    probability P(A)
  • Let be the set of evidence applicable to A
  • Assume conditional independence among
  • Using Bayes formula we compute P(A its evidence
    values) as
  • where

20
Extraction vs. domain ontologies
  • When existing domain ontologies are available
  • identify relevant parts
  • reuse classes, attributes, cardinalities, some
    axioms
  • Transformation rules
  • reused parts of domain ontology may require
    transformation to fit into extraction ontology
  • due to extraction ontologies focusing on the way
    of presentation rather than semantics
  • identified typical transformation rules that
    could be used to transform parts of OWL-encoded
    ontologies

21
Agenda
  • Example applications of Web IE
  • Difficulties in practical applications
  • Extraction Ontologies
  • Extraction process
  • Experimental results
  • Future work and Conclusion

22
The extraction process (1/5)
  • Tokenize, build HTML formatting tree, apply
    sentence splitter, POS tagger
  • Match patterns
  • Apply trained models
  • Create Attribute Candidates (ACs)
  • For each created AC, let PAC
  • prune ACs below threshold
  • build document AC lattice, score ACs by log(PAC)

Washington , DC
...
...
23
The extraction process (2/5)
  • Evaluate coreference resolution rules for each
    pair of ACs
  • e.g. Dr. Burns ?? John Burns
  • possible coreferring groups are remembered
  • in attributes value section
  • Compute the best scoring path BP through AC
    lattice
  • using dynamic programming
  • Run wrapper induction algorithm using all AC ? BP
  • wrapper induction algorithm described in next
    slides
  • if new local patterns are induced, apply them to
  • rescore existing ACs
  • create new ACs
  • update AC lattice, recompute BP
  • Terminate here if no instances are to be
    generated
  • output all AC ? BP (n-best paths supported)

24
The extraction process (3/5)
  • Generate Instance Candidates (ICs) bottom-up
  • triangular trellis used to store partial ICs
  • when scoring new ICs, only consider axioms and
    patterns that already can be applied to the IC.
    Validity is not required.
  • pruning parameters abs and relative beam size at
    trellis node, maximum number of ACs that can be
    skipped, min IC probability

25
The extraction process (4/5)
  • IC generation continued
  • When new IC is created, its P(IC) is computed
    from 2 components
  • where IC is member attribute count,
  • ACskip is an non-member AC that is fully or
    partially inside the IC,
  • PAC skip is the probability of AC being a false
    positive.
  • where ?C is the set of evidence known for the
    class C, computed using the same probabilistic
    model as for ACs.
  • Scores are combined using the Prospector
    pseudo-bayesian method

26
The extraction process (5/5)
  • Insert valid ICs into AC lattice
  • Valid ICs were assembled during IC generation
    phase
  • Score of a valid IC reflects all extraction
    evidence of its class
  • All unpruned valid ICs are inserted into the AC
    lattice, scored by
  • The best path BP is calculated through the ICAC
    lattice (n-best supported)
  • the search algorithm allows constraints to be
    defined over the extracted path(s)
  • e.g. min/max count of extracted instances
  • output all ACs and ICs on BP

IC1
27
Extraction evidence based on formatting
  • A simple wrapper induction algorithm
  • identify formatting regularities
  • turn them into local context patterns to boost
    contained ACs
  • Assemble distinct formatting subtrees rooted at
    block elements containing ACs from the best path
    BP currently determined by the system
  • For each subtree S, calculate
  • If both C(S,Att) and prec(AttS) reach defined
    thresholds, a new local context pattern is
    created with its precision set to C(S,Att) and
    its recall close to 0 (in order not to harm
    potential singleton ACs.

a formatting tree learned using known names like
John Doe and applied to unknown names
TD
TD
A_href
B
A_href
B
John Doe
jdoe_at_web.ca
Argentina Agosto
aa_at_web.br
28
Agenda
  • Example applications of Web IE
  • Difficulties in practical applications
  • Extraction Ontologies
  • Extraction process
  • Experimental results
  • Future work and Conclusion

29
Experimental results Seminar announcements
  • 485 English seminar announcement text documents
  • Manual extraction ontology created based on
    seeing 40 randomly chosen documents, evaluated
    using remaining 445
  • ManualCRF same extraction ontology equipped
    with a CRF classifier used as further extraction
    evidence. 10-fold cross-validation using test set
    above

30
Cost of the IE system Seminar announcements
  • Creation of extraction ontology 1-2 person weeks
  • annotate 40 training documents (expect 1-2 days)
  • inspecting examples in 40 documents
  • writing patterns, axioms, iterating
  • Training inductive model in addition to ex.
    ontology
  • 2-3 person weeks to annotate training data (445
    docs)
  • F-measure improvement from 2 to 6
  • ex. ontologies allow for fast flexible
    prototyping (annotation design changes quickly
    reflected)
  • then, for parts of the ex. ontology that need
    accuracy improvement, obtain more training data
    reuse as features all manual extraction evidence
    already provided

31
Experimental results Contact information
  • 109 English contact pages, 200 Spanish, 108 Czech
  • Named entity counts 7000, 5000, 11000,
    respectively, instances not labeled
  • Only domain experts evidence and formatting
    pattern induction were used
  • Domain expert saw 30 randomly chosen documents,
    the rest was test data
  • Instance extraction done but not evaluated
  • Instance grouping
  • Villain score F 60-70
  • Villain recall of correct links recovered
  • Villain precision of recovered links that are
    correct

32
Experimental results Bicycle descriptions
  • Hidden Markov Model
  • Trigram, naive topology
  • 103 labeled web pages, 12346 named entities,
  • Instances not labeled instance extraction done
    but not evaluated
  • Single HMM for all extracted types
  • 1 Background state
  • 1 Target, 1 Prefix and 1 Suffix state type for
    each extracted slot
  • 13N states

B
S
T
P
S
T
P
...
33
Bicycle structured search interface
34
Future work
  • Attempt to improve a seed extraction ontology by
    bootstrapping using relevant pages retrieved from
    the Internet
  • Adapt the structure of extraction ontology
    according to data
  • e.g. add new attributes to represent product
    features

35
Conclusions
  • Tooltutorial available
  • http//eso.vse.cz/labsky/ex/
  • Presented an extraction ontology approach to
  • allow for fast prototyping of IE applications
  • accommodate extraction schema changes easily
  • utilize all available forms of extraction
    knowledge
  • domain experts knowledge
  • training data
  • formatting regularities found in web pages
  • Results
  • indicate that extraction ontologies can serve as
    a quick prototyping tool
  • accuracy of the prototyped ontology can be
    improved when training data become available

36
Acknowledgements
  • The research was partially supported by the EC
    under contract FP6-027026, Knowledge Space of
    Semantic Inference for Automatic Annotation and
    Retrieval of Multimedia Content K-Space.
  • The medical website application is carried out in
    the context of the EC-funded (DG-SANCO) project
    MedIEQ.
Write a Comment
User Comments (0)
About PowerShow.com