Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation presentation

About This Presentation

Transcript and Presenter's Notes

Title: Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation

1
Information Extraction Based on Extraction
Ontologies Design, Deployment and Evaluation

Martin Labský, Vojtech Svátek
Dept. of Knowledge Engineering, UEP
labsky,svatek_at_vse.cz
AI Seminar, November 13th 2008

2
Agenda

Example applications of Web IE
Difficulties in practical applications
Extraction Ontologies
Extraction process
Experimental results
Future work and Conclusion

3
Example apps of Web IE (1/5) online products
4
Example apps of Web IE (2/5) contact information
5
Example apps of Web IE (3/5) seminars, events
6
Example apps of Web IE (4/5) bike products
7
Example apps of Web IE (4/5)

Store the extracted results in a DB to enable
structured search over documents
information retrieval
database-like querying
e.g. online product search engine,
e.g. building a contact DB
Support for web page quality assessment
involved in an EU project MedIEQ to support
medical website accreditation agencies
Source documents
internet, intranet, emails
can be very diverse

8
Agenda

Example applications of Web IE
Difficulties in practical IE applications
Extraction Ontologies
Extraction process
Experimental results
Future work and Conclusion

9
Difficulties in practical applications (1/3)

Requirements
quickly prototype IE applications
not necessarily with the best accuracy initially
often needed for a proof-of-concept application
then more work can be done to boost accuracy
the extraction model changes
meaning of to-be-extracted items may shift,
new items are often added or removed

10
Difficulties in practical applications (2/3)

Purely manual rules
writing extraction rules manually does not scale
when more complex extraction rules need to be
encoded
not easy to combine with trained models when
training data become available in later phases
Training data
trainable IE systems often require large amounts
of training data these are typically not
available for the desired task
when training data is collected, it is not easy
to adapt it to modified or additional criteria
Wrappers
cannot rely on wrapper-only systems when
extracting from multiple websites
non-wrapper systems often do not utilize regular
formatting cues

11
Difficulties in practical applications (3/3)

Seems interesting to exploit at the same time
extraction knowledge from domain experts
training data
formatting regularities

12
Agenda

Example applications of Web IE
Difficulties in practical applications
Extraction Ontologies
Extraction process
Experimental results
Future work and Conclusion

13
Extraction ontologies

An extraction ontology is a part of a domain
ontology transformed to suit extraction needs
Contains classes composed of attributes
more like UML class diagrams, less like
ontologies where e.g. relations are standalone
also contains axioms related to classes or
attributes
Classes and attributes are augmented with
extraction evidence
manually provided patterns for content and
context
axioms
value or length ranges
links to trained models

Person name 1 degree 0-5 email 0-2 phone
0-3
Responsible
14
Extraction evidence provided by domain expert (1)

Patterns
for attributes and classes
for their content and context
patterns may be defined at the following levels
word and character-level,
formatting tag level
level of labels (e.g. sentence breaks, POS tags)
Attribute value constraints
word length constraints, numeric value ranges
possible to attach units to numeric attributes
Axioms
may enforce relations among attributes
interpreted using JavaScript scripting language
Simple co-reference resolution rules

15
Extraction evidence provided by domain expert (2)

Axioms
class level
attribute level
Patterns
class content
attribute value
attribute context
class context
Value constraints
word length
numeric value

16
Extraction evidence based on trained models (1)

Links to trainable classifiers
may classify attributes only
binary or multi-class
Trained models may use as features
simple word level features (word itself, word
type, possibly POS tags)
re-use all evidence provided by expert (patterns,
axioms, constraints)
induced binary features based on word n-grams

classifier usage
classifier definition
17
Extraction evidence based on trained models (2)

Data representation for classifiers
word sequence (1 word 1 sample)
phrase set (sliding window method)
Tested trainable classifiers
CRF (Conditional Random Fields)
http//crfpp.sourceforge.net
algorithms from the Weka machine learning toolkit
SVM (Support Vector Machine)
JRip (rule induction)
http//www.cs.waikato.ac.nz/ml/weka
Hidden Markov Model extractor

18
Extraction evidence based on trained models (3)

Feature induction
candidate features are all word n-grams of given
lengths occurring inside or near training
attribute values
pruning parameters
point-wise mutual information thresholds
minimal absolute occurrence count
maximum number of features

19
Probabilistic model to combine evidence

Each piece of evidence E is equipped with 2
probability estimates with respect to predicted
attribute A
evidence precision P(AE) ... prediction
confidence
evidence coverage P(EA) ... necessity of
evidence (support)
Each attribute is assigned some low prior
probability P(A)
Let be the set of evidence applicable to A
Assume conditional independence among
Using Bayes formula we compute P(A its evidence
values) as
where

20
Extraction vs. domain ontologies

When existing domain ontologies are available
identify relevant parts
reuse classes, attributes, cardinalities, some
axioms
Transformation rules
reused parts of domain ontology may require
transformation to fit into extraction ontology
due to extraction ontologies focusing on the way
of presentation rather than semantics
identified typical transformation rules that
could be used to transform parts of OWL-encoded
ontologies

21
Agenda

Example applications of Web IE
Difficulties in practical applications
Extraction Ontologies
Extraction process
Experimental results
Future work and Conclusion

22
The extraction process (1/5)

Tokenize, build HTML formatting tree, apply
sentence splitter, POS tagger
Match patterns
Apply trained models
Create Attribute Candidates (ACs)
For each created AC, let PAC
prune ACs below threshold
build document AC lattice, score ACs by log(PAC)

Washington , DC
...
...
23
The extraction process (2/5)

Evaluate coreference resolution rules for each
pair of ACs
e.g. Dr. Burns ?? John Burns
possible coreferring groups are remembered
in attributes value section
Compute the best scoring path BP through AC
lattice
using dynamic programming
Run wrapper induction algorithm using all AC ? BP
wrapper induction algorithm described in next
slides
if new local patterns are induced, apply them to
rescore existing ACs
create new ACs
update AC lattice, recompute BP
Terminate here if no instances are to be
generated
output all AC ? BP (n-best paths supported)

24
The extraction process (3/5)

Generate Instance Candidates (ICs) bottom-up
triangular trellis used to store partial ICs
when scoring new ICs, only consider axioms and
patterns that already can be applied to the IC.
Validity is not required.
pruning parameters abs and relative beam size at
trellis node, maximum number of ACs that can be
skipped, min IC probability

25
The extraction process (4/5)

IC generation continued
When new IC is created, its P(IC) is computed
from 2 components
where IC is member attribute count,
ACskip is an non-member AC that is fully or
partially inside the IC,
PAC skip is the probability of AC being a false
positive.
where ?C is the set of evidence known for the
class C, computed using the same probabilistic
model as for ACs.
Scores are combined using the Prospector
pseudo-bayesian method

26
The extraction process (5/5)

Insert valid ICs into AC lattice
Valid ICs were assembled during IC generation
phase
Score of a valid IC reflects all extraction
evidence of its class
All unpruned valid ICs are inserted into the AC
lattice, scored by
The best path BP is calculated through the ICAC
lattice (n-best supported)
the search algorithm allows constraints to be
defined over the extracted path(s)
e.g. min/max count of extracted instances
output all ACs and ICs on BP

IC1
27
Extraction evidence based on formatting

A simple wrapper induction algorithm
identify formatting regularities
turn them into local context patterns to boost
contained ACs
Assemble distinct formatting subtrees rooted at
block elements containing ACs from the best path
BP currently determined by the system
For each subtree S, calculate
If both C(S,Att) and prec(AttS) reach defined
thresholds, a new local context pattern is
created with its precision set to C(S,Att) and
its recall close to 0 (in order not to harm
potential singleton ACs.

a formatting tree learned using known names like
John Doe and applied to unknown names
TD
TD
A_href
B
A_href
B
John Doe
jdoe_at_web.ca
Argentina Agosto
aa_at_web.br
28
Agenda

Example applications of Web IE
Difficulties in practical applications
Extraction Ontologies
Extraction process
Experimental results
Future work and Conclusion

29
Experimental results Seminar announcements

485 English seminar announcement text documents
Manual extraction ontology created based on
seeing 40 randomly chosen documents, evaluated
using remaining 445
ManualCRF same extraction ontology equipped
with a CRF classifier used as further extraction
evidence. 10-fold cross-validation using test set
above

30
Cost of the IE system Seminar announcements

Creation of extraction ontology 1-2 person weeks
annotate 40 training documents (expect 1-2 days)
inspecting examples in 40 documents
writing patterns, axioms, iterating
Training inductive model in addition to ex.
ontology
2-3 person weeks to annotate training data (445
docs)
F-measure improvement from 2 to 6
ex. ontologies allow for fast flexible
prototyping (annotation design changes quickly
reflected)
then, for parts of the ex. ontology that need
accuracy improvement, obtain more training data
reuse as features all manual extraction evidence
already provided

31
Experimental results Contact information

109 English contact pages, 200 Spanish, 108 Czech
Named entity counts 7000, 5000, 11000,
respectively, instances not labeled
Only domain experts evidence and formatting
pattern induction were used
Domain expert saw 30 randomly chosen documents,
the rest was test data
Instance extraction done but not evaluated

Instance grouping
Villain score F 60-70
Villain recall of correct links recovered
Villain precision of recovered links that are
correct

32
Experimental results Bicycle descriptions

Hidden Markov Model
Trigram, naive topology
103 labeled web pages, 12346 named entities,
Instances not labeled instance extraction done
but not evaluated
Single HMM for all extracted types
1 Background state
1 Target, 1 Prefix and 1 Suffix state type for
each extracted slot
13N states

B
S
T
P
S
T
P
...
33
Bicycle structured search interface
34
Future work

Attempt to improve a seed extraction ontology by
bootstrapping using relevant pages retrieved from
the Internet
Adapt the structure of extraction ontology
according to data
e.g. add new attributes to represent product
features

35
Conclusions

Tooltutorial available
http//eso.vse.cz/labsky/ex/
Presented an extraction ontology approach to
allow for fast prototyping of IE applications
accommodate extraction schema changes easily
utilize all available forms of extraction
knowledge
domain experts knowledge
training data
formatting regularities found in web pages
Results
indicate that extraction ontologies can serve as
a quick prototyping tool
accuracy of the prototyped ontology can be
improved when training data become available

36
Acknowledgements

The research was partially supported by the EC
under contract FP6-027026, Knowledge Space of
Semantic Inference for Automatic Annotation and
Retrieval of Multimedia Content K-Space.
The medical website application is carried out in
the context of the EC-funded (DG-SANCO) project
MedIEQ.

Write a Comment

User Comments (0)

About PowerShow.com

Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation PowerPoint PPT Presentation