Title: PASCAL CHALLENGE ON EVALUATING MACHINE LEARNING FOR INFORMATION EXTRACTION
1PASCAL CHALLENGE ON EVALUATING MACHINE LEARNING
FOR INFORMATION EXTRACTION
Neil Ireson Local Challenge Coordinator Web
Intelligent Group Department of Computer
Science University of Sheffield UK
2Organisers
- Sheffield Fabio Ciravegna
- UCD Dublin Nicholas Kushmerick
- ITC-IRST Alberto Lavelli
- University of Illinois Mary-Elaine Califf
- FairIsaac Dayne Freitag
3Outline
- Challenge Goals
- Data
- Tasks
- Participants
- Experimental Results
- Conclusions
4Goal Provide a testbed for comparative
evaluation of ML-based IE
- Standardisation
- Data
- Partitioning
- Same set of features
- Corpus preprocessed using Gate
- No features allowed other than the ones provided
- Explicit Tasks
- Evaluation Metrics
- For future use
- Available for further test with same or new
systems - Possible to publish and new corpora or tasks
5Data (Workshop CFP)
2005
Testing Data 200 Workshop CFP
2000
Training Data 400 Workshop CFP
1993
6Data (Workshop CFP)
2005
Testing Data 200 Workshop CFP
2000
Training Data 400 Workshop CFP
1993
7Data (Workshop CFP)
2005
Testing Data 200 Workshop CFP
2000
Training Data 400 Workshop CFP
1993
8Data (Workshop CFP)
2005
Testing Data 200 Workshop CFP
2000
Training Data 400 Workshop CFP
1993
9(No Transcript)
10Annotation Slots
11Preprocessing
- GATE
- Tokenisation
- Part-Of-Speech
- Named-Entities
- Date, Location, Person, Number, Money
12Evaluation Tasks
- Task1 - ML for IE Annotating implicit
information - 4-fold cross-validation on 400 training documents
- Final Test on 200 unseen test documents
- Task2a - Learning Curve
- Effect of increasing amounts of training data on
learning - Task2b - Active learning Learning to select
documents - Given seed documents select the documents to add
to training set - Task3a Semi-supervised Learning Given data
- Same as Task1 but can use the 500 unannotated
documents - Task3b - Semi-supervised Learning Any Data
- Same as Task1 but can use all available
unannotated documents
13Evaluation
- Precision/Recall/F1Measure
- MUC Scorer
- Automatic Evaluation Server
- Exact matching
- Extract every slot occurrence
14Participants
15Task1
- Information Extraction with all the available data
16Task1 Test Corpus
17Task1 Test Corpus
18Task1 Test Corpus
19Task1 4-Fold Cross-validation
20Task1 4-Fold Test Corpus
21Task1 Slot FMeasure
22Best Slot FMeasures Task1 Test Corpus
23Task 2a
24Task2a Learning Curve FMeasure
25Task2a Learning Curve Precision
26Task2a Learning Curve Recall
27Task 2b
28Task2b Active Learning
- Amilcare
- Maximum divergence from expected number of tags.
- Hachey
- Maximum divergence between two classifiers built
on different feature sets. - Yaoyong (Gram-Schmidt)
- Maximum divergence between example subset.
29Task2b Active LearningIncreased FMeasure over
random selection
30Task 3
- Semi-supervised learning
- (not significant participation)
31Conclusions (Task1)
- Top three (4) systems use different algorithms
- Rule Induction, SVM, CRF HMM
- Same algorithms (SVM) produced different results
- Brittle Performance
- Large variation on slot performance
- Post-processing
32Conclusion (Task2 Task3)
- Task 2a Learning Curve
- Systems performance is largely as expected
- Task 2b Active Learning
- Two approaches, Amilcare and Hachey, showed
benefits - Task 3 Semi-supervised Learning
- Not sufficient participation to evaluate use of
enrich data
33Future Work
- Performance differences
- Systems what determines good/bad performance
- Slots different systems were better/worse at
identifying different slots - Combine approaches
- Active Learning
- Semi-supervised Learning
- Overcoming the need for annotated data
- Extensions
- Data Use different data sets and other features,
using (HTML) structured data - Tasks Relation extraction
34Thank You
- http//tyne.shef.ac.uk/Pascal