Title: GATE, Human Language and Machine Learning
1- GATE, Human Language and Machine Learning
- http//gate.ac.uk/ http//nlp.shef.ac.uk/
- Hamish Cunningham, Valentin Tablan,
- Kalina Bontcheva, Diana Maynard
- 9th July/2003
- The Knowledge Economy and Human Language
Technology - GATE a General Architecture for Text Engineering
- GATE, Information Extraction and Machine Learning
2- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                           - 1. The Knowledge Economy and Human Language
- Gartner, December 2002
- taxonomic and hierachical knowledge mapping and
indexing will be prevalent in almost all
information-rich applications - through 2012 more than 95 of human-to-computer
information input will involve textual language - A contradiction formal knowledge in
semantics-based systems vs. ambiguous informal
natural language - The challenge to reconcile these two opposing
tendencies
3Information Extraction (1) from text to
structured data
- Two trends in the early 1990s
- NLU too difficult! Restrict the task and
increase the performance - Quantitative measurement (MUC Message
Understanding Conference, ACE Advanced Content
Extraction, TREC Text Retrieval Conference...)
means good estimation of accuracy - Types of extraction
- Identify named entities (domain independent)
- Persons
- Dates
- Numbers
- Organizations
- Identify domain-specific events and terms e.g.,
if were processing football - Relations which team a player plays for
- Events goal, foul, etc
4Information Extraction (2)
- MUC-7 tasks
- NE Named Entity recognition and typing
- CO co-reference resolution
- TE Template Elements
- TR Template Relations
- ST Scenario Templates
- Example
- The shiny red rocket was fired on Tuesday. It
is the brainchild of Dr. Big Head. Dr. Head is a
staff scientist at We Build Rockets Inc. - NE entities are "rocket", "Tuesday", "Dr. Head"
and "We Build Rockets" - CO "it" refers to the rocket "Dr. Head" and
"Dr. Big Head are the same - TE the rocket is "shiny red" and Head's
"brainchild". - TR Dr. Head works for We Build Rockets Inc.
- ST a rocket launching event occurred with the
various participants.
5KEY MNLG Multilingual Natural Language
GenerationOIE Ontology-aware Information
ExtractionAIE Adaptive IECLIE Controlled
Language IE
(M)NLG
Semantic Web Semantic GridSemantic Web
Services
Formal Knowledge(ontologies andinstance bases)
HumanLanguage
OIE
IE and Knowledge Closing theLanguageLoop
(A)IE
ControlledLanguage
CLIE
6Populating Ontologies with IE
7Protégé and Ontology Management
8IE the bad news Domain specificity vs. task
complexity
verygeneral
simple entities
acceptable accuracy
specificity
events and relations
complexity
domainspecific
9Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                                 Â
                         2. GATE Software
Architecure for HLT Software lifecycle in
collaborative research Project Proposal We love
each other. We can work so well together. We can
hold workshops on Santorini together. We will
solve all the problems of AI that our
predecessors were too stupid to. Analysis and
Design Stop work entirely, for a period of
reflection and recuperation following the stress
of attending the kick-off meeting in
Luxembourg. Implementation Each developer
partner tries to convince the others that program
X that they just happen to have lying around on a
dusty disk-drive meets the project objectives
exactly and should form the centrepiece of the
demonstrator. Integration and Testing The lead
partner gets desperate and decides to hard-code
the results for a small set of examples into the
demonstrator, and have a fail-safe crash facility
for unknown input ("well, you know, it's still a
prototype..."). Evaluation Everyone says how
nice it is, how it solves all sorts of terribly
hard problems, and how if we had another grant we
could go on to transform information processing
the World over (or at least the European business
travel industry).
10- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                           - GATE, a General Architecture for Text Engineering
- An architectureA macro-level organisational
picture for LE software systems. - A frameworkFor programmers, GATE is an
object-oriented class library that implements the
architecture. - A development environmentFor language
engineers, computational linguists et al, GATE is
a graphical development environment bundled with
a set of tools for doing e.g. Information
Extraction. - Some free components... ...and wrappers for
other people's components - Tools for evaluation visualise/edit
persistence IR IE dialogue ontologies etc. - Free software (LGPL). Download at
http//gate.ac.uk/download/
11- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                           - Architectural principles
- Non-prescriptive, theory neutral (strength and
weakness) - Re-use, interoperation, not reimplementation
(e.g. diverse XML support, integration of tools
like Protégé, Jena and Weka) - (Almost) everything is a component, and
component sets are user-extendable - Component-based development
- An OO way of chunking software Java Beans
- GATE components CREOLE modified Java Beans
(Collection of REusable Objects for Language
Engineering) - The minimal component 10 lines of Java, 10
lines of XML, 1 URL.
12- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                           - GATE Language Resources
- GATE LRs are documents, ontologies, corpora,
lexicons, - Documents / corpora
- GATE documents loaded from local files or the
web... - Diverse document formats text, html, XML,
email, RTF, SGML. - Processing Resources
- Algorithmic components knows as PRs beans with
execute methods. - All PRs can handle Unicode data by default.
- Clear distinction between code and data (simple
repurposing). - 20-30 freebies with GATE
- e.g. Named entity recognition WordNet Protégé
Ontology OntoGazetteer DAMLOIL export
Information Retrieval based on Lucene
13Visual Resources
14- Â Performance Evaluation
- At document level annotation diff
- At corpus level corpus benchmark tool
tracking systems performance over time
15Regression Test Corpus Benchmark Tool
16Information Retrieval Based on the Lucene IR
engine
17Editing Multilingual Data
- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
- GATE Unicode Kit (GUK)
- Java provides no special support for text input
(this may change) - Support for defining additional Input
Methods (IMs) - currently 30 IMs for 17 languages
- Pluggable in other applications
18Processing Multilingual Data All the
visualisation and editing tools for ML LRs use
enhanced Java facilities
19A bit of a nuisance (users)
- GATE team projects
- Conceptual indexing MUMIS automatic semantic
indices for sports video - MUSE, cross-genre entitiy finder
- HSL, Health-and-safety IE
- ETCSL collaboration with IOAS Oxford on Sumerian
- Old Bailey collaboration with HRI on 17th
century court reports - Multiflora plant taxonomy text analysis for
biodiversity research e-science - Advanced Knowledge Technologies 12m UK five
site collaborative project - H-TechSight knowledge portal for Chemicals
Engineers - Framework 6 SEKT, PrestoSpace, KnowledgeWeb
- A representative fraction of GATE users
- IBM TJ Watson, US
- the American National Corpus project, US
- the Perseus Digital Library project, Tufts
University, US - Longman Pearson publishing, UK
- Merck KgAa, Germany
- Canon Europe, UK
- Knight Ridder (the second biggest US news
publisher) - BBN (leading HLT research lab), US
- SMEs in Sirma AI Ltd., Bulgaria
- Imperial College, London, the University of
Manchester, the University of Karlsruhe, Vassar
College, the University of Southern California
and a large number of other UK, US and EU
Universities - UK and EU projects inc. MyGrid, CLEF, dotkom,
AMITIES, Cub Reporter, EMILLE, MUSE, Poesia...
203. Machine Learning in GATE
- Uses classification.
- Attr1, Attr2, Attr3, Attrn ? Class
- Classifies annotations.
- (Documents can be classified as well using a
simple trick.) - Annotations of a particular type are selected as
instances. - Attributes refer to instance annotations.
- Attributes have a position relative to the
instance annotation they refer to.
21Attributes
- Attributes can be
- Boolean
- The lack of presence of an annotation of a
particular type partially overlapping the
referred instance annotation. - Nominal
- The value of a particular feature of the referred
instance annotation. The complete set of
acceptable values must be specified a-priori. - Numeric
- The numeric value (converted from String) of a
particular feature of the referred instance
annotation.
22Implementation
- Machine Learning PR in GATE.
- Has two functioning modes
- training
- application
- Uses an XML file for configuration
- lt?xml version"1.0" encoding"windows-1252"?gt
- ltML-CONFIGgt
- ltDATASETgt lt/DATASETgt
- ltENGINEgtlt/ENGINEgt
- ltML-CONFIGgt
23ltDATASETgt
- ltDATASETgt
- ltINSTANCE-TYPEgtTokenlt/INSTANCE-TYPEgt
- ltATTRIBUTEgt
- ltNAMEgtPOS_category(0)lt/NAMEgt
- ltTYPEgtTokenlt/TYPEgt
- ltFEATUREgtcategorylt/FEATUREgt
- ltPOSITIONgt0lt/POSITIONgt
- ltVALUESgt
- ltVALUEgtNNlt/VALUEgt
- ltVALUEgtNNPlt/VALUEgt
- ltVALUEgtNNPSlt/VALUEgt
-
- lt/VALUESgt
- ltCLASS/gt
- lt/ATTRIBUTEgt
-
- lt/DATASETgt
24ltENGINEgt
- ltENGINEgt
- ltWRAPPERgtgate.creole.ml.weka.Wrapperlt/WRAPPERgt
- ltOPTIONSgt
- ltCLASSIFIERgtweka.classifiers.j48.J48lt/CLASS
IFIERgt - ltCLASSIFIER-OPTIONSgt-K 3lt/CLASSIFIER-OPTION
Sgt - ltCONFIDENCE-THRESHOLDgt0.85lt/CONFIDENCE-THRE
SHOLDgt - lt/OPTIONSgt
- lt/ENGINEgt
25Attributes Position
Instances type Token
26Machine Learning PR
- Can save a learnt model to an external file for
later use. - Saves the actual model and the collected dataset.
- Can export the collected dataset in .arff format.
27Standard Use Scenario
- Application
- Prepare data by enriching the documents with
annotation for attributes. (e.g. run Tokeniser,
POS tagger, Gazetteer, etc). - Load the previously saved model.
- Run the ML PR in application mode.
- Save the learnt model.
- Training
- Prepare training data by enriching the documents
with annotation for attributes. (e.g. run
Tokeniser, POS tagger, Gazetteer, etc). - Run the ML PR in training mode.
- Export the dataset as .arff and perform
experiments using the WEKA interface in order to
find the best attribute set / algorithm /
algorithm options. - Update the configuration file accordingly.
- Run the ML PR again to collect the actual data.
- Save the learnt model.
28An Example
- Learn POS category from POS context.
29Using Other ML Libraries
- The MLEngine Interface
- Method Summary
- void addTrainingInstance(List attributes) Adds
a new training instance to the dataset. - Object classifyInstance(List attributes)
Classifies a new instance. - void init() This method will be called after an
engine is created and has its dataset and options
set. - void setDatasetDefinition(DatasetDefintion definit
ion) Sets the definition for the dataset used. - void setOptions(org.jdom.Element options) Sets
the options from an XML JDom element. - void setOwnerPR(ProcessingResource pr)
Registers the PR using the engine with the
engine.Â
30- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                                               Â
                           - Conclusion
- GATE is
- Addressing the need for scalable, reusable, and
portable HLT solutions - Supporting large data, in multiple media,
languages, formats, and locations - Lowering the cost of creation of new language
processing components - Promoting quantitative evaluation metrics via
tools and a level playing field - Promoting experimental repeatability by
developing and supporting free software - Perhaps it may become
- A vehicle for the spread of collaborative
experiments in ML and HLT?