Title: Software Architecture for Language Engineering (SALE)
1Software Architecture for Language Engineering
(SALE) where next? http//gate.ac.uk/
http//nlp.shef.ac.uk/ Hamish Cunningham IBM
TJ Watson, 1st August/2003
2- Structure of the Talk
- SALE and its context
- Definitions
- The Knowledge Economy and HLT
- Software Lifecycle
- GATE, a General Architecture for Text Engineering
- History
- Summary of Features and Principles
- Component-base development
- Unicode support
- Measurement
- CREOLE some components
- Users and Projects
- Where Next (give up and go home)?
- Future context
- Desirables
- Conclusion
3SALE definitions
- Computational Linguistics science of language
that uses computation as an investigative tool. - Natural Language Processing science of
computation whose subject matter is data
structures and algorithms for human language
processing. - Language Engineering building systems whose cost
and outputs are measurable and predictable. - Software Architecture macro-level organisational
principles for families of systems. In this
context is also used as infrastructure. - SALE software infrastructure, architecture and
development tools for applied NLP and LE.
4-
- The Knowledge Economy and Human Language
- Gartner, December 2002
- taxonomic and hierachical knowledge mapping and
indexing will be prevalent in almost all
information-rich applications - through 2012 more than 95 of human-to-computer
information input will involve textual language - A contradiction formal knowledge in
semantics-based - systems vs. ambiguous informal natural language
- The challenge to reconcile these two opposing
tendencies
5IE and Knowledge Closing the Language Loop
KEY MNLG Multilingual Natural Language
GenerationOIE Ontology-aware Information
ExtractionAIE Adaptive IECLIE Controlled
Language IE
(M)NLG
Semantic Web Semantic GridSemantic Web
Services
Formal Knowledge(ontologies andinstance bases)
HumanLanguage
OIE
(A)IE
ControlledLanguage
CLIE
6 Software lifecycle in
collaborative research Project Proposal We
love each other. We can work so well together. We
can hold workshops on Santorini together. We will
solve all the problems of AI that our
predecessors were too stupid to. Analysis and
Design Stop work entirely, for a period of
reflection and recuperation following the stress
of attending the kick-off meeting in
Luxembourg. Implementation Each developer
partner tries to convince the others that program
X that they just happen to have lying around on a
dusty disk-drive meets the project objectives
exactly and should form the centrepiece of the
demonstrator. Integration and Testing The lead
partner gets desperate and decides to hard-code
the results for a small set of examples into the
demonstrator, and have a fail-safe crash facility
for unknown input ("well, you know, it's still a
prototype..."). Evaluation Everyone says how
nice it is, how it solves all sorts of terribly
hard problems, and how if we had another grant we
could go on to transform information processing
the World over (or at least the European business
travel industry).
7-
- Where did GATE come from?
- Early- mid-1990s (e.g. in TIPSTER)
- Increasing trend towards multi-site collaborative
projects - Role of engineering in scalable, reusable, and
portable HLT - Support for large data, in multiple media,
languages, formats, and locations - Lower cost of creation of language processing
components - Promote quantitative evaluation metrics via tools
and a level playing field - GATE history
- 1996 2002 GATE version 1, proof of concept
- March 2002 version 2, rewritten in Java,
component based, LGPL, more users - Fall 2003 new development cycle
8GATE is...
- An architecture A macro-level organisational
picture for LE software systems. - A framework For programmers, GATE is an
object-oriented class library that implements the
architecture. - A development environment For language engineers,
computational linguists et al, a graphical
development environment. - GATE comes with...
- Some free components... ...and wrappers for other
people's components - Tools for evaluation visualise/edit
persistence IR IE dialogue ontologies etc. - Free software (LGPL). Download at
http//gate.ac.uk/download/
9-
- Architectural principles
- Non-prescriptive, theory neutral (strength and
weakness) - Re-use, interoperation, not reimplementation
(e.g. diverse XML support, integration of
Protégé, Jena, Weka...) - (Almost) everything is a component, and component
sets are user-extendable - (Almost) all operations are available both from
API and GUI
10Component-based development
- CREOLE a Collection of REusable Objects for
- Language Engineering
- Java Beans an OO way of chunking software
- GATE components modified Java Beans with XML
configuration - The minimal component 10 lines of Java, 10
lines of XML, 1 URL - Why bother?
- Allows the system to load arbitrary language
processing components
11CREOLE lifecycle
- Bootstrap stub Java class, Makefile, config
- Registration URL / JAR / creole.xml
- Instantiation class loading, parameterisation,
bean object creation - load-time parameters, e.g. a documents charset
- run-time parameters, e.g. a parsers lexicon
- Three types of beans (not a new religion!)
- Language Resources, e.g. doc, corpus, lexicon
- Processing Resource, e.g. tagger, stat modeller
- Visual Resource, e.g. doc editor, syntax editor
12Language Resources (LRs)
- GATE LRs are documents, ontologies, corpora,
lexicons, - LRs can be associated with DataStores (Oracle,
PostgreSQL, XML, Java Serialisation) - Documents / corpora
- Diverse document formats text, html, XML, email,
RTF, SGML - Optional format-preserving markup analyse / save
- Standoff annotation model (start, end, type,
features), derivative of TIPSTER, compatible with
ATLAS and XCES
13Processing Resources (PRs)
- Algorithmic components knows as PRs beans with
execute methods. - Controllers execute a set of PRs
- SerialController sequential run of arbitrary PR
set - SerialAnalyserController analyser PRs over
corpus - Conditional controllers execute depend on
features - Parallel controller?
- PRs Controller Applications
- Application parameterisation state can be saved
and restored, and used for embedding / batching
14Visual Resources (VRs)
15VRs (2) Coreference
16VRs (3) Syntax
17Editing Multilingual Data
-
- GATE Unicode Kit (GUK)
- Complements Javas facilities
- Support for defining Input Methods (IMs)
- currently 30 IMs for 17 languages
- Pluggable in other applications (e.g.
JEdit)
18Processing Multilingual Data All processing,
visualisation and editing tools use GUK
19- Performance Evaluation
- At document level annotation diff
20Regression Test
At corpus level corpus benchmark tool
tracking systems performance over time
21More CREOLE
- JAPE, FSTs over annotations
- ANNIE, A Nearly-New IE system
- DAMLOIL, Protégé, Ontology-Aware IE
- Information Retrieval, Lucene
- WordNet
- Machine Learning support
22- FSTs over annotations
- JAPE a Java Annotation Patterns Engine
- Light, robust regular-expression-based
processing - Cascaded finite state transduction
- Low-overhead development of new components
- Simplifies multi-phase regex processing
- Rule Company1
- Priority 25
- (
- ( Token.orthography upperInitial )
- Lookup.kind companyDesignator
- )match
- --gt
- match.NamedEntity kindcompany,
ruleCompany1
23Info Extraction Components The ANNIE system a
reusable and easily extendable set of components
24Populating Ontologies with IE
25Protégé and Ontology Management
26Information Retrieval Currently based on the
Lucene IR engine
27WordNet support
28Machine Learning support
- Uses classification.
- Attr1, Attr2, Attr3, Attrn ? Class
- Classifies annotations.
- (Documents can be classified as well using a
simple trick.) - Annotations of a particular type are selected as
instances. - Attributes refer to instance annotations.
- Attributes have a position relative to the
instance annotation they refer to.
29Attributes
- Attributes can be
- Boolean
- The lack of presence of an annotation of a
particular type partially overlapping the
referred instance annotation. - Nominal
- The value of a particular feature of the referred
instance annotation. The complete set of
acceptable values must be specified a-priori. - Numeric
- The numeric value (converted from String) of a
particular feature of the referred instance
annotation.
30Implementation
- Machine Learning PR in GATE.
- Has two functioning modes
- training
- application
- Uses an XML file for configuration
- lt?xml version"1.0" encoding"windows-1252"?gt
- ltML-CONFIGgt
- ltDATASETgt lt/DATASETgt
- ltENGINEgtlt/ENGINEgt
- ltML-CONFIGgt
31ltDATASETgt
- ltDATASETgt
- ltINSTANCE-TYPEgtTokenlt/INSTANCE-TYPEgt
- ltATTRIBUTEgt
- ltNAMEgtPOS_category(0)lt/NAMEgt
- ltTYPEgtTokenlt/TYPEgt
- ltFEATUREgtcategorylt/FEATUREgt
- ltPOSITIONgt0lt/POSITIONgt
- ltVALUESgt
- ltVALUEgtNNlt/VALUEgt
- ltVALUEgtNNPlt/VALUEgt
-
- lt/VALUESgt
- ltCLASS/gt
- lt/ATTRIBUTEgt
-
- lt/DATASETgt
32ltENGINEgt
- ltENGINEgt
- ltWRAPPERgtgate.creole.ml.weka.Wrapperlt/WRAPPERgt
- ltOPTIONSgt
- ltCLASSIFIERgtweka.classifiers.j48.J48lt/CLASS
IFIERgt - ltCLASSIFIER-OPTIONSgt-K 3lt/CLASSIFIER-OPTION
Sgt - ltCONFIDENCE-THRESHOLDgt0.85lt/CONFIDENCE-THRE
SHOLDgt - lt/OPTIONSgt
- lt/ENGINEgt
- Now WEKA
- Soon Torch? YASMET? TIMBL?
33Attributes Position
Instances type Token
34Standard Use Scenario
- Training
- Prepare training annotations.
- Run the ML PR in training mode.
- Export the dataset as .arff and perform
experiments using the WEKA interface in order to
find the best attribute set / algorithm /
algorithm options. - Update the configuration file accordingly.
- Run the ML PR again to collect the actual data.
- Save the learnt model.
- Application
- Load the previously saved model.
- Run the ML PR in application mode.
- Save the learnt model.
35Using Other ML Libraries
- The MLEngine Interface
- void addTrainingInstance(List attributes) Adds
a new training instance to the dataset. - Object classifyInstance(List attributes)
Classifies a new instance. - void init() This method will be called after an
engine is created and has its dataset and options
set. - void setDatasetDefinition(DatasetDefintion definit
ion) Sets the definition for the dataset used. - void setOptions(org.jdom.Element options) Sets
the options from an XML JDom element. - void setOwnerPR(ProcessingResource pr)
Registers the PR using the engine with the
engine.
36A bit of a nuisance (GATE users)
- Thousands of users at hundreds of
- sites. A representative sample
- the American National Corpus project
- the Perseus Digital Library project, Tufts
University, US - Longman Pearson publishing, UK
- Merck KgAa, Germany
- Canon Europe, UK
- Knight Ridder, US
- BBN (leading HLT research lab), US
- SMEs inc. Sirma AI Ltd., Bulgaria
- Imperial College, London, the University of
Manchester, UMIST, the University of Karlsruhe,
Vassar College, the University of Southern
California and a large number of other UK, US and
EU Universities - UK and EU projects inc. MyGrid, CLEF, dotkom,
AMITIES, Cub Reporter, EMILLE, Poesia...
- GATE team projects. Past
- Conceptual indexing MUMIS automatic semantic
indices for sports video - MUSE, cross-genre entitiy finder
- HSL, Health-and-safety IE
- Old Bailey collaboration with HRI on 17th
century court reports - Multiflora plant taxonomy text analysis for
biodiversity research e-science - Present
- Advanced Knowledge Technologies 12m UK five
site collaborative project - EMILLE S. Asian languages corpus
- ACE / TIDES Arabic, Chinese NE
- JHU summer w/s on semtagging
- Future
- Five new projects (below)
37Where Next (1)?
- Can Universities cope with the long term?
- User survey
- Future context
- SEKT Knowledge Management
- KnowledgeWeb OntoWeb II
- PrestoSpace audiovisual preservation (FSTs for
users?) - hTechSight knowledge portal for petrochemicals
- ETCSL Electronic Text Corpus of Sumerian
Language - DERI Digital Enterprise Research Institute
- PhDs INK, PIE
38Where Next (2)?
- Some desirables
- Corpus tools (ANNIC in progress)
- Audiovisual documents
- WS-based backend server, for ML, active learning
etc. - Better dialogue support (cf. AMITIES, Galaxy)
- Better MT support
- PDF documents
- JAPE debugger, editor, 101 language extensions
(e.g. quantified ops, deletion ontology callouts) - Cleverer treatment of large documents in the GUI
- PR reloading
39- Conclusion
- GATE is
- Addressing the need for scalable, reusable, and
portable HLT solutions - Supporting large data, in multiple media,
languages, formats, and locations - Lowering the cost of creation of new language
processing components - Promoting quantitative evaluation metrics via
tools and a level playing field - Promoting experimental repeatability by
developing and supporting free software - http//gate.ac.uk/