Software Architecture for Language Engineering (SALE) - PowerPoint PPT Presentation

About This Presentation
Title:

Software Architecture for Language Engineering (SALE)

Description:

Software Architecture for Language Engineering (SALE) ... JAPE debugger, editor, 101 language extensions (e.g. quantified ops, deletion ontology callouts) ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 40
Provided by: ham48
Category:

less

Transcript and Presenter's Notes

Title: Software Architecture for Language Engineering (SALE)


1
Software Architecture for Language Engineering
(SALE) where next? http//gate.ac.uk/
http//nlp.shef.ac.uk/ Hamish Cunningham IBM
TJ Watson, 1st August/2003
2
  • Structure of the Talk
  • SALE and its context
  • Definitions
  • The Knowledge Economy and HLT
  • Software Lifecycle
  • GATE, a General Architecture for Text Engineering
  • History
  • Summary of Features and Principles
  • Component-base development
  • Unicode support
  • Measurement
  • CREOLE some components
  • Users and Projects
  • Where Next (give up and go home)?
  • Future context
  • Desirables
  • Conclusion

3
SALE definitions
  • Computational Linguistics science of language
    that uses computation as an investigative tool.
  • Natural Language Processing science of
    computation whose subject matter is data
    structures and algorithms for human language
    processing.
  • Language Engineering building systems whose cost
    and outputs are measurable and predictable.
  • Software Architecture macro-level organisational
    principles for families of systems. In this
    context is also used as infrastructure.
  • SALE software infrastructure, architecture and
    development tools for applied NLP and LE.

4
  •                                                
                                                    
                               
  • The Knowledge Economy and Human Language
  • Gartner, December 2002
  • taxonomic and hierachical knowledge mapping and
    indexing will be prevalent in almost all
    information-rich applications
  • through 2012 more than 95 of human-to-computer
    information input will involve textual language
  • A contradiction formal knowledge in
    semantics-based
  • systems vs. ambiguous informal natural language
  • The challenge to reconcile these two opposing
    tendencies

5
IE and Knowledge Closing the Language Loop
KEY MNLG Multilingual Natural Language
GenerationOIE Ontology-aware Information
ExtractionAIE Adaptive IECLIE Controlled
Language IE
(M)NLG
Semantic Web Semantic GridSemantic Web
Services
Formal Knowledge(ontologies andinstance bases)
HumanLanguage
OIE
(A)IE
ControlledLanguage
CLIE
6
                                               
                                                  
                          Software lifecycle in
collaborative research Project Proposal We
love each other. We can work so well together. We
can hold workshops on Santorini together. We will
solve all the problems of AI that our
predecessors were too stupid to. Analysis and
Design Stop work entirely, for a period of
reflection and recuperation following the stress
of attending the kick-off meeting in
Luxembourg. Implementation Each developer
partner tries to convince the others that program
X that they just happen to have lying around on a
dusty disk-drive meets the project objectives
exactly and should form the centrepiece of the
demonstrator. Integration and Testing The lead
partner gets desperate and decides to hard-code
the results for a small set of examples into the
demonstrator, and have a fail-safe crash facility
for unknown input ("well, you know, it's still a
prototype..."). Evaluation Everyone says how
nice it is, how it solves all sorts of terribly
hard problems, and how if we had another grant we
could go on to transform information processing
the World over (or at least the European business
travel industry).
7
  •                                                
                                                    
                               
  • Where did GATE come from?
  • Early- mid-1990s (e.g. in TIPSTER)
  • Increasing trend towards multi-site collaborative
    projects
  • Role of engineering in scalable, reusable, and
    portable HLT
  • Support for large data, in multiple media,
    languages, formats, and locations
  • Lower cost of creation of language processing
    components
  • Promote quantitative evaluation metrics via tools
    and a level playing field
  • GATE history
  • 1996 2002 GATE version 1, proof of concept
  • March 2002 version 2, rewritten in Java,
    component based, LGPL, more users
  • Fall 2003 new development cycle

8
GATE is...
  • An architecture A macro-level organisational
    picture for LE software systems.
  • A framework For programmers, GATE is an
    object-oriented class library that implements the
    architecture.
  • A development environment For language engineers,
    computational linguists et al, a graphical
    development environment.
  • GATE comes with...
  • Some free components... ...and wrappers for other
    people's components
  • Tools for evaluation visualise/edit
    persistence IR IE dialogue ontologies etc.
  • Free software (LGPL). Download at
    http//gate.ac.uk/download/

9
  •                                                
                                                    
                               
  • Architectural principles
  • Non-prescriptive, theory neutral (strength and
    weakness)
  • Re-use, interoperation, not reimplementation
    (e.g. diverse XML support, integration of
    Protégé, Jena, Weka...)
  • (Almost) everything is a component, and component
    sets are user-extendable
  • (Almost) all operations are available both from
    API and GUI

10
Component-based development
  • CREOLE a Collection of REusable Objects for
  • Language Engineering
  • Java Beans an OO way of chunking software
  • GATE components modified Java Beans with XML
    configuration
  • The minimal component 10 lines of Java, 10
    lines of XML, 1 URL
  • Why bother?
  • Allows the system to load arbitrary language
    processing components

11
CREOLE lifecycle
  • Bootstrap stub Java class, Makefile, config
  • Registration URL / JAR / creole.xml
  • Instantiation class loading, parameterisation,
    bean object creation
  • load-time parameters, e.g. a documents charset
  • run-time parameters, e.g. a parsers lexicon
  • Three types of beans (not a new religion!)
  • Language Resources, e.g. doc, corpus, lexicon
  • Processing Resource, e.g. tagger, stat modeller
  • Visual Resource, e.g. doc editor, syntax editor

12
Language Resources (LRs)
  • GATE LRs are documents, ontologies, corpora,
    lexicons,
  • LRs can be associated with DataStores (Oracle,
    PostgreSQL, XML, Java Serialisation)
  • Documents / corpora
  • Diverse document formats text, html, XML, email,
    RTF, SGML
  • Optional format-preserving markup analyse / save
  • Standoff annotation model (start, end, type,
    features), derivative of TIPSTER, compatible with
    ATLAS and XCES

13
Processing Resources (PRs)
  • Algorithmic components knows as PRs beans with
    execute methods.
  • Controllers execute a set of PRs
  • SerialController sequential run of arbitrary PR
    set
  • SerialAnalyserController analyser PRs over
    corpus
  • Conditional controllers execute depend on
    features
  • Parallel controller?
  • PRs Controller Applications
  • Application parameterisation state can be saved
    and restored, and used for embedding / batching

14
Visual Resources (VRs)
15
VRs (2) Coreference
16
VRs (3) Syntax
17
Editing Multilingual Data
  •                      
  • GATE Unicode Kit (GUK)
  • Complements Javas facilities
  • Support for defining Input Methods (IMs)
  • currently 30 IMs for 17 languages
  • Pluggable in other applications (e.g.
    JEdit)

18
Processing Multilingual Data All processing,
visualisation and editing tools use GUK
19
  •  Performance Evaluation
  • At document level annotation diff

20
Regression Test
At corpus level corpus benchmark tool
tracking systems performance over time
21
More CREOLE
  1. JAPE, FSTs over annotations
  2. ANNIE, A Nearly-New IE system
  3. DAMLOIL, Protégé, Ontology-Aware IE
  4. Information Retrieval, Lucene
  5. WordNet
  6. Machine Learning support

22
  •  FSTs over annotations
  • JAPE a Java Annotation Patterns Engine
  • Light, robust regular-expression-based
    processing
  • Cascaded finite state transduction
  • Low-overhead development of new components
  • Simplifies multi-phase regex processing
  • Rule Company1
  • Priority 25
  • (
  • ( Token.orthography upperInitial )
  • Lookup.kind companyDesignator
  • )match
  • --gt
  • match.NamedEntity kindcompany,
    ruleCompany1

23
Info Extraction Components The ANNIE system a
reusable and easily extendable set of components
24
Populating Ontologies with IE
25
Protégé and Ontology Management
26
Information Retrieval Currently based on the
Lucene IR engine
27
WordNet support
28
Machine Learning support
  • Uses classification.
  • Attr1, Attr2, Attr3, Attrn ? Class
  • Classifies annotations.
  • (Documents can be classified as well using a
    simple trick.)
  • Annotations of a particular type are selected as
    instances.
  • Attributes refer to instance annotations.
  • Attributes have a position relative to the
    instance annotation they refer to.

29
Attributes
  • Attributes can be
  • Boolean
  • The lack of presence of an annotation of a
    particular type partially overlapping the
    referred instance annotation.
  • Nominal
  • The value of a particular feature of the referred
    instance annotation. The complete set of
    acceptable values must be specified a-priori.
  • Numeric
  • The numeric value (converted from String) of a
    particular feature of the referred instance
    annotation.

30
Implementation
  • Machine Learning PR in GATE.
  • Has two functioning modes
  • training
  • application
  • Uses an XML file for configuration
  • lt?xml version"1.0" encoding"windows-1252"?gt
  • ltML-CONFIGgt
  • ltDATASETgt lt/DATASETgt
  • ltENGINEgtlt/ENGINEgt
  • ltML-CONFIGgt

31
ltDATASETgt
  • ltDATASETgt
  • ltINSTANCE-TYPEgtTokenlt/INSTANCE-TYPEgt
  • ltATTRIBUTEgt
  • ltNAMEgtPOS_category(0)lt/NAMEgt
  • ltTYPEgtTokenlt/TYPEgt
  • ltFEATUREgtcategorylt/FEATUREgt
  • ltPOSITIONgt0lt/POSITIONgt
  • ltVALUESgt
  • ltVALUEgtNNlt/VALUEgt
  • ltVALUEgtNNPlt/VALUEgt
  • lt/VALUESgt
  • ltCLASS/gt
  • lt/ATTRIBUTEgt
  • lt/DATASETgt

32
ltENGINEgt
  • ltENGINEgt
  • ltWRAPPERgtgate.creole.ml.weka.Wrapperlt/WRAPPERgt
  • ltOPTIONSgt
  • ltCLASSIFIERgtweka.classifiers.j48.J48lt/CLASS
    IFIERgt
  • ltCLASSIFIER-OPTIONSgt-K 3lt/CLASSIFIER-OPTION
    Sgt
  • ltCONFIDENCE-THRESHOLDgt0.85lt/CONFIDENCE-THRE
    SHOLDgt
  • lt/OPTIONSgt
  • lt/ENGINEgt
  • Now WEKA
  • Soon Torch? YASMET? TIMBL?

33
Attributes Position
Instances type Token
34
Standard Use Scenario
  • Training
  • Prepare training annotations.
  • Run the ML PR in training mode.
  • Export the dataset as .arff and perform
    experiments using the WEKA interface in order to
    find the best attribute set / algorithm /
    algorithm options.
  • Update the configuration file accordingly.
  • Run the ML PR again to collect the actual data.
  • Save the learnt model.
  • Application
  • Load the previously saved model.
  • Run the ML PR in application mode.
  • Save the learnt model.

35
Using Other ML Libraries
  • The MLEngine Interface
  • void addTrainingInstance(List attributes) Adds
    a new training instance to the dataset. 
  • Object classifyInstance(List attributes)
    Classifies a new instance. 
  • void init() This method will be called after an
    engine is created and has its dataset and options
    set. 
  • void setDatasetDefinition(DatasetDefintion definit
    ion) Sets the definition for the dataset used. 
  • void setOptions(org.jdom.Element options) Sets
    the options from an XML JDom element.
  • void setOwnerPR(ProcessingResource pr)
    Registers the PR using the engine with the
    engine. 

36
A bit of a nuisance (GATE users)
  • Thousands of users at hundreds of
  • sites. A representative sample
  • the American National Corpus project
  • the Perseus Digital Library project, Tufts
    University, US
  • Longman Pearson publishing, UK
  • Merck KgAa, Germany
  • Canon Europe, UK
  • Knight Ridder, US
  • BBN (leading HLT research lab), US
  • SMEs inc. Sirma AI Ltd., Bulgaria
  • Imperial College, London, the University of
    Manchester, UMIST, the University of Karlsruhe,
    Vassar College, the University of Southern
    California and a large number of other UK, US and
    EU Universities
  • UK and EU projects inc. MyGrid, CLEF, dotkom,
    AMITIES, Cub Reporter, EMILLE, Poesia...
  • GATE team projects. Past
  • Conceptual indexing MUMIS automatic semantic
    indices for sports video
  • MUSE, cross-genre entitiy finder
  • HSL, Health-and-safety IE
  • Old Bailey collaboration with HRI on 17th
    century court reports
  • Multiflora plant taxonomy text analysis for
    biodiversity research e-science
  • Present
  • Advanced Knowledge Technologies 12m UK five
    site collaborative project
  • EMILLE S. Asian languages corpus
  • ACE / TIDES Arabic, Chinese NE
  • JHU summer w/s on semtagging
  • Future
  • Five new projects (below)

37
Where Next (1)?
  • Can Universities cope with the long term?
  • User survey
  • Future context
  • SEKT Knowledge Management
  • KnowledgeWeb OntoWeb II
  • PrestoSpace audiovisual preservation (FSTs for
    users?)
  • hTechSight knowledge portal for petrochemicals
  • ETCSL Electronic Text Corpus of Sumerian
    Language
  • DERI Digital Enterprise Research Institute
  • PhDs INK, PIE

38
Where Next (2)?
  • Some desirables
  • Corpus tools (ANNIC in progress)
  • Audiovisual documents
  • WS-based backend server, for ML, active learning
    etc.
  • Better dialogue support (cf. AMITIES, Galaxy)
  • Better MT support
  • PDF documents
  • JAPE debugger, editor, 101 language extensions
    (e.g. quantified ops, deletion ontology callouts)
  • Cleverer treatment of large documents in the GUI
  • PR reloading

39
  • Conclusion
  • GATE is
  • Addressing the need for scalable, reusable, and
    portable HLT solutions
  • Supporting large data, in multiple media,
    languages, formats, and locations
  • Lowering the cost of creation of new language
    processing components
  • Promoting quantitative evaluation metrics via
    tools and a level playing field
  • Promoting experimental repeatability by
    developing and supporting free software
  • http//gate.ac.uk/
Write a Comment
User Comments (0)
About PowerShow.com