Title: Experiences with UIMA in NLP teaching and research
1Experiences with UIMA in NLP teaching and research
- Manuela Kunze,
- Dietmar Rösner
University of Magdeburg C Knowledge Based Systems
and Document Processing
2Overview
- What is UIMA?
- First Experiments
- NLP Teaching
- Conclusion
3UIMA Unstructured Information Management
Architecture
- a software architecture for developing and
deploying unstructured information management
(UIM) applications - UIM application a software system
- analyse large volumes of unstructured information
to - discover,
- organize, and
- deliver relevant knowledge to the end user
- software architecture which specifies
- component interfaces, data representations,
- http//www.research.ibm.com/UIMA/
4UIMA Unstructured Information Management
Architecture
interfaces to a collection of data items (e.g.,
documents) to be analyzed. Collection Readers
return CASes that contain the documents
to analyze, possibly along with additional
metadata.
takes a CAS, analyzes its contents, and
produces an enriched CAS. Analysis Engines can be
recursively composed of other Analysis
Engines (called an Aggregate Analysis Engine).
Aggregates may also contain CAS Consumers.
may be used by a Collection Reader to populate
a CAS from a document. An example of a CAS
Initializer is an HTML parser that de-tags an
HTML document and also inserts paragraph
annotations (determined from ltPgt tags in the
original HTML) into the CAS.
CAS Common Analysis Structure CPE Collecting
Processing Manager
consume the enriched CAS that was produced by
the sequence of Analysis Engines before it, and
produce an application-specific data structure,
such as a search engine index or database.
Ferucci et al. Unstructured Information
Management Architecture (UIMA) SDK User's Guide
and Reference
5UIMA Unstructured Information Management
Architecture
- Analysis Engine (AE)
- a component that analyzes artifacts (e.g.
documents) and infers information about them - consists of two parts
- Java classes (typically packaged as one or more
JAR files) and - AE descriptors (one or more XML files)
- the configuration settings for the Analysis
Engine as well as - a description of the AEs input and output
requirements.
6UIMA Unstructured Information Management
Architecture
- describe analysis engine
- annotator class
- input parameter
- output of annotations
- external resources
- interface
- resources
Java
XML
define an annotator
analysis engine
linked to a type system
uses
type system
Annotation Interface
create
- define annotation type
- name
- features (begin, end, )
7UIMA Unstructured Information Management
Architecture
- Aggregate Analysis Engine
- combine different analysis engine within one
Analysis Engine
Ferucci et al. Unstructured Information
Management Architecture (UIMA) SDK User's Guide
and Reference
8Overview
- Introduction
- First Experiments
- NLP Teaching
- Conclusion
9First Experiments UIMA vs. GATE
- base line
- 2 persons, 2 systems, 1 corpus and 1 extraction
task - skills/experiences of the persons
UIMA GATE Eclipse/Java
Person 1
Person 2
J
J
K
K
K
J
10Task of the Experiment
- process a corpus of websites
- to detect and extract information relevant for
tourists - opening times of museum, prices of hotels,
- corpus
- 30 tourism web sites of Egypt
- additional 20 web sites of Washington, New York,
London - output
- Prolog facts for a reasoner
- Questions
- Which museum is now open?
11Evaluation Topics/Points
- ease of getting acquainted with system?
- quality of docus completeness, clarity,
up-to-date, ? - tutorials, use cases, ?
- processing and linguistic resources?
- lexica, Gazetteer lists, tools
- tools for resource maintenance and extension?
- quality selfexplanatory, robust, comfortable
- speed of processing?
- single document vs. large corpora?
- limitations, suggestions for improvement?
- support for im-/export of a variety of document
formats?
12Excerpts from the Corpus
- The Egyptian Museum is open the hours 9am-5pm
daily - The Military Museum is open the hours Summer
8am-530pm winter 8am-430pm - Palace Museum is open the hours 8am-530pm
(summer) 8am-430pm (winter) - 10am-2pm, 6pm-9pm Sat-Wed 6pm-9pm Fri
13UIMA Application
- several annotators (like a pipeline)
... Fraunces Tavern Museum 54 Pearl St. -
1-212-425-1778 Tuesday-Friday, 12pm?5pm
regular expressions
restrictions
Prolog facts museumopen('Fraunces Tavern Museum
', '2005-12-01T120000', '2005-12-01T170000').
museumopen('Fraunces Tavern Museum
', '2005-12-02T120000', '2005-12-02T170000').
museumopen('Fraunces Tavern Museum
', '2005-12-03T120000', '2005-12-03T170000').
interval of times
museum information
time pattern
window covering two time intervals and a
restriction
museum pattern
regular expressions
window covering a museum and opening hours
regular expressions
14UIMA Results
- information annotated in the documents
- names of museums, hotels
- times, time intervals
- time restrictions
- prices, intervals of prices (hotel prices)
- keywords for museum category
- names of pharaohs (annotated with a correction of
mispellings) - information about hotel and museum are exported
into Prolog facts and into a short textual
summary - templates filled with the detected information
- hotels Price information about Cosmopolitan
Hotel 157 - museums
- Fraunces Tavern Museum
- Open from 120000 to 170000
- Restriction Tuesday-Friday
15UIMA vs. GATE Conclusion
- no final judgement about use GATE or UIMA
- depends on
- your task
- task description
- expected results
- which processing resources are necessary
- your preferences for interface
- prefer the Eclispe environment (or other Java
editors) - prefer a comfortable GUI
16UIMA vs. GATE Conclusion
- GATE
- tools available
- comfortable GUI
- UIMA
- plain framework
- simplified definition of (complex) result
structures - simplified pre- and postprocessing of annotations
- both are extensible
- e.g. for processing German documents
17'German' Extension of Processing Resources
- XDOC document suite
- tools for processing German documents
- tools implemented in CommonLisp
- for UIMA
- Java reimplementation of the tools
- several analysis engines
18XDOC in UIMA
- annotation of
- part-of-speech (Morphix, heuristics)
- semantic categories
- named entities (vehicles, cities, )
- a coarse approach for classification of PP
- using maxent library
19UIMA Evaluation
- good
- illustrative examples (tutorial)
- completeness sometimes it is very shortly
described - experiences with Eclipse and Java programming are
advantageous - prior knowledge about Java and Eclipse is helpful
- documentation?
- processing and linguistic resources?
- tools for resource maintenance and extension?
- speed of processing?
- single docs vs. large corpora?
- limitations, suggestions for improvement?
- im-/export of document formats?
20UIMA Evaluation
- documentation?
- processing and linguistic resources?
- tools for resource maintenance and extension?
- speed of processing?
- single docs vs. large corpora?
- limitations, suggestions for improvement?
- im-/export of document formats?
- annotators only from tutorial
- sentence annotation
- word annotation
- date/time annotators
- examples for using regular expressions etc.
- external resources can be integrated
- lexical resources as external resources (text
files) - existing processing resources
- implementation of an interface is necessary
21UIMA Evaluation
- documentation?
- processing and linguistic resources?
- tools for resource maintenance and extension?
- speed of processing?
- single docs vs. large corpora?
- limitations, suggestions for improvement?
- im-/export of document formats?
- specific Eclipse component editors or
- simple text editors
22UIMA Evaluation
- documentation
- processing and linguistic resources
- tools for resource maintenance and extension?
- speed of processing?
- single docs vs. large corpora?
- limitations, suggestions for improvement?
- im-/export of document formats?
- faster than GATE?
- in CPE detailed information about processing time
for each module
23UIMA Evaluation
- documentation
- processing and linguistic resources
- tools for resource maintenance and extension?
- speed of processing?
- single docs vs. large corpora?
- limitations, suggestions for improvement?
- im-/export of document formats?
- Collection Reader
- document(s) from a directory
24UIMA Evaluation
- documentation
- processing and linguistic resources
- tools for resource maintenance and extension?
- speed of processing?
- single docs vs. large corpora?
- limitations, suggestions for improvement?
- im-/export of document formats?
- no limitations
- all is possible, but implementation or
interfacing by user - wish
- more processing and linguistic resources within
the distribution
25UIMA Evaluation
- documentation
- processing and linguistic resources
- tools for resource maintenance and extension?
- speed of processing?
- single docs vs. large corpora?
- limitations, suggestions for improvement?
- im-/export of document formats?
- import CAS Initializer
- export CAS Consumer
- transform annotations in any other format
- export of
- document annotations
- only annotations
- required Java application
26Overview
- Introduction
- First Experiments
- NLP Teaching
- Conclusion
27NLP Teaching
- course Information Extraction
- aim of the course to make our students
acquainted with information extraction as basic
NLP technology - UIMA, GATE
- students computer science, data-knowledge
engineering - skills of the students programming Java
28NLP Teaching
- different corpora
- news about FIFA world cup 2006 in Germany,
- description of drugs,
- announcements of new books,
- tasks for students
- to develop different anaylsis engines and combine
them for annotation of - URLs,
- email addresses,
- name of players,
- results of games,
- using regular expressions, external resources,
maximum entropy models
29NLP Teaching
30UIMA A Students View
- easy to handle
- Java programming (environment)
- problems of students
- to understand the dependencies between the
several descriptors - for teaching helpful (future work)
- a 'comparator' of different solutions of students
- which solution is the best, related to a 'master'
solution
31Overview
- Introduction
- First Experiments
- NLP Teaching
- Conclusion
32Conclusion
- UIMA
- easy to learn and to handle
- support the management of
- different annotations
- different processing resources
- integration of external resources (processing
resources as well lexical resources) - splitting of 'processing steps'
- reader, initalizer, analysis engine, consumer
- 'wish-list'
- a kind of jape transducer
- interface to GATE's processing resources is
available - 'comparator' for evaluation of solutions
33(No Transcript)
34XDOC in UIMA
35Introduction
really?
36Introduction
- first experiments with UIMA
- processing tourism web sites, news about the FIFA
world cup 2006 in Germany, - integration of tools from the XDOC document suite
-
- using UIMA in a course on Information Extraction
37Introduction
- November 2005 Version 1.2.3 of UIMA is available
- "IBMs Unstructured Information Management
Architecture (UIMA) is an architecture and
software framework for creating, discovering,
composing and deploying a broad range of
multi-modal analysis capabilities and integrating
them with search technologies."