Title: Whats the difference between Tony Blair and Mother Theresa Human Language Technology for Preservatio
1Whats the difference between Tony Blair and
Mother Theresa?(Human Language Technology for
Preservation return on investment) http//gate.a
c.uk/ http//nlp.shef.ac.uk/ Hamish
Cunningham Dept. Computer Science, University of
Sheffield Alghero, March 2004
220th Century Rot
- 20th Century audio-visual media is rapidly
disappearing - Preservation and restoration are high cost
- The costs must be justified by increased access
- Metadata descriptive information about content
- Therefore the rest of the talk will cover
- rich metadata and semantic access
- cross-lingual access
- syndicated delivery
- repurposeable content
3IT context the Knowledge Economy and Human
Language
- Gartner, December 2002
- taxonomic and hierachical knowledge mapping and
indexing will be prevalent in almost all
information-rich applications - through 2012 more than 95 of human-to-computer
information input will involve textual language - A contradiction
- to deal with the information deluge we need
formal knowledge in semantics-based systems - our archived history is in informal and ambiguous
natural language - The challenge to reconcile these two phenomena
4HLT Closing the Loop
KEY MNLG Multilingual Natural Language
GenerationOIE Ontology-aware Information
ExtractionAIE Adaptive IECLIE Controlled
Language IE
(M)NLG
Semantic Web Semantic GridSemantic Web
Services
Formal Knowledge(ontologies andinstance bases)
HumanLanguage
OIE
(A)IE
ControlledLanguage
CLIE
5Information Extraction
- Information Extraction (IE) pulls facts and
structured information from the content of large
text collections. - Contrast IE and Information Retrieval
- NLP history from NLU to IE
- Progress driven by quantitative measures
- MUC Message Understanding Conferences
- ACE Advanced Content Extraction
6IE Example
- The shiny red rocket was fired on Tuesday. It is
the brainchild of Dr. Big Head. Dr. Head is a
staff scientist at We Build Rockets Inc.
- Named entities (NE) "rocket", "Tuesday", "Dr.
Head" and "We Build Rockets" - Co-reference resolution (CO) "it" refers to the
rocket "Dr. Head" and "Dr. Big Head" are the
same - Template Elements (TE) the rocket is "shiny red"
and Head's "brainchild". - Template Relations (TR) Dr. Head works for We
Build Rockets Inc. - Scenario Templates (ST) a rocket launching event
occurred with the various participants.
7Performance levels
- (Extensive quantitative evaluation since early
90s mainly on text, ASR now also video OCR) - Vary according to text type, domain, scenario,
language - NE up to 97 (tested in English, Spanish,
Japanese, Chinese, others) - CO 60-70 resolution
- TE 80
- TR 75-80
- ST 60 (but human level may be only 80)
8Ontology-based IE
XYZ was established on 03 November 1978 in
London. It opened a plant in Bulgaria in
Ontology KB
Location
Company
HQ
partOf
City
Country
type
type
HQ
type
type
establOn
partOf
03/11/1978
9Classes, instances metadata
Gordon Brown met George Bush during his two day
visit.
ltmetadatagt ltDOC-IDgthttp// 1.htmllt/DOC-IDgt
ltAnnotationgt lts_offsetgt 0 lt/s_offsetgt
lte_offsetgt 12 lt/e_offsetgt ltstringgtGordon
Brownlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson12345lt/instgt lt/Annotationgt
ltAnnotationgt lts_offsetgt 18 lt/s_offsetgt
lte_offsetgt 32 lt/e_offsetgt ltstringgtGeorge
Bushlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson67890lt/instgt lt/Annotationgt lt/metad
atagt
Classesinstances before
Bush
Classesinstances after
10An example the MUMIS project
- Multimedia Indexing and Searching Environment
- Composite index of a multimedia programme from
multiple sources in different languages - ASR, video processing, Information Extraction
(Dutch, English, German), merging, user interface - University of Twente/CTIT, University of
Sheffield, University of Nijmegen, DFKI, MPI,
ESTEAM AB, VDA - An important experimental result multiple
sources for same events can improve extraction
quality - PrestoSpace applications in news and sports
archiving
11Semantic Query
Not goal Beckham (includes e.g. missed goals,
or this was not a goal) Instead goal events
with scorer David Beckham
12The results England win!
13PSpace good news and bad news
- The good news PrestoSpace has some of the world
leaders on AI and metadata - The bad news AI always fails
- How does the machine tell the difference between
Mother Theresa is a saint and Tony Blair is a
saint?(Or, who tells Google which statement is
important?) - Other web users do, by linking (also cf. Amazon)
- Two solutions to the AI problem
- allow archivists and users to build their own
(simple specific models can succeed, but the cost
may be too high) - use recommender systems to make the user an
archivists assistant (researchers and students
may barter for access) - Any route to searchable content!
14Syndication and Merging
- The web promotes diversity, but also
fragmentation - Original web separate content and presentation
(this is a header, not set in 20 point bold
font) - Now many incompatible/inaccessible interfaces
- Archives need to
- pool their impact syndication in networked
communities - support repurposable content
- Therefore data must be presentation indepenent
- Candidate technologies XML, RSS, RDF, OWL
(semantic web)
15GATE, a General Architecture for Text Engineering
is...
- An architecture A macro-level organisational
picture for LE software systems. - A framework For programmers, GATE is an
object-oriented class library that implements the
architecture. - A development environment For language engineers,
a graphical development environment. - GATE comes with...
- Free components, and wrappers for other people's
- Tools for evaluation visualise/edit
persistence IR IE dialogue ontologies etc. - Free software (LGPL) at http//gate.ac.uk/download
/ - Used by thousands of people at hundreds of sites
16A bit of a nuisance (GATE users)
- Thousands of users at hundreds of
- sites. A representative sample
- the American National Corpus project
- the Perseus Digital Library project, Tufts
University, US - Longman Pearson publishing, UK
- Merck KgAa, Germany
- Canon Europe, UK
- Knight Ridder, US
- BBN (leading HLT research lab), US
- SMEs inc. Sirma AI Ltd., Bulgaria
- Imperial College, London, the University of
Manchester, UMIST, the University of Karlsruhe,
Vassar College, the University of Southern
California and a large number of other UK, US and
EU Universities - UK and EU projects inc. MyGrid, CLEF, dotkom,
AMITIES, Cub Reporter, EMILLE, Poesia...
- GATE team projects. Past
- Conceptual indexing MUMIS automatic semantic
indices for sports video - MUSE, cross-genre entitiy finder
- HSL, Health-and-safety IE
- Old Bailey collaboration with HRI on 17th
century court reports - Multiflora plant taxonomy text analysis for
biodiversity research e-science - Present
- Advanced Knowledge Technologies 12m UK five
site collaborative project - EMILLE S. Asian languages corpus
- ACE / TIDES Arabic, Chinese NE
- JHU summer w/s on semtagging
- Future
- Five new projects inc. PrestoSpace
17GATE infrastructure for semantic metadata
extraction
- Combines learning and rule-based methods (new
work on mixed-initiative learning - Allows combination of IE and IR
- Enables use of large-scale linguistic resources
for IE, such as WordNet - Supports ontologies as part of IE applications -
Ontology-Based IE - Supports languages from Hindi to Chinese, Italian
to German
18(Not the) MAD Semantics Architecture
IE
...
Formal Text
Formal Text
Formal Text
Final Annotations
IE
Formal Text
IT
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
EN
Text Sources
IE
Multilingual Conceptual Q A
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
AV Signals
Formal Text
Signal md, Transcr-iptions
ASR, etc.
19Archiving is not a luxury
- C21st all the C20th mistakes but bigger
better? - If you dont know where youve been, how can you
know where youre going? - Archives ammunition in the war on ignorance
- Ammunition is useless if you cant find it new
technology must make our history accessible to
all, for all our futures - More information
- http//gate.ac.uk/ http//www.prestospace.org/