Whats the difference between Tony Blair and Mother Theresa Human Language Technology for Preservatio - PowerPoint PPT Presentation

About This Presentation
Title:

Whats the difference between Tony Blair and Mother Theresa Human Language Technology for Preservatio

Description:

(Human Language Technology for Preservation return on investment) ... cross-lingual access. syndicated delivery. repurposeable content ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 20
Provided by: ham48
Category:

less

Transcript and Presenter's Notes

Title: Whats the difference between Tony Blair and Mother Theresa Human Language Technology for Preservatio


1
Whats the difference between Tony Blair and
Mother Theresa?(Human Language Technology for
Preservation return on investment) http//gate.a
c.uk/ http//nlp.shef.ac.uk/ Hamish
Cunningham Dept. Computer Science, University of
Sheffield Alghero, March 2004
2
20th Century Rot
  • 20th Century audio-visual media is rapidly
    disappearing
  • Preservation and restoration are high cost
  • The costs must be justified by increased access
  • Metadata descriptive information about content
  • Therefore the rest of the talk will cover
  • rich metadata and semantic access
  • cross-lingual access
  • syndicated delivery
  • repurposeable content

3
IT context the Knowledge Economy and Human
Language
  • Gartner, December 2002
  • taxonomic and hierachical knowledge mapping and
    indexing will be prevalent in almost all
    information-rich applications
  • through 2012 more than 95 of human-to-computer
    information input will involve textual language
  • A contradiction
  • to deal with the information deluge we need
    formal knowledge in semantics-based systems
  • our archived history is in informal and ambiguous
    natural language
  • The challenge to reconcile these two phenomena

4
HLT Closing the Loop
KEY MNLG Multilingual Natural Language
GenerationOIE Ontology-aware Information
ExtractionAIE Adaptive IECLIE Controlled
Language IE
(M)NLG
Semantic Web Semantic GridSemantic Web
Services
Formal Knowledge(ontologies andinstance bases)
HumanLanguage
OIE
(A)IE
ControlledLanguage
CLIE
5
Information Extraction
  • Information Extraction (IE) pulls facts and
    structured information from the content of large
    text collections.
  • Contrast IE and Information Retrieval
  • NLP history from NLU to IE
  • Progress driven by quantitative measures
  • MUC Message Understanding Conferences
  • ACE Advanced Content Extraction

6
IE Example
  • The shiny red rocket was fired on Tuesday. It is
    the brainchild of Dr. Big Head. Dr. Head is a
    staff scientist at We Build Rockets Inc.
  • Named entities (NE) "rocket", "Tuesday", "Dr.
    Head" and "We Build Rockets"
  • Co-reference resolution (CO) "it" refers to the
    rocket "Dr. Head" and "Dr. Big Head" are the
    same
  • Template Elements (TE) the rocket is "shiny red"
    and Head's "brainchild".
  • Template Relations (TR) Dr. Head works for We
    Build Rockets Inc.
  • Scenario Templates (ST) a rocket launching event
    occurred with the various participants.

7
Performance levels
  • (Extensive quantitative evaluation since early
    90s mainly on text, ASR now also video OCR)
  • Vary according to text type, domain, scenario,
    language
  • NE up to 97 (tested in English, Spanish,
    Japanese, Chinese, others)
  • CO 60-70 resolution
  • TE 80
  • TR 75-80
  • ST 60 (but human level may be only 80)

8
Ontology-based IE
XYZ was established on 03 November 1978 in
London. It opened a plant in Bulgaria in
Ontology KB
Location
Company
HQ
partOf
City
Country
type
type
HQ
type
type
establOn
partOf
03/11/1978
9
Classes, instances metadata
Gordon Brown met George Bush during his two day
visit.
ltmetadatagt ltDOC-IDgthttp// 1.htmllt/DOC-IDgt
ltAnnotationgt lts_offsetgt 0 lt/s_offsetgt
lte_offsetgt 12 lt/e_offsetgt ltstringgtGordon
Brownlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson12345lt/instgt lt/Annotationgt
ltAnnotationgt lts_offsetgt 18 lt/s_offsetgt
lte_offsetgt 32 lt/e_offsetgt ltstringgtGeorge
Bushlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson67890lt/instgt lt/Annotationgt lt/metad
atagt
Classesinstances before
Bush
Classesinstances after
10
An example the MUMIS project
  • Multimedia Indexing and Searching Environment
  • Composite index of a multimedia programme from
    multiple sources in different languages
  • ASR, video processing, Information Extraction
    (Dutch, English, German), merging, user interface
  • University of Twente/CTIT, University of
    Sheffield, University of Nijmegen, DFKI, MPI,
    ESTEAM AB, VDA
  • An important experimental result multiple
    sources for same events can improve extraction
    quality
  • PrestoSpace applications in news and sports
    archiving

11
Semantic Query
Not goal Beckham (includes e.g. missed goals,
or this was not a goal) Instead goal events
with scorer David Beckham
12
The results England win!
13
PSpace good news and bad news
  • The good news PrestoSpace has some of the world
    leaders on AI and metadata
  • The bad news AI always fails
  • How does the machine tell the difference between
    Mother Theresa is a saint and Tony Blair is a
    saint?(Or, who tells Google which statement is
    important?)
  • Other web users do, by linking (also cf. Amazon)
  • Two solutions to the AI problem
  • allow archivists and users to build their own
    (simple specific models can succeed, but the cost
    may be too high)
  • use recommender systems to make the user an
    archivists assistant (researchers and students
    may barter for access)
  • Any route to searchable content!

14
Syndication and Merging
  • The web promotes diversity, but also
    fragmentation
  • Original web separate content and presentation
    (this is a header, not set in 20 point bold
    font)
  • Now many incompatible/inaccessible interfaces
  • Archives need to
  • pool their impact syndication in networked
    communities
  • support repurposable content
  • Therefore data must be presentation indepenent
  • Candidate technologies XML, RSS, RDF, OWL
    (semantic web)

15
GATE, a General Architecture for Text Engineering
is...
  • An architecture A macro-level organisational
    picture for LE software systems.
  • A framework For programmers, GATE is an
    object-oriented class library that implements the
    architecture.
  • A development environment For language engineers,
    a graphical development environment.
  • GATE comes with...
  • Free components, and wrappers for other people's
  • Tools for evaluation visualise/edit
    persistence IR IE dialogue ontologies etc.
  • Free software (LGPL) at http//gate.ac.uk/download
    /
  • Used by thousands of people at hundreds of sites

16
A bit of a nuisance (GATE users)
  • Thousands of users at hundreds of
  • sites. A representative sample
  • the American National Corpus project
  • the Perseus Digital Library project, Tufts
    University, US
  • Longman Pearson publishing, UK
  • Merck KgAa, Germany
  • Canon Europe, UK
  • Knight Ridder, US
  • BBN (leading HLT research lab), US
  • SMEs inc. Sirma AI Ltd., Bulgaria
  • Imperial College, London, the University of
    Manchester, UMIST, the University of Karlsruhe,
    Vassar College, the University of Southern
    California and a large number of other UK, US and
    EU Universities
  • UK and EU projects inc. MyGrid, CLEF, dotkom,
    AMITIES, Cub Reporter, EMILLE, Poesia...
  • GATE team projects. Past
  • Conceptual indexing MUMIS automatic semantic
    indices for sports video
  • MUSE, cross-genre entitiy finder
  • HSL, Health-and-safety IE
  • Old Bailey collaboration with HRI on 17th
    century court reports
  • Multiflora plant taxonomy text analysis for
    biodiversity research e-science
  • Present
  • Advanced Knowledge Technologies 12m UK five
    site collaborative project
  • EMILLE S. Asian languages corpus
  • ACE / TIDES Arabic, Chinese NE
  • JHU summer w/s on semtagging
  • Future
  • Five new projects inc. PrestoSpace

17
GATE infrastructure for semantic metadata
extraction
  • Combines learning and rule-based methods (new
    work on mixed-initiative learning
  • Allows combination of IE and IR
  • Enables use of large-scale linguistic resources
    for IE, such as WordNet
  • Supports ontologies as part of IE applications -
    Ontology-Based IE
  • Supports languages from Hindi to Chinese, Italian
    to German

18
(Not the) MAD Semantics Architecture
IE
...
Formal Text
Formal Text
Formal Text
Final Annotations
IE
Formal Text
IT
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
EN
Text Sources
IE
Multilingual Conceptual Q A
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
AV Signals
Formal Text
Signal md, Transcr-iptions
ASR, etc.
19
Archiving is not a luxury
  • C21st all the C20th mistakes but bigger
    better?
  • If you dont know where youve been, how can you
    know where youre going?
  • Archives ammunition in the war on ignorance
  • Ammunition is useless if you cant find it new
    technology must make our history accessible to
    all, for all our futures
  • More information
  • http//gate.ac.uk/ http//www.prestospace.org/
Write a Comment
User Comments (0)
About PowerShow.com