Metadata Extraction: Human Language Technology and the Semantic Web - PowerPoint PPT Presentation

About This Presentation
Title:

Metadata Extraction: Human Language Technology and the Semantic Web

Description:

Metadata Extraction: Human Language Technology and the Semantic Web http://gate.ac.uk/ http://nlp.shef.ac.uk/ Hamish Cunningham Kalina Bontcheva – PowerPoint PPT presentation

Number of Views:396
Avg rating:3.0/5.0
Slides: 118
Provided by: hami69
Category:

less

Transcript and Presenter's Notes

Title: Metadata Extraction: Human Language Technology and the Semantic Web


1
Metadata Extraction Human Language Technology
and the Semantic Web http//gate.ac.uk/
http//nlp.shef.ac.uk/ Hamish Cunningham Kalina
Bontcheva Valentin Tablan Diana Maynard SEKT
meeting, London, 21 January 2004
2
The Knowledge Economy and Human Language
  • Gartner, December 2002
  • taxonomic and hierachical knowledge mapping and
    indexing will be prevalent in almost all
    information-rich applications
  • through 2012 more than 95 of human-to-computer
    information input will involve textual language
  • A contradiction formal knowledge in
    semantics-based systems vs. ambiguous informal
    natural language
  • The challenge to reconcile these two opposing
    tendencies

3
HLT and Knowledge Closing the Language Loop
KEY MNLG Multilingual Natural Language
GenerationOIE Ontology-aware Information
ExtractionAIE Adaptive IECLIE Controlled
Language IE
(M)NLG
Semantic Web Semantic GridSemantic Web
Services
Formal Knowledge(ontologies andinstance bases)
HumanLanguage
OIE
(A)IE
ControlledLanguage
CLIE
4
Structure of the Tutorial
  • Information Extraction - definition
  • Evaluation corpora metrics
  • IE approaches some examples
  • Rule-based approaches
  • Learning-based approaches
  • Semantic Tagging
  • Using traditional IE
  • Ontology-based IE
  • Platforms for large-scale processing
  • Language Generation

5
Information Extraction
  • Information Extraction (IE) pulls facts and
    structured information from the content of large
    text collections.
  • Contrast IE and Information Retrieval
  • NLP history from NLU to IE
  • Progress driven by quantitative measures
  • MUC Message Understanding Conferences
  • ACE Automatic Content Extraction

6
MUC-7 tasks
  • Held in 1997, around 15 participants inc. 2 UK.
    Broke IE down into component tasks
  • NE Named Entity recognition and typing
  • CO co-reference resolution
  • TE Template Elements
  • TR Template Relations
  • ST Scenario Templates

7
An Example
  • NE entities are "rocket", "Tuesday", "Dr. Head"
    and "We Build Rockets"
  • CO "it" refers to the rocket "Dr. Head" and
    "Dr. Big Head" are the same
  • TE the rocket is "shiny red" and Head's
    "brainchild".
  • TR Dr. Head works for We Build Rockets Inc.
  • ST a rocket launching event occurred with the


    various
    participants.
  • The shiny red rocket was fired on Tuesday. It is
    the brainchild of Dr. Big Head. Dr. Head is a
    staff scientist at We Build Rockets Inc.

8
Performance levels
  • Vary according to text type, domain, scenario,
    language
  • NE up to 97 (tested in English, Spanish,
    Japanese, Chinese)
  • CO 60-70 resolution
  • TE 80
  • TR 75-80
  • ST 60 (but human level may be only 80)

9
What are Named Entities?
  • NE involves identification of proper names in
    texts, and classification into a set of
    predefined categories of interest
  • Person names
  • Organizations (companies, government
    organisations, committees, etc)
  • Locations (cities, countries, rivers, etc)
  • Date and time expressions

10
What are Named Entities (2)
  • Other common types measures (percent, money,
    weight etc), email addresses, Web addresses,
    street addresses, etc.
  • Some domain-specific entities names of drugs,
    medical conditions, names of ships, bibliographic
    references etc.
  • MUC-7 entity definition guidelines Chinchor97
  • http//www.itl.nist.gov/iaui/894.02/related_projec
    ts/muc/proceedings/ne_task.html

11
What are NOT NEs (MUC-7)
  • Artefacts Wall Street Journal
  • Common nouns, referring to named entities the
    company, the committee
  • Names of groups of people and things named after
    people the Tories, the Nobel prize
  • Adjectives derived from names Bulgarian,
    Chinese
  • Numbers which are not times, dates, percentages,
    and money amounts

12
Basic Problems in NE
  • Variation of NEs e.g. John Smith, Mr Smith,
    John.
  • Ambiguity of NE types John Smith (company vs.
    person)
  • May (person vs. month)
  • Washington (person vs. location)
  • 1945 (date vs. time)
  • Ambiguity with common words, e.g. "may"

13
More complex problems in NE
  • Issues of style, structure, domain, genre etc.
  • Punctuation, spelling, spacing, formatting, ...
    all have an impact
  • Dept. of Computing and Maths
  • Manchester Metropolitan University
  • Manchester
  • United Kingdom
  • Tell me more about Leonardo
  • Da Vinci

14
Structure of the Tutorial
  • Information Extraction - definition
  • Evaluation corpora metrics
  • IE approaches some examples
  • Rule-based approaches
  • Learning-based approaches
  • Semantic Tagging
  • Using traditional IE
  • Ontology-based IE
  • Platforms for large-scale processing
  • Language Generation

15
Corpora and System Development
  • Corpora are divided typically into a training and
    testing portion
  • Rules/Learning algorithms are trained on the
    training part
  • Tuned on the testing portion in order to optimise
  • Rule priorities, rules effectiveness, etc.
  • Parameters of the learning algorithm and the
    features used
  • Evaluation set the best system configuration is
    run on this data and the system performance is
    obtained
  • No further tuning once evaluation set is used!

16
Some NE Annotated Corpora
  • MUC-6 and MUC-7 corpora - English
  • CONLL shared task corpora http//cnts.uia.ac.be/co
    nll2003/ner/ - NEs in English and
    Germanhttp//cnts.uia.ac.be/conll2002/ner/ -
    NEs in Spanish and Dutch
  • TIDES surprise language exercise (NEs in Cebuano
    and Hindi)
  • ACE English - http//www.ldc.upenn.edu/Projects/
    ACE/

17
The MUC-7 corpus
  • 100 documents in SGML
  • News domain
  • Named Entities
  • 1880 Organizations (46)
  • 1324 Locations (32)
  • 887 Persons (22)
  • Inter-annotator agreement very high (97)
  • http//www.itl.nist.gov/iaui/894.02/related_projec
    ts/muc/proceedings/muc_7_proceedings/marsh_slides.
    pdf

18
The MUC-7 Corpus (2)
  • ltENAMEX TYPE"LOCATION"gtCAPE CANAVERALlt/ENAMEXgt,
    ltENAMEX TYPE"LOCATION"gtFla.lt/ENAMEXgt MD
    Working in chilly temperatures ltTIMEX
    TYPE"DATE"gtWednesdaylt/TIMEXgt ltTIMEX
    TYPE"TIME"gtnightlt/TIMEXgt, ltENAMEX
    TYPE"ORGANIZATION"gtNASAlt/ENAMEXgt ground crews
    readied the space shuttle Endeavour for launch on
    a Japanese satellite retrieval mission.
  • ltpgt
  • Endeavour, with an international crew of six, was
    set to blast off from the ltENAMEX
    TYPE"ORGANIZATIONLOCATION"gtKennedy Space
    Centerlt/ENAMEXgt on ltTIMEX TYPE"DATE"gtThursdaylt/TI
    MEXgt at ltTIMEX TYPE"TIME"gt418 a.m. ESTlt/TIMEXgt,
    the start of a 49-minute launching period. The
    ltTIMEX TYPE"DATE"gtnine daylt/TIMEXgt shuttle
    flight was to be the 12th launched in darkness.

19
ACE Towards Semantic Tagging of Entities
  • MUC NE tags segments of text whenever that text
    represents the name of an entity
  • In ACE (Automated Content Extraction), these
    names are viewed as mentions of the underlying
    entities. The main task is to detect (or infer)
    the mentions in the text of the entities
    themselves
  • Rolls together the NE and CO tasks
  • Domain- and genre-independent approaches
  • ACE corpus contains newswire, broadcast news (ASR
    output and cleaned), and newspaper reports (OCR
    output and cleaned)

20
ACE Entities
  • Dealing with
  • Proper names e.g., England, Mr. Smith, IBM
  • Pronouns e.g., he, she, it
  • Nominal mentions the company, the spokesman
  • Identify which mentions in the text refer to
    which entities, e.g.,
  • Tony Blair, Mr. Blair, he, the prime minister, he
  • Gordon Brown, he, Mr. Brown, the chancellor

21
ACE Example
  • ltentity ID"ft-airlines-27-jul-2001-2"
  • GENERIC"FALSE"
  • entity_type "ORGANIZATION"gt
  • ltentity_mention ID"M003"
  • TYPE "NAME"
  • string "National Air
    Traffic Services"gt
  • lt/entity_mentiongt
  • ltentity_mention ID"M004"
  • TYPE "NAME"
  • string "NATS"gt
  • lt/entity_mentiongt
  • ltentity_mention ID"M005"
  • TYPE "PRO"
  • string "its"gt
  • lt/entity_mentiongt
  • ltentity_mention ID"M006"
  • TYPE "NAME"
  • string "Nats"gt
  • lt/entity_mentiongt

22
Annotation Tools Alembic, GATE, ...
23
Performance Evaluation
  • Evaluation metric mathematically defines how to
    measure the systems performance against a
    human-annotated, gold standard
  • Scoring program implements the metric and
    provides performance measures
  • For each document and over the entire corpus
  • For each type of NE

24
The Evaluation Metric
  • Precision correct answers/answers produced
  • Recall correct answers/total possible correct
    answers
  • Trade-off between precision and recall
  • F-Measure (ß2 1)PR / ß2R P van Rijsbergen
    75
  • ß reflects the weighting between precision and
    recall, typically ß1

25
The Evaluation Metric (2)
  • We may also want to take account of partially
    correct answers
  • Precision Correct ½ Partially correct
  • Correct Incorrect Partial
  • Recall Correct ½ Partially correctCorrect
    Missing Partial
  • Why NE boundaries are often misplaced, sosome
    partially correct results

26
The GATE Evaluation Tool
27
Corpus-level Regression Testing
  • Need to track systems performance over time
  • When a change is made to the system we want to
    know what implications are over the entire corpus
  • Why because an improvement in one case can lead
    to problems in others
  • GATE offers automated tool to help with the NE
    development task over time

28
Regression Testing (2)
At corpus level GATEs corpus benchmark tool
tracking systems performance over time
29
ChallengeEvaluating Richer NE Tagging
  • Need for new metrics when evaluating
    hierarchy/ontology-based NE tagging
  • Need to take into account distance in the
    hierarchy
  • Tagging a company as a charity is less wrong than
    tagging it as a person

30
SW IE Evaluation tasks
  • Detection of entities and events, given a target
    ontology of the domain.
  • Disambiguation of the entities and events from
    the documents with respect to instances in the
    given ontology. For example, measuring whether
    the IE correctly disambiguated Cambridge in the
    text to the correct instance Cambridge, UK vs
    Cambridge, MA.
  • Decision when a new instance needs to be added to
    the ontology, because the text contains a new
    instance, that does not already exist in the
    ontology.

31
Structure of the Tutorial
  • Information Extraction - definition
  • Evaluation corpora metrics
  • IE approaches some examples
  • Rule-based approaches
  • Learning-based approaches
  • Semantic Tagging
  • Using traditional IE
  • Ontology-based IE
  • Platforms for large-scale processing
  • Language Generation

32
Two kinds of IE approaches
  • Knowledge Engineering
  • rule based
  • developed by experienced language engineers
  • make use of human intuition
  • requires only small amount of training data
  • development could be very time consuming
  • some changes may be hard to accommodate
  • Learning Systems
  • use statistics or other machine learning
  • developers do not need LE expertise
  • requires large amounts of annotated training data
  • some changes may require re-annotation of the
    entire training corpus
  • annotators are cheap (but you get what you pay
    for!)

33
NE Baseline list lookup approach
  • System that recognises only entities stored in
    its lists (gazetteers).
  • Advantages - Simple, fast, language independent,
    easy to retarget (just create lists)
  • Disadvantages impossible to enumerate all
    names, collection and maintenance of lists,
    cannot deal with name variants, cannot resolve
    ambiguity

34
Shallow parsing approach using internal structure
  • Internal evidence names often have internal
    structure. These components can be either stored
    or guessed, e.g. location
  • Cap. Word City, Forest, Center, River
  • e.g. Sherwood Forest
  • Cap. Word Street, Boulevard, Avenue, Crescent,
    Road
  • e.g. Portobello Street

35
Problems ...
  • Ambiguously capitalised words (first word in
    sentence)All American Bank vs. All State
    Police
  • Semantic ambiguity "John F. Kennedy" airport
    (location) "Philip Morris" organisation
  • Structural ambiguity Cable and Wireless vs.
    Microsoft and DellCenter for Computational
    Linguistics vs. message from City Hospital for
    John Smith

36
Shallow parsing with context
  • Use of context-based patterns is helpful in
    ambiguous cases
  • "David Walton" and "Goldman Sachs" are
    indistinguishable
  • But with the phrase "David Walton of Goldman
    Sachs" and the Person entity "David Walton"
    recognised, we can use the pattern "Person of
    Organization" to identify "Goldman Sachs
    correctly.

37
Examples of context patterns
  • PERSON earns MONEY
  • PERSON joined ORGANIZATION
  • PERSON left ORGANIZATION
  • PERSON joined ORGANIZATION as JOBTITLE
  • ORGANIZATION's JOBTITLE PERSON
  • ORGANIZATION JOBTITLE PERSON
  • the ORGANIZATION JOBTITLE
  • part of the ORGANIZATION
  • ORGANIZATION headquarters in LOCATION
  • price of ORGANIZATION
  • sale of ORGANIZATION
  • investors in ORGANIZATION
  • ORGANIZATION is worth MONEY
  • JOBTITLE PERSON
  • PERSON, JOBTITLE

38
Example Rule-based System - ANNIE
  • Created as part of GATE
  • GATE automatically deals with document formats,
    saving of results, evaluation, and visualisation
    of results for debugging
  • GATE has a finite-state pattern-action rule
    language, used by ANNIE
  • ANNIE modified for MUC guidelines 89.5
    f-measure on MUC-7 NE corpus

39
NE Components The ANNIE system a reusable and
easily extendable set of components
40
Gazetteer lists for rule-based NE
  • Needed to store the indicator strings for the
    internal structure and context rules
  • Internal location indicators e.g., river,
    mountain, forest for natural locations street,
    road, crescent, place, square, for address
    locations
  • Internal organisation indicators e.g., company
    designators GmbH, Ltd, Inc,
  • Produces Lookup results of the given kind

41
The Named Entity Grammars
  • Phases run sequentially and constitute a cascade
    of FSTs over the pre-processing results
  • Hand-coded rules applied to annotations to
    identify NEs
  • Annotations from format analysis, tokeniser,
    sentence splitter, POS tagger, and gazetteer
    modules
  • Use of contextual information
  • Finds person names, locations, organisations,
    dates, addresses.

42
  •  NE Rule in JAPE
  • JAPE a Java Annotation Patterns Engine
  • Light, robust regular-expression-based
    processing
  • Cascaded finite state transduction
  • Low-overhead development of new components
  • Simplifies multi-phase regex processing
  • Rule Company1
  • Priority 25
  • (
  • ( Token.orthography upperInitial )
    //from tokeniser
  • Lookup.kind companyDesignator //from
    gazetteer lists
  • )match
  • --gt
  • match.NamedEntity
  • kindcompany, ruleCompany1

43
Named Entities in GATE
44
Using co-reference to classify ambiguous NEs
  • Orthographic co-reference module that matches
    proper names in a document
  • Improves NE results by assigning entity type to
    previously unclassified names, based on
    relations with classified NEs
  • May not reclassify already classified entities
  • Classification of unknown entities very useful
    for surnames which match a full name, or
    abbreviations, e.g. Bonfield will match Sir
    Peter Bonfield International Business
    Machines Ltd. will match IBM

45
Named Entity Coreference
46
Structure of the Tutorial
  • Information Extraction - definition
  • Evaluation corpora metrics
  • IE approaches some examples
  • Rule-based approaches
  • Learning-based approaches
  • Semantic Tagging
  • Using traditional IE
  • Ontology-based IE
  • Platforms for large-scale processing
  • Language Generation

47
Machine Learning Approaches
  • Approaches
  • Train ML models on manually annotated text
  • Mixed initiative learning
  • Used for producing training data
  • Used for producing working systems
  • ML Methods
  • Symbolic learning rules/decision trees induction
  • Statistical models HMMs, Bayesian methods,
    Maximum Entropy

48
ML Terminology
  • Instances (tokens, entities)
  • Occurrences of a phenomenon
  • Attributes (features)
  • Characteristics of the instances
  • Classes
  • Sets of similar instances

49
Methodology
  • The task can be broken into several subtasks
    (that can use different methods)
  • Boundary detection
  • Entity classification into NE types
  • Different models for different entity types
  • Several models can be used in competition.
  • Some algorithms perform better on little data
    while others are better when more training is
    available

50
Methodology (2)
  • Boundaries (and entity types) notations
  • S(-XXX), E(-XXX)
  • ltS-ORG/gtU.N.ltE-ORG/gt official ltS-PER/gtEkeusltE-PER/
    gt heads for
  • ltS-LOC/gtBaghdadltE-LOC/gt.
  • IOB notation (Inside, Outside, Beginning_of)
  • U.N. I-ORG
  • official O
  • Ekeus I-PER
  • heads O
  • for O
  • Baghdad I-LOC
  • . O
  • Translations between the two conventions are
  • straight-forward

51
Features
  • Linguistic features
  • POS
  • Morphology
  • Syntax
  • Lexicon data
  • Semantic features
  • Ontological class
  • ETC
  • Document structure
  • Original markup
  • Paragraph/sentence structure
  • Surface features
  • Token length
  • Capitalisation
  • Token type (word, punctuation, symbol)
  • Feature selection the most difficult part
  • Some automatic scoring methods can be used

52
Mixed Initiative Learning
  • Human computer interaction
  • Speeds up the creation of training data
  • Can be used for corpus/system creation
  • Example implementations
  • Alembic Day et al97
  • Amilcare Ciravegna03

53
Mixed Initiative Learning (2)
User annotates
System learns
Pgtt1
Pgtt2
54
GATE Machine Learning support
  • Uses classification.
  • Attr1, Attr2, Attr3, Attrn ? Class
  • Classifies annotations.
  • (Documents can be classified as well using a
    1-to1 relation with annotations.)
  • Annotations of a particular type are selected as
    instances.
  • Attributes refer to features of the instance
    annotations or their context.
  • Generic implementation for attribute collection
    can be linked to any ML engine.
  • ML engines currently integrated WEKA and
    Ontotexts HMM.

55
Implementation
  • Machine Learning PR in GATE.
  • Has two functioning modes
  • training
  • application
  • Uses an XML file for configuration
  • lt?xml version"1.0" encoding"windows-1252"?gt
  • ltML-CONFIGgt
  • ltDATASETgt lt/DATASETgt
  • ltENGINEgtlt/ENGINEgt
  • ltML-CONFIGgt

56
Attributes Collection
Instances type Token
57
Dataflow
GATE ML Library
NLP Pipeline Tokeniser Gazetteer POS
Tagger Lexicon Lookup Semantic Tagger etc
Annotated documents
Plain text documents
Feature Collection
Results Converter
Engine Interface
Machine Learning Engine
58
Amilcare Melita
  • Amilcare rule-learning algorithm
  • Tagging rules learn to insert tags in the text,
    given training examples
  • Correction rules learn to move already inserted
    tags to their correct place in the text
  • Novel aspect learns independently begin and end
    tags
  • Melita support adaptive IE
  • Applied in SemWeb context (see below)
  • Being extended as part of the EU-funded DOT.KOM
    project towards KM andSemWeb applications

Ciravegna03www.dcs.shef.ac.uk/fabio
59
Structure of the Tutorial
  • Information Extraction - definition
  • Evaluation corpora metrics
  • IE approaches some examples
  • Rule-based approaches
  • Learning-based approaches
  • Semantic Tagging
  • Using traditional IE
  • Ontology-based IE
  • Platforms for large-scale processing
  • Language Generation

60
Towards Semantic Tagging of Entities
  • The MUC NE task tags selected segments of text
    whenever that text represents the name of an
    entity.
  • Semantic tagging - view as mentions of the
    underlying instances from the ontology
  • Identify which mentions in the text refer to
    which instances in the ontology, e.g.,
  • Tony Blair, Mr. Blair, he, the prime minister, he
  • Gordon Brown, he, Mr. Brown, the chancellor

61
Tasks
  • Identify entity mentions in the text
  • Reference disambiguation
  • Add new instances if needed
  • Disambiguate wrt instances in the ontology
  • Identify instances of attributes and relations
  • take into account what are allowed given the
    ontology, using domainrange as constraints

62
Example
XYZ was established on 03 November 1978 in
London. It opened a plant in Bulgaria in
Ontology KB
Location
Company
HQ
partOf
City
Country
type
type
HQ
type
type
establOn
partOf
03/11/1978
63
Classes, instances metadata
Gordon Brown met George Bush during his two day
visit.
ltmetadatagt ltDOC-IDgthttp// 1.htmllt/DOC-IDgt
ltAnnotationgt lts_offsetgt 0 lt/s_offsetgt
lte_offsetgt 12 lt/e_offsetgt ltstringgtGordon
Brownlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson12345lt/instgt lt/Annotationgt
ltAnnotationgt lts_offsetgt 18 lt/s_offsetgt
lte_offsetgt 32 lt/e_offsetgt ltstringgtGeorge
Bushlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson67890lt/instgt lt/Annotationgt lt/metad
atagt
Classesinstances after
Classesinstances before
Bush
64
Classes, instances metadata (2)
Gordon Brown met Tony Blair to discuss the
university tuition fees.
ltmetadatagt ltDOC-IDgthttp// 2.htmllt/DOC-IDgt
ltAnnotationgt lts_offsetgt 0 lt/s_offsetgt
lte_offsetgt 12 lt/e_offsetgt ltstringgtGordon
Brownlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson12345lt/instgt lt/Annotationgt
ltAnnotationgt lts_offsetgt 18 lt/s_offsetgt
lte_offsetgt 30 lt/e_offsetgt ltstringgtTony
Blairlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson26389lt/instgt lt/Annotationgt lt/metad
atagt
Classesinstances after
Classesinstances before
G. Brown
G. Bush
65
Why not put metadata in ontologies?
  • Can be encoded in RDF/OWL, etc. but does it need
    to be put as instances in the ontology?
  • Typically we do not need to reason with it
  • Reasoning happens in the ontology when the new
    instances of classes and properties are added,
    but the metadata statements are different from
    them, they only refer to them
  • A lot more metadata than instances
  • Millions of metadata statements, thousands of
    instances, hundreds of concepts
  • Different access required
  • By offset (give me all metadata of the first
    paragraph)
  • Efficient metadata-wide statistics based on
    strings not an operation that people would do
    on other concepts
  • Mixing with keyword-based search using IR-style
    indexing

66
Metadata Creation with IE
  • Semantic tagging creates metadata
  • Stand-off or part of document
  • Semi-automatic
  • One view (given by the user, one ontology)
  • More reliable
  • Automatic metadata creation
  • Many views change with ontology, re-train IE
    engine for each ontology
  • Always up to date, if ontology changes
  • Less reliable

67
Problems with traditional IE for metadata
creation
  • S-CREAM Semi-automatic CREAtion of Metadata
    Handschuh et al02
  • Semantic tags from IE need to be mapped to
    instances of concepts, attributes or relations
  • Most ML-based IE systems do not deal well with
    relations, mainly entities
  • Amilcare does not handle anaphora resolution,
    GATE has such component but not used here
  • Implemented a discourse model with logical rules
  • LASIE used discourse model with domainontology
    problem is robustness and domain portability

68
Example
Handschuh et al02 S-CREAM, EKAW02
69
S-CREAM Discourse Rules
  • Rules to attach instances only when the ontology
    allows that (e.g., prices)
  • Attach tag values to the nearest preceding
    compatible entity (e.g., prices and rooms)
  • Create a complex object between two concept
    instances if they are adjacent (e.g., rate
    number followed by currency)
  • Experienced users can write new rules

70
Challenges for IE for SemWeb
  • Portability different and changing ontologies
  • Different text types structured, free, etc.
  • Utilise ontology information where available
  • Train from small amount of annotated text
  • Output results wrt the given ontology
  • bridge the gap demonstrated in S-CREAM
  • Learn/Model at the right level
  • ontologies are hierarchical and data will get
    sparser the lower we go

DOT.KOM http//nlp.shef.ac.uk/dot.kom/
71
Structure of the Tutorial
  • Information Extraction - definition
  • Evaluation corpora metrics
  • IE approaches some examples
  • Rule-based approaches
  • Learning-based approaches
  • Semantic Tagging
  • Using traditional IE
  • Ontology-based IE
  • Platforms for large-scale processing
  • Language Generation

72
GATE Infrastructure for metadata extraction for
the SemWeb
  • Combines learning and rule-based methods
  • Allows combination of IE and IR
  • Enables use of large-scale linguistic resources
    for IE, such as WordNet
  • Supports ontologies as part of IE applications -
    Ontology-Based IE (OBIE)

73
Ontology Management in GATE
74
Information Retrieval Currently based on the
Lucene IR engine useful for combining semantic
and keyword-based search
75
WordNet support
76
Populating Ontologies with IE
77
Example OBIE Application
  • hTechSight project using Ontology-Based IE for
    semantic tagging of job adverts, news and reports
    in chemical engineering domain
  • Aim is to track technological change over time
    through terminological analysis
  • Fundamental to the application is a
    domain-specific ontology
  • Terminological gazetteer lists are linked to
    classes in the ontology
  • Rules classify the mentions in the text wrt the
    domain ontology
  • Annotations output into a database or as an
    ontology

78
(No Transcript)
79
(No Transcript)
80
Exported Database
81
Structure of the Tutorial
  • Information Extraction - definition
  • Evaluation corpora metrics
  • IE approaches some examples
  • Rule-based approaches
  • Learning-based approaches
  • Semantic Tagging
  • Using traditional IE
  • Ontology-based IE
  • Platforms for large-scale processing
  • Language Generation

82
Platforms for Large-Scale Metadata Creation
  • Allow use of corpus-wide statistics to improve
    metadata quality, e.g., disambiguation
  • Automated alias discovery
  • Generate SemWeb output (RDF, OWL)
  • Stand-off storage and indexing of metadata
  • Use large instance bases to disambiguate to
  • Ontology servers for reasoning and access
  • Architecture elements
  • Crawler, onto storage, doc indexing, query,
    annotators
  • Apps sem browsers, authoring tools, etc.

83
SemTag
  • Lookup of all instances from the ontology (TAP)
    65K instances
  • Disambiguate the occurrences as
  • One of those in the taxonomy
  • Not present in the taxonomy
  • Not very high ambiguity of instances with the
    same label in TAP concentrate on the second
    problem
  • Use bag-of-words approach for disambiguation
  • 3 people evaluated 200 labels in context agreed
    on only 68.5 - metonymy
  • Placing labels in the taxonomy is hard

Dill et al, SemTag and Seeker. WWW03
84
Seeker
  • High-performance distributed infrastructure
  • 128 dual-processor machines with separate ½
    terabyte of storage
  • Each node runs approx. 200 documents per sec.
  • Service-oriented architecture Vinci (SOAP)

Dill et al, SemTag and Seeker. WWW03
85
OBIE in KIM
  • The ontology (KIMO) and 86K/200K instances KB
  • High ambiguity of instances with the same label
    need for disambiguation step
  • Lookup phase marks mentions from the ontology
  • Combined with rule-based IE system to recognise
    new instances of concepts and relations
  • Special KB enrichment stage where some of these
    new instances are added to the KB
  • Disambiguation uses an Entity Ranking algorithm,
    i.e., priority ordering of entities with the same
    label based on corpus statistics (e.g., Paris)

Popov et al. KIM. ISWC03
86
OBIE in KIM (2)
Popov et al. KIM. ISWC03
87
Comparison between SemTag and KIM
  • SemTag only aims for accuracy (precision) of
    classification of the annotated entities
  • KIM also aims for coverage (recall) whether all
    possible mentions of entities were found
  • Trade-off sometimes finding some is enough
  • SemTag does not attempt to discover and expand
    the KB with new instances (e.g., new company)
    the reason why KIM uses IE, not simple KB lookup
  • i.e. OBIE is often needed for ontology
    population, not just metadata creation

88
Two Annotation Scenarios (1)
  • Getting the instances and the relations between
    them is enough, maybe not all mentions in the
    text are covered, but compensated by giving
    access to this info from the annotated text

89
Example
Gordon Brown met president Bush during his two
day visit. Afterwards George Bush said
The system
Bush
Score 100
90
Two Annotation Scenarios (2)
  • Exhaustive annotation is required, so all
    occurrences of all instances and relations are
    needed
  • Allows sentence and paragraph-level exploration,
    rather than document-level as in the previous
    scenario
  • Harder to achieve
  • Distinction between these scenarios needs to be
    made in the metadata annotation tools/KM tools
    using IE

91
Example
Gordon Brown met president Bush during his two
day visit. Afterwards George Bush said
ltmetadatagt ltAnnotationgt lts_offsetgt 0
lt/s_offsetgt lte_offsetgt 12 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson12345lt/ins
tgt lt/Annotationgt ltAnnotationgt lts_offsetgt
61 lt/s_offsetgt lte_offsetgt 72 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson1267lt/inst
gt lt/Annotationgt lt/metadatagt
ltmetadatagt ltAnnotationgt lts_offsetgt 0
lt/s_offsetgt lte_offsetgt 12 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson12345lt/ins
tgt lt/Annotationgt ltAnnotationgt lts_offsetgt
18 lt/s_offsetgt lte_offsetgt 32 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson1267lt/inst
gt lt/Annotationgt ltAnnotationgt lts_offsetgt
61 lt/s_offsetgt lte_offsetgt 72 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson1267lt/inst
gt lt/Annotationgt lt/metadatagt
Score 66
92
Semantic Reference Disambiguation
  • Possible approaches
  • Vector-space models compare context similarity
    runs over a corpus
  • SemTag
  • Baggas cross-document coreference work
  • Communities of practise approach from KM
  • Identity criteria from the ontology based on
    properties, e.g., date_of_birth, name

93
Why disambiguation is hard not all knowledge
is explicit in text
  • Paris fashion week underway as cancellations
    continue
  • By Jo Johnson and Holly Finn  - Oct 07 2001
    184817 (FT)
  • Even as Paris fashion week opened at the
    weekend, the cancellations and reschedulings were
    still trickling in over the fax machines Loewe,
    the leather specialists owned by LVMH empire, is
    not showing, Cerruti, the Italian tailor,is
    downscaling to private viewings, Helmut Lang,
    master of the sharp suit, is cancelling his
    catwalk.
  • The Oscar de la Renta show, for example, which
    had been planned for September 11th in New York,
    and which might easily enough have moved over to
    Paris instead, is not on the schedule. When the
    Dominican Republic-born designer consulted
    America Vogue's influential editor, Anna Wintour,
    she reportedly told him it would be unpatriotic
    to decamp.

94
Structure of the Tutorial
  • Information Extraction - definition
  • Evaluation corpora metrics
  • IE approaches some examples
  • Rule-based approaches
  • Learning-based approaches
  • Semantic Tagging
  • Using traditional IE
  • Ontology-based IE
  • Platforms for large-scale processing
  • Language Generation

95
Natural Language Generation
  • NLG is
  • subfield of AI and CL that is concerned with the
    construction of computer systems that can produce
    understandable texts in English or other human
    languages from some underlying linguistic
    representation of information ReiterDale97
  • NLG techniques are applied also for producing
    speech, e.g., in speech dialogue systems

96
  • Natural Language Generation

Ontology/KB/Database
Lexicons Grammars
Text
97
Requirements Analysis
  • Create a corpus of target texts and (if possible)
    their input representations
  • Analyse the information content
  • Unchanging texts thank you, hello, etc.
  • Directly available data timetable of buses
  • Computable data number of buses
  • Unavailable data not in the systems KB/DB

98
NLG Tasks
  1. Content determination
  2. Discourse planning
  3. Sentence aggregation
  4. Lexicalisation
  5. Referring expression generation
  6. Linguistic realisation

99
Content determination
  • What information to include in the text
    filtering and summarising input data into a
    formal knowledge representation
  • Application dependent
  • Example
  • project AKT
  • start_date October-2000
  • end_date October-2006
  • participants A,E,OU,So,Sh

100
Discourse Planning
  • Determine ordering and structure over the
    knowledge to be generated
  • Theories of discourse how texts are structured
  • Influences text readability
  • Result tree structure imposing ordering over the
    predicates and possibly providing discourse
    relations

101
Example
SEQUENCE
LIST

ELABORATION
ELABORATION
projectAKT duration 6 yrs
project AKT participantShef
univ Shef Web-page URL

project AKT participantOU
102
Planning-Based Approaches
  • Use AI-style planners (e.g., Moore Paris 93
  • Discourse relations (e.g., ELABORATION) are
    encoded as planning operators
  • Preconditions specify when the relation can apply
  • Planning starts from a top-level goal, e.g.,
    define-project(X)
  • Computationally expensive and require a lot of
    knowledge problem for real-world systems

103
Schema-Based Approaches
  • Capture typical text structuring patterns in
    templates (derived from corpus), e.g., McKeown
    85
  • Typically implemented as RTN
  • Variety comes from different available knowledge
    for each entity
  • Reusable ones available Exemplars
  • Example
  • Describe-Project-Schema -gt Sequence(duration,
    ProjParticipants-Schema)

104
Sentence Aggregation
  • Determine which predicates should be grouped
    together in sentences
  • Less understood process
  • Default each predicate can be expressed as a
    sentence, so optional step
  • SPOT trainable planner
  • Example
  • AKT is a 6-year project with 5 participants
  • Sheffield (URL)
  • OU

105
Lexicalisation
  • Choosing words and phrases to express the
    concepts and relations in predicates
  • Trivial solution 1-1 mapping between
    concepts/relations and lexical entries
  • Variation is useful to avoid repetitiveness and
    also convey pragmatic distinctions (e.g.
    formality)

106
Referring Expression Generation
  • Choose pronouns/phrases to refer to the entities
    in the text
  • Example he vs Mr Smith vs John Smith, the
    president of XXX Corp.
  • Depends on what is previously said
  • He is only appropriate if the person is already
    introduced in the text

107
Linguistic Realisation
  • Use grammar to generate text which is
    grammatical, i.e., syntactically and
    morphologically correct
  • Domain-independent
  • Reusable components are available e.g.,
    RealPro, FUF/SURGE
  • Example
  • Morphology participant -gt participants
  • Syntactic agreement AKT starts on

108
A GATE-based generator
  • Input
  • The MIAKT ontology
  • The RDF file for the given case
  • The MIAKT lexicon
  • Output
  • GATE document with the generated text

109
Lexicalising Concepts and Instances
110
Example RDF Input
  • ltrdfDescription rdfabout'c\breast_cancer_ontol
    ogy.daml01401_patient'gt
  • ltrdftype rdfresource'c\breast_cancer_ontology
    .damlPatient'/gt
  • ltNS2has_agegt68lt/NS2has_agegt
  • ltNS2involved_in_ta rdfresource'c\breast_cance
    r_ontology.damlta-soton-1069861276136'/gt
  • lt/rdfDescriptiongt
  • ltrdfDescription rdfabout'c\breast_cancer_ontol
    ogy.daml01401_mammography'gt
  • ltrdftype rdfresource'c\breast_cancer_ontology
    .damlMammography'/gt
  • ltNS2carried_out_on rdfresource'c\breast_cance
    r_ontology.daml01401_patient'/gt
  • ltNS2has_dategt22 9 1995lt/NS2has_dategt
  • ltNS2produce_result rdfresource'c\breast_cance
    r_ontology.damlimage_01401_right_cc'/gt
  • lt/rdfDescriptiongt
  • ltrdfDescription rdfabout'c\breast_cancer_ontol
    ogy.damlimage_01401_right_cc'gt
  • ltNS2image_filegtcancer/case0140/C_0140_1.RIGHT_CC
    .LJPEGlt/NS2image_filegt
  • ltrdftype rdfresource'c\breast_cancer_ontology
    .damlRight_CC_Image'/gt
  • ltNS2has_lateral rdfresource'c\breast_cancer_o
    ntology.damllateral_right'/gt
  • ltNS2view_of_image rdfresource'c\breast_cancer
    _ontology.damlcraniocaudal_view'/gt
  • ltNS2contains_entity rdfresource'c\breast_canc
    er_ontology.daml01401_right_cc_abnor_1'/gt
  • lt/rdfDescriptiongt
  • ltrdfDescription rdfabout'c\breast_cancer_ontol
    ogy.daml01401_right_cc_abnor_1'gt

111
CASE0140.RDF
  • The 68 years old patient is involved in a
    triple assessment procedure. The triple
    assessment procedure contains a mammography exam.
    The mammography exam is carried out on the
    patient on 22 9 1995. The mammography exam
    produced a right CC image. The right CC image
    contains an abnormality and it has a right
    lateral side and a craniocaudal view. The
    abnormality has a mass, a microlobulated margin ,
    a round shape, and a probably malignant
    assessment.

112
Further Reading on IE for SemWeb
  • Requirements for Information Extraction for
    Knowledge Management. http//nlp.shef.ac.uk/dot.ko
    m/publications.html
  • Information Extraction as a Semantic Web
    Technology Requirements and Promises. Adaptive
    Text Extraction and Mining workshop, 2003.
  • A. Kiryakov, B. Popov, et al. Semantic
    Annotation, Indexing, and Retrieval. 2nd
    International Semantic Web Conference (ISWC2003),
    http//www.ontotext.com/publications/index.htmlKi
    ryakovEtAl2003
  • S. Handschuh, S. Staab, R. Volz
    http//www.aifb.uni-karlsruhe.de/WBS/sha/papers/p2
    73_handschuh.pdf. On Deep Annotation. WWW03.
  • S. Dill, N. Eiron, et al http//www.tomkinshome.c
    om/papers/2Web/semtag.pdf . SemTag and Seeker
    Bootstrapping the semantic web via automated
    semantic annotation. WWW03.
  • E. Motta, M. Vargas-Vera, et al MnM Ontology
    Driven Semi-Automatic and Automatic Support for
    Semantic Markup. Knowledge Engineering and
    Knowledge Management (Ontologies and the Semantic
    Web), (EKAW02), http//www.aktors.org/publications
    /selected-papers/06.pdf
  • K. Bontcheva, A. Kiryakov, H. Cunningham, B.
    Popov. M. Dimitrov. Semantic Web Enabled, Open
    Source Language Technology. Language Technology
    and the Semantic Web, Workshop on NLP and XML
    (NLPXML-2003). http//www.gate.ac.uk/sale/eacl03-s
    emweb/bontcheva-etal-final.pdf
  • Handschuh, Staab, Ciravegna. S-CREAM -
    Semi-automatic CREAtion of Metadata (2002)
    http//citeseer.nj.nec.com/529793.html

113
Further Reading on traditional IE
  • Day et al97 D. Day, J. Aberdeen, L. Hirschman,
    R. Kozierok, P. Robinson, and M. Vilain.
    Mixed-Initiative Development of Language
    Processing Systems. In Proceedings of the Fifth
    Conference on Applied Natural Language Processing
    (ANLP97). 1997.
  • Ciravegna02 F. Ciravegna, A. Dingli, D.
    Petrelli, Y. Wilks User-System Cooperation in
    Document Annotation based on Information
    Extraction. Knowledge Engineering and Knowledge
    Management (Ontologies and the Semantic Web),
    (EKAW02), 2002.
  • N. Kushmerick, B. Thomas. Adaptive information
    extraction Core technologies for information
    agents (2002). http//citeseer.nj.nec.com/kushmeri
    ck02adaptive.html
  • H. Cunningham, D. Maynard, K. Bontcheva, V.
    Tablan. GATE A Framework and Graphical
    Development Environment for Robust NLP Tools and
    Applications. 40th Anniversary Meeting of the
    Association for Computational Linguistics
    (ACL'02). 2002.
  • D.Maynard, K. Bontcheva and H. Cunningham.
    Towards a semantic extraction of named entities.
    Recent Advances in Natural Language Processing,
    Bulgaria, 2003.
  • Califf and Mooney Relational Learning of Pattern
    Matching Rules for Information Extraction
    http//citeseer.nj.nec.com/6804.html
  • Borthwick. A. A Maximum Entropy Approach to Named
    Entity Recognition.PhD Dissertation. 1999
  • Bikel D., Schwarta R., Weischedel. R. An
    algorithm that learns whats in a name. Machine
    Learning 34, pp.211-231, 1999
  • Riloff, E. (1996) "Automatically Generating
    Extraction Patterns from Untagged Text"
    Proceedings of the Thirteenth National Conference
    on Artificial Intelligence (AAAI-96) , 1996, pp.
    1044-1049. http//www.cs.utah.edu/7Eriloff/psfile
    s/aaai96.pdf
  • Daelemans W. and Hoste V. Evaluation of Machine
    Learning Methods for Natural Language Processing
    Tasks. In LREC 2002 Third International
    Conference on Language Resources and Evaluation,
    pages 755760

114
Further Reading on traditional IE
  • Black W.J., Rinaldi F., Mowatt D. Facile
    Description of the NE System Used For MUC-7.
    Proceedings of 7th Message Understanding
    Conference, Fairfax, VA, 19 April - 1 May, 1998.
  • Collins M., Singer Y. Unsupervised models for
    named entity classificationIn Proceedings of the
    Joint SIGDAT Conference on Empirical Methods in
    Natural Language Processing and Very Large
    Corpora, 1999
  • Collins M. Ranking Algorithms for Named-Entity
    Extraction Boosting and the Voted Perceptron.
    Proceedings of the 40th Annual Meeting of the
    ACL, Philadelphia, pp. 489-496, July 2002 Gotoh
    Y., Renals S. Information extraction from
    broadcast news, Philosophical Transactions of the
    Royal Society of London, series A Mathematical,
    Physical and Engineering Sciences, 2000.
  • Grishman R. The NYU System for MUC-6 or Where's
    the Syntax? Proceedings of the MUC-6 workshop,
    Washington. November 1995.
  • Krupka G. R., Hausman K. IsoQuest Inc.
    Description of the NetOwlTM Extractor System as
    Used for MUC-7. Proceedings of 7th Message
    Understanding Conference, Fairfax, VA, 19 April -
    1 May, 1998.
  • McDonald D. Internal and External Evidence in the
    Identification and Semantic Categorization of
    Proper Names. In B.Boguraev and J. Pustejovsky
    editors Corpus Processing for Lexical
    Acquisition. Pages21-39. MIT Press. Cambridge,
    MA. 1996
  • Mikheev A., Grover C. and Moens M. Description of
    the LTG System Used for MUC-7. Proceedings of 7th
    Message Understanding Conference, Fairfax, VA, 19
    April - 1 May, 1998
  • Miller S., Crystal M., et al. BBN Description of
    the SIFT System as Used for MUC-7. Proceedings of
    7th Message Understanding Conference, Fairfax,
    VA, 19 April - 1 May, 1998

115
Further Reading on multilingual IE
  • Palmer D., Day D.S. A Statistical Profile of the
    Named Entity Task. Proceedings of the Fifth
    Conference on Applied Natural Language
    Processing, Washington, D.C., March 31- April 3,
    1997.
  • Sekine S., Grishman R. and Shinou H. A decision
    tree method for finding and classifying names in
    Japanese texts. Proceedings of the Sixth Workshop
    on Very Large Corpora, Montreal, Canada, 1998
  • Sun J., Gao J.F., Zhang L., Zhou M., Huang C.N.
    Chinese Named Entity Identification Using
    Class-based Language Model. In proceeding of the
    19th International Conference on Computational
    Linguistics (COLING2002), pp.967-973, 2002.
  • Takeuchi K., Collier N. Use of Support Vector
    Machines in Extended Named Entity Recognition.
    The 6th Conference on Natural Language Learning.
    2002
  • D.Maynard, K. Bontcheva and H. Cunningham.
    Towards a semantic extraction of named entities.
    Recent Advances in Natural Language Processing,
    Bulgaria, 2003.
  • M. M. Wood and S. J. Lydon and V. Tablan and D.
    Maynard and H. Cunningham. Using parallel texts
    to improve recall in IE. Recent Advances in
    Natural Language Processing, Bulgaria, 2003.
  • D.Maynard, V. Tablan and H. Cunningham. NE
    recognition without training data on a language
    you don't speak. ACL Workshop on Multilingual and
    Mixed-language Named Entity Recognition
    Combining Statistical and Symbolic Models,
    Sapporo, Japan, 2003.

116
Further Reading on multilingual IE
  • H. Saggion, H. Cunningham, K. Bontcheva, D.
    Maynard, O. Hamza, Y. Wilks. Multimedia Indexing
    through Multisource and Multilingual Information
    Extraction the MUMIS project. Data and Knowledge
    Engineering, 2003.
  • D. Manov and A. Kiryakov and B. Popov and K.
    Bontcheva and D. Maynard, H. Cunningham.
    Experiments with geographic knowledge for
    information extraction. Workshop on Analysis of
    Geographic References, HLT/NAACL'03, Canada,
    2003.
  • H. Cunningham, D. Maynard, K. Bontcheva, V.
    Tablan. GATE A Framework and Graphical
    Development Environment for Robust NLP Tools and
    Applications. Proceedings of the 40th Anniversary
    Meeting of the Association for Computational
    Linguistics (ACL'02). Philadelphia, July 2002.
  • H. Cunningham. GATE, a General Architecture for
    Text Engineering. Computers and the Humanities,
    volume 36, pp. 223-254, 2002.
  • D. Maynard, H. Cunningham, K. Bontcheva, M.
    Dimitrov. Adapting A Robust Multi-Genre NE System
    for Automatic Content Extraction. Proc. of the
    10th International Conference on Artificial
    Intelligence Methodology, Systems, Applications
    (AIMSA 2002), 2002.
  • K. Pastra, D. Maynard, H. Cunningham, O. Hamza,
    Y. Wilks. How feasible is the reuse of grammars
    for Named Entity Recognition? Language Resources
    and Evaluation Conference (LREC'2002), 2002.

117
THANK YOU!The slideshttp//gate.ac.uk/sale/ta
lks/sekt-tutorial.ppt
Write a Comment
User Comments (0)
About PowerShow.com