Text%20Mining%20for%20Biomedicine:%20Techniques%20

About This Presentation

Title:

Text%20Mining%20for%20Biomedicine:%20Techniques%20

Description:

TerMine, AcroMine, Smart dictionary look up, Phenetica. Medie, InfoPubMed, KLEIO. 3 ... Focus: biology, medicine, social sciences... 9. We don't just press a ... – PowerPoint PPT presentation

Number of Views:695

Avg rating:3.0/5.0

Slides: 147

Provided by: personalpa6

Category:

more less

Transcript and Presenter's Notes

Title: Text%20Mining%20for%20Biomedicine:%20Techniques%20

1
Text Mining for BiomedicineTechniques tools

Sophia Ananiadou, Chikashi Nobata,Yutaka Sasaki,
Yoshimasa Tsuruoka
School of Computer Science
National Centre for Text Mining
www.nactem.ac.uk
Sophia.Ananiadou_at_manchester.ac.uk

2
Outline

Challenges / objectives of TM in biomedicine
Terminology processing
Term extraction, term variation, named entity
recognition
Resources for TM in biomedicine
Document classification
Information Extraction approaches
Levels of Text Mining Processing
Biomedical text mining services and systems _at_
NaCTeM
TerMine, AcroMine, Smart dictionary look up,
Phenetica
Medie, InfoPubMed, KLEIO

3
Material

Further background on TM for Biology
Ananiadou, S. McNaught, J. (eds) (2006)
Text Mining for Biology and Biomedicine. Boston,
MA Artech House
Numerous papers on line from bibliography
See BLIMP http//blimp.cs.queensu.ca/
Biomedical Literature (and text) mining
publications

4
Text Mining in biomedicine

Why biomedicine?
Consider just MEDLINE 16,000,000 references,
40,000 added per month
Dynamic nature of the domain new terms (genes,
proteins, chemical compounds, drugs) constantly
created
Impossible to manage such an information overload

5
From Text to Knowledge tackling the data deluge
through text mining
Unstructured Text (implicit knowledge)
Information Retrieval
Information extraction
Knowledge Discovery
Semantic metadata
Structured content (explicit knowledge)
Advanced Information Retrieval
6
Information deluge

Bio-databases, controlled vocabularies and
bio-ontologies encode only small fraction of
information
Linking text to databases and ontologies
Curators struggling to process scientific
literature
Discovery of facts and events crucial for gaining
insights in biosciences need for text mining

7
(No Transcript)
8
The solution The UK National Centre for Text
Mining www.nactem.ac.uk

Location Manchester Interdisciplinary Biocentre
(MIB) www.mib.ac.uk
First publicly funded text mining centre in the
world..
Focus biology, medicine, social sciences

9
We dont just press a button

TM involves
Many components (converters, analysers, miners,
visualisers, ...)?
Many resources (grammars, ontologies, lexicons,
terminologies, thesauri, CVs)?
Many combinations of components and resources for
different applications
Many different user requirements and scenarios,
training needs
The best solutions are customised

10
People behind NaCTeM

Text Mining Team 14 members
Close collaboration with University of Tokyo,
Tsujii Lab http//www-tsujii.is.s.u-tokyo.ac.jp/

11
What NaCTeM is building

Resources ontologies, lexicons, terminologies,
thesauri, grammars, annotated corpora
BOOTStrep project http//www.nactem.ac.uk/bootstre
p.php
Tools tokenisers, taggers, chunkers, parsers, NE
recognisers, semantic analysers
NaCTeM is also providing services
Our related bio-text mining projects
REFINE http//dbkgroup.org/refine/
Representing Evidence For Interacting Network
Elements
ONDEX (data integration, workflows, text mining)

12
Individual tools for user data

Splitters, taggers, chunkers, parsers, NER, term
extractors
Modes of use
Demonstrators for small-scale online use
Batch mode upload data, get email with link to
download site when job done
Web Services
Integration into Workflows (Taverna)
Some services are compositions of tools

13
Aims

Text mining discover extract unstructured
knowledge hidden in text
Hearst (1999)
Text mining aids to construct hypotheses from
associations derived from text
protein-protein interactions
associations of genes phenotypes
functional relationships among genes

14
Impact of text mining

Extraction of named entities (genes, proteins,
metabolites, etc)
Discovery of concepts allows semantic annotation
of documents
Improves information access by going beyond index
terms, enabling semantic querying
Construction of concept networks from text
Allows clustering, classification of documents
Visualisation of concept maps

15
Impact of TM

Extraction of relationships (events and facts)
for knowledge discovery
Information extraction, more sophisticated
annotation of texts (event annotation)
Beyond named entities facts, events
Enables even more advanced semantic querying

16
Hypothesis generation from literature

Swanson experiments (1986) influenced conceptual
biology
rapid mining of candidate hypotheses from the
literature
migraine and magnesium deficiency (Swanson,
1988)
indomethacin and Alzheimers disease (Swanson
and Smalheiser 1994),
Curcuma longa and retinal diseases, Crohn's
disease and disorders related to the spinal cord
(Srinivasan and Libbus 2004).
(Weeber M, Rein et al. 2003) thalidomide for
treating a series of diseases such as acute
pancreatitis, chronic hepatitis C.

17
Text mining steps

Information Retrieval yields all relevant texts
Gathers, selects, filters documents that may
prove useful
Finds what is known
Information Extraction extracts facts events of
interest to user
Finds relevant concepts, facts about concepts
Finds only what we are looking for
Data Mining discovers unsuspected associations
Combines links facts and events
Discovers new knowledge, finds new associations

18
From Text to Knowledge NLP and Knowledge
Extraction
Lexicons and ontologies
19
Challenge the resource bottleneck

Lack of large-scale, richly annotated corpora
Support training of ML algorithms
Development of computational grammars
Evaluation of text mining components
Lack of knowledge resources lexica,
terminologies, ontologies.

20
Annotation Information Extraction
Biomedical Knowledge
Biomedical Literature

Semantic annotation simulates an ideal
performance of IE system.
IE systems can be developed by referencing
annotated corpus.
The performance of IE systems can be evaluated by
being compared to the annotated corpus.
(Kim Tsujii, Text Mining Workshop, Manchester,
2006)

21
Text Annotation

Task-neutral Annotation
GENIA Corpus
U-Tokyo, NaCTeM
Development of generic tools
Defined by theories
Linguistics
Tokens
POS
Phrase Structure
Dependency Structure
Deep Syntax (PAS)
Biology
Named Entities of various semantic types
Events
Linguistics Biology
Co-references

Task-oriented Annotation
Application annotated text
User system development
Defined by specific tasks
Specific curation tasks in specific environments
Mapping of Protein names to database IDs in
specific text types
Specific event types such as Protein-Protein
Interaction
Disease-Gene Association of specific diseases

22
Annotation of GENIA corpus TermPOS
Term (entity)annotation2000400abstracts
23
Text semantic annotation

annotation of events and involved named entities
Example Regulation of Transcription events
BOOTSTrep project http//www.nactem.ac.uk/bootstre
p.php
two different types of annotation levels
linguistic annotation levels
biological annotation level, in charge of marking
the biological knowledge contained in the text
Linking text with biological knowledge

24
Events and variables

Biological events can be centred on
verbs, e.g. activate,
nouns with verb-like meanings (nominalised
verbs), e.g. transcription
Different parts of sentence correspond to
different types of variables in the event e.g.
What caused event
The narL gene product activates the nitrate
reductase operon
What was affected by event
Analysis of mutants
Where event took place
These fusions were formed on plasmid cloning
vectors

25
Verb Frame Example

The narL gene product activates the nitrate
reductase operon

Theme Characteristics operon
Agent Characteristics protein
26
Role Name Description Phrase Type(s) Clues
AGENT Drives or instigates event Entity or event Typically subject of verb, Follows by in passives
The narL gene product activates the nitrate reductase operon The narL gene product activates the nitrate reductase operon The narL gene product activates the nitrate reductase operon The narL gene product activates the nitrate reductase operon
THEME Affected by or results from event Entity or event Typically object of verb, subject in passives
recA protein was induced by UV radiation recA protein was induced by UV radiation recA protein was induced by UV radiation recA protein was induced by UV radiation
MANNER Method or way in which event is carried out Event (process), adverb, direction, in vitro, in vivo etc by, through, via, using
cpxA gene increases the levels of csgA transcription by dephosphorylation of CpxR cpxA gene increases the levels of csgA transcription by dephosphorylation of CpxR cpxA gene increases the levels of csgA transcription by dephosphorylation of CpxR cpxA gene increases the levels of csgA transcription by dephosphorylation of CpxR
27
Role Name Description Phrase Type(s) Clues
INSTRUMENT Used to carry out event Entity with,with the aid of, via, by, through, using
EnvZ functions through OmpR to control porin gene expression in Escherichia coli K-12 EnvZ functions through OmpR to control porin gene expression in Escherichia coli K-12 EnvZ functions through OmpR to control porin gene expression in Escherichia coli K-12 EnvZ functions through OmpR to control porin gene expression in Escherichia coli K-12
LOCATION Location of event Entity in, on, near, etc
Phosphorylation of OmpR by the osmosensor EnvZ modulates expression of the ompF and ompC genes in Escherichia coli Phosphorylation of OmpR by the osmosensor EnvZ modulates expression of the ompF and ompC genes in Escherichia coli Phosphorylation of OmpR by the osmosensor EnvZ modulates expression of the ompF and ompC genes in Escherichia coli Phosphorylation of OmpR by the osmosensor EnvZ modulates expression of the ompF and ompC genes in Escherichia coli
SOURCE Start point of event Entity from
A transducing lambda phage carrying glpD''lacZ, glpR, and malT was isolated from a strain harbouring a glpD''lacZ fusion A transducing lambda phage carrying glpD''lacZ, glpR, and malT was isolated from a strain harbouring a glpD''lacZ fusion A transducing lambda phage carrying glpD''lacZ, glpR, and malT was isolated from a strain harbouring a glpD''lacZ fusion A transducing lambda phage carrying glpD''lacZ, glpR, and malT was isolated from a strain harbouring a glpD''lacZ fusion
DESTINATION End point of event Entity to, into
Transcription of gntT is activated by binding of the cyclic AMP (cAMP)-cAMP receptor protein (CRP) complex to a CRP binding site Transcription of gntT is activated by binding of the cyclic AMP (cAMP)-cAMP receptor protein (CRP) complex to a CRP binding site Transcription of gntT is activated by binding of the cyclic AMP (cAMP)-cAMP receptor protein (CRP) complex to a CRP binding site Transcription of gntT is activated by binding of the cyclic AMP (cAMP)-cAMP receptor protein (CRP) complex to a CRP binding site
28
Example 1

activates
29
Linguistically Annotated Corpora

GENIA
Domain
Mesh term Human, Blood Cells, and Transcription
Factors.
Annotation POS, named entity, parse tree
Penn BioIE
Domain
the molecular genetics of oncology
the inhibition of enzymes of the CYP450 class.
Annotation POS, named entity, parse tree
Yapex
GENETAG a corpus of 20K MEDLINE sentences for
gene/protein NER

30
The GENIA annotation

Linguistic annotation
Reveals linguistic structures behind the text
Part-of-speech annotation
annotates for the syntactic category of each
word.
Syntactic Tree annotation
annotates for the syntactic structure of
sentences.
Semantic annotation
Reveals knowledge pieces delivered by the text.
Term annotation
annotates domain-specific terms
Event annotation
annotates events on biological entities.

Ontology-drivenannotation
31
Annotation Tool

WordFreak http//wordfreak.sourceforge.net/
Java-based linguistic annotation tool developed
at University of Pennsylvania
Extensible to new tasks and domains
Customised visualisation and annotation
specification
Allows annotation process to be made as simple as
possible

Resources

33
What about existing resources?

Ontologies important for knowledge discovery
They form the link between terms in texts and
biological databases
Can be used to add meaning, semantic annotation
of texts

34
Link between text and ontologies
Adding new knowledge
KEGG
Ontological resources
text
UMLS
Supporting semantics
GO
GENIA
35
Bridging the Gap Integrating data, text and
knowledge

Databases
Semantic Interpretation of data
Adding new knowledge
Ontological resources
text
UMLS
Supporting semantics
GO
GENIA
KEGG
Semantic Interpretation of models in Systems
Biology
Mathematical Models
36
Resources for Bio-Text Mining

Lexical / terminological resources
SPECIALIST lexicon, Metathesaurus (UMLS)
Lists of terms / lexical entries (hierarchical
relations)
Ontological resources
Metathesaurus, Semantic Network, GO, SNOMED CT,
etc
Encode relations among entities
Bodenreider, O. Lexical, Terminological, and
Ontological Resources for Biological Text
Mining, Chapter 3, Text Mining for Biology and
Biomedicine, pp.43-66

37
SPECIALIST lexicon

UMLS specialist lexicon http//SPECIALIST.nlm.nih.
gov
Each lexical entry contains morphological (e.g.
cauterize, cauterizes, cauterized, cauterizing),
syntactic (e.g. complementation patterns for
verbs, nouns, adjectives), orthographic
information (e.g. esophagus oesophagus)
General language lexicon with many biomedical
terms (over 180,000 records)
Lexical programs include variation (spelling),
base form, inflection, acronyms

38
Lexicon record

baseKaposi's sarcoma
spelling_variantKaposi sarcoma
entryE0003576
catnoun
variantsuncount
variantsreg
variantsglreg

Kaposis
sarcoma
Kaposis sarcomas
Kaposis sarcomata
Kaposi sarcoma
Kaposi sarcomas
Kaposi sarcomata

The SPECIALIST Lexicon and Lexical Tools Allen
C. Browne, Guy Divita, and Chris Lu PhD 2002 NLM
Associates Presentation, 12/03/2002, Bethesda, MD
39
Normalisation (lexical tools)

Hodgkin Disease
HODGKIN DISEASE
Hodgkins Disease
Hodgkins disease
Disease, Hodgkin ...

disease hodgkin

normalise
40
Steps of Norm

Remove genitive
Hodgkins Diseases
Replace punctuation with spaces
Hodgkin Diseases
Remove stop words
Hodgkin Diseases
Lowercase
hodgkin diseases
Uninflect each word
hodgkin disease
Word order sort
disease hodgkin

Lexical tools of the UMLS http//lexsrv3.nlm.nih.g
ov/SPECIALIST/index.html
41
The Gene Ontology (GO)

Controlled vocabulary for the annotation of gene
products
http//www.geneontology.org/
19,468 terms. 95.3 with definitions
10391 biological_process
1681 cellular_component7396
molecular_function

42
Gene Ontology

GOA database (http//www.ebi.ac.uk/GOA/) assigns
gene products to the Gene Ontology
GO terms follow certain conventions of creation,
have synonyms such as
ornithine cycle is an exact synonym of urea cycle
cell division is a broad synonym of cytokinesis
cytochrome bc1 complex is a related synonym of
ubiquinol-cytochrome-c reductase activity

43
GO terms, definitions and ontologies in OBO

id GO0000002
name mitochondrial genome maintenance
namespace biological_process
def "The maintenance of the structure and
integrity of the mitochondrial genome. GOCai
is_a GO0007005 ! mitochondrion organization
and biogenesis

44
Metathesaurus

organised by concept
5M names, 1M concepts, 16M relations
built from 134 electronic versions of many
different thesauri, classifications, code sets,
and lists of controlled terms
"source vocabularies
common representation

45
Are the existing knowledge resources sufficient
for TM?

No!
Why?
Limited lexical terminological coverage of
biological sub-domains
Resources focused on human specialists
GO, UMLS, UniProt ontology concept names
frequently confused with terms

46
Naming conventions

Update and curation of resources
FlyBase gene name coverage 31 (abstracts) to 84
(full texts)
Naming conventions and representation in
heterogeneous resources
Term formation guidelines from formal bodies e.g.
HUGO, IPI not uniformly used
Problems with integration of resources
dystrophin used for 18 gene products
Dystrophin (muscular dystrophy, Duchenne and
Becker types), included DXS143, DXS164, DXS206,
HUGO

47
Term variation

Terminological variation and complexity of names
High correlation between degree of term variation
and dynamic nature of biomedicine
Variation occurs in controlled vocabularies and
texts but discrepancy between the two
Exact match methods fail to associate term
occurrences in texts with databases

Whats in a name?
Terms, named entities in biology

49
Whats in a name?

Breast cancer 1 (BRCA1)
p53
Ribosomal protein S27
Heat shock protein 110
Mitogen activated protein kinase 15
Mitogen activated protein kinase kinase kinase 5

From K. Cohen, NAACL 2007
50
Worst gene names

sema domain, seven thrombospondin repeats (type 1
and type 1-like), transmembrane domain (TM) and
short cytoplasmic domain, (semaphorin) 5A

K. Cohen NAACL 2007
51
Worst gene names

sema domain, seven thrombospondin repeats (type 1
and type 1-like), transmembrane domain (TM) and
short cytoplasmic domain, (semaphorin) 5A

K. Cohen NAACL 2007
52
Worst gene names

sema domain, seven thrombospondin repeats (type 1
and type 1-like), transmembrane domain (TM) and
short cytoplasmic domain, (semaphorin) 5A
SEMA5A

K. Cohen NAACL 2007
53
Worst gene names

sema domain, seven thrombospondin repeats (type 1
and type 1-like), transmembrane domain (TM) and
short cytoplasmic domain, (semaphorin) 5A
SEMA5A
Tyrosine kinase with immunoglobulin and epidermal
growth factor homology domains
tie

K. Cohen NAACL 2007
54
Term ambiguity

Neurofibromatosis 2 disease
NF2 Neurofibromin 2 protein
Neurofibromatosis 2 gene gene

O. Bodenreider, MIE 2005 tutorial http//www.nacte
m.ac.uk/
55
Term ambiguity

Gene terms may be also common English words
BAD human gene encoding BCL-2 family of proteins
(bad news, bad prediction)
Gene names are often used to denote gene products
(proteins)
suppressor of sable is used ambiguously to refer
to either genes and proteins
Existing resources lack information that can
support term disambiguation
Difficult to establish equivalences between
termforms and concepts

56
Homologues

Cycline-dependent kinase inhibitor first
introduced to represent a protein family p27
But it is used interchangeably with p27 or
p27kip1, as the name of the individual protein
and not as the name of the protein family (Morgan
2003).
NFKB2 denotes the name of a family of 2
individual proteins with separate IDs in
Swiss-Prot.
These proteins are homologues belonging to
different species, homo sapiens chicken.

57
Terms

Term linguistic realisation of specialised
concepts, e.g. genes, proteins, diseases
Terminology collection of terms structured
(hierarchy) denoting relationships among
concepts, part-whole, is-a, specific, generic,
etc.
Terms link text and ontologies
Mapping is not trivial (main challenge)

58
Term variation and ambiguity
Term1 Term2 Term3 TEXT
Term variation
Term ambiguity
Concept1 concept2 concept3
ONTOLOGY
59
Term mining steps
Term recognition
Tp53
Term classification
Gene
Genome Database, IARC TP53 Mutation Database
Term mapping
60
Term recognition techniques

ATR extracts terms (variants) from a collection
of document
Distinguishes terms vs non-terms
In NER the steps of recognition and
classification are merged, a classified
terminological instance is a named entity
The tasks of ATR and NER share techniques but
their ultimate goals are different
ATR for resource building, lexica ontologies
NER first step of IE, text mining

61
Overview papers

S. Ananiadou G. Nenadic (2006) Automatic
Terminology Management in Biomedicine, Text
Mining for Biology and Biomedicine, pp. 67- 97.
M. Krauthammer G. Nenadic (2004) Term
identification in the biomedical literature, JBI
37 (2004) 512-526
J.C. Park J. Kim (2006) Named Entity
Recognition, Text Mining for Biology and
Biomedicine, pp. 121-142
Detailed bibliography in Bio-Text Mining
BLIMPhttp//blimp.cs.queensu.ca/
http//www.ccs.neu.edu/home/futrelle/bionlp/
Book on BioText Mining
S. Ananiadou J. McNaught (eds) (2006) Text
Mining for Biology and Biomedicine, Artech
House.
Other Bio-Text Mining tutorials
Kevin Cohen (NAACL 2007 tutorial) U. Colorado

62
Main ATR approaches
63
Dictionary NER (1)

Use terminological resources to locate term
occurrences in text
NCBI http//www.ncbi.nlm.nih.gov/
EBI http//www.ebi.ac.uk/
neologisms, variations, ambiguity problematic for
simple dictionary look-up
Ambiguous words e.g. an, for, can
spelling variants, punctuation, word order
variations
estrogen oestrogen
NF kappa B / NF kB

64
Dictionary NER (2)

Hirschman (2002) used FlyBase for gene name
recognition, results disappointing due to
homonymy, spelling variations
Precision, 7 abstracts, 2 full papers
Recall, 31 -- 84
Tuason (2004) reports term variation as main
problem of mismatch
bmp-4 bmp4
syt4 syt iv
integrin alpha 4 alpha4 integrin

65
Dictionary NER (3)

Tsuruoka Tsujii (2003) suggest a probabilistic
generator of spelling variants, edit distance
operations (delete, substitute, insert)
Terms with ED 1 considered spelling variants
Used a dictionary of protein terms
Support query expansion
Augment dictionaries with variation

66
Rule NER (2)
67
Rule based (1)

Use orthographic, morpho-syntactic features of
terms
Rules that make use of internal term formation
patterns (tagging, morphological analysers) e.g.
affixes, combining forms
Do not take into account contextual features
Dictionaries of constituents e.g. affixes,
neoclassical forms included
Portability to different domains?

68
Rule based (2)

Ananiadou, S. (1994) recognised single-word terms
based on morphological analysis of term formation
patterns (internal term make up)
based on analysis of neoclassical and hybrid
elements
alphafetoprotein immunoosmoelectrophor
esis
radioimmunoassay
some elements are used for creating terms
term ? word term_suffix
term ? term word_suffix
neoclassical combining forms (electro- adeno-),
prefixes (auto-, hypo-)
suffixes ( -osis, -itis)

69
Rule-based (3)

Fukuda (1998) used lexical, orthographic features
for protein name recognition e.g. upper case
character, numerals etc.
PROPER core and feature elements
Core meaning bearing elements
Feature function elements
SAP kinase

feature
core
Core elements extended to feature based on
concatenation rules (based on POS tags)
70
Rule-based (4)

Gaizauskas (2000) CFG for protein name
recognition (PASTA, EMPATHIE)
Based on morphological and lexical
characteristics of terms
biochemical suffixes (-ase enzyme name)
dictionary look-up (protein names, chemical
compounds, etc)
deduction of term grammar rules from Protein Data
Bank

Protein -gt protein_modifier, protein_head, numeral
71
Rule-based (5)

Inspired by PROPER, Yapex uses Swiss-Prot to add
core term elements
http//www.sics.se/humle/projects/prothalt/yapex.c
gi
Hou (2003) used Yapex with context information
(collocations) appearing with protein names
Rule based approaches construct rule and patterns
manually or automatically
Difficult to tune to different domains

72
Machine learning systems

Learn features from training data for term
recognition and classification
Most ML systems combine recognition and
classification
Challenges
Feature selection and optimisation
Availability of training data
detection of term boundaries

73
Overview of ML-based NER

Training phase
Testing phase

Detecting features
Learning model

Manually tagged texts
Learned Model
Tag annotator with model
Tagged texts
Raw texts
74
ML (1)

Nobata et al.(1999) used Decision Tree for NER
Decision tree one of the methods to classify a
case using training data
Node specifies some condition with a subtree
Leaf indicates a class
Features
Part-of-speech information
Orthographic information
Term lists

75
Example of a decision tree
Each node has one condition
Is the current word in the Protein term list?
No
Yes
Does the previous word have figures?
What is the next words POS?
No
Noun
Yes
Verb

Each leaf has one class
PROTEIN
Unknown
RNA
DNA

76
ML (2)

Collier (2000) used HMM, orthographic features
for term recognition
HMM looks for most likely sequence of classes
corresponding to a word sequence e.g.
interleukin-2 protein/DNA
To find similarities between known words
(training set) and unknown words, use character
features
Feature Examples
DigitNumber 2protein3DNA
GreekLetter alphaprotein
TwoCaps RelBproteinTARRNA

77
ML (2)

Use of GENIA resources as training data
Results depend on training data
Morgan (2004) used FlyBase to construct
automatically training corpus
Pattern matching for gene name recognition, noisy
corpus annotated
HMM was trained on that corpus for gene name
recognition

78
Support Vector Machines (1)

Kazama trained multi-class SVMs on Genia corpus
Corpus annotated with B-I-O tags
B tags denote words at beginning of term
I tags inside term
O tags outside term
B-protein-tag word in the beginning of a
protein name

79
SVMs for NER (2)

Yamamoto used a combination of features for
protein name recognition
Morphological, lexical, boundary, syntactic (head
noun), domain specific (if term exists in
biomedical database).
Lee use different features for recognition and
classification.
orthographic, prefix, suffix
Contextual information

80
Hybrid approaches

Combine rules, statistics, resources

81
Hybrid (1)

ABGene protein and gene name tagger
Combines ML, transformation rules, dictionaries
with statistics
Protein tagger trained on MEDLINE abstracts by
adapting Brills tagger
Transformation rules for recognition of gene,
protein names
Used GO, LocusLink list of genes, proteins for
false negative tags

82
Hybrid (2)

ARBITER (Access and Retrieve Binding Terms) uses
UMLS Metathesaurus and GenBank to map NPs
(binding terms)
morphological features
lexical information (head noun)
EDGAR recognises gene, cell, drug names using
co-occurrences of cell, clone, expression

83
Hybrid (3)

C/NC value (Frantzi Ananiadou, 1999)
C-value
Linguistic filters
total frequency of occurrence of string in corpus
frequency of string as part of longer candidate
terms (nested terms)
number of these longer candidate terms
length of string
Output automatically ranked terms (TerMine)

84
C-value

C- value measure extracts multi-word, nested
terms
adenoid cystic basal cell carcinoma
cystic basal cell carcinoma
ulcerated basal cell carcinoma
recurrent basal cell carcinoma
basal cell carcinoma

85
Term variation

variation recognition as part of ATR (Nenadic,
Ananiadou)
recognise term forms and link them into
equivalence classes
important if ATR is based on statistics (e.g.
frequency of occurrence)
corpus-based measures are distributed across
different variants
conflation of various surface representations of
a given term should improve ATR

86
Simple variation

orthographic
hyphens, slashes (amino acid and amino-acid)
lower/upper cases (NF-KB and NF-kb)
spelling variations (tumour and tumor)
transliterations (oestrogen and estrogen)
morphological
inflectional phenomena (plural, possessives)
lexical
genuine synonyms (carcinoma and cancer)

87
Complex variation

Structural
Possessive usage of nouns using prepositions
(clones of human and human clones)
Prepositional variants (cell in blood, cell from
blood)
Term coordinations (adrenal glands and gonads)

88
Coordinated term variants

Structure is ambiguous
Head coordination or term conjunction?
Head or argument coordination?
(NA) CC (NA) N
cell differentiation and proliferation
chicken and mouse receptors

89
TerMine a term management system
Demo
90
http//www.nactem.ac.uk/software/termine/
91
Marrying IR and terminology

IR engine plus TerMine
Discover associated terms ranked according to
relevance
Allow user to link term with IR for document
discovery
NB compound terms
NB technical terms, not classic index terms
NB terms familiar to user, found in documents

92

http//www.nactem.ac.uk/software/ctermine/
93
Biomedical IE/IR Systems

iHOP
http//www.ihop-net.org/UniPub/iHOP/
EBIMed
http//www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp
GoPubMed
http//www.gopubmed.org/
PubFinder
http//www.glycosciences.de/tools/PubFinder
Textpresso
http//www.textpresso.org/

94
Acronyms

Very productive type of term variation
Acronym variation (synonymy)
NF kappa B/ NF kB / nuclear factor kappa B
Acronym ambiguity (polysemy) even in controlled
vocabularies
GR glucocorticoid receptor
glutathione reductase

95
Acronym recognition

Swartz, A. Hearst, M. (2003) A simple algorithm
for identifying abbreviation definitions in
biomedical text, PSB 2003,8, 451-462
Adar, E. (2004) SaRAD a simple and robust
abbreviation dictionary, Bioinformatics, 20(4)
527-533
Chang, J.T. Schutze, H. (2006) Abbreviations in
biomedical text, Text Mining for Biology and
Biomedicine, pp.99-119, Artech
Tsuruoka, Y., Ananiadou, S. Tsujii, J. (2005) A
Machine learning approach to automatic acronym
generation, ISMB, BioLink SIG, 25-31
Okazaki, N. S.Ananiadou (2006) Acronym
recognition based on term identification,
Bioinformatics

96
The importance ofacronym recognition

Acronyms are among the most productive type of
term variation
64, 242 new acronyms are introduced in 2004
Chang and Schütze 06
Acronyms are used more frequently than full terms
5,477 documents could be retrieved by using the
acronym JNK while only 3,773 documents could be
retrieved by using its full term, c-jun
N-terminal kinase Wren et al. 05
No rules or exact patterns for the creation of
acronyms from their full form

97
Recognition

Extracting pairs of short and long forms
ltacronym, long formgt
Distinguishing acronyms from parenthetical
expressions
Search for parentheses in text single or more
words e.g. Ab (antibody)
Limit context around ( ) limit number of words
according to number of letters in acronym

98
Recognition (heuristics)

Heuristics match letters of acronym with letters
of long form using rules, patterns
letters from beginning of words
combining forms
carboxifluorescein diacetate (CFDA)
Acronym normalisation to allow orthographic,
structural and lexical variations
morphological information, positional info
Penalise words in long form that do not match
acronym
Accidental matching
argininosuccitate synthetase (AS)

A
S
99
Letter matching

Alignment find all matches between letters of
acronyms and their long forms and calculate
likelihood (Chang Schütze)
Solves problem of acronyms containing letters not
occurring in LF
Choose best alignment based on features, e.g.
position of letter etc.
Finding optimal weight for each feature challenge

http//abbreviation.stanford.edu/
100
Acronym Recognition
Okazaki, N., Ananiadou, S. (2006) Building an
abbreviation dictionary using a term recognition
approach. Bioinformatics.
101
A simple algorithm Schwartz and Hearst (2003)

Uses parenthetical expressions as a marker of a
short form
long-form (short-form )
All letters and digits in a short form must
appear in the corresponding long form in the same
order
We used hidden markov model (HMM) to
Early repolarization (ER) is an enigma.

102
Problems of letter-matching approach

Highly dependent on the expressions in the target
text
o acquired immuno deficiency syndrome (AIDS)
x acquired syndrome (AIDS)
x a patient with human immunodeficiency syndrome
(AIDS)
? magnetic resonance imaging unit (MRI)
! beta 2 adrenergic receptor (ADRB2)
! gamma interferon (IFN-GAMMA)
(These examples are obtained from actual MEDLINE
abstracts)
Naive with respect to term variations

103
AcroMines approach

Extract a word or word sequence
Co-occurring frequently with an acronym (e.g.,
TTF-1)
1, factor 1, transcription factor 1, thyroid
transcription factor 1
Does not co-occur with other surrounding words
thyroid transcription factor 1
Not necessarily based on letter-matching
Note that this is a difficult case for the
letter-matching algorithm
Prune unlikely candidates
Nested candidates transcription factor 1
Expansions expression of thyroid transcription
factor 1
Insertions thyroid specific transcription factor
1

104
Short-form mining

Enumerate all short forms in a target text
Using parentheses as a clue (short-form
)
Validation rules for identifying acronyms
Schwartz and Hearst 03
It consists of at most two words
Its length is between two to ten characters
It contains at least an alphabetic letter
The first character is alphanumeric

The contextual sentence of HMM and ASR.
The present system consists of a hidden Markov
model (HMM) based automatic speech recognizer
(ASR), with a keyword spotting system to capture
the machine sensitive words (registered in a
dictionary) from the running utterances.
105
Enumerating long-form candidates for an acronym

Tokenize a contextual sentence by
non-alphanumeric characters (e.g., space, hyphen,
etc.)
Apply Porters stemming algorithm Porter 80
Extract terms that match the following pattern
WORD.

Empty string or words of any length
We studied the expression of thyroid
transcription factor-1 (TTF-1).
1 factor 1 transcript factor 1 thyroid
transcript factor 1 expression of thyroid
transcript factor 1 studi the expression
of thyroid transcript factor 1
of thyroid transcript factor 1 thyroid transcript
106
Expansions for TTF-1
107
Top 20 acronyms in MEDLINE
108
Long-form candidates for acronym ADM
Candidate Length Frequency Score Validity
adriamycin 1 727 721.4 o
adrenomedullin 1 247 241.7 o
abductor digiti minimi 3 78 74.9 o
doxorubicin 1 56 54.6 x
effect of adriamycin 3 25 23.6 Expansion
adrenodemedullated 1 19 17.7 o
acellular dermal matrix 3 17 15.9 o
peptide adrenomedullin 2 17 15.1 Expansion
effects of adrenomedullin 3 15 13.2 Expansion
resistance to adriamycin 3 15 13.2 Expansion
amyopathic dermatomyositis 2 14 12.8 o
brevis and abductor digiti minimi 5 11 9.8 Expansion
minimi 1 83 5.8 Nested
digiti minimi 2 80 3.9 Nested
automated digital microscopy 3 1 0.0 match
adrenomedullin concentration 2 1 0.0 Nested
109
Long-form extraction

Long-form candidates are sorted with their scores
in a descending order
A long-form candidate is considered valid if
It has a score greater than 2.0
The words in the long form can be rearranged so
that all alphanumeric letters appear in the same
order as the short form
It is not nested or expansion of the previously
chosen long forms

110
http//www.nactem.ac.uk/software/acromine/
111
Acronym disambiguation

Local acronyms
Accompany their expanded forms in documents
Global acronyms
Appear in documents without the expanded forms
stated
Need to be their correct expanded forms
identified
Immunomodulatory effects of CT were investigated
in a rat model, and the effects of CT on rat
renal allograft (from Lewis rat to WKAH rat) were
also examined.
Immunomodulatory effects of cholera toxin (CT)
were investigated in a rat model, and the effects
of cholera toxin (CT) on rat renal allograft
(from Lewis rat to Wistar-King-Aptekman-Hokudai
(WKAH) rat) were also examined.

112
Acronym disambiguation
Sample text Considerations in the identification
of functional RNA structural elements in genomic
alignments (Tomas Babak et al) http//www.biomedce
ntral.com/1471-2105/8/33
113

Term structuring

114
Term structuring

term clustering (linking semantically similar
terms) and term classification (assigning terms
to classes from a pre-defined classification
scheme)
Hypothesis similar terms tend to appear in
similar contexts (patterns)
combining various sources of similarity
lexical
syntactic
contextual
Ontological (using external resources)

115
Term structuring

Based on term similarities
choice of features
domain specific ? ontology
linguistic ? text
ontology-based similarity
textual similarity
internal features
contextual features

116
Using ontologies

two terms should match if they are
identified as variants
siblings in the is-a hierarchy
in the is-a or part-whole relation
the distance between the corresponding nodes in
the ontology should be transformed into the
matching score
? I. Spasic presentation MIE Tutorial
http//www.nactem.ac.uk/

117
Using text

number of neologisms terms are not in the
ontologies
Use of text based techniques to calculate
similarities
edit distance (ED) the minimal number (or cost)
of changes needed to transform one string into
the other
edit operations
insertion deletion replacement
transposition
...a-c... ...abc... ...abc... ...abc...
...abc... ...a-c... ...adc... ...acb...
use of dynamic programming

118
Term similarities

lexical similarity based on sharing term head
and/or modifier(s) --hyponymy
nuclear receptor
orphan nuclear receptor
Sharing heads
progesterone receptor oestrogen receptor
Specific types of associations
mainly general is_a and part_of
some domain-specific, e.g. binding CREP binding
protein

119
Contextual similarities

Features from context
syntactic category
terminological status
position relative to the term
syntactic relation between a context element and
the term
semantic properties
semantic relation between a context element and
the term .

120
Lexical syntactic patterns

a lexico-syntactic pattern
. . . Term (, Term) , and other Term .
. .
the leading Terms hyponyms of the head Term
... antiandrogens, hydroxyflutamide,
bicalutamide,
cyproterone acetate, RU58841, and other
compounds ...
candidate instances of the hyponymy relation
hyponym( antiandrogens, compound )
hyponym( hydroxyflutamide, compound )
hyponym( bicalutamide, compound )
hyponym( cyproterone acetate, compound )
hyponym( RU58841, compound )

121
Contextual information

automatic pattern mining for most important
context patterns
find most important contexts in which a term
appears
receptor is bound to these DNA sequences
proteins bound to the DNA
estrogen receptor bound to DNA
steroid receptor coactivator-1 when bound to
DNA
progesterone receptor complexes bound to DNA
RXRs bound to respective DNA elements in vitro
glucocorticoid receptor to bind DNA
pattern ltTERMgt Vbind ltTERMDNAgt

122
Stumbling blocks

Lexical similarities affected by many neologisms
and ad hoc names
only 5 of most frequent terms in GENIA belonging
to same biomedical class have some lexical links
how much context to use? (sentence, phrase,
abstract, )
Attempts at using co-occurrence many report up
to 40 of co-occurrence based relationships
biologically meaningless

123
Term similarities

SOLD Syntactic, Ontology-driven Lexical
Distance (Spasic, I. Ananiadou, S. 2005,
Bioinformatics)
hybrid approach to comparing term contexts, which
relies on
linguistic information (acquired through tagging
and parsing)
domain-specific knowledge (obtained from the
ontology)
based on the approximate pattern matching
combines ontology-based similarity with
corpus-based similarity using both internal and
contextual features

124
Challenges of biomedical terminology

Linking termforms in text with existing resources
Term clustering, classification and linking to
databases, ontologies
Selection of most representative terms (concepts)
in documents (important for improved IR, database
curation, annotation tasks)
Efficient term management important for updating
terminological and ontological resources, text
mining applications e.g. IE, Q/A, summarisation,
linking heterogeneous resources, IR etc

125
Information Extraction in Biology

Results appear depressed compared to general
language
Dependent of earlier stages of processing
(tokenisers, taggers, results from NER, etc)
MUC data 80 F-score template relations, 60
events
Challenge for bio-text mining is to achieve
similar results
Evaluation see Hirschman, L. (Text mining book)
BioCreATive 2004

126
I

Information Extraction

127
IE in Biology

Pattern-matching
Context-free grammar approaches
Full parsing approaches
Sublanguage driven IE
Ontology-driven IE

McNaught, J. Black, W. (2006) Information
Extraction, Text Mining for Biology
Biomedicine, Artech house, pp.143-177
128
Pattern-matching IE

Usual limitations with non inclusion of semantic
processing
Large amount of surface grammatical structures
too many patterns (Zipfs law)
Cannot explore syntactic generalisations (active,
passive voice)
Systems extract phrases or entire sentences with
matched patterns restricted usefulness for
subsequent mining

129
Pattern-matching systems (1)

BioIE uses patterns to extract sentences, protein
families, structures, functions..
Presents user with relevant information,
improvement from classic IR
BioRAT uses deeper analysis, tagging, apply RE
over POS tags, stemming, gazetter categories etc
Templates apply to extract matching phrases,
primitive filters (verbs are not proteins, etc)

130
Pattern matching systems (2)

RLIMS-P (Hu) protein phosphorylation by looking
for enzymes, substrates, sites assigned to agent,
theme, site roles of phosphorylation relations
Pos tagger, trained on newswire, chunking,
semantic typing of chunks, identification of
relations using pattern-matching rules
Semantic typing of NPs using combination of clue
words, suffixes, acronyms etc
Semantically typed sentences matched with rules
Patterns target sentences containing
phosphorylate

131
Full parsing approaches

Link Grammar applied for protein-protein
interactions general English grammar adapted to
bio-text
Link Grammar finds all possible linkages
according to its grammar
Number of analyses reduced by random sampling,
heuristics, processing constraints relaxed
10,000 results permitted per sentence
60 of protein interactions extracted
Problems missing possessive markers
determiners, coordination of compound noun
modifiers

132
Full parsing IE (2)

Not all parsing strategies suitable for bio-text
mining
Text type, abstracts, ungrammaticality related
with sublanguage characteristics?
Ambiguity and full parsing fragmentary phrases
(titles, headings, text in table cells, etc)
CADERIGE project used Link grammar but on shallow
parsing mode
Kim Park (BioIE) use combinatorial categorial
grammar, annotated with GO concepts, extract
general biological interactions
1,300 patterns applied to find instances of
patterns with keywords

133
Full parsing (3)

Keywords indicate basic biological interactions
Patterns find potential arguments of the
interaction keywords (verbs or nominalisations)
Validated arguments mapped into GO concepts
Difficult to generalise interaction keyword
patterns
BioIEs syntactic parsing performance improved
after adding subcategorisation frames on verbal
interaction keywords

134
Full parsing (4)

Daraselia(2004) use full parsing and domain
specific filter to extract protein interactions
All syntactic analyses discovered using CFG and
variant of LFG
Each alternative parse mapped to its
corresponding semantic representation
Output set of semantic trees, lexemes linked by
relations indicating thematic or attributive
roles
Apply custom-built, frame based ontology to
filter representations of each sentence
Preference mechanism controls construction of
frame tree, high precision, low recall (21)

135
Sublanguage-driven IE (1)

Language of a special community (e.g. biology)
Particular set of constraints re GL
Constraints operate at all linguistic levels
Special vocabulary (terms)
Specialised term formation rules
Sublanguage syntactic patterns
Sublanguage semantics
These constraints give rise to the informational
structure of the domain (Z. Harris)
See JBI 35(4) Special Issue on Sublanguage

136
GENIES system

Employs SL approach to extract biomolecular
interactions
Uses hybrid syntactic-semantic rules
Syntactic and semantic constraints referred to in
one rule
Able to cope with complex sentences
Frame-based representation
Embedded frames
Domain specific ontology covers both entities and
events

137
GENIES system

Default strategy full parsing
Robust due to sublanguage constraints
Much ambiguity excluded
If full parse fails, partial parsing invoked
Maintains good level of recall
Precision 96, Recall 63

138
Ontology-driven IE

Until recently most rule based IE have used
neither linguistic lexica nor ontologies
Reliance on gazetteers
Small number of semantic categories
Gazetteer approach not well suited in bioIE
Ontology based vs ontology driven
Passive use of ontologies, map discovered entity
to concept
Active use, ontology guides and constrains
analysis, fewer rules
Examples PASTA, GenIE not SL
GENIES, SL and ontology driven

139
Summary simple pattern matching

Over text strings
Many patterns required, no generalisation
possible
Over POS
Some generalisation but ignore sentence structure
POS tagging, chunking, semantic p-m, typing
Limited generalisation, some account taken of
structure, limited consideration of SL patterns

140
Summary full parsing

Full parsing on its own, parsing done in
combination with chunking, partial parsing,
heuristics) to reduce ambiguity, filter out
implausible readings
GL theories not appropriate
Difficult to specialise for biotext
Many analyses per sentence
Missing information due to sublanguage meaning

Write a Comment

User Comments (0)

Cancel

OK

OK

Latest

Latest Highest Rated

Sort by:

Page of

About PowerShow.com

PowerShow.com is a leading presentation sharing website. It has millions of presentations already uploaded and available with 1,000s more being uploaded by its users every day. Whatever your area of interest, here you’ll be able to find and view presentations you’ll love and possibly download. And, best of all, it is completely free and easy to use.

You might even have a presentation you’d like to share with others. If so, just upload it to PowerShow.com. We’ll convert it to an HTML5 slideshow that includes all the media types you’ve already added: audio, video, music, pictures, animations and transition effects. Then you can share it with your target audience as well as PowerShow.com’s millions of monthly visitors. And, again, it’s all free.

About the Developers

PowerShow.com is brought to you by CrystalGraphics, the award-winning developer and market-leading publisher of rich-media enhancement products for presentations. Our product offerings include millions of PowerPoint templates, diagrams, animated 3D characters and more.