Title: Natural Language Processing in the biomedical domain
1Natural Language Processing in the biomedical
domain
- SBI Course WS 2005/2006
- Thomas Karopka
- 19.01.2006
2Outline
- Motivation
- Introduction to Natural Language Processing
- Named Entity Recognition (NER)
- Information Extraction (IE)
- GATE-General Architecture for Text Engineering
- Some Tools, some applications....
- (Short introduction to GATE)
3Motivation
- MEDLINE currently contains over 16 million
- biomedical abstracts
- 50.000 new abstracts per month
- Huge amount of biomedical knowledge
- Problem unstructured text
- difficult to analyze
automatically
40.000 abstracts á 5 min app. 400 days (8 h a
day)
Solution NLP Information Extraction
4What is NLP?
- Definition 1
- Natural Language Processing (NLP) is a subfield
of artificial intelligence and linguistics. It
studies the problems inherent in the processing
and manipulation of natural language, but not,
generally, natural language understanding. - Definition 2
- A study of how to use computers to do things with
human languages. - Synonyms Language Engineering, Human Language
Technology
5Publications in MEDLINE
6Main fields of NLP
- Text to speech
- Speech recognition
- Natural language generation
- Machine translation
- Question answering
- Information retrieval
- Information extraction
- Named entity recognition
- Text classification
- Translation technology
- Text Summaries
7Why is NLP so hard?
- Ambiguity
- Context
- Acronyms
- Semantics
8Ambiguity
- Time flies like an arrow, fruit flies like a
banana - (Groucho Marx)
9Global vs. Local ambiguity
- Local ambiguity means that part of a sentence can
have more than 1 interpretation, but not the
whole sentence. - Global ambiguity means that the whole sentence
can have more than 1 interpretation.
10Global vs. Local ambiguity cont.
- Local ambiguity
- The old train..... ...the young.
- ...left the station.
- Here syntax can tell us that TRAIN must be a
verb in sentence 1. - Global ambiguity
- "I saw the Grand Canyon flying to New York" "I
saw a Boeing 747 flying to New York" - Here we know the meaning of the two sentences
because we know what can and cannot fly.
11Types of Ambiguity
- Categorical ambiguity
- Noun "Time is money"
- Verb "Time me on the last lap"
- Adjective "Time travel is not likely in my life
time - Word sense ambiguity
- Electrical "The battery was charged with jump
leads" - Legal "Thief was charged by PC Smith"
- Responsibility "The lecturer was charged with
student recruitment"
12Types of Ambiguity cont.
- Structural ambiguity
- "You can have peas and beans or carrots with the
set meal -
- Referential ambiguity
- What can THEY refer to in "After THEY finished
the exam the students and lecturers left. - Lectures only?
- Students only?
- Both?
13Problems in NLP
Polysemy - one word carrying different
meanings. (Glück 1993, 474) (in different
contexts) beam ('Lichtstrahl' und 'Balken')
Synonymy - the semantic relation that holds
between two words that can (in a given
context) express the same meaning ship
vessel buy - purchase
Semantics - the meaning of a word, phrase,
clause, or sentence, as opposed to its
syntactic construction. Baby swallows fly
14Basic NLP Tasks
- Tokenization
- Split text into units called tokens (words, .,-)
- Sentence Splitting
- Detect sentence boundaries
- Part of Speech (POS) Tagging
- Apply parts of speech (verb, noun, adjective..)
- Parsing
- Work out parse trees
15Basic NLP Tasks cont.
- Verb Phrase chunking
- Find verbal phrases
- Noun Phrase chunking
- Find noun phrases
- Acronym resolution
- Find long forms for acronyms
- Corefference resolution
- New York, .... The big apple
16Basic NLP Tasks cont.
- Named Entity Recognition
- Find named entities
- .....
17What is NER?
- NER
- Named Entity Recognition
- Including two tasks
- Identification of proper names in text
- Classification of proper names in text
- Newswire Domain
- Person, Location, Organization
- Biomedical Domain
- Protein, DNA, RNA, Body Part, Cell Type, Lipid,
etc.
18NER in biomedical domain
- BioNER aims to recognize following names
- First Priority
- Protein name, DNA name, RNA name
- Second Priority
- cell type, other organic compound, cell line,
lipid, multi-cell, virus, cell component, body
part, tissue, amino acid monomer, polynucleotide,
mono-cell, inorganic, peptide, nucleotide, atom,
other artificial source, carbohydrate, organic
19Example of NER - Biomedical
Protein/gene
Cell type
20Problems in BioNER
- Unknown words
- Long compound words
- Variations of expressions
- Nested NEs
21Unknown Words
- Words containing hyphen, digit, letter, Greek
letter, Roman numeral. - Alpha B1
- Adenyly cyclase 76E
- Latent membrane protein 1
- 4-mycarosyl isovaleryl-CoA transferase
- oligodeoxyribonucleotide
- 18-deoxyaldosterone
- Abbreviation and Acronym
- IL, TECd, IFN, TPA
22Long Compound words
- interleukin 1 (IL-1)-responsive kinase
- interleukin 1-responsive kinase
- epidermal growth factor receptor
- SH2 domain containing tyrosine kinase Syk
- SH2 domain (GENIA example)
23Various expressions of the same NE
- Spelling variation
- N-acetylcysteine, N-acetyl-cysteine,
NAcetylCysteine - Word permutation
- beta-1 intergrin, integrin beta-1
- Ambiguous expressions
- epidermal growth factor receptor, EGF receptor,
EGFR - c-jun, c-Jun, c jun
24Various expressions the name explains its
function
- the Ras guanine nucleotide exchange factor Sos
- the Ras guanine nucleotide releasing protein Sos
- the Ras exchanger Sos
- the GDP-GTP exchange factor Sos
- Sos(mSos), a GDP/GTP exchange protein for Ras
25Various expressions The name includes
preposition and/or conjunction (ambiguity of
dependencies)
- p85 alpha subunit of PI 3-kinase
- SH2 and SH3 domains of Src
- NF-AT1 , AP-1 , and NF-kB sites
- E2F1 and -3
- Residues 432, 435, 437, 438, and 440
26Nested Named Entity
- An NE embedded in another NE.
- IL-2 protein
- IL-2 gene gene
- CBP/p300 associated factor protein
- CBP/p300 associated factor binding promoter DNA
27Gene Naming Conventions
- "Biologists would rather share their toothbrush
than share a gene name Michael Ashburner 1 -
1 Pearson H. Biology's name game. Nature.
2001411631632.
28Protein/Gene name recognition
For comic relief dont miss the worst gene
names page http//tinman.vetmed.helsinki.fi/eng
/drosophila.html My favourite ones drop
dead FBgn0000494 lost in space
FBgn0016996 ken and barbie FBgn0011236 So
urce FlyBase http//flybase.bio.indiana.edu/
29(No Transcript)
30(No Transcript)
31State-of-the-art Systems on NER Two evaluation
contests
- BioCreative 2004 (March)
- Critical Assessment of Information Extraction
Systems in Biology - Task 1 Entity extraction
- Target genes (or proteins, where there is
ambiguity) - 10000 sentences from Medline as training data,
and 5000 sentences as testing data - BioNLP 2004 (August)
- GENIA Corpus as training data and 404 abstracts
as testing data - Target 5 classes, including protein, DNA, gene,
cell line and cell type. - Both use exact match scoring.
32BioNLP 2004 Datasets
  of abstracts of sentences of tokens
Training Set Training Set 2,000 20,546 (10.27/abs) 472,006 (236.00/abs) (22.97/sen)
Test Set Total 404 4,260 (10.54/abs) 96,780 (239.55/abs) (22.72/sen)
Test Set 1978-1989 104 991 ( 9.53/abs) 22,320 (214.62/abs) (22.52/sen)
Test Set 1990-1999 106 1,115 (10.52/abs) 25,080 (236.60/abs) (22.49/sen)
Test Set 2000-2001 130 1,452 (11.17/abs) 33,380 (256.77/abs) (22.99/sen)
Test Set S/1998-2001 204 2,254 (11.05/abs) 51,628 (253.08/abs) (22.91/sen)
33R/P/F  1978-1989 set  1990-1999 set  2000-2001 set  S/1998-2001 set  Total
Zho04 75.3 / 69.5 / 72.3 77.1 / 69.2 / 72.9 75.6 / 71.3 / 73.8 75.8 / 69.5 / 72.5 76.0 / 69.4 / 72.6
Fin04 66.9 / 70.4 / 68.6 73.8 / 69.4 / 71.5 72.6 / 69.3 / 70.9 71.8 / 67.5 / 69.6 71.6 / 68.6 / 70.1
Set04 63.6 / 71.4 / 67.3 72.2 / 68.7 / 70.4 71.3 / 69.6 / 70.5 71.3 / 68.8 / 70.1 70.3 / 69.3 / 69.8
Son04 60.3 / 66.2 / 63.1 71.2 / 65.6 / 68.2 69.5 / 65.8 / 67.6 68.3 / 64.0 / 66.1 67.8 / 64.8 / 66.3
Zha04 63.2 / 60.4 / 61.8 72.5 / 62.6 / 67.2 69.1 / 60.2 / 64.7 69.2 / 60.3 / 64.4 69.1 / 61.0 / 64.8
Rös04 59.2 / 60.3 / 59.8 70.3 / 61.8 / 65.8 68.4 / 61.5 / 64.8 68.3 / 60.4 / 64.1 67.4 / 61.0 / 64.0
Par04 62.8 / 55.9 / 59.2 70.3 / 61.4 / 65.6 65.1 / 60.4 / 62.7 65.9 / 59.7 / 62.7 66.5 / 59.8 / 63.0
Lee04 42.5 / 42.0 / 42.2 52.5 / 49.1 / 50.8 53.8 / 50.9 / 52.3 52.3 / 48.1 / 50.1 50.8 / 47.6 / 49.1
BL 47.1 / 33.9 / 39.4 56.8 / 45.5 / 50.5 51.7 / 46.3 / 48.8 52.6 / 46.0 / 49.1 52.6 / 43.6 / 47.7
34Current Methods
- Machine Learning
- HMM, SVM, ME (Maximum Entropy), CRF (Conditional
Random Field) - Hybrid methods
- Dictionary Based
- Approximate String matching algorithm
- Naming Rules
- Dynamic Programming
35Features for Machine Learning Methods
- Morphological Features
- Orthographical Features
- POS Features
- Genia POS tagger
- Semantic Trigger Features
- Head-noun Features
- NF-kappaB consensus site
- IL-2 gene
36Morphological Features
Prefix/Suffix Example
cin mide zole actinomycin Cycloheximide Sulphamethoxazole
lipid rogen vitamin phospholipids estrogen dihydroxyvitamin
blast cyte phil erythroblast thymocyte eosinophil
phosph methyl immuno phosphorylation methyltranferase immunomodulator
37Orthographical Features
Orthographical Features Example Orthographical Features Example
AllCaps EBNA, NFAT AlphaDigit p50, p65
AlphaDigitAlpha IL23R, E1A ATGCSequence CCGCCC
CapLowAlpha Src, Ras, Epo CapMixAlpha NFkappaB
CapsAndDigits IL2, STAT4, SH2 DigitAlpha 2xNFkappaB
38Head Nouns
Head Nouns
Unigram factor, protein, receptor, alpha, NF-kappaB, IL-2, cytokine, kinase, transcription, domain, complex, TNF-alpha, Nuclear, p50, CD28, TNF, molecule, subunit, cell, STAT3, family, tumor, factor-alpha, expression, interleukin
Bigram NF-kappa B, transcription factor, I kappa, nuclear factor, protein kinase, B alpha, kinase C, tumor necrosis, T cell, glucocorticoid receptor, binding protein, factor alpha, adhesion molecule, monoclonal antibody, gene product, binding domain
39Excursus Head Noun, Noun phrase
A noun is usually embedded in a noun phrase (NP),
a syntactic unit of the sentence in which
information about the noun is gathered. The
noun is the head of the noun phrase, the central
constituent that determines the syntactic
character of the phrase.
40Excursus Head Noun, Noun phrase cont.
- A noun phrase normally consists of
- An optional determiner
- Zero or more adjective phrases
- A head noun
- Optional post-modifier (prepositional phrase or
clausal modifier)
Example The homeless old man in the park that I
tried to help yesterday
human umbilical vein endothelial
cells lipopolysaccharide-stimulated human
saphenous vein endothelial cells
41Zhou et al. approach
- HMM SVM
- Post-processing
- Rule-based used to resolve nested name entities.
- Top1 in the NLPBA Task, F72.5
42Manning et al. method
- Machine learning
- ME Markov model
- Local features
- External resources and larger context
- Post-processing
- To correct genes boundary (mainly for
BioCreative Task) - Top 1 in BioCreative, F 83.2
- Top 2 in NLPBA Task, F70.1
43What is Informationsextraction(IE)?
IE-Systems analyse unstructured text, extract
predefined named entities and store these
entities in a structured form
Text Mining is the discovery by computer of new,
previously unknown information, by automatically
extracting information from different written
resources. A key element is the linking together
of the extracted information to form new facts or
new hypotheses to be explored further by more
conventional means of experimentation.
Source Marti Hearst, What is text mining?
http//www.sims.berkeley.edu/hearst/text-mining.h
tml
44Targets of Information Extraction
- Protein-Protein interaction/binding/inhibition
- Protein-Small Molecules
- Gene-Gene regulation
- Gene-Gene Product interaction
- Gene-Drug relation
- Protein-Subcellular location
- Amino Acid-Protein relation
- Example relationships between gene and drugs
- The gene is the drug target
- The gene confers resistance to the drug
- The gene metabolizes the drug
45Information Extraction Tasks
Identify Target Named Entities
Identify Relations among Named Entities
Identify Relations among Events and Named Entities
Associate Results with existing database records
46IE-Systems
Rulebased Systems using rules for the extraction
Machine Learning Support Vector Machine (SVM),
Maximum Entropy (ME), Memory Based
Learning (MBL), Inductive Logic Programming
(ILP) Artificial Neural Networks (ANNs)
Hybrid Systems combining the two approaches
47GATE General Architecture for Text Engineering
GATE is an architecture, a framework and a
development environment for LE (Language
Engineering) (Cunningham, 2002)
48Extractor
Tokenizer
Sentence Splitter
POS Tagger
Gene Gazetteer
Gene-relation transducer
Acronym Resolution
NP- Chunking
GATE standard components
New developed modules
external modules
49JAPE
50Examples
Pattern
Example
51Gene-gene relation
52Evaluation
Estimation based on 100 manual checked abstracts
REC ?
PRE 83
Standard for evaluation necessary BioCreAtive?
GENIA?
53Collection of Documents
More complicated due to partially filled
templates
54Recall vs. Precision
- High recall
- You get all the right answers, but garbage too.
- Good when incorrect results are not problematic.
- More common from automatic systems.
- High precision
- When all returned answers must be correct.
- Good when missing results are not problematic.
- More common from hand-built systems.
- In general in these things, one can trade one for
the other - But its harder to score well on both
55Penn Treebank Tagset
- CC Coordinating conjunction
- CD Cardinal number
- DT Determiner
- EX Existential there
- FW Foreign word
- IN Preposition or subordinating conjunction
- JJ Adjective
- JJR Adjective, comparative
- JJS Adjective, superlative
- LS List item marker
- MD Modal
- NN Noun, singular or mass
- NNS Noun, plural
- NP Proper noun, singular
- NPS Proper noun, plural
- PDT Predeterminer
17. POS Possessive ending 18. PP Personal
pronoun 19. PP Possessive pronoun 20. RB
Adverb 21. RBR Adverb, comparative 22. RBS
Adverb, superlative 23. RP Particle 24.
SYM Symbol 25. TO to 26. UH
Interjection 27. VB Verb, base form 28. VBD
Verb, past tense 29. VBG Verb, gerund or
present participle 30. VBN Verb, past
participle 31. VBP Verb, non-3rd person
singular present
56Tools for NLP in the biomedical domain
57(No Transcript)
58(No Transcript)
59(No Transcript)
60(No Transcript)
61(No Transcript)
62(No Transcript)
63(No Transcript)