Natural Language Processing in the biomedical domain - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Natural Language Processing in the biomedical domain

Description:

GATE-General Architecture for Text Engineering. Some Tools, some ... 'Baby swallows fly' Natural Language Processing in the Biomedical Domain. Basic NLP Tasks ... – PowerPoint PPT presentation

Number of Views:275
Avg rating:3.0/5.0
Slides: 64
Provided by: lei151
Category:

less

Transcript and Presenter's Notes

Title: Natural Language Processing in the biomedical domain


1
Natural Language Processing in the biomedical
domain
  • SBI Course WS 2005/2006
  • Thomas Karopka
  • 19.01.2006

2
Outline
  • Motivation
  • Introduction to Natural Language Processing
  • Named Entity Recognition (NER)
  • Information Extraction (IE)
  • GATE-General Architecture for Text Engineering
  • Some Tools, some applications....
  • (Short introduction to GATE)

3
Motivation
  • MEDLINE currently contains over 16 million
  • biomedical abstracts
  • 50.000 new abstracts per month
  • Huge amount of biomedical knowledge
  • Problem unstructured text
  • difficult to analyze
    automatically

40.000 abstracts á 5 min app. 400 days (8 h a
day)
Solution NLP Information Extraction
4
What is NLP?
  • Definition 1
  • Natural Language Processing (NLP) is a subfield
    of artificial intelligence and linguistics. It
    studies the problems inherent in the processing
    and manipulation of natural language, but not,
    generally, natural language understanding.
  • Definition 2
  • A study of how to use computers to do things with
    human languages.
  • Synonyms Language Engineering, Human Language
    Technology

5
Publications in MEDLINE
6
Main fields of NLP
  • Text to speech
  • Speech recognition
  • Natural language generation
  • Machine translation
  • Question answering
  • Information retrieval
  • Information extraction
  • Named entity recognition
  • Text classification
  • Translation technology
  • Text Summaries

7
Why is NLP so hard?
  • Ambiguity
  • Context
  • Acronyms
  • Semantics

8
Ambiguity
  • Time flies like an arrow, fruit flies like a
    banana
  • (Groucho Marx)

9
Global vs. Local ambiguity
  • Local ambiguity means that part of a sentence can
    have more than 1 interpretation, but not the
    whole sentence.
  • Global ambiguity means that the whole sentence
    can have more than 1 interpretation.

10
Global vs. Local ambiguity cont.
  • Local ambiguity
  • The old train..... ...the young.
  • ...left the station.
  • Here syntax can tell us that TRAIN must be a
    verb in sentence 1.
  • Global ambiguity
  • "I saw the Grand Canyon flying to New York" "I
    saw a Boeing 747 flying to New York"
  • Here we know the meaning of the two sentences
    because we know what can and cannot fly.

11
Types of Ambiguity
  • Categorical ambiguity
  • Noun "Time is money"
  • Verb "Time me on the last lap"
  • Adjective "Time travel is not likely in my life
    time
  • Word sense ambiguity
  • Electrical "The battery was charged with jump
    leads"
  • Legal "Thief was charged by PC Smith"
  • Responsibility "The lecturer was charged with
    student recruitment"

12
Types of Ambiguity cont.
  • Structural ambiguity
  • "You can have peas and beans or carrots with the
    set meal
  • Referential ambiguity
  • What can THEY refer to in "After THEY finished
    the exam the students and lecturers left.
  • Lectures only?
  • Students only?
  • Both?

13
Problems in NLP
Polysemy - one word carrying different
meanings. (Glück 1993, 474) (in different
contexts) beam ('Lichtstrahl' und 'Balken')
Synonymy - the semantic relation that holds
between two words that can (in a given
context) express the same meaning ship
vessel buy - purchase
Semantics - the meaning of a word, phrase,
clause, or sentence, as opposed to its
syntactic construction. Baby swallows fly
14
Basic NLP Tasks
  • Tokenization
  • Split text into units called tokens (words, .,-)
  • Sentence Splitting
  • Detect sentence boundaries
  • Part of Speech (POS) Tagging
  • Apply parts of speech (verb, noun, adjective..)
  • Parsing
  • Work out parse trees

15
Basic NLP Tasks cont.
  • Verb Phrase chunking
  • Find verbal phrases
  • Noun Phrase chunking
  • Find noun phrases
  • Acronym resolution
  • Find long forms for acronyms
  • Corefference resolution
  • New York, .... The big apple

16
Basic NLP Tasks cont.
  • Named Entity Recognition
  • Find named entities
  • .....

17
What is NER?
  • NER
  • Named Entity Recognition
  • Including two tasks
  • Identification of proper names in text
  • Classification of proper names in text
  • Newswire Domain
  • Person, Location, Organization
  • Biomedical Domain
  • Protein, DNA, RNA, Body Part, Cell Type, Lipid,
    etc.

18
NER in biomedical domain
  • BioNER aims to recognize following names
  • First Priority
  • Protein name, DNA name, RNA name
  • Second Priority
  • cell type, other organic compound, cell line,
    lipid, multi-cell, virus, cell component, body
    part, tissue, amino acid monomer, polynucleotide,
    mono-cell, inorganic, peptide, nucleotide, atom,
    other artificial source, carbohydrate, organic

19
Example of NER - Biomedical
Protein/gene
Cell type
20
Problems in BioNER
  • Unknown words
  • Long compound words
  • Variations of expressions
  • Nested NEs

21
Unknown Words
  • Words containing hyphen, digit, letter, Greek
    letter, Roman numeral.
  • Alpha B1
  • Adenyly cyclase 76E
  • Latent membrane protein 1
  • 4-mycarosyl isovaleryl-CoA transferase
  • oligodeoxyribonucleotide
  • 18-deoxyaldosterone
  • Abbreviation and Acronym
  • IL, TECd, IFN, TPA

22
Long Compound words
  • interleukin 1 (IL-1)-responsive kinase
  • interleukin 1-responsive kinase
  • epidermal growth factor receptor
  • SH2 domain containing tyrosine kinase Syk
  • SH2 domain (GENIA example)

23
Various expressions of the same NE
  • Spelling variation
  • N-acetylcysteine, N-acetyl-cysteine,
    NAcetylCysteine
  • Word permutation
  • beta-1 intergrin, integrin beta-1
  • Ambiguous expressions
  • epidermal growth factor receptor, EGF receptor,
    EGFR
  • c-jun, c-Jun, c jun

24
Various expressions the name explains its
function
  • the Ras guanine nucleotide exchange factor Sos
  • the Ras guanine nucleotide releasing protein Sos
  • the Ras exchanger Sos
  • the GDP-GTP exchange factor Sos
  • Sos(mSos), a GDP/GTP exchange protein for Ras

25
Various expressions The name includes
preposition and/or conjunction (ambiguity of
dependencies)
  • p85 alpha subunit of PI 3-kinase
  • SH2 and SH3 domains of Src
  • NF-AT1 , AP-1 , and NF-kB sites
  • E2F1 and -3
  • Residues 432, 435, 437, 438, and 440

26
Nested Named Entity
  • An NE embedded in another NE.
  • IL-2 protein
  • IL-2 gene gene
  • CBP/p300 associated factor protein
  • CBP/p300 associated factor binding promoter DNA

27
Gene Naming Conventions
  • "Biologists would rather share their toothbrush
    than share a gene name Michael Ashburner 1

1 Pearson H. Biology's name game. Nature.
2001411631632.
28
Protein/Gene name recognition
For comic relief dont miss the worst gene
names page http//tinman.vetmed.helsinki.fi/eng
/drosophila.html My favourite ones drop
dead FBgn0000494 lost in space
FBgn0016996 ken and barbie FBgn0011236 So
urce FlyBase http//flybase.bio.indiana.edu/
29
(No Transcript)
30
(No Transcript)
31
State-of-the-art Systems on NER Two evaluation
contests
  • BioCreative 2004 (March)
  • Critical Assessment of Information Extraction
    Systems in Biology
  • Task 1 Entity extraction
  • Target genes (or proteins, where there is
    ambiguity)
  • 10000 sentences from Medline as training data,
    and 5000 sentences as testing data
  • BioNLP 2004 (August)
  • GENIA Corpus as training data and 404 abstracts
    as testing data
  • Target 5 classes, including protein, DNA, gene,
    cell line and cell type.
  • Both use exact match scoring.

32
BioNLP 2004 Datasets
    of abstracts of sentences of tokens
Training Set Training Set 2,000 20,546 (10.27/abs) 472,006 (236.00/abs) (22.97/sen)
Test Set Total 404 4,260 (10.54/abs) 96,780 (239.55/abs) (22.72/sen)
Test Set 1978-1989 104 991 ( 9.53/abs) 22,320 (214.62/abs) (22.52/sen)
Test Set 1990-1999 106 1,115 (10.52/abs) 25,080 (236.60/abs) (22.49/sen)
Test Set 2000-2001 130 1,452 (11.17/abs) 33,380 (256.77/abs) (22.99/sen)
Test Set S/1998-2001 204 2,254 (11.05/abs) 51,628 (253.08/abs) (22.91/sen)
33
R/P/F   1978-1989 set  1990-1999 set  2000-2001 set  S/1998-2001 set  Total
Zho04 75.3 / 69.5 / 72.3 77.1 / 69.2 / 72.9 75.6 / 71.3 / 73.8 75.8 / 69.5 / 72.5 76.0 / 69.4 / 72.6
Fin04 66.9 / 70.4 / 68.6 73.8 / 69.4 / 71.5 72.6 / 69.3 / 70.9 71.8 / 67.5 / 69.6 71.6 / 68.6 / 70.1
Set04 63.6 / 71.4 / 67.3 72.2 / 68.7 / 70.4 71.3 / 69.6 / 70.5 71.3 / 68.8 / 70.1 70.3 / 69.3 / 69.8
Son04 60.3 / 66.2 / 63.1 71.2 / 65.6 / 68.2 69.5 / 65.8 / 67.6 68.3 / 64.0 / 66.1 67.8 / 64.8 / 66.3
Zha04 63.2 / 60.4 / 61.8 72.5 / 62.6 / 67.2 69.1 / 60.2 / 64.7 69.2 / 60.3 / 64.4 69.1 / 61.0 / 64.8
Rös04 59.2 / 60.3 / 59.8 70.3 / 61.8 / 65.8 68.4 / 61.5 / 64.8 68.3 / 60.4 / 64.1 67.4 / 61.0 / 64.0
Par04 62.8 / 55.9 / 59.2 70.3 / 61.4 / 65.6 65.1 / 60.4 / 62.7 65.9 / 59.7 / 62.7 66.5 / 59.8 / 63.0
Lee04 42.5 / 42.0 / 42.2 52.5 / 49.1 / 50.8 53.8 / 50.9 / 52.3 52.3 / 48.1 / 50.1 50.8 / 47.6 / 49.1
BL 47.1 / 33.9 / 39.4 56.8 / 45.5 / 50.5 51.7 / 46.3 / 48.8 52.6 / 46.0 / 49.1 52.6 / 43.6 / 47.7
34
Current Methods
  • Machine Learning
  • HMM, SVM, ME (Maximum Entropy), CRF (Conditional
    Random Field)
  • Hybrid methods
  • Dictionary Based
  • Approximate String matching algorithm
  • Naming Rules
  • Dynamic Programming

35
Features for Machine Learning Methods
  • Morphological Features
  • Orthographical Features
  • POS Features
  • Genia POS tagger
  • Semantic Trigger Features
  • Head-noun Features
  • NF-kappaB consensus site
  • IL-2 gene

36
Morphological Features
Prefix/Suffix Example
cin mide zole actinomycin Cycloheximide Sulphamethoxazole
lipid rogen vitamin phospholipids estrogen dihydroxyvitamin
blast cyte phil erythroblast thymocyte eosinophil
phosph methyl immuno phosphorylation methyltranferase immunomodulator
37
Orthographical Features
Orthographical Features Example Orthographical Features Example
AllCaps EBNA, NFAT AlphaDigit p50, p65
AlphaDigitAlpha IL23R, E1A ATGCSequence CCGCCC
CapLowAlpha Src, Ras, Epo CapMixAlpha NFkappaB
CapsAndDigits IL2, STAT4, SH2 DigitAlpha 2xNFkappaB
38
Head Nouns
Head Nouns
Unigram factor, protein, receptor, alpha, NF-kappaB, IL-2, cytokine, kinase, transcription, domain, complex, TNF-alpha, Nuclear, p50, CD28, TNF, molecule, subunit, cell, STAT3, family, tumor, factor-alpha, expression, interleukin
Bigram NF-kappa B, transcription factor, I kappa, nuclear factor, protein kinase, B alpha, kinase C, tumor necrosis, T cell, glucocorticoid receptor, binding protein, factor alpha, adhesion molecule, monoclonal antibody, gene product, binding domain
39
Excursus Head Noun, Noun phrase
A noun is usually embedded in a noun phrase (NP),
a syntactic unit of the sentence in which
information about the noun is gathered. The
noun is the head of the noun phrase, the central
constituent that determines the syntactic
character of the phrase.
40
Excursus Head Noun, Noun phrase cont.
  • A noun phrase normally consists of
  • An optional determiner
  • Zero or more adjective phrases
  • A head noun
  • Optional post-modifier (prepositional phrase or
    clausal modifier)

Example The homeless old man in the park that I
tried to help yesterday
human umbilical vein endothelial
cells lipopolysaccharide-stimulated human
saphenous vein endothelial cells
41
Zhou et al. approach
  • HMM SVM
  • Post-processing
  • Rule-based used to resolve nested name entities.
  • Top1 in the NLPBA Task, F72.5

42
Manning et al. method
  • Machine learning
  • ME Markov model
  • Local features
  • External resources and larger context
  • Post-processing
  • To correct genes boundary (mainly for
    BioCreative Task)
  • Top 1 in BioCreative, F 83.2
  • Top 2 in NLPBA Task, F70.1

43
What is Informationsextraction(IE)?
IE-Systems analyse unstructured text, extract
predefined named entities and store these
entities in a structured form
Text Mining is the discovery by computer of new,
previously unknown information, by automatically
extracting information from different written
resources. A key element is the linking together
of the extracted information to form new facts or
new hypotheses to be explored further by more
conventional means of experimentation.
Source Marti Hearst, What is text mining?
http//www.sims.berkeley.edu/hearst/text-mining.h
tml
44
Targets of Information Extraction
  • Protein-Protein interaction/binding/inhibition
  • Protein-Small Molecules
  • Gene-Gene regulation
  • Gene-Gene Product interaction
  • Gene-Drug relation
  • Protein-Subcellular location
  • Amino Acid-Protein relation
  • Example relationships between gene and drugs
  • The gene is the drug target
  • The gene confers resistance to the drug
  • The gene metabolizes the drug

45
Information Extraction Tasks
Identify Target Named Entities
Identify Relations among Named Entities
Identify Relations among Events and Named Entities
Associate Results with existing database records
46
IE-Systems
Rulebased Systems using rules for the extraction
Machine Learning Support Vector Machine (SVM),
Maximum Entropy (ME), Memory Based
Learning (MBL), Inductive Logic Programming
(ILP) Artificial Neural Networks (ANNs)
Hybrid Systems combining the two approaches
47
GATE General Architecture for Text Engineering
GATE is an architecture, a framework and a
development environment for LE (Language
Engineering) (Cunningham, 2002)
48
Extractor
Tokenizer
Sentence Splitter
POS Tagger
Gene Gazetteer
Gene-relation transducer
Acronym Resolution
NP- Chunking
GATE standard components
New developed modules
external modules
49
JAPE
50
Examples
Pattern
Example
51
Gene-gene relation
52
Evaluation
Estimation based on 100 manual checked abstracts
REC ?
PRE 83
Standard for evaluation necessary BioCreAtive?
GENIA?
53
Collection of Documents
More complicated due to partially filled
templates
54
Recall vs. Precision
  • High recall
  • You get all the right answers, but garbage too.
  • Good when incorrect results are not problematic.
  • More common from automatic systems.
  • High precision
  • When all returned answers must be correct.
  • Good when missing results are not problematic.
  • More common from hand-built systems.
  • In general in these things, one can trade one for
    the other
  • But its harder to score well on both

55
Penn Treebank Tagset
  1. CC Coordinating conjunction
  2. CD Cardinal number
  3. DT Determiner
  4. EX Existential there
  5. FW Foreign word
  6. IN Preposition or subordinating conjunction
  7. JJ Adjective
  8. JJR Adjective, comparative
  9. JJS Adjective, superlative
  10. LS List item marker
  11. MD Modal
  12. NN Noun, singular or mass
  13. NNS Noun, plural
  14. NP Proper noun, singular
  15. NPS Proper noun, plural
  16. PDT Predeterminer

17. POS Possessive ending 18. PP Personal
pronoun 19. PP Possessive pronoun 20. RB
Adverb 21. RBR Adverb, comparative 22. RBS
Adverb, superlative 23. RP Particle 24.
SYM Symbol 25. TO to 26. UH
Interjection 27. VB Verb, base form 28. VBD
Verb, past tense 29. VBG Verb, gerund or
present participle 30. VBN Verb, past
participle 31. VBP Verb, non-3rd person
singular present
56
Tools for NLP in the biomedical domain
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com