Title: Textmining, Entity Identification, and Relationship Extraction
1Text-mining, Entity Identification, and
Relationship Extraction
- ILS Lecture 11th July
- Loïc Royer
2Outline
- Motivation
- Text-Minning
- Gene Mention Tagging
- Gene Mention Identification
- Relation Extraction
3Motivation
- Biomedical literature is growing at a tremendous
pace - PubMed indexes 16 million articles and grows
every year by 600'000 articles
4A Knowledge Explosion
Most biological information is in the form of
text and sequences
5Four Solutions
- Manual curation
- Best precision
- Not scalable
- Text-Mining
- Quite scalable
- Difficult
- Authors formalize their results in a formal
language - Problems to convince people
- Formal language cannot be top complicated
- Wikipedia approach for Life Sciences
- Scalability already demonstrated
- Better Semantic Wiki engines are needed
6Text-Mining
- Text Classification
- Information Retrieval
- Entity Recognition
- Information Extraction
- Question Answering
- Text Summarization
7Outline
- Motivation
- Text-Minning
- Gene Mention Tagging
- Gene Mention Identification
- Relation Extraction
8Gene Mention Tagging
Objective Locate gene/protein names in text
9Gene Mention Tagging
- The problem can be formulated as a classification
problem of tokens in text -
- were reactive for CXCR3 and that
- of the interleukin-1 receptor gene in
- Tokens (words) are either
- part of a gene
- not part of a gene
10Gene Mention Tagging
- Machine Learning techniques used
- CRF (Conditional random Fields)
- SVM (Support Vector Machines)
- N-gram (frequencies of word n-tuples)
- Max-Ent (Maximum Entropy)
- HMM (Hidden Markov Model)
- NLP Techniques used
- POS Tagger (Part of Speech)
- NP Chunker (Noun Phrase Chunker)
11Gene Mention Tagging
- Conditional Random Fields
Part of gene name, or not
Words (tokens)
Edges between random variables represent
conditional probabilities
What is the main assumption a such a model ?
What is missing here ?
12Gene Mention Tagging
- Features used
- Morphological Features
- CAPWORD ? A-Za-z
- CAPSMIX ? A-Z(A-za-za-zA-Z)A-z
- Character n-gram features
- ase in decarboxylase
- Part Of Speech Features
- Noun better than Verb
- Lexical features
- Dictionaries of known genes/proteins
- Dictionaries of non-gene/protein.
13Gene Mention Tagging
- Things that make the tagging of gene/protein
names difficult - Delimitation of a gene name in text issubject to
discussion activated - Biological terms such as diseases or phenotypes
are also used to name genes, - Abbreviations defined in the text that
accidentally resemble gene mentions, such as cell
lines or domains.
14Gene Mention Tagging
- cheap date (id32999)Mutants are especially
sensitive to alcohol. Interestingly, another name
for the gene is amnesiac, as mutants also have a
poor memory. - ken and barbie (id37785)Both male and female
mutants lack external genitalia, as do poor Ken
and Barbie. - icebox (id48456)Female icebox mutants do not
care about courting males.
Superman, Hairy, Crooked Legs, Lava lamp,
Dreadlocks, Clown,
15Gene Mention Tagging
- Results and State of the Art for human genes
- Recall 85
- Precision 88
- F-Measure 87
- Relevance What biologists really need are the
database identifiers of the genes.
16Outline
- Motivation
- Text-Minning
- Gene Mention Tagging
- Gene Mention Identification
- Relation Extraction
17Gene Mention Identification
- Objective Associate to each document a list of
gene identifiers from a reference database
18Gene Mention Identification
MAGUK
- False positives
- Text that does not mention proteins gets
annotated - The wrong EntrezGene identifier is chosen
- False negatives
- A protein is not found at all,
- The wrong EntrezGene identifier is chosen
One mistake that counts for two !!
19Gene Mention Identification
- Recall Phase
- First a maximum of reasonable candidates genes
are obtained per document using - Dictionaries merging, filtering,
classification, - Known Text gene/protein associations,
- Syntactical variation and matching.
- Precision Phase
- Candidates are ranked according to
- Syntax / Semantics,
- Local / document wide contexts,
- Inter-annotation agreement.
20Gene Mention Identification
- Recall Oriented Techniques
- Preprocess text by interpreting intensive
enumerationsfreac1 to freac4 freac1,
freac2, freac3, and freac4 - Problem eiF1-eiF3
- Is it eiF1 and eiF3 or eiF1, eiF2, and eiF3 ?
21Gene Mention Identification
- Recall Oriented Techniques
- Collect and merge gene/protein synonyms
dictionaries from different sources.
22Gene Mention Identification
- Recall Oriented Techniques
- Dictionary synonyms classification Divide and
conquer strategy. Different types of gene
synonyms require different identification
strategies - database identifiers (KIAA0958, HGNC17875),
- Abbreviations (CD95L, Lin7c),
- single- or multi-word terms (tumor necrosis
factor alpha) - spurious synonyms (AA, ORF has no N-terminal
Met, it may be non-functional).
23Gene Mention Identification
- Recall Oriented Techniques
- Generate variant synonyms using rules
24Gene Mention Identification
Vesicle Soluble Maleic acid N-ethylimide
Sensitive Fusion Protein Attachment Protein
Receptor
Hunter et. Al.
25Gene Mention Identification
- Recall Oriented Techniques
- Gather similar documents with known associations
to genes/proteins, then transfer association
rab5
orc2
p20
eiF2
Rap55
Similar documents
Document examined
26Gene Mention Identification
- How to compute similarity between documents?
- Vector space model a document is represented as
a word vector. - Cosine Similarity
- TFIDF
Why is there a logarithm in this formula ?
27Gene Mention Identification
- Zipf's law In a corpus of natural language
utterances, the frequency of any word is roughly
inversely proportional to its rank in the
frequency table.
28Gene Mention Identification
- Ambiguity Problem
- In naming
- 1168 genes in EntrezGene named p60
- In species
- 949 species have a gene named p53
- Official names and symbols are not used.
29Gene Mention Identification
Yeast smallest vocab, shortest names, least
ambiguity Mouse largest vocabulary, longest
names less ambiguity than fly Fly large
vocabulary, medium names, most ambiguity
Lynette Hirschman, Marc Colosimo, Alexander A.
Morgan, Alexander S. Yeh. "Overview of
BioCreAtIvE task 1B Normalized Gene Lists,"
accepted by BMC Bioinformatics.
30Gene Mention Identification
- Precision Oriented Techniques
- Identify with - high confidence - regions of text
that do not refer to genes/proteins
Genes high confidence
Non-Geneshigh confidence
31Gene Mention Identification
- Precision Oriented Techniques
- Alignement-based syntactical similarities
- Levenshtein Distance edit distance
- Needleman-Wunch distance or Sellers Algorithm
- Gap cost function substitution matrix
- Smith-Waterman distance
- optimal subsequences
- Smith-Waterman-Gotoh distance
- Starting a gap different from continuing a gap.
What is the weakness of alignement based methods ?
32Gene Mention Identification
- Precision Oriented Techniques
- Other syntactical similarities
- Jaro distance metric between s1 and s2 m
number of matching characters a,b
length of s1, s2t number of
transpositions
What is this not a distance but a similarity ?
33Gene Mention Identification
- Precision Oriented Techniques
- Bayesian Estimations
- Knowing the a priori use frequency of gene names.
Given a context and additional evidenceken
and barbie in biomedical text relating to the
fly organism does refer to a gene and never to
the toys...
Why is the marginal probability not needed in
practice ?
34Gene Mention Identification
- Evidences that influence posterior probabilities
35Gene Mention Identification
36Gene Mention Identificationtypical recall
problems
- Missing substitution rules
- GAR1 protein ? Gar1p
- Wrong order of tokens
- IL-receptor, type II ? type II IL-receptor
- Abbreviation inside long synonym
- ubiquitin ( UBC4/5) ? Ubc4
- Capitalisation
- APOER2 ? ApoER2
37Gene Mention Identificationtypical recall
problems
- Missing syntactic variants
- GPIb-alpha ? GPIbalpha
- Morphology
- UBC3B ? Ubc3
- Token polution
- Serotonin receptor 6 ? serotonin 5-HT(6)
receptor - Unspecific mentions
- Maxi K channel beta subunit ? beta2 !!!
38Gene Mention Identification typical Precision
Problems
- Wrongly delimited match
- complex I NADH dehydrogenase (ubiquinone), Fe-S
(20 kDa) EC 1.6.5.3 - Local Context
- inhibitors of PI 3-kinase.
- Unspecific
- NF-kappa B
- Wrong identifier chosen / not found
- Acronym resolution failed
39Gene Mention Identification State of the Art -
BioCreAtIvE II 2006
- Our group got the best results for gene name
identification - Recall 83
- Precision 78
- F-Measure 81
40Outline
- Motivation
- Text-Minning
- Gene Mention Tagging
- Gene Mention Identification
- Relation Extraction
41Relation Extraction
- Entities involved in interactions
- Genes / proteins / chemicals
- Species / Cell Types
- Diseases / Phenotypes
- Qualities of relations that can be extracted
- Co-occurrences,
- Strict semantic relations protein interactions,
protein to function,
42Jensen et al.
43alibaba.informatik.hu-berlin.de
44Relation Extraction
- Techniques
- Co-occurrences
- Same sentence, word distance, word composition
- High recall, low precision
- Strict Semantic Relations
- Natural Language Processing (NLP)
- Shallow parsing (manual rule, mined patterns)
- Deep parsing (grammar, linguistic).
- Low Precision, Low Recall.
45Relation Extraction
- Relation Extraction Workflow
- Sentence segmentation
- Tokenization
- Part of speech
- Chunking
- Lexical analysis
- Entity identification
- Natural Language Parsing
- Candidate relation filtering
46Relation Extraction
- Natural Language Parsing
- Chomsky Hierarchy
- Type 3 Regular language
- Type 2 Context-free
- Type 1 Context-sensitive
- Type 0 Unrestricted
- Natural Language Grammars
- Head-Driven Phrase Structure Grammar (HPSG)
- Probabilistic context-free grammar (PCFG)
47Relation Extraction
- Parsing Rab5 interacts with CDC2 and CDC3 with
Enju a wide-coverage probabilistic HPSG parser - interacts(rab5,cdc2)interacts(rab5,cdc3)interact
s(cdc2,cdc3)
48Relation Extraction
- Problems
- Natural text is very complex
- Dependent on entity identification,
- Anaphora resolution,
- Coverage of grammars is still poor,
- Results State of the Art is not good -(
- Recall 30
- Precision 38
- F-measure 28
49Conclusion
- Biomedical knowledge is being dumped as text
without computer readable semantics. - Text-mining techniques are being developed to
mitigate this problem. - Identifying entities and their relations are the
main goals. - The next iteration in entity identification will
be usable by biologists. Relation Extraction does
not yet work beyond co-occurrence.
50Thank you for your attention
51A Knowledge Explosion
Jensen et al.
52Relation Extraction
53Relation Extraction
54A Knowledge Explosion
Possible in theory to have semantics, but in
practice only links.
55Gene Mention Identification
Tamames et al., 2003
56Entity Recognition and Information Extraction for
Biology
- Entity Tagging
- Finding the mention of gene/protein names in text
- Entity Identification
- Link a gene/protein mention to a reference
database (EntrezGene, Uniprot, ) - Relation Extraction
- Identify interactions between genes and proteins.
57Gene Mention Identification State of the Art
BioCreAtIvE II 2006
We have the best F-measure 81
We have the best official recall 88 However we
can go up to 92.7with Rlt40