Textmining, Entity Identification, and Relationship Extraction - PowerPoint PPT Presentation

1 / 57

About This Presentation

Title:

Textmining, Entity Identification, and Relationship Extraction

Description:

Biomedical literature is growing at a tremendous pace ... Superman, Hairy, Crooked Legs, Lava lamp, Dreadlocks, Clown, Gene Mention Tagging ... – PowerPoint PPT presentation

Number of Views:128

Avg rating:3.0/5.0

Slides: 58

Provided by: biotecTu

Category:

more less

Transcript and Presenter's Notes

Title: Textmining, Entity Identification, and Relationship Extraction

1
Text-mining, Entity Identification, and
Relationship Extraction

ILS Lecture 11th July
Loïc Royer

2
Outline

Motivation
Text-Minning
Gene Mention Tagging
Gene Mention Identification
Relation Extraction

3
Motivation

Biomedical literature is growing at a tremendous
pace
PubMed indexes 16 million articles and grows
every year by 600'000 articles

4
A Knowledge Explosion
Most biological information is in the form of
text and sequences
5
Four Solutions

Manual curation
Best precision
Not scalable
Text-Mining
Quite scalable
Difficult
Authors formalize their results in a formal
language
Problems to convince people
Formal language cannot be top complicated
Wikipedia approach for Life Sciences
Scalability already demonstrated
Better Semantic Wiki engines are needed

6
Text-Mining

Text Classification
Information Retrieval
Entity Recognition
Information Extraction
Question Answering
Text Summarization

7
Outline

Motivation
Text-Minning
Gene Mention Tagging
Gene Mention Identification
Relation Extraction

8
Gene Mention Tagging
Objective Locate gene/protein names in text
9
Gene Mention Tagging

The problem can be formulated as a classification
problem of tokens in text
were reactive for CXCR3 and that
of the interleukin-1 receptor gene in
Tokens (words) are either
part of a gene
not part of a gene

10
Gene Mention Tagging

Machine Learning techniques used
CRF (Conditional random Fields)
SVM (Support Vector Machines)
N-gram (frequencies of word n-tuples)
Max-Ent (Maximum Entropy)
HMM (Hidden Markov Model)
NLP Techniques used
POS Tagger (Part of Speech)
NP Chunker (Noun Phrase Chunker)

11
Gene Mention Tagging

Conditional Random Fields

Part of gene name, or not
Words (tokens)
Edges between random variables represent
conditional probabilities
What is the main assumption a such a model ?
What is missing here ?
12
Gene Mention Tagging

Features used
Morphological Features
CAPWORD ? A-Za-z
CAPSMIX ? A-Z(A-za-za-zA-Z)A-z
Character n-gram features
ase in decarboxylase
Part Of Speech Features
Noun better than Verb
Lexical features
Dictionaries of known genes/proteins
Dictionaries of non-gene/protein.

13
Gene Mention Tagging

Things that make the tagging of gene/protein
names difficult
Delimitation of a gene name in text issubject to
discussion activated
Biological terms such as diseases or phenotypes
are also used to name genes,
Abbreviations defined in the text that
accidentally resemble gene mentions, such as cell
lines or domains.

14
Gene Mention Tagging

cheap date (id32999)Mutants are especially
sensitive to alcohol. Interestingly, another name
for the gene is amnesiac, as mutants also have a
poor memory.
ken and barbie (id37785)Both male and female
mutants lack external genitalia, as do poor Ken
and Barbie.
icebox (id48456)Female icebox mutants do not
care about courting males.

Superman, Hairy, Crooked Legs, Lava lamp,
Dreadlocks, Clown,
15
Gene Mention Tagging

Results and State of the Art for human genes
Recall 85
Precision 88
F-Measure 87
Relevance What biologists really need are the
database identifiers of the genes.

16
Outline

Motivation
Text-Minning
Gene Mention Tagging
Gene Mention Identification
Relation Extraction

17
Gene Mention Identification

Objective Associate to each document a list of
gene identifiers from a reference database

18
Gene Mention Identification
MAGUK

False positives
Text that does not mention proteins gets
annotated
The wrong EntrezGene identifier is chosen
False negatives
A protein is not found at all,
The wrong EntrezGene identifier is chosen

One mistake that counts for two !!
19
Gene Mention Identification

Recall Phase
First a maximum of reasonable candidates genes
are obtained per document using
Dictionaries merging, filtering,
classification,
Known Text gene/protein associations,
Syntactical variation and matching.
Precision Phase
Candidates are ranked according to
Syntax / Semantics,
Local / document wide contexts,
Inter-annotation agreement.

20
Gene Mention Identification

Recall Oriented Techniques
Preprocess text by interpreting intensive
enumerationsfreac1 to freac4 freac1,
freac2, freac3, and freac4
Problem eiF1-eiF3
Is it eiF1 and eiF3 or eiF1, eiF2, and eiF3 ?

21
Gene Mention Identification

Recall Oriented Techniques
Collect and merge gene/protein synonyms
dictionaries from different sources.

22
Gene Mention Identification

Recall Oriented Techniques
Dictionary synonyms classification Divide and
conquer strategy. Different types of gene
synonyms require different identification
strategies
database identifiers (KIAA0958, HGNC17875),
Abbreviations (CD95L, Lin7c),
single- or multi-word terms (tumor necrosis
factor alpha)
spurious synonyms (AA, ORF has no N-terminal
Met, it may be non-functional).

23
Gene Mention Identification

Recall Oriented Techniques
Generate variant synonyms using rules

24
Gene Mention Identification
Vesicle Soluble Maleic acid N-ethylimide
Sensitive Fusion Protein Attachment Protein
Receptor
Hunter et. Al.
25
Gene Mention Identification

Recall Oriented Techniques
Gather similar documents with known associations
to genes/proteins, then transfer association

rab5
orc2
p20
eiF2
Rap55
Similar documents
Document examined
26
Gene Mention Identification

How to compute similarity between documents?
Vector space model a document is represented as
a word vector.
Cosine Similarity
TFIDF

Why is there a logarithm in this formula ?
27
Gene Mention Identification

Zipf's law In a corpus of natural language
utterances, the frequency of any word is roughly
inversely proportional to its rank in the
frequency table.

28
Gene Mention Identification

Ambiguity Problem
In naming
1168 genes in EntrezGene named p60
In species
949 species have a gene named p53
Official names and symbols are not used.

29
Gene Mention Identification
Yeast smallest vocab, shortest names, least
ambiguity Mouse largest vocabulary, longest
names less ambiguity than fly Fly large
vocabulary, medium names, most ambiguity
Lynette Hirschman, Marc Colosimo, Alexander A.
Morgan, Alexander S. Yeh. "Overview of
BioCreAtIvE task 1B Normalized Gene Lists,"
accepted by BMC Bioinformatics.
30
Gene Mention Identification

Precision Oriented Techniques
Identify with - high confidence - regions of text
that do not refer to genes/proteins

Genes high confidence
Non-Geneshigh confidence
31
Gene Mention Identification

Precision Oriented Techniques
Alignement-based syntactical similarities
Levenshtein Distance edit distance
Needleman-Wunch distance or Sellers Algorithm
Gap cost function substitution matrix
Smith-Waterman distance
optimal subsequences
Smith-Waterman-Gotoh distance
Starting a gap different from continuing a gap.

What is the weakness of alignement based methods ?
32
Gene Mention Identification

Precision Oriented Techniques
Other syntactical similarities
Jaro distance metric between s1 and s2 m
number of matching characters a,b
length of s1, s2t number of
transpositions

What is this not a distance but a similarity ?
33
Gene Mention Identification

Precision Oriented Techniques
Bayesian Estimations
Knowing the a priori use frequency of gene names.
Given a context and additional evidenceken
and barbie in biomedical text relating to the
fly organism does refer to a gene and never to
the toys...

Why is the marginal probability not needed in
practice ?
34
Gene Mention Identification

Evidences that influence posterior probabilities

35
Gene Mention Identification

Z-score

36
Gene Mention Identificationtypical recall
problems

Missing substitution rules
GAR1 protein ? Gar1p
Wrong order of tokens
IL-receptor, type II ? type II IL-receptor
Abbreviation inside long synonym
ubiquitin ( UBC4/5) ? Ubc4
Capitalisation
APOER2 ? ApoER2

37
Gene Mention Identificationtypical recall
problems

Missing syntactic variants
GPIb-alpha ? GPIbalpha
Morphology
UBC3B ? Ubc3
Token polution
Serotonin receptor 6 ? serotonin 5-HT(6)
receptor
Unspecific mentions
Maxi K channel beta subunit ? beta2 !!!

38
Gene Mention Identification typical Precision
Problems

Wrongly delimited match
complex I NADH dehydrogenase (ubiquinone), Fe-S
(20 kDa) EC 1.6.5.3
Local Context
inhibitors of PI 3-kinase.
Unspecific
NF-kappa B
Wrong identifier chosen / not found
Acronym resolution failed

39
Gene Mention Identification State of the Art -
BioCreAtIvE II 2006

Our group got the best results for gene name
identification
Recall 83
Precision 78
F-Measure 81

40
Outline

Motivation
Text-Minning
Gene Mention Tagging
Gene Mention Identification
Relation Extraction

41
Relation Extraction

Entities involved in interactions
Genes / proteins / chemicals
Species / Cell Types
Diseases / Phenotypes
Qualities of relations that can be extracted
Co-occurrences,
Strict semantic relations protein interactions,
protein to function,

42
Jensen et al.
43
alibaba.informatik.hu-berlin.de
44
Relation Extraction

Techniques
Co-occurrences
Same sentence, word distance, word composition
High recall, low precision
Strict Semantic Relations
Natural Language Processing (NLP)
Shallow parsing (manual rule, mined patterns)
Deep parsing (grammar, linguistic).
Low Precision, Low Recall.

45
Relation Extraction

Relation Extraction Workflow
Sentence segmentation
Tokenization
Part of speech
Chunking
Lexical analysis
Entity identification
Natural Language Parsing
Candidate relation filtering

46
Relation Extraction

Natural Language Parsing
Chomsky Hierarchy
Type 3 Regular language
Type 2 Context-free
Type 1 Context-sensitive
Type 0 Unrestricted
Natural Language Grammars
Head-Driven Phrase Structure Grammar (HPSG)
Probabilistic context-free grammar (PCFG)

47
Relation Extraction

Parsing Rab5 interacts with CDC2 and CDC3 with
Enju a wide-coverage probabilistic HPSG parser
interacts(rab5,cdc2)interacts(rab5,cdc3)interact
s(cdc2,cdc3)

48
Relation Extraction

Problems
Natural text is very complex
Dependent on entity identification,
Anaphora resolution,
Coverage of grammars is still poor,
Results State of the Art is not good -(
Recall 30
Precision 38
F-measure 28

49
Conclusion

Biomedical knowledge is being dumped as text
without computer readable semantics.
Text-mining techniques are being developed to
mitigate this problem.
Identifying entities and their relations are the
main goals.
The next iteration in entity identification will
be usable by biologists. Relation Extraction does
not yet work beyond co-occurrence.

50
Thank you for your attention
51
A Knowledge Explosion
Jensen et al.
52
Relation Extraction
53
Relation Extraction
54
A Knowledge Explosion
Possible in theory to have semantics, but in
practice only links.
55
Gene Mention Identification
Tamames et al., 2003
56
Entity Recognition and Information Extraction for
Biology

Entity Tagging
Finding the mention of gene/protein names in text
Entity Identification
Link a gene/protein mention to a reference
database (EntrezGene, Uniprot, )
Relation Extraction
Identify interactions between genes and proteins.

57
Gene Mention Identification State of the Art
BioCreAtIvE II 2006
We have the best F-measure 81
We have the best official recall 88 However we
can go up to 92.7with Rlt40

Write a Comment

User Comments (0)