Uses of text mining in molecular biology - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Uses of text mining in molecular biology

Description:

The problem: Combining biological knowledge with large scale biological experiments ... 12:203-14 2002: Opus, X-Mine (Chaussabel and Sher, Genome Biol. ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 40

Provided by: eivind6

Category:

more less

Transcript and Presenter's Notes

Title: Uses of text mining in molecular biology

1
Uses of text mining in molecular biology

Eivind Hovig

The problem Combining biological knowledge with
large scale biological experiments

3
Information sources

Unstructured text
Medline
Full text papers
Databases
Methodology based compilations (ex. interaction
data)
Sequence bases
Homology
Protein domain databases
etc.
High throughput experiments

4
Medline

13 million titles and abstract
Annotated using Medical Subject Headings (MeSH),
and some other fields
Downloadable (following license agreement)
Problem Mostly unstructured

Number of Articles in MEDLINE by year
Number of Articles mentioning gene OR
protein by year
Number of gene occurences in MEDLINE

6
Full text papers

Few corpora available
Biomed Central (5000 papers)
Scattered sample sets
Available as pdf or eXtensible Markup Language
Availability depends entirely on publishers
Thus, most work has been focused on Medline

7
Databases
PubGene
8
Databases

A number of databases available for download
(e.g. GenBank, InterPro, UniGene, OMIM)
Some are only available commercially (e.g.
Transfac) or with restrictive licenses (Mitelman
abberration database)
Manual curation at varying level of precision and
completeness
Problems with non-overlapping identifiers

9
Brief history of biological text mining

-80ies Natural language processing strategies
for general text
1997-9 Locating protein names
2001Cocitation networks of biological entities
2002 Development of diverse (sub)domain specific
dictionaries, ontologies
2003 NLP based strategies for information
extraction
2004 Development and evaluation based on test
sets

Now a scientific field of its own, with separate
tracks on major bioinformatics meeting
10
Domain specific dictionaries

The basis for anything useful
Protein names
Gene names
Chemical substance names
Disease names
Organism names
Ontologies
Unified medical language system (UMLS)
etc.

All of these have major problems!!!
11
Names
Hoffman, Valencia, TIG, 2003
12
Signal to noise ratio

Very short names II, P2,GS,G2.
Short names ABO
Ambiguous gene names
Mts1SAI1, S100A4 and CDKN2A
Complicated names NF-k-B, 14-3-3 etc.
Gene families integrins etc..

13
Co-occurrence

Extract pairs of genes that have been mentioned
together in the literature
Stapley and Benoit PSB 2000, Yeast genes
Jenssen et.al Nat. Genet. 2001, Human genes
Raychaudhuri et al., Genome Res. 12203-14 2002
Opus, X-Mine
(Chaussabel and Sher, Genome Biol. 2002
Clustering of genes and terms)
Others
Common idea Use predefined lists of gene terms
(symbols and names) and search for occurrences of
these in order to index the gene literature

14
Gene neighbors
Selected strategy
Gene level index
15
Numbers

10 M article records
2M with occurrence of gene terms (1)
1,620,038 gene-article associations (filtered)
5,490 genes have no literature references
6,200 genes with no neighbors
7,537 genes with one or more neighbors
144,685 gene pairs,
total pair (link) weight 1,118,639

16
Types of relationships
1 art. gt5 art. Cell biology 43 24 Expression
correlation 151 183 Histology 22 66 Homology 29 75
Mapping 53 6 Other 4 5 Incorrect 198 141 Sum 500
500
17
PubGene 2.4
18
Cocitation networks
19
Literature microarray gene clustering
20
K-means statistical clustering with relative
similarity in MeSH/GO terms
21
MeSH and GO forward and reverse mapping
22
Combining literature and sequence

cDNA microarrays etc.are produced containing a
number of uncharacterized genes. How to maximize
the information contained in each experiment?

Map all printed clones to closest gene symbol
Complete Research Genetics 40K clone set
Identify sequences lacking identified gene symbols
Smith-Waterman search versus all of Human IPI
(International Protein Index) version 2.22 with
52870 sequences
Connect the homologous uncharacterized genes to
existing genes in the literature space
Cut-off 10-4
23
Linking undescribed genes

Example
Expression analysis of blood B cells, activated
versus resting state
Rectangle represents undescribed gene, linked to
MYBL1. Score of MYBL1 and MYBL1 averaged.

Score 5.56
Image ID825476 Score 8.42 EST with expect lt10-4
24
Refining and focusing

Template based searches
Mutations
Abberrations
Natural language processing

25
Mutation data
26
Chemicals
27
Natural language processing

Conversion of written or spoken word to common
machine usable representations
Tokenization breaking up text into units
sentences, words, other
Part of speech tagging (POS tagging)Semantic
annotation of words based on local context
(nouns, verbs etc.). Often 7 tags, but may be
much more. Various taggers exist, but without
biological domain content. Rule-based
(contextual), or probabilistic (HMM).
Parsing, shallow parsingDetermining the complete
syntactic structure of a sentence etc.,
represents in an abstract syntax tree.
Computationally expensive.

28
Tagging

Example based on standard tagger

29
(No Transcript)
30
Noun-Isolate 1
Noun-Isolate
WDT
G
that
STOP, Get Next NN or VB
A
VBP VBD
H
Sep
be have
CC
Y
or
0,1
CC
X
Noun-Isolate 2
G-2
CC
0,1
0,1
DT
L
RB
B-2
C-2
D-2
E-2
F-2
J
I
0,1
0,5
DT
that those
0,2
IN
RB
0,1
that
0,2
K
M
VBN
0,1
0,2
TO
VBG
N
VBD
O
1,1
R
VBZ
P
S
T
U
0,1
Sep
1,1
MD
VBP
Q
0,1
V
0,1
Noun isolation
STOP, Get Next NN or VB
31
Purpose Information retrieval

To obtain the most pertinent and/or hidden large
scale relationships between entities
Kernel document strategy
Clustering strategies (K-means)
Classification strategies (SVM, Bayes
classifiers)
Protein interactions etc.

32
Benchmarking problems

Who defines what is correct for a given
biological relationship?
Use of manually curated databases (DIP, KEGG
etc.)
Curated corpora
Genia
BioCreative (http//www.pdg.cnb.uam.es/BioLINK/wor
kshop_BioCreative_04/handout/)
TREC - Genomics (http//medir.ohsu.edu/genomics/)
Nothing is perfect, but some systems now claim a
balanced precision and recall above 80, based on
BioCreative competition in March 2004

33
Application examples
From genes to diseases Based on chromosome
locations
34
Aided Interpretation of QTL Data Via Literature
Network
QTL Data
Genes
A,B,C
D,E
F,G
8
17
Extracted Sub-Net from Literature Network
35
Logrank significant gene list

95 genes (clones) significant (of 8024)
Top of list
accession gene p-value good1 groups2
W42723 GRO1 4,441E-16 low 2
R31441 GATA3 3,786E-14 high 3
W96197 DPP6 2,005E-12 high 3
AA452513 KNSL5 1,829E-11 low 3
H95792 ACADSB 2,695E-11 high 2
AA128362 MGC11352 5,209E-11 low 3
R59697 CDK8 7,875E-11 low 3
W73889 TNA 1,287E-10 high 3
AA478585 BTN3A3 5,044E-10 low 3
AA029948 NaN 7,146E-10 high 2
W56526 NaN 9,425E-10 high 2
R53330 Hs.287827 1,069E-09 low 2
N35341 NaN 1,239E-09 high 3
W69094 SCAND1 3,864E-09 low 2
H78560 IGFBP2 6,728E-09 high 3
H62527 GW128 6,893E-09 high 3

36
PubGene analysis (direct links)
Symbol PPP2R5A scored 3.902 Gene Ratio log-2 lev
el PPP2R5A 29.301 4.873 1 PPP2R5E 7.623 2.930 2
Protein phosphatase 2
Symbol CSNK2A1 scored 4.185 Gene Ratio log-2 lev
el CSNK2A1 2.833 1.502 1 CSNK2A2 116.711 6.867 2
Casein kinase II
Two mitochondrial ribosomal proteins
Symbol MRPS12 scored 3.591 Gene Ratio log-2 leve
l MRPS12 39.323 5.297 1 MRPL12 3.695 1.885 2
37
Full text example

Assumption Figure captions will tend to
represent a higher proportion of direct
protein-protein interactions
Figures are directly useful for visualization of
interactions and cellular functions
10 000 articles from JBC, NAR and PNAS (xml
tagged from Highwire Press (John Sack) with
50 000 figures

38
Liu et al., Bioinformatics. In press.
39
Co-workers