Title: Uses of text mining in molecular biology
1Uses of text mining in molecular biology
2- The problem Combining biological knowledge with
large scale biological experiments
3Information sources
- Unstructured text
- Medline
- Full text papers
- Databases
- Methodology based compilations (ex. interaction
data) - Sequence bases
- Homology
- Protein domain databases
- etc.
- High throughput experiments
4Medline
- 13 million titles and abstract
- Annotated using Medical Subject Headings (MeSH),
and some other fields - Downloadable (following license agreement)
- Problem Mostly unstructured
5- Number of Articles in MEDLINE by year
- Number of Articles mentioning gene OR
protein by year - Number of gene occurences in MEDLINE
6Full text papers
- Few corpora available
- Biomed Central (5000 papers)
- Scattered sample sets
- Available as pdf or eXtensible Markup Language
- Availability depends entirely on publishers
- Thus, most work has been focused on Medline
7Databases
PubGene
8Databases
- A number of databases available for download
(e.g. GenBank, InterPro, UniGene, OMIM) - Some are only available commercially (e.g.
Transfac) or with restrictive licenses (Mitelman
abberration database) - Manual curation at varying level of precision and
completeness - Problems with non-overlapping identifiers
9Brief history of biological text mining
- -80ies Natural language processing strategies
for general text - 1997-9 Locating protein names
- 2001Cocitation networks of biological entities
- 2002 Development of diverse (sub)domain specific
dictionaries, ontologies - 2003 NLP based strategies for information
extraction - 2004 Development and evaluation based on test
sets
Now a scientific field of its own, with separate
tracks on major bioinformatics meeting
10Domain specific dictionaries
- The basis for anything useful
- Protein names
- Gene names
- Chemical substance names
- Disease names
- Organism names
- Ontologies
- Unified medical language system (UMLS)
- etc.
All of these have major problems!!!
11Names
Hoffman, Valencia, TIG, 2003
12Signal to noise ratio
- Very short names II, P2,GS,G2.
- Short names ABO
- Ambiguous gene names
- Mts1SAI1, S100A4 and CDKN2A
- Complicated names NF-k-B, 14-3-3 etc.
- Gene families integrins etc..
13Co-occurrence
- Extract pairs of genes that have been mentioned
together in the literature - Stapley and Benoit PSB 2000, Yeast genes
- Jenssen et.al Nat. Genet. 2001, Human genes
- Raychaudhuri et al., Genome Res. 12203-14 2002
Opus, X-Mine - (Chaussabel and Sher, Genome Biol. 2002
Clustering of genes and terms) - Others
- Common idea Use predefined lists of gene terms
(symbols and names) and search for occurrences of
these in order to index the gene literature
14Gene neighbors
Selected strategy
Gene level index
15Numbers
- 10 M article records
- 2M with occurrence of gene terms (1)
- 1,620,038 gene-article associations (filtered)
- 5,490 genes have no literature references
- 6,200 genes with no neighbors
- 7,537 genes with one or more neighbors
- 144,685 gene pairs,
- total pair (link) weight 1,118,639
16Types of relationships
1 art. gt5 art. Cell biology 43 24 Expression
correlation 151 183 Histology 22 66 Homology 29 75
Mapping 53 6 Other 4 5 Incorrect 198 141 Sum 500
500
17PubGene 2.4
18Cocitation networks
19Literature microarray gene clustering
20K-means statistical clustering with relative
similarity in MeSH/GO terms
21MeSH and GO forward and reverse mapping
22Combining literature and sequence
- cDNA microarrays etc.are produced containing a
number of uncharacterized genes. How to maximize
the information contained in each experiment?
Map all printed clones to closest gene symbol
Complete Research Genetics 40K clone set
Identify sequences lacking identified gene symbols
Smith-Waterman search versus all of Human IPI
(International Protein Index) version 2.22 with
52870 sequences
Connect the homologous uncharacterized genes to
existing genes in the literature space
Cut-off 10-4
23Linking undescribed genes
- Example
- Expression analysis of blood B cells, activated
versus resting state - Rectangle represents undescribed gene, linked to
MYBL1. Score of MYBL1 and MYBL1 averaged.
Score 5.56
Image ID825476 Score 8.42 EST with expect lt10-4
24Refining and focusing
- Template based searches
- Mutations
- Abberrations
- Natural language processing
25Mutation data
26Chemicals
27Natural language processing
- Conversion of written or spoken word to common
machine usable representations - Tokenization breaking up text into units
sentences, words, other - Part of speech tagging (POS tagging)Semantic
annotation of words based on local context
(nouns, verbs etc.). Often 7 tags, but may be
much more. Various taggers exist, but without
biological domain content. Rule-based
(contextual), or probabilistic (HMM). - Parsing, shallow parsingDetermining the complete
syntactic structure of a sentence etc.,
represents in an abstract syntax tree.
Computationally expensive.
28Tagging
- Example based on standard tagger
29(No Transcript)
30Noun-Isolate 1
Noun-Isolate
WDT
G
that
STOP, Get Next NN or VB
A
VBP VBD
H
Sep
be have
CC
Y
or
0,1
CC
X
Noun-Isolate 2
G-2
CC
0,1
0,1
DT
L
RB
B-2
C-2
D-2
E-2
F-2
J
I
0,1
0,5
DT
that those
0,2
IN
RB
0,1
that
0,2
K
M
VBN
0,1
0,2
TO
VBG
N
VBD
O
1,1
R
VBZ
P
S
T
U
0,1
Sep
1,1
MD
VBP
Q
0,1
V
0,1
Noun isolation
STOP, Get Next NN or VB
31Purpose Information retrieval
- To obtain the most pertinent and/or hidden large
scale relationships between entities - Kernel document strategy
- Clustering strategies (K-means)
- Classification strategies (SVM, Bayes
classifiers) - Protein interactions etc.
32Benchmarking problems
- Who defines what is correct for a given
biological relationship? - Use of manually curated databases (DIP, KEGG
etc.) - Curated corpora
- Genia
- BioCreative (http//www.pdg.cnb.uam.es/BioLINK/wor
kshop_BioCreative_04/handout/) - TREC - Genomics (http//medir.ohsu.edu/genomics/)
- Nothing is perfect, but some systems now claim a
balanced precision and recall above 80, based on
BioCreative competition in March 2004
33Application examples
From genes to diseases Based on chromosome
locations
34Aided Interpretation of QTL Data Via Literature
Network
QTL Data
Genes
A,B,C
D,E
F,G
8
17
Extracted Sub-Net from Literature Network
35Logrank significant gene list
- 95 genes (clones) significant (of 8024)
- Top of list
- accession gene p-value good1 groups2
- W42723 GRO1 4,441E-16 low 2
- R31441 GATA3 3,786E-14 high 3
- W96197 DPP6 2,005E-12 high 3
- AA452513 KNSL5 1,829E-11 low 3
- H95792 ACADSB 2,695E-11 high 2
- AA128362 MGC11352 5,209E-11 low 3
- R59697 CDK8 7,875E-11 low 3
- W73889 TNA 1,287E-10 high 3
- AA478585 BTN3A3 5,044E-10 low 3
- AA029948 NaN 7,146E-10 high 2
- W56526 NaN 9,425E-10 high 2
- R53330 Hs.287827 1,069E-09 low 2
- N35341 NaN 1,239E-09 high 3
- W69094 SCAND1 3,864E-09 low 2
- H78560 IGFBP2 6,728E-09 high 3
- H62527 GW128 6,893E-09 high 3
36PubGene analysis (direct links)
Symbol PPP2R5A scored 3.902 Gene Ratio log-2 lev
el PPP2R5A 29.301 4.873 1 PPP2R5E 7.623 2.930 2
Protein phosphatase 2
Symbol CSNK2A1 scored 4.185 Gene Ratio log-2 lev
el CSNK2A1 2.833 1.502 1 CSNK2A2 116.711 6.867 2
Casein kinase II
Two mitochondrial ribosomal proteins
Symbol MRPS12 scored 3.591 Gene Ratio log-2 leve
l MRPS12 39.323 5.297 1 MRPL12 3.695 1.885 2
37Full text example
- Assumption Figure captions will tend to
represent a higher proportion of direct
protein-protein interactions - Figures are directly useful for visualization of
interactions and cellular functions - 10 000 articles from JBC, NAR and PNAS (xml
tagged from Highwire Press (John Sack) with
50Â 000 figures
38Liu et al., Bioinformatics. In press.
39Co-workers
- Tor-Kristian Jenssen
- Fang Liu
- Trevor Clancy
- Vegard Nygaard
- Torbjørn Rognes
- Brit Helle Aarskog
- John Sack