Uses of text mining in molecular biology - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Uses of text mining in molecular biology

Description:

The problem: Combining biological knowledge with large scale biological experiments ... 12:203-14 2002: Opus, X-Mine (Chaussabel and Sher, Genome Biol. ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 40
Provided by: eivind6
Category:

less

Transcript and Presenter's Notes

Title: Uses of text mining in molecular biology


1
Uses of text mining in molecular biology
  • Eivind Hovig

2
  • The problem Combining biological knowledge with
    large scale biological experiments

3
Information sources
  • Unstructured text
  • Medline
  • Full text papers
  • Databases
  • Methodology based compilations (ex. interaction
    data)
  • Sequence bases
  • Homology
  • Protein domain databases
  • etc.
  • High throughput experiments

4
Medline
  • 13 million titles and abstract
  • Annotated using Medical Subject Headings (MeSH),
    and some other fields
  • Downloadable (following license agreement)
  • Problem Mostly unstructured

5
  • Number of Articles in MEDLINE by year
  • Number of Articles mentioning gene OR
    protein by year
  • Number of gene occurences in MEDLINE

6
Full text papers
  • Few corpora available
  • Biomed Central (5000 papers)
  • Scattered sample sets
  • Available as pdf or eXtensible Markup Language
  • Availability depends entirely on publishers
  • Thus, most work has been focused on Medline

7
Databases
PubGene
8
Databases
  • A number of databases available for download
    (e.g. GenBank, InterPro, UniGene, OMIM)
  • Some are only available commercially (e.g.
    Transfac) or with restrictive licenses (Mitelman
    abberration database)
  • Manual curation at varying level of precision and
    completeness
  • Problems with non-overlapping identifiers

9
Brief history of biological text mining
  • -80ies Natural language processing strategies
    for general text
  • 1997-9 Locating protein names
  • 2001Cocitation networks of biological entities
  • 2002 Development of diverse (sub)domain specific
    dictionaries, ontologies
  • 2003 NLP based strategies for information
    extraction
  • 2004 Development and evaluation based on test
    sets

Now a scientific field of its own, with separate
tracks on major bioinformatics meeting
10
Domain specific dictionaries
  • The basis for anything useful
  • Protein names
  • Gene names
  • Chemical substance names
  • Disease names
  • Organism names
  • Ontologies
  • Unified medical language system (UMLS)
  • etc.

All of these have major problems!!!
11
Names
Hoffman, Valencia, TIG, 2003
12
Signal to noise ratio
  • Very short names II, P2,GS,G2.
  • Short names ABO
  • Ambiguous gene names
  • Mts1SAI1, S100A4 and CDKN2A
  • Complicated names NF-k-B, 14-3-3 etc.
  • Gene families integrins etc..

13
Co-occurrence
  • Extract pairs of genes that have been mentioned
    together in the literature
  • Stapley and Benoit PSB 2000, Yeast genes
  • Jenssen et.al Nat. Genet. 2001, Human genes
  • Raychaudhuri et al., Genome Res. 12203-14 2002
    Opus, X-Mine
  • (Chaussabel and Sher, Genome Biol. 2002
    Clustering of genes and terms)
  • Others
  • Common idea Use predefined lists of gene terms
    (symbols and names) and search for occurrences of
    these in order to index the gene literature

14
Gene neighbors
Selected strategy
Gene level index
15
Numbers
  • 10 M article records
  • 2M with occurrence of gene terms (1)
  • 1,620,038 gene-article associations (filtered)
  • 5,490 genes have no literature references
  • 6,200 genes with no neighbors
  • 7,537 genes with one or more neighbors
  • 144,685 gene pairs,
  • total pair (link) weight 1,118,639

16
Types of relationships
1 art. gt5 art. Cell biology 43 24 Expression
correlation 151 183 Histology 22 66 Homology 29 75
Mapping 53 6 Other 4 5 Incorrect 198 141 Sum 500
500
17
PubGene 2.4
18
Cocitation networks
19
Literature microarray gene clustering
20
K-means statistical clustering with relative
similarity in MeSH/GO terms
21
MeSH and GO forward and reverse mapping
22
Combining literature and sequence
  • cDNA microarrays etc.are produced containing a
    number of uncharacterized genes. How to maximize
    the information contained in each experiment?

Map all printed clones to closest gene symbol
Complete Research Genetics 40K clone set
Identify sequences lacking identified gene symbols
Smith-Waterman search versus all of Human IPI
(International Protein Index) version 2.22 with
52870 sequences
Connect the homologous uncharacterized genes to
existing genes in the literature space
Cut-off 10-4
23
Linking undescribed genes
  • Example
  • Expression analysis of blood B cells, activated
    versus resting state
  • Rectangle represents undescribed gene, linked to
    MYBL1. Score of MYBL1 and MYBL1 averaged.

Score 5.56
Image ID825476 Score 8.42 EST with expect lt10-4
24
Refining and focusing
  • Template based searches
  • Mutations
  • Abberrations
  • Natural language processing

25
Mutation data
26
Chemicals
27
Natural language processing
  • Conversion of written or spoken word to common
    machine usable representations
  • Tokenization breaking up text into units
    sentences, words, other
  • Part of speech tagging (POS tagging)Semantic
    annotation of words based on local context
    (nouns, verbs etc.). Often 7 tags, but may be
    much more. Various taggers exist, but without
    biological domain content. Rule-based
    (contextual), or probabilistic (HMM).
  • Parsing, shallow parsingDetermining the complete
    syntactic structure of a sentence etc.,
    represents in an abstract syntax tree.
    Computationally expensive.

28
Tagging
  • Example based on standard tagger

29
(No Transcript)
30
Noun-Isolate 1
Noun-Isolate
WDT
G
that
STOP, Get Next NN or VB
A
VBP VBD
H
Sep
be have
CC
Y
or
0,1
CC
X
Noun-Isolate 2
G-2
CC
0,1
0,1
DT
L
RB
B-2
C-2
D-2
E-2
F-2
J
I
0,1
0,5
DT
that those
0,2
IN
RB
0,1
that
0,2
K
M
VBN
0,1
0,2
TO
VBG
N
VBD
O
1,1
R
VBZ
P
S
T
U
0,1
Sep
1,1
MD
VBP
Q
0,1
V
0,1
Noun isolation
STOP, Get Next NN or VB
31
Purpose Information retrieval
  • To obtain the most pertinent and/or hidden large
    scale relationships between entities
  • Kernel document strategy
  • Clustering strategies (K-means)
  • Classification strategies (SVM, Bayes
    classifiers)
  • Protein interactions etc.

32
Benchmarking problems
  • Who defines what is correct for a given
    biological relationship?
  • Use of manually curated databases (DIP, KEGG
    etc.)
  • Curated corpora
  • Genia
  • BioCreative (http//www.pdg.cnb.uam.es/BioLINK/wor
    kshop_BioCreative_04/handout/)
  • TREC - Genomics (http//medir.ohsu.edu/genomics/)
  • Nothing is perfect, but some systems now claim a
    balanced precision and recall above 80, based on
    BioCreative competition in March 2004

33
Application examples
From genes to diseases Based on chromosome
locations
34
Aided Interpretation of QTL Data Via Literature
Network
QTL Data
Genes
A,B,C
D,E
F,G
8
17
Extracted Sub-Net from Literature Network
35
Logrank significant gene list
  • 95 genes (clones) significant (of 8024)
  • Top of list
  • accession gene p-value good1 groups2
  • W42723 GRO1 4,441E-16 low 2
  • R31441 GATA3 3,786E-14 high 3
  • W96197 DPP6 2,005E-12 high 3
  • AA452513 KNSL5 1,829E-11 low 3
  • H95792 ACADSB 2,695E-11 high 2
  • AA128362 MGC11352 5,209E-11 low 3
  • R59697 CDK8 7,875E-11 low 3
  • W73889 TNA 1,287E-10 high 3
  • AA478585 BTN3A3 5,044E-10 low 3
  • AA029948 NaN 7,146E-10 high 2
  • W56526 NaN 9,425E-10 high 2
  • R53330 Hs.287827 1,069E-09 low 2
  • N35341 NaN 1,239E-09 high 3
  • W69094 SCAND1 3,864E-09 low 2
  • H78560 IGFBP2 6,728E-09 high 3
  • H62527 GW128 6,893E-09 high 3

36
PubGene analysis (direct links)
Symbol PPP2R5A scored 3.902 Gene Ratio log-2 lev
el PPP2R5A 29.301 4.873 1 PPP2R5E 7.623 2.930 2
Protein phosphatase 2
Symbol CSNK2A1 scored 4.185 Gene Ratio log-2 lev
el CSNK2A1 2.833 1.502 1 CSNK2A2 116.711 6.867 2
Casein kinase II
Two mitochondrial ribosomal proteins
Symbol MRPS12 scored 3.591 Gene Ratio log-2 leve
l MRPS12 39.323 5.297 1 MRPL12 3.695 1.885 2
37
Full text example
  • Assumption Figure captions will tend to
    represent a higher proportion of direct
    protein-protein interactions
  • Figures are directly useful for visualization of
    interactions and cellular functions
  • 10 000 articles from JBC, NAR and PNAS (xml
    tagged from Highwire Press (John Sack) with
    50 000 figures

38
Liu et al., Bioinformatics. In press.
39
Co-workers
  • Tor-Kristian Jenssen
  • Fang Liu
  • Trevor Clancy
  • Vegard Nygaard
  • Torbjørn Rognes
  • Brit Helle Aarskog
  • John Sack
Write a Comment
User Comments (0)
About PowerShow.com