BioNLP Tutorial - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

BioNLP Tutorial

Description:

Resistance to apoptosis, increased growth potential, and altered gene ... Cerberus. wingless. Ken and Barbie. the. Entity identification. 3. Application types ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 51
Provided by: compbi
Category:

less

Transcript and Presenter's Notes

Title: BioNLP Tutorial


1
BioNLP Tutorial
K. Bretonnel Cohen Olivier Bodenreider Lynette
Hirschman
  • PSB 2006
  • Wailea, Maui, HI

2
The Biological Data Cycle
ExpertCuration
Bottleneck getting knowledge from literature to
databases

Solution text mining
1
3
Model Organism Curation Pipeline
3. Curate genes from paper
2. List genes for curation
1. Select papers
1
4
Double exponential growthin the literature
  • New entries in Medline with publication date in
    Jan-Aug 2005 431,478 (avg. 1775/ day)

1
5
Examples of BioNLP in action
1
6
Examples of BioNLP in action
1
7
Examples of BioNLP in action
1
8
Application types
  • Information retrieval find documents in response
    to an information need

p53
Resistance to apoptosis, increased growth
potential, and altered gene expression in cells
that survived genotoxic hexavalent chromium
exposure. PMID 16283527
2
9
Application types
  • Question-answering question as input, answer as
    output

What is BRCA1?
A gene located on the seventeenth chromosome
associated with a risk of breast and ovarian
cancer
(Yu and Sable 2005)
2
10
Application types
  • Summarization
  • Input one or more texts
  • Output single (shorter) text
  • Information extraction Information extraction
    systems find statements about some specified type
    of relationship in text. Entity identification
    is a necessary prerequisite to information
    extraction. Information retrieval Information
    retrieval is classically defined as the location
    of documents that are relevant to some
    information need. PubMed is a premier example of
    a sophisticated biomedical information retrieval
    system. Summarization systems benefit from
    high-performance entity identification and
    normalization. Other approaches involve
    information extraction.

Ling et al. (multiple documents) Lu et al.
(single document)
2
11
Application types
  • Information extraction relationships between
    things
  • BINDING_EVENT
  • Binder
  • Bound

2
12
Application types
  • Met28 binds to DNA.
  • BINDING_EVENT
  • Binder Met28
  • Bound DNA

Lussier (gene/phenotype) Maguitman
(protein/family) Chun (gene/disease) Höglund
(protein/location) Stoica (protein/function)
2
13
Application types
  • HSP60
  • Hsp-60
  • heat shock protein 60
  • Cerberus
  • wingless
  • Ken and Barbie
  • the

Entity identification
3
14
Application types
  • Entity normalization find concepts in text and
    map them to unique identifiers

A locus has been found, an allele of which causes
a modification of some allozymes of the enzyme
esterase 6 in Drosophila melanogaster. There are
two alleles of this locus, one of which is
dominant to the other and results in increased
electrophoretic mobility of affected allozymes.
The locus responsible has been mapped to 3-56.7
on the standard genetic map (Est-6 is at 3-36.8).
Of 13 other enzyme systems analyzed, only leucine
aminopeptidase is affected by the modifier locus.
Neuraminidase incubations of homogenates altered
the electrophoretic mobility of esterase 6
allozymes, but the mobility differences found are
not large enough to conclude that esterase 6 is
sialylated.
3
15
Application types
  • Perfect entity identification finds 5 mentions
    they correspond to just 2 genes
  • FBgn0000592 (esterase 6)
  • FBgn0026412 (leucine aminopeptidase)

A locus has been found, an allele of which causes
a modification of some allozymes of the enzyme
esterase 6 in Drosophila melanogaster. There are
two alleles of this locus, one of which is
dominant to the other and results in increased
electrophoretic mobility of affected allozymes.
The locus responsible has been mapped to 3-56.7
on the standard genetic map (Est-6 is at 3-36.8).
Of 13 other enzyme systems analyzed, only leucine
aminopeptidase is affected by the modifier locus.
Neuraminidase incubations of homogenates altered
the electrophoretic mobility of esterase 6
allozymes, but the mobility differences found are
not large enough to conclude that esterase 6 is
sialylated.
3
16
Application types
  • Partial list of synonyms for FBgn0000592
  • Esterase 6
  • Carboxyl ester hydrolase
  • CG6917
  • Est6
  • Est-D
  • Est-5

Chun (gene/disease) Johnson (ontology
alignment) Stoica (gene/function) Vlachos
(FlyBase mapping)
3
17
Biological Nomenclature V-SNARE
Vesicle Soluble Maleic acid N-ethylimide
Sensitive Fusion Protein Attachment Protein
Receptor
4
(A. Morgan)
18
The Biological Data Cycle
ExpertCuration
Whats the organizing principle for all of this?

4
19
Organizing principles
UMLS
4
20
Organizing principles
4
21
Ontologies as text mining resources
Neurofibromatosis type 2 (NF2) is often not
recognised as a distinct entity from peripheral
neurofibromatosis. NF2 is a predominantly
intracranial condition whose hallmark is
bilateral vestibular schwannomas. NF2 results
from a mutation in the gene named merlin, located
on chromosome 22.
(Uppal, S., and A. P. Coatesworth.
Neurofibromatosis Type 2. Int J Clin Pract, 57,
no. 8, 2003, pp. 698-703.)
4
22
Ontologies as text mining resources
Neurofibromatosis type 2 (NF2) is often not
recognised as a distinct entity from peripheral
neurofibromatosis. NF2 is a predominantly
intracranial condition whose hallmark is
bilateral vestibular schwannomas. NF2 results
from a mutation in the gene named merlin, located
on chromosome 22.
  • vestibular schwannoma manifestation of
    neurofibromatosis 2
  • neurofibromatosis 2 associated with mutation of
    merlin
  • merlin located on chromosome 22
  • Tumor manifestation of Disease
  • Disease associated with mutation of Gene
  • Gene located on Chromosome

Disease Tumor Gene Chromosome
4
23
Whats the state of the art?
  • Tasks differ greatly finding human protein
    interactions (Bunescu 05) may be harder than
    finding inhibition relations (Pustejovsky 02)
  • Need a CASP-style competitive evaluation

Precision Specificity Recall Sensitivity
4
24
Whats the state of the art?
  • KDD Cup (2002)
  • TREC Genomics (2003, 2004, 2005)
  • BioCreAtIvE (2004)
  • BioNLP (2004)

25
Whats the state of the art?
BioCreAtIvE information extraction task PDB ?
Gene Ontology
3. Curate genes from paper
2. List genes for curation
1. Select papers
BioCreAtIvE entity identification and entity
normalization tasks
KDD 2002, TREC Genomics 2004
5
26
Whats the state of the art?
F-measure is balanced precision and recall
2PR/(PR) Recall correctly identified/
possible correct Precision correctly
identified/ identified
3
27
Whats the state of the art?
Blaschke et al.
5
28
Whats the state of the art?
  • Cellular Component 34.61 (561/1621)
  • Molecular Function 33.00 (933/2827)
  • Biological Process 23.02 (1011/4391)

Cellular component is easier because task is
relation between entities located_in
(protein,cell_component) Biological process is
hardest because it is the most abstract
Blaschke et al.
5
29
2.5 types of solutions
  • Rule-based
  • Patterns
  • Grammars
  • Statistical/machine learning
  • Labelled training data
  • Noisy training data
  • Hybrid statistical/rule-based

Höglund (information extraction, gene ?
localiz.) Maguitman (info. extract., SWISSPROT ?
Pfam) Vlachos (entity normalization, gene ?
FlyBase) Stoica (gene ? GO code)
Chun (IE, multiple gene -gt UMLS disease) Ling
(summarization, FlyBase)
Johnson (ontology alignment, GO ? other OBO) Lu
(summarization, Entrez Gene ? GeneRIFs) Lussier
(info. extraction, GOA -gt phenotype) Vlachos
(coreference, FlyBase Sequence Ont.)
5
30
Common tools/techniques
  • Stop word removal eliminate features that are
    rarely helpful the, a, and
  • (Porter) stemming convert inflected words to
    their roots promot, mitochondri, cytochrom
  • POS part of speech 80 categories

5
31
Why text mining is difficult
  • Variability
  • Pervasive ambiguity at every level of analysis

5
32
Why text mining is difficult
  • Met28 binds to DNA
  • binding of Met28 to DNA
  • Met28 and DNA bind
  • binding between Met28 and DNA
  • Met28 is sufficient to bind DNA
  • DNA bound by Met28

2(6)
33
Why text mining is difficult
  • binding of Met28 to DNA
  • binding under unspecified conditions of Met28 to
    DNA
  • binding of this translational variant of Met28
    to DNA
  • binding of Met28 to upstream regions of DNA

2(6)
34
Why text mining is difficult
  • binding under unspecified conditions of this
    translational variant of Met28 to upstream
    regions of DNA

3(6)
35
Why text mining is difficult
  • Document segmentation
  • Sentence segmentation
  • Tokenization
  • Part of speech tagging
  • Parsing

5
36
Why text mining is difficult
  • Here, we show that Bifocal (Bif), a putative
    cytoskeletal regulator, is a component of the Msn
    pathway for regulating R cell growth targeting.
    bif displays strong genetic interaction with msn.
  • (Ruan et al. 2002)

6
(Baumgartner, in prep.)
37
Why text mining is difficult
  • lead
  • 69 tokens in GENIA
  • bare stem verb 34
  • 3rd person singular present tense verb 29
  • Noun 3
  • Past tense verb 2
  • Past participle 1

6
38
Why text mining is difficult
  • HUNK
  • Human natural killer (cell type)
  • HUN kinase (gene/protein)
  • Radiological/orthopedic classification scheme
  • Piece of something

6
39
Why text mining is difficult
  • NaCT is expressed in liver, testis and brain in
    rat and shows preference for citrate over
    dicarboxylates (GeneRIF 26699812177002)

NACT neoadjuvant chemotherapy (PMID
8898170) N-acetyltransferase (PMID
10725313) Na-coupled citrate transporter (PMID
12177002 )
6
40
Why text mining is difficult
  • NaCT is expressed in liver, testis and brain in
    rat and shows preference for citrate over
    dicarboxylates (GeneRIF 26699812177002)
  • (liver), (testis) and (brain in rat)
  • liver, (testis and brain in rat)
  • (liver, testis and brain in rat)

6
41
Why text mining is difficult
  • NaCT is expressed in liver, testis and brain in
    rat and shows preference for citrate over
    dicarboxylates (GeneRIF 26699812177002)
  • shows preference for (citrate over
    dicarboxylates)
  • shows preference (for citrate) (over
    dicarboxylates)

7
42
Why text mining is difficult
  • regulation of cell migration and proliferation
  • (PMID )
  • serine phosphorylation, translocation, and
    degradation of IRS-1 (PMID 16099428)
  • proliferation and regulation of cell migration
  • regulation of proliferation and cell migration
  • regulation of cell migration and regulation of
    cell proliferation

7
43
Why text mining is difficult
  • regulation of cell migration and proliferation
    (PMID )
  • serine phosphorylation, translocation, and
    degradation of IRS-1 (PMID 16099428)
  • degradation of IRS-1, translocation, and serine
    phosphorylation
  • serine phosphorylation, serine translocation, and
    serine degradation (of IRS-1)

7
44
Most biomedical text mining to date ungrounded
  • Drosophila OBP76a is necessary for fruit flies to
    respond to the aggregation pheromone 11-cis
    vaccenyl acetate (PMID 15664166)
  • lush is completely devoid of evoked activity to
    the pheromone 11-cis vaccenyl acetate (VA),
    revealing that this binding protein is absolutely
    required for activation of pheromone-sensitive
    chemosensory neurons (PMID 15664171)

Entrez Gene ID40136
7
45
The next step
  • Text mining can be key tool for linking
    biological knowledge from the literature to
    structured data in biological databases
  • and databases to each other.

7
46
Papers in the text mining session
  • 5 papers on linkage to ontologies
  • Höglund et al. generating cellular localization
    annotations
  • Lussier et al. PhenoGO for capture of phenome
    data
  • Stoica and Hearst functional annotation of
    proteins
  • Johnson et al. ontology alignments
  • Vlachos et al. ontology for name extraction,
    anaphora
  • 2 papers linking other sets of resources
  • Maguitman et al. on bibliome to reproduce Pfam
    classes
  • Chun et al. on linking genes and diseases
  • 2 papers on summarization, using linked resources
  • Lu et al. automated GeneRIF extraction
  • Ling et al. automated gene summary generation

7
47
Acknowledgements
  • Alex Morgan for several slides
  • Christian Blaschke for data and slides
  • Bill Baumgartner for sentence segmenter
    performance data
  • Helen Johnson for data on POS ambiguity in GENIA
  • Lu Zhiyong for syntactic ambiguity examples
  • Larry Hunter for current PubMed graph

7
48
How big is a humuhumunukunukuapuaa?
49
How big is a humuhumunukunukuapuaa?
50
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com