Title: Exploring and Exploiting the Biological Maze
1Exploring and Exploiting the Biological Maze
- Zoé Lacroix
- Arizona State University
2Data collection queries
- Scientific protocol
- Must be able to reproduce the process
- Involve multiple resources
- Data sources
- Applications
3Expressing scientific protocols
- Scientific protocols mix design and
implementation - Design
- What the protocols does (tasks)
- Scientific objects involved
- Implementation
- How the protocol is executed
- Data sources and applications
4Expressing scientific protocols
- Scientific protocols are driven by their
implementation - Scientists use the resources they know
- data (quality)
- access to data
- format, limits, etc.
- Scientists may not exploit better resources
because they do not know them - Queries should be driven by the design, the
implementation should meet the design needs
5Example - Pipeline for Analysis of Protein
Variation Due to Alternative Splicing and SNPs
- The alternative splicing pipeline will provide
a complete characterization of variations in
proteins due to splice variation or SNPs evident
in repositiories of contiguous genome sequence
data and expressed sequence tags (ESTs). The
pipeline applies secondary structure, tertiary
structure, domain motif detection and sequence
comparison tools to proteins encoded by genes
with alternatively splice forms or SNPs. - Courtesy of Dr. Marta Janer, Institute for
Systems Biology
6Step 2 - Pipeline for Analysis of Protein
Variation Due to Alternative Splicing and SNPs
- From GenBank, Dbest and the Riken Clone
Collection, collect all EST and full-length cDNA
sequences from the target organisms of interest
(in this case, human and mouse) that match the
query proteins (mouse DNA binding proteins) using
tblastn. Map the query protein to the target DNA
sequences, keeping track of which query amino
acids correspond to which nucleotides.
7Step 2 - Pipeline for Analysis of Protein
Variation Due to Alternative Splicing and SNPs
- From GenBank, Dbest and the Riken Clone
Collection, collect all EST and full-length cDNA
sequences from the target organisms of interest
(in this case, human and mouse) that match the
query proteins (mouse DNA binding proteins) using
tblastn. Map the query protein to the target DNA
sequences, keeping track of which query amino
acids correspond to which nucleotides.
Data sources
8Step 2 - Pipeline for Analysis of Protein
Variation Due to Alternative Splicing and SNPs
- From GenBank, Dbest and the Riken Clone
Collection, collect all EST and full-length cDNA
sequences from the target organisms of interest
(in this case, human and mouse) that match the
query proteins (mouse DNA binding proteins) using
tblastn. Map the query protein to the target DNA
sequences, keeping track of which query amino
acids correspond to which nucleotides.
tools
9Step 2 - Pipeline for Analysis of Protein
Variation Due to Alternative Splicing and SNPs
- From GenBank, Dbest and the Riken Clone
Collection, collect all EST and full-length cDNA
sequences from the target organisms of interest
(in this case, human and mouse) that match the
query proteins (mouse DNA binding proteins) using
tblastn. Map the query protein to the target DNA
sequences, keeping track of which query amino
acids correspond to which nucleotides.
tasks
10Step 2 - Pipeline for Analysis of Protein
Variation Due to Alternative Splicing and SNPs
- From GenBank, Dbest and the Riken Clone
Collection, collect all EST and full-length cDNA
sequences from the target organisms of interest
(in this case, human and mouse) that match the
query proteins (mouse DNA binding proteins) using
tblastn. Map the query protein to the target DNA
sequences, keeping track of which query amino
acids correspond to which nucleotides.
Scientific objects
11Pipeline Selecting Target Proteins
Step 1 retrieve all proteins from SMART and
Swiss-Prot with textual search with the keyword
apoptosis Step 2 retrieve all proteins from
Swiss-Prot with a signal peptide feature and the
keyword apoptosis Step 3 retrieve their
binding partners from DIP, BIND and the C.elegans
dataset Step 4 run through a signal peptide
prediction program such as SigPep to check for
the presence of signal peptides in each of the
sequences Step 5 homology search using BLAST of
the retrieved sequences with proteins predicted
from the Drosophila melanogaster genome might
yield additional candidates Output final set of
signal peptide proteins involved in apoptosis
Courtesy of Dr. Terry Gaasterland, The
Rockefeller University
12Design and implementation
13Expressing scientific pipelines with BioNavigation
- Queries are expressed at a conceptual level
(design)
Disease
Protein Seq.
Scientific classes
DNA Seq.
Gene
Citation
Conceptual level
14Conceptual graph
- Labeled edges
- Scientific meaningful edges
15Conceptual graph
16Mapping to physical resources
17Mapping to physical resources
Disease
Protein Seq.
Scientific classes
DNA Seq.
Gene
Citation
Conceptual level
Physical level
Gen- Bank
Pub- Med
HUGO
Data Sources
OMIM
NCBI Protein
18Exploring biological metadata
- Return all citations that are related to some
disease or condition - Diabetes 11 Aging 71 Cancer 391
- Link Entrez provides an index with the Links in
the display option from each entry - Parse Parsing each entry to retrieve its
related entries - All Entrez provides an index with the Links in
the display option which allows to look at a set
of entries at a time
19Selecting biological resources
- 3 resources that look the same
- Are they the same?
- 3 paths that will retrieve PubMed entries related
to citations - Do they have the same semantics?
20Results for the disease conditions diabetes,
aging and cancer
21Overlap results for the disease conditions
diabetes
22Evaluating resources
- Similar applications
- Different outputs
- Similar data sources
- Different output
- Number of resources
- Different output
- Order of resources
- Different output
23Exploiting semantics of resources
- Number of entries
- Characterization of entries (number of
attributes) - Time
24Exploiting the semantics of links
25BioNavigation (joint work with Louiqa Raschid and
Maria-Esther Vidal)
- Conceptual graph
- No labeled links
- Queries
- Regular expressions of concepts
- ESearch
- Path cardinality - number of instances of paths
of the result. For a path of length 1 between two
sources S1 and S2, it is the number of pairs (e1,
e2) of entries e1 of S1 linked to an entry e2 of
S2. - Target Object Cardinality number of distinct
objects retrieved from the final data source. - Evaluation Cost cost of the evaluation plan,
which involves both the local processing cost and
remote network access delays.
26Work in progress
- Conceptual graph
- Labeled links
- Queries
- Complex dataflows
- Physical graph
- Access to a BioMetaDatabase
- Data sources
- Applications
27Representing the conceptual graph in Protégé
28Visualization Limitations in Protégé
- Using the GraphViz plugin
- Shows only IsA hierarchy
29 30Conclusion
- Scientists need support to select resources to
express their protocols - Semantics of resources may be exploited to
enhance the data collection process - Need for a repository of biological metadata
(BioMetaDatabase)