Exploring and Exploiting the Biological Maze - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Exploring and Exploiting the Biological Maze

Description:

Exploring and Exploiting the Biological Maze. Zo Lacroix ... Pub- Med. HUGO. NCBI. Protein. DNA Seq. Disease. Gene. Citation. Protein Seq. Conceptual level ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 31

Provided by: Nes66

Category:

more less

Transcript and Presenter's Notes

Title: Exploring and Exploiting the Biological Maze

1
Exploring and Exploiting the Biological Maze

Zoé Lacroix
Arizona State University

2
Data collection queries

Scientific protocol
Must be able to reproduce the process
Involve multiple resources
Data sources
Applications

3
Expressing scientific protocols

Scientific protocols mix design and
implementation
Design
What the protocols does (tasks)
Scientific objects involved
Implementation
How the protocol is executed
Data sources and applications

4
Expressing scientific protocols

Scientific protocols are driven by their
implementation
Scientists use the resources they know
data (quality)
access to data
format, limits, etc.
Scientists may not exploit better resources
because they do not know them
Queries should be driven by the design, the
implementation should meet the design needs

5
Example - Pipeline for Analysis of Protein
Variation Due to Alternative Splicing and SNPs

The alternative splicing pipeline will provide
a complete characterization of variations in
proteins due to splice variation or SNPs evident
in repositiories of contiguous genome sequence
data and expressed sequence tags (ESTs). The
pipeline applies secondary structure, tertiary
structure, domain motif detection and sequence
comparison tools to proteins encoded by genes
with alternatively splice forms or SNPs.
Courtesy of Dr. Marta Janer, Institute for
Systems Biology

6
Step 2 - Pipeline for Analysis of Protein
Variation Due to Alternative Splicing and SNPs

From GenBank, Dbest and the Riken Clone
Collection, collect all EST and full-length cDNA
sequences from the target organisms of interest
(in this case, human and mouse) that match the
query proteins (mouse DNA binding proteins) using
tblastn. Map the query protein to the target DNA
sequences, keeping track of which query amino
acids correspond to which nucleotides.

7
Step 2 - Pipeline for Analysis of Protein
Variation Due to Alternative Splicing and SNPs

From GenBank, Dbest and the Riken Clone
Collection, collect all EST and full-length cDNA
sequences from the target organisms of interest
(in this case, human and mouse) that match the
query proteins (mouse DNA binding proteins) using
tblastn. Map the query protein to the target DNA
sequences, keeping track of which query amino
acids correspond to which nucleotides.

Data sources
8
Step 2 - Pipeline for Analysis of Protein
Variation Due to Alternative Splicing and SNPs

From GenBank, Dbest and the Riken Clone
Collection, collect all EST and full-length cDNA
sequences from the target organisms of interest
(in this case, human and mouse) that match the
query proteins (mouse DNA binding proteins) using
tblastn. Map the query protein to the target DNA
sequences, keeping track of which query amino
acids correspond to which nucleotides.

tools
9
Step 2 - Pipeline for Analysis of Protein
Variation Due to Alternative Splicing and SNPs

From GenBank, Dbest and the Riken Clone
Collection, collect all EST and full-length cDNA
sequences from the target organisms of interest
(in this case, human and mouse) that match the
query proteins (mouse DNA binding proteins) using
tblastn. Map the query protein to the target DNA
sequences, keeping track of which query amino
acids correspond to which nucleotides.

tasks
10
Step 2 - Pipeline for Analysis of Protein
Variation Due to Alternative Splicing and SNPs

From GenBank, Dbest and the Riken Clone
Collection, collect all EST and full-length cDNA
sequences from the target organisms of interest
(in this case, human and mouse) that match the
query proteins (mouse DNA binding proteins) using
tblastn. Map the query protein to the target DNA
sequences, keeping track of which query amino
acids correspond to which nucleotides.

Scientific objects
11
Pipeline Selecting Target Proteins
Step 1 retrieve all proteins from SMART and
Swiss-Prot with textual search with the keyword
apoptosis Step 2 retrieve all proteins from
Swiss-Prot with a signal peptide feature and the
keyword apoptosis Step 3 retrieve their
binding partners from DIP, BIND and the C.elegans
dataset Step 4 run through a signal peptide
prediction program such as SigPep to check for
the presence of signal peptides in each of the
sequences Step 5 homology search using BLAST of
the retrieved sequences with proteins predicted
from the Drosophila melanogaster genome might
yield additional candidates Output final set of
signal peptide proteins involved in apoptosis
Courtesy of Dr. Terry Gaasterland, The
Rockefeller University
12
Design and implementation
13
Expressing scientific pipelines with BioNavigation

Queries are expressed at a conceptual level
(design)

Disease
Protein Seq.
Scientific classes
DNA Seq.
Gene
Citation
Conceptual level
14
Conceptual graph

Labeled edges
Scientific meaningful edges

15
Conceptual graph
16
Mapping to physical resources
17
Mapping to physical resources
Disease
Protein Seq.
Scientific classes
DNA Seq.
Gene
Citation
Conceptual level
Physical level
Gen- Bank
Pub- Med
HUGO
Data Sources
OMIM
NCBI Protein
18
Exploring biological metadata

Return all citations that are related to some
disease or condition
Diabetes 11 Aging 71 Cancer 391

Link Entrez provides an index with the Links in
the display option from each entry
Parse Parsing each entry to retrieve its
related entries
All Entrez provides an index with the Links in
the display option which allows to look at a set
of entries at a time

19
Selecting biological resources

3 resources that look the same
Are they the same?
3 paths that will retrieve PubMed entries related
to citations
Do they have the same semantics?

20
Results for the disease conditions diabetes,
aging and cancer
21
Overlap results for the disease conditions
diabetes
22
Evaluating resources

Similar applications
Different outputs
Similar data sources
Different output
Number of resources
Different output
Order of resources
Different output

23
Exploiting semantics of resources

Number of entries
Characterization of entries (number of
attributes)
Time

24
Exploiting the semantics of links
25
BioNavigation (joint work with Louiqa Raschid and
Maria-Esther Vidal)

Conceptual graph
No labeled links
Queries
Regular expressions of concepts
ESearch
Path cardinality - number of instances of paths
of the result. For a path of length 1 between two
sources S1 and S2, it is the number of pairs (e1,
e2) of entries e1 of S1 linked to an entry e2 of
S2.
Target Object Cardinality number of distinct
objects retrieved from the final data source.
Evaluation Cost cost of the evaluation plan,
which involves both the local processing cost and
remote network access delays.