Integrated Data Systems for Genomic Analysis Genomics and Bioinformatics for the Advancement of Clinical Sciences Thomas Jefferson University, Oct. 14, 2002 - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Integrated Data Systems for Genomic Analysis Genomics and Bioinformatics for the Advancement of Clinical Sciences Thomas Jefferson University, Oct. 14, 2002

Description:

Integrated Data Systems for Genomic Analysis Genomics and Bioinformatics for the Advancement of Clinical Sciences Thomas Jefferson University, Oct. 14, 2002 – PowerPoint PPT presentation

Number of Views:402
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Integrated Data Systems for Genomic Analysis Genomics and Bioinformatics for the Advancement of Clinical Sciences Thomas Jefferson University, Oct. 14, 2002


1
Integrated Data Systems for Genomic
AnalysisGenomics and Bioinformatics for the
Advancement of Clinical SciencesThomas Jefferson
University, Oct. 14, 2002
  • Chris Stoeckert, Ph.D.
  • Dept. of Genetics Center for Bioinformatics
  • University of Pennsylvania

2
Plasmodium genomics Genomics and proteomics pave
the way for controlling malaria
Nature, October 3, 2002
3
Thinking Genomically
Genome Phenotype
  • Genome structure
  • Genes and function
  • Pathways
  • Expression patterns
  • (Complex) diseases

4
Using a Genomics Unified Schema (GUS) to ask
genomic questions
Genomic Unified Schema (GUS) is a relational
database that warehouses and integrates
biological sequence, sequence annotation, and
gene expression data from a number of
heterogeneous sources. User-friendly web
interfaces present slices of the GUS database and
allow researchers to execute structured queries
for information concerning gene structure,
function, and expression.
5
GUS Powers Multiple Genomics Projects
AllGenes
Allgenes is based on a comprehensive mouse and
human gene index. The genes are approximated by
transcripts predicted from EST and mRNA clustering
PlasmoDB
PlasmoDB is the official database of the
Plasmodium falciparum genome project which
provides an integrated view of genome sequence
data including expression data from EST, SAGE,
and microarray projects
EPConDB
EPConDB is an index of genes expressed in
endocrine pancreas. Expression is defined either
through microarray experiments or sequence
annotation.
6
allgenes.org query
"Is my cDNA similar to any mouse genes that are
predicted to encode transcription factors and
have been localized to mouse chromosome 5?"
This query illustrates several aspects of the GUS
database including
Data Integration Data Analysis Tools
RHMap GOFunction Sequence GOFunction assigments Boolean function History function BLAST
http//www.allgenes.org/ Steve Fischer, Debbie
Pinney, Brian Brunk, Joan Mazarelli, Jonathan
Crabtree, Yongchang Gan, Sharon Diskin Nikolay
Kolchanov, Alexey Katohkin
7
Select the allgenes.org boolean query page
8
Choose the RH map and GO function queries
9
There are 26 mouse RNAs (assemblies) that meet
these criteria
10
Now use the BLAST page to identify RNAs similar
to my cDNA
11
Intersect ("AND") the BLAST search with the
previous query
And we have our answer (the third row on the
query history page)
12
(No Transcript)
13
PlasmoDB query
"List all genes whose proteins are predicted to
contain a signal peptide and for which there is
evidence that they are expressed in Plasmodium
falciparum's merozoite stage."
This query illustrates several aspects of the GUS
database including
Data Integration Data Analysis Tools
Genome annotation Mass spec Sequence analysis History function
http//plasmodb.org/ David Roos, Jessie
Kissinger, Bindu Gajria, Martin Fraunholz, Jules
Milgram, Phil Labo, Amit Bahl, Dave Pearson,
Dinesh Gupta, Hagai Ginsburg Jonathan Crabtree,
Jonathan Schug, Brian Brunk, Greg Grant, Trish
Whetzel, Matt Mailman, Li Li
14
Select Queries from the PlasmoDB homepage
15
Choose chromosome and Gene/prediction type-submit
16
Choose Gene Expression from the queries page,
then Proteomics Then choose chromosome, lifecycle
stage, evidence - submit
17
Go to the history page and choose which simple
queries to combine. Select intersect.
18
There is a variety of information available from
the report page including
19
EPConDB query
"Which DOTS assemblies (RNA) represented on the
Endocrine Pancreas Consortiums chip 2.0 are
constituents of the insulin initiated signal
transduction pathway ?"
Data Integrationes3wwww Data Analysis Tools
Sequence Microarray experiment Transduction pathway BLAST History function
20
http//www.cbil.upenn.edu/EPConDB Klaus Kaestner,
Marie Scearce, John Brestelli, Phillip
Le Elisabetta Manduchi, Angel Pizarro, Debbie
Pinney, Greg Grant, Joan Mazzarelli, Jonathan
Crabtree, Hongxian He,Shannon Mcweeney, Matt
Mailman
21
Go to the gene information query page and click
on DOTS assemblies involved in a pathway
22
Choose the insulin pathway, a p-value, pancreas,
the species, and whether an assembly must include
an mRNA - submit
23
Return to the gene information query page and
select clones sets. Choose chip 2.0 - submit
24
Go to the history page, select the queries to
combine and select intersect view the results
25
Using Databases to Think Genomically
  • Draw attention to these resources
  • Show how different data sources and approaches
    can be used to ask powerful questions
  • This can be done for different organisms,
    different systems

26
How GUS Works
AllGenes
PlasmoDB
EPConDB
Java Servlets

Other sites, Other projects, e.g. GeneDB
Oracle RDBMS
Object Layer for Data Loading
27
Goals of GUS
  • Generic platform for model organism or disease
    specific databases
  • Integration of genome, transcript and protein
    data, including
  • Sequence
  • Function
  • Expression
  • Interaction
  • Regulation
  • Orthologs and paralogs
  • Support for
  • automated annotation and integration
  • manual curation
  • data mining/analysis and sophisticated queries
  • web access

http//www.gusdb.org Jonathan Crabtree, Jonathan
Schug, Steve Fischer, Elisabetta Manduchi, Angel
Pizarro, Junmin Liu, Debbie Pinney, Greg Grant,
Trish Whetzel, Li Li, Sharon Diskin, Hongxian He
28
Architecture of GUS
QTL,POP, SNP, Clinical
GenomicSequence
GenBank, InterPro, GO, etc
microarray SAGEExperiments
MappingData
GSSs ESTs
Annotation
Object Layer
Oracle/SQL
DoTS
TESS
RAD
Core
SRes
Java Servlets Perl CGI
29
Five domains
GUS is divided into 5 domains (separate name
spaces)
Protein Abundance DB domain underway
30
DoTS central dogma schema
Gene
Gene Instance
Gene Feature (isa NA Feature)
Genomic Sequence (isa NA Sequence)
RNA
RNA Instance
RNA Feature (isa NA Feature)
RNA Sequence (isa NA Sequence)
Protein
Protein Instance
Protein Feature (isa NA Feature)
Protein Sequence (isa AA Sequence)
31
RAD schema uses MAGE/MIAME
MAGE Experiment Array BioMaterial BioAssay BioAssa
yData Protocol, Descr. HigherLevelAnalysis
MIAME Experimental Design Array
design Samples Hybridization, Measure Normalizatio
n .
32
http//www.mged.org
33
Journals are Adopting the MGED Standards
Use of Minimal Information About Microarray
Experiment (MIAME)
34
TESS Schema
TESS.Moiety
Moiety
MoietyHeterodimer
MoietyMultimer
MoietyComplex
TESS.Activity
ActivityProteinDnaBinding
TESS.FootprintInstance
ActivityTissueSpecificity
TESS.TrainingSet
TESS.Model
ModelString
TESS.ParameterGroup
ModelConsensusString
ModelPositionalWeightMatrix
TESS.Note
ModelGrammar
35
RAD
DoTS
EST clustering and assembly
TESS
36
Using GUS for Genomic Research
  • Annotating mouse chromosome 5
  • Maja Bucan
  • Identifying novel genes expressed in the
    endocrine pancreas
  • Klaus Kaestner, Alan Permutt, Doug Melton
  • Identifying genes regulated by CREB
  • Allan Pack, Mirek Mackiewicz

37
Annotation of Mouse Chromosome 5
  • What are all the genes?
  • What is their structure and function?
  • Where are they expressed and how is this
    regulated?

Maja Bucan, Otto Valladeres, Kyle
Gaulton Jonathan Crabtree, Yongchang Gan, Joan
Mazzarelli, Jonathan Shug
38
Areas of Focus on Mouse Chromosome 5
Rw as a balancer
39
Approach to Annotating Mouse Chromosome 5
  • Genomic sequence
  • Public release chromosome 5 has many gaps
  • Celera
  • Combine to eliminate gaps where possible
  • Gene models
  • ENSMBL prediction
  • Celera predictions
  • BLAT alignment of DoTS
  • Comparison to human regions

40
Known RefSeq Genes in (72-76Mb) Region as Viewed
in UCSC Genome Browser
Only 14 RefSeq Genes plus an additional 7 from
Ensembl
41
Known Genes on Mouse Chromosome 5
MGI approved symbols
72Mb
5033405K12Rik 6030432N09Rik 1810027I20Rik AI836376
Sgcb 1700067I02Rik C78283 2700023E23Rik 1190017B1
8Rik 6720475M21Rik 1300019H17Rik Lnx1 Chic2 Gsh2 P
dgfra Kit Kdr Gabarapl2 (homolog) Srd5a2l Tparl Cl
ock Pdcl2 Nmu
Gene symbol synonyms
KIAA1458 KIAA0826 LOC231293 KIAA0276 FLJ12552
Identified 28 known genes
15 genes have assigned GO Functions
5 enzyme 4 signal transducer 4 ligand binding or
carrier 3 nucleic acid binding 2 transporter
76Mb
42
Example of Known Mouse Chromosome 5 Gene - Chic2

Alignment reveals exon differences between RNAs
belonging to gene (Alternative forms)
43
Putative Genes on Mouse Chromosome 5
putative gene mouse chr5 Notemulti-exon
alignment single image clone 583253 polyA
signal suggests 3 end of gene putative gene
mouse chr5 NoteSingleton ESTs from IMAGE clone
551428 align putative gene mouse chr5
Notemulti-exon alignment ESTs from single
image clone 515319 possible polyA signal in
3'sequence putative gene mouse chr5
Notemultiple span alignment 9/02- RNAs also
aligning to another region of mouse
chr5 putative gene mouse chr5 Note 3 ESTs in
assembly from embryo . . Total 21 (some
putative genes may later be merged)
44
Example of a Putative Mouse Gene
Example DT.40155293 image clone sequences (5
and 3 in same assembly)
45
Genes on Mouse Chromosome 5
  • 72-76 Mb region
  • 65 genes from automated DoTS analysis
  • 49 manual evaluation
  • 21 Ensembl genes
  • 14 RefSeq genes
  • Whole chromosome 5 (151 Mb)
  • 2157 genes from automated DoTS analysis
  • 1275 Ensembl genes

46
Summary
  • To make links between genotype and phenotype, the
    output of technologies such as genomic
    sequencing, microarrays, mass spec, etc., must be
    integrated
  • Our solution is GUS, Genomics Unified Schema,
    used for multiple systems AllGenes, PlasmoDB,
    EPConDB
  • GUS is freely available as a system for use and
    development
  • RAD as part of GUS and uses microarray standards
    now available
  • Using GUS for genomic research such as annotating
    mouse chromosome 5.
  • Possibly doubling the number of genes in
    annotated regions!

http//www.cbil.upenn.edu
Write a Comment
User Comments (0)
About PowerShow.com