Genomes - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Genomes

Description:

Genomes & Databases. Biologists Collect Lots of Data. Hundreds ... short Expressed Sequence Tags (ESTs) Now we have the genome of the month club. The Poodle ? ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 58
Provided by: stuart67
Category:
Tags: genomes | poodle

less

Transcript and Presenter's Notes

Title: Genomes


1
Genomes Databases
2
Biologists Collect Lots of Data
  • Hundreds of thousands of species
  • Millions of articles in scientific journals
  • Genetic information
  • gene names
  • phenotype of mutants
  • location of genes/mutations on chromosmes
  • linkage (distances between genes)

3
Gene Sequences
  • 1980s - sequenced one gene at a time
  • 1990s - DNA sequencing machines!!
  • whole genomes
  • cDNA (made from RNA)
  • short Expressed Sequence Tags (ESTs)
  • Now we have the genome of the month club
  • The Poodle ??!!?? (Science. Sep 26, 2003)

4
Protein Sequences
  • Protein sequences are derived from genome and
    cDNA
  • using the genetic code and gene-prediction
    software
  • Functional information is mostly about proteins
  • metabolic pathways
  • conserved domains
  • 3-Dimensional structure
  • protein-protein interaction
  • DNA binding (regulatory proteins transcription
    factors)

5
Related Data
  • Gene expression
  • Measure levels of each gene in different types of
    cells, in response to different stimuli, disease,
    etc
  • Polymorphism
  • mutations and allelic differences
  • population biology

6
What is a Database?
  • Organized data
  • Like a spreadsheet
  • Columns are fields
  • Rows are records
  • Fields contain data of the same type
  • Rows contain data that is related to one object
  • Can search for a term within just one field
  • speed and specificity
  • Or combine searches in several fields
  • this cant be done in a simple text file

7
(No Transcript)
8
Large Databases
  • Once upon a time, GenBank sent out sequence
    updates on CD-ROM disks a few times per year.
  • Now (Oct/2003) GenBank is over 250 Gigabytes
  • (100 billion bases)
  • Most biocomputing sites update their copy of
    GenBank every day over the internet.
  • Scientists access GenBank directly over the Web

9
GenBank is a Flatfile Database
  • Composed entirely of text
  • you could print the whole thing out
  • Each submitted sequence is a record
  • Had fields for Organism, Date, Author, etc.
  • Unique identifier for each sequence
  • Locus and Accession

10
Finding Sequences in Databases
  • The public DNA and protein sequence databases are
    huge.
  • In order for these databases to be useful, the
    data must be readily accessible to researchers.

11
Raw Genome Data
12
What Are You Looking For?
  • A gene?
  • DNA or protein sequence?
  • DNA sequences are essentially all in GenBank
  • Genomic, mRNA, cDNA, EST?
  • Proteins are harder to pin down
  • GenPept (GenBank Peptides) is huge and poorly
    annotated - lots of junk
  • SwissProt is carefully annotated, but not fully
    comprehensive
  • PIR is somewhere in between
  • PDB has protein 3-D structures

13
Federated Databases
  • EMBL (European Molecular Biology Lab)
  • has the same data as GenBank, but in a different
    format and sometimes with different accession
    numbers
  • uses SRS as query tool
  • DDBJ (DNA DataBank of Japan)
  • also same data, diff. format, diff. acc. s

14
Finding Genes in GenBank
  • These billions of G, A, T, and C letters would
    be almost useless without descriptions of what
    genes they contain, the organisms they come from,
    etc.
  • All of this information is contained in the
    "annotation" part of each sequence record.

15
(No Transcript)
16
Accession Numbers!!
  • Databases are designed to be searched by
    accession numbers (and locus IDs)
  • These are guaranteed to be non-redundant,
    accurate, and not to change.
  • Searching by gene names and keywords is doomed to
    frustration and probable failure
  • Neither scientists nor computers can be trusted
    to accurately and consistently annotate database
    entries
  • If only scientists would refer to genes by
    accession numbers in all published work!

17
(No Transcript)
18
Relational Databases
  • GenBank is a DNA sequence database
  • What about proteins?
  • Could just add them into the same database
  • some fields are the same
  • (Organism, Author, Date, etc)
  • but some are different (phosphorylation site)
  • Better to make a separate table
  • But how to connect DNA and protein for the same
    gene?
  • Need a primary key (accession )

19
Other Related Information
  • Also related to GenBank are protein structures,
    literature in PubMed, UniGene collections of cDNA
    sequences, etc
  • Each dataset deserves its own table - but by
    linking to a GenBank accession, they can all be
    cross referenced

20
Entrez is a Relational Database
  • The Entrez database contains all of the
    nucleotide and protein sequences in GenBank
    (updated daily) along with all of the literature
    in MEDLINE and the 3-D protein structures in PDB
    (Protein Data Base).
  • Entrez is much more than a database, it is a both
    a powerful search engine and a pre-computed list
    of relationships between all of its data elements

21
Entrez is Internally Cross-linked
  • DNA and protein sequences are linked to other
    similar sequences
  • Medline citations are linked to other citations
    that contain similar keywords
  • 3-D structures are linked to similar structures

22
Databases contain more than just DNA protein
sequences
23
Type in a Query term
  • Enter your search words in the
  • query box and hit the Go button

24

25
Related Items
  • You can search for a text term in sequence
    annotations or in MEDLINE abstracts, and find all
    articles, DNA, and protein sequences that mention
    that term.
  • Then from any article or sequence, you can move
    to "related articles" or "related sequences".
  • Relationships between sequences are computed with
    BLAST
  • Relationships between articles are computed with
    "MESH" terms (shared keywords
  • Relationships between DNA and protein sequences
    rely on accession numbers
  • Relationships between sequences and MEDLINE
    articles rely on both shared keywords and the
    mention of accession numbers in the articles.

26
  • These pre-computed relationships might include
    genes in the same multi-gene family, articles
    written about genes that have the same function,
    or other proteins that function in the same
    biochemical pathway
  • This potential for horizontal movement through
    the linked databases makes Entrez really
    exciting so you can think like a biologist, not
    a database programmer!
  • A researcher can start with only a vague set of
    keywords or a sequence identified in the
    laboratory and rapidly access a set of relevant
    literature and a list of related database
    sequences.

27
Refine the Query
  • Often a search finds too many (or too few)
    sequences, so you can go back and try again with
    more (or fewer) keywords in your query
  • The History feature allows you to combine any
    of your past queries.
  • The Limits feature allows you to limit a query
    to specific organisms, sequences submitted during
    a specific period of time, etc.
  • Many other features are designed to search for
    literature in MEDLINE

28
Other Sequence Search Tools
  • SRS (Sequence Retrival Service) was created by
    Dr. Thure Etzold CABIOS 9(1) 49-57 (1993)
  • It is a meta search engine for all types of
    biological data in hundreds of databases as well
    as about 20 sequence analysis programs
  • SRS can be accessed over the WWW from many
    servers (mostly in Europe)
  • http//srs.ebi.ac.uk/
  • http//www.infobiogen.fr/srs6bin/cgi-bin/wgetz?-pa
    getop
  • http//www.sanger.ac.uk/srs6bin/cgi-bin/wgetz?-pag
    etop
  • http//iubio.bio.indiana.edu/srs6bin/cgi-bin/wgetz
    ?-pagetop

29
(No Transcript)
30
(No Transcript)
31
Why So Many Databases?
  • If GenBank has all sequence data and Entrez is
    such a good query tool, then why are there so
    many other sequence databases?
  • Specialized data (single species,
    immunoglobulins, etc.)
  • Better annotation (i.e. SwissProt)
  • Sequences linked to other data (ACEDB)
  • Subbornness and local pride - EMBL, DDBJ
  • Well designed databases are interlinked with
    others for supplemental data
  • It is very hard to get all relevant information
    across all databases for any gene

32
Other Genetic Databases
  • Genome Sequence - where does a gene fall on the
    genome
  • integrate multiple layers of information
  • Sequence contigs, mRNAs, predicted exons, etc.
  • Single species?
  • ESTs dbEST _at_ NCBI
  • SNPs dbSNP _at_ NCBI,
  • http//snp.cshl.org (SNP Consortium)
  • Metabolism/Pathways
  • Gene Function (Genome Ontology)
  • Protein motifs/domains and protein families

33
Genome Databases
  • New area - in desperate need of development
  • ChromosomesSequenceContigsClones STS
    MarkersGenetic MarkersGenes
    FeaturesExpression dataPhenotype
  • No single database can hold it all
  • UCSC is probably the best right now
  • genome.ucsc.edu
  • Need a data exchange and linkage infrastructure

34
(No Transcript)
35
(No Transcript)
36
Ensembl at EBI/EMBL
37
ESTs (Expressed Sequence Tags)
  • partial cDNA sequences
  • dbEST at NCBI
  • a comprehensive set of all public EST data
  • UniGene at NCBI
  • clusters of ESTs and know genes from key species
  • does NOT have consensus sequences
  • has far too many clusters to be representative of
    real genes (129 K human clusters)

38
IDENTIFIERS dbEST Id 101883 EST name
yb01a01.s1 GenBank Acc T48601 GenBank gi
650461 GDB Id 490761 CLONE INFO Clone
Id IMAGE69864 (3') Other ESTs on
cloneyb01a01.r1 DNA type
cDNA PRIMERS Sequencing -21m13 PolyA Tail
Unknown SEQUENCE GGCGGCTCAGTAGCAGGTGCC
GTCCACCTCCGCCATGACAACAGACACATTGACATGGGT
GGGTTTACCACCAAGCGTCCGATGGTCTTCTGTGTGAAGGCCAG
CCAGGCGCCTCCATGG
CACCATGCAGGAGAAGGNCTCCCCCTTCTTCCAGTCCTCGGCTGCCACGC
GCAGTATGCT GGTCACACGAAGGTCGTGGTGCC
CTGGCTGGNTCCTNCANGGATGCCCAAGTCAGGTACT
TNTCGCGGGGCAGCTCCTGTGACCCCTGCAGCCAGCGAACCAGCAC
GTCCTTGGGGCTTN AAGCNGCGCTACCAGGCAC
TTCAACCGTTCNCCAGCTTCGTTCAGGGCCANCTTC Quality
High quality sequence stops at base 277 Entry
Created Feb 6 1995 Last Updated Feb 6
1995 COMMENTS High qality sequence stops
277 Source IMAGE Consortium,
LLNL This clone is available
royalty-free through LLNL contact
the IMAGE Consortium (info_at_image.llnl.gov) for
further information. PUTATIVE ID
Assigned by submitter similar
to gbS71043_rna1 IG ALPHA-2 CHAIN C REGION
(HUMAN) LIBRARY Lib Name Stratagene
placenta (937225) Organism Homo
sapiens Sex male Organ
placenta Lab host SOLR cells (kanamycin
resistant) Vector pBluescript SK- R.
Site 1 EcoRI R. Site 2
XhoI Description Cloned unidirectionally.
Primer Oligo dT. Caucasian.
Average insert size 1.2 kb Uni-ZAP XR Vector
5' adaptor sequence 5'
GAATTCGGCACGAG 3' 3' adaptor sequence 5'
CTCGAGTTTTTTTTTTTTTTTTTT 3'
39
(No Transcript)
40
(No Transcript)
41
Digital Differential Display
42
(No Transcript)
43
SNPs (Single Nucleotide Polymorphisms)
  • Genetic variation
  • Can be alleles of genes
  • also differences in non-coding regions collected
    from genome sequencing of different individuals
  • dbSNP at the NCBI - all public SNP data
  • SNP Consortium at CSHL - high quality set

44
(No Transcript)
45
(No Transcript)
46
Human Genetic Variation
  • Every human has essentially the same set of genes
  • But there are different forms of each gene -
    known as alleles
  • blue vs. brown eyes
  • genetic diseases such as cystic fibrosis or
    Huntingtons disease are caused by dysfunctional
    alleles

47
(No Transcript)
48
KEGG Kyoto Encylopedia of Genes and Genomes
  • Enzymatic and regulatory pathways
  • Mapped out by EC number and cross-referenced to
    genes in all known organisms
  • (wherever sequence information exits)
  • Parallel maps of regulatory pathways

49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
Protein-Protein Interactions
  • Metabolic and regulatory pathways
  • Transcription factors
  • Co-expression
  • Biochemical data
  • crosslinking
  • yeast 2-hybrid
  • affinity tagging
  • Useful feedback to genome annotation/protein
    function and gene expression

53
(No Transcript)
54
BIND - The Biomolecular Interaction Network
Database
55
Genome Ontology
  • Genetics is a messy science
  • Scientists have been working in isolation on
    individual species for many years - naming genes,
    mutants, odd phenotypes
  • sonic hedgehog
  • Now that we have complete genome sequences, how
    to reconcile the names across all species?
  • Genome Ontology uses a single 3 part system
  • Molecular function (specific tasks)
  • Biological process (broad biologial goals - e.g
    cell division)
  • Cellular component (location)

56
(No Transcript)
57
Database Search Strategies
  • General search principles - not limited to
    sequence (or to biology)
  • Use accession numbers whenever possible
  • Start with broad keywords and narrow the search
    using more specific terms
  • Try variants of spelling, numbers, etc.
  • Search all relevant databases
  • Be persistent!!
Write a Comment
User Comments (0)
About PowerShow.com