Bioinformatic Databases - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Bioinformatic Databases

Description:

... of genes are obtained using high-density oligonucleotide array technology and ... Data quality policy development and assessment ... – PowerPoint PPT presentation

Number of Views:208
Avg rating:3.0/5.0
Slides: 76
Provided by: Ped775
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatic Databases


1
Bioinformatic Databases
  • Norie de la Cruz, PhD

2
Take home
  • The internet is a powerful resource containing a
    large volume of data and tools to manipulate
    them unfortunately, connecting data between them
    can sometimes be tricky.

3
Overview
  • Whirlwind tour of Web databases
  • The Rat Genome Database data, tools, and
    operations

4
Bioinformatic databases on the WWW
  • Loose definition of database here
  • Vary widely in terms of offerings, data, tools
    and specialization
  • Vary widely in terms of data collection
    methodologies

5
Some classifications per NAR
  • Major sequence repositories
  • Gene Expression
  • Comparative genomics
  • Gene Identification and Structure
  • Genetic and physical maps
  • Genomic Databases
  • Intermolecular interactions
  • Metabolic Pathways and Cellular Regulation
  • Mutation Databases
  • Pathology

6
Some classifications per NAR
  • Protein Databases
  • Protein sequence Motifs
  • Proteome Resources
  • Retrieval systems
  • RNA Sequences
  • Structure
  • Transgenics
  • Varied Biomedical Content

7
Major Sequence Repositories
  • GenBank
  • RefSeq
  • DDBJ
  • Ensemble
  • Unigene
  • Collection of sequence data
  • Genomic
  • Markers
  • Genes
  • Proteins
  • Some provide tools to expedite access
  • Blast Search
  • Alignment tools
  • Translation tools etc.
  • Varying degrees of quality control
  • Machine data upload
  • Human curation and QC

8
Major Sequence Repositories Genbank
  • All know nucleotide and protein sequences
  • Provides submission system for various authors
  • Little QC

9
Major Sequence Repositories RefSeq
  • Non redundant collection of naturally occurring
    biological molecules
  • Human QC
  • Comprehensive, integrated set of sequences for
    major research organisms
  • Provides a stable reference for further
    characterization of sequences including
    comparative analyses, mutations, expression, etc.

10
Major Sequence Repositories Unigene
  • Attempts to cluster GenBank sequences into
    gene-oriented clusters
  • Each cluster contains sequences that represent
    one gene
  • Provides a stable reference for further
    characterization of sequences including
    comparative analyses, mutations, expression, etc.

11
Major Sequence Repositories DDBJ (DNA Data Bank
of Japan)
  • Japanese equivalent to NCBI efforts
  • Attempting to gather all known nucleotide and
    protein sequences
  • Part of the International Nucleotide Sequence
    Collaboration

12
Major Sequence Repositories EMBL Nucleotide
Sequence Database
  • European equivalent to NCBI efforts
  • Attempting to gather all known nucleotide and
    protein sequences
  • Part of the International Nucleotide Sequence
    Collaboration

13
Major Sequence Repositories UCSC Genome Browser
  • Visual representation of genome and sequence data
  • Run by University of California at Santa Cruz

14
Comparative Genomics
  • Examines the similarities and difference in
    Genome organization
  • Clustering of like data across multiple genomes
    protein motifs
  • Cross referencing of genome data across genomes

15
Comparative Genomics Microbial Genome Database
for Comparative Analysis
  • MBGD is a database for comparative analysis of
    completely sequenced microbial genomes
  • MBGD aims to facilitate comparative genomics from
    various points of view such as ortholog
    identification, paralog clustering, motif
    analysis and gene order comparison

16
Comparative Genomics Some specialized sites
  • Homophila human diseases and Drosophila gene
    relationships
  • CORG conserved non coding sequence blocks
  • ParaDB paralog mapping in human genomes

17
Comparative Genomics Clusters of Orthologous
Groups
  • Phylogenetic classification of the proteins
    encoded in complete genomes
  • Proteins grouped according to sequence by a
    program called COGNITOR
  • Must be represented in at least three species in
    a group of 43 species representing phylogenetic
    lineages
  • Each COG consists of individual proteins or
    groups of paralogs from at least 3 lineages and
    thus corresponds to an ancient conserved domain.

18
Gene Expression
  • Analysis of gene expression patterns
  • Repositories of microarray data
  • Analysis of tissue specificities of gene
    expression
  • Analysis of expression patterns for genes linked
    to specific diseases
  • Analysis of gene expression regulatory networks

19
Gene Expression Array Express
  • ArrayExpress is a new public database of
    microarray gene expression data at the EBI
  • The ArrayExpress infrastructure consists of
  • the database itself,
  • data submissions in MAGE-ML format or via an
    online submission tool MIAMExpress,
  • online database query interface, and the
  • Expression Profiler online analysis tool.

20
Gene Expression Edinburgh Mouse Atlas Project
  • database to be a resource for spatially mapped
    data such as in situ gene expression and cell
    lineage
  • The gene expression database (emage) is being
    developed as part of the Mouse Gene Expression
    Information Resource (MGEIR) in collaboration
    with the Jackson Laboratory, USA

21
Gene Expression HugeIndex (Human Gene Expression
Index)
  • aims to provide a comprehensive database to
    understand the expression of human genes in
    normal human tissues
  • mRNA expression levels of thousands of genes are
    obtained using high-density oligonucleotide array
    technology and used to create a public
    database.

22
Gene Expression Other specialized sites
  • Kidney development database
  • TRIPLES Transposon-insertion phenotypes,
    localization and expression in Saccharomyces
  • Tooth Development database
  • MethDB DNA methylation data, patterns and
    profile

23
Gene Identification and Structure
  • Focuses on the analysis of sequences to determine
    gene structures
  • Analysis of gene expression control signals
  • Analysis of coding signals
  • Analysis of variations in the exons alleles
  • Analysis of codon usage

24
Gene Identification and StructureSNP Consortium
database
  • collaboration that has to date discovered and
    characterized nearly 1.8 million SNPs
  • Now that the SNP discovery phase of the TSC
    project is essentially complete, the emphasis has
    shifted to studying SNPs in populations

25
Gene Identification and StructureAlternative
Splicing Annotation Project (ASAP)
  • for biologists to access and mine the enormous
    wealth of alternative splicing information coming
    from genomics and proteomics
  • use the UniGene clusters of human Expressed
    Sequence Tags (ESTs) to identify splices

26
Gene Identification and StructurePromEC
  • Database of promoters of characterized genes in
    E. coli

27
Gene Identification and StructureSome other
specialized sites
  • Place Plant cis acting regulatory elements
  • Sputnik Functional annotation of clustered
    plant ESTs
  • VIDA Virus Open reading frames
  • HS3D Human exon, intron, splice regions

28
Genetic and physical maps
  • Repository for marker information
  • Data on gene locations within the genome
  • Map of cloned sequences
  • Tools to integrate information across genomes

29
Genetic and Physical MapsHuGeMap
  • Collections of human genetic maps from Genethon
    and the Coorperative Human Linkage Center
  • Collections of physical maps from Genethon and
    the Whitehead Institute

30
Genetic and Physical MapsGeneMap99
  • A map of 30,181 human gene-based markers was
    assembled and integrated with the current genetic
    map by radiation hybrid mapping.
  • constitutes an important infrastructure and tool
    for the study of complex genetic traits, the
    positional cloning of disease genes, the
    cross-referencing of mammalian genomes, and
    validated human transcribed sequences for
    large-scale studies of gene expression

31
Genomic Databases
  • Data repositories for research results on various
    model organisms
  • Rat
  • Human
  • Fruit fly
  • Worm
  • Arabidopsis
  • Some other rodent
  • Linking information across databases
  • Tools to organize and integrate information

32
Genomic DatabasesThe Rat Genome Database
  • Consolidates and integrates Rat research data
  • Presents data on genes, qtls, sslps,ests etc.
  • Fields a series of tools to help analysis and
    integration with data within and without.

33
Genomic DatabasesFlyBase
  • Focuses on Drophila genome data
  • Presents data on genes, stocks, ests,
    transposons, sequences.
  • Not a lot of tools

34
Genomic DatabasesEcoGene
  • EcoGene is a collection of information about the
    genes, proteins, and intergenic regions of the E.
    coli K-12 genome and proteome
  • Collaborative effort between many laboratories

35
Genomic DatabasesSome other examples
  • wormbase C. elegans
  • oryzabase rice
  • TAIR Arabidopsis
  • IRIS Rice germplasm
  • MitoDat Mitochondrial proteins
  • MGI Medicago
  • CropNet crop plants
  • MGD another rodent

36
Mutation Databases
  • Allele distributions in populations
  • Inherited genetics diseases
  • Mutations in proteins implicated in disease
    development

37
Mutation Databases ALFRED
  • designed to make allele frequency data on
    anthropologically defined human population
    samples readily available to the scientific
    community
  • link these polymorphism data to the molecular
    genetics-human genome databases

38
Mutation Databases Human Gene Mutation Database
  • an attempt to collate known (published) gene
    lesions responsible for human inherited disease
  • provides information of practical diagnostic
    importance to
  • researchers and diagnosticians in human molecular
    genetics
  • physicians interested in a particular inherited
    condition in a given patient or family
  • genetic counsellors.

39
Mutation DatabasesOnline Mendelian Inheritance
in Man (OMIM)
  • catalog of human genes and genetic disorders
  • contains textual information, pictures, and
    reference information

40
Mutation Databases Other examples
  • Atlas of Genetics and Cytogenetics in Oncology
    and Haematology
  • Database of Germline p53 Mutations
  • SV40 Large T-Antigen Mutant Database
  • KinMutBase Disease causing kinase mutations

41
Protein Databases
  • Protein sequences collection
  • Clustering of protein data into families
  • Specialized protein sites
  • Organism
  • Function
  • Large variety of enzymes

42
Protein Databases InterPro
  • a database of protein families, domains and
    functional sites in which identifiable features
    found in known proteins can be applied to unknown
    protein sequences
  • amalgamating the major protein signature
    databases, data have been manually integrated and
    curated and are available in InterPro
  • PROSITE
  • Pfam
  • PRINTS
  • ProDom
  • SMART
  • TIGRFAMs

43
Protein DatabasesProtoNet
  • provides global classification of the proteins,
    from the SWISS-PROT database into hierarchical
    clusters
  • clustering is based on an all-against-all BLAST
    similarity search

44
Protein DatabasesiProClass
  • an integrated resource that provides
    comprehensive family relationships and
    structural/functional features of proteins
  • currently consists of non-redundant PIR and
    SwissProt/TrEMBL proteins
  • 36,200 PIR superfamilies
  • 145,300 families
  • 5720 domains
  • 1300 motifs
  • 280 post-translational modification sites
  • links to over 50 biological databases.

45
Protein Databases Other Examples
  • Nuclear Protein Database Proteins localized in
    the nucleus
  • PLANT-Pls Plant protease inhibitors
  • SWISS-PROT/TrEMBL Curated protein sequences
  • SENTRA Sensory signal transduction proteins
  • Ribonuclease P Database

46
Protein Sequence Motifs
  • Alignment of protein sequences
  • Organization of proteins into families

47
Protein Sequence MotifsBLOCKS
  • multiply aligned ungapped segments corresponding
    to the most highly conserved regions of proteins
  • Tools
  • Block Searcher -- compare a protein or DNA
    sequence to a database of protein blocks
  • Get Blocks -- retrieve blocks
  • Block Maker -- create new blocks

48
Protein Sequence MotifsPfam
  • a large collection of multiple sequence
    alignments and hidden Markov models covering many
    common protein domains and families.
  • For each family in Pfam you can
  • Look at multiple alignments
  • View protein domain architectures
  • Examine species distribution
  • Follow links to other databases
  • View known protein structures

49
Protein Sequence MotifsPROSITE
  • database of protein families and domains. It
    consists of biologically characterized sites,
    patterns and profiles that help to reliably
    identify to which known protein family (if any) a
    new sequence belongs
  • currently contains patterns and profiles specific
    for more than a thousand protein families or
    domains.
  • each of these signatures comes with documentation
    providing background information on the structure
    and function of these proteins

50
Protein Sequence Motifs Other Examples
  • ASC Active Sequence Collection Biologically
    active oligopeptides
  • ClusTr Automatic classification of SWISS-PROT
    and TrEMBL proteins
  • TMPDB Experimentally-characterized
    transmembrane topology
  • O-GLYCBASE O- and C- linked glycosylation sites
    in proteins

51
RNA Sequences
  • Repository of RNA sequences
  • RNA structure data
  • RNA metabolism information
  • Specialized site by organism, function, etc

52
RNA SequencesHyPaLib
  • contains annotated structural elements
    characteristic for certain classes of structural
    and/or functional RNAs
  • developing software tools that allow a user to
    search sequence databases for any pattern in
    HyPaLib

53
RNA SequencesRfam
  • a collection of multiple sequence alignments and
    covariance models representing non-coding RNA
    families
  • allow the user to search a query sequence against
    a library of covariance models, and view multiple
    sequence alignments and family annotation

54
RNA SequencestRNA sequences
  • compilation of tRNA Sequences and Sequences of
    tRNA genes

55
RNA SequencesOther Examples
  • 16S and 23S Ribosomal RNA Mutation Database
  • ACTIVITY functional DNA/RNA site activity
  • PLANTncRNAs Plant non-coding RNAs
  • RNA Modification Database Naturally modified
    nucleosides in RNA

56
Structure
  • Information on protein structure derived from
    physical data crystallography, NMR
  • Classification of proteins according to tertiary
    structures
  • Specialized site for specific proteins

57
StructureASTRAL
  • provides databases and tools useful for analyzing
    protein structures and their sequences
  • Partially derived from the SCOP database
    (Structural Classification of Proteins)

58
StructureSCOP
  • Comprehensive ordering of proteins to know
    structures based on their evolutionary and
    structural relationships
  • Protein domains are grouped into species and
    hierarchically classified in families
    superfamilies, folds, and classes

59
StructurePDB
  • Structure data determined by X-ray
    crystallography and NMR

60
Structure Other Examples
  • CADB conformation angles of protein structures,
    with associated crystallographic data
  • Database of Macromolecular Movements
  • DSDBase Disulfide Bonds in proteins
  • PSSH alignment between sequences and tertiary
    structures
  • SUPERFAMILY Assignments of proteins to
    structural superfamilies

61
Other Databases
  • Intermolecular Interactions
  • Metabolic Pathways and Cellular Regulation
  • Pathology
  • Proteome Resources
  • Retrieval Systems and Database Structure
  • Transgenics
  • Varied Medical Content

62
Other Databases Intermolecular Interactions
  • BIND Molecular interactions, complexes and
    pathways
  • DIP (Database of Interacting Proteins)
    Experimentally determined protein-protein
    interactions
  • KDBI Kinetic data on biomolecular interactions

63
Other Databases Metabolic Pathways and Cellular
Regulation
  • KEGG Kyoto Encyclopedia of Genes and Genomes
  • MetaCyc Metabolic Pathways and Enzymes from
    Various organisms
  • PathDB
  • EcoCyc E. coli K-12 genome and pathway data
  • PRODORIC gene regulation and regulatory
    networks in prokaryotes

64
Other DatabasesPathology
  • BayGenomics cardiovascular and pulmonary
    disease
  • INFEVERS hereditary inflammatory disorder
  • GOLD.db lipid-associated disorders
  • Mouse Tumor Biology Database

65
Other Databases Proteome Resources
  • GELBANK 2D gel data repository
  • REBASE Restriction enzymes and associated
    methylases
  • SWISS-2DPAGE Annotated two-dimensional gel
    electrophoresis database

66
Other Databases Retrieval Systems and Database
Structure
  • TESS Transcription Element search system
  • Virgil Database interconnectivity

67
Other DatabasesTransgenics
  • Cre Transgenic database Cre transgenic
    mouslines
  • Transgenic/targeted mutation database
    information on transgenic animals and targeted
    mutations

68
Other Databases Varied Medical Content
  • Tree of Life phylogeny and biodiversity
  • PubMed biomedical literature
  • NCBI Taxonomy Browser organisms with at least
    one sequence deposited in the database
  • Pharmgkb Pharmacogenomics and variations in
    drug response based on human variation

69
The Rat Genome Database
  • Data
  • Tools
  • Operations

70
The Rat Genome Database data
  • Genes
  • Maps and Markers
  • QTLs
  • Strains
  • Homologs

71
The Rat Genome Database tools
  • VCMap
  • Mapserver
  • Meta Gene
  • Genome Scanner
  • Ontology Browser

72
The Rat Genome Database operations
  • Curation
  • Data QC and Loading
  • Data development
  • Tool development

73
The Rat Genome Database Operations Curation
  • Information gathering from peer-reviewed work
  • Coordination with other model organism data bases
  • Data quality policy development and assessment

74
The Rat Genome Database Operations data
development
  • Development of data integration strategies
  • Development of ontology annotation protocols
  • Some development of curation policies
  • Outreach
  • Ontology development

75
The Rat Genome Database Operations tool
development
  • Ontology system development
  • Systems analysis
  • Tool integration
  • Tool building
  • Software system migration
Write a Comment
User Comments (0)
About PowerShow.com