Information Management for Genome Level Bioinformatics - PowerPoint PPT Presentation

1 / 163
About This Presentation
Title:

Information Management for Genome Level Bioinformatics

Description:

Title: Incremental Maintenance of Materialized OQL Views Author: Norman Paton Last modified by: norm Created Date: 11/4/2000 11:38:08 AM Document presentation format – PowerPoint PPT presentation

Number of Views:593
Avg rating:3.0/5.0
Slides: 164
Provided by: Norman118
Category:

less

Transcript and Presenter's Notes

Title: Information Management for Genome Level Bioinformatics


1
Information Management for Genome Level
Bioinformatics
  • Norman Paton and Carole Goble
  • Department of Computer Science
  • University of Manchester
  • Manchester, UK
  • ltnorm, carolegt_at_cs.man.ac.uk

2
Structure of Tutorial
  • Introduction - why it matters.
  • Genome level data.
  • Modelling challenges.
  • Genomic databases.
  • Integrating biological databases.
  • Analysing genomic data.
  • Summary and challenges.

3
What is the Genome?
  • All the genetic material in the chromosomes of
    a particular organism.

4
What is Genomics?
  • The systematic application of (high throughput)
    molecular biology techniques to examine the whole
    genetic content of cells.
  • Understand the meaning of the genomic information
    and how and when this information is expressed.

5
What is Bioinformatics?
  • The application and development of computing and
    mathematics to the management, analysis and
    understanding of the rapidly expanding amount of
    biological information to solve biological
    questions
  • Straddles the interface between traditional
    biology and computer science

6
Human Genome Project
  • The systematic cataloguing of individual gene
    sequences and mapping data to large
    species-specific collections
  • An inventory of life
  • June 25, 2000 draft of entire human genome
    announced
  • Mouse, fruit fly, c. elegans,
  • Sequence is just the beginning

http//www.nature.com/genomics/human/papers/articl
es.html
7
Functional Genomics
  • An integrated view of how organisms work and
    interact in growth, development and pathogenesis
  • From single gene to whole genome
  • From single biochemical reactions to whole
    physiological and developmental systems
  • What do genes do?
  • How do they interact?

8
Comparative Genomics
9,000
14,000
31,000
30,000
6,000
http//wit.integratedgenomics.com/GOLD/
9
Of Mice and Men
10
Genotype to Phenotype
  • Link the observable behaviour of an organism with
    its genotype
  • Drug Discovery, Agro-Food, Pharmacogenomics
    (individualised medicine)

11
Disease Genetics Pharmacogenomics
Hypotheses
Design
Integration
ClinicalResourcesIndividualisedMedicine
Data Mining Case-BaseReasoning
InformationFusion
12
In silico experimentation
Which compounds interact with (alpha-adrenergic
receptors) ((over expressed in (bladder
epithelial cells)) but not (smooth muscle
tissue)) of ((patients with urinary flow
dysfunction) and a sensitivity to the
(quinazoline family of compounds))?
Enzyme database
SNPs database
Tissue database
Drug formulary
High throput screening
Receptor database
Clinical trials database
Chemical database
Expressn. database
13
A Paradigm Shift
Hypothesis-driven
  • Hunter gatherers
  • Harvesters

Collection-driven
14
Size, complexity, heterogeneity, instability
  • EMBL July 2001
  • 150 Gbytes
  • Microarray
  • 1 Petabyte per annum
  • Sanger Centre
  • 20 terabytes of data
  • Genome sequences increase 4x per annum

http//www3.ebi.ac.uk/Services/DBStats/
15
High throughput experimental methods
  • Micro arrays for gene expression
  • Robot-based capture
  • 10K data points per chip
  • 20 x per chip
  • Cottage industry -gt industrial scale

16
Complexity,size, heterogeneity, instability
  • Multiple views
  • Interrelated
  • Intra and inter cell interactions and
    bio-processes

"Courtesy U.S. Department of Energy Genomes to
Life program (proposed) DOEGenomesToLife.org."
17
Heterogeneity size, complexity, instability
  • Multimedia
  • Images Video (e.g. microarrays)
  • Text annotations literature
  • Over 500 different databases
  • Genomic, proteomic, transcriptomic, metabalomic,
    protein-protein interactions, regulatory
    bio-networks, alignments, disease, patterns
    motifs, protein structure, protein
    classifications, specialist proteins (enzymes,
    receptors),
  • Different formats, structure, schemas, coverage
  • Web interfaces, flat file distribution,

18
Instability size, complexity, heterogeneity
  • Exploring the unknown
  • At least 5 definitions of a gene
  • The sequence is a model
  • Other models are work in progress
  • Names unstable
  • Data unstable
  • Models unstable

19
Genome Level Data
20
Biological Macromolecules
  • DNA the source of the program.
  • mRNA the compiled class definitions.
  • Protein the runtime object instances.

DNA ? mRNA ? Protein
Biological Teaching Resources http//www.accessex
cellence.com/
21
Genome
  • The genome is the entire DNA sequence of an
    organism.

The yeast genome (Saccharomyces cerevisiae). A
friendly fungus brewers and bakers
yeast. http//genome-www.stanford.edu/Saccharomyce
s/
22
A Genome Data Model
1

Chromosome
Everything in this model is DNA
Genome
1

Chromosome Fragment
Transcribed Region
NonTranscribed Region
23
Chromosome
  • A chromosome is a DNA molecule containing genes
    in linear order.

Chromosome III from yeast. Genes are shown
shaded on the different strands of DNA.
24
Gene
  • A gene is a discrete unit of inherited
    information.
  • A gene is transcribed into RNA which either
  • Functions directly in the cell, or
  • Is translated into protein.

non transcribed
transcribed
25
Model Revisited
Not all Junk DNA
1

Chromosome
Genome
1

Chromosome Fragment
Transcribed Region
NonTranscribed Region
26
Translation Data Model
Transcribed Region
1
transcription

DNA
Amino Acid
RNA
1
1
tRNA
rRNA
snRNA
mRNA
Protein
translation
27
Transcription
  • In transcription, DNA is used as a template for
    the creation of RNA.

DNA DNA RNA RNA
A Adenine A Adenine
C Cytosine C Cytosine
G Guanine G Guanine
T Thymine U Uracil
28
Translation
  • In translation a protein sequence is synthesised
    according to the sequence of an mRNA molecule.
  • Four nucleic acids contribute to mRNA.
  • Twenty amino acids contribute to protein.

CODONS Amino Acid
AAA, AAG Lysine (Lys)
GCU, GCC, GCA, GCG Alanine (Ala)

29
Molecular Structures
An abstract view of a globular Protein of unknown
function (Zarembinski et al., PNAS 95 1998)
The double helix of DNA (http//www.bio.cmu.edu/ P
rograms/Courses/)
30
Genome Facts
Chromosomes Genes Base Pairs
Human 22 X,Y 25000 3.2 billion
Yeast 16 6000 12 million
E Coli 1 3500 4.6 million
31
Growth in Data Volumes
  • Non-redundant growth of sequences during
    1988-1998 (black) and the corresponding growth in
    the number of structures (red).

32
General Growth Patterns
loads
Growth in experimental production of stuff.
lots
some
recently now soon
An emphasis on quantity could lead to oversights
relating to complexity
33
Making Sense of Sequences
  • The sequencing of a genome leaves two crucial
    questions
  • What is the individual behaviour of each protein?
  • How does the overall behaviour of a cell follow
    from its genetic make-up?

In yeast, the function of slightly over 50 of
the proteins has been detected experimentally or
predicted through sequence similarity.
34
Reverse Engineering
  • The genome is the source of a program by an
    inaccessible author, for which no documentation
    is available.
  • Functional genomics seeks to develop and document
    the functionality of the program by observing its
    runtime behaviour.

35
Functional Genomics
Sequence data
Functional data
36
The omes
  • Genome the total DNA sequence of an organism
    (static).
  • Transcriptome a measure of the mRNA present in a
    cell at a point in time.
  • Proteome a measure of the protein present in a
    cell at a point in time.
  • Metabolome a record of the metabolites in a cell
    at a point in time.

37
Transcriptome
  • Microarrays (DNA Chips) can measure many
    thousands of transcript levels at a time.
  • Arrays allow transcript levels to be compared at
    different points in time.

38
Transcriptome Features
  • Loads of data
  • Comprehensive in coverage.
  • High throughput.
  • Challenging to interpret
  • Normalisation.
  • Clustering.
  • Time series.

39
Proteome
  • Most proteome experiments involve separation then
    measurement.
  • 2D Gels separate a sample according to mass and
    pH, so that (hopefully) each spot contains one
    protein.

Proteome Database http//www.expasy.ch/
40
Mass Spectrometry
  • Individual spots can be analysed using (one of
    many) mass spectrometry techniques.
  • This can lead to the identification of specific
    proteins in a sample.

Mass spec. results for yeast. (http//www.cogeme.m
an .ac.uk )
41
Modelling Proteome Data
  • Describing individual functional data sets is
    often challenging in itself.

42
Proteome Features
  • Moderate amounts of data
  • Partial in coverage.
  • Medium throughput.
  • Challenging to interpret
  • Protein identification.

43
Protein Interactions
  • Experimental techniques can also be used to
    identify protein interactions.

Protein-protein interaction viewer
highlighting proteins based on cellular location
(http//img.cs. man.ac.uk/gims)
44
Metabolome
  • A metabolic pathway describes a series of
    reactions.
  • Such pathways bring together collections of
    proteins and the small molecules with which they
    interact.

Glucose metabolism in yeast from WIT
(http//wit.mcs.anl.gov/WIT2/ )
45
Summary Genome Level Data
  • Genome sequencing is moving fast
  • Several genomes fully sequenced.
  • Many genomes partially sequenced.
  • The sequence is not the whole story
  • Many genes are of unknown function.
  • Developments in functional genomics are yielding
    new and challenging data sets.

46
Useful URLs on Genomic Data
  • Nature Genome Gateway
  • http//www.nature.com/genomics/
  • UK Medical Research Council Demystifying Genomics
    document
  • http//www.mrc.ac.uk/PDFs/dem_gen.pdf
  • Genomic glossary
  • http//www.genomicglossaries.com/
  • Teaching resources
  • http//www.iacr.bbsrc.ac.uk/notebook/index.html

47
Genomic Databases
http//www.hgmp.mrc.ac.uk/GenomeWeb/genome-db.html
http//srs.ebi.ac.uk/
48
Key points
  • What do the databases contain?
  • Broad vs deep
  • Primary vs secondary
  • What are the database services?
  • Architecture Web browsing paradigm
  • How are the databases published?
  • How are the data represented?
  • Annotation
  • How are the databases curated ?

49
A Paradigm Shift
  • Publishing journals
  • Publishing data

Re-analysable
50
Broad vs Deep Databases
  • Broad Clustered around data type or biological
    system across multiple species
  • Sequence protein (Swiss-Prot), nucleotide
    (EMBL), patterns (Interpro)
  • Genomic transcriptome (MaxD), pathway (WIT)
  • Deep Data integrated across a species
  • Saccharomyces cerevisiae MIPS, SGD, YPD
  • Flybase, MouseBase, XXXBase

51
Broad Example MaxD
MaxD is a relational implementation of the
ArrayExpress proposal for a transcriptome
database.
http//www.bioinf.man.ac.uk
52
Broad Example - WIT
WIT is a WWW resource providing access to
metabolic pathways from many species.
http//wit.mcs.anl.gov
53
Deep Example - MIPS
MIPS is one of several sites providing access,
principally for browsing, to both sequence and
functional data.
http//www.mips.biochem.mpg.de/
54
Deep Example - SGD
SGD contains sequence, function and literature
information on S. cerevisiae. mostly for browsing
and viewing.
http//genome-www.stanford.edu/Saccharomyces/
55
Primary Databases
  • Primary source generated by experimentalists.
  • Role standards, quality thresholds,
    dissemination
  • Sequence databases EMBL, GenBank
  • Increasingly other data types micro-array

56
Secondary databases
  • Secondary source derived from repositories, other
    secondary databases, analysis and expertise.
  • Role Distilled and accumulated specialist
    knowledge. Value added commentary called
    annotation
  • Swiss-Prot, PRINTS, CATH, PAX6, Enzyme, dbSNP
  • Role Warehouses to support analysis over
    replicated data
  • GIMS, aMAZE, InterPro

57
Database collection flows
Secondary Database
Primary Database
Secondary Database
Secondary Database
Analysis
58
InterPro Data Flow
http//www.ebi.ac.uk/interpro/dataflow_scheme.html
59
Services Architecture
Browser
Visualisation
User Programs
Access manager
Analysis Library
Storage manager
RDBMS
OODBMS
Home brewed DBMS
Flat files
60
How do I use a database?
  • Web browser
  • Cut and paste, point and click
  • Query by navigation
  • Results in flat file formats or graphical
  • Screen-scrapping
  • Perl scripts over downloaded flat files
  • The most popular form
  • XML formats taking hold
  • Beginnings of APIs in Corba
  • But still limited to call-interface rather than
    queries

61
Example Visualisation
Mouse Atlas http//genex.hgu.mrc.ac.uk/
62
Query and Browse
63
Browse
64
Inter-database referential integrity
65
Inter-database references
66
Query based retrieval?
Browser
Visualisation
User Programs
Access manager
Analysis Library
Query manager
Storage manager
RDBMS
OODBMS
Home brewed DBMS
67
Query Expressions
  • A query interface through a web browser or
    command line
  • AceDB language (SGD)
  • Icarus (SRS)
  • SQL?
  • APIs generally dont allow query submission

68
Two ( three) tier delivery
Browsing Analysis
Production Database
Publication Server
RDBMS OODBMS Home-grown
Local copy database
Bundled application or export files
69
EMBL Flat File Format part 1
  • ID TRBG361 standard RNA PLN 1859 BP.
  • AC X56734 S46826
  • SV X56734.1
  • DT 12-SEP-1991 (Rel. 29, Created)
  • DT 15-MAR-1999 (Rel. 59, Last updated, Version
    9)
  • DE Trifolium repens mRNA for non-cyanogenic
    beta-glucosidase
  • KW beta-glucosidase.
  • OS Trifolium repens (white clover)
  • OC Eukaryota Viridiplantae Streptophyta
    Embryophyta Tracheophyta
  • OC Spermatophyta Magnoliophyta
    eudicotyledons core eudicots Rosidae
  • OC eurosids I Fabales Fabaceae
    Papilionoideae Trifolium.
  • RN 5
  • RP 1-1859
  • RX MEDLINE 91322517.
  • RA Oxtoby E., Dunn M.A., Pancoro A., Hughes
    M.A.
  • RT "Nucleotide and derived amino acid sequence
    of the cyanogenic
  • RT beta-glucosidase (linamarase) from white
    clover (Trifolium repens L.)."
  • RL Plant Mol. Biol. 17209-219(1991).

70
EMBL Flat File Format part 2
  • DR AGDR X56734 X56734.
  • DR MENDEL 11000 Trirp116211000.
  • DR SWISS-PROT P26204 BGLS_TRIRP.
  • FH Key Location/Qualifiers
  • FH
  • FT source 1..1859
  • FT /db_xref"taxon3899"
  • FT /organism"Trifolium repens"
  • FT /tissue_type"leaves"
  • FT /clone_lib"lambda gt10"
  • FT /clone"TRE361"
  • FT CDS 14..1495
  • FT /db_xref"SWISS-PROTP26204"
  • FT /note"non-cyanogenic"
  • FT /EC_number"3.2.1.21"
  • FT /product"beta-glucosidase"
  • FT /protein_id"CAA40058.1"

71
EMBL Flat File Format part 3
  • FT /translation"MDFIVAIFALFVISS
    FTITSTNAVEASTLLDIGNLSRSSFPRGFI
  • FT FGAGSSAYQFEGAVNEGGRGPSIWDTFTH
    KYPEKIRDGSNADITVDQYHRYKEDVGIMK
  • FT DQNMDSYRFSISWPRILPKGKLSGGINHE
    GIKYYNNLINELLANGIQPFVTLFHWDLPQ
  • FT VLEDEYGGFLNSGVINDFRDYTDLCFKEF
    GDRVRYWSTLNEPWVFSNSGYALGTNAPGR
  • FT CSASNVAKPGDSGTGPYIVTHNQILAHAE
    AVHVYKTKYQAYQKGKIGITLVSNWLMPLD
  • FT DNSIPDIKAAERSLDFQFGLFMEQLTTGD
    YSKSMRRIVKNRLPKFSKFESSLVNGSFDF
  • FT IGINYYSSSYISNAPSHGNAKPSYSTNPM
    TNISFEKHGIPLGPRAASIWIYVYPYMFIQ
  • FT EDFEIFCYILKINITILQFSITENGMNEF
    NDATLPVEEALLNTYRIDYYYRHLYYIRSA
  • FT IRAGSNVKGFYAWSFLDCNEWFAGFTVRF
    GLNFVD"
  • FT mRNA 1..1859
  • FT /evidenceEXPERIMENTAL
  • XX
  • SQ Sequence 1859 BP 609 A 314 C 355 G 581
    T 0 other
  • aaacaaacca aatatggatt ttattgtagc catatttgct
    ctgtttgtta ttagctcatt 60
  • cacaattact tccacaaatg cagttgaagc ttctactctt
    cttgacatag gtaacctgag 120
  • tcggagcagt tttcctcgtg gcttcatctt tggtgctgga
    tcttcagcat accaatttga 180
  • aggtgcagta aacgaaggcg gtagaggacc aagtatttgg
    gataccttca cccataaata 240
  • etc.

72
XML embraced
  • Side effect of publication through flat files and
    textual annotation
  • XML for distribution, storage and interoperation,
  • e.g. BLASTXML, Distributed Annotation System
  • Many XML genome annotation DTDs
  • Sequence BIOML, BSML, AGAVE, GAME
  • Function MAML, MaXML
  • http//www.bioxml.org/
  • I3C vendors attempt to coordinate activities and
    promote XML for integration
  • http//i3c.open-bio.org

73
Move to OO interfaces
  • OO APIs to RDMS or flatfiles
  • CORBA activity
  • EMBL CORBA Server
  • OMG Life Sciences Research
  • OMG not yet taken hold

http//corba.ebi.ac.uk/EMBL_embl.html http//lsr.e
bi.ac.uk/
74
Annotation and Curation
  • the elucidation and description of biologically
    relevant features in a sequence
  • Computationally formed e.g. cross references to
    other database entries, date collected
  • Intellectually formed the accumulated knowledge
    of an expert distilling the aggregated
    information drawn from multiple data sources and
    analyses, and the annotators knowledge.

75
Annotation Distillation
76
Swiss-ProtAnnotation
ID PRIO_HUMAN STANDARD PRT 253
AA. AC P04156 DE MAJOR PRION PROTEIN
PRECURSOR (PRP) (PRP27-30) (PRP33-35C) (ASCR). OS
Homo sapiens (Human). OC Eukaryota Metazoa
Chordata Craniata Vertebrata Euteleostomi OC
Mammalia Eutheria Primates Catarrhini
Hominidae Homo. OX NCBI_TaxID9606 RN
1 RP SEQUENCE FROM N.A. RX MEDLINE86300093
NCBI, ExPASy, Israel, Japan PubMed3755672 RA
Kretzschmar H.A., Stowring L.E., Westaway D.,
Stubblebine W.H., Prusiner S.B., Dearmond S.J. RT
"Molecular cloning of a human prion protein
cDNA." RL DNA 5315-324(1986). RN 6 RP
STRUCTURE BY NMR OF 23-231. RX MEDLINE97424376
NCBI, ExPASy, Israel, Japan PubMed9280298 RA
Riek R., Hornemann S., Wider G., Glockshuber
R., Wuethrich K. RT "NMR characterization of
the full-length recombinant murine prion protein,
mPrP(23-231)." RL FEBS Lett.
413282-288(1997). CC -!- FUNCTION THE
FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN
THE HOST GENOME AND IS CC EXPRESSED BOTH
IN NORMAL AND INFECTED CELLS. CC -!- SUBUNIT
PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS
CALLED "RODS". CC -!- SUBCELLULAR LOCATION
ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR. CC
-!- DISEASE PRP IS FOUND IN HIGH QUANTITY IN THE
BRAIN OF HUMANS AND ANIMALS INFECTED WITH CC
NEURODEGENERATIVE DISEASES KNOWN AS
TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR
PRION CC DISEASES, LIKE CREUTZFELDT-JAKOB
DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROME
(GSS), CC FATAL FAMILIAL INSOMNIA (FFI)
AND KURU IN HUMANS SCRAPIE IN SHEEP AND GOAT
BOVINE CC SPONGIFORM ENCEPHALOPATHY (BSE)
IN CATTLE TRANSMISSIBLE MINK ENCEPHALOPATHY
(TME) CC CHRONIC WASTING DISEASE (CWD) OF
MULE DEER AND ELK FELINE SPONGIFORM
ENCEPHALOPATHY CC (FSE) IN CATS AND EXOTIC
UNGULATE ENCEPHALOPATHY(EUE) IN NYALA AND GREATER
KUDU. THE CC PRION DISEASES ILLUSTRATE
THREE MANIFESTATIONS OF CNS DEGENERATION (1)
INFECTIOUS (2) CC SPORADIC AND (3)
DOMINANTLY INHERITED FORMS. TME, CWD, BSE, FSE,
EUE ARE ALL THOUGHT TO CC OCCUR AFTER
CONSUMPTION OF PRION-INFECTED FOODSTUFFS. CC
-!- SIMILARITY BELONGS TO THE PRION FAMILY. DR
HSSP P04925 1AG2. HSSP ENTRY / SWISS-3DIMAGE /
PDB DR MIM 176640 -. NCBI / EBI DR
InterPro IPR000817 -. DR Pfam PF00377
prion 1. DR PRINTS PR00341 PRION. KW
Prion Brain Glycoprotein GPI-anchor Repeat
Signal Polymorphism Disease mutation.
77
PRINTS Annotation
gc PRION gx PR00341 gt Prion protein
signature gp INTERPRO IPR000817 gp PROSITE
PS00291 PRION_1 PS00706 PRION_2 gp BLOCKS
BL00291 gp PFAM PF00377 prion bb gr 1. STAHL,
N. AND PRUSINER, S.B. gr Prions and prion
proteins. gr FASEB J. 5 2799-2807
(1991). gr gr 2. BRUNORI, M., CHIARA
SILVESTRINI, M. AND POCCHIARI, M. gr The scrapie
agent and the prion hypothesis. gr TRENDS
BIOCHEM.SCI. 13 309-313 (1988). gr gr 3.
PRUSINER, S.B. gr Scrapie prions. gr
ANNU.REV.MICROBIOL. 43 345-374 (1989). bb gd
Prion protein (PrP) is a small glycoprotein found
in high quantity in the brain of animals infected
with gd certain degenerative neurological
diseases, such as sheep scrapie and bovine
spongiform encephalopathy (BSE), gd and the
human dementias Creutzfeldt-Jacob disease (CJD)
and Gerstmann-Straussler syndrome (GSS). PrP is
gd encoded in the host genome and is expressed
both in normal and infected cells. During
infection, however, the gd PrP molecules become
altered and polymerise, yielding fibrils of
modified PrP protein. gd gd PrP molecules have
been found on the outer surface of plasma
membranes of nerve cells, to which they are gd
anchored through a covalent-linked glycolipid,
suggesting a role as a membrane receptor. PrP is
also gd expressed in other tissues, indicating
that it may have different functions depending on
its location. gd gd The primary sequences of
PrP's from different sources are highly similar
all bear an N-terminal domain gd containing
multiple tandem repeats of a Pro/Gly rich
octapeptide sites of Asn-linked glycosylation
an gd essential disulphide bond and 3
hydrophobic segments. These sequences show some
similarity to a chicken gd glycoprotein,
thought to be an acetylcholine receptor-inducing
activity (ARIA) molecule. It has been gd
suggested that changes in the octapeptide repeat
region may indicate a predisposition to disease,
but it is gd not known for certain whether the
repeat can meaningfully be used as a fingerprint
to indicate susceptibility. gd gd PRION is an
8-element fingerprint that provides a signature
for the prion proteins. The fingerprint was gd
derived from an initial alignment of 5 sequences
the motifs were drawn from conserved regions
spanning gd virtually the full alignment
length, including the 3 hydrophobic domains and
the octapeptide repeats gd (WGQPHGGG). Two
iterations on OWL18.0 were required to reach
convergence, at which point a true set comprising
gd 9 sequences was identified. Several partial
matches were also found these include a fragment
(PRIO_RAT) gd lacking part of the sequence
bearing the first motif,and the PrP homologue
found in chicken - this matches gd well with
only 2 of the 3 hydrophobic motifs (1 and 5) and
one of the other conserved regions (6), but has
an gd N-terminal signature based on a
sextapeptide repeat (YPHNPG) rather than the
characteristic PrP octapeptide.
78
The Annotation Pipeline
PRINTS
EMBL
Swiss- Prot
GPCRDB
TrEMBL
Analysis
79
UnStructured Literature
  • Biology is knowledge based
  • The insights are in the literature

80
Semi-Structured
  • Schemaless Descriptions
  • Evolving
  • Non-predictive
  • The structured part of the schema is open to
    change
  • Hence flat file mark ups prevalence

81
Typical Database Services
  1. Browsing
  2. Visualisation
  3. Querying
  4. Analysis
  5. API

? ? ?
? ? ?
?
-
?
Focus on a person sitting in front of a Web
browser pointing and clicking
82
Typical Genomic Databases
Single Genome Multiple Genomes Sequence Function
Browse ? ? ? ? ?
Visualise ? ? ? ? ?
Query
Analyse
Broad Databases
Deep Databases
83
Description based Data
Semi-structured Data
standards
Information Extraction
Ontologies Controlled vocabularies
84
InterPro Relational Schema
ENTRY2METHOD entry_ac method_ac timestamp
ENTRY2PUB entry_ac pub_id
ENTRY2ENTRY entry_ac parent_ac
ENTRY2COMP entry1_ac entry2_ac
ABSTRACT entry_ac abstract
ENTRY entry_ac name entry_type
METHOD method_ac name dbcode method_date
EXAMPLE entry_ac protein_ac description
ENTRY_ACCPAIR entry_ac secondary_ac timestamp
PROTEIN protein_ac name CRC64 dbcode length fragme
nt seq_date timestamp
CV_ENTRYTYPE code abbrev description
MATCH protein_ac method_ac pos_from pos_to status
seq_date timestamp
PROTEIN2GENOME protein_ac oscode
CV_DATABASE dbcode dbname dborder
ORGANISM Oscode Taxid name
PROTEIN2ACCPAIR protein_ac Secondary_ac
85
Controlled Vocabularies
ID PRIO_HUMAN STANDARD PRT 253
AA. DE MAJOR PRION PROTEIN PRECURSOR (PRP)
(PRP27-30) (PRP33-35C) (ASCR). OS Homo sapiens
(Human). OC Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi OC Mammalia
Eutheria Primates Catarrhini Hominidae
Homo. CC -!- FUNCTION THE FUNCTION OF PRP IS
NOT KNOWN. PRP IS ENCODED IN THE HOST GENOME AND
IS CC EXPRESSED BOTH IN NORMAL AND
INFECTED CELLS. CC -!- SUBUNIT PRP HAS A
TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED
"RODS". CC -!- SUBCELLULAR LOCATION ATTACHED TO
THE MEMBRANE BY A GPI-ANCHOR. CC -!- DISEASE
PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF
HUMANS AND ANIMALS INFECTED WITH CC
NEURODEGENERATIVE DISEASES KNOWN AS TRANSMISSIBLE
SPONGIFORM ENCEPHALOPATHIES OR PRION CC
DISEASES, LIKE CREUTZFELDT-JAKOB DISEASE (CJD),
GERSTMANN-STRAUSSLER SYNDROME (GSS), CC FATAL
FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS
SCRAPIE IN SHEEP AND GOAT BOVINE CC SPONGIFORM
ENCEPHALOPATHY (BSE) IN CATTLE TRANSMISSIBLE
MINK ENCEPHALOPATHY (TME) CC CHRONIC WASTING
DISEASE (CWD) OF MULE DEER AND ELK FELINE
SPONGIFORM ENCEPHALOPATHY CC (FSE) IN CATS AND
EXOTIC UNGULATE ENCEPHALOPATHY(EUE) IN NYALA AND
GREATER KUDU. THE CC PRION DISEASES ILLUSTRATE
THREE MANIFESTATIONS OF CNS DEGENERATION (1)
INFECTIOUS (2) CC SPORADIC AND (3) DOMINANTLY
INHERITED FORMS. TME, CWD, BSE, FSE, EUE ARE ALL
THOUGHT TO CC OCCUR AFTER CONSUMPTION OF
PRION-INFECTED FOODSTUFFS. CC -!- SIMILARITY
BELONGS TO THE PRION FAMILY. KW Prion Brain
Glycoprotein GPI-anchor Repeat Signal
Polymorphism Disease mutation.
86
Controlled Vocabularies
  • Data resources have been built introspectively
    for human researchers
  • Information is machine readable not machine
    understandable
  • Sharing vocabulary is a step towards unification

87
Ontologies in Bioinformatics
  • Controlled vocabularies for genome annotation
  • Gene Ontology, MGED, Mouse Anatomy
  • Searching retrieval
  • above MeSH
  • Communication framework for resource mediation
  • TAMBIS
  • Knowledge acquisition hypothesis generation
  • Ecocyc, Riboweb
  • Information extraction annotation generation
  • EmPathIE and PASTA
  • BioOntology Consortium (BOC)

88
Gene Ontology
  • Controlled vocabularies for the description of
    the molecular function, biological process and
    cellular component of gene products.
  • Terms are used as attributes of gene products by
    collaborating databases, facilitating uniform
    queries across them.
  • 6,000 concepts

http//www.geneontology.org/
89
Gene Ontology
90
How GO is used by databases
  • Making database cross-links between GO terms and
    objects in their database (typically, gene
    products, or their surrogates, genes), and then
    providing tables of these links to GO
  • Supporting queries that use these terms in their
    database

91
Information Extraction
  • Annotation to annotation
  • Irbane SWISS-PROT to PRINTS annotations
  • Protein Annotators Workbench
  • From online searchable journal articles
  • EMPathIE Enzyme and Metabolic Path Information
    Extraction
  • PASTA Protein structure extraction from texts to
    support the annotation of PDB
  • http//www.dcs.shef.ac.uk/research/groups/nlp/
  • PIES Protein interaction extraction system
  • BioPATH http//www.lionbioscience.com/

92
Research on Term Extraction in Biology
  • Rule based (linguistics)
  • terminology lexicons derived from biology
    databases and annotated corpora
  • Hybrid (statistics linguistics)
  • pattern extraction, information categorisation
    using clustering, automated term recognition
  • Machine Learning (Decision Trees, HMM)
  • Text in Biology (BRIE OAP) 2001
  • http//bioinformatics.org/bof/brie-oap-01/
  • Natural language processing of biology text
  • http//www.ccs.neu.edu/home/futrelle/bionlp/

93
PASTA Protein Structure
94
Summary (1)
  • Sequence data has a good data abstraction the
    sequence
  • No obvious or good abstractions for functional
    genomic data yet
  • Descriptive models
  • Unstable schemas
  • Retain all results in primary database just in
    case (e.g. microarray images)

95
Summary (2)
  • Reliance on description
  • Semi-structured data
  • Controlled vocabularies
  • Text extraction
  • High value on expert curation
  • Knowledge warehouses
  • Labour intensive

96
Summary (3)
  • Current dominant delivery paradigms
  • Document publication flat files
  • Web browsing interactive visualisation
  • Human readable vs machine understandable
  • High connectivity between different databases for
    making links between pieces of evidence
  • Poor mechanisms for maintaining the connectivity
  • Integration considered essential

97
Biological Database Integration
98
Motivation
  • Quantity of biological resources
  • Databases.
  • Analysis tools.
  • Databases represented in Nucleic Acids Research,
    January 2001 96.
  • Many meaningful requests require access to data
    from multiple sources.

99
Difficulties
  • All the usual ones
  • Heterogeneity.
  • Autonomy.
  • Distribution.
  • Inconsistency.
  • And a few more as well
  • Focus on interactive interfaces.
  • Widespread use of free text.

100
Example Queries
  • Retrieve the motifs of proteins from S.
    cerevisiae.
  • Retrieve proteins from A. fumigatus that are
    homologous to those in S. cerevisiae.
  • Retrieve the motifs of proteins from A. fumigatus
    that are homologous to those in S. cerevisiae.

101
Possible Solutions
  • Many different approaches have been tried
  • SRS file based indexing and linking.
  • BioNavigator type based linking of resources.
  • Kleisli semi-structured database querying.
  • DiscoveryLink database oriented middleware.
  • TAMBIS ontology based integration.
  • Some standards are emerging
  • OMG Life Sciences.
  • I3C.

102
SRS
Sequence Retrieval System http//srs.ebi.ac.uk/
103
SRS In Use
List of Databases
Search Interfaces
Selected Databases
104
Searching in SRS
Search Fields
Boolean Condition
105
SRS Results
Links to Result Records
106
PRINTS Database Record
File Format from Source
Link to Other Databases
107
Link Following
Related record from SPTREMBL
Reference back to PRINTS
108
Features of SRS
  • Single access point to many sources.
  • Consistent, if limited, searching.
  • Fast.
  • No global model, so suffers from N2 problem
    linking sources.
  • No reorganisation of source data.
  • Minimal transparency.

109
BioNavigator
  • BioNavigator combines data sources and the tools
    that act over them.
  • As tools act on specific kinds of data, the
    interface makes available only tools that are
    applicable to the data in hand.

Online trial from https//www.bionavigator.com/
110
Initiating Navigation
Select database
Enter accession number
111
Viewing Selected Data
Navigate to related programs
Relevant display options
112
Listing Possible Applications
Programs acting on protein structures
113
Viewing Results
Several views of result available
114
Chaining Analyses in Macros
Chained collections of navigations can be saved
as macros and restored for later use.
115
Features of BioNavigator
  • Single access point for many tools over a
    collection of databases.
  • Easy-to-use interface.
  • Not really query oriented.
  • User selects order of access.
  • Possible to minimise exposure to file formats.

116
Kleisli
  • Many biological sources make data available as
    structured flat files.
  • Such structures can be naturally represented and
    manipulated using complex value models.
  • Kleisli uses a comprehension-based query language
    (CPL) over such models.

117
Architecture
Kleisli supports client-side wrapping of sources,
which surface to CPL as functions.
Online demos http//sdmc.krdl.org.sg8080/kleisli
/demos/
118
Queries
  • Queries can refer to multiple sources by calling
    driver functions.
  • Example Which motifs are components of guppy
    proteins?

m \p lt- get-sp-entry-by-os(guppy),
\m lt- go-prosite-scan-by-entry-rec(p)
Query calls drivers from two sources
119
Features of Kleisli
  • Query-oriented access to many sources.
  • Comprehensive querying.
  • No global model as such.
  • Not really a user level language.
  • Some barriers to optimisation.

L. Wong, Kleisli its Exchange Format, Supporting
Tools and an application in Protein Interaction
Extraction, Proc. BIBE, 21-28, IEEE Press,
2000. S.B. Davidson, et al., K2/Kleisli and GUS
Experiments in integrated access to genomic data
sources, IBM Systems Journal, 40(2), 512-531,
2001.
120
DiscoveryLink
  • DiscoveryLink ? Garlic DataJoiner applied to
    bioinformatics.
  • In contrast with Kleisli
  • Relational not complex value model.
  • SQL not CPL for querying.
  • More emphasis on optimisation.
  • Wrappers map sources to relational model.

121
DiscoveryLink Example
  • Not much to see SQL query ranges over tables
    from different databases.

SELECT a.nsc, b.compound_name, FROM nci_results
a, nci_names bWHERE panel_number user
selectedAND cell_number user selectedAND
a.nsc b.nsc
Description online www.ibm.com/discoverylink
122
On Relational Integration
  • Relational model has reasonable presence in
    bioinformatics.
  • More commercial than public domain sources are
    relational.
  • Wrapping certain sources as relations will be
    challenging.

123
TAMBIS
  • TAMBIS Transparent Access to Multiple
    Bioinformatics Information Sources.
  • In contrast with Kleisli/DiscoveryLink
  • Important role for global schema.
  • Global schema domain ontology.
  • Sources not visible to users.

124
TAMBIS Architecture
  • Ontology described using Description Logic.
  • Query formulation ontology browsing concept
    construction.
  • Wrapper service Kleisli.

Query Formulation Interface
125
Ontology Browsing
Current Concept
Buttons for changing current concept
Online demo http//img.cs.man.ac.uk/ tambis
126
Query Construction
Query Retrieve the motifs that are both
components of guppy proteins and associated with
post translational modification.
127
Genome Level Integration
  • Few integration proposals have focused on genome
    level information sources.
  • Possible reasons
  • Most mature sources are gene-level.
  • Lack of standards for genome-level sources.
  • Species-specific genome databases are highly
    heterogeneous.
  • There are few functional genomics databases.

128
Standardisation
  • Most standards in bioinformatics have been de
    facto.
  • The OMG has an ongoing Life Sciences Research
    Activity with Standardisation activities in
    Sequence Analysis Gene Expression
    Macromolecular structure.
  • http//www.omg.org/homepages/lsr/
  • XML approach I3C
  • http//i3c.open-bio.org
  • Open bio consortium
  • http//www.open-bio.org

129
I3C
  • The Interoperable Informatics Infrastructure
    Consortium (I3C)
  • Open XML-in, XML-out paradigm
  • Services-based for accessing remote analysis
    services
  • http//i3c.open-bio.org/

130
Business vs Biology Data Warehouses
Classical Business
Biological Science
High number of queries over a priori known data
aggregates
Query targets frequently change due to new
scientific insights/questions
Pre-aggregation not easy since body of formal
background knowledge is complex and growing fast
Pre-aggregation easy since business
processes/models are straightforward, stable and
know a priori
Most relevant data resides on globally
distributed information systems owned by many
organisations
Data necessary often owned by enterprise
Breakdown of data into N-cubes of few simple
dimensions
Complex underlying data structures that are
inherently difficult to reduce to many dimensions
Temporal view of data (week, month, year)
snapshots
Temporal modelling important but more complex
Dubitzky et al, NETTAB 2001
131
Integrated Genomic Resources
  • For yeast, by way of illustration
  • MIPS (http//www.mips.biochem.mpg.de/).
  • SGD (http//genome-www.stanford.edu/Saccharomyces/
    ).
  • YPD (http//www.proteome.com/).
  • General features
  • Integrate data from single species.
  • Limited support for analyses.
  • Limited use of generic integration technologies.

132
Analysing Genomic Data
133
Gene Level Analysis
  • Conventional bioinformatics provides the
    principal gene level analyses, such as
  • Sequence homology.
  • Sequence alignment.
  • Pattern matching.
  • Structure prediction.

134
Sequence Homology
  • Basic idea
  • Organisms evolve.
  • Individual genes evolve.
  • Sequences are homologous if they have diverged
    from a common ancestor.
  • Comparing sequences allows inferences to be drawn
    on the presence of homology.
  • Well known similarity search tools
  • BLAST (http//www.ncbi.nlm.nih.gov/BLAST/ ).
  • FASTA (http//fasta.genome.ad.jp/ ).

135
Running BLAST
Search Sequence
Aligned Result
136
Multiple Alignments
  • Multiple sequences can be aligned, possibly with
    gaps or substitutions.
  • Sequence alignment is important to the
    classification of sequences and to function.

CINEMA alignment applet http//www.bioinf.man.ac.
uk/dbbrowser/CINEMA2.1/
137
Pattern Databases
  • Pattern databases are secondary databases of
    patterns associated with alignments.
  • Conserved regions in alignments are known as
    motifs.

InterPro pattern database (http//www.ebi.ac.uk/i
nterpro/ )
138
Protein Structure
  • Structural data is important for understanding
    and explaining protein function.
  • Predicting structure from sequence is an ongoing
    challenge (http//predictioncenter.llnl.gov/ ).

139
Relevance to Genome Level
  • Making sense of sequence data needs
  • Identification of gene function.
  • Understanding of evolutionary relationships.
  • Genome level functional data is often understood
    in terms of the results of gene level analyses.
  • Genome sequencing has given new impetus to gene
    level bioinformatics (e.g. in structural genomics
    http//www.structuralgenomics.org)

140
Genome Level Analysis
  • Genome level analyses can be classified according
    to the data they use.
  • Within a genome
  • Individual genomic data sets.
  • Multiple genomic data sets.
  • Between genomes.
  • Individual genomic data sets.
  • Multiple genomic data sets.
  • Some examples follow

141
Sequencing
  • Data management and analysis are essential parts
    of a sequencing project. Typical tasks
  • Sequence assembly.
  • Gene prediction.
  • Examples of projects supporting the sequencing
    activity
  • AceDB (http//www.acedb.org/ ).
  • Ensembl (http//www.ensembl.org/ ).
  • Providing systematic and effective support for
    sequencing will continue to be important.

142
ACeDB
  • ACeDB was developed for use in the C.Elegans
    genome project.
  • Roles
  • Storage.
  • Annotation.
  • Browsing.
  • Semi-structured data model.
  • Visual, interactive interface.

C.elegans Genome (http//www.sanger.ac.uk/Project
s/C_elegans/ )
143
Sequence Similarity
  • Sequence similarity searches can be conducted
  • Within genomes.
  • Between genomes.
  • Challenges
  • Performance.
  • Presentation.
  • Interpretation.

Visualisation of regions of sequence similarity
between chromosomes in yeast.
144
Whole Genome Alignment
  • Aligning genomes allows identification of
  • Homologous genes.
  • Translocations.
  • Single nucleotide changes.
  • Broader studies, for example, might focus on
    understanding pathogenicity.

Comparison of two Staphyloccus strains using
MUMmer (http//www.tigr.org/ )
145
Another Genome Alignment
  • Fast searching and alignment will grow in
    importance.
  • More sequenced genomes.
  • Sequencing of strains/individuals.
  • Interpreting alignments requires other
    information.

Mycoplasma genitalium v Mycoplasma pneumoniae,
A.L. Delcher, N. Acids Res. 27(11), 2369-2376,
1999.
146
Transcriptome
  • Data sets are
  • Large.
  • Complex.
  • Noisy.
  • Time-varying.
  • Challenges
  • Normalisation.
  • Clustering.
  • Visualisation.

maxd http//www.bioinf.man.ac.uk/microarray/
GeneX http//genex.ncgr.org/
147
Transcriptome Results
  • Dot plots allow changes in specific mRNAs to be
    identified.
  • The example shows a comparison of two different
    yeast strains.

148
Transcriptome Clustering
  • The key issue what genes are co-regulated?
  • Some techniques give absolute and some relative
    expression measures.
  • Experiments compare expression levels for
    different
  • Strains.
  • Environmental conditions.

Yeast clusters M.B. Eisen et al., PNAS 95(25),
14863-14868, 1998.
149
Proteome Analysis
  • Driven directly from proteome-centred
    experiments
  • Identification of proteins in samples.
  • Identification of post translational
    modifications.
  • Grouping existing protein entries by
  • Sequence similarity.
  • Sequence family.
  • Structural family.
  • Functional class.

CluSTRhttp//www.ebi.ac.uk/proteome/
150
Metabolome Analyses
  • Analysis tasks include
  • Searching for routes through pathways.
  • Simulating the dynamic behaviour of pathways.
  • Building pathways from known reactions.
  • Other data can be overlaid on pathways (e.g.
    transcriptome).

EcoCyc (Frame Based) http//ecocyc.pangeasystems.
com/
151
Integrative Analysis
  • Analysing individual data sets is fine.
  • Specialist techniques often required.
  • Many research challenges remain.
  • Analysing multiple data sets is necessary
  • Understanding the whole story requires all the
    evidence.
  • Most important results yet to come?

152
Further Information
  • IBM Systems Journal 40(2), 2001
  • http//www.research.ibm.com/journal/sj40-2.html

153
Challenges
  • The opportunities for partnership between
    information management providers researchers,
    and biologists, is enormous.
  • The challenges of genomic data are even greater
    than for sequence data.
  • There are genuine research issues for information
    management.

154
Information representation
  • Semi-structured description
  • Controlled vocabularies, metadata Complexity of
    living cells
  • Context genome is context independent and
    static transcriptome, proteome etc are
    context-dependent and dynamic
  • Granularity molecules to cells to whole
    organisms to populations

155
Information representation
  • Spatial / temporal
  • Time-series data cell events on different
    timescales
  • Gene expression spatially related to tissue

156
Representational forms
  • A huge digital library
  • Free text
  • literature annotations
  • Images
  • micro array
  • Moving images
  • calcium ions waves, behaviour of transgenic mice

157
Quality Stability
the problem in the field is not a lack of
good integrating software, Smith says. The
packages usually end up leading back to public
databases. "The problem is the databases are
God-awful," he told BioMedNet. If the data
is still fundamentally flawed, then better
algorithms add little Temple Smith, director of
the Molecular Engineering Research Center at
Boston University, BioMedNet 2000
  • Data quality
  • Inconsistency, incompleteness
  • Provenance
  • Contamination, noise, experimental rigour
  • Data irregularity
  • Evolution

158
Process Flow
  • Supporting the annotation pipeline
  • Supporting in silico experiments
  • Provenance
  • Change propagation
  • Derived data management
  • Tracability

159
Interoperation
  • Seamless repository and process integration
    interoperation
  • The Semantic Web for e-Science
  • Genome data warehouses for complex analysis
  • Distributed processing too time consuming
  • Perhaps GRIDs will solve this?

160
Supporting Science
  • Personalisation
  • My view of a metabolic pathway
  • My experimental process flows
  • Science is not linear
  • What did we know then
  • What do we know now
  • Longevity of data
  • It has to be available in 50 years time.

161
Prediction and Mining
  • Data mining
  • Machine learning
  • Visualisation
  • Information Extraction
  • Simulation

162
Final point
  • "Molecular biologists appear to have eyes for
    data that are bigger than their stomachs. As
    genomes near completion, as DNA arrays on chips
    begin to reveal patterns of gene sequences and
    expressions, as researchers embark on
    characterising all known proteins, the
    anticipated flood of data vastly exceeds in scale
    anything biologists have been used to."
  • (Editorial Nature, June 10, 1999)

163
Acknowledgements
  • Help with slides
  • Terri Attwood
  • Steve Oliver
  • Robert Stevens
  • Funding
  • UK Research Councils BBSRC, EPSRC.
  • AstraZeneca.

Further information on bioinformatics http//www.
iscb.org/
Write a Comment
User Comments (0)
About PowerShow.com