Bio-Trac 40 (Protein Bioinformatics) - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Bio-Trac 40 (Protein Bioinformatics)

Description:

Ontology Enables Large-Scale Biomedical Science Slide 8 GO Consortium Need for annotation of genome sequences GO Representation: Tree or Network? Slide 12 ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 44
Provided by: Demo264
Category:

less

Transcript and Presenter's Notes

Title: Bio-Trac 40 (Protein Bioinformatics)


1
Biomedical Ontologies
  • Bio-Trac 40 (Protein Bioinformatics)
  • October 9, 2008
  • Zhang-Zhi Hu, M.D.
  • Research Associate Professor
  • Protein Information Resource, Department of
  • Biochemistry and Molecular Cellular Biology
  • Georgetown University Medical Center

2
Overview
  • What is ontology?
  • What is biomedical ontology?
  • What is gene ontology?
  • How is it generated?
  • How is it used for annotation?
  • What is protein ontology?
  • Why is it necessary?
  • How to use it?

3
Tree of Porphyry with Aristotles Categories
Aristotle, 384 BC 322 BC
4
Ontology
onto-, of being or existence -logy, study.
Greek origin Latin, ontologia,1606
  • In philosophy, it seeks to describe basic
    categories and relationships of being or
    existence to define entities and types of
    entities within its framework
  • What do you know? How do you know it?
  • What is existence? What is a physical object?
  • What constitutes the identity of an object?
  • Central goal is to have a definitive and
    exhaustive classification of all entities.

The science of what is, of the kinds and
structures of objects, properties, events,
processes and relations in every area of reality
Barry Smith, U Buffalo
5
In computer and information science
  • Ontology is a data model that represents a set of
    concepts within a domain and the relationships
    between those concepts. It is used to reason
    about the objects within that domain.

Most ontologies describe individuals (instances),
classes (concepts), attributes, and relations
Classes
Relations
Attributes
Classes (concepts)
e.g. color, engine, door
Individuals (instances)
your Ford, my Ford, his Ford
6
What are ontology useful for?
Ontology is a form of knowledge representation
about the world or some part of it.
  • Terminology management
  • Integration, interoperability, and sharing of
    data
  • promote precise communication between scientists
  • enable information retrieval across multiple
    resources
  • Knowledge reuse and decision support
  • extend the power of computational approaches to
    perform data exploration, inference, and mining

Biomedical Terminology vs. Biomedical Ontology
  • UMLS (unified medical language system)
  • MeSH (medical subject heading)
  • NCI Thesaurus
  • SNOMED / SNODENT
  • Medical WordNet

7
Ontology Enables Large-Scale Biomedical Science
The center of two major activities currently in
biomedical research
  • Structured representation of biomedicine
  • For different types of entities and relations to
    describe biomedicine (ontology content curation).
  • Annotation using ontologies to summarize and
    describe biomedical experimental results to
    enable
  • Integration of their data with other researchers
    results
  • Cross-species analyses

8
Gene Ontology (GO)
what makes it so wildly successful ?
9
GO Consortium
http//www.geneontology.org/
  • The Gene Ontology was originally constructed in
    1998 by a consortium of researchers studying the
    genome of three model organisms
  • Drosophila melanogaster (fruit fly) (FlyBase)
  • Mus musculus (mouse) (MGD)
  • Saccharomyces cerevisiae (yeast) (SGD)
  • Many other model organism databases have joined
    the GO consortium, contributing
  • development of the ontologies
  • annotations for the genes of one or more organisms

10
Need for annotation of genome sequences
  • What is Gene Ontology? GO provides controlled
    vocabulary to describe gene and gene product
    attributes in any organism how gene products
    behave in a cellular context
  • Three key concepts Currently total 25804 GO
    terms (Oct. 2008)
  • Biological process series of events accomplished
    by one or more ordered assemblies of molecular
    functions, e.g. signal transduction, or
    pyrimidine metabolism, and alpha-glucoside
    transport. total 15161
  • Molecular function describes activities, such as
    catalytic or binding activities, that occur at
    the molecular level. Activities that can be
    performed by individual gene products, or by
    assembled complexes of gene products e.g.
    catalytic activity, transporter activity. total
    8425
  • Cellular component a component of a cell that it
    is part of some larger object, maybe an
    anatomical structure (e.g. ER or nucleus) or a
    gene product group (e.g. ribosome, or a protein
    dimer). total 2218
  • GO annotation
  • - Characterization of gene products using GO
    terms
  • - Members submit their data which are
    available at GO website.

11
GO Representation Tree or Network?
GO is a network structure
Node, a concept or a term
12
http//www.geneontology.org/
13
GO term (GO0006366) mRNA transcription from
RNA polymerase II promoter
GO search and display tool
14
Human p53 GO annotation (UniProtKBP04637)
GO0006289nucleotide-excision repair
PMID7663514 evidenceIMP
15
GO annotation of gene products
  • Science basis of the GO trained experts use the
    experimental observations from literature to
    associate GO terms with gene products (to
    annotate the entities represented in the
    gene/protein databases)
  • Enabling data integration across databases and
    making them available to semantic search

http//www.geneontology.org/GO.current.annotations
.shtml
46
Human, mouse, plant, worm, yeast
16
What GO is NOT
  • Ontology of gene products e.g. cytochrome c is
    not in GO, but attributes of cytochrome c are,
    e.g. oxidoreductase activity.
  • Processes, functions and component unique to
    mutants or diseases e.g. oncogenesis is not a
    valid GO.
  • Protein domains or structural features.
  • Protein-protein interactions.
  • Environment, evolution and expression.
  • Anatomical or histological features above the
    level of cellular components, including cell
    types.

Neither GO is Ontology of Genes!! a misnomer
17
Missing GO nodes
not deep enoughnot broad enough
18
Lack of connections among GOs
19
GO A Common Standard for Omics Data Analysis
what molecular function?
what biological process?
what cellular component?
20
need more
  • need to improve the quality of GO to support more
    rigorous logic-based reasoning across the data
    annotated in its terms
  • need to extend the GO by engaging ever broader
    community support for addition of new terms and
    for correction of errors
  • need to extend the methodology to other domains,
    including clinical domains, such as
  • disease ontology
  • immunology ontology
  • symptom (phenotype) ontology
  • clinical trial ontology
  • ...

21
http//www.obofoundry.org/
  • Establish common rules governing best practices
    for creating ontologies and for using these in
    annotations
  • Apply these rules to create a complete suite of
    orthogonal interoperable biomedical reference
    ontologies

National Center for Biomedical Ontology (NCBO)
http//bioontology.org/
22
http//www.obofoundry.org/index.cgi?sortdomainsh
owontologies
23
The OBO Foundry
  • A family of interoperable gold standard
    biomedical reference ontologies to serve
    annotation of
  • scientific literature
  • model organism databases
  • clinical trial data

OBO Foundry a subset of OBO ontologies, whose
developers have agreed in advance to accept a
common set of principles reflecting best practice
in ontology development designed to ensure
  • tight connection to the biomedical basic sciences
  • compatibility, interoperability, common relations
  • support for logic-based reasoning

OBO Foundry Principles http//www.obofoundry.org/
crit.shtml
24
Rationale of OBO Foundry coverage
CONTINUANT CONTINUANT CONTINUANT CONTINUANT OCCURRENT
INDEPENDENT INDEPENDENT DEPENDENT DEPENDENT
ORGAN AND ORGANISM Organism (NCBI Taxonomy) Anatomical Entity (FMA, CARO) Organ Function (FMP, CPRO) Phenotypic Quality(PaTO) Biological Process (GO)
CELL AND CELLULAR COMPONENT Cell (CL) Cellular Component (FMA, GO) Cellular Function (GO) Phenotypic Quality(PaTO) Biological Process (GO)
MOLECULE Molecule (ChEBI, SO, RNAO, PRO) Molecule (ChEBI, SO, RNAO, PRO) Molecular Function (GO) Molecular Function (GO) Molecular Process (GO)
25
OBO Relation Ontology
Foundational is_a part_of
Spatial located_in contained_in adjacent_to
Temporal transformation_of derives_from preceded_by
Participation has_participant has_agent
e.g. A is_a B def. every instance of A is an
instance of B rose is_a plant ? all instances of
rose is_a plant
26
What is Protein Ontology? Why?
PRO
http//pir.georgetown.edu/pro/
27
The Need for Representation of Various Proteins
Forms
Glucocorticoid receptor (GR)
and PTMs
28
Sphingomyelin phosphodiesterase (SMPD1)
(ASM_HUMAN)
  • Cleavage sites
  • lysosomal the enzyme is transported from the
    Golgi apparatus to the lysosome after additions
    of mannose-6-phosphate moieties (M6P) and binding
    to M6P receptor.
  • secreted the shorter cleaved form is not
    modified with M6P and is targeted for secretion
    to the extracellular space, with different
    functions such as LDL binding and oxidized LDL
    catabolism.

29
Alternative splicing
a single new contact between Phe32 (F32) of FGF8b
and a hydrophobic groove within Ig domain 3 of
FGFR2c
Olsen et al., Genes Dev. 2006
FGF8a, 8b differ in their ability to pattern
embryonic brain
  • Only FGF8b can transform midbrain to cerebellum
    whereas FGF8a causes an overgrowth of midbrain.

FGF8a FGF8b
FGF8_HUMAN alternative splicing
30
GOA for Transcription factor Ovo-like 2
Form 1 - long GO0045892 IDA - negative
regulation of transcription, DNA-dependent Form 2
short GO0045893 IDA - positive regulation of
transcription, DNA-dependent
- Gene. 2004 33647-58. PMID15225875
274 aa
31
The Need for Protein Classes Representing
Protein Evolutionary Relationships
  • Genes/proteins identified in model organisms,
    such as mouse, yeast, fly, may have important
    functional implications in human.
  • Gene function in model organism may not applied
    to human
  • Animal models for human diseases such as mouse
    models for diabetes, arthritis, and tumor.
  • Essential genes may be redundant and nonessential
    in another species due to functional
    compensation, e.g.
  • mutation of Rb1 causes retinoblastoma in early
    childhood
  • Rb1 knock-out mouse did not develop
    retinoblastoma because of compensation from a
    functional homolog p107.
  • Close examination of proteins in phylogenetic
    classes and their functional convergence and
    divergence in a ontological structure is
    important for application of disease models.

32
Implications of Protein Evolution
B.subtilis
Human
Mouse
Chimp
Yeast
Worm
E.coli
Rat
Fly
  • Conclusions from experiments performed on
    proteins from one organism are often applicable
    to the homologous protein from another organism.
  • Information learned about existing proteins
    allows us to infer the properties of ancestral
    proteins.

Common ancestor
33
Protein Evolution
Sequence changes
Domain shuffling
With enough similarity, one can trace back to a
common origin
What about these?
34
Functional convergence
  • Protein classes of the same function derived from
    different evolutionary origins, e.g. carbonate
    dehydratase (or carbonic anhydrase EC 4.2.1.1),
    which has three independent gene families with
    functional convergence.

Animal and prokaryotic type
Plant and prokaryotic type
Archaea type
35
Functional divergence
Gene Duplication (TGM3/EPB42 split)
Speciation (Human/mouse split)
Human
TGM3 (Human)
TGM3 branch
Mouse
TGM3 (Mouse)
Human
EPB42 (Human)
EPB42 branch
Mouse
EPB42 (Mouse)
TGM3 (Human)
TGM3 (Mouse)
EPB42 (Human)
EPB42 (Mouse)
TGM3 Protein-glutamine gamma-glutamyltransferase
(Transglutaminase involved in protein
modification)
EBP42 Erythrocyte membrane protein band
4.2 (Constituent of cytoskeleton involved in
cell shape)
36
The Need for Protein Ontology
  • Data integration and knowledge management for
    -omics work.
  • A gap exists in OBO for gene products.
  • Protein Ontology (PRO) will contain two connected
    components (or subontologies)
  • ProEvo captures the protein classes represented
    by protein families at fold, domain and full
    length levels that reflect evolutional
    relationship
  • ProForm captures the specific protein objects of
    a specific gene resulting from alternative
    splicing, posttranslational modification, genetic
    variations.
  • ProEvo and ProMod is connected through the
    reference (canonical) protein sequence
    currently annotated in UniProtKB.
  • PRO formalization of these detailed protein
    objects and classes will allow accurate and
    consistent proteomics experimental design and
    data analysis/integration.

37
PRO Framework
  • PRO is designed to be a formal and
    well-principled OBO Foundry ontology for protein
    entities.
  • Attributes of objects will take the form of links
    to other ontologies, such as gene (GO), sequence
    (SO), modification (PSI-MOD) and disease (DO)
    ontologies.
  • A PRO prototype for TGF-beta signaling proteins
    was built based on this framework.
  • In this way, PRO aims at providing an ontological
    framework to define protein entities and
    evolutionary-related classes that community can
    adopt for different purposes, e.g.
  • annotation of entities attributes,
  • mapping of objects in pathways, and
  • modeling of biological system dynamics and
    disease.

38
Protein Ontology (PRO) http//pir.georgetown.edu/p
ro/
39
Mothers against decapentaplegic homolog 2
Smad 2
GO annotation of SMAD2_HUMAN
Cellular Component - nucleus Molecular
Function - protein bindingBiological
Process - signal transduction - regulation of
transcription, DNA-dependent
40
TGF-b
TGF-beta receptor
II
I
Smad 2
1 phosphorylation
Smad 4
Smad 2
P
P
P
CAMK2
ERK1
2 complex formation
Smad 2
P
P
P
P
Smad 2
P
P
P
P
Smad 4
P
Cytoplasm
3 nuclear translocation
Smad 2
P
P
P
Smad 4
Nucleus
P
4 DNA binding

Transcription Regulation
41
Smad2 gene products Forms Location ID
normal Cytoplasmic PRO00000011
TGF-b receptor phosphorylated Forms complex Nuclear Txn upregulation PRO00000013
ERK1 phosphorylated Forms complex Nuclear Txn upregulation PRO00000014
CAMK2 phosphorylated Forms complex Cytoplasmic No Txn upregulation PRO00000015
alternatively spliced short form Cytoplasmic PRO00000016
phosphorylated short form Nuclear Txn upregulation PRO00000018
point mutation (causative agent large intestine carcinoma) Doesnt form complex Cytoplasmic No Txn upregulation PRO00000019
SMAD2_HUMAN
Smad 2
SMAD2_HUMAN
SMAD2_HUMAN
SMAD2_HUMAN
SMAD2_HUMAN
Smad 2
SMAD2_HUMAN
SMAD2_HUMAN
42
PRO allows proper representations of protein
forms in pathways
TGF-beta signaling pathway (REACT_6844)
Each step in the pathway is described by a
Reactome event ID. Bold PRO IDs indicate objects
that undergo some modification that is relevant
for function (the modified form is underlined).
From Arighi et al., SIG2008.
43
PRO hierarchy in Obo Edit
Representing evolutionary-related protein
classes. In this example, children of
TGF-beta-like cysteine-knot cytokine have a
common architecture consisting of a signal
peptide, a variable propeptide region and a
transforming growth factor beta-like domain that
is a cysteine-knot domain. PfamPF00019
"has_part Transforming growth factor beta like
domain".
ProEvo
Representing multiple protein products of a gene.
Only forms with experimental data are included.
When common protein forms exist in human and
mouse, a single node is created (See details
below).
ProForm
OBO relations is_a, derives_from
44
Summary
  • The vision of the biomedical ontology community
    is that all biomedical knowledge and data are
    disseminated on the Internet using principled
    ontologies, such that they are semantically
    interoperable and useful for improving biomedical
    science and clinical care.
  • The scope extends to all knowledge and data that
    is relevant to the understanding or improvement
    of human biology and health.
  • Knowledge and data are semantically interoperable
    when they enable predictable, meaningful,
    computation across knowledge sources developed
    independently to meet diverse needs.
  • Principled ontologies are ones that follow
    NCBO-recommended formats and methodologies for
    ontology development, maintenance, and use.
Write a Comment
User Comments (0)
About PowerShow.com