MotifML - PowerPoint PPT Presentation

About This Presentation
Title:

MotifML

Description:

bovine GCTTGAATTAAATAGGATTAAAGGC TTATCAGGGCTGGGAGCTACACCCC -AACTCCTGAGTTTAGCCCCA ... 85%. The Genbank entries for human, bovine, and mouse are X53044, M32733, and ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 33
Provided by: programfor
Category:
Tags: motifml | bovine

less

Transcript and Presenter's Notes

Title: MotifML


1
  • MotifML
  • A Novel Ontology-based XML Model for
    Data-Exchange of Regulatory DNA Motif Profiles
  • Eric Neumann, Beyond Genomics
  • Tian Niu, Harvard University
  • Ken Baclawski, Northeastern University

2
DNA Motifs

  • human GCTTGAATTAGACAGGATTAAAGGC
    TTACTGGAGCTGGAAGCCTTGCCCC -AACTCAGGAGTTTAGCCCCA
  • bovine GCTTGAATTAAATAGGATTAAAGGC
    TTATCAGGGCTGGGAGCTACACCCC -AACTCCTGAGTTTAGCCCCA
  • mouse GCTTGAATTAGACAGGATTAAAGGC
    TTAGCAGAGCTGGAAGCCTCACATC TAACTCCCACATTGAGCCCCA

  • -70 -45
    -20 1

Alignment Profile
Functional Significance?
3
Motif Finding Tools
  • AlignACE
  • GIBBS
  • Consensus
  • Propsector

4
The Need for motifML
  • Information resides at multiple sources
  • Data follow multiple Structures
  • Multiple Interfaces

Integrated XML view
MotifML
BioProspector
Consensus
Gibbs
AlignACE
5
Motif Function
  • Gene expression regulation that is dependent on
    activated transcriptional factors
  • Key element of Gene Networks Complex analysis of
    microarrays

Transcriptional Factors
Regulated Gene Expression

Cis-Elements Associated with a Gene
6
motifML Goals
  • to allow the full specification of all
    experimental information known about motifs
  • to provide an extensible framework for this
    annotation and provide a common vehicle for
    exchanging the motif information
  • to provide a single document interface to
    integrate all project information, complete with
    protocols for network data retrieval.

7
motifML Design
  • formal and concise- ontology based
  • motifML documents easy to create
  • clarity more important than brevity
  • use both XML schema and XML DTD

8
motifML Semantics
  • Annotation
  • The collection of features for a given set of
    sequence(s) that have built in semantics
  • Features
  • Characteristics supported by analytic evidence
  • Analyses
  • Computational
  • Experimental

9
motifML Semantics
Ontology
Property
Semantically Definable Searchable
Pragmatic
Objects
Annotation
Analyses
Features Motifs
Intentional Extraction
Results
10
motifML Sequence Item
ltseq iddemo_seq nameHuman HAL Gene Exon
18gt ltdbxrefgt ltdatabasegtGenBanklt/databasegt
ltunique_idgt14588658 lt/unique_idgt lt/dbxrefgt
ltfeaturegt ltmotif typecis-regulatory
nameCBE iddm312/gt ltdescriptiongt
CRX Binding Element lt/descriptiongt
ltposition start21 end32 /gt ltevidencegt
ltreference paperDavies, J Mol Biol. 1993
2961205-14/gt lt/evidencegt lt/featuregt
ltresidues typednagt ATAATGTCCAAGATCTTCTGGAGAGTGT
ATCCCATGCTGTGGAGCACTCTGTGGAAGCCACGGGTCCTTTAGACAGCT
CATCCTATGAGGAGCACTTCTTAACTGGCACTGGTCTCTTGCAGTTTCTG
AGAACAAGGCTCTGTGCCATCCCTCGTCTGTTGACTCCCTCTCCACCAGC
GCAGCCACGGAGGACCACGTCTCCATGGGAGGATGGGCAGCAAGGAAAGC
CCTCAGGGTCATCGAGCATGTGGAGCAAGGTAATGCTGATGAGTTCGGGG
TGGCGGGCCTGCCTGATAGACCACTGTGCCTGTGGTTCTCAAGTGGGATC
TCCCACCAGCAACATCAGCATC ACCTGGAAAC
lt/residuesgt lt/seqgt
11
Computational Analysis
lt!ELEMENT computational_analysis (date?, program,
version?, parameter,
database?, result_set)gt lt!ATTLIST
computational_analysis seq IDREF
REQUIREDgt lt!ELEMENT program (PCDATA)gt lt!ELEMEN
T result_set (score?, output, result)gt lt!ELEMEN
T result (score, type, subtype?,
seq_relationship, output)gt lt!ATTLIST result
id ID IMPLIEDgt lt!ELEMENT seq_relationship
(location, alignment?)gt lt!ATTLIST
seq_relationship seq IDREF
REQUIRED type
(query subject peer ) REQUIREDgt lt!ELEMENT
alignment (PCDATA)gt lt!ELEMENT type
(PCDATA)gt lt!ELEMENT value (PCDATA)gt lt!ELEMEN
T parameter (type, value)gt lt!ELEMENT output
(type, value)gt lt!ELEMENT database (name, date?,
version?)gt lt!ELEMENT version (PCDATA)gt lt!ELEMEN
T score (PCDATA)gt
12
HSP and HSE
  • Heat shock and other environmental and
    pathophysiologic stresses stimulate synthesis of
    heat shock proteins (Hsps). These proteins enable
    the cell to survive and recover from stressful
    conditions by as yet incompletely understood
    mechanisms.
  • A conserved 14 base pair regulatory sequence,
    referred to as the heat shock element (HSE), is
    found in multiple imperfect copies upstream of
    the TATA box of all heat shock genes.
  • Genes with an HSE at the upstream region may be
    co-regulated

13
Dataset (Vertebrates)
  • gt gid 3004462, start1, end1027
  • gt gid 7861931, start1, end666
  • gt gid 7108904, start1, end1519
  • gt gid 7739662, start1, end800
  • gt gid 64795, start1, end487
  • gt gid 64791, start1, end614
  • gt gid 64789, start1, end1128
  • gt gid 64786, start1, end374
  • gt gid 32480, start1, end483
  • gt gid 32484, start1, end711
  • gt gid 7669470, start1, end424
  • gt gid 5729878, start1, end313
  • gt gid 5031770, start1, end760
  • gt gid 1816451, start1, end2179
  • gt gid 184422, start1, end2634
  • gt gid 184416, start1, end488
  • gt gid 188491, start1, end959
  • gt gid 4691417, start1, end2631
  • gt gid 188489, start1, end485
  • gt gid 188487, start1, end489
  • gt gid 184416, start1, end488
  • gt gid 211940, start1, end391
  • gt gid 63508, start1, end1421
  • gt gid 63512, start1, end2300
  • gt gid 409185, start1, end1231
  • gt gid 163160, start1, end491
  • gt gid 414974, start1, end426
  • Data are from GenBank

14
AlignACE program
  • uses a Gibbs sampling strategy which is similar
    to that described by Neuwald et al., 1995
  • An iterative masking procedure is used to allow
    multiple distinct motifs to be found within a
    single data set
  • Reference Hughes et al., J Mol Biol. 2000
    2961205-14

15
AlignACE Results
  • ...
  • Motif 1
  • GGGGAGGGGGTGGGGGGGC 23 788 0
  • GGCGGGCGGGCGGCGGGGG 23 867 1
  • GGACAGCGGCGGCTGGCTG 11 107 0
  • GGGGTGCGGGGGCAGGCGC 23 1417 1
  • CCGCGGGGGCGGGCGGGGC 13 2034 1
  • ...
  • MAP Score 794.004
  • Motif 2
  • GGGGAGGGGGTGGGGGGGCGGGG 23 784 0
  • GTGCGGGGGCAGGCGCGGAGAGC 23 1420 1
  • GCGGAGCGGGAGGGGGCGTGGCC 13 1932 1
  • GGGGTGCGGGAGGGCGGGCGGGC 23 1448 1
  • GGGCAGTGGGCGGCTGGCAGCTG 14 1452 1
  • ...

16
Gibbs Motif Sampler Program
  • Uses Stochastic Iterative Sampling
  • The Bernoulli motif sampler assumes that each
    sequence can contain zero or more ungapped motif
    elements of each motif type
  • Reference
  • Lawrence et al., Science 1993262(5131)208-14
  • Neuwald et al., Protein Sci. 1995
    Aug4(8)1618-32.

17
Gibbs Results
  • ...
  • 4, 1 284 agtgc AGAGTCTGGAGAGC cgaat 271
    0.87 R gid 7739662, start1, end800
  • 4, 2 425 ggtat AGATGTCGGAGAGT cgttt 412
    0.79 R gid 7739662, start1, end800
  • 4, 3 643 atgga AGCCTCGGGAAACT tcggg 656
    0.86 F gid 7739662, start1, end800
  • 5, 1 239 atgga AGCCTCGGGAAACT tcggg 252
    0.86 F gid 64795, start1, end487
  • 7, 1 401 agtgt GGGTGCTGGAGGCT gacgg 388
    0.99 R gid 64789, start1, end1128
  • 9, 1 26 ggagt GGCGGTGGGAAGGG tgttg 13
    0.99 R gid 32480, start1, end483
  • ...
  • ...

18
Consensus Program
  • Uses entropy-based scoring functions
  • References
  • Stormo and Hartzell, PNAS 1989861183-1187
  • Hertz et al., 1990, CABIOS, 681-92

19
Consensus Results
  • MATRIX 1
  • ...
  • 123 1/593 TGCAAGATTTTTAA
  • 29 2/8 TGGAGGCTTCCAGA
  • 310 3/889 TGGAGGCTTCCAGA
  • ...
  • MATRIX 2
  • ...
  • 123 1/593 TGCAAGATTTTTAA
  • 29 2/8 TGGAGGCTTCCAGA
  • 310 3/889 TGGAGGCTTCCAGA
  • ...
  • MATRIX 3
  • 123 1/593 TGCAAGATTTTTAA
  • 29 2/8 TGGAGGCTTCCAGA
  • 310 3/889 TGGAGGCTTCCAGA
  • ...
  • MATRIX 4
  • 121 1/38 GGGAAAGCTCGAGA

20
BioProspector Program
  • a program that examines the upstream region of
    genes in the same gene expression pattern group
    to search for regulatory sequence motifs.
  • uses zero to third-order Markov background models
  • allows for the searching of gapped motifs and
    motifs with palindromic patterns
  • Reference Liu et al., Pac Symp Biocomput.
    2001127-38

21
BioProspector Results
  • ...
  • Motif 1
  • ...
  • Seq 1 seg 1 r998 TCATCCAATCAGAG
  • Seq 2 seg 1 f91 TCAACCGAACAGAA
  • Seq 3 seg 1 r638 TCGACCAATCAAAA
  • ...
  • Motif 2
  • ...
  • Seq 1 seg 1 f38 GGGAAAGCTCGAGA
  • Seq 2 seg 1 r648 TGGAAGCCTCCAGT
  • Seq 3 seg 1 r620 TGGAAGCCTCCAGT
  • ...
  • Motif 3
  • ...
  • Seq 1 seg 1 r997 CTCATCCAATCAGA
  • Seq 2 seg 1 f90 CTCAACCGAACAGA
  • Seq 3 seg 1 r637 TTCGACCAATCAAA
  • ...

22
Conceptions and Interactions of the Underlying
Statistical Algorithms Used by the Motif
Searching Programs
Gibbs Sampler Iterative Updating Strategy
Two Block Motif Model
23
Motif Data Representation
  • Common data representation for motif information.
  • Uses XML Schema to specify format.
  • Both human and machine readable.
  • Supports knowledge mining.
  • Statements can be asserted about a motif such as
    a role in gene regulation.

24
Example of a motif
ltmotif id"GXY1"gt ltblockgt ltbase
type"G"gt0.21lt/basegt ltbase type"C"gt0.21lt/base
gt ltbase type"T"gt0.59lt/basegt lt/blockgt
ltblockgt ltbase type"G"gt0.44lt/basegt ltbase
type"C"gt0.50lt/basegt ltbase type"T"gt0.06lt/base
gt lt/blockgt ltblockgt ltbase
type"A"gt0.70lt/basegt ltbase type"G"gt0.29lt/base
gt lt/blockgt ... lt/motifgt
Blk1 A G C T 1 0.00 0.21
0.21 0.59 2 0.00 0.44 0.50 0.06 3
0.70 0.29 0.00 0.00 4 0.32 0.62 0.00
0.06 5 0.03 0.00 0.97 0.00 6 0.00
0.00 1.00 0.00 7 0.85 0.09 0.03 0.03 8
0.88 0.12 0.00 0.00 9 0.03 0.00 0.03
0.94 10 0.03 0.09 0.88 0.00 11 0.70
0.12 0.18 0.00 ...
25
XML Schema
  • Extends the XML document type language
  • Data format restrictions.
  • Data value (min and max) restrictions.
  • Element occurrence (min and max) restrictions.
  • No sophisticated restrictions
  • Probability distribution.

26
XML Schema for MotifML
ltxsdschema xmlnsxsd"http//www.w3.org/2001/XML
Schema"gt ltxsdelement name"motif"
type"MotifType"/gt lt!-- A motif consists of a
sequence of blocks. --gt ltxsdcomplexType
name"MotifType"gt ltxsdsequencegt
ltxsdelement name"block" minOccurs"0"
maxOccurs"unbounded"
type"BlockType"/gt lt/xsdsequencegt lt/xsdcomplex
Typegt lt!-- A block specifies a probability for
each DNA base type. --gt ltxsdcomplexType
name"BlockType"gt ltxsdsequencegt
ltxsdelement name"base" minOccurs"1"
maxOccurs"4"gt ...
27
Statements about motifs
lt?xml version"1.0"?gt ltRDF xmlns"http//www.w3.or
g/1999/02/22-rdf-syntax-ns
xmlnsrdf"http//www.w3.org/1999/02/22-rdf-syntax
-ns xmlnsmml"http//www.beyondgenomics.co
m/2001/07/motifml" xmlnsbp"http//www.beyo
ndgenomics.com/2001/07/biopathway"/gt
ltDescription about"http//www.beyondgenomics.com/
motifdb/gxy1"gt ltbpupregulate
rdfresource"http//www.beyondgenomics.com/motifd
b/awy5"/gt ltbpupregulate rdfresource"http//
www.beyondgenomics.com/motifdb/ftg6"/gt
ltbpdownregulate rdfresource"http//www.beyondge
nomics.com/motifdb/bgt3"/gt lt/Descriptiongt lt/RDFgt
28
The Need for Bio-Ontologies
  • How do biologists learn the element structure of
    a document describing the heterogeneous sequence
    alignment output?
  • How do biologists share the structure and
    meta-data on motif profiles efficiently and
    unambiguously?

29
A multiple sequence alignment linked with
TRANSFAC/TRANSPATH

  • human GCTTGAATTAGACAGGATTAAAGGC
    TTACTGGAGCTGGAAGCCTTGCCCC -AACTCAGGAGTTTAGCCCCA
  • bovine GCTTGAATTAAATAGGATTAAAGGC
    TTATCAGGGCTGGGAGCTACACCCC -AACTCCTGAGTTTAGCCCCA
  • mouse GCTTGAATTAGACAGGATTAAAGGC
    TTAGCAGAGCTGGAAGCCTCACATC TAACTCCCACATTGAGCCCCA
  • PCE-I -CBE-- AP-4
  • 8888 cETS cETS

  • -70 -45
    -20 1

Alignment Profile
Shown here is the alignment from -70 to 1. The
numbering shown corresponds to the mouse
sequence. Identical bases are shown by the
above each nucleotide. Consensus sequence matches
conserved among all three species are the
Ret-1/PCE-I element at -65 to -60, the
CRX-binding element (CBE) at -55 to -50, an AP-4
consensus core sequence at -37 to -34, a cETS
consensus core at -35 to -31 and another at
positions -57 to -54, and an S8 homeodomain is
shown by "8888" at -64 to -61. Only the core
bases are marked. The criteria for searching the
TRANSFAC Database by MatInspector were a match to
the core sequence of at least 80 and to the
entire consensus sequence of at least 85. The
Genbank entries for human, bovine, and mouse are
X53044, M32733, and M32734, respectively.
(Boatright, Mol Vis 1997 315)
30
Transcriptional Factors Ontology
Composite Element
contains
Upstream to
Site
Gene
Within
  • Tissue
  • Stage
  • Disease
  • Env.Cond.
  • Induced

Kind of
Context
Transcript
produces
Part of
Found in
Observation
Transcriptional Motif Elements
Transcriptional Factors
Binds to
31
MotifML Applications
  • Develop a data exchange format for DNA motif data
  • Handling output from motif analyses
  • Annotation and data mining of micro-array data
  • Important in modeling transcriptional regulatory
    networks in eukaryotes

32
Future Directions
  • Distributed Annotation System Lincoln Stein,
    Open-Bio
  • Exchange with Other XML Dialects
  • DAML development
Write a Comment
User Comments (0)
About PowerShow.com