Title: MotifML
1- MotifML
- A Novel Ontology-based XML Model for
Data-Exchange of Regulatory DNA Motif Profiles - Eric Neumann, Beyond Genomics
- Tian Niu, Harvard University
- Ken Baclawski, Northeastern University
2DNA Motifs
-
- human GCTTGAATTAGACAGGATTAAAGGC
TTACTGGAGCTGGAAGCCTTGCCCC -AACTCAGGAGTTTAGCCCCA - bovine GCTTGAATTAAATAGGATTAAAGGC
TTATCAGGGCTGGGAGCTACACCCC -AACTCCTGAGTTTAGCCCCA - mouse GCTTGAATTAGACAGGATTAAAGGC
TTAGCAGAGCTGGAAGCCTCACATC TAACTCCCACATTGAGCCCCA -
-
-
- -70 -45
-20 1
Alignment Profile
Functional Significance?
3Motif Finding Tools
- AlignACE
- GIBBS
- Consensus
- Propsector
4The Need for motifML
- Information resides at multiple sources
- Data follow multiple Structures
- Multiple Interfaces
Integrated XML view
MotifML
BioProspector
Consensus
Gibbs
AlignACE
5Motif Function
- Gene expression regulation that is dependent on
activated transcriptional factors - Key element of Gene Networks Complex analysis of
microarrays
Transcriptional Factors
Regulated Gene Expression
Cis-Elements Associated with a Gene
6motifML Goals
- to allow the full specification of all
experimental information known about motifs - to provide an extensible framework for this
annotation and provide a common vehicle for
exchanging the motif information - to provide a single document interface to
integrate all project information, complete with
protocols for network data retrieval.
7motifML Design
- formal and concise- ontology based
- motifML documents easy to create
- clarity more important than brevity
- use both XML schema and XML DTD
8motifML Semantics
- Annotation
- The collection of features for a given set of
sequence(s) that have built in semantics - Features
- Characteristics supported by analytic evidence
- Analyses
- Computational
- Experimental
9motifML Semantics
Ontology
Property
Semantically Definable Searchable
Pragmatic
Objects
Annotation
Analyses
Features Motifs
Intentional Extraction
Results
10motifML Sequence Item
ltseq iddemo_seq nameHuman HAL Gene Exon
18gt ltdbxrefgt ltdatabasegtGenBanklt/databasegt
ltunique_idgt14588658 lt/unique_idgt lt/dbxrefgt
ltfeaturegt ltmotif typecis-regulatory
nameCBE iddm312/gt ltdescriptiongt
CRX Binding Element lt/descriptiongt
ltposition start21 end32 /gt ltevidencegt
ltreference paperDavies, J Mol Biol. 1993
2961205-14/gt lt/evidencegt lt/featuregt
ltresidues typednagt ATAATGTCCAAGATCTTCTGGAGAGTGT
ATCCCATGCTGTGGAGCACTCTGTGGAAGCCACGGGTCCTTTAGACAGCT
CATCCTATGAGGAGCACTTCTTAACTGGCACTGGTCTCTTGCAGTTTCTG
AGAACAAGGCTCTGTGCCATCCCTCGTCTGTTGACTCCCTCTCCACCAGC
GCAGCCACGGAGGACCACGTCTCCATGGGAGGATGGGCAGCAAGGAAAGC
CCTCAGGGTCATCGAGCATGTGGAGCAAGGTAATGCTGATGAGTTCGGGG
TGGCGGGCCTGCCTGATAGACCACTGTGCCTGTGGTTCTCAAGTGGGATC
TCCCACCAGCAACATCAGCATC ACCTGGAAAC
lt/residuesgt lt/seqgt
11Computational Analysis
lt!ELEMENT computational_analysis (date?, program,
version?, parameter,
database?, result_set)gt lt!ATTLIST
computational_analysis seq IDREF
REQUIREDgt lt!ELEMENT program (PCDATA)gt lt!ELEMEN
T result_set (score?, output, result)gt lt!ELEMEN
T result (score, type, subtype?,
seq_relationship, output)gt lt!ATTLIST result
id ID IMPLIEDgt lt!ELEMENT seq_relationship
(location, alignment?)gt lt!ATTLIST
seq_relationship seq IDREF
REQUIRED type
(query subject peer ) REQUIREDgt lt!ELEMENT
alignment (PCDATA)gt lt!ELEMENT type
(PCDATA)gt lt!ELEMENT value (PCDATA)gt lt!ELEMEN
T parameter (type, value)gt lt!ELEMENT output
(type, value)gt lt!ELEMENT database (name, date?,
version?)gt lt!ELEMENT version (PCDATA)gt lt!ELEMEN
T score (PCDATA)gt
12HSP and HSE
- Heat shock and other environmental and
pathophysiologic stresses stimulate synthesis of
heat shock proteins (Hsps). These proteins enable
the cell to survive and recover from stressful
conditions by as yet incompletely understood
mechanisms. - A conserved 14 base pair regulatory sequence,
referred to as the heat shock element (HSE), is
found in multiple imperfect copies upstream of
the TATA box of all heat shock genes. - Genes with an HSE at the upstream region may be
co-regulated
13Dataset (Vertebrates)
- gt gid 3004462, start1, end1027
- gt gid 7861931, start1, end666
- gt gid 7108904, start1, end1519
- gt gid 7739662, start1, end800
- gt gid 64795, start1, end487
- gt gid 64791, start1, end614
- gt gid 64789, start1, end1128
- gt gid 64786, start1, end374
- gt gid 32480, start1, end483
- gt gid 32484, start1, end711
- gt gid 7669470, start1, end424
- gt gid 5729878, start1, end313
- gt gid 5031770, start1, end760
- gt gid 1816451, start1, end2179
- gt gid 184422, start1, end2634
- gt gid 184416, start1, end488
- gt gid 188491, start1, end959
- gt gid 4691417, start1, end2631
- gt gid 188489, start1, end485
- gt gid 188487, start1, end489
- gt gid 184416, start1, end488
- gt gid 211940, start1, end391
- gt gid 63508, start1, end1421
- gt gid 63512, start1, end2300
- gt gid 409185, start1, end1231
- gt gid 163160, start1, end491
- gt gid 414974, start1, end426
- Data are from GenBank
14AlignACE program
- uses a Gibbs sampling strategy which is similar
to that described by Neuwald et al., 1995 - An iterative masking procedure is used to allow
multiple distinct motifs to be found within a
single data set - Reference Hughes et al., J Mol Biol. 2000
2961205-14
15AlignACE Results
- ...
- Motif 1
- GGGGAGGGGGTGGGGGGGC 23 788 0
- GGCGGGCGGGCGGCGGGGG 23 867 1
- GGACAGCGGCGGCTGGCTG 11 107 0
- GGGGTGCGGGGGCAGGCGC 23 1417 1
- CCGCGGGGGCGGGCGGGGC 13 2034 1
- ...
-
- MAP Score 794.004
- Motif 2
- GGGGAGGGGGTGGGGGGGCGGGG 23 784 0
- GTGCGGGGGCAGGCGCGGAGAGC 23 1420 1
- GCGGAGCGGGAGGGGGCGTGGCC 13 1932 1
- GGGGTGCGGGAGGGCGGGCGGGC 23 1448 1
- GGGCAGTGGGCGGCTGGCAGCTG 14 1452 1
- ...
16Gibbs Motif Sampler Program
- Uses Stochastic Iterative Sampling
- The Bernoulli motif sampler assumes that each
sequence can contain zero or more ungapped motif
elements of each motif type - Reference
- Lawrence et al., Science 1993262(5131)208-14
- Neuwald et al., Protein Sci. 1995
Aug4(8)1618-32.
17Gibbs Results
- ...
- 4, 1 284 agtgc AGAGTCTGGAGAGC cgaat 271
0.87 R gid 7739662, start1, end800 - 4, 2 425 ggtat AGATGTCGGAGAGT cgttt 412
0.79 R gid 7739662, start1, end800 - 4, 3 643 atgga AGCCTCGGGAAACT tcggg 656
0.86 F gid 7739662, start1, end800 - 5, 1 239 atgga AGCCTCGGGAAACT tcggg 252
0.86 F gid 64795, start1, end487 - 7, 1 401 agtgt GGGTGCTGGAGGCT gacgg 388
0.99 R gid 64789, start1, end1128 - 9, 1 26 ggagt GGCGGTGGGAAGGG tgttg 13
0.99 R gid 32480, start1, end483 - ...
-
- ...
18Consensus Program
- Uses entropy-based scoring functions
- References
- Stormo and Hartzell, PNAS 1989861183-1187
- Hertz et al., 1990, CABIOS, 681-92
19Consensus Results
- MATRIX 1
- ...
- 123 1/593 TGCAAGATTTTTAA
- 29 2/8 TGGAGGCTTCCAGA
- 310 3/889 TGGAGGCTTCCAGA
- ...
- MATRIX 2
- ...
- 123 1/593 TGCAAGATTTTTAA
- 29 2/8 TGGAGGCTTCCAGA
- 310 3/889 TGGAGGCTTCCAGA
- ...
- MATRIX 3
- 123 1/593 TGCAAGATTTTTAA
- 29 2/8 TGGAGGCTTCCAGA
- 310 3/889 TGGAGGCTTCCAGA
- ...
- MATRIX 4
- 121 1/38 GGGAAAGCTCGAGA
20BioProspector Program
- a program that examines the upstream region of
genes in the same gene expression pattern group
to search for regulatory sequence motifs. - uses zero to third-order Markov background models
- allows for the searching of gapped motifs and
motifs with palindromic patterns - Reference Liu et al., Pac Symp Biocomput.
2001127-38
21BioProspector Results
- ...
- Motif 1
- ...
- Seq 1 seg 1 r998 TCATCCAATCAGAG
- Seq 2 seg 1 f91 TCAACCGAACAGAA
- Seq 3 seg 1 r638 TCGACCAATCAAAA
- ...
- Motif 2
- ...
- Seq 1 seg 1 f38 GGGAAAGCTCGAGA
- Seq 2 seg 1 r648 TGGAAGCCTCCAGT
- Seq 3 seg 1 r620 TGGAAGCCTCCAGT
- ...
- Motif 3
- ...
- Seq 1 seg 1 r997 CTCATCCAATCAGA
- Seq 2 seg 1 f90 CTCAACCGAACAGA
- Seq 3 seg 1 r637 TTCGACCAATCAAA
- ...
22Conceptions and Interactions of the Underlying
Statistical Algorithms Used by the Motif
Searching Programs
Gibbs Sampler Iterative Updating Strategy
Two Block Motif Model
23Motif Data Representation
- Common data representation for motif information.
- Uses XML Schema to specify format.
- Both human and machine readable.
- Supports knowledge mining.
- Statements can be asserted about a motif such as
a role in gene regulation.
24Example of a motif
ltmotif id"GXY1"gt ltblockgt ltbase
type"G"gt0.21lt/basegt ltbase type"C"gt0.21lt/base
gt ltbase type"T"gt0.59lt/basegt lt/blockgt
ltblockgt ltbase type"G"gt0.44lt/basegt ltbase
type"C"gt0.50lt/basegt ltbase type"T"gt0.06lt/base
gt lt/blockgt ltblockgt ltbase
type"A"gt0.70lt/basegt ltbase type"G"gt0.29lt/base
gt lt/blockgt ... lt/motifgt
Blk1 A G C T 1 0.00 0.21
0.21 0.59 2 0.00 0.44 0.50 0.06 3
0.70 0.29 0.00 0.00 4 0.32 0.62 0.00
0.06 5 0.03 0.00 0.97 0.00 6 0.00
0.00 1.00 0.00 7 0.85 0.09 0.03 0.03 8
0.88 0.12 0.00 0.00 9 0.03 0.00 0.03
0.94 10 0.03 0.09 0.88 0.00 11 0.70
0.12 0.18 0.00 ...
25XML Schema
- Extends the XML document type language
- Data format restrictions.
- Data value (min and max) restrictions.
- Element occurrence (min and max) restrictions.
- No sophisticated restrictions
- Probability distribution.
26XML Schema for MotifML
ltxsdschema xmlnsxsd"http//www.w3.org/2001/XML
Schema"gt ltxsdelement name"motif"
type"MotifType"/gt lt!-- A motif consists of a
sequence of blocks. --gt ltxsdcomplexType
name"MotifType"gt ltxsdsequencegt
ltxsdelement name"block" minOccurs"0"
maxOccurs"unbounded"
type"BlockType"/gt lt/xsdsequencegt lt/xsdcomplex
Typegt lt!-- A block specifies a probability for
each DNA base type. --gt ltxsdcomplexType
name"BlockType"gt ltxsdsequencegt
ltxsdelement name"base" minOccurs"1"
maxOccurs"4"gt ...
27Statements about motifs
lt?xml version"1.0"?gt ltRDF xmlns"http//www.w3.or
g/1999/02/22-rdf-syntax-ns
xmlnsrdf"http//www.w3.org/1999/02/22-rdf-syntax
-ns xmlnsmml"http//www.beyondgenomics.co
m/2001/07/motifml" xmlnsbp"http//www.beyo
ndgenomics.com/2001/07/biopathway"/gt
ltDescription about"http//www.beyondgenomics.com/
motifdb/gxy1"gt ltbpupregulate
rdfresource"http//www.beyondgenomics.com/motifd
b/awy5"/gt ltbpupregulate rdfresource"http//
www.beyondgenomics.com/motifdb/ftg6"/gt
ltbpdownregulate rdfresource"http//www.beyondge
nomics.com/motifdb/bgt3"/gt lt/Descriptiongt lt/RDFgt
28The Need for Bio-Ontologies
- How do biologists learn the element structure of
a document describing the heterogeneous sequence
alignment output? - How do biologists share the structure and
meta-data on motif profiles efficiently and
unambiguously?
29A multiple sequence alignment linked with
TRANSFAC/TRANSPATH
-
- human GCTTGAATTAGACAGGATTAAAGGC
TTACTGGAGCTGGAAGCCTTGCCCC -AACTCAGGAGTTTAGCCCCA - bovine GCTTGAATTAAATAGGATTAAAGGC
TTATCAGGGCTGGGAGCTACACCCC -AACTCCTGAGTTTAGCCCCA - mouse GCTTGAATTAGACAGGATTAAAGGC
TTAGCAGAGCTGGAAGCCTCACATC TAACTCCCACATTGAGCCCCA - PCE-I -CBE-- AP-4
- 8888 cETS cETS
-
- -70 -45
-20 1
Alignment Profile
Shown here is the alignment from -70 to 1. The
numbering shown corresponds to the mouse
sequence. Identical bases are shown by the
above each nucleotide. Consensus sequence matches
conserved among all three species are the
Ret-1/PCE-I element at -65 to -60, the
CRX-binding element (CBE) at -55 to -50, an AP-4
consensus core sequence at -37 to -34, a cETS
consensus core at -35 to -31 and another at
positions -57 to -54, and an S8 homeodomain is
shown by "8888" at -64 to -61. Only the core
bases are marked. The criteria for searching the
TRANSFAC Database by MatInspector were a match to
the core sequence of at least 80 and to the
entire consensus sequence of at least 85. The
Genbank entries for human, bovine, and mouse are
X53044, M32733, and M32734, respectively.
(Boatright, Mol Vis 1997 315)
30Transcriptional Factors Ontology
Composite Element
contains
Upstream to
Site
Gene
Within
- Tissue
- Stage
- Disease
- Env.Cond.
- Induced
Kind of
Context
Transcript
produces
Part of
Found in
Observation
Transcriptional Motif Elements
Transcriptional Factors
Binds to
31MotifML Applications
- Develop a data exchange format for DNA motif data
- Handling output from motif analyses
- Annotation and data mining of micro-array data
- Important in modeling transcriptional regulatory
networks in eukaryotes
32Future Directions
- Distributed Annotation System Lincoln Stein,
Open-Bio - Exchange with Other XML Dialects
- DAML development