Title: Master of Science
1TCGR A Novel DNA/RNA Visualization Technique
Margaret H. Dunham and Donya Quick Southern
Methodist University Dallas, Texas
75275 mhd_at_engr.smu.edu
Some slides presented at IEEE BIBE 2006
2Outline
- Introduction
- TCGR
- EMM
- miRNA Prediction using TCGR/EMM
- Conclusion / Future Work
3Outline
- Introduction
- Background
- CGR/FCGR
- Motivation
- Research Objective
- TCGR
- EMM
- miRNA Prediction using TCGR/EMM
- Conclusion / Future Work
4DNA
- Deoxyribonucleic Acid
- Basic building blocks of organisms
- Located in nucleus of cells
- Composed of 4 nucleotides
- Adenine (A)
- Cytosine (C)
- Guanine (G)
- Thymine (T)
- Two strands bound together
- Contains genetic information
Image source http//www.visionlearning.com/librar
y/module_viewer.php?mid63
5Nucleotide Bases
http//www.people.virginia.edu/rjh9u/gif/bases.gi
f
6Transcription
- During transcription, DNA is converted in mRNA
- RNA is processed and noncoding regions removed
- Coding regions are converted in protein
- Enzyme (RNA Polymerase) that starts transcription
by binding to DNA code
7Transcription
http//ghs.gresham.k12.or.us/science/ps/sci/ibbio/
chem/nucleic/chpt15/transcription.gif
8RNA
- Ribonucleic Acid
- Contains A,C,G but U (Uracil) instead of T
- Single Stranded
- May fold back on itself
- Needed to create proteins
- Move around cells can act like a messenger
- mRNA moves out of nucleus to other parts of cell
9Translation
- Synthesis of Proteins from mRNA
- Nucleotide sequence of mRNA converted in amino
acid sequence of protein - Four nucleotides
- Twenty amino acids
- Codon Group of 3 nucleotides
- Amino acids have many codings
10Central Dogma DNA -gt RNA -gt Protein
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
www.bioalgorithms.info chapter 6 Gene
Prediction
11- http//www.time.com/time/magazine/article/0,9171,1
541283,00.html
12Human Genome
- Scientists originally thought there would be
about 100,000 genes - Appear to be about 20,000
- WHY?
- Almost identical to that of Chimps. What makes
the difference? - Answers appear to lie in the noncoding regions of
the DNA (formerly thought to be junk)
13More Questions
- If each cell in an organism contains the same DNA
- How does each cell behave differently?
- Why do cells behave differently during childhood
development? - What causes some cells to act differently such
as during disease? - DNA contains many genes, but only a few are being
transcribed why? - One answer - miRNA
14miRNA
- Short (20-25nt) sequence of noncoding RNA
- Single strand
- Previously assumed to be garbage
- Impact/Prevent translation of mRNA
- Bind to target areas in mRNA Problem is that
this binding is not perfect (particularly in
animals) - mRNA may have multiple (nonoverlapping) binding
sites for one miRNA
15miRNA Functions
- Causes some cancers
- Embryo Development
- Cell Differentiation
- Cell Death
- Prevents the production of a protein that causes
lung cancer - Control brain development in zebra fish
- Associated with HIV
16Outline
- Introduction
- Background
- CGR/FCGR
- Motivation
- Research Objective
- TCGR
- EMM
- miRNA Prediction using TCGR/EMM
- Conclusion / Future Work
17Chaos Game Representation (CGR)
- Scatter plot showing occurrence of patterns of
nucleotides.
University of the Basque Country
http//insilico.ehu.es/genomics/my_words/
18Frequency CGR (FCGR)
- Shows the frequencies of oligonucleotides using a
color scheme normalized to the distribution of
frequency of occurrence of associated patterns.
19Chaos Game Representation (CGR)
FCGR
- 2D technique to visually see the distribution of
subpatterns - Our technique is based on the following
- Generate totals for each subpattern
- Scale totals to a 0,1 range. (Note scaling can
be a problem) - Convert range to red/blue
- 0-0.5 White to Blue
- 0.5-1 Blue to Red
20FCGR
Figures courtesy of Eamonn Keogh, UCR
21FCGR Example
Homo sapiens all mature miRNA Patterns of
length 3
UUC
GUG
22Outline
- Introduction
- Background
- CGR/FCGR
- Motivation
- Research Objective
- TCGR
- EMM
- miRNA Prediction using TCGR/EMM
- Conclusion / Future Work
23Motivation
2000bp Flanking Upstream Region mir-258.2 in C
elegans
a) All 2000 bp b) First 240 bp b)
Last 240 bp
24Research Objectives
- Identify, develop, and implement algorithms which
can be used for identifying potential miRNA
functions. - Create an online tool which can be used by other
researchers to apply our algorithms to new data.
25Outline
- Introduction
- CGR/FCGR
- miRNA
- Motivation
- Research Objective
- TCGR
- EMM
- miRNA Prediction using TCGR/EMM
- Conclusion / Future Work
26Temporal CGR (TCGR)
- Temporal version of Frequency CGR
- In our context temporal means the starting
location of a window - 2D Array
- Each Row represents counts for a particular
window in sequence - First row first window
- Last row last window
- We start successive windows at the next character
location - Each Column represents the counts for the
associated pattern in that window - Initially we have assumed order of patterns is
alphabetic - Size of TCGR depends on sequence length and
subpattern lengt - As sequence lengths vary, we only examine
complete windows - We only count patterns completely contained in
each window.
27TCGR Example
A C G T Pos 0-8 2 3 3 1 Pos 1-9
1 3 3 2 Pos 34-42 2 4 2 1
A C G T Pos 0-8 0.4 0.6 0.6 0.2 Pos
1-9 0.2 0.6 0.6 0.4 Pos 34-42 0.4 0.8 0.4 0.2
28TCGR Example (contd)
-
- TCGRs for Sub-patterns of length 1, 2, and 3
29TCGR Example (contd)
A C G T
acgtgcacg cgtgcacgt tccggaacc ccggaacca
ccacgtcga
Window 0 Pos 0-8 Window 1 Pos
1-9 Window 17 Pos 17-25 Window 18
Pos 18-26 Window 34 Pos 34-42
30TCGR Viruses miRNA(Window9 Pattern123)
Epstein Barr Human Cytomegalovirus
Kaposi sarc Herpesvirus Mouse
Gammaherpesvirus
Pattern1 Pattern2 Pattern3
31TCGR Mature miRNA(Window5 Pattern3)
32Outline
- Introduction
- CGR/FCGR
- miRNA
- Motivation
- Research Objective
- TCGR
- EMM
- miRNA Prediction using TCGR/EMM
- Conclusion / Future Work
33EMM Overview
- Time Varying Discrete First Order Markov Model
- Nodes are clusters of real world states.
- Learning continues during prediction phase.
- Learning
- Transition probabilities between nodes
- Node labels (centroid of cluster)
- Nodes are added and removed as data arrives
34EMM Definition
- Extensible Markov Model (EMM) at any time t, EMM
consists of an MC with designated current node,
Nn, and algorithms to modify it, where algorithms
include - EMMCluster, which defines a technique for
matching between input data at time t 1 and
existing states in the MC at time t. - EMMIncrement algorithm, which updates MC at time
t 1 given the MC at time t and clustering
measure result at time t 1. - EMMDecrement algorithm, which removes nodes from
the EMM when needed.
35EMM Cluster
- Find closest node to incoming event.
- If none close create new node
- Labeling of cluster is centroid of members in
cluster - O(n)
36EMM Increment
lt18,10,3,3,1,0,0gt lt17,10,2,3,1,0,0gt lt16,9,2,3,1,0,
0gt lt14,8,2,3,1,0,0gt lt14,8,2,3,0,0,0gt lt18,10,3,3,1,
1,0.gt
37Outline
- Introduction
- CGR/FCGR
- miRNA
- Motivation
- Research Objective
- TCGR
- EMM
- miRNA Prediction using TCGR/EMM
- Conclusion / Future Work
38Research Approach
- Represent potential miRNA sequence with TCGR
sequence of count vectors - Create EMM using count vectors for known miRNA
(miRNA stem loops, miRNA targets) - Predict unknown sequence to be miRNA (miRNA stem
loop, miRNA target) based on normalized product
of transition probabilities along clustering path
in EMM
39Related Work 1
- Predicted occurrence of pre-miRNA segments form a
set of hairpin sequences - No assumptions about biological function or
conservation across species. - Used SVMs to differentiate the structure of
hiarpin segments that contained pre-miRNAs from
those that did not. - Sensitivey of 93.3
- Specificity of 88.1
- 1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X.
Zhang, Classification of Real and Pseudo
MicroRNA Precursors using Local
Structure-Sequence Features and Support Vector
Machine, BMC Bioinformatics, vol 6, no 310.
40Preliminary Test Data1
- Positive Training This dataset consists of 163
human pre-miRNAs with lengths of 62-119. - Negative Training This dataset was obtained
from protein coding regions of human RefSeq
genes. As these are from coding regions it is
likely that there are no true pre-miRNAs in this
data. This dataset contains 168 sequences with
lengths between 63 and 110 characters. - Positive Test This dataset contains 30
pre-miRNAs. - Negative Test This dataset contains 1000
randomly chosen sequences from coding regions. - 1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X.
Zhang, Classification of Real and Pseudo
MicroRNA Precursors using Local
Structure-Sequence Features and Support Vector
Machine, BMC Bioinformatics, vol 6, no 310.
41TCGRs for Xue Training Data
POS I T I VE
NEGAT I VE
42TCGRs for Xue Test Data
POS I T I VE
NEGAT I VE
43Predictive Probabilities with Xues Data
EMM Test Data Mean Std Dev Max Min
Negative Test-Neg 0 0 0 0
Negative Test-Pos 0 0 0 0
Negative Train-Neg 0.37963 0.050085 0.91256 0.2945
Negative Train-Pos 0 0 0 0
Positive Test-Neg 0 0 0 0
Positive Test-Pos 0.25894 0.18701 0.42075 0
Positive Train-Neg 0 0 0 0
Positive Train-Pos 0.38926 0.048439 0.91155 0.32209
44Preliminary Test Results
- Positive EMM
- Cutoff Probability 0.3
- False Positive Rate 0
- True Positive Rate 66
- Test results could be improved by meta
classifiers combining multiple positive and
negative classifiers together.
45Outline
- Introduction
- CGR/FCGR
- miRNA
- Motivation
- Research Objective
- TCGR
- EMM
- miRNA Prediction using TCGR/EMM
- Conclusion / Future Work
46Future Research
- Obtain all known mature miRNA sequences for a
species initially the 119 C. elegans miRNAs. - Create TCGR count vectors for each sequence and
each sub-pattern length (1,2,3,4,5). - Train EMMs using this data for each sub-pattern
length. Thus five EMMs will be created - Obtain negative data (much as Xue did in his
research) from coding regions for C. elegans. - Train EMMs using this data for each sub-pattern
length. Thus five EMMs will be created - Construct a meta-classifier based on the combined
results of prediction from each of these ten
EMMs. - Apply the EMM classifier to the existing 75x106
base pairs of non-exonic sequence in the C.
elegans genome to search for miRNAs. Note all
119 validated C. elegans miRNAs are contained in
the non-exonic part of the genome and thus the
first pass of the algorithm will be tested for
its ability to detect all 119 validated miRNAs. - Validate the prediction of novel miRNAs using
molecular biology.