Master of Science - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Master of Science

Description:

Margaret H. Dunham and Donya Quick Southern Methodist University ... Codon Group of 3 nucleotides. Amino acids have many codings. 3/14/08, UMKC. 10. Protein ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 44

Provided by: dream1

Learn more at: http://lyle.smu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Master of Science

1
TCGR A Novel DNA/RNA Visualization Technique

Margaret H. Dunham and Donya Quick Southern
Methodist University Dallas, Texas
75275 mhd_at_engr.smu.edu
Some slides presented at IEEE BIBE 2006
2
Outline

Introduction
TCGR
EMM
miRNA Prediction using TCGR/EMM
Conclusion / Future Work

3
Outline

Introduction
Background
CGR/FCGR
Motivation
Research Objective
TCGR
EMM
miRNA Prediction using TCGR/EMM
Conclusion / Future Work

4
DNA

Deoxyribonucleic Acid
Basic building blocks of organisms
Located in nucleus of cells
Composed of 4 nucleotides
Adenine (A)
Cytosine (C)
Guanine (G)
Thymine (T)
Two strands bound together
Contains genetic information

Image source http//www.visionlearning.com/librar
y/module_viewer.php?mid63
5
Nucleotide Bases
http//www.people.virginia.edu/rjh9u/gif/bases.gi
f
6
Transcription

During transcription, DNA is converted in mRNA
RNA is processed and noncoding regions removed
Coding regions are converted in protein
Enzyme (RNA Polymerase) that starts transcription
by binding to DNA code

7
Transcription
http//ghs.gresham.k12.or.us/science/ps/sci/ibbio/
chem/nucleic/chpt15/transcription.gif
8
RNA

Ribonucleic Acid
Contains A,C,G but U (Uracil) instead of T
Single Stranded
May fold back on itself
Needed to create proteins
Move around cells can act like a messenger
mRNA moves out of nucleus to other parts of cell

9
Translation

Synthesis of Proteins from mRNA
Nucleotide sequence of mRNA converted in amino
acid sequence of protein
Four nucleotides
Twenty amino acids
Codon Group of 3 nucleotides
Amino acids have many codings

10
Central Dogma DNA -gt RNA -gt Protein
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
www.bioalgorithms.info chapter 6 Gene
Prediction
11

http//www.time.com/time/magazine/article/0,9171,1
541283,00.html

12
Human Genome

Scientists originally thought there would be
about 100,000 genes
Appear to be about 20,000
WHY?
Almost identical to that of Chimps. What makes
the difference?
Answers appear to lie in the noncoding regions of
the DNA (formerly thought to be junk)

13
More Questions

If each cell in an organism contains the same DNA
How does each cell behave differently?
Why do cells behave differently during childhood
development?
What causes some cells to act differently such
as during disease?
DNA contains many genes, but only a few are being
transcribed why?
One answer - miRNA

14
miRNA

Short (20-25nt) sequence of noncoding RNA
Single strand
Previously assumed to be garbage
Impact/Prevent translation of mRNA
Bind to target areas in mRNA Problem is that
this binding is not perfect (particularly in
animals)
mRNA may have multiple (nonoverlapping) binding
sites for one miRNA

15
miRNA Functions

Causes some cancers
Embryo Development
Cell Differentiation
Cell Death
Prevents the production of a protein that causes
lung cancer
Control brain development in zebra fish
Associated with HIV

16
Outline

Introduction
Background
CGR/FCGR
Motivation
Research Objective
TCGR
EMM
miRNA Prediction using TCGR/EMM
Conclusion / Future Work

17
Chaos Game Representation (CGR)

Scatter plot showing occurrence of patterns of
nucleotides.

University of the Basque Country
http//insilico.ehu.es/genomics/my_words/
18
Frequency CGR (FCGR)

Shows the frequencies of oligonucleotides using a
color scheme normalized to the distribution of
frequency of occurrence of associated patterns.

19
Chaos Game Representation (CGR)
FCGR

2D technique to visually see the distribution of
subpatterns
Our technique is based on the following
Generate totals for each subpattern
Scale totals to a 0,1 range. (Note scaling can
be a problem)
Convert range to red/blue
0-0.5 White to Blue
0.5-1 Blue to Red

20
FCGR
Figures courtesy of Eamonn Keogh, UCR
21
FCGR Example
Homo sapiens all mature miRNA Patterns of
length 3
UUC
GUG
22
Outline

Introduction
Background
CGR/FCGR
Motivation
Research Objective
TCGR
EMM
miRNA Prediction using TCGR/EMM
Conclusion / Future Work

23
Motivation
2000bp Flanking Upstream Region mir-258.2 in C
elegans
a) All 2000 bp b) First 240 bp b)
Last 240 bp
24
Research Objectives

Identify, develop, and implement algorithms which
can be used for identifying potential miRNA
functions.
Create an online tool which can be used by other
researchers to apply our algorithms to new data.

25
Outline

Introduction
CGR/FCGR
miRNA
Motivation
Research Objective
TCGR
EMM
miRNA Prediction using TCGR/EMM
Conclusion / Future Work

26
Temporal CGR (TCGR)

Temporal version of Frequency CGR
In our context temporal means the starting
location of a window
2D Array
Each Row represents counts for a particular
window in sequence
First row first window
Last row last window
We start successive windows at the next character
location
Each Column represents the counts for the
associated pattern in that window
Initially we have assumed order of patterns is
alphabetic
Size of TCGR depends on sequence length and
subpattern lengt
As sequence lengths vary, we only examine
complete windows
We only count patterns completely contained in
each window.

27
TCGR Example
A C G T Pos 0-8 2 3 3 1 Pos 1-9
1 3 3 2 Pos 34-42 2 4 2 1
A C G T Pos 0-8 0.4 0.6 0.6 0.2 Pos
1-9 0.2 0.6 0.6 0.4 Pos 34-42 0.4 0.8 0.4 0.2
28
TCGR Example (contd)

TCGRs for Sub-patterns of length 1, 2, and 3

29
TCGR Example (contd)
A C G T
acgtgcacg cgtgcacgt tccggaacc ccggaacca
ccacgtcga
Window 0 Pos 0-8 Window 1 Pos
1-9 Window 17 Pos 17-25 Window 18
Pos 18-26 Window 34 Pos 34-42
30
TCGR Viruses miRNA(Window9 Pattern123)
Epstein Barr Human Cytomegalovirus
Kaposi sarc Herpesvirus Mouse
Gammaherpesvirus

Pattern1 Pattern2 Pattern3
31
TCGR Mature miRNA(Window5 Pattern3)
32
Outline

Introduction
CGR/FCGR
miRNA
Motivation
Research Objective
TCGR
EMM
miRNA Prediction using TCGR/EMM
Conclusion / Future Work

33
EMM Overview

Time Varying Discrete First Order Markov Model
Nodes are clusters of real world states.
Learning continues during prediction phase.
Learning
Transition probabilities between nodes
Node labels (centroid of cluster)
Nodes are added and removed as data arrives

34
EMM Definition

Extensible Markov Model (EMM) at any time t, EMM
consists of an MC with designated current node,
Nn, and algorithms to modify it, where algorithms
include
EMMCluster, which defines a technique for
matching between input data at time t 1 and
existing states in the MC at time t.
EMMIncrement algorithm, which updates MC at time
t 1 given the MC at time t and clustering
measure result at time t 1.
EMMDecrement algorithm, which removes nodes from
the EMM when needed.

35
EMM Cluster

Find closest node to incoming event.
If none close create new node
Labeling of cluster is centroid of members in
cluster
O(n)

36
EMM Increment
lt18,10,3,3,1,0,0gt lt17,10,2,3,1,0,0gt lt16,9,2,3,1,0,
0gt lt14,8,2,3,1,0,0gt lt14,8,2,3,0,0,0gt lt18,10,3,3,1,
1,0.gt
37
Outline

Introduction
CGR/FCGR
miRNA
Motivation
Research Objective
TCGR
EMM
miRNA Prediction using TCGR/EMM
Conclusion / Future Work

38
Research Approach

Represent potential miRNA sequence with TCGR
sequence of count vectors
Create EMM using count vectors for known miRNA
(miRNA stem loops, miRNA targets)
Predict unknown sequence to be miRNA (miRNA stem
loop, miRNA target) based on normalized product
of transition probabilities along clustering path
in EMM

39
Related Work 1

Predicted occurrence of pre-miRNA segments form a
set of hairpin sequences
No assumptions about biological function or
conservation across species.
Used SVMs to differentiate the structure of
hiarpin segments that contained pre-miRNAs from
those that did not.
Sensitivey of 93.3
Specificity of 88.1
1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X.
Zhang, Classification of Real and Pseudo
MicroRNA Precursors using Local
Structure-Sequence Features and Support Vector
Machine, BMC Bioinformatics, vol 6, no 310.

40
Preliminary Test Data1

Positive Training This dataset consists of 163
human pre-miRNAs with lengths of 62-119.
Negative Training This dataset was obtained
from protein coding regions of human RefSeq
genes. As these are from coding regions it is
likely that there are no true pre-miRNAs in this
data. This dataset contains 168 sequences with
lengths between 63 and 110 characters.
Positive Test This dataset contains 30
pre-miRNAs.
Negative Test This dataset contains 1000
randomly chosen sequences from coding regions.
1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X.
Zhang, Classification of Real and Pseudo
MicroRNA Precursors using Local
Structure-Sequence Features and Support Vector
Machine, BMC Bioinformatics, vol 6, no 310.

41
TCGRs for Xue Training Data
POS I T I VE
NEGAT I VE
42
TCGRs for Xue Test Data
POS I T I VE
NEGAT I VE
43
Predictive Probabilities with Xues Data
EMM Test Data Mean Std Dev Max Min
Negative Test-Neg 0 0 0 0
Negative Test-Pos 0 0 0 0
Negative Train-Neg 0.37963 0.050085 0.91256 0.2945
Negative Train-Pos 0 0 0 0
Positive Test-Neg 0 0 0 0
Positive Test-Pos 0.25894 0.18701 0.42075 0
Positive Train-Neg 0 0 0 0
Positive Train-Pos 0.38926 0.048439 0.91155 0.32209
44
Preliminary Test Results

Positive EMM
Cutoff Probability 0.3
False Positive Rate 0
True Positive Rate 66
Test results could be improved by meta
classifiers combining multiple positive and
negative classifiers together.

45
Outline

Introduction
CGR/FCGR
miRNA
Motivation
Research Objective
TCGR
EMM
miRNA Prediction using TCGR/EMM
Conclusion / Future Work

46
Future Research

Obtain all known mature miRNA sequences for a
species initially the 119 C. elegans miRNAs.
Create TCGR count vectors for each sequence and
each sub-pattern length (1,2,3,4,5).
Train EMMs using this data for each sub-pattern
length. Thus five EMMs will be created
Obtain negative data (much as Xue did in his
research) from coding regions for C. elegans.
Train EMMs using this data for each sub-pattern
length. Thus five EMMs will be created
Construct a meta-classifier based on the combined
results of prediction from each of these ten
EMMs.
Apply the EMM classifier to the existing 75x106
base pairs of non-exonic sequence in the C.
elegans genome to search for miRNAs. Note all
119 validated C. elegans miRNAs are contained in
the non-exonic part of the genome and thus the
first pass of the algorithm will be tested for
its ability to detect all 119 validated miRNAs.
Validate the prediction of novel miRNAs using
molecular biology.