Identifying Patterns in DNA Change - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Identifying Patterns in DNA Change

Description:

CENSOR used to get substitution data. http://www.girinst.org/Censor_Server. ... CENSOR Alignments. ALUY 11 282 CONTIG-1P1 317456 317723 0.88 0.10 1.53 272 206.52 ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 21
Provided by: michael1626
Category:

less

Transcript and Presenter's Notes

Title: Identifying Patterns in DNA Change


1
Identifying Patterns in DNA Change
Jason Gilder Bioinformatics Research Group Wright
State University MAICS Presentation April 12,
2003
2
ALU Background
  • Well-known sequences broken into families
  • Short Interspersed Repetitive Elements (SINEs)
  • Approximately 280 bp long
  • 10 of human genome

3
ALU Background Cont
  • Proliferate through retrotransposition
  • - Copy is transcribed, reverse transcribed,
    and
  • reinserted at distant site
  • Original progenitor sequence known
  • Can trace evolutionary path
  • - Number of changes
  • - Types of changes

4
Problem
  • Use EC to predict substitution rates
  • (e.g. The number of Cs that used to be As)
  • Only use features of repeat itself, not
    progenitor
  • - Content information for repeat
  • - GC content in flanking regions
  • Feasible?
  • - Enough features?
  • - Correct features?

5
Feature Set
  • 16 Features
  • - Length of repeat, of As, of Gs, of
    Cs,
  • of Ts, and GC Content Percentage of ALU
  • GC content for 10 flanking regions

  • (500 20,000 nts)

6
Data Set
  • ALU Y Family in Chromosome 1
  • CENSOR used to get substitution data
  • http//www.girinst.org/Censor_Server.html
  • 6,749 examples
  • 5,000 Training ( 74 )
  • 1,749 Holdout Testing ( 26 )
  • Each chosen randomly

7
CENSOR Alignments
ALUY 11 282 CONTIG-1P1 317456 317723
0.88 0.10 1.53 272 206.52
GGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGA
TCACGAGGTCAGGAGATCGA

GCTGGATGTC-CCTGTAATCCCAGCACTTTGGGAGG
CCGAGGCGGGTGGATCATGAGGTCAGGAGATCGA
GACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAA
AATTAGCCGGGCGTGGTGGC

GACCATTCTGGCTAACACAGTGAAACCCCGTCTCTACTAAAAATACAAAA
AATTAGCCAGGCATGGTGGC GGGCGCCTGTAGTCCCAGCTACT
CGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTT

ACACGCCTCTAGTCCCAACTA
CTCAGGAGGCTGACACAGGAGAATCACTTGGACCCGGGAGGTGGAGGTT
GCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGCGACA
GAGCGAGACTCCGTCTCA

GCAGTGAGCTGAGATCACGCCACTGCACTCCAGCCTGGGTGA-AAA--GA
GACTCCGCCTCA Containing 239 matches, 3 gaps
and 29 mismatches including 19 transitions
8
Genetic Programming
  • Equations built like parse trees
  • Operator, input, and constant nodes
  • (I6 35) 2.87



I6
35
2.87
9
GP Reproduction
Parent 1
Parent 2



I2
I4
2.87
I6
35
Child 2
Child 1



2.87
I4
I2
I6
35
10
Operator Node Set
Min returns the minimum of two nodes or
subtrees Max returns the maximum of two nodes
or subtrees   Cos if the connected nodes are x
and y, it returns x Cos(y)   Sin if the
connected nodes are x and y, it returns x
Sin(y)   Ave if the connected nodes are x and
y, it returns (x y)/2 Log if the connected
nodes are x and y, it returns x Log(y)
11
Mask Operator Nodes
  • All features mutable binary mask
  • Summation fi mi
  •  
  • Multiplication fi mi
  •  
  • SumSquareRoot fi mi

12
Initial GP Results
  • Classifying C -gt G
  • Fitness average absolute error
  • Classification Rate 46
  • Average absolute error 0.75

13
Theta Factor Offset
  • Average absolute error 0.75
  • If error lt 0.5, correct classification
  • Subtract theta from solution to get correct
    solution
  • Using 0.30, classification rates jumped to 66
  • Linear search for best theta 0, 1 in 0.01
    increments

14
Initial Results
Progenitor Sequence
Alu
( training classification, test classification )
15
Regional GC Analysis
  • All Experiments redone utilizing only GC flanking
    content.
  • No ALU information used

16
Regional GC Analysis Results
  • Features regional GC content

Progenitor Sequence
Alu
( Previous classification rates in parentheses )
17
Context Analysis Masking CpGs
  • Substitution rates from Cs and Gs redone
  • All CpGs were masked
  • Removed some independent mutation factors

18
Masked CpG Results
  • Features Flanking GC with masked CpGs

Progenitor Sequence
Alu
( Previous classification rates in parentheses )
19
Conclusions
  • Successfully predicted substitution rates
  • Regional GC Content holds needed information
  • 10 / 12 rates gt 80
  • 6 / 12 rates gt 90
  • Future Work Classifying entire genome

20
Acknowledgements
  • Dr. Dan Krane
  • Dr. Travis Doom
  • Dr. Michael Raymer

Dr. Mateen Rizki
Bioinformatics Research Group http//birg.cs.wrig
ht.edu
This work was supported in part by the National
Science Foundation (grant EIA-0122582), and by
the Dayton Area Graduate Studies Institute.
Write a Comment
User Comments (0)
About PowerShow.com