Evaluation of the Haplotype Motif Model using the Principle of Minimum Description - PowerPoint PPT Presentation

About This Presentation
Title:

Evaluation of the Haplotype Motif Model using the Principle of Minimum Description

Description:

Evaluation of the Haplotype Motif Model using the Principle of Minimum Description ... Extract haplotype patterns (motifs) from the model ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 32
Provided by: srinath8
Category:

less

Transcript and Presenter's Notes

Title: Evaluation of the Haplotype Motif Model using the Principle of Minimum Description


1
Evaluation of the Haplotype Motif Model using the
Principle of Minimum Description
  • Srinath Sridhar, Kedar Dhamdhere,
  • Guy E. Blelloch, R. Ravi and Russell Schwartz
  • Computer Science Dept, Tepper School of Business
    and Dept of Biological Sciences.Carnegie Mellon
    University.

2
Motivation
  • Individual characteristics genetic
    factors
  • Model the structure of correlated genetic
    variation (haplotypes) in the DNA
  • Extract haplotype patterns (motifs) from the
    model
  • Perform association studies comparing motifs to
    susceptibility to diseases

3
Single Nucleotide Polymorphism (SNP)
  • Rows Individual samples Columns Nucleotides
  • ACCTGTATACGTA 0000000000000
  • ACATGTAGACGGA 0010000100010
  • ACCTGTAGACGGA 0000000100010
  • ACATGTATACGTA 0010000000000
  • ACATGTATACGTA 0010000000000
  • ACCTGTAGACGGA 0000000100010

4
Related Article
5
Evolution
  • Two types of events mutation and recombination
  • Mutation (one strand of one chromosome shown)
  • ACGTACCGTATATA
  • ACGTACTGTATATA
  • Recombination (one strand of two homologous
    chromosomes shown)
  • ACGTACCGTATATA ACGTACCGTACGTA
  • GTACTACGTACGTA GTACTACGTATATA

6
Recombination
  • Ancestral Sequences


  • Current Population







7
Comparison of blocks and motifs
Blocks Daly et al, 2000
Motifs Schwartz 2003
Blocks Daly et al. 2000
Motifs Schwartz, 2003
8
Minimum Description Length (MDL)
  • Let
  • M represent the parameters of the model
  • I represent the input matrix
  • E be the explanation of I using M
  • L be the length of encoding
  • Objective
  • Minimize L(M) L(E(I)M)
  • Complicated models are penalized
  • Prevents over-fitting

9
Dynamic Program - Blocks
  • Dynamic Program Koivisto et al. 2003
  • where C ( j1, i ) is the cost of creating a
    single block from j1 to i.
  • Running time O(n2)
  • Work space O(n)

best
i

single block
10
Expectation-Maximization Algorithm - Motifs
  • Create a DAG of all possible motifs with a
    start vertex
  • Initialize probabilities
  • For each EM iteration
  • For each row r in R
  • In sub-graph corresponding to r find ML path from
    start
  • Re-normalize probabilities based on the number of
    times the vertices were used in ML path

11
Example
0

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
12
Example
0

0
0
0
1
0
1
0
0
0
0
0
1
1
0
0
0
0
13
Example
Example
0

0
0
1
1
0
1
0
0
0
0
0
2
1
0
0
1
0
14
Example
Example
1

6
1
0
4
5
0
3
6
5
9
4
3
0
2
0
2
5
15
Example
Example
Example - Re-normalize
1
0.33

6
4
1
0
5
0.33
1.0
0
3
6
5
9
4
0.17
0.0
3
0
2
0
0.17
2
5
16
Heuristics
  • EM finds P1 and P2 but cost(P1) cost(P2)
    cost(P1 U P2)
  • Use knowledge from previous EM iteration
  • Multiple shortest paths with weight
    (1e)-cost(P)
  • Addition of small constants to prevent zero
    probability in first few iterations
  • Initialize the probabilities to favor smaller
    motifs
  • Restrict maximum length of motifs

17
Experimental Results
Simulated data using the ms program Hudson, 2002
Num Seqs/ Recomb Rt Num SNPs Desc Length Motifs (in bits) Desc Length- Blocks (in bits)
100/low 229.00 4528.26 5029.53
200/low 241.00 6600.21 8528.48
300/low 248.33 9944.45 13341.84
400/low 219.00 12735.52 18270.20
100/high 202.33 5787.31 6025.29
200/high 250.33 8852.31 10511.31
300/high 209.67 11921.38 15418.65
400/high 211.00 19626.85 26233.36
18
Experimental Results
19
Experimental Results
Motifs High recombination
Motifs Low Recombination
Blocks High recombination
Blocks Low Recombination
20
Conclusion
  • Characterized the problem of inferring haplotype
    structure as an optimization problem that is
    robust against over-fitting
  • Haplotype motif model better captures the
    structure than haplotype blocks
  • Furthermore, motif method performs progressively
    better with larger input size

21
Discussion Future Work
  • Extensions
  • Polynomial time algorithm/NP-hardness
  • Clustering and error models
  • Real data recombination hot-spots
  • Future directions

direct optimization
phasing
htSNP, association tests, ?
current work
Disease Analysis/ Drug design
Genotype data
Haplotype Data
Motifs/ Blocks/?
22
Encoding Motifs
  • Let si be the start locations of motifs
  • Let ti,j be the number of motifs that start at i
    and end at j
  • Let Ei ei, 1, , ei,k be the ordered set of
    end locations for motifs that start at i
  • Cost for encoding model
  • Additional cost for encoding motif probabilities

23
Explanation
  • Explanation of a row specify the ordered set of
    block haplotypes that produce the bits of the row
  • Cost for explanation of row r
  • Cost for explanation

24
Human Genetic Structure
  • Chromosomes in the nucleus of cells
  • 23 pairs of chromosomes
  • Double helix structure of chromosomes
  • Chromosomes Genes and inter-genic regions
  • Genes Encode for proteins

25
Single Nucleotide Polymorphism (SNP)
  • Human genomes are very similar
  • SNP Single base with high probability of
    variation
  • Bi-allelic Two out of four possible nucleotides
  • In humans reduction in size 300

26
Encoding Blocks
  • Let
  • si represent the start columns of blocks
  • ti represent the number of blocks starting at ti
  • Cost of encoding Model
  • Additionally, encoding for probabilities for
    block haplotypes

27
Encoding Blocks
  • Explanation of a row specify the ordered set of
    block haplotypes that produce the bits of the row
  • Cost for explanation of row r
  • Cost for explanation

28
DNA
  • Building blocks (nucleotides) Adenine(A),
    Cytosine(C), Guanine(G) and Thymine(T)
  • Adenine(A) pairs with Thymine(T)Cytosine(C)
    pairs with Guanine(G)

29
Haplotypes
  • Contiguous regions of correlated genetic
    variation
  • Two models Blocks and Motifs
  • Blocks
  • Popular and widely assumed Daly et al. 2000
  • Boundary aligned block haplotypes
  • Motifs
  • Recently introduced Schwartz 2003
  • Overlapping haplotype motifs

30
Comparison of Blocks and Motifs
Two models Haplotype blocksDaly et al. 2000
and haplotype motifs Schwartz 2003
31
Recent Article dogs helping humans
Write a Comment
User Comments (0)
About PowerShow.com