Informative SNP Selection Based on Multiple Linear Regression - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Informative SNP Selection Based on Multiple Linear Regression

Description:

Informative SNP Selection Based on Multiple Linear Regression. Jingwu He ... 616 kilobase region of human Chromosome 5q31 genotyping 103 SNPs for 129 trios. ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 19
Provided by: Gan999
Learn more at: https://www.cs.gsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Informative SNP Selection Based on Multiple Linear Regression


1
Informative SNP Selection Based on Multiple
Linear Regression
  • Jingwu He
  • Alex Zelikovsky

2
Outline
  • SNPs, haplotypes, and genotypes
  • Tagging problem formulation
  • Tagging based on multiple linear regression
  • Experimental results

3
Human Genome
  • Length of Human Genome (DNA)
  • ? 3 billion base pairs A,C,G, or T.
  • Our DNA is similar.

99.9 of DNA is common.
4
SNPs
  • Genome difference between any two people ? 0.1
    of genome
  • These differences are Single Nucleotide
    Polymorphisms (SNPs).
  • Total number of SNPs in human genome ? 107

SNP
SNP
SNP
A C C G . . . .
A A C A
G C C A . . . . T T C G
G G T C . . . . A G T C
C
G
G
A C C G . . . .
A A C A
G C C A . . . . T T C G
G G T C . . . . A G T C
C
A
A
A C C G . . . .
A A C A
G C C A . . . . T T C G
G G T C . . . . A G T C
T
G
A
A C C G . . . .
A A C A
G C C A . . . . T T C G
G G T C . . . . A G T C
C
G
G
5
Haplotyes and Genotypes
  • Human diploid organism two different copies
    of each chromosome, one from mother, one from
    father.

One copy from A
A C C G . . . .
. . . C A
G C C A . . . . T T C G
G G T C . . . . A G T C
C
G
G
C
G
G
Another copy from A
C
A
A
A C C G . . . .
. . . C A
G C C A . . . . T T C G
G G T C . . . . A G T C
T
G
A
One copy from B
T
G
A
Another copy from B
G
A C C G . . . .
. . . C A
G C C A . . . . T T C G
G G T C . . . . A G T C
C
G
G
C
G
  • Since individuals differ in SNPs, we keep only
    SNPs.
  • Haplotype SNPs in a single copy of a
    chromosome
  • Genotype A pair of haplotypes

6
Cause of Variation Mutations and Recombinations
Mutation
Recombinations
One nucleotide is replaced with other G -gt A
One chromatid recombine with another.
7
Encoding
  • SNPs are generally bi-allelic
  • only two alleles in single SNP wild type and
    mutation
  • 0 stands for wide type, 1 stands for mutation

homozygous
Heterozygous
8
Outline
  • SNPs, haplotypes, and genotypes
  • Tagging problem formulation
  • Tagging based on multiple linear regression
  • Experimental results

9
Tagging Motivation
  • Decrease SNP genotyping cost and data analysis
  • Many SNPs are linked (strongly correlated)
  • Genotype only informative SNPs ? tag SNPs, other
    SNPs are inferred from tag SNPs
  • Perform data analysis only on tag SNPs.
  • Cost-saving ratio m/k

Use only tag SNPs to infer non-tag SNPs
10
Tagging Problem
Step 1 Find tags (SNP position) in sample
Find tags (0, 1, 2)
Step 2 Reconstruct complete haplotype
Computation Methods
  • Problem formulation
  • Given the full pattern of all SNPs in a sample
  • Find the minimum number of tag SNPs that will
    allow the reconstruction of the complete
    haplotype for each individual.
  • Tag Selection Algorithm
  • SNP Prediction Algorithm

11
Tagging Methods
  • Tagging Methods
  • HapBlock (K. Zhang, M.S. Waterman, et al.)
  • Greedy algorithm for tag selection
  • Majority voting for prediction
  • V. Bafna, B.V. Halldorson et al.
  • Graph algorithm for tag selection
  • Majority voting for prediction
  • STAMPA (E. Halperin and R. Shamir)
  • Dynamic programming for tag selection
  • Majority voting for prediction
  • ..
  • Tagging based on Multiple Linear Regression
  • Greedy Selection
  • Multiple Linear Regression for Prediction

12
SNP Prediction Algorithm
Predicting
13
Tag Selection based on Prediction
  • Choose the optimal k tags
  • It is NP-hard, m choose k
  • (m No. of total SNPs, k No. of tags)
  • Use Stepwise (greedy) Tag Selection Algorithm
    (STA) to reduce the cost and time
  • Starts with the best tag t0, i.e., tag that
    minimizes error when predicting with Ak all other
    tags.
  • Then STA finds such tag t1, which would be the
    best extension of t0, and continues adding best
    tags until reaching the set of tags of the given
    size k.

14
Projection Method forSNP Prediction
possible resolutions
s0
0 . . .
2 . . .
s2
1 . . .
s1
d0
d2
d1
tag t2
projections
span(T)
0
tag t1
  • Choose resolution minimizing its distance d to
    spanning of tag space span (T)

15
Data Sets
  • Daly et al
  • 616 kilobase region of human Chromosome 5q31
    genotyping 103 SNPs for 129 trios.
  • Seven ENCODE regions from HapMap.
  • Regions ENr123 and ENm010 from 2 population 45
    singles Han Chinese (HCB) and 44 singles
    Japanese(JPT).
  • Three regions (ENm013, ENr112, ENr113) from 30
    CEPH family trios obtained from HapMapSTAMPA (E.
    Halperin and R. Shamir)
  • Two gene regions STEAP and TRPM8
  • genotyping 23 and 102 SNPs for 30 trios

16
Experimental Results
Directly to genotype data
17
Multivariate Linear Regression Tagging
  • Genotype tagging
  • uses fewer tags (e.g., up to two times less tags
    to reach 90 prediction accuracy) than STAMPA (E.
    Halperin and R. Shamir, ISMB 2005 and
    Bioinformatics)
  • Statistical tagging
  • Linear recombination of tags statistically cover
    non-tag SNPs
  • Traditional methods use single tag to cover
    non-tag SNPs
  • uses on average 30 fewer tags than IdSelect
    (C.S. Carlson et al. 2004) for statistical
    covering all SNPs.

18
Thank youAny Questions?
Write a Comment
User Comments (0)
About PowerShow.com