Title: Detection of Natural Selection in the Human Genome by Hidden Markov Model
1Detection of Natural Selectionin the Human
Genome by Hidden Markov Model
Capstone Project.
- Presenter Sang-Gook Han
- Advisor Prof. Matthew Hahn
2Contents
- Objective
- Background
- Problems and Motivation
- Algorithm
- Procedure
- Results
- Acknowledgement
3Objective
Mutation
A mutation is not neutral if it affects a
function.
Natural selection can be divided largely into two
types Positive selection favors new mutation
(allele) Negative selection disfavors new
mutation
Speciation
Mutation causes DNA Polymorphism
4Objective
Single Nucleotide Polymorphism (SNP)
C(5)/G(7)
T(10)/G(2)
AAACTCATAGTCCGATTTCCCCGGGAACCCTA AAACTCATAGTCCGATT
TCCCCGGGAACCCTA AAACTCATAGTCCGATTTCCCCGGGAACCCTA A
AACTCATAGTCCCATTTCCCCGGGAACCCTA AAACTCATAGTCCCATTT
CCCCGGGAACCCTA AAACTCATAGTCCCATTTCCCCGGGAACCCTA AA
ACTCATAGTCCGATTTCCCCGGGAACCCTA AAACTCATAGTCCCATTTC
CCCGGGAACCCGA AAACTCATAGTCCCATTTCCCCGGGAACCCTA AAA
CTCATAGTCCGATTTCCCCGGGAACCCTA AAACTCATAGTCCGATTTCC
CCGGGAACCCGA AAACTCATAGTCCGATTTCCCCGGGAACCCTA
Polymorphism in a population provides a
characteristic of the genome
5Objective
From polymorphism,
Understanding the types of selection can aid in
understanding human evolution and the human genome
Infer evolutionary processes in different regions
of the genome e.g. genes evolving adaptively in
humans
6Background Derived Allele Frequency
C
Derived Allele four Gs
C
C/G
Infer natural selection from derived allele
frequency
7Background Allele-Freq. Spectrum
- One single frequency is not enough to infer
natural selection type. - So, need allele-frequency spectrum ( frequency of
many alleles)
Site 1 Site 2 Site 3 Site 4 Site 5 Site 6
IND 1 0 0 0 1 0 0
IND 2 0 0 1 1 0 0
IND 3 1 1 0 0 0 0
IND 4 0 0 0 0 0 1
IND 5 0 0 0 0 1 0
Negative Selection
1 derived allele, 0 ancestral allele
5 sites(site 1, 2, 3, 5, 6) have only one derived
allele 1 site (site 4) has two derived allele 0
site has three derived allele 0 site has four
derived allele
8Background Allele-Freq. Spectrum
Y axis Number of sites X axis Number of
derived allele
Negative Low Genetic Variation
Neutral High Genetic Variation
Positive Low Genetic Variation
Distribution of alleles enables us to infer
natural selection.
9Background the PRF model
- Evolutionary theory gives predictions of allele
frequencies when they are - Neutral ( no selection )
- Positively selected
- Negatively selected
? 0 Neutral ? gt 0 Positive ? lt 0 Negative
? natural selection intensity
10Problems of the PRF model
- Even though ? is estimated, the value is not
exact. IT HAS VARIANCE. - Thus, we cannot rely on the estimate to determine
positive or negative selection. - Conduct likelihood ratio test against null
hypothesis of no selection
11Motivation
- Polymorphism in a genome can be described as a
Markov random field. - So, neighboring regions tend to be predicted as
the same natural selection type. - Hidden Markov Model (HMM) can improve detection
of natural selection because of similarity among
neighboring regions. - Transition probability and emission function can
alleviate variance effect on determination of a
selection type.
12Algorithm
? gt 0 SP Strongly Positive ? confidence interval gt 0
? gt 0 WP Weakly Positive ? confidence interval include and -
NE Neutral LRT
? lt 0 WN Weakly Negative ? confidence interval include and -
? lt 0 SN Strongly Negative ? confidence interval lt 0
13Procedure
Allele Frequency from unrelated parent on 60
Europeans (CEU), 60 Africans (YRI), 45 Chinese
(CHB)
Natural selection
Derived allele
Trained HMM
? estimation
Allele-frequency spectrum
14Procedure HMM
? estimation on a sliding window from the PRF
model
? 0.2 0.4 1 1.1 1.2 3 4 5 10 11
11 9 8 5 3 3 2 1 0.5 0.3
0.2 0.1 0 -0.1
NE, ? 1.1
Build emission function, P(? S) for each state
and for each chromosome
- Training
- input Emission function and estimated ?s from
the PRF - Baum-Welch algorithm.
15Results HMM
State Transition Probabilities
Emission function
16Results HMM
- A snapshot of a region on chromosome 13 of
CEU with annotation of natural selection before
and after PRF-HMM
Population changes/total undetermined/ changes
European CEU 3.80 5.94
African YRI 4.11 1.74
Chinese CHB 7.31 5.69
17Results Distribution of Natural Selection
A) Whole chromosome
B) coding region
Ratio of B to A
Negative (SN and WN) B gt A Positive (SP and WP)
B lt A Coding regions are under negative
selection
18Results Distribution of Natural Selection
- One SNP can be shared by more than one protein.
1 No share 3 Share with at least 3 proteins 6
Share with at least 6 proteins
19Results Interesting genes in European
CSN3 Kappa-casein
ZDHHC12 Palmitoyltransferase
MCL1 myeloid leukemia cell differentiation
protein
20Results Interesting genes in African
TP53BP1 Tumor suppressor p53-binding protein
GFER, MPV17, GNMT Liver disease related genes
21Results Interesting genes in Chinese
SERF2 Gastric cancer-related protein
ZMYND10 Lung cancer-related protein
22Conclusion
- The PRF-HMM improves the PRF model and overcomes
numerical problem - Differences of SP and SN among populations might
be related to difference environment. - Negative cSNP gt chromsome, Positive cSNP lt
chromsome. It implies that human genome is under
negative selection. - The larger number of shares of SNP, The larger
proportions of SN and WP and the smaller
proportions of SP and WP. - Positive selection in a specific population with
low genetic diversity on a gene related to a
disease might appear weakness on the population.
23Acknowledge
Thank Prof. Matthew Hahn, Prof. Haixu Tang and
people in Hahns lab for advice of this project.
???????
merci
??
grazias
grazie
??
danke
?????
??
e??a??st?e?