Title: Biostat 830
1Introduction
- Biostat 830
- Winter 2006
- Lecture 1
2Title
- SNPs, haplotypes and association studies
- Special topics on human genetics and population
genetics
3Who should take this class?
- For those who are or will involve in association
studies. - Design.
- Conduct the study.
- Data analysis.
- For those who are looking for interesting
research topics in statistics and genetics.
4Overall objective
- Connecting phenotype with genotype.
- Identify DNA sequence variation that cause
specific traits.
5Brief history
- First genome-wide linkage analysis using DNA
polymorphisms was first proposed. (Botstein et
al. 1980). - Many disease (mostly mendelian) causing genes
identified using positional cloning. - Cystic fibrosis. Karem et al. 1989, Riordan et
al. 1989. - Huntingtons disease. Gusella et al. 1983.
6Limitation of linkage study
- Successes of linkage studies need near perfect
phenotype-genotype match. - In complex diseases, misdiagnoses, heterogeneity,
complex inheritance or frequent phenocopies are
abundant. - Linkage mapping limits resolution to the order of
1-10 cM. The dominant limit of resolution is the
number of meioses in which crossovers might have
occurred.
7Motivation
- Identify genetic variants that cause complex
diseases. - Association studies have emerged as a promising
approach for such endeavor. - Problems need to be addressed
- population stratification, tagSNP selection,
multiple testing, study design,
8Contents of the class I
- Lectures
- LD, haplotype block.
- TagSNP selection.
- HapMap project.
- Haplotype inference.
- Population structure.
- Colascence theory.
- Multiple testing.
9Contents of the class II
- Lectures
- Association studies
- Genotyping assays Affy 500k, Illumina 250K,
- Design strategy power, case-control, two stage,
- Quality control HWE, genotyping errors,.
- Data analysis haplotype based, multiple testing,
gene-gene interaction, gene-environment
interaction, adjust for population
stratification,
10Contents of the class III
- Paper discussion
- Two papers each time,
- Two groups,
- Each group make up ten questions about the paper,
and challenge the other group to answer these
questions.
11Contents of the class IV
- Student presentations
- Each student present once, on analysis performed
on a real dataset, using HapMap resources.
12Evaluation
- Class participation,
- Data analysis exercise/homework,
- Analysis project and presentation.
13Human Genome Variation
14Terminology
- Locus The physical location of a gene.
- e.g., D1S80, D4S43, D16S126
- Allele Alternative form of a gene.
- e.g., A, a A,B,O
- Genotype
- The observed alleles at a genetic locus for an
individual. - e.g., AA, Aa, aa AA,AB,AO,BB,BO,OO
- Homozygous AA, aa
- Heterozygous Aa
- Phenotype
- The expression of a particular genotype
- Continuous plasma LDL level, blood pressure
- Dichotomous Hyperlipidemia, Hypertension
15DNA Polymorphism
- Restriction Fragment Length Polymorphism
- Tandem Repeats
- Satellites
- Minisatellites
- Microsatellites
- Single Nucleotide Polymorphism
- Single-base substitutions
- Single-base insertion/deletions
16RFLP
- Discovered in 1975
- Only two alleles present or not present
- 10,000 in genome
17RFLP
18Tandem repeats
- They include three subclasses
- Satellites.
- Minisatellites.
- Microsatellites.
- The name "satellites" comes from their optical
spectra Buoyant density gradient centrifugation
can separate DNA fragments with significantly
different base compositions. The main band
represents the bulk DNA, and the "satellite"
bands originate from tandem repeats.
19Satellites
- Satellites
- The size of a satellite DNA ranges from 100 kb to
over 1 Mb. - the alphoid DNA located at the centromere of all
chromosomes. Its repeat unit is 171 bp.
20Minisatellite
- Large family of repetitive sequences (Jeffreys et
al. 1985, Armour et al. 1992). - Many of the original tandem repeat families
- could be purified from the rest of the genome
as satellite fraction of DNA. - 9-24 bp monomer, total length 0.5-30kb.
- Located in non-coding regions.
- Used for
- Paternity test
- Forensic identification
21VNTR
22- D1S80
- 16 base pair repeats
- Non-coding region on
- chromosome 1
- Repeat number 14-40
BS 50 Genetics and Genomics Spring 2001, Prof.
Dan Hartl
23Microsatellite
- Nucleotide repeat markers (Weber et al. 1989,
Litty et al. 1989) - Short Tandem Repeat
- Mostly located in intron and UTR, some in coding
region. - Used for
- DNA fingerprinting and DNA testing.
- Linkage analysis.
- Genetic and physical mapping of genes.
24Huntingtons Disease
- On short arm of Chromosome 4,
- CAG repeats
- normal 11-29 times
- disease 40-80 times
THIS LAND IS YOUR LAND words and music by Woody
Guthrie This land is your land, this land is
my land From California, to the New York Island
From the redwood forest, to the gulf stream
waters This land was made for you and me As I
was walking a ribbon of highway I saw above me
an endless skyway I saw below me a golden valley
This land was made for you and me
Woody Guthrie
25SNP
26What is a SNP?
27SNP Key Concepts
- Definition More than one alternative bases occur
at an appreciable frequency - Availability Over 10 million SNPs have been
identified in human genome (dbSNP Build 125) - Function Most SNPs are neutral, and less than 1
is present in protein-coding regions
28SNP
- The most common genetic polymorphism
- Distribute throughout genome with high density
- More stable and easy to assay
- Major cause of genetic diversity among different
(normal) individuals, e.g. drug response, disease
susceptibility. - Facilitates large scale genetic association
studies as genetic markers.
29Total Number of SNPs in PHASE II HapMap
Total 5,894,684.
30SNP
- Most of SNPs neither change protein synthesis nor
cause disease directly. Rather, they serve as
landmarks, since they may be physically close to
the mutation site on the chromosome. Because of
this proximity, SNPs may be shared among groups
of people with common characteristics. - Analyze SNP patterns among different groups of
people may shed light on evolution of human race,
understand ethnic groups and races.
31SNP Types
32SNP Locations
Exon 1
Exon 2
Exon 3
Exon 4
Intron 1
Intron 2
Intron 3
3
5
DNA
TRANSCRIPTION
pre-mRNA
SPLICING
Mature mRNA
AAAAAAAA
ORF
Phenotype Change (e.g. Asthma)
TRANSLATION
AUG - B1Bn - STOP
Protein Sequence
protein 3D structure
33Haplotype
- Definition an ordered list of alleles of
multiple linked loci on a single chromosome
34Genotype vs Haplotype
Single locus Homozygous wild type AA Homozygous
mutant aa Heterozygous Aa
Haplotype ABC ABc AbC Abc aBC aBc abC abc
Multiple locus 1 2 3 AA BB CC aa bb cc Aa
Bb Cc
35Haplotypes vs. SNPs
- ADVANTAGES
- Haplotypes are more informative
- Haplotypes may enhance the power for LD analysis
- Haplotypes can be used to study the evolutionary
relationship of SNPs - DISADVANTAGE
- May not be completely resolved in the absence of
family data or experimentation
36Haplotype phasing
- To determine the haplotypes from genotypes
containing tightly linked SNPs from a set of n
individuals
Subject 1 AA BB cc Subject 2 Aa BB cc Subject
3 AA Bb Cc Subject 4 aa BB Cc Subject
5 Aa Bb CC . . .
37Thank you
38- Recurrent Risk Ratio (?R)
- Ratio of the risk of disease in a particular
class of relative to the risk of disease in the
general population
?R KR/Ko KR Pr(X21X11)
KR recurrent risk for a type R relative of an
affected individual K0 prevalence of the disease
in the general population X1 and X2 represent
relative1 and relative2 1 means affected, 0
means unaffected Reference ???
39Evolution
- Mutation.
- Spontaneous heritable changes in genes.
- Migration.
- Movement of subpopulation within a larger
population. - Natural selection.
- Difference in the ability to survive and
reproduce. - Random genetic drift.
- Chance.
40- Recombination Fraction (?)
- The frequency of crossing over between two loci.
?AC
?AB
?BC
A
B
C
?AB?BC-2?AB?BC??AC?min(?AB?BC,1/2)
41Goals of Linkage Studies
- To obtain a crude chromosomal location of the
gene or genes associated with a phenotype of
interest, e.g. a genetic disease or an important
quantitative traitsExamples cystic fibrosis
(found), diabetes, multiple sclerosis, and blood
pressure
42Linkage and Association
- Linkage studies use individual families where
members are affected and attempt to demonstrate
linkage between the occurrence of the disease and
genetic markers (creates associations within
families, but not among unrelated people) - Association studies are based on populations and
attempt to show an association between a
particular allele and susceptibility to disease
(a statistical statement about the co-occurrence
of alleles or phenotypes)
43Linkage analysis
44Association Studies
Cases
Controls