Title: Human Genome: sequence, structure, diseases
1Human Genome sequence, structure, diseases
- Lecture 6
- BINF 7580
- Spring 2005
2Human Genome Project (1990-2003) from the site
http//www.ornl.gov/sci/techresources/Human_Genome
/home.shtml
- Project goals were to
- identify all the approximately 20,000-25,000
genes in human DNA, - determine the sequences of the 3 billion chemical
base pairs that make up human DNA, - store this information in databases,
- improve tools for data analysis,
- transfer related technologies to the private
sector, and - address the ethical, legal, and social issues
(ELSI) that may arise from the project.
3The International Human Genome Sequencing
Consortium (IHGSC) used a hierarchical mapping
and sequencing strategy to construct the working
draft of the human genome. This clone-based
approach involves generating an overlapping
series of clones that covers the entire genome.
Each clone is 'fingerprinted' on the basis of
the pattern of fragments generated by restriction
enzyme digestion. Clones are then selected for
shotgun sequencing and the whole genome sequence
is reconstructed by map-guided assembly of
overlapping clone sequences.
4Genomic DNA was partially digested to form
bacterial artificial chromosome libraries.
Bacterial DNA contains fragments about ( 80
to 350 kb pairs) which can be multiplied in E.
coli.
In result we have two types of maps
genetic maps indirect . estimate of the
distance physical maps estimate . of the
true distance, in . measurements called .
base pairs
5Look at this figure. Do you understand any
details? Can you answer questions about shotgun
sequencing? If Yes, think about following
questions
- Questions
- Why do we need shotgun sequencing?
- What is the main idea of shotgun sequencing?
- What is a contig?
If you have problems with answers, look at the
next slide.
6DNA sequence is randomly sheared into small
pieces. These sequence reads are then assembled
into contigs, and the complete sequence of the
clone generated. a contig - a set of overlapping
segments of DNA.
7In June 2001, scientists completed the first
working draft of the human genome
A physical map of the human genome. McPherson JD,
Marra M, Hillier L, Waterston RH, Chinwalla A,
Wallis J, Sekhon M, Wylie K, Mardis ER, Wilson
RK, Fulton R, Kucaba TA, Wagner-McPherson C,
Barbazuk WB, Gregory SG, Humphray SJ, French L,
Evans RS, Bethel G, Whittaker A, Holden JL,
McCann OT, Dunham A, Soderlund C, Scott CE,
Bentley DR, Schuler G, Chen HC, Jang W, Green ED,
Idol JR, Maduro VV, Montgomery KT, Lee E, Miller
A, Emerling S, Kucherlapati, Gibbs R, Scherer S,
Gorrell JH, Sodergren E, Clerc-Blankenburg K,
Tabor P, Naylor S, Garcia D, de Jong PJ, Catanese
JJ, Nowak N, Osoegawa K, Qin S, Rowen L, Madan A,
Dors M, Hood L, Trask B, Friedman C, Massa H,
Cheung VG, Kirsch IR, Reid T, Yonescu R,
Weissenbach J, Bruls T, Heilig R, Branscomb E,
Olsen A, Doggett N, Cheng JF, Hawkins T, Myers
RM, Shang J, Ramirez L, Schmutz J, Velasquez O,
Dixon K, Stone NE, Cox DR, Haussler D, Kent WJ,
Furey T, Rogic S, Kennedy S, Jones S, Rosenthal
A, Wen G, Schilhabel M, Gloeckner G, Nyakatura G,
Siebert R, Schlegelberger B, Korenberg J, Chen
XN, Fujiyama A, Hattori M, Toyoda A, Yada T, Park
HS, Sakaki Y, Shimizu N, Asakawa S, Kawasaki K,
Sasaki T, Shintani A, Shimizu A, Shibuya K, Kudoh
J, Minoshima S, Ramser J, Seranski P, Hoff C,
Poustka A, Reinhardt R, Lehrach H
International Human Genome Mapping
Consortium.Nature. 2001 Feb 15409(6822)934-41.
8Question What is the physical map of the human
genome?
There are two ways of GENOME MAPPING genetic maps
and physical maps. Both maps are graphical
representation of the arrangement of genes on a
chromosome.
9Physical maps. Distance is measured in base
pairs, the highest-resolution map would be the
complete nucleotide sequence of the chromosomes.
10Two types of information is important 1. the
house (the goal of your search) is 25 miles North
from your house this is a physical map exact
distance from one point to another in miles,
kilometers or basepairs But sometimes is useful
and easy for search to know that the place you
are looking for is near the known landmark. 2.
the house is near Empire State Building. In a
genetic map a landmark is a unique site in DNA
sequence.
11genetic distance
physical distance
12Distance in GENETIC CARTS is based on genetic
markers What will serve as a landmark? The
Sequence Tagged Site (STS) is ideal.
An STS is any
short DNA segment which is present at only one
location in the genome and whose sequence is
known.
. Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
If two different clones contain the same STS then
they must overlap in their content of genomic
DNA.. Normally one would want two or three STSs
in common to really trust the result.
13Distance in GENETIC CARTS is based on genetic
markers
Short Tandem Repeat Polymorphism (STR)
Polymorphisms Variations in inherited regions
of DNA sequence which vary from person to
person. STRs are short sequences of DNA, normally
of length 2-5 base pairs, that are repeated
numerous times in a head-tail manner
"gatagatagatagata" In each case, the number of
times a sequence is repeated may vary.
- RFLPs,restriction fragment length polymorphisms -
a certain nucleotide sequence (a restriction
site) for a bacterial restriction enzyme. This
enzyme breaks apart strands of DNA wherever they
contain
14- SNPs, or single nucleotide polymorphisms, are
individual point mutations, or substitutions of a
single nucleotide.
genetic diseases are frequently used in humans as
gene markers, with the disease state being one
allele and the healthy state the second allele.
Genetic maps show where markers are in relation
to each other on the chromosome, but do not show
the actual "mileage" between those markers
15 NATURE VOL 431 21 OCTOBER 2004Finishing the
euchromatic sequence of the human
genomeInternational Human Genome Sequencing
Consortium
The question
What about the previous papers in 2001, where
the sequence was announced?
In fact, in 2001 it was just the draft with full
of holes. Sequence omitted 10 of the euchromatic
genome it was interrupted by 150,000 gaps.
16NATURE VOL 431 21 OCTOBER 2004From
Abstract In 2001, the International Human Genome
Sequencing Consortium reported a draft sequence
of the euchromatic portion of the human genome.
Since then, the international
collaboration has worked to convert this draft
into a genome sequence with high accuracy and
nearly complete coverage.
Here, we report the result
of this finishing process.
The current genome sequence (Build 35) contains
2.85 billion nucleotides interrupted by only 341
gaps.
It covers 99 of the euchromatic genome and is
accurate to an error rate of 1 event per 100,000
bases. .. Notably,
the human genome seems to encode only
20,00025,000 protein-coding genes.
The
genome sequence reported here should serve as a
firm foundation for biomedical research in the
decades ahead. Â
17Question What it is the euchromatic sequence of
the human genome?
Euchromatin is the relatively loose, gel-like
portion of human DNA, which is gene-rich and is
the most actively transcribed part of the genome.
Some of our DNA, however, is more densely packed
into heterochromatin, which contains a smaller
but by no means negligible number of genes
. The 2001 drafts largely neglected
heterochromatin, so that they covered only 70 of
the whole genome Even the 'completed' sequence
announced in 2003 only meant that the euchromatin
sequence was 98 finished.
18Thus, the euchromatic genome is 2.88 Gb
and the overall human genome is 3.08
Gb. Decoding the heterochromatic genome
(one-fifth of the whole human genome) could take
another five years or more.
The central goal of the human genome decoding is
the identification of all
genes, which code for particular proteins.
The results a decade ago, most scientists thought
humans had about 100,000 genes. Three years ago,
we estimated there were about 30,000 to 35,000
genes, which surprised many. In 2004 researchers
have confirmed the existence of 19,599
protein-coding genes and identified another 2,188
DNA segments that are predicted to be
protein-coding genes.
19Important findings The birth and death of genes
Scientists have identified more than 1,000 new
genes that arose in the human genome after our
divergence with rodents some 75 million years
ago. They arose through recent gene duplications
and are involved with immune,
olfactory
(relating to, or contributing to the sense of
smell), and reproductive functions.
For example, there are two families of genes
recently duplicated in the human genome that
encode sets of proteins (pregnancy-specific
beta-1 glycoprotein and choriogonadotropin beta
proteins) that may be involved in the extended
period of pregnancy unique to humans.
20Important findings 33 nearly intact genes that
have recently acquired one or more mutations,
causing them to stop functioning, or "die."
Scientists pinpointed these non-functioning
genes, referred to as pseudogenes, in the human
genome by aligning them with the mouse and rat
genomes, in which the corresponding genes have
maintained their functionality. Researchers
determined that 10 of these pseudogenes in the
human genome sequence appear to have coded for
proteins involved in olfactory reception, which
helps to explain why humans have fewer functional
olfactory receptors and, consequently, a poorer
sense of smell than rodents.
21All genome information is collected in NCBI NCBI
creates public databases, conducts research in
computational biology, develops software tools
for analyzing genome data, and disseminates
biomedical information - all for the better
understanding of molecular processes affecting
human health and disease public
databases GenBank - a database of nucleotide
sequences from gt130,000 organisms. Records that
are annotated with coding region (CDS) features
also include amino acid translations.
 GenBank is updated daily in
NCBI search systems, and a full release is issued
on the FTP site approximately the 15th of every
February, April, June, August, October, and
December.
22LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999 The
locus name is SCU49845 The sequence length is
5028 bp. Number of nucleotide base pairs (or
amino acid residues) The molecule type is
DNA GenBank division is PLN (plant, fungal, and
algal sequences) The date of last modification is
21-JUN-1999
23DEFINITION Saccharomyces cerevisiae TCP1-beta
gene, partial cds, and Axl2p (AXL2) and Rev7p
(REV7) genes, complete cds Brief description of
sequence includes information such as source
organism, gene name/protein name, or some
description of the sequence's function (if the
sequence is non-coding). If the sequence has a
coding region (CDS). ACCESSION U49845 The
unique identifier for a sequence record. An
accession number is usually a combination of a
letter(s) and numbers, Accession numbers do not
change, however, an original accession number
might become secondary to a newer accession
number, if the authors make a new submission that
combines previous sequences.
24VERSION U49845.1 GI1293613 uses the
accession.version format. If there is any change
to the sequence data (even a single base), the
version number will be increased, e.g., U12345.1
? U12345.2, but the accession portion will remain
stable. GI -"GenInfo Identifier" sequence
identification number, in this case, for the
nucleotide sequence. If a sequence changes in any
way, a new GI number will be assigned.
25SOURCE Saccharomyces cerevisiae (baker's
yeast) ORGANISM
Saccharomyces cerevisiae Eukaryota Fungi
Ascomycota Saccharomycotina Saccharomycetes
Saccharomycetales Saccharomycetaceae
Saccharomyces. The formal scientific name for
the source organism and its lineage, based on the
phylogenetic classification scheme used in the
NCBI Taxonomy Database
26REFERENCE1 (bases 1 to 5028)
AUTHORS
Torpey,L.E., Gibbs,P.E., Nelson,J. and
Lawrence,C.W.
TITLE Cloning and sequence of REV7, a gene whose
function is required for DNA damage-induced
mutagenesis in Saccharomyces cerevisiae
JOURNAL Yeast 10 (11), 1503-1509 (1994)
MEDLINE 95176709
PUBMED 7871890 REFERENCE 2 (bases 1
to 5028) REFERENCE 2 (bases 1 to 5028)
27CDS 687..3158
.
/gene"AXL2"
.
/note"plasma membrane glycoprotein" .
. /codon_start1
/function"required for axial budding .
. pattern of S.
cerevisiae"
. /product"Axl2p"
. /protein_id"AAA98666.1"
/db_xref"GI1293615" .
/translation"MTQLQISLLLTATISLLHLVVATP BASE
COUNT 1510 a 1074 c 835 g 1609 t
ORIGIN 1 gatcctccat atacaacggt
atctccacct caggtttaga tctcaacaac .
4981 tgccatgact cagattctaa
ttttaagcta ttcaatttct ctttgatc
28The other nucleotide Databases dbEST is a
division of GenBank that contains sequence of
Expressed Sequence Tags, from a number of
organisms
An expressed sequence tag (EST) is a small part
of the active part of a gene, made from cDNA,
which can be used to fish the rest of the gene
out of the chromosome, by matching base pairs
with part of the gene.
dbSNP - database of single nucleotide
polymorphisms, small-scale insertions/deletions,
polymorphic repetitive elements, and variation.
29UniGene
is a system for automatically
partitioning GenBank sequences into a
non-redundant set of gene-oriented clusters. Each
UniGene cluster contains sequences that represent
a unique gene, as well as related information
such as the tissue types in which the gene has
been expressed and map location.
30Question
What is the next step after the genome
sequence is completed?
- The new Research challenges in genetics now
- Gene number, exact locations, and functions
- Gene regulation
- DNA sequence organization
- Chromosomal structure and organization
- Noncoding DNA types, amount, distribution,
information content, and functions - Coordination of gene expression, protein
synthesis, and post-translational events - Interaction of proteins in complex molecular
machines
31- Protein conservation (structure and function)
- Proteomes (total protein content and function) in
organisms - Correlation of SNPs (single-base DNA variations
among individuals) with health and disease - Disease prediction based on gene sequence
variation - Genes involved in complex traits and multigene
diseases - Developmental genetics, genomics
32Nucleotide and Amino acids Sequence Analysis
- Here is a short list of problems
- sequence comparison compare two sequences and
show the similarities and differences. - The trivial method to compare two
sequences is to compare them character by
character, allowing for gaps - The Best Alignment ?
- Try every possible alignment between two
sequences - and give each aligned position a score according
to the scoring matrix. - The alignment with highest score is the
best.
33The question How many possible alignments are
possible?
34Unfortunately, all possible combinations of one
sequence against another is enormous amount ofÂ
combinations
Therefore, the main problem is
to make
alignment process applicable in relatively short
time.
35In Bioinformatics use a computational method -
Dynamic Programming
to align two proteins or nucleic acids The
term dynamic programming to describe the process
of solving problems where one needs to find the
best decisions one after another.
At first, we select the best path from Start to
A,
then we select the best path from A to
Finish. The choice of the best path from A to
Finish is independent of the choice of path from
Start to A
36Thus the path is subdivided into a set of
steps. The goal is to find the optimal way for
each step Any step along the true optimal path
must itself be the optimal path. This is the
main idea of dynamic programming method. Dynamic
programming is typically used when a problem has
many possible solutions and an optimal one needs
to be found.
37sequence 1 S D V Y
sequence 2 S
R V L Y
Score 3 2 -1 2 -2 2
Sum of residues pair scores minus .
gap penalty
-2 Score of new Score of previous
Score of new aligned
alignment alignment
pair sequence 1 S D V Y
T
sequence 2 S R V L Y
T Score
5 3 2
.
38There are two Sequences
A ACGCTG,
B CATGT The best alignment ?
Question explain the cell in
the first row and the first column
39 A C G... C A T...
40QUESTION How do we estimate the gap?
41Question
How
do we calculate the score of this alignment?
42How do we calculate the scores?
43Question How do we estimate the mismatch? 0, -1,
1?
44Question How do we estimate the match? 0, 1,
2 Thus in this alignment the penalty for a gap
is .
the score for a mismatch is
45Explain the score in the cell G3/ C1 Check the
score for mismatch with the previous slides.
46Check the score in the cell G3/A2
47After filling in all of the values the score
matrix is as follows
48The next procedure is the traceback step. The
traceback step determines the actual alignment
that result in the maximum score. The traceback
step begins in the N,M position in the matrix,
i.e. the position where both sequences are
globally aligned
49The algorithm of the traceback
a) step begins
with the last cell
Traceback takes the current cell and looks to the
neighbor cells that could be direct predacessors
? to the neighbor to the
left (gap in sequence 2), ? the diagonal
neighbor (match/mismatch), and
? the neighbor above it
(gap in sequence 1).
there is a G6/T5 in this case).
50For the current cell there are two possible
predacessors with the maximum score 3.
b) If more than one possible predacessor
(? left and ? above) with the same
maximum score exists, any can be chosen. If the
diagonal neighbor ? has the same maximum score,
diagonal way is selected to avoid a gap.
Variant 1 select left cell ? as the predacessor.
TG
T -
Select the best alignment and compare with the
alignment at the next slide.
51Question Does your alignment coincide with this
one?
Make another possible alignment (Variant 2) and
then compare it with the alignment at the next
slide.
52Variant 2
Question
What are the maximum scores of these two
possible alignments?
53- Nucleotide Sequence Analysis
- HomoloGene - a gene homology tool that compares
nucleotide sequences between pairs of organisms
in order to identify putative orthologs. - BLAST - sequence similarity searching set of
programs - Nucleotide-nucleotide BLAST (blastn)
- Search for short, nearly exact matches
- Translated query vs. protein database (blastx)
- Protein query vs. translated database (tblastn)
- Immunoglobin BLAST (IgBlast)
54Home assignment BLAST - sequence similarity
searching program.