Title: Human Genome: sequence, structure, diseases
1Human Genome sequence, structure, diseases
2Question
What is the next step after the genome
sequence is completed?
- The new Research challenges in genetics now
- Gene number, exact locations, and functions
- Gene regulation
- DNA sequence organization
- Chromosomal structure and organization
- Noncoding DNA types, amount, distribution,
information content, and functions - Coordination of gene expression, protein
synthesis, and post-translational events - Interaction of proteins in complex molecular
machines
3- Protein conservation (structure and function)
- Proteomes (total protein content and function) in
organisms - Correlation of SNPs (single-base DNA variations
among individuals) with health and disease - Disease prediction based on gene sequence
variation - Genes involved in complex traits and multigene
diseases - Developmental genetics, genomics
For most of these problems we need to locate DNA
fragment in chromosome
4- Chromosomes
- Each chromosome contains one long piece of DNA
- Chromosomes are visible in the light microscope
- Banding The chromosomes themselves looked
striped. They have dark regions (bands)
alternating with light regions (interbands). The
dark regions are dark because they have highly
compacted and coiled DNA. The interbands are
regions where the uncoiled DNA connects the
bands. - As long as the DNA in the band remains tightly
coiled, it is not available for transcription.
The puffing of a band is a site of RNA
transcription they were the sites of RNA
synthesis. - Each chromosome has a characteristic length and
banding pattern.
5Identification of chromosomes Each human
chromosome is numbered from 1-22,
sex chromosomes
either X or Y
- Each arm divided into sub-regions and identified
by a number. - Each sub-region divided into bands identified
with a number
p arm (short arm)
Centromere
q arm (long arm)
Example - 1q2.4 . The first chromosome, long arm,
second region of the chromosome, the fourth band
of that sub-region
6H.A. Prepare a couple slides about Chromosomes
analysis, present data about Karyotype (?) . How
differ karyotypes in different species? What
technique is used to visualize all the pairs of
chromosomes in an organism in different colors
Spectral karyotype of a human female
7Nucleotide and Amino acids Sequence Analysis
- Here is a short list of problems
- sequence comparison compare two sequences and
show the similarities and differences. - The trivial method to compare two
sequences is to compare them character by
character, allowing for gaps - The Best Alignment ?
- Try every possible alignment between two
sequences - and give each aligned position a score according
to the scoring matrix. - The alignment with highest score is the
best.
8The question How many possible alignments are
possible?
9Unfortunately, all possible combinations of one
sequence against another is enormous amount ofÂ
combinations
Therefore, the main problem is
to make
alignment process applicable in relatively short
time.
10Sequence comparison In bioinformatics, a sequence
alignment is a way of arranging the sequences (?)
of DNA, RNA, or protein
to identify regions of
similarity Similarity may be a consequence of
functional, structural, or evolutionary
relationships between the sequences.
How many conserved positions?
BANANA-
-
ANANAS
Sequence alignment Compare two words
How many Gaps?
The goal of sequence alignment is to find
optimal residue-to-residue correspondences.
The
optimization gives the maximum number of
conserved positions occupied by identical or
similar residues in all aligned sequences.
To achieve this
goal one sometimes needs to allow for gaps
within sequences so that chemically similar amino
acids can be aligned to each other.
11In Bioinformatics use a computational method -
Dynamic Programming
to align two proteins or nucleic acids The
term dynamic programming to describe the process
of solving problems where one needs to find the
best decisions one after another.
At first, we select the best path from Start to
A,
then we select the best path from A to
Finish. The choice of the best path from A to
Finish is independent of the choice of path from
Start to A
12How to determine an optimal path?
The crucial observation The choice of the best
path from A to Finish is independent of the
choice of path from the Start to A
If we determine the best of 6 paths from Start to
A and the best of 6 paths from A to Finish
Then
the best paths Start to Finish is the
best path from Start to A followed the best path
from A to Finish. Question How many variants of
pathway do we need to consider?
Answer No more than 12 of the paths. (instead of
36 paths)
The algorithm does not guarantee that the given
path is the best one, but the method do find the
optimal one of the best solutions.
13Thus the path is subdivided into a set of
steps. The goal is to find the optimal way for
each step Any step along the true optimal path
must itself be the optimal path. This is the
main idea of dynamic programming method. Dynamic
programming is typically used when a problem has
many possible solutions and an optimal one needs
to be found.
14Dynamic Programming An example of global sequence
alignment
the two sequences to be globally
aligned are
G A A T T C A G T T A (sequence 1) M 11
length of sequence G G A T C G A
(sequence 2) N 7 length of sequence
The step 1. COST.
We have to
assign a cost to each comparison A simple
scoring scheme is for a residue at position i of
sequence 1 and a residue at position j of
sequence 2
AAA AAA
A A
AAA ADA
A A A
Si,j 1 Si,j 0
Si,j 0
(match score) (mismatch score)
(gap penalty)
15The step 2. The solutions for each alignment
position is saved in a matrix with M 1 columns
and N 1 rows where M and N correspond to the
size of the sequences to be aligned.
The first row
and first column of the matrix can be initially
filled with 0.
.
i
WHY ?
M 1,1 G . G M 1,0 G A
. -- G
j
16Matrix Fill Step
The steps 3. To find maximal score Mi,j for each
position i,j . in the
matrix. . GAATT
.
GGATC
The question is
How to better align
residues at the i and j position?
.
For example, GAA
GGA or G A
GGA or ?
17To find the score Mi,j for the position i, j we
have to know the score for the matrix positions
to the left (Mi-1,j ), above (Mi,j-1
), and diagonal (Mi-1,j-1 ), to i, j to check all
possible alignment
Why ?
i A C
T D Q
FHASY
j
Because positions to the left Mi-1,j ), above
(Mi,j-1 ), and diagonal (Mi-1,j-1 are the
positions before the position Mi,j We have to
select the best previous position to make the
next step to Mi,J
18There are two Sequences
A ACGCTG,
B CATGT The best alignment ?
Question explain the cell in
the first row and the first column
19 A C G... C A T...
20QUESTION How do we estimate the gap?
21Question
How
do we calculate the score of this alignment?
22How do we calculate the scores?
23Question How do we estimate the mismatch? 0, -1,
1?
24Question How do we estimate the match? 0, 1,
2 Thus in this alignment the penalty for a gap
is .
the score for a mismatch is
25Explain the score in the cell G3/ C1 Check the
score for mismatch with the previous slides.
26Check the score in the cell G3/A2
27After filling in all of the values the score
matrix is as follows
28The next procedure is the traceback step. The
traceback step determines the actual alignment
that result in the maximum score. The traceback
step begins in the N,M position in the matrix,
i.e. the position where both sequences are
globally aligned
29The algorithm of the traceback
a) step begins
with the last cell
Traceback takes the current cell and looks to the
neighbor cells that could be direct predacessors
? to the neighbor to the
left (gap in sequence 2), ? the diagonal
neighbor (match/mismatch), and
? the neighbor above it
(gap in sequence 1).
there is a G6/T5 in this case).
30For the current cell there are two possible
predacessors with the maximum score 3.
b) If more than one possible predacessor
(? left and ? above) with the same
maximum score exists, any can be chosen. If the
diagonal neighbor ? has the same maximum score,
diagonal way is selected to avoid a gap.
Variant 1 select left cell ? as the predacessor.
TG
T -
Select the best alignment and compare with the
alignment at the next slide.
31Question Does your alignment coincide with this
one?
Make another possible alignment (Variant 2) and
then compare it with the alignment at the next
slide.
32Variant 2
Question
What are the maximum scores of these two
possible alignments?
33H.A. Create an alignment according this matrix
H.A. Construct the table (calculate the value of
all cells ) for the same sequences but with the
different scores Si,j 2
Si,j -1 Si,j -2
(match score) (mismatch
score) (gap penalty) Find the
optimal alignment and compare with the previous
one.
34- Nucleotide Sequence Analysis
- HomoloGene - a gene homology tool that compares
nucleotide sequences between pairs of organisms
in order to identify putative orthologs. - BLAST - sequence similarity searching set of
programs - Nucleotide-nucleotide BLAST (blastn)
- Search for short, nearly exact matches
- Translated query vs. protein database (blastx)
- Protein query vs. translated database (tblastn)
- Immunoglobin BLAST (IgBlast)
H.A. Look at A user-friendly introduction to
BLAST http//www.geospiza.com/outreach/BLAST/slide
1.html
35H.A.BLAST - sequence similarity searching
program. Short power-point presentation