Bioinformatics - PowerPoint PPT Presentation

1 / 89
About This Presentation
Title:

Bioinformatics

Description:

Bioinformatics CSC 391/691; PHY 392; BICM 715 Importance of bioinformatics A more global perspective in experimental design The ability to capitalize on the emerging ... – PowerPoint PPT presentation

Number of Views:181
Avg rating:3.0/5.0
Slides: 90
Provided by: Jacquely46
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics


1
Bioinformatics
  • CSC 391/691 PHY 392 BICM 715

2
Importance of bioinformatics
  • A more global perspective in experimental design
  • The ability to capitalize on the emerging
    technology of database-mining--the process by
    which testable hypotheses are generated regarding
    the function or structure of a gene or protein of
    interest by identifying similar sequences in
    better characterized organisms.

3
Amino acids chemical composition or digital
symbols for proteins
http//wbiomed.curtin.edu.au/teach/biochem/tutoria
ls/AAs/AA.html
Link found on the Research Collaboratory for
Structural Biology web site www.rcsb.org/pdb/edu
cation.html
See also Table 2.2 (Mount)
4
Nucleotides chemical composition or digital
symbols for nucleic acids
http//ndbserver.rutgers.edu/NDB/archives/NAintro/

http//www.web-books.com/MoBio/Free/Ch3A.htm
Link found on the Research Collaboratory for
Structural Biology web site www.rcsb.org/pdb/edu
cation.html
See also Table 2.1 (Mount)
5
The Genetic Code how DNA nucleotides encode
protein amino acids
http//www.accessexcellence.org/AB/GG/genetic.html
6
Biologists think its a lot of data, but maybe
its really not
He made fun of biologists for complaining that
the human genome, which takes up about 3
gigabytes, is "a lot of data".  He offered the
comparison of the DVD movie "Evita", which is
about 12 gigabytes, with the genome of Madonna. 
(3 gigabytes).  "The movie contains four times
more information than Madonna's genome.  And
Madonna shares 99 of her DNA with a chimp...And
90 with Craig Venters dog.   More proof that
the genome is not a lot of data  About
90-something percent of genetic information is
common to all humans.  "The unique part of you
will fit on a floppy disk."
Nathan Myhrvold, former Chief Technology Officer
for MicrosoftKeynote Speech at NIH Digital
Biology Meeting 2003
7
Review of Lab 1
  • What did you learn about the sites you visited
    SGD, SwissProt, EntrezRefSeq, EntrezNeighbor,
    EntrezProtein, PIR-US
  • Can you define the term protein function?
  • Does the term gene function have any meaning?
  • Questions?

8
Biologists think its a lot of data, but maybe
its really not
He made fun of biologists for complaining that
the human genome, which takes up about 3
gigabytes, is "a lot of data".  He offered the
comparison of the DVD movie "Evita", which is
about 12 gigabytes, with the genome of Madonna. 
(3 gigabytes).  "The movie contains four times
more information than Madonna's genome.  And
Madonna shares 99 of her DNA with a chimp...And
90 with Craig Venters dog.   More proof that
the genome is not a lot of data  About
90-something percent of genetic information is
common to all humans.  "The unique part of you
will fit on a floppy disk."
Nathan Myhrvold, former Chief Technology Officer
for MicrosoftKeynote Speech at NIH Digital
Biology Meeting 2003
9
Biologists think its a lot of data, and maybe it
really is
  • The genome is not a static, one-time picture
  • Genome changes over timemutations and other
    changes
  • Genes expressed to make proteins
  • Set of genes that are expressed changes with cell
    type
  • Set of genes that are expressed changes over time
    and state

10
Definition of a Biological Database
A biological database is a large, organized body
of persistent data, usually associated with
computerized software designed to update, query,
and retrieve components of the data stored within
the system.
11
Sources of sequence data
  • GenBank at the National Center of Biotechnology
    Information, National Library of Medicine,
    Washington, DC (nucleotides and proteins)
    http//www.ncbi.nlm.nih.gov/Entrez
  • European Molecular Biology Laboratory (EMBL)
    Outstation at Hixton, England http//www.ebi.ac.uk
    /embl/index.html
  • DNA DataBank of Japan (DDBJ) at Mishima, Japan
    http//www.ddbj.nig.ac.jp/
  • Protein International Resource (PIR) database at
    the National Biomedical Research Foundation in
    Washington, DC (see Barker et al. 1998)
    http//www-nbrf.georgetown.edu/pirwww/
  • The SwissProt protein sequence database at ISREC,
    Swiss Institute for Experimental Cancer Research
    in Epalinges/Lausanne http//www.expasy.ch/cgi-bi
    n/sprot-search-de
  • The Sequence Retrieval System (SRS) at the
    European Bioinformatics Institute allows both
    simple and complex concurrent searches of one or
    more sequence databases. The SRS system may also
    be used on a local machine to assist in the
    preparation of local sequence databases.
    http//srs6.ebi.ac.uk

Table 2.5. Mount
12
Sources of protein structure data
  • RCSB Protein Data Bank (PDB) www.rcsb.org
  • BioMagResBank http//www.bmrb.wisc.edu/
  • MMDB http//www.ncbi.nlm.nih.gov/Structure/MMDB/
    mmdb.shtml

13
Review of Lab 2
  • What did you learn about the RCSB web page?
  • What are your thoughts about the PDB file format?
  • Was RasMol easy or hard to use? Is there
    anything you tried to do, but couldnt figure out
    how?
  • What is the difference between the two
    glutaredoxin structures (1aaz and 1die)?
  • MMDB database of protein structures, ASN.1
    format (http//www.ncbi.nlm.nih.gov/Structure/MMDB
    /mmdb.shtml)
  • Other questions?

14
Levels of protein structure
  • Primary structure
  • Secondary structure
  • (Super secondary structure)
  • Tertiary structure
  • Quaternary structure

15
Databases of protein structure classification
  • SCOP Murzin A. G., Brenner S. E., Hubbard T.,
    Chothia C. (1995). J. Mol. Biol. 247, 536-540.
    scop_at_mrc-lmb.cam.ac.uk
  • CATH Orengo, C.A., Michie, A.D., Jones, S.,
    Jones, D.T., Swindells, M.B., and Thornton, J.M.
    (1997) Vol 5. No 8. p.1093-1108.
    http//www.biochem.ucl.ac.uk/bsm/cath/
  • Dali L. Holm and C. Sander (1996) Science
    273595-602. http//www.bioinfo.biocenter.helsink
    i.fi8080/dali/index.html
  • VAST S. H. Bryant and C. Hogue.
    http//www.ncbi.nlm.nih.gov/Structure/VAST/vast.sh
    tml

16
RNA Structure
  • Primary structure sequence of GACU nucleotides
  • Secondary structure stem-loop structures
  • Tertiary structure
  • http//www.rnabase.org/

17
DNA structure
  • Primary structure sequence of GACT nucleotides
  • Secondary structure double helix
  • Higher levels of structure nucleosome
    chromatin chromosome

18
An example of pairwise alignment
  • ./wwwtmp/lalign/.17728.1.seq Glutaredoxin, T4,
    1AAZ.pdb - 87 aa
  • (B) ./wwwtmp/lalign/.17728.2.seq Unknown protein
    - 93 aa
  • using matrix file BL50, gap penalties -14/-4
  • 27.0 identity in 89 aa overlap score 101
    E(10,000) 0.0014

10 20 30 40
50 Glutar KVYGYDSNIHKCVYCDNAKRLLTVK
KQPFEFINIMPEKGV---FDDEKIAELLTKLGR ..
.. . .. .. . . .
.. . Unknow EIYGIPEDVAKCSGCISAIRLCFEKGYDYEIIPVLKK
ANNQLGFDYILEKFDECKARANM 10 20
30 40 50 60
60 70 80 Glutar
DTQIGLTMPQVFAPDGSHIGGFDQLREYF ..
..... .... ... .Unknow QTR-PTSFPRIFV-DGQYI
GSLKQFKDLY 70 80 90
19
Pairwise Sequence Alignment
  • The alignment of two sequences (either protein or
    nucleic acid) based on some algorithm
  • What is the right answer?
  • Align (pairwise) the following words
    instruction, insurrection, incision
  • There is NO unique, precise, and universally
    applicable method of pairwise alignment

20
An example of pairwise alignment
  • ./wwwtmp/lalign/.17728.1.seq Glutaredoxin, T4,
    1AAZ.pdb - 87 aa
  • (B) ./wwwtmp/lalign/.17728.2.seq Unknown protein
    - 93 aa
  • using matrix file BL50, gap penalties -14/-4
  • 27.0 identity in 89 aa overlap score 101
    E(10,000) 0.0014

10 20 30 40
50 Glutar KVYGYDSNIHKCVYCDNAKRLLTVK
KQPFEFINIMPEKGV---FDDEKIAELLTKLGR ..
.. . .. .. . . .
.. . Unknow EIYGIPEDVAKCSGCISAIRLCFEKGYDYEIIPVLKK
ANNQLGFDYILEKFDECKARANM 10 20
30 40 50 60
60 70 80 Glutar
DTQIGLTMPQVFAPDGSHIGGFDQLREYF ..
..... .... ... .Unknow QTR-PTSFPRIFV-DGQYI
GSLKQFKDLY 70 80 90
21
An example of pairwise alignment
  • ./wwwtmp/lalign/.17731.1.seq unknown protein,
    Arabidopsis - 201 aa
  • /wwwtmp/lalign/.17731.2.seq Heamophilus Influenza
    Hybrid-Prx5, 1nm3 - 241 aa using matrix file
    BL50, gap penalties -14/-4
  • 36.4 identity in 140 aa overlap score 288
    E(10,000) 3.5e-19

70 80 90 100 110
120unknow KFSTTPLSDIFKGKKVVIFGLPGAYTGVCSQQHVPSYKS
HIDKFKAKGIDSVICVSVNDP . .
.. . .... .... . ....
Heamop KWVDVTTSELFDNKTVIVFSLPGAFTPTCSSSHLPR
YNELAPVFKKYGVDDILVVSVNDT 30
40 50 60 70 80
130 140 150 160
170 180unknow FAINGWAEKLGAKDAIEFYGDFDGKFHKS
LGLDKDLSAALLGPRSERWSAYVEDGKVKAV . .
... .. .. . . .. .
Heamop FVXNAWKEDEKSEN-ISFIPDGNGEFTEGXGXLVGKEDLGFGK
RSWRYSXLVKNGVVEKX 90 100
110 120 130
190 unknow NVE-EAPSD-FKVTGAEVIL
. . . . .. Heamop FIEPNEPGDPFKVSDADTXL
140 150
22
Global vs Local Alignment
Figure 3.1, Mount
23
Pairwise Sequence Alignment Websites
Bayes block aligner http//www.wadsworth.org/resres/bioinfo Zhu et al. (1998)
BCM Search Launcher Pairwise sequence alignment http//searchlauncher.bcm.tmc.edu/seq-search/alignment.html  
SIMLocal similarity program for finding alternative alignments http//www.expasy.ch/tools/sim.html Huang et al. (1990) Huang and Miller (1991) Pearson and Miller (1992)
Global alignment programs (GAP, NAP) http//genome.cs.mtu.edu/align/align.html Huang (1994)
FASTA program suite http//fasta.bioch.virginia.edu/fasta/fasta_list.html Pearson and Miller (1992) Pearson (1996)
BLAST 2 sequence alignment (BLASTN, BLASTP) http//www.ncbi.nlm.nih.gov/gorf/bl2.html Altschul et al. (1990)
LALIGN http//www.ch.embnet.org/software/LALIGN_form.html Huang and Miller, published in Adv. Appl. Math. (1991) 12337-357
Likelihood-weighted sequence alignment (lwa) http//stateslab.bioinformatics.med.umich.edu/service/lwa.html  
Table 3.1, Mount
24
What is multiple sequence alignment?
  • Multiple sequence alignment is the alignment of
    more than two nucleotide or protein sequences
  • Compare pairwise sequence alignment multiple
    sequence alignment

25
Issues with multiple sequence alignment
  • Try creating a multiple sequence alignment of the
    three words
  • Insurrection
  • Incision
  • Instruction

26
Issues with multiple sequence alignment
  • Whats the right answer?
  • Computational complexity
  • What is reasonable method for obtaining
    cumulative score?
  • Placement and scoring of gaps

in cisioninsurrec tioninstr uc tion
in cisioninsurrectionins truction
in cisioninsurrectioninstr uction
inci sioninsurrectionins truction
27
Pairwise sequence alignment LALIGN of OVCA2 and
DYR_SCHPO (global)
./wwwt MAAQRPLRVLCLAGFRQSERGFREKTGALRKALRGRAELVCLS
GPHPVPDPPGPEGARSD . .. .
. ... . . . . . dihydr
MSKPLKVLCLHGWIQSGPVFSKKMGSVQKYLSKYAELHFPTGPVVADEE
ADPNDEEEK 10 20 30
40 50 70
80 90 100 110
120./wwwt FGSCPPEEQPRGWWFSEQEADVFSALEEPAVCRGLEESL
GMVAQALNRLGPFDGLLGFSQ . .
.. . . . . . ... .dihydr
KRLAALGGEQNGGKFGWFEVEDFKN-----TYGSWDESLECINQYMQEKG
PFDGLIGFSQ 60 70 80
90 100 110 130
140 150 160 170
./wwwt GAALAALVCALGQAGDPRFPL---PRFILLVSSFCPRGIGF
KESILQRPLSLPSLHVF ..... . .
...... . . . . . . dihydr
GAGIGAMLAQMLQPGQPPNPYVQHPPFKFVVFVGGFRAEKPEF-DHFYNP
KLTTPSLHIA 120 130 140
150 160 170 180
190 200 210
./wwwt GDTDKVIPSQESVQLASQFPGAITLTHSGGHFIPA-------
------AAP--------- . .. . . . .
. .. . dihydr
GTSDTLVPLARSKQLVERCENAHVLLHPGQHIVPQQAVYKTGIRDFMFSA
PTKEPTKHPR
19.2 sequence identity score -413
28
Multiple sequence alignment
29
What is multiple sequence alignment used for?
  • Consensus sequences which residues can be used
    to identify other members of the family?
  • Gene and protein families which residues are
    functionally important functional families
  • Relationships and phylogenies contains
    evolutionary history of sequences
  • Data underlying some protein structure prediction
    algorithms
  • Genome sequencing sequence random, overlapping
    fragments automation of assembly (in this case,
    there is a RIGHT answer)

30
Consensus sequences and important functional
residues
Baxter, et al, Mol Cell Prot 2003
31
Relationships and phylogenies
  • Serine-threonine protein phosphatases
  • Same biochemical function
  • Clustering clearly shows PP1, PP2a and PP2B
    families
  • What is different about these families?

Fetrow, Siew, Skolnick, FASEB J, 1999
32
Possible redox site in PP1 family
Only a clustering, not a true phylogenetic tree
33
Methods to solve computational complexity
  • Progressive global alignment
  • Iterative methods
  • Alignments based on locally conserved patterns
  • Statistical methods and probabilistic models

34
Multiple Sequence Alignment Global
CLUSTALW or CLUSTALX (latter has graphical interface) FTP to ftp.ebi.ac.uk/pub/softwarea,d Thompson et al. (1994a, 1997) Higgins et al. (1996)
MSA http//www.psc.edu/bhttp//www.ibc.wustl.edu/ibc/msa.htmlcFTP to fastlink.nih.gov/pub/msa Lipman et al. (1989)Gupta et al. (1995)
PRALINE http//mathbio.nimr.mrc.ac.uk/jhering/praline/ Heringa (1999)
Table 4.1, Mount
35
Multiple Sequence Alignment Interative
 
DIALIGN segment alignment http//www.gsf.de/biodv/dialign.html   Morgenstern et al. (1996)
MultAlin http//protein.toulouse.inra.fr/multalin.html   Corpet (1988)
Parallel PRRN progressive global alignment http//prrn.ims.u-tokyo.ac.jp/   Gotoh (1996)
SAGA genetic algorithm http//igs-server.cnrs-mrs.fr/cnotred/Projects_home_page/saga_home_page.html   Notredame and Higgins (1996)
Table 4.1, Mount
36
Multiple Sequence Alignment Local
Aligned Segment Statistical Evaluation Tool (Asset) FTP to ncbi.nlm.nih.gov/pub/neuwald/asset Neuwald and Green (1994)
BLOCKS Web site http//blocks.fhcrc.org/blocks/ Henikoff and Henikoff (1991, 1992)
eMOTIF Web server http//dna.Stanford.EDU/emotif/ Nevill-Manning et al. (1998)
GIBBS, the Gibbs sampler statistical method FTP to ncbi.nlm.nih.gov/pub/neuwald/gibbs9_95/ Lawrence et al. (1993) Liu et al. (1995) Neuwald et al. (1995)
HMMER hidden Markov model software http//hmmer.wustl.edu/ Eddy (1998)
MACAW, a workbench for multiple alignment construction and analysis FTP to ncbi.nlm.nih.gov/pub/macaw/ Schuler et al. (1991)
MEME Web site, expectation maximization method http//meme.sdsc.edu/meme/website/ Bailey and Elkan (1995) Grundy et al. (1996, 1997) Bailey and Gribskov (1998)
Profile analysis at UCSDa,e http//www.sdsc.edu/projects/profile/ Gribskov and Veretnik (1996)
SAM hidden Markov model Web site http//www.cse.ucsc.edu/research/compbio/sam.html Krogh et al. (1994) Hughey and Krogh (1996)
Table 4.1, Mount
37
Methods to solve computational complexity
  • Progressive global alignment
  • Start with most related sequences
  • Problem is that these errors in initial
    alignments are propagated
  • Iterative methods
  • Iterative alignment of subgroup of sequences to
    find best then align subgroups
  • Alignments based on locally conserved patterns
  • Block analysis
  • Statistical methods and probabilistic models
  • Expectation maximum Gibbs sampler Hidden Markov
    Models

38
Profile Methods
  • Perform a global multiple sequence alignment on a
    group of sequences
  • Extract more highly conserved regions
  • Profile scoring matrix for these highly
    conserved regions
  • Used to search unknown sequences for membership
    in the family

Figures 4.11 (p. 162) and 4.12 (p. 166-167)
39
Limitations of such profiles
  • Limited by sequences in original msa
  • Sequence bias (too many of one type of sequence)
  • Sequences in msa not representative of entire
    family

40
Blocks
  • Blocks are conserved regions of msa (like
    profiles) but no gaps allowed
  • Servers for producing Blocks
  • Blocks server
  • eMotif server
  • Block libraries for database searching
  • Blocks (Henikoff and Henikoff)
  • Prosite (Bairoch)
  • Prints (Attwood)

41
Blocks that might be extracted from an msa
Baxter, et al, Mol Cell Prot 2003
42
Blocks that might be extracted from an msa
Baxter, et al, Mol Cell Prot 2003
43
Database searching
  • Identify a new sequence by experimental methods
    what is it?
  • Search databases to find similar sequences
  • If enough similarity, can say that function of
    new sequence is same as known sequence function
    annotation transfer
  • What is enough similarity?
  • What is function?

Chapter 7, Mount
44
Relationships between family members
  • Sequence relationships between family members
  • Not all members of family have significant
    sequence similarity to all others
  • Can be represented by nodes and edges of a graph

Z
F
E
A
D
C
B
45
Beware of issues with function annotation transfer
  • Multiple domains
  • High sequence identity, but functional residues
    not conserved
  • Sequence repeats (low complexity regions)

New
Function B Function A
Function A
H
S
D
Known serine hydrolase
New sequence
S
D
L
46
Methods for database searching
  • Sequence similarity with query sequence FASTA,
    BLAST (Fig 7.5, p. 305)
  • Profile search ProfileSearch
  • Position-specific scoring matrix MAST
  • Iterative alignment (combination of sequence
    searching and profile search) PSI-BLAST
  • Patterns Prosite, Blocks, Prints, CDD/Impala

Table 7.1, Mount
47
The problem with speed
  • Dynamic programming
  • Guaranteed to find optimal answer
  • Too slow (number of searches performed and number
    of sequences in databases that are searched)
    Smith-Waterman dynamic programming algorithm 50X
    slower than BLAST or FASTA
  • faster hardware has made this problem feasible
  • Heuristic methods
  • FASTA short, common patterns in query and
    database searches
  • BLAST similar, but searched for more rare and
    significant patterns

48
Searches on DNA vs Protein Sequences
  • 20-letter alphabet vs 4-letter alphabet
  • Fivefold larger variety of sequence characters in
    proteins easier to detect patterns
  • Searches with DNA sequences produce fewer
    significant matches
  • What if you dont know reading frame?
  • Sometimes must do nucleic acid searches
    (searching for similarities in non-coding regions)

49
Sensitivity vs selectivity
  • Sensitivity methods ability to find most
    members of the protein family
  • Selectivity methods ability to distinguish
    true members from non-members
  • Want a method to have high sensitivity (get all
    true positives) and high selectivity (not get
    false positives)
  • Can be a difficult test with biological data
    sets not all true positives are known

50
Scoring matrices commonly used
  • PAM250 point accepted mutation Dayhoff, M.,
    Schwartz, R. M., and Orcutt, B. C., Atlas of
    Protein Sequence and Structure (1978) 5(3)345
  • BLOSUM62 blocks amino acid substitution
    matrices Henikoff and Henikoff, Amino acid
    substitution matrices from protein blocks. (1992)
    Proc. Natl. Acad. Sci. USA 8910915-10919.

51
PAM250
  • Calculated for families of related proteins (gt85
    identity)
  • 1 PAM is the amount of evolutionary change that
    yields, on average, one substitution in 100 amino
    acid residues
  • A positive score signifies a common replacement
    whereas a negative score signifies an unlikely
    replacement
  • PAM250 matrix assumes/is optimized for sequences
    separated by 250 PAM, i.e. 250 substitutions in
    100 amino acids (longer evolutionary time)

52
BLOSUM62
  • BLOSUM matrices are based on local alignments
    (blocks or conserved amino acid patterns)
  • BLOSUM 62 is a matrix calculated from comparisons
    of sequences with no less than 62 divergence
  • All BLOSUM matrices are based on observed
    alignments they are not extrapolated from
    comparisons of closely related proteins
  • BLOSUM 62 is the default matrix in BLAST 2.0

53
Comparison of PAM250 and BLOSUM62
BLOSUM80 PAM1
BLOSUM62 PAM120
BLOSUM45 PAM250
Less divergent
More divergent
The relationship between BLOSUM and PAM
substitution matrices. BLOSUM matrices with
higher numbers and PAM matrices with low numbers
are both designed for comparisons of closely
related sequences. BLOSUM matrices with low
numbers and PAM matrices with high numbers are
designed for comparisons of distantly related
proteins. If distant relatives of the query
sequence are specifically being sought, the
matrix can be tailored to that type of search.
54
Scoring matrices commonly used
  • PAM250
  • Represents a period of time during which only
    about 20 of amino acids will remain unchanged
  • Shown to be appropriate for searching for
    sequences of 17-27 identity
  • BLOSUM62
  • Matrix calculated from comparisons of sequences
    with no less than 62 divergence
  • Though it is tailored for comparisons of
    moderately distant proteins, it performs well in
    detecting closer relationships
  • BLOSUM50
  • Shown to be better for FASTA searches

55
Methods for database sequence searching
  • Sequence similarity with query sequence FASTA,
    BLAST
  • Profile search ProfileSearch
  • Position-specific scoring matrix MAST
  • Iterative alignment (combination of sequence
    searching and profile search) PSI-BLAST
  • Patterns Prosite, PFAM, CDD/Impala

56
Review of protein structure
  • Primary structure sequence of amino acids
  • Secondary structure local segments of protein
    structure
  • Tertiary structure three-dimensional structure
    of a single protein chain
  • Quaternary structure packing of 2 or more
    protein chains

57
Classification of protein tertiary structure
  • All alpha proteins
  • All beta proteins
  • Alphabeta proteins
  • Alpha/beta proteins
  • Irregular proteins

Classify these proteins T-cell protein CD8
(1cd8), myoglobin, triose phosphate isomerase,
G-specific endonuclease (1rnb)
58
Representations of protein structures
  • All atom
  • CPK models
  • Cartoons (ribbons, etc)
  • Topology diagrams

59
Protein structure databases
  • RCSB (PDB) http//www.rcsb.org/pdb
  • General repository for all protein coordinate
    files
  • MMDB http//www.ncbi.nlm.nih.gov/Structure
  • NCBI structure database structures from pdb
  • Links to sequence and genome databases
  • BioMagResBank http//www.bmrb.wisc.edu/
  • General repository for NMR structure data

60
Alignment of protein structure
  • Superposition of protein 3D structures
  • Used in searching for structural similarity and
    grouping proteins into fold families
  • Structural similarity is common and does not
    necessarily indicate an evolutionary relationship
    (different from sequence similarity)

61
Structure Alignment A difficult problem
  • Alignment in atom positions in 3D space
  • Pieces of proteins may align
  • What is significant and what is not? (Is
    alignment of two helices significant?)
  • Alignment of topology or secondary structure
    packing give different answers

Easy example (Eidhammer and Jonassen)
More difficult examples http//www.sbg.bio.ic.ac.
uk/people/rob/sf/sf.html
62
Structure alignment used to classify (group)
protein structures
  • SCOP (Structural Classification Of Proteins
    http//scop.mrc-lmb.cam.ac.uk/scop/)
  • Class (all alpha, all beta, alphabeta,
    alpha/beta), family, superfamily, fold
  • Reflects structural and evolutionary
    relationships
  • Mostly done by hand (expert analysis)
  • CATH (classification by class, architecture,
    topology and homology http//www.biochem.ucl.ac.u
    k/bsm/cath)
  • Class (all alpha, all beta, alpha/beta),
    architecture, fold, superfamily, family
  • Uses SSAP structure alignment program
  • FSSP (fold classification based on
    structure-structure protein alignment
    http//www.bioinfo.biocenter.helsinki.fi8080/dali
    /index.html.)
  • Based on pairwise alignment of all non-redundant
    proteins in PDB
  • Divides proteins into structures and domains
    represents unique configuration of secondary
    structure elements
  • Uses Dali structure alignment program
  • MMDB (molecular modeling database
    http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?dbS
    tructure)
  • Proteins classified into structurally related
    groups by VAST, based on arrangements of
    secondary structures
  • Groupings of all PDB structures
  • SARF (spatial arrangement of backbone fragments
    http//123d.ncifcrf.gov/)

63
Web sites for structure alignment
  • VAST http//www.ncbi.nlm.nih.gov/Structure/VAST/
    vast.shtml
  • NCBI structure comparison
  • Comparison of orientations of secondary
    structures (vector representation of secondary
    structures)
  • Approach from graph theory
  • Dali http//www.ebi.ac.uk/dali/
  • FSSP structure comparison
  • Protein represented as distance matrix between
    alpha carbons
  • Monte Carlo simulation to do random search for
    sub-distance-matrices
  • SSAP http//www.biochem.ucl.ac.uk/cgi-bin/cath/G
    etSsapRasmol.pl
  • CATH structure comparison
  • Set structure environment for each residue, then
    align residue by residue using double dynamic
    programming
  • Structure environment can use beta carbon vectors
    or phi/psi backbone dihedral angles
  • Others Lots, such as Structal (Gerstein and
    Levitt) Minarea (Falicov and Cohen) Lock (Singh
    and Brutlag)

64
Protein Structure Prediction
  • Goal is to understand the relationship between
    the primary amino acid sequence and the structure
    of the protein
  • Relationship between sequence and structure is
    not simple and is not understood
  • Protein folding problem remains unsolved

65
Protein Structure Prediction
  • Secondary structure prediction unsolved?
  • Tertiary structure prediction unsolved problem
    (CASP competition)
  • Quaternary structure prediction unsolved
    problem
  • Docking of two subunits

66
Secondary structure prediction
  • Prediction of three classes of secondary
    structure helix, strand, coil
  • Solved problem? 70-80 correct predictions
  • Methods (web sites) can give very different
    answers
  • Prediction of non-regular secondary structure
    (loops and turns) not as successful

67
Secondary structure prediction
  • Method development
  • Frequencies on types of residues found in each
    secondary structures
  • Frequencies calculated from database of known
    structures (training set)
  • Method evaluation
  • Test method on proteins whose structures are
    known (testing set)
  • Training and testing sets must not be the same

68
Secondary structure prediction methods and
references
Single residue statistics Explicit rules Nearest neighbors Neural networks Hidden markov models
1st generation Chou/Fasman (74) GOR I Lim (74)
2nd generation GOR III (87) Predator (96) Levin (86) Nishikawa and Ooi (86) Yi and Lander (93) Qian and Sejnowski (88) Holley and Karplus (89) Yi and Lander (93) Asai/Handa (93)
3rd generation GOR IV DSC (Prof) (96) NNssp (95) NNssp (95) PHD (93) Jnet (99) PsiPred (99) PASSML (98)
See Table 9.7, Mount, for list of servers
69
GOR IV secondary structure prediction
  • Three state prediction helix, strand, loop
  • Statistics of pair frequencies observed within a
    window of 17 amino acid residues
  • Based on information theorysound statistical
    basis and no ad hoc rules
  • Mean accuracy of 64.4 for a three state
    prediction (Q3)

Garnier, Gibrat, Robson http//abs.cit.nih.gov/in
dex.html
70
PHD secondary structure prediction
  • Three state prediction helix, strand, loop
  • Predicts secondary structure from multiple
    sequence alignments
  • Three consecutive neural networks (feed forward)
  • Raw 3-state prediction for each position, based
    on alignment composition in 13 residue window
  • Filter 3-state probabilities based on
    probabilities of flanking positions in 17-residue
    window
  • Jury network using several raw/filter
    combinations trained separately
  • Expected average accuracy gt 72 for three state
    prediction (Q3)

Rost and Sander http//www.predictprotein.org
71
Method evaluation how good is good?
  • Testing of prediction methods involves
  • Applying the method to a set of proteins whose
    secondary structures are known experimentally and
    comparing prediction results to known results
  • Calculating measures of how good the performance
    is
  • Q1 (h, s, or c)
  • (number of residues correctly predicted in one
    state/number of residues in that state) 100
  • Q3 (h, s, and c)
  • (number of residues correctly predicted in each
    of 3 states/number of all residues) 100
  • Matthews correlation coefficient (Cs)
  • (TpTn - FpFn) / sqrt(TpFp)(TnFn)(TpFn)(TnFp)

Num ....,....1....,....2....,....3....,....4....
,....5....,....6 Res MSTKQHSAKDELLYLNKAVVFGSGAFG
TALAMVLSKKCREVCVWHMNEEEVRLVNEKREN Actu
HHHHHHHHHHHH EE HHHHHHHHHHHHHHH EE
HHHHHHHHHHHHHH Pred HHHHHH EEEEE
HHHHHHHHHHHH EEEEEE HHHHHHHH
Pred HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
HHHHHHHHHHHHHHHH
72
Method evaluation how good is good?
  • Matthews correlation coefficient (Cs)
  • (TpTn - FpFn) / sqrt(TpFp)(TnFn)(TpFn)(TnFp)
  • Where Tp, true positive predictions (method
    predicts helix, and residue is in a helix) Tn,
    true negative prediction (method predicts not
    helix, and residue is not in a helix) Fp,
    false positive prediction (method predicts helix,
    but residue is not in a helix) Fn, false
    negative prediction (method predicts not helix,
    but residue is in a helix)

Num ....,....1....,....2....,....3....,....4....
,....5....,....6 Res MSTKQHSAKDELLYLNKAVVFGSGAFG
TALAMVLSKKCREVCVWHMNEEEVRLVNEKREN Actu
HHHHHHHHHHHH EE HHHHHHHHHHHHHHH EE
HHHHHHHHHHHHHH Pred HHHHHH EEEEE
HHHHHHHHHHHH EEEEEE HHHHHHHH
Q1 (helix)(4128)/(121514)10058 Q3(4128
22)/6010047 Tp412824 Tn9817 Fp2
Fn8125117 Ch(2417)-(217)/sqrt(242)(17
17)(2417)(172)
73
Tertiary Structure Prediction
  • Homology modeling identifiable sequence
    similarity
  • Fold recognition (threading Table 9.8 for
    server list)
  • Ab initio methods

74
Homology modeling
  • Sequence alignment
  • Side chain modeling
  • Modeling insertions and deletions
  • Optimizing the model
  • Model evaluation
  • Repeat?

75
Fold Recognition (threading)
  • Template identification/sequence
    alignment/alignment optimization
  • Side chain modeling
  • Modeling insertions and deletions
  • Optimizing the model
  • Model evaluation
  • Repeat?

76
Ab initio methods folding from scratch
  • Start with unfolded protein or random
    conformation
  • Use atomic-level forces, solve energetic
    equations
  • Identify most stable conformation (lowest free
    energy)
  • Computational demands high for protein of 100
    amino acids
  • Assume constant bond lengths and angles
  • Allow 2/3 backbone torsion angles per amino acid
    to rotate
  • Do not allow side chain torsion angles to move
  • Assuming 10 allowed conformations per residue,
    must explore 10100 conformations
  • Calculation of 10100 energies (one for each
    conformation) is not possible

77
Ab initio methods simplifications
  • Lattice models to simplify the conformational
    search space
  • Monte Carlo statistical sampling of
    conformational space
  • Stepwise processes
  • Predict regular secondary structures
  • Pack secondary structures to form tertiary
    structures
  • Others

78
Review of Definitions
  • Cell fundamental working unit of biology
  • DNA encodes all information to create cells and
    allow them to function
  • Linear arrangement of bases (AGTC)
  • Genome organisms complete set of DNA
  • Chromosome physically distinct molecules of DNA
  • Genomes can be composed of 1, 2 or more
    chromosomes
  • Gene basic physical and functional unit of
    heredity
  • Linear arrangement of bases along the chromosome
  • Contain instructions for encoding protein
  • (Remember genetic code?)

79
Genomes and proteomes
  • Genome Sum of all genes and intergenic DNA
    sequences in a cell
  • the smallest known genome for a free-living
    organism (a bacterium) contains about 600,000 DNA
    base pairs
  • human and mouse genomes have about 3 billion
  • relatively unchanging from cell to cell
  • Proteome The entire set of proteins encoded in
    the genome of an organism and produced by that
    organism
  • Constellation of proteins in cells is highly
    dynamic

80
The Human Genome
  • 24 chromosomes
  • Chromosomes range is size from 50 million to 250
    million base pairs
  • Total size of the human genome is over 3 billion
    base pairs (3.1647 billion)
  • 99.9 of all bases are the same in all people
  • Genes comprise only 2 of the total genome
  • Human genome is estimated to contain 30,000 to
    40,000 genes
  • Average gene size is about 3000 bases
  • Largest identified so far is 2.4 million bases
    (dystrophin)
  • Functions for less than 50 of genes and gene
    products are known
  • Remainder of genome is non-coding regions
  • Chromosomal structural integrity
  • Repetitive sequences
  • Regulation of protein production
  • Other functions that we dont know about

81
Human Genome Sequencing Project Goals
  • Determine the sequences of the 3 billion chemical
    base pairs that make up human DNA
  • Identify all the approximately 30,000 genes in
    human DNA
  • Store this information in databases
  • Improve tools for data analysis
  • Transfer related technologies to the private
    sector
  • Address the ethical, legal, and social issues
    that may arise from the project

Human Genome Project (DOE) http//www.ornl.gov/sc
i/techresources/Human_Genome/home.shtml NIH http
//www.ncbi.nlm.nih.gov/genome/guide/human/
82
Other sequencing projects
  • Over 200 genomes sequenced
  • Range of archeae, bacteria, eukaryotic genomes
  • Organisms that have been well-studied in the
    laboratory
  • Organisms that are pathogenic to humans
  • Organisms of special scientific or technical
    interest

NCBI list of sequenced genomes (NIH) http//www.n
cbi.nlm.nih.gov/entrez/query.fcgi?dbGenome
83
Prokaryotes and eukaryotes
  • Prokaryotes (bacteria and archaea)
  • No true nucleus
  • DNA generally circular (one chromosome)
  • Eukaryotes
  • True nucleus contains (most) DNA
  • DNA linear and arranged in chromosomes

Phylogenetic analysis of small subunit ribosomal
RNAs, C. Woese, 1987
84
Anatomy of a prokaryotic genome
  • DNA compact and circular
  • ORFs (open reading frames) with start and stop
    codons
  • No introns

85
Anatomy of a eukaryotic genome
  • Linear DNA chromosomes
  • Centromeres
  • Telomeres
  • Tandem repeats
  • Transposable elements
  • Introns
  • Pseudogenes

Example of chromosome maps http//www.ncbi.nlm.ni
h.gov/genome/guide/human/
86
DNA sequencing
A G C T
  • Separate strands of DNA
  • Anneal primer to one strand
  • Replicate using fluorescently labeled ddNTPs (as
    opposed to normal dNTPs)
  • Separate fragments by size
  • Image gel for fluorescent labels

See also, electropherogram, Fig2.2, Mount
87
Methods of genome sequencing
  • Mapping method
  • Fragment chromosome
  • Identify markers and order them
  • Arrange fragments, then sequence
  • Shotgun method
  • Fragment chromosome
  • Sequence fragments, then arrange
  • cDNA sequencing (ESTs)
  • Isolate mRNA (expressed in cell)
  • Reverse transcribe mRNA to create cDNA
  • Sequence cDNA

88
Maps
  • Gene map
  • Chromosome map
  • Sequence map
  • Maps important for obtaining sequence information
    (mapping method)
  • Restriction map
  • Contig (contiguous clone) map

NCBI map viewer http//www.ncbi.nlm.nih.gov/mapvi
ew/
89
Prediction of genes
  • Method
  • Difference between prokaryotes and eukaryotes
  • Tests for validation of predictions

90
Genome Analysis
  • General approach (p. 492)
  • Comparative genomics
  • Self-comparison reveals gene families and
    duplication
  • Between-genome-comparison reveals orthologs, gene
    families and domains
  • Gene ordering on chromosomes
  • Phylogenetic analysis
  • Genetic diversity
Write a Comment
User Comments (0)
About PowerShow.com