CS177 Review/Summary of the Madej lectures - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

CS177 Review/Summary of the Madej lectures

Description:

Superfolds and examples Globin 1hlm sea cucumber hemoglobin; 1cpcA phycocyanin; 1colA colicin -up-down 2hmqA hemerythrin; 256bA cytochrome B562; ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 53
Provided by: made66
Category:

less

Transcript and Presenter's Notes

Title: CS177 Review/Summary of the Madej lectures


1
CS177 Review/Summary of the Madej lectures
  • Tom Madej 12.07.05

2
Overview
  • Basic biology.
  • Protein/DNA sequence comparison.
  • Protein structure comparison/classification.
  • NCBI databases overview.
  • Miscellaneous topics.

3
(No Transcript)
4
Lodish et al. Molecular Cell Biology, W.H.
Freeman 2000
5
Protein/DNA sequence comparison
  • What is the meaning of a sequence alignment?
  • Scoring methods amino acid substitution
    matrices, PSSMs.
  • Basic computational methods e.g. BLAST.
  • Know how to run PSI-BLAST, interpret the results.

6
Homology
  • whenever statistically significant sequence or
    structural similarity between proteins or protein
    domains is observed, this is an indication of
    their divergent evolution from a common ancestor
    or, in other words, evidence of homology.
  • E.V. Koonin and M.Y. Galperin, Sequence
    Evolution Function, Kluwer 2003

7
(No Transcript)
8
A simple phylogenetic tree
9
Human hemoglobin and more distantly related
globins
  • Human and horse
  • Human and fish
  • Human and insect
  • Human and bacteria

10
Alignment notation different notations for the
same alignment!
VISDWNMPN-------MDGLE CILVV----AANDGPMPQTRE
VISDWnm---pnMDGLE CILVVaandgpmPQTRE
11
Computing sequence alignments
  • You must be able to recognize the answer
    (correct alignment) when you see it (scoring
    system).
  • You must be able to find the answer i.e. compute
    it efficiently.

12
Scoring and computing alignments
  • Position independent amino acid substitution
    tables e.g. BLOSUM62.
  • Global alignment algorithms such as
    Smith-Waterman (dynamic programming) or fast
    heuristics such as BLAST.

13
(No Transcript)
14
Score this alignment
VISDWnm---pnMDGLE CILVVaandgpmPQTRE
Use BLOSUM62 matrix gap opening penalty 10 gap
extension penalty 1
(-1 4 2 3 3) 10 111 (-2 0 2 2
5) -27
15
BLAST (Basic Local Alignment Search Tool)
  • Extremely fast, can be on the order of 50-100
    times faster than Smith-Waterman.
  • Method of choice for database searches.
  • Statistical theory for significance of results
    (extreme value distribution).
  • Heuristic does not guarantee optimal results.
  • Many variants, e.g. PHI-, PSI-, RPS-BLAST.

16
Why database searches?
  • Gene finding.
  • Assigning likely function to a gene.
  • Identifying regulatory elements.
  • Understanding genome evolution.
  • Assisting in sequence assembly.
  • Finding relations between genes.

17
Issues in database searches
  • Speed.
  • Relevance of the search results (selectivity).
  • Recovering all information of interest
    (sensitivity).
  • The results depend on the search parameters, e.g.
    gap penalty, scoring matrix.
  • Sometimes searches with more than one matrix
    should be performed.

18
E-values, P-values
  • E-value, Expectation value this is the expected
    number of hits of at least the given score, that
    you would expect by random chance for the search
    database.
  • P-value, Probability value this is the
    probability that a hit would attain at least the
    given score, by random chance for the search
    database.
  • E-values are easier to interpret than P-values.
  • If the E-value is small enough, e.g. no more than
    0.10, then it is essentially a P-value.

19
PSI-BLAST
  • Position Specific Iterated BLAST
  • As a first step runs a (regular) BLAST.
  • Hits that cross the threshold are used to
    construct a position specific score matrix
    (PSSM).
  • A new search is done using the PSSM to find more
    remotely related sequences.
  • The last two steps are iterated until convergence.

20
PSSM (Position Specific Score Matrix)
  • One column per residue in the query sequence.
  • Per-column residue frequencies are computed so
    that log-odds scores may be assigned to each
    residue type in each column.
  • There are difficulties e.g. pseudo-counts are
    needed if there are not a lot of sequences, the
    sequences must be weighted to compensate for
    redundancy.

21
Two key advantages of PSSMs
  • More sensitive scoring because of improved
    estimates of probabilities for a.a.s at specific
    positions.
  • Describes the important motifs that occur in the
    protein family and therefore enhances the
    selectivity.

22
Position Specific Substitution Rates
Weakly conserved serine
Active site serine
23
Position Specific Score Matrix (PSSM)
A R N D C Q E G H I L K M
F P S T W Y V 206 D 0 -2 0 2 -4 2 4
-4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G
-2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2
-1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1
-4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3
-3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1
-4 0 210 D -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6
-4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4
-4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3
212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0
-7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0
-2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G
-2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3
-5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5
-7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4
-2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6
-5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7
-5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5
-6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7
219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7
9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6
-7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N
-1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2
-1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1
-1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1
4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3
-4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1
-2 -2 -3 0 -2 -2 -2 -3
Serine scored differently in these two positions
Active site nucleophile
24
PSI-BLAST key points
  • The first PSSM is constructed from all hits that
    cross the significance threshold using standard
    BLAST.
  • The search is then carried out with the PSSM to
    draw in new significant hits.
  • If new hits are found then a new PSSM is
    constructed these last two steps are iterated.
  • The computation terminates upon convergence,
    i.e. when no new sequences are found to cross the
    significance threshold.

25
Protein structure comparison/classification
  • Protein secondary structure elements.
  • Supersecondary structures (simple structure
    motifs).
  • Folds and domains.
  • Comparing structures (VAST).
  • Superfolds.
  • Fold classification (SCOP).
  • Conserved Domain Database (CDD).

26
a-helix (3chy)
backbone atoms
with sidechains
27
Parallel ß-strands (3chy)
28
Anti-parallel ß-strands (1hbq)
29
Higher level organization
  • A single protein may consist of multiple domains.
    Examples 1liy A, 1bgc A. The domains may or
    may not perform different functions.
  • Proteins may form higher-level assemblies.
    Useful for complicated biochemical processes that
    require several steps, e.g. processing/synthesis
    of a molecule. Example 1l1o chains A, B, C.

30
Supersecondary structures
  • ß-hairpin
  • a-hairpin
  • ßaß-unit
  • ß4 Greek key
  • ßa Greek key

31
Supersecondary structure simple units
G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981
32
Supersecondary structure Greek key motifs
G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981
33
Protein folds
  • There is a continuum of similarity!
  • Fold definition two folds are similar if they
    have a similar arrangement of SSEs (architecture)
    and connectivity (topology). Sometimes a few
    SSEs may be missing.
  • Fold classification To get an idea of the
    variety of different folds, one must adjust for
    sequence redundancy and also try to correctly
    assign homologs that have low sequence identity
    (e.g. below 25).

34
Vector Alignment Search Tool (VAST)
  • Fast structure comparison based on representing
    SSEs by vectors.
  • A measure of statistical significance (VAST
    E-value) is computed (very differently from a
    BLAST E-value).
  • VAST structure neighbor lists useful for
    recognizing structural similarity.

35
Superfolds (Orengo, Jones, Thornton)
  • Distribution of fold types is highly non-uniform.
  • There are about 10 types of folds, the
    superfolds, to which about 30 of the other folds
    are similar. There are many examples of
    isolated fold types.
  • Superfolds are characterized by a wide range of
    sequence diversity and spanning a range of
    non-similar functions.
  • It is a research question as to the evolutionary
    relationships of the superfolds, i.e. do they
    arise by divergent or convergent evolution?

36
Superfolds and examples
  • Globin 1hlm sea cucumber hemoglobin 1cpcA
    phycocyanin 1colA colicin
  • a-up-down 2hmqA hemerythrin 256bA cytochrome
    B562 1lpe apolipoprotein E3
  • Trefoil 1i1b interleukin-1ß 1aaiB ricin 1tie
    erythrina trypsin inhibitor
  • TIM barrel 1timA triosephosphate isomerase 1ald
    aldolase 5rubA rubisco
  • OB fold 1quqA replication protein A 32kDa
    subunit 1mjc major cold-shock protein 1bcpD
    pertussis toxin S5 subunit
  • a/ß doubly-wound 5p21 Ras p21 4fxn flavodoxin
    3chy CheY
  • Immunoglobulin 2rhe Bence-Jones protein 2cd4
    CD4 1ten tenascin
  • UB aß roll 1ubq ubiquitin 1fxiA ferredoxin 1pgx
    protein G
  • Jelly roll 2stv tobacco necrosis virus 1tnfA
    tumor necrosis factor 2ltnA pea lectin
  • Plaitfold (Split aß sandwich) 1aps
    acylphosphatase 1fxd ferredoxin 2hpr
    histidine-containing phosphocarrier

37
Fold classification (when you have the structure)
  • First, look up PubMed abstracts for any relevant
    papers. E.g. if this is from a PDB file there
    will be references in it.
  • Try checking SCOP or CATH.
  • Look at VAST neighbors. See if the structure in
    question is highly similar to another structure
    with a known fold.

38
SCOP (Structural Classification of Proteins)
  • http//scop.mrc-lmb.cam.ac.uk/scop/
  • Levels of the SCOP hierarchy
  • Family clear evolutionary relationship
  • Superfamily probable common evolutionary origin
  • Fold major structural similarity

39
(No Transcript)
40
Bioinformatics databases
  • Entrez is by far the most useful, because of the
    links between the individual databases, e.g.
    literature, sequence, structure, taxonomy, etc.
  • Other specialty databases available on the
    internet can also be very useful, of course!

41
The (ever expanding) Entrez System
NLM Catalog
PubChem
Compounds
BioAssays
Substances
Literature
Organism
Expression
HomoloGene
Gene
42
Links Between and Within Nodes
Word weight
Computational
3-D Structure
3 -D Structures
VAST
Phylogeny
Computational
Protein sequences
BLAST
BLAST
Computational
Computational
43
Entrez queries
  • Be able to formulate queries using index terms
    (Preview/Index), and limits.

44
(No Transcript)
45
Exercises!
  • How many protein structures are there that
    include DNA and are from bacteria?
  • In PubMed, how many articles are there from the
    journal Science and have Alzheimer in the title
    or abstract, and amyloid beta anywhere? How
    many since the year 2000?
  • Notice that the results are not 100 accurate!
  • In 3D Domains, how many domains are there with no
    more than two helices and 8 to 10 strands and are
    from the mouse?

46
P53 tumor suppressor protein
  • Li-Fraumeni syndrome only one functional copy of
    p53 predisposes to cancer.
  • Mutations in p53 are found in most tumor types.
  • p53 binds to DNA and stimulates another gene to
    produce p21, which binds to another protein cdk2.
    This prevents the cell from progressing thru the
    cell cycle.

47
G. Giglia-Mari, A. Sarasi, Hum. Mutat. (2003) 21
217-228.
48
Exercise!
  • Use Cn3D to investigate the binding of p53 to
    DNA.
  • Formulate a query for Structure that will require
    the DNA molecules to be present (there are 2
    structures like this).

49
Miscellaneous topics
  • BLAST a sequence against a genome locate hits on
    chromosomes with map viewer.
  • Obtain genomic sequence with map viewer.
  • Spidey to predict intron/exon structure (but we
    wont use spidey on the exam!).
  • How sequence variations can affect protein
    structure/function.

50
EST exercise summary
  • BLAST the EST (or other DNA seq) against the
    genome.
  • From the BLAST output you can get the genomic
    coordinates of any nucleotide differences.
  • Use map viewer to locate the hit on a chromosome
    assume the hit is in the region of a gene.
  • By following the gene link you can get an
    accession for mRNA.
  • By using the dl link you can get an accession
    for the genomic sequence.
  • Use spidey with the mRNA and genomic sequence
    to locate changed residues in the protein.

51
EST exercise summary (cont.)
  • From the gene report you can follow the protein
    link, and then Blink.
  • From the BLAST link page you can get to CDD and
    related structures.
  • Since you know where are the changed residues you
    can use the structures to study what effect the
    changes might have on the function of the protein.

52
Gene variants that can affect protein function
  • Mutation to a stop codon truncates the protein
    product!
  • Insertion/deletion of multiple bases changes the
    sequence of amino acid residues.
  • Single point change could alter folding
    properties of the protein.
  • Single point change could affect the active site
    of the protein.
  • Single point change could affect an interaction
    site with another molecule.

53
Important note!
  • Most diseases (e.g. cancer) are complex and
    involve multiple factors (not just a single
    malfunctioning protein!).

54
Investigating a genetic disease
  • The following EST comes from a hemochromatosis
    patient your task is to identify the gene and
    specific mutation causing the illness, and why
    the protein is not functioning properly.
  • The sequence
  • TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAG
  • TGACCACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGA
  • ACATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGAT
  • GCCAAGGAGTTCGAACCTAAAGACGTATTGCCCAATGGGGA
  • TGGGACCTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGG
  • GGAAGAGCAGAGATATACGTACCAGGTGGAGCACCCAGGCC
  • TGGATCAGCCCCTCATTGTGATCTGGG

55
ESTs
  • Expressed Sequence Tags useful for discovering
    genes, obtaining data on gene expression/regulatio
    n, and in genome mapping.
  • Short nucleotide sequences (200-500 bases or so)
    derived from mRNA expressed in cells.
  • The introns from the genes will already be
    spliced out.
  • mRNA is unstable, however, and so it is reverse
    transcribed into cDNA.

56
Hemochromatosis 2
  • BLAST the EST vs. the Human genome (could take a
    few minutes).
  • - Which chromosome is hit?
  • - What is the contig that is hit (reference
    assembly)?
  • - Is the EST identical to the genomic sequence?
  • - Take note of the coords of the difference.
  • Click on Genome View.
  • Select the map element at the bottom
    corresponding to the contig.

57
Hemochromatosis 3
  • What gene is hit? Zoom in on the BLAST hit a few
    times.
  • Display the entire gene sequence vi dl and
    Display.
  • Copy and save the genomic sequence.
  • Record the coords for the start of the genomic
    sequence.

58
Hemochromatosis 4
  • Click on a UniGene link Hs.233325.
  • Note Expression profile presents data for the
    expression level of the gene in various tissues.
  • How many mRNAs and ESTs are there for the HFE
    gene?
  • Take note of the mRNA accession NM_000410.

59
Hemochromatosis 5
  • Go to spidey http//www.ncbi.nlm.nih.gov/spidey
    /
  • To determine the intron/exon structure, paste the
    HFE gene sequence into the upper box, and enter
    the HFE mRNA accession NM_000410 in the lower
    box.
  • Click Align.

60
Hemochromatosis 6
  • How many exons are there?
  • Which exon codes the residue that is changed in
    the original EST? (You have to do a little
    arithmetic!)
  • Record some of the protein sequence around the
    changed residue EQRYTCQVEHPG

61
Hemochromatosis 7
  • From the Map Viewer page click on the HFE gene
    link.
  • How many HFE transcripts are there? Which is the
    longest isoform?
  • Follow Links to Protein and then to the
    report for NP_000410.
  • Determine the residue number that corresponds to
    the mutation.

62
RNA splicing and isoforms
63
Hemochromatosis 8
  • What effect does the mutation in the original EST
    have on the protein? (Look at the table for the
    Genetic Code.)
  • Go back to the Gene Report read the summary and
    take note of the GeneRIF bibliography.
  • Now go to Links and then to GeneView in dbSNP
    to a list of known SNPs.

64
Hemochromatosis 9
  • In the SNP list note that the one you want is
    currently shown.
  • Select view rs in gene region and then click on
    view rs.
  • How many nonsynonomous substitutions do you see?
  • Do you see the one we are particularly interested
    in?

65
Digression SNPs
  • Single Nucleotide Polymorphisms.
  • A single base change that can occur in a persons
    DNA.
  • On average SNPs occur about 1 of the time, most
    are outside of protein coding regions.
  • Some SNPs may cause a disease some may be
    associated with a disease others may affect
    disposition to a disease others may be simple
    genetic variation.
  • dbSNP archives SNPs and other variations such as
    small-scale deletion/insertion polymorphisms
    (DIPs), etc.

66
(No Transcript)
67
Hemochromatosis 10
  • Back to the Gene Report, click on Links and go
    to OMIM (can also get there via the Map
    Viewer).
  • In the OMIM entry you can read a bit also click
    on View List for Allelic Variants, where you
    can see the mutation again.

68
Hemochromatosis 11
  • From the Gene Report again follow Links to
    Protein and scroll down to NP_000401.
  • Click on Domains and then Show Details.
  • What is the Conserved Domain in the region of
    interest?
  • Follow the link to the CD.
  • Click on View 3D Structure.

69
Hemochromatosis 12
  • Look for residue position 282 in the query
    sequence.
  • Highlight that column.
  • Is the Cys282 conserved in the family?
  • The C282Y mutation therefore likely has the
    effect of
Write a Comment
User Comments (0)
About PowerShow.com