Computing Patterns in Biology - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Computing Patterns in Biology

Description:

– PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 49
Provided by: med57
Category:

less

Transcript and Presenter's Notes

Title: Computing Patterns in Biology


1
Computing Patterns in Biology
  • Stuart M. Brown
  • New York University School of Medicine

2
Why Compute Biological Patterns?
  • Because we can
  • (computer scientists love to find interesting
    problems)
  • patterns are beautiful
  • Its practical - helps with genecloning
    experiments, predict functions of new proteins
  • Systems biology - figure out circuits of
    regulation, predict outcome of changes, design
    new biological systems

3
Overview
  • DNA Patterns
  • Restriction sites
  • Finding genes in DNA sequences
  • Regulatory sites in DNA
  • Protein Patterns
  • signals (transport and processing)
  • Protein functional Motifs
  • Protein families
  • Protein 3-D structure

4
DNA Information Content
  • Just a 4 letter alphabet (GATC)
  • Encodes proteins with 3 letter codons
  • Punctuation determines transcription starts and
    stops
  • Transcripitonal regulation (promoters, enhancers,
    etc.)

5
Restriction Sites
  • Bacteria make restriction enzymes that cut DNA
    at specific sequences (4-8 base patterns)
  • Very simple to find these patterns - can even use
    the Find function of your web browser or word
    processor
  • Exact matches only - these sites never vary
  • Open any page of text and look for CAT
  • you now have a restriction site search program!

6
NEBcutter2
  • http//tools.neb.com/NEBcutter2/

7
Finding Genes in Genomic DNA
  • Translate (in all 6 reading frames) and look for
    similarity to known protein sequences
  • Look for long Open Reading Frames (ORFs) between
    start and stop codons (startATG, stopTAA,
    TAG, TGA)
  • Look for known gene markers
  • TAATAA box, intron splice sites, etc.
  • Statistical methods (codon preference)

8
GCCACATGTAGATAATTGAAACTGGATCCTCATCCCTCGCCTTGTACAA
AAATCAACTCCAGATGGATCTAAGATTTAAATCTAACACCTGAAACCATA
AAAATTCTAGGAGATAACACTGGCAAAGCTATTCTAGACATTGGCTTAGG
CAAAGAGTTCGTGACCAAGAACCCAAAAGCAAATGCAACAAAAACAAAAA
TAAATAGGTGGGACCTGATTAAACTGAAAAGCCTCTGCACAGCAAAAGAA
ATAATCAGCAGAGTAAACAGACAACCCACAGAATGAGAGAAAATATTTGC
AAACCATGCATCTGATGACAAAGGACTAATATCCAGAATCTACAAGGAAC
TCAAACAAATCAGCAAGAAAAAAATAACCCCATCAAAAAGTGGGCAAAGG
AATGAATAGACAATTCTCAAAATATACAAATGGCCAATAAACATACGAAA
AACTGTTCAACATCACTAATTATCAGGGAAATGCAAATTAAAACCACAAT
GAGATGCCACCTTACTCCTGCAAGAATGGCCATAATAAAAAAAAATCAAA
AAAGAATAAATGTTGGTGTGAATGTGGTGAAAAGAGAACACTTTGACACT
GCTGGTGGGAATGGAAACTAGTACAACCACTGTGGAAAACAGTACCGAGA
TTTCTTAAAGAACTACAAGTAGAACTACCATTTGATCCAGCAATCCCACT
ACTGGGTATCTACCCAGAGGAAAAGAAGTCATTATTTGAAAAAGACACTT
GTACATACATGTTTATAGCAGCACAATTTGCAATTGCAAAGATATGGAAC
CAGTCTAAATGCCCATCAACCAACAAATGGATAAAGAAAATATGGTATAT
ATACACCATGGAACACTACTCAGCCATAAAAAGGAACAAAATAATGGCAA
CTCACAGATGGAGTTGGAGACCACTATTCTAAGTGAAATAACTCAGGAAT
GGAAAACCAAATATTGTATGTTCTCACTTATAAGTGGGAGCTAAGCTATG
AGGACAAAAGGCATAAGAATTATACTATGGACTTTGGGGACTCGGGGGAA
AGGGTGGGAGGGGGATGAGGGACAAAAGACTACACATTGGGTGCAGTGTA
CACTGCTGAGGTGATGGGTGCACCAAAATCTCAGAAATTACCACTAAAGA
ACTTATCCATGTAACTAAAAACCACCTCTACCCAAATAATTTTGAAATAA
AAAATAAAAATATTTTAAAAAGAACTCTTTAAAATAAATAATGAAAAGCA
CCAACAGACTTATGAACAGGCAATAGAAAAAATGAGAAATAGAAAGGAAT
ACAAATAAAAGTACAGAAAAAAAATATGGCAAGTTATTCAACCAAACTGG
TAATTTGAAATCCAGATTGAAATAATGCAAAAAAAAGGCAATTTCTGGCA
CCATGGCAGACCAGGTACCTGGATGATCTGTTGCTGAAAACAACTGAAAA
TGCTGGTTAAAATATATTAACACATTCTTGAATACAGTCATGGCCAAAGG
AAGTCACATGACTAAGCCCACAGTCAAGGAGTGAGAAAGTATTCTCTACC
TACCATGAGGCCAGGGCAAGGGTGTGCACTTTTTTTTTTCTTCTGTTCAT
TGAATACAGTCACTGTGTATTTTACATACTTTCATTTAGTCTTATGACAA
TCCTATGAAACAAGTACTTTTAAAAAAATTGAGATAACAGTTGCATACCG
TGAAATTCATCCATTTAAAGTGAGCAATTCACAGGTGCAGCTAGCTCAGT
CAGCAGAGCATAAGACTCTTAAAGTGAACAATTCAGTGCTTTTTAGTATA
TTCACAGAGTTGTGCAACCATCACCACTATCTAATTGGTCTTAGTCTGTT
TGGGCTGCCATAACAAAATACCACAAACTGGATAGCTCATAAACAACAGG
CATTTATTGCTCACAGTTCTAGAGGCTGGAAGTGCAAGATTAAGATGCCA
GCAGATTCTGTGTCTGCTGAGGGCCTGTTCCTCATAGAAGGTGCCCTCTT
GCTGAATTCTCACATGGTGGAAGGGGGAAAACAAGCTTGCATTGCAAAGA
GGTGGGCCTCTTTAATCCCAAAGGCCCCACCTCTAAAAGGCCCCACTTCT
GAATACCATTACATTGAGAATTAAGTTTCAACATAGGAATTTGGGGGAAC
ACAAATATCCAGACTGTAGCATAATTCCAGAACGGATTCAT
9
Intron/Exon structure
  • Gene finding programs work well in bacteria
  • None of the gene prediction programs do an
    adequate job predicting intron/exon boundaries
  • The only reasonable gene models are based on
    alignment of cDNAs to genome sequence
  • Perhaps 50 of all human genes still do not have
    a correct coding sequence defined
  • (transcription start, intron splice sites)

10
(No Transcript)
11
Truth?
  • There may not be a "correct" answer to the gene
    finding problem
  • Some genes have more than one start and stop
    position on the DNA
  • Alternative splicing
  • (a portion of the DNA is sometimes in an exon,
    sometimes in an intron)
  • Pseudogenes - look like genes, but no longer
    function
  • All computational gene predictions need to be
    experimentally verified

12
Gene Finding on the Web
  • GRAIL Oak Ridge Natl. Lab, Oak Ridge, TN
  • http//compbio.ornl.gov/grailexp
  • ORFfinder NCBI
  • http//www.ncbi.nlm.nih.gov/gorf/gorf.html
  • DNA translation Univ. of Minnesota Med. School
  • http//alces.med.umn.edu/webtrans.html
  • GenLang
  • http//cbil.humgen.upenn.edu/sdong/genlang.html
  • BCM GeneFinder Baylor College of Medicine,
    Houston, TX
  • http//dot.imgen.bcm.tmc.edu9331/seq-search/gene-
    search.html
  • http//dot.imgen.bcm.tmc.edu9331/gene-finder/gf.h
    tml

13
Genomic Sequence
  • Once each gene is located on the chromosome, it
    becomes possible to get upstream genomic sequence
  • This is where transcription factor (TF) binding
    sites are located
  • promoters and enhancers
  • Search for known TF sites, and discover new ones
    (among co-regulated genes)

14
Phage CRO repressor bound to DNA Andrew Coulson
Roger Sayles with RasMol, Univ. of Edinburgh
1993
15
Websites for Promoter finding
  • Promoter Scan NIH Bioinformatics (BIMAS)
  • http//bimas.dcrt.nih.gov/molbio/proscan/
  • Promoter Scan II Univ. of Minnesota Axyx
    Pharmaceuticals
  • http//biosci.cbs.umn.edu/software/proscan/promote
    rscan.htm
  • Signal Scan NIH Bioinformatics (BIMAS)
  • http//bimas.dcrt.nih.gov80/molbio/signal/index.h
    tml
  • Transcription Element Search (TESS) Center for
    Bioinformatics, Univ. of Pennsylvania
  • http//www.cbil.upenn.edu/tess/
  • Search TransFac at GBF with MatInspector,
    PatSearch, and FunSiteP
  • http//transfac.gbf-braunschweig.de/TRANSFAC/progr
    ams.html
  • TargetFinder Telethon Inst.of Genetics and
    Medicine, Milan, Italy
  • http//hercules.tigem.it/TargetFinder.html

16
Many DNA Regulatory Sequences are Known
  • Databases of promoters, enhancers, etc.
  • TransFac the Transcription Factor database
  • 4342 entries w/ known protein binding and
    transcriptional regulatory functions
  • Maintained by Gesellschaft for Biotechnologische
    Forschung mbH (Braunschweig, Germany)
  • The Eukaryotic Promoter Database (EPD)
  • Bucher Trifonov. (1986) NAR 14 10009-26
  • 1314 entries taken directly from scientific
    literature
  • Maintained by ISREC (Lausanne, Switzerland) as a
    subset of the EMBL

17
DE IFI-6-16 (interferon-induced gene 6-16)
G000176. SQ gGGAAAaTGAAACT SF -127 ST
-89 BF T00428 ISGF-3 Quality 6 Species
human, Homo sapiens.
TF Binding sites lack information
  • Most TF binding sites are determined by just a
    few base pairs (typically 6-12)
  • Sequence is variable (consensus)
  • This is not enough information for proteins to
    locate unique promoters for each gene in a 3
    billion base genome
  • TF's bind cooperatively and combinatorially
  • The key is in the location in relation to each
    other and to the transcription units of genes
  • Can use multiple alignments to predict binding
    sites

18
Sequence Logos
19
Pattern Finding Tools
Simple pattern search perfect matches
only Regular expression defined sets of
mismatches Fuzzy match allow specified of
mismatches in any location Matrix use letter
frequency from multiple alignment HMM more
complex matrix that uses info from adjacent pairs
of letters Challenges sensitivity and false
positives ( the ability to search large
amounts of data)
20
Tools to find patterns in DNA
  • Signal Scan, Promoter Scan - Mac, Windows, Unix
  • (Dr. Dan S. Prestridge, Univ. of Minnesota)
  • EMBOSS tools Unix
  • tfscan scans DNA sequences for transcription
    factors
  • fuzznuc nucleic acid pattern search
  • fuzzpro protein pattern search
  • fuzztran translate DNA-gtprotein search for
    protein patterns
  • restrict finds restriction enzyme cleavage sites
  • repeats (G. Benson) - tandem repeats
  • palindrome - inverted repeats
  • REPuter (whole genome repeat search) Unix

21
Protein Sequence
22
Protein Sequence Analysis
  • Molecular properties (pH, mol. wt. isoelectric
    point, hydrophobicity)
  • Simple Motifs (signal peptide, coiled-coil,
    trans-membrane, etc.)
  • Protein Families
  • Secondary Structure (helix vs. beta-sheet)
  • 3-D prediction, Threading

23
Simple Motifs
  • Common structural motifs
  • Membrane spanning
  • Signal peptide
  • Coiled coil
  • Helix-turn-helix

24
Web Sites for Simple Protein Analysis
  • Protein Hydrophobicity Server Bioinformatics
    Unit, Weizmann Institute of Science , Israel
  • http//bioinformatics.weizmann.ac.il/hydroph/
  • SAPS - statistical analysis of protein sequences
    hydrophobic and transmembrane segments,
    cysteine spacings, repeats and periodicity
  • http//www.isrec.isb-sib.ch/software/SAPS_form.htm
    l

25
Protein Signal Peptides
  • Proteins are sorted within the cell using 15-25
    amino acid tags at their 5' end (beginning)
  • Chopped off once they reach their destination

26
Some Signal Peptides
27
Protein Signal Prediction
  • ChloroP - Prediction of chloroplast transit
    peptides
  • LipoP - Prediction of lipoproteins and signal
    peptides in Gram negative bacteria
  • MITOPROT - Prediction of mitochondrial targeting
    sequences
  • PATS - Prediction of apicoplast targeted
    sequences
  • PlasMit - Prediction of mitochondrial transit
    peptides in Plasmodium falciparum
  • Predotar - Prediction of mitochondrial and
    plastid targeting sequences
  • PTS1 - Prediction of peroxisomal targeting signal
    1 containing proteins
  • SignalP - Prediction of signal peptide cleavage
    sites?

28
EMBOSS Protein Analysis Tools
Program name Description antigenic Finds
antigenic sites in proteins digest Protein
proteolytic enzyme or reagent cleavage
digest epestfind Finds PEST motifs as potential
proteolytic cleavage sites fuzzpro Protein
pattern search fuzztran Protein pattern search
after translation helixturnhelix Report nucleic
acid binding motifs Pepcoil Predicts coiled-coil
regions oddcomp Find protein sequence regions
with a biased composition patmatdb Search a
protein sequence with a motif patmatmotifs Search
a PROSITE motif database with a protein
sequence tmap Predicts membrane spanning
regions preg Regular expression search of a
protein sequence pscan Scans proteins using
PRINTS sigcleave Reports protein signal
cleavage sites emast Motif detection meme Motif
detection Profit Scan a sequence or database
with a matrix or profile Prophecy Creates
matrices/profiles from multiple
alignments Prophet Gapped alignment for profiles
29
Web servers that predict these structures
  • Predict Protein server EMBL Heidelberg
  • http//www.embl-heidelberg.de/predictprotein/
  • SOSUI Tokyo Univ. of Ag. Tech., Japan
  • http//www.tuat.ac.jp/mitaku/adv_sosui/submit.htm
    l
  • TMpred (transmembrane prediction) ISREC (Swiss
    Institute for Experimental Cancer Research)
  • http//www.isrec.isb-sib.ch/software/TMPRED_form.h
    tml
  • COILS (coiled coil prediction) ISREC
  • http//www.isrec.isb-sib.ch/software/COILS_form.ht
    ml
  • SignalP (signal peptides) Tech. Univ. of Denmark
  • http//www.cbs.dtu.dk/services/SignalP/

30
Protein Domains/Motifs
  • Proteins are built out of functional units know
    as domains (or motifs)
  • These domains have conserved sequences
  • Often much more similar than their respective
    proteins
  • Exon splicing theory (W. Gilbert)
  • Exons correspond to folding domains which in
    turn serve as functional units
  • Unrelated proteins may share a single similar
    exon (i.e.. ATPase or DNA binding function)

31
Protein Domains (Pattern analysis)
32
Motifs are built from Multiple Alignmennts
33
Protein Motif Databases
  • Known protein motifs have been collected in
    databases
  • Best database is PROSITE
  • The Dictionary of Protein Sites and Patterns
  • maintained by Amos Bairoch, at the Univ. of
    Geneva, Switzerland
  • contains a comprehensive list of documented
    protein domains constructed by expert molecular
    biologists
  • Alignments and patterns built by hand!

34
PROSITE is based on Patterns
  • Each domain is defined by a simple pattern
  • Patterns can have alternate amino acids in each
    position and defined spaces, but no gaps
  • Pattern searching is by exact matching, so any
    new variant will not be found (can allow
    mismatches, but this weakens the algorithm)
  • Grep

35
(No Transcript)
36
(No Transcript)
37
Tools for Pattern searches
  • Free Mac program MacPattern
  • ftp//ftp.ebi.ac.uk/pub/software/mac/macpattern.hq
    x
  • Free PC program (DOS) PATMAT
  • ftp//ncbi.nlm.nih.gov/repository/blocks/patmat.do
    s
  • EMBOSS fuzzpro

38
Websites for PROSITE Searches
  • ScanProsite at ExPASy Univ. of Geneva
  • http//expasy.hcuge.ch/sprot/scnpsit1.html
  • Network Protein Sequence Analysis Institut de
    Biologie et Chimie des Protéines, Lyon, France
  • http//pbil.ibcp.fr/NPSA/npsa_prosite.html
  • PPSRCH EBI, Cambridge, UK
  • http//www2.ebi.ac.uk/ppsearch/

39
Profiles
  • Profiles are tables of amino acid frequencies at
    each position in a motif
  • They are built from multiple alignments
  • PROSITE entries also contain profiles built from
    an alignment of proteins that match the pattern
  • Profile searching is more sensitive than pattern
    searching - uses an alignment algorithm, allows
    gaps

40
(No Transcript)
41
Websites for Profile searching
  • PROSITE ProfileScan ExPASy, Geneva
  • http//www.isrec.isb-sib.ch/software/PFSCAN_form.h
    tml
  • BLOCKS (builds profiles from PROSITE entries and
    adds all matching sequences in SwissProt) Fred
    Hutchinson Cancer Research Center, Seattle,
    Washington, USA
  • http//www.blocks.fhcrc.org/blocks_search.html
  • PRINTS (profiles built from automatic alignments
    of OWL non-redundant protein databases)
    http//www.biochem.ucl.ac.uk/cgi-bin/fingerPRINTSc
    an/fps/PathForm.cgi

42
More Protein Motif Databases
  • PFAM (1344 protein family HMM profiles built by
    hand) Washington Univ., St. Louis
  • http//pfam.wustl.edu/hmmsearch.shtml
  • ProDom (profiles built from PSI-BLAST automatic
    multiple alignments of the SwissProt database)
    INRA, Toulouse, France
  • http//www.toulouse.inra.fr/prodom/doc/blast_form.
    html
  • This is my favorite protein database - nicely
    colored results

43
Sample ProDom Output
44
Psi-BLAST
  • Use BLAST to find a group of sequences that share
    a region of similarity with a seed sequence
  • Build a profile from the alignment at this region
  • Use the profile to make a more sensitive search
    the database for more matches
  • Rebuild the alignment and profile, repeat search
  • Profile is only as good as the results from the
    initial BLAST search no good matches useless
    profile

45
Hidden Markov Models
  • Hidden Markov Models (HMMs) are a more
    sophisticated form of profile analysis.
  • Rather than build a table of amino acid
    frequencies at each position, they model the
    transition from one amino acid to the next.
  • Pfam is built with HMMs.
  • EMBOSS HMM tools (HMMER)
  • HmmerBuild HmmerCalibrate
  • HmmerSearch HmmerPfam
  • HmmerAlign HmmerEmit
  • HmmerFetch HmmerIndex

46
HMM model
47
Discovery of new Motifs
  • All of the tools discussed so far rely on a
    database of existing domains/motifs
  • How to discover new motifs
  • Start with a set of related proteins
  • Make a multiple alignment
  • Build a pattern or profile
  • You will need access to a fairly powerful UNIX
    computer to search databases with custom built
    profiles or HMMs.

48
Patterns in Unaligned Sequences
  • Sometimes sequences may share just a small common
    region
  • transcription factors
  • MEME San Diego Supercomputing Facility
  • http//www.sdsc.edu/MEME/meme/website/meme.html
  • EMBOSS also includes the MEME program
  • Gibbs Sampler
  • http//bayesweb.wadsworth.org/gibbs/gibbs.html
Write a Comment
User Comments (0)
About PowerShow.com