Computing Patterns in Biology

About This Presentation

Title:

Computing Patterns in Biology

Description:

– PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 49

Provided by: med57

Category:

more less

Transcript and Presenter's Notes

Title: Computing Patterns in Biology

1
Computing Patterns in Biology

Stuart M. Brown
New York University School of Medicine

2
Why Compute Biological Patterns?

Because we can
(computer scientists love to find interesting
problems)
patterns are beautiful
Its practical - helps with genecloning
experiments, predict functions of new proteins
Systems biology - figure out circuits of
regulation, predict outcome of changes, design
new biological systems

3
Overview

DNA Patterns
Restriction sites
Finding genes in DNA sequences
Regulatory sites in DNA
Protein Patterns
signals (transport and processing)
Protein functional Motifs
Protein families
Protein 3-D structure

4
DNA Information Content

Just a 4 letter alphabet (GATC)
Encodes proteins with 3 letter codons
Punctuation determines transcription starts and
stops
Transcripitonal regulation (promoters, enhancers,
etc.)

5
Restriction Sites

Bacteria make restriction enzymes that cut DNA
at specific sequences (4-8 base patterns)
Very simple to find these patterns - can even use
the Find function of your web browser or word
processor
Exact matches only - these sites never vary
Open any page of text and look for CAT
you now have a restriction site search program!

6
NEBcutter2

http//tools.neb.com/NEBcutter2/

7
Finding Genes in Genomic DNA

Translate (in all 6 reading frames) and look for
similarity to known protein sequences
Look for long Open Reading Frames (ORFs) between
start and stop codons (startATG, stopTAA,
TAG, TGA)
Look for known gene markers
TAATAA box, intron splice sites, etc.
Statistical methods (codon preference)

8
GCCACATGTAGATAATTGAAACTGGATCCTCATCCCTCGCCTTGTACAA
AAATCAACTCCAGATGGATCTAAGATTTAAATCTAACACCTGAAACCATA
AAAATTCTAGGAGATAACACTGGCAAAGCTATTCTAGACATTGGCTTAGG
CAAAGAGTTCGTGACCAAGAACCCAAAAGCAAATGCAACAAAAACAAAAA
TAAATAGGTGGGACCTGATTAAACTGAAAAGCCTCTGCACAGCAAAAGAA
ATAATCAGCAGAGTAAACAGACAACCCACAGAATGAGAGAAAATATTTGC
AAACCATGCATCTGATGACAAAGGACTAATATCCAGAATCTACAAGGAAC
TCAAACAAATCAGCAAGAAAAAAATAACCCCATCAAAAAGTGGGCAAAGG
AATGAATAGACAATTCTCAAAATATACAAATGGCCAATAAACATACGAAA
AACTGTTCAACATCACTAATTATCAGGGAAATGCAAATTAAAACCACAAT
GAGATGCCACCTTACTCCTGCAAGAATGGCCATAATAAAAAAAAATCAAA
AAAGAATAAATGTTGGTGTGAATGTGGTGAAAAGAGAACACTTTGACACT
GCTGGTGGGAATGGAAACTAGTACAACCACTGTGGAAAACAGTACCGAGA
TTTCTTAAAGAACTACAAGTAGAACTACCATTTGATCCAGCAATCCCACT
ACTGGGTATCTACCCAGAGGAAAAGAAGTCATTATTTGAAAAAGACACTT
GTACATACATGTTTATAGCAGCACAATTTGCAATTGCAAAGATATGGAAC
CAGTCTAAATGCCCATCAACCAACAAATGGATAAAGAAAATATGGTATAT
ATACACCATGGAACACTACTCAGCCATAAAAAGGAACAAAATAATGGCAA
CTCACAGATGGAGTTGGAGACCACTATTCTAAGTGAAATAACTCAGGAAT
GGAAAACCAAATATTGTATGTTCTCACTTATAAGTGGGAGCTAAGCTATG
AGGACAAAAGGCATAAGAATTATACTATGGACTTTGGGGACTCGGGGGAA
AGGGTGGGAGGGGGATGAGGGACAAAAGACTACACATTGGGTGCAGTGTA
CACTGCTGAGGTGATGGGTGCACCAAAATCTCAGAAATTACCACTAAAGA
ACTTATCCATGTAACTAAAAACCACCTCTACCCAAATAATTTTGAAATAA
AAAATAAAAATATTTTAAAAAGAACTCTTTAAAATAAATAATGAAAAGCA
CCAACAGACTTATGAACAGGCAATAGAAAAAATGAGAAATAGAAAGGAAT
ACAAATAAAAGTACAGAAAAAAAATATGGCAAGTTATTCAACCAAACTGG
TAATTTGAAATCCAGATTGAAATAATGCAAAAAAAAGGCAATTTCTGGCA
CCATGGCAGACCAGGTACCTGGATGATCTGTTGCTGAAAACAACTGAAAA
TGCTGGTTAAAATATATTAACACATTCTTGAATACAGTCATGGCCAAAGG
AAGTCACATGACTAAGCCCACAGTCAAGGAGTGAGAAAGTATTCTCTACC
TACCATGAGGCCAGGGCAAGGGTGTGCACTTTTTTTTTTCTTCTGTTCAT
TGAATACAGTCACTGTGTATTTTACATACTTTCATTTAGTCTTATGACAA
TCCTATGAAACAAGTACTTTTAAAAAAATTGAGATAACAGTTGCATACCG
TGAAATTCATCCATTTAAAGTGAGCAATTCACAGGTGCAGCTAGCTCAGT
CAGCAGAGCATAAGACTCTTAAAGTGAACAATTCAGTGCTTTTTAGTATA
TTCACAGAGTTGTGCAACCATCACCACTATCTAATTGGTCTTAGTCTGTT
TGGGCTGCCATAACAAAATACCACAAACTGGATAGCTCATAAACAACAGG
CATTTATTGCTCACAGTTCTAGAGGCTGGAAGTGCAAGATTAAGATGCCA
GCAGATTCTGTGTCTGCTGAGGGCCTGTTCCTCATAGAAGGTGCCCTCTT
GCTGAATTCTCACATGGTGGAAGGGGGAAAACAAGCTTGCATTGCAAAGA
GGTGGGCCTCTTTAATCCCAAAGGCCCCACCTCTAAAAGGCCCCACTTCT
GAATACCATTACATTGAGAATTAAGTTTCAACATAGGAATTTGGGGGAAC
ACAAATATCCAGACTGTAGCATAATTCCAGAACGGATTCAT
9
Intron/Exon structure

Gene finding programs work well in bacteria
None of the gene prediction programs do an
adequate job predicting intron/exon boundaries
The only reasonable gene models are based on
alignment of cDNAs to genome sequence
Perhaps 50 of all human genes still do not have
a correct coding sequence defined
(transcription start, intron splice sites)

10
(No Transcript)
11
Truth?

There may not be a "correct" answer to the gene
finding problem
Some genes have more than one start and stop
position on the DNA
Alternative splicing
(a portion of the DNA is sometimes in an exon,
sometimes in an intron)
Pseudogenes - look like genes, but no longer
function
All computational gene predictions need to be
experimentally verified

12
Gene Finding on the Web

GRAIL Oak Ridge Natl. Lab, Oak Ridge, TN
http//compbio.ornl.gov/grailexp
ORFfinder NCBI
http//www.ncbi.nlm.nih.gov/gorf/gorf.html
DNA translation Univ. of Minnesota Med. School
http//alces.med.umn.edu/webtrans.html
GenLang
http//cbil.humgen.upenn.edu/sdong/genlang.html
BCM GeneFinder Baylor College of Medicine,
Houston, TX
http//dot.imgen.bcm.tmc.edu9331/seq-search/gene-
search.html
http//dot.imgen.bcm.tmc.edu9331/gene-finder/gf.h
tml

13
Genomic Sequence

Once each gene is located on the chromosome, it
becomes possible to get upstream genomic sequence
This is where transcription factor (TF) binding
sites are located
promoters and enhancers
Search for known TF sites, and discover new ones
(among co-regulated genes)

14
Phage CRO repressor bound to DNA Andrew Coulson
Roger Sayles with RasMol, Univ. of Edinburgh
1993
15
Websites for Promoter finding

Promoter Scan NIH Bioinformatics (BIMAS)
http//bimas.dcrt.nih.gov/molbio/proscan/
Promoter Scan II Univ. of Minnesota Axyx
Pharmaceuticals
http//biosci.cbs.umn.edu/software/proscan/promote
rscan.htm
Signal Scan NIH Bioinformatics (BIMAS)
http//bimas.dcrt.nih.gov80/molbio/signal/index.h
tml
Transcription Element Search (TESS) Center for
Bioinformatics, Univ. of Pennsylvania
http//www.cbil.upenn.edu/tess/
Search TransFac at GBF with MatInspector,
PatSearch, and FunSiteP
http//transfac.gbf-braunschweig.de/TRANSFAC/progr
ams.html
TargetFinder Telethon Inst.of Genetics and
Medicine, Milan, Italy
http//hercules.tigem.it/TargetFinder.html

16
Many DNA Regulatory Sequences are Known

Databases of promoters, enhancers, etc.
TransFac the Transcription Factor database
4342 entries w/ known protein binding and
transcriptional regulatory functions
Maintained by Gesellschaft for Biotechnologische
Forschung mbH (Braunschweig, Germany)
The Eukaryotic Promoter Database (EPD)
Bucher Trifonov. (1986) NAR 14 10009-26
1314 entries taken directly from scientific
literature
Maintained by ISREC (Lausanne, Switzerland) as a
subset of the EMBL

17
DE IFI-6-16 (interferon-induced gene 6-16)
G000176. SQ gGGAAAaTGAAACT SF -127 ST
-89 BF T00428 ISGF-3 Quality 6 Species
human, Homo sapiens.
TF Binding sites lack information

Most TF binding sites are determined by just a
few base pairs (typically 6-12)
Sequence is variable (consensus)
This is not enough information for proteins to
locate unique promoters for each gene in a 3
billion base genome
TF's bind cooperatively and combinatorially
The key is in the location in relation to each
other and to the transcription units of genes
Can use multiple alignments to predict binding
sites

18
Sequence Logos
19
Pattern Finding Tools
Simple pattern search perfect matches
only Regular expression defined sets of
mismatches Fuzzy match allow specified of
mismatches in any location Matrix use letter
frequency from multiple alignment HMM more
complex matrix that uses info from adjacent pairs
of letters Challenges sensitivity and false
positives ( the ability to search large
amounts of data)
20
Tools to find patterns in DNA

Signal Scan, Promoter Scan - Mac, Windows, Unix
(Dr. Dan S. Prestridge, Univ. of Minnesota)
EMBOSS tools Unix
tfscan scans DNA sequences for transcription
factors
fuzznuc nucleic acid pattern search
fuzzpro protein pattern search
fuzztran translate DNA-gtprotein search for
protein patterns
restrict finds restriction enzyme cleavage sites
repeats (G. Benson) - tandem repeats
palindrome - inverted repeats
REPuter (whole genome repeat search) Unix

21
Protein Sequence
22
Protein Sequence Analysis

Molecular properties (pH, mol. wt. isoelectric
point, hydrophobicity)
Simple Motifs (signal peptide, coiled-coil,
trans-membrane, etc.)
Protein Families
Secondary Structure (helix vs. beta-sheet)
3-D prediction, Threading

23
Simple Motifs

Common structural motifs
Membrane spanning
Signal peptide
Coiled coil
Helix-turn-helix

24
Web Sites for Simple Protein Analysis

Protein Hydrophobicity Server Bioinformatics
Unit, Weizmann Institute of Science , Israel
http//bioinformatics.weizmann.ac.il/hydroph/
SAPS - statistical analysis of protein sequences
hydrophobic and transmembrane segments,
cysteine spacings, repeats and periodicity
http//www.isrec.isb-sib.ch/software/SAPS_form.htm
l

25
Protein Signal Peptides

Proteins are sorted within the cell using 15-25
amino acid tags at their 5' end (beginning)
Chopped off once they reach their destination

26
Some Signal Peptides
27
Protein Signal Prediction

ChloroP - Prediction of chloroplast transit
peptides
LipoP - Prediction of lipoproteins and signal
peptides in Gram negative bacteria
MITOPROT - Prediction of mitochondrial targeting
sequences
PATS - Prediction of apicoplast targeted
sequences
PlasMit - Prediction of mitochondrial transit
peptides in Plasmodium falciparum
Predotar - Prediction of mitochondrial and
plastid targeting sequences
PTS1 - Prediction of peroxisomal targeting signal
1 containing proteins
SignalP - Prediction of signal peptide cleavage
sites?

28
EMBOSS Protein Analysis Tools
Program name Description antigenic Finds
antigenic sites in proteins digest Protein
proteolytic enzyme or reagent cleavage
digest epestfind Finds PEST motifs as potential
proteolytic cleavage sites fuzzpro Protein
pattern search fuzztran Protein pattern search
after translation helixturnhelix Report nucleic
acid binding motifs Pepcoil Predicts coiled-coil
regions oddcomp Find protein sequence regions
with a biased composition patmatdb Search a
protein sequence with a motif patmatmotifs Search
a PROSITE motif database with a protein
sequence tmap Predicts membrane spanning
regions preg Regular expression search of a
protein sequence pscan Scans proteins using
PRINTS sigcleave Reports protein signal
cleavage sites emast Motif detection meme Motif
detection Profit Scan a sequence or database
with a matrix or profile Prophecy Creates
matrices/profiles from multiple
alignments Prophet Gapped alignment for profiles
29
Web servers that predict these structures

Predict Protein server EMBL Heidelberg
http//www.embl-heidelberg.de/predictprotein/
SOSUI Tokyo Univ. of Ag. Tech., Japan
http//www.tuat.ac.jp/mitaku/adv_sosui/submit.htm
l
TMpred (transmembrane prediction) ISREC (Swiss
Institute for Experimental Cancer Research)
http//www.isrec.isb-sib.ch/software/TMPRED_form.h
tml
COILS (coiled coil prediction) ISREC
http//www.isrec.isb-sib.ch/software/COILS_form.ht
ml
SignalP (signal peptides) Tech. Univ. of Denmark
http//www.cbs.dtu.dk/services/SignalP/

30
Protein Domains/Motifs

Proteins are built out of functional units know
as domains (or motifs)
These domains have conserved sequences
Often much more similar than their respective
proteins
Exon splicing theory (W. Gilbert)
Exons correspond to folding domains which in
turn serve as functional units
Unrelated proteins may share a single similar
exon (i.e.. ATPase or DNA binding function)

31
Protein Domains (Pattern analysis)
32
Motifs are built from Multiple Alignmennts
33
Protein Motif Databases

Known protein motifs have been collected in
databases
Best database is PROSITE
The Dictionary of Protein Sites and Patterns
maintained by Amos Bairoch, at the Univ. of
Geneva, Switzerland
contains a comprehensive list of documented
protein domains constructed by expert molecular
biologists
Alignments and patterns built by hand!

34
PROSITE is based on Patterns

Each domain is defined by a simple pattern
Patterns can have alternate amino acids in each
position and defined spaces, but no gaps
Pattern searching is by exact matching, so any
new variant will not be found (can allow
mismatches, but this weakens the algorithm)
Grep

35
(No Transcript)
36
(No Transcript)
37
Tools for Pattern searches

Free Mac program MacPattern
ftp//ftp.ebi.ac.uk/pub/software/mac/macpattern.hq
x
Free PC program (DOS) PATMAT
ftp//ncbi.nlm.nih.gov/repository/blocks/patmat.do
s
EMBOSS fuzzpro

38
Websites for PROSITE Searches

ScanProsite at ExPASy Univ. of Geneva
http//expasy.hcuge.ch/sprot/scnpsit1.html
Network Protein Sequence Analysis Institut de
Biologie et Chimie des Protéines, Lyon, France
http//pbil.ibcp.fr/NPSA/npsa_prosite.html
PPSRCH EBI, Cambridge, UK
http//www2.ebi.ac.uk/ppsearch/

39
Profiles

Profiles are tables of amino acid frequencies at
each position in a motif
They are built from multiple alignments
PROSITE entries also contain profiles built from
an alignment of proteins that match the pattern
Profile searching is more sensitive than pattern
searching - uses an alignment algorithm, allows
gaps

40
(No Transcript)
41
Websites for Profile searching

PROSITE ProfileScan ExPASy, Geneva
http//www.isrec.isb-sib.ch/software/PFSCAN_form.h
tml
BLOCKS (builds profiles from PROSITE entries and
adds all matching sequences in SwissProt) Fred
Hutchinson Cancer Research Center, Seattle,
Washington, USA
http//www.blocks.fhcrc.org/blocks_search.html
PRINTS (profiles built from automatic alignments
of OWL non-redundant protein databases)
http//www.biochem.ucl.ac.uk/cgi-bin/fingerPRINTSc
an/fps/PathForm.cgi

42
More Protein Motif Databases

PFAM (1344 protein family HMM profiles built by
hand) Washington Univ., St. Louis
http//pfam.wustl.edu/hmmsearch.shtml
ProDom (profiles built from PSI-BLAST automatic
multiple alignments of the SwissProt database)
INRA, Toulouse, France
http//www.toulouse.inra.fr/prodom/doc/blast_form.
html
This is my favorite protein database - nicely
colored results

43
Sample ProDom Output
44
Psi-BLAST

Use BLAST to find a group of sequences that share
a region of similarity with a seed sequence
Build a profile from the alignment at this region
Use the profile to make a more sensitive search
the database for more matches
Rebuild the alignment and profile, repeat search
Profile is only as good as the results from the
initial BLAST search no good matches useless
profile

45
Hidden Markov Models

Hidden Markov Models (HMMs) are a more
sophisticated form of profile analysis.
Rather than build a table of amino acid
frequencies at each position, they model the
transition from one amino acid to the next.
Pfam is built with HMMs.
EMBOSS HMM tools (HMMER)
HmmerBuild HmmerCalibrate
HmmerSearch HmmerPfam
HmmerAlign HmmerEmit
HmmerFetch HmmerIndex

46
HMM model
47
Discovery of new Motifs

All of the tools discussed so far rely on a
database of existing domains/motifs
How to discover new motifs
Start with a set of related proteins
Make a multiple alignment
Build a pattern or profile
You will need access to a fairly powerful UNIX
computer to search databases with custom built
profiles or HMMs.

48
Patterns in Unaligned Sequences