Introduction to Bioinformatics - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Bioinformatics

Description:

C E N T R F O R I N T E G R A T I V E E B I O I N F O R M A T I C S V U Introduction to Bioinformatics Lecture 15: Predicting Protein Function Centre for Integrative ... – PowerPoint PPT presentation

Number of Views:512
Avg rating:3.0/5.0
Slides: 48
Provided by: Rober409
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics


1
Introduction to Bioinformatics
Lecture 15 Predicting Protein Function
Centre for Integrative Bioinformatics VU
(IBIVU)
2
Protein Function Prediction
The deluge of genomic information begs the
following question what do all these genes do?
Many genes are not annotated, and many more are
partially or erroneously annotated. Given a
genome which is partially annotated at best, how
do we fill in the blanks? Of each sequenced
genome, 20-50 of the functions of proteins
encoded by the genomes remains unknown!
3
Protein Function Prediction
We are faced with the problem of predicting
protein function from sequence, genomic,
expression, interaction and structural data. For
all these reasons and many more, automated
protein function prediction is rapidly gaining
interest among bioinformaticians and
computational biologists
4
Outline
  • Sequence-based function prediction
  • Structure-based function prediction
  • Sequence-structure comparison
  • Structure-structure comparison
  • Motif-based function prediction
  • Phylogenetic profile analysis
  • Protein interaction prediction and databases
  • Functional inference at systems level

5
Classes of function prediction methods
  • Sequence based approaches
  • protein A has function X, and protein B is a
    homolog (ortholog) of protein A Hence B has
    function X
  • Structure-based approaches
  • protein A has structure X, and X has so-so
    structural features Hence As function sites are
    .
  • Motif-based approaches
  • a group of genes have function X and they all
    have motif Y protein A has motif Y Hence
    protein As function might be related to X
  • Function prediction based on guilt-by-association
  • gene A has function X and gene B is often
    associated with gene A, B might have function
    related to X

6
Sequence-based function prediction Homology
searching
  • Sequence comparison is a powerful tool for
    detection of homologous genes but limited to
    genomes that are not too distant away

uery 2   LSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLE
KFDKFKHLKSEDEMKASEDL 61           LSD   V  W
K       G L R   PT   F        D    S
Sbjct 3   LSDKDKAAVRALWSKIGKSSDAIGNDALSRMIVVYP
QTKIYFSHWP-----DVTPGSPNI 57Query 62
 KKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISEC
IIQVLQSKHPG 121           K HG  V     K    
   L HA K     CI V  PSbjct 58
 KAHGKKVMGGIALAVSKIDDLKTGLMELSEQHAYKLRVDPSNFKILNHC
ILVVISTMFPK 117Query 122 DFGADAQGAMNKALELFRKDMA
SNYK 147           F  A  K L      A
 YSbjct 118 EFTPEAHVSLDKFLSGVALALAERYR 143
We have done homology searching (FASTA, BLAST,
PSI-BLAST) in earlier lectures
7
Structure-based function prediction
  • Structure-based methods could possibly detect
    remote homologues that are not detectable by
    sequence-based method
  • using structural information in addition to
    sequence information
  • protein threading (sequence-structure alignment)
    is a popular method

Structure-based methods could provide more than
just homology information
8
Threading
Template sequence
Compatibility score
Query sequence
Template structure
9
Threading
Template sequence
Compatibility score
Query sequence
Template structure
10
Structure-based function prediction Threading
  • Scoring function for measuring to what extend
    query sequence fits into template structure
  • For scoring we have to map an amino acid
    (query sequence) onto a local environment
    (template structure)
  • We can use the following structural features
    for scoring
  • Secondary structure
  • Is environment inside or outside? Residue
    accessible surface area (ASA)
  • Polarity of environment
  • The best (highest scoring) thread through
    the structure gives a so-called structural
    alignment, this looks exactly the same as a
    sequence alignment but is based on structure.

11
Threading inverse foldingMap sequence to
structural environments
Query
Template
?
What is the optimal thread for each local
environment? Find the best compromise over all
environments
environment
  • Secondary structure
  • ASA
  • Polarity of environment

C
N
hydrophobic
hydrophilic
12
Fold recognition by threading
Fold 1 Fold 2 Fold 3 Fold N
Query sequence
What is the most compatible structure? The one
with the highest threading score
Compatibility scores
13
Structure-based function prediction
  • SCOP (http//scop.berkeley.edu/) is a protein
    structure classification database where proteins
    are grouped into a hierarchy of families,
    superfamilies, folds and classes, based on their
    structural and functional similarities

14
Structure-based function prediction
  • SCOP hierarchy the top level 11 classes

15
Structure-based function prediction
All-alpha protein
Alpha-beta protein
membrane protein
Coiled-coil protein
All-beta protein
16
Structure-based function prediction
  • SCOP hierarchy the second level 800 folds

17
Structure-based function prediction
  • SCOP hierarchy - third level 1294 superfamilies

18
Structure-based function prediction
  • SCOP hierarchy - third level 2327 families

19
Structure-based function prediction
  • Using sequence-structure alignment method, one
    can predict a protein belongs to a
  • SCOP family, superfamily or fold
  • Proteins predicted to be in the same SCOP family
    are orthologous
  • Proteins predicted to be in the same SCOPE
    superfamily are homologous
  • Proteins predicted to be in the same SCOP fold
    are structurally analogous

folds
superfamilies
families
20
Structure-based function prediction
  • Prediction of ligand binding sites
  • For 85 of ligand-binding proteins, the largest
    largest cleft is the ligand-binding site
  • For additional 10 of ligand-binding proteins,
    the second largest cleft is the ligand-binding
    site

21
Structure-based function prediction
  • Prediction of macromolecular binding site
  • there is a strong correlation between
    macromolecular binding site (with protein, DNA
    and RNA) and disordered protein regions
  • disordered regions in a protein sequence can be
    predicted using computational methods

http//www.pondr.com/
22
Motif-based function prediction
  • Prediction of protein functions based on
    identified sequence motifs
  • PROSITE contains patterns specific for more than
    a thousand protein families.
  • ScanPROSITE -- it allows to scan a protein
    sequence for occurrence of patterns and profiles
    stored in PROSITE

23
Motif-based function prediction
  • Search PROSITE using ScanPROSITE
  • The sequence has ASN_GLYCOSYLATION
    N-glycosylation site 242 - 245 NETL

MSEGSDNNGDPQQQGAEGEAVGENKMKSRLRKGALKKKNVFNVKDHCFIA
RFFKQPTFCSHCKDFICGYQSGYAWMGFGKQGFQCQVCSYVVHKRCHEYV
TFICPGKDKGNETLIDSDSPKTQH ..
24
Regular expressions
Alignment ADLGAVFALCDRYFQ SDVGPRSCFCERFYQ ADLGRTQN
RCDRYYQ ADIGQPHSLCERYFQ Regular
expression AS-D-IVL-G-x4-PG-C-DE-R-FY2-Q
PG not (P or G)
For short sequence stretches, regular expressions
are often more suitable to describe the
information than alignments (or profiles)
25
Regular expressions
Regular expression No. of exact matches
in DB D-A-V-I-D 71 D-A-V-I-DENQ 252 DENQ-
A-V-I-DENQ 925 DENQ-A-VLI-I-DENQ 2739 DE
NQ-AG-VLI2-DENQ 51506 D-A-V-E 1088
26
Prosite
  • In addition to regular expressions, the Prosite
    database also contains so-called extended
    profiles
  • Extended profiles contain more explicit
    information than classical profiles, for example
    to describe expected gap lengths, etc.
  • This is because some patterns are better
    described using regular expressions (e.g. short
    motifs), while others are better formalised using
    (extended) profiles

27
Phylogenetic profile analysis
  • Function prediction of genes based on
    guilt-by-association a non-homologous
    approach
  • The phylogenetic profile of a protein is a string
    that encodes the presence or absence of the
    protein in every sequenced genome
  • Because proteins that participate in a common
    structural complex or metabolic pathway are
    likely to co-evolve, the phylogenetic profiles of
    such proteins are often similar''

28
Phylogenetic profile analysis
  • Phylogenetic profile (against N genomes)
  • For each gene X in a target genome (e.g., E
    coli), build a phylogenetic profile as follows
  • If gene X has a homolog in genome i, the ith bit
    of Xs phylogenetic profile is 1 otherwise it
    is 0

29
Phylogenetic profile analysis
  • Example phylogenetic profiles based on 60
    genomes

genome
gene
orf1034111011011001011111010001010000000011110001
1111110110111010101 orf10361011110001000001010000
010010000000010111101110011011010000101 orf103711
01100110000001110010000111111001101111101011101111
000010100 orf103811101001100100101100100111000001
01110101101111111111110000101 orf1039111111111111
1111111111111111111111111111101111111111111111101
orf104 10001010000000000000001010000000001100000
00000000100101000100 orf1040111011111111110111110
1111100000111111100111111110110111111101 orf10411
11111111111111111011111111111110111111110111111111
1111111101 orf10421110100101010010010110000100001
001111110111110101101100010101 orf104311101001100
10000010100111100100001111110101111011101000010101
orf104411111001111100100101110101111110011111111
11111101101100010101 orf1045111111011011001111111
1111111111101111111101111111111110010101 orf10460
10110000001000101100000011111000001010000000101001
0100000000 orf10470000000000000001000010000001000
100000000000000010000000000000 orf105
01101101101000101111011010101110011011001011111000
10000010001 orf1054010010011000000110000100010000
0000100100100001000100100000000
By correlating the rows (open reading frames
(ORF) or genes) you find out about joint presence
or absence of genes this is a signal for a
functional connection
Genes with similar phylogenetic profiles have
related functions or functionally linked D
Eisenberg and colleagues (1999)
30
Phylogenetic profile analysis
  • Phylogenetic profiles contain great amount of
    functional information
  • Phlylogenetic profile analysis can be used to
    distinguish orthologous genes from paralogous
    genes
  • Subcellular localization 361 yeast
    nucleus-encoded mitochondrial proteins are
    identified at 50 accuracy with 58 coverage
    through phylogenetic profile analysis
  • Functional complementarity By examining inverse
    phylogenetic profiles, one can find functionally
    complementary genes that have evolved through one
    of several mechanisms of convergent evolution.

31
Prediction of protein-protein interactionsRosetta
stone
  • Gene fusion is the an effective method for
    prediction of protein-protein interactions
  • If proteins A and B are homologous to two domains
    of a protein C, A and B are predicted to have
    interaction

A
B
C
Though gene-fusion has low prediction coverage,
it false-positive rate is low
32
Domain fusion example
  • Vertebrates have a multi-enzyme protein
    (GARs-AIRs-GARt) comprising the enzymes GAR
    synthetase (GARs), AIR synthetase (AIRs), and GAR
    transformylase (GARt).
  • In insects, the polypeptide appears as
    GARs-(AIRs)2-GARt.
  • In yeast, GARs-AIRs is encoded separately from
    GARt
  • In bacteria each domain is encoded separately
    (Henikoff et al., 1997).
  • GAR glycinamide ribonucleotide
  • AIR aminoimidazole ribonucleotide

33
Protein interaction database
  • There are numerous databases of protein-protein
    interactions
  • DIP is a popular protein-protein interaction
    database

The DIP database catalogs experimentally
determined interactions between proteins. It
combines information from a variety of sources to
create a single, consistent set of
protein-protein interactions.
34
Protein interaction databases
  • BIND - Biomolecular Interaction Network Database
  • DIP - Database of Interacting Proteins
  • PIM Hybrigenics
  • PathCalling Yeast Interaction Database
  • MINT - a Molecular Interactions Database
  • GRID - The General Repository for Interaction
    Datasets
  • InterPreTS - protein interaction prediction
    through tertiary structure
  • STRING - predicted functional associations among
    genes/proteins
  • Mammalian protein-protein interaction database
    (PPI)
  • InterDom - database of putative interacting
    protein domains
  • FusionDB - database of bacterial and archaeal
    gene fusion events
  • IntAct Project
  • The Human Protein Interaction Database (HPID)
  • ADVICE - Automated Detection and Validation of
    Interaction by Co-evolution
  • InterWeaver - protein interaction reports with
    online evidence
  • PathBLAST - alignment of protein interaction
    networks
  • ClusPro - a fully automated algorithm for
    protein-protein docking
  • HPRD - Human Protein Reference Database

35
Protein interaction database
36
Network of protein interactions and predicted
functional links involving silencing information
regulator (SIR) proteins. Filled circles
represent proteins of known function open
circles represent proteins of unknown function,
represented only by their Saccharomyces genome
sequence numbers ( http//genome-www.stanford.edu/
Saccharomyces). Solid lines show experimentally
determined interactions, as summarized in the
Database of Interacting Proteins19
(http//dip.doe-mbi.ucla.edu). Dashed lines show
functional links predicted by the Rosetta Stone
method12. Dotted lines show functional links
predicted by phylogenetic profiles16. Some
predicted links are omitted for clarity.
37
Network of predicted functional linkages
involving the yeast prion protein20 Sup35. The
dashed line shows the only experimentally
determined interaction. The other functional
links were calculated from genome and expression
data11 by a combination of methods, including
phylogenetic profiles, Rosetta stone linkages and
mRNA expression. Linkages predicted by more than
one method, and hence particularly reliable, are
shown by heavy lines. Adapted from ref. 11.  
38
STRING - predicted functional associations among
genes/proteins
  • STRING is a database of predicted functional
    associations among genes/proteins.
  • Genes of similar function tend to be maintained
    in close neighborhood, tend to be present or
    absent together, i.e. to have the same
    phylogenetic occurrence, and can sometimes be
    found fused into a single gene encoding a
    combined polypeptide.
  • STRING integrates this information from as many
    genomes as possible to predict functional links
    between proteins.

Berend Snel en Martijn Huynen (RUN) and the group
of Peer Bork (EMBL, Heidelberg)
39
STRING - predicted functional associations among
genes/proteins
  • STRING is a database of known and predicted
    protein-protein interactions.The interactions
    include direct (physical) and indirect
    (functional) associations they are derived from
    four sources
  • Genomic Context (Synteny)
  • High-throughput Experiments 
  • (Conserved) Co-expression 
  • Previous Knowledge
  • STRING quantitatively integrates interaction
    data from these sources for a large number of
    organisms, and transfers information between
    these organisms where applicable. The database
    currently contains 736429 proteins in 179 species

40
STRING - predicted functional associations among
genes/proteins
Conserved Neighborhood This view shows runs
of genes that occur repeatedly in close
neighborhood in (prokaryotic) genomes. Genes
located together in a run are linked with a black
line (maximum allowed intergenic distance is 300
bp). Note that if there are multiple runs for a
given species, these are separated by white
space. If there are other genes in the run that
are below the current score threshold, they are
drawn as small white triangles. Gene fusion
occurences are also drawn, but only if they are
present in a run (see also the Fusion section
below for more details).
41
Functional inference at systems level
  • Function prediction of individual genes could be
    made in the context of biological
    pathways/networks
  • Example phoB is predicted to be a transcription
    regulator and it regulates all the genes in the
    pho-regulon (a group of co-regulated operons)
    and within this regulon, gene A is interacting
    with gene B, etc.

42
Functional inference at systems level
  • KEGG is database of biological pathways and
    networks

43
Functional inference at systems level
44
Functional inference at systems level
45
Functional inference at systems level
  • By doing homologous search, one can map a known
    biological pathway in one organism to another
    one hence predict gene functions in the context
    of biological pathways/networks

46
Wrapping up
  • We have seen a number of ways to infer a putative
    function for a protein sequence
  • To gain confidence, it is important to combine as
    many different prediction protocols as possible
    (the STRING server is an example of this)

47
Homework
  • Give an example of two proteins having the same
    structural fold but different biological
    functions through searching SCOP and Swiss-prot
  • What is the biological function of phoR in the
    two-component system of prokaryotic organism
    based on KEGG database search
Write a Comment
User Comments (0)
About PowerShow.com