Gene Clustering - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Gene Clustering

Description:

Gene cluster prediction algorithms are useful in discovering a set of gene ' ... Genes are matched between two genomes using two concepts, pairs of close ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 27
Provided by: Hal1152
Category:

less

Transcript and Presenter's Notes

Title: Gene Clustering


1
Gene Clustering
Haleh Ashki School of Informatics, Indiana
University, Aug 2008 Advisor Professor Sun Kim
2
Goal of the project
Gene cluster prediction algorithms are useful in
discovering a set of gene conserved in a pair
of genomes. However, the prediction result
depend highly on the phylogenetic distance of two
genomes. In particular, when two genomes are
close, sizes of predicted gene clusters are
large, containing several functional gene sets in
one cluster.
3
Ecoli - Shigella
Ecoli - Salmonella
4
  • Thus a new computational tool is needed to
    predict functionally related gene sets
  • In this study, we developed a novel computational
    method to predict functionally related gene sets
    from gene clusters, using
  • gene-ontology based clustering of genes and
    one dimensional dynamic programming techniques.

5
  • The input for this algorithm are the EGGS
    Clusters algorithm output
  • EGGS Extraction of Gene clusters by iteratively
    using Genome context based Sequence matching
    techniques.
  • Genes are matched between two genomes using
    two concepts, pairs of close bidirectional best
    hits (PCBBHs) and pairs of close homologs (PCHs),
    where the term close means the physical
    proximity, say within 300 bp.

6
This Cluster Contain 54 genes which have
different Operons, Pathways and strand
information.
16128413 - 84 patheco00190 protoheme IX farnesyltransferase (haeme O biosynthesis)
16128414 - 84 patheco00190 cytochrome o ubiquinol oxidase subunit IV
16128415 - 84 patheco00190 cytochrome o ubiquinol oxidase subunit III
...
16128423 85 "ATP-dependent specificity component of clpP serine protease, chaperone"
16128424 85 "DNA-binding, ATP-dependent protease La heat shock K-protein"
16128425 85 "DNA-binding protein HU-beta, NS1 (HU-1)"
16128426 85 peptidyl-prolyl cis-trans isomerase D
...
16128433 86 patheco02010 ATP-binding component of a transport system
16128434 86 patheco02010 putative ATP-binding component of a transport system
16128435 87 nitrogen regulatory protein P-II 2
16128436 87 probable ammonium transporter
16128437 - patheco00632 acyl-CoA thioesterase II
...
16128450 - 90 "orf, hypothetical protein"
16128451 - 90 primosomal replication protein N''
16128454 91 patheco00230 "DNA polymerase III, tau and gamma subunits DNA elongation factor III"
16128455 91 "orf, hypothetical protein"
16128456 91 recombination and repair
...
16128474 94 patheco02010 putative ATP-binding component of a transport system
16128477 - 95 putative oxidoreductase
16128478 - 95 patheco00632 acyl-CoA thioesterase I also functions as protease I
16128479 96 patheco02010 putative ATP-binding component of a transport system
7
  • predicted clusters are often too long and need to
    be dissected BUT how?
  • Predicting biologically meaningful gene clusters
    from conserved gene clusters
  • A conserved gene cluster depends much on
    phylogenic distance between two genomes and it
    often contains multiple biologically meaning
    clusters.
  • Our method uses clustering technique using gene
    ontology information.
  • Results from our method are shown biologically
    meaningful in terms of operon (a set of genes in
    a single transcription) and biological pathways.

8
GO Gene Ontology
  • The GO project has developed three structured
    controlled vocabularies (ontologies) that
    describe gene products in terms of their
    associated
  • biological processes
  • cellular components
  • molecular functions in a species-independent
    manner.
  • The ontologies are structured as directed acyclic
    graphs.
  • GO terms can be linked by different types of
    relationships is_a, part_of
  • For each gene there are more than one GO terms.
    in all different component and also in all
    different level of the hierarchal tree.
  • Here the UniProt IDs have been used as a key to
    get the Go terms of each gene.

9
Semantic Similarity Value (SS)
  • Different methods to calculate the semantic
    similarity value
  • Resnik is solely based on the information
    content of shared parents of the two terms. If
    there is more than one shared parent, the minimum
    information content is taken. Then the similarity
    score is derived as follows

where S(t1, t2) is the set of parent terms shared
by t1 and t2.
Lin and Jiang Both methods use not only the
information content of the shared parents, but
also that of the query terms
where p(t1), p(t2) and p(t) are information
content values for t1, t2 and their parents,
respectively.
10
Our method by (James Z. Wang1, Zhidian Du)
  • The semantic of a GO term is determined by its
    location in the entire GO graph and semantic
    relations with all of its ancestor term.
  • So we are using the subgraph, starting from the
    specific Go term and end at root (Biological,
    cellular, Molecular)
  • In this study I have worked with Molecular Go
    Terms.

DAGA(A,TA,EA) TA is a set of GO terms,including
A and all its ancestors in subgraph. EAset of
edges.
SV(A)4.52
11
Sim(ADh4,Ldb3).693
max
.427 .427 .664 .814 .482 .664 .482
.664
.664
.814
.390
.480
max
From Paper
Here I have used the online tool to measure the
Semantic Similarity value for each two genes
based on their GO terms. I made a matrix of
semantic value for each group of genes. this
value is normalized between 0 and 1.
12
  • Make the Cluster based on Semantic Similarity
    Matrix

0 1 2 3 4
5 6 7 8
9 10 1 1.000 0.250 0.000 0.000
0.000 0.000 0.313 0.571 0.433 0.250 2
0.250 1.000 0.000 0.000 0.000
0.000 0.000 0.250 0.278 0.188 3
0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 4 0.000
0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 5 0.000 0.000
0.000 0.000 1.000 0.500 0.000 0.000
0.000 0.000 6 0.000 0.000 0.000
0.000 0.500 1.000 0.000 0.000 0.000
0.000 7 0.313 0.000 0.000 0.000
0.000 0.000 1.000 0.313 0.222 0.438 8
0.571 0.250 0.000 0.000 0.000
0.000 0.313 1.000 0.900 0.286 9
0.433 0.278 0.000 0.000 0.000 0.000
0.222 0.900 1.000 0.233 10 0.250
0.188 0.000 0.000 0.000 0.000 0.438
0.286 0.233 1.000
Clustering Result Value Genes 0.9
8 9 0.2 1 2 0.4 7 10 0.5
6 5
this method group the genes based on their SS
value. Descending (0.9 0.1) So each gene is
grouped based on its highest SS value. The
genes with SS value of 0 are omitted on this step.
HCluster
  • Is one of the features of R which make the
    cluster based on the Dissimilarity value of group
    of elements. I have used that for visualization
    of clustering based on my Semantic Similarity
    Matrix.

13
Hcluster visualization
14
Now each Eggs cluster is grouped based on the
Semantic similarity value. I made a key like
as FirstGenome.SecondGenome.EggClusterNumber.SSva
lue ESC12S0.8 EcoliSalmonellaCluster12Subcluster
0.8 In this study I used clusters from four
pairs of genomes Ecoli Salmonella Ecoli
Yersinia Ecoli Shigella Ecoli Shewanella I
gathered all existence keys for each gene in
Ecoli genome. For sure more conserved genes have
more keys in all four groups
? Break point
  • 16131330 ESGc102s0.8 ESc125s0.8
    EYc25s0.8
  • 16131335 ESGc102s0.8 ESc125s0.8
    EShc106s0.6 EYc25s0.8

  • ? ?
    ?
  • 16131350 ESGc102s0.8 ESc126s0.8
    EShc107s0.8 EYc99s0.3
  • ?
  • 16131351 ESGc102s0.9
    EYc99s0.3


  • ?
  • 16131352 ESGc102s0.9
    EYc99s0.5

15
Break Point and Cluster Score
  • Break points are defined in target genome
    (Ecoli). break points are the genes which the
    keys are changed. Based on both cluster number
    or sub cluster value.
  • All breakpoints are collected and been removed of
    redundancies.
  • Formula for gene set score
  • (( of same keys inside the cluster)/( of same
    keys outside the cluster) ) 2
  • _________________________________________________
    ______________
  • Size of cluster (number of genes)

16
Breakpoint1-breakpoint2 genes
inner gene outer gene
Size gene set Score
16127996-16128002 EYc174s0.6 2 2 5 1
16127996-16128002 ESc3s0.6 2 4 5 0.36
16127996-16128002 EYc174s0.3 2 2 5 1
16127996-16128002 ESc3s0.3 2 2 5 1
16127996-16128002 EShc3s0.4 3 3 5 1
Break point interval score Sum of gene set
score / number of genes4.36 /5 0.872
16127996
-16127997 0.583 16127996-16127998
0.830 16127996-16128000 0.901 16127996-1612
8002 0.872 16127996-16128008
0.815 16127996-16128014 0.782 16127996-1612
8019 0.840 16127996-16128020
0.889 16127996-16128021 0.939 16127996-1612
8025 0.94 16127996-16128026
0.920 16127996-16128029 0.870 16127996-1612
8030 0.846 16127996-16128035
0.760 16127996-16128042 0.709
  • Each group is defined as genes between each
    breakpoint and the 5th ,10th ,15th break point
    ahead.
  • Here 15 break points in group

17
  • Problem definition
  • any pair of breakpoints can define a functionally
    related gene set, but there are too many
    candidates O(n2) for n break points.
  • We formulate a problem of functional gene set
    prediction as generating maximal cover of genes
    based on the Break point interval score .
  • This problem is similar to exon chaining problem
    that predict exons from a number of intron-exon
    boundaries.
  • Thus we used one dimensional dynamic programming
    technique to solve the functional gene set
    prediction problem
  • Select non overlapping break points
    intervals that maximize sum of break point
    interval scores.

18
One dimensional dynamic programming

16127996
On each group ( each breakpoint with the next
5th,.. Breakpoint ) the four highest score have
been chosen as blocks for dynamic
programming. This dynamic programming get the
block as potential clusters, the start and stop
position and the weight of that block (Break
point interval score). and finally generate the
clusters with highest score. This algorithm is
modified based on our data such as overlapping on
end points etc.
19
  • One more step to refine predicted clusters
  • Strand Information
  • Connected gene neighborhoods in prokaryotic
    genomes Nucleic Acids Research, 2002, Vol. 30,
    No. 10 2212-2223
  • the genes which have the same function are
    in the same direction.
  • So the strand information of Ecoli genome as
    target is used to dissect each cluster.
  • in this step the clusters are dissected
    based on the strand information.
  • The new clusters with one gene are removed.

20
Gene Id Start
Position End Position Strand
Operon ID Pathway

16132180 4595173 4597425 -
16132182 4598261 4598998 - 787
16132183 4599001 4599540 - 787
16132188 4602898 4603686 -
16132189 4604692 4605723 -
------------------------------------------- ------------------------------------------- -------------------------------------------
16132190 4605826 4606239 789 eco00230
16132191 4606208 4606654 789
16132192 4606669 4607346 789 eco00230
16132193 4607437 4609026
16132195 4610434 4611507
------------------------------------------- ------------------------------------------- -------------------------------------------
16132196 4612703 4613566 - 790
------------------------------------------- ------------------------------------------- -------------------------------------------
16132198 4615346 4616125 791 eco00030
16132199 4616252 4617574 791 eco00230
16132200 4617626 4618849 791 eco00030
16132201 4618906 4619625 791 eco00230
------------------------------------------- ------------------------------------------- -------------------------------------------
16132203 4621124 4622140 - eco00785

21
Predicted gene clusters verify in terms of
  • Definition of each gene NCBI
  • Operon information
  • Detecting uber-operons in prokaryotic
    genomes, Dongsheng Che2, Guojun Li, Nucleic Acids
    Research, 2006
  • Database http//csbl.bmb.uga.edu/uber/
  • This DB has grouped genes based on the
    operons they belongs too.Each Uber_Operon gropu
    represent a rich set of footprints of operon
    evolution.
  • KEGG Pathway
  • a metabolic pathway is a series of chemical
    reactions occurring within a cell. In each
    pathway, a principal chemical is modified by
    chemical reaction. Enzymes catalyze these
    reactions.
  • Database http//www.genome.jp/kegg/
  • absence of information for non enzyme genes
    make that not very useful.

22
Summary
Our Method
EGGS (Ecoli-Salmonella)
Cluster Numbers167 Gene range2-130
(2-50) Operon Id Range0-42
Cluster Numbers 483 Gene range2-25
(2-10) Operon Id Range 0-6
23
Conclusion
  • By dissecting big conserved clusters we will
    have functionally meaningful related genes
    clusters without worry about phylogenetic
    distance of genes.

24
Literature
  • Resnik P Semantic similarity in a taxonomy an
    information-based measure and its application to
    problems of ambiguity in natural language. J
    Artif Intell Res, 1999, 1195-130.
  • Lin D An information-theoretic definition of
    similarity. In International Conference on
    Machine Learning 1998 San Fransisco Morgan
    Kaufmann 1998 296-304.
  • Jiang JaC, DW Semantic similarity based on
    corpus statistics and lexical taxonomy. In
    Proceedings of 10th International Conference on
    Research In Computational Linguistics. Taiwan
    1997 19-33.
  • Wang JZ, Du Z, Payattakool R, Yu PS, Chen C-F A
    new method to measure the semantic similarity of
    GO terms. Bioinformatics 2007, 23(10)1274-1281.
  • EGGS Extraction of Gene clusters using Genome
    context based Sequence matching techniques.
    Kwangmin Choi, Bharath Kumar Maryada,SunKim
  • Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori
    M The KEGG resource for deciphering the
    genome. Nucl Acids Res 2004, 32(90001)D277-280.
  • Databasehttp//www.genome.jp/kegg/
  • Connected gene neighborhoods in prokaryotic
    genomes Nucleic Acids Research, 2002, Vol. 30,
    No. 10 2212-2223
  • Genome Alignment, Evolution of Prokaryotic Genome
    Organization, and Prediction of Gene Function
    Using Genomic ContextYuri I. Wolf, Igor B.
    Rogozin, Alexey S. Kondrashov, and Eugene V.
    Koonin Research 113 356-372 (2001)
  • Detecting uber-operons in prokaryotic genomes,
    Dongsheng Che2, Guojun Li, Nucleic Acids
    Research, 2006

25
Online resources
  • http//bioinformatics.clemson.edu/G-SESAME
  • http//csbl.bmb.uga.edu/uber/
  • http//www.geneontology.org/
  • http//bioconductor.org
  • http//www.r-project.org
  • http//platcom.org/EGGS
  • http//www.genome.jp/kegg/
  • http//www.ncbi.nlm.nih.gov/

26
Thanks
  • Professor.Sun Kim
  • Professor.Dalkilic
  • Kwangmin choi , youngik yang
  • Professor.Tang,Professor.Radivojac and all other
    Informatics faculties.
  • Informatics Staffs. Mis.Linda Hostetter
  • All Graduate Students (my Friends)
  • Profesoor.Kehoe
  • School of informatics.
Write a Comment
User Comments (0)
About PowerShow.com