GO Fuzzy cmeans Clustering Application - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

GO Fuzzy cmeans Clustering Application

Description:

Commonly used Stopping Criterion Distance wise or Number wise. ... Display alpha and r in the output file. ... Display compactness and separation in the output file. ... – PowerPoint PPT presentation

Number of Views:198
Avg rating:3.0/5.0
Slides: 16
Provided by: may973
Category:

less

Transcript and Presenter's Notes

Title: GO Fuzzy cmeans Clustering Application


1
GO Fuzzy c-means Clustering Application
  • Presented by
  • Neha Somani

2
Outline of the presentation
  • Clustering and different algorithms (In general)
  • GO Fuzzy c-means algorithm (My understanding)
  • Basic details about the algorithm
  • Differences from Fuzzy c-means algorithm.
  • The assigned project Application for the
    underlying algorithm
  • Whats already there.
  • What I plan to do.

3
Clustering
  • Clustering - Classification of objects into
    different groups such that elements of each group
    share some common features.
  • In gene expression data, it is used to observe
    simple gene-to-gene and sample-to-sample
    similarities so that potential functions of genes
    can be found.
  • An important criteria for any clustering
    algorithm is the distance measure that helps in
    determining the similarity of two elements. For
    e.g. The Euclidean distance.
  • Commonly used Stopping Criterion Distance wise
    or Number wise.

4
Standard Clustering Methodologies
Hierarchical
Partitional
  • K-Means its derivatives
  • K-means
  • Fuzzy c-means
  • QT clustering

Locality-sensitive hashing
Graph-theoretic methods
Agglomerative (Bottom-up approach)
Divisive (Top-down approach)
  • Other common classifications
  • Symmetric/ Asymmetric distances
  • Two-way clustering Clusters objects as well as
    the features of the objects.
  • Source Wikipedia Cluster Analysis

5
GO Fuzzy c-means algorithm
  • Its an extension of Fuzzy c-means with prior
    biological knowledge.
  • Allows assignment of multiple gene functions as
    well as incorporate prior biological knowledge
    such that genes can be associated with more than
    one biological function and thus help in
    capturing the actual behavior of genes.
  • The prior knowledge used here is obtained from
    Gene Ontology and is used for initial membership
    assigment. (Can support other sources of
    knowledge as well).
  • Uses the biological process ontology.
    Specifically GOSlim biological process terms are
    used to interpret the functions of genes at a
    higher level.
  • GOSlim terms a set of general GO terms for
    various organisms and generic use. E.g. Cell
    cycle, Meiosis.
  • Developed by Luis Tari under guidance of Dr. Kim
    and Dr. Baral.

6
Some important terms
  • GOSlim
  • Narrows down the clustering to general biological
    processes.
  • For experimental purposes , 32 GOSlim biological
    process terms have been used.
  • GO Annotations
  • Annotations provided by experts.
  • Evidence codes determine the degree of belief (0
    lt p_ij lt 1) of these annotations.
  • E.g. IEA - 0.5, NAS - 0.6, IC - 0.7, IDA - 0.9
  • Two customizable parameters play an important
    role in initialization (range 0 lt alpha, r lt 1)
  • Alpha Degree of dependency.
  • Alpha 0 implies initialization of membership is
    totally dependent on Gene annotation and less
    dependent when it approaches 1.
  • r Degree of belief when a gene is not
    associated with a particular biological process.
  • Value of r should be smaller but not too small
    either.
  • Validity of Cluster Measure S
  • Minimal value of S implies that the cluster is
    most compact and the furthest separation exists
    between clusters. (Inter and Intra cluster
    distances)
  • Stopping Criteria

7
Algorithm steps
  • Initial membership assignment
  • Each GOSlim biological process term (GOBP) is
    considered as a cluster. (Thus, number of
    clusters is automatically determined).
  • Suppose as per the annotation, gene g is
    associated with a process bp. Now using GO
    hierarchy the parent (direct or indirect) sbp of
    bp is found such that sbp belongs to GOBP. In
    this case g is assigned to the process sbp and
    thus associated with the corresponding cluster.
  • The actual membership value is calculated based
    on evidence codes (degree of belief), alpha
    (degree of dependency) and r (constant that acts
    as degree of belief when g is not associated with
    a process b).
  • Repeat the following steps until stopping
    criteria is achieved.
  • At each step (k), compute the fuzzy centroid. The
    parameters involved are m (fuzzy parameter),
    expression vector for gene, number of genes, and
    current membership.
  • Update the fuzzy membership based on expression
    vector, centroid and initial membership.
  • Calculate the validity of cluster measure (S).
  • If validity of cluster S lt S, then assign S S
    and update the optimal cluster and memberships
    with the values calculated above (in the kth
    step).

8
Results Output
  • On reaching the stopping criteria, Optimal
    Clusters and memberships are provided as output.
  • A cluster is considered optimal if the validity
    measure (S, initialized to infinity) of the
    cluster is minimal among the iterations.
  • For mathematical equations, results of
    experimental evaluation, please refer the paper.
    (Its beyond the scope of this presentation.)

9
Major differences from Fuzzy c-means
  • In fuzzy c-means, initial assignment of
    memberships is random while GOFuzzy uses gene
    annotations for initialization.
  • This ensures that results generated are
    repeatable.
  • It uses predefined classes based on GOSlim
    biological processes, so number of clusters is
    not needed as an input from user.
  • In recursive steps, the membership is updated
    based on data as well as GO annotations.
  • Flexibility of degree of reliability on the GO
    annotations for clustering using alpha (degree of
    dependency) and r (degree of belief).
  • GOFuzzy eliminates the extra effort to identify
    the functions associated with the clusters as
    they are pre-defined based on biological
    processes.

10
The assigned project Application for the
underlying algorithm
  • Developing a user friendly and robust interface
    for the clustering algorithm developed.
  • The functionalities should include
  • Visualization of results
  • Regular updates of annotation files, data, etc.
    through the online repositories
  • Input Parameters for user customization.
  • File input/output.
  • Report generations based on some input and
    generated parameters.
  • The objective is to make a stand-alone
    application for GO Fuzzy c-means clustering in
    order to make it useful for all the users,
    especially, biologists.

11
Whats already present?
  • GOFuzzyv1.0 - A command line program for the
    algorithm has been programmed in JAVA (1.5) and
    runs perfectly fine.
  • Input file format
  • Tab-delimited format .txt file
  • Tab-delimited format in Eisen's CDT format - .cdt
    file
  • The command line program takes input file name,
    taxonId, number of genes, number of samples and
    type of input file as input.
  • TaxonId current version can take Yeast Human
    data-sets
  • Uses static files saved in the working directory
    for
  • GO data files
  • GO Slim yeast file
  • GO annotation files
  • File containing probabilities for various
    evidence codes
  • Output
  • Three .txt output files are generated.
  • out-geneexp Indicating initial assignment of
    genes according to the GO annotation file
  • multiple-out-geneexp All genes assigned to the
    optimal clusters

12
What I plan to do?
  • Start with a basic version
  • Interface for file input.
  • Following input parameters can be set by the
    user (Need more input)
  • alpha ( 0 lt alpha lt 1)
  • r (0 lt r lt 1)
  • Type of organism/species for which the dataset is
    to be clustered
  • Output
  • Display alpha and r in the output file. Also, as
    the users can change this parameter it will be a
    better idea to name the generated file such that
    it indicates the value of alpha and r.
  • Generate output files such that it can be viewed
    in Maple Tree software. (Luis has worked some
    more on this particular function)
  • Display compactness and separation in the output
    file.
  • Build up-on the basic version to add the required
    features
  • Use the Maple Tree source code /develop our own
    method so that the user can view the output of
    clustering directly in our application and need
    not open another application to view the results.
  • Allow the user to get periodic/manual updates of
    GO Annotations (based on evidence codes), GO data
    files and GO Slim files.
  • Give users the flexibility to choose which
    version of the above mentioned files they want to
    use.
  • Additional features (if time permits)
  • Some way/tools to be implemented/collaborated to
    match the names of genes for species when they
    are not uniform across all the datasets. (e.g.
    human).
  • More input required in this area. GO Browser ?

13
References
  • Tari L et al., Fuzzy c-means clustering with
    prior biological knowledge, J Biomed Inform
    (2008),doi10.1016/j.jbi.2008.05.009
  • http//sysbio.fulton.asu.edu/gofuzzy/
  • http//en.wikipedia.org/wiki/Data_clustering
  • http//www.scholarpedia.org/article/Fuzzy_C-means_
    cluster_analysis
  • http//www.geneontology.org

14
Any Questions
  • Hope there arent any!
  • Thanks everyone!

15
Files for references
Write a Comment
User Comments (0)
About PowerShow.com