GO Fuzzy cmeans Clustering Application - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

GO Fuzzy cmeans Clustering Application

Description:

Commonly used Stopping Criterion Distance wise or Number wise. ... Display alpha and r in the output file. ... Display compactness and separation in the output file. ... – PowerPoint PPT presentation

Number of Views:198

Avg rating:3.0/5.0

Slides: 16

Provided by: may973

Category:

more less

Transcript and Presenter's Notes

Title: GO Fuzzy cmeans Clustering Application

1
GO Fuzzy c-means Clustering Application

Presented by
Neha Somani

2
Outline of the presentation

Clustering and different algorithms (In general)
GO Fuzzy c-means algorithm (My understanding)
Basic details about the algorithm
Differences from Fuzzy c-means algorithm.
The assigned project Application for the
underlying algorithm
Whats already there.
What I plan to do.

3
Clustering

Clustering - Classification of objects into
different groups such that elements of each group
share some common features.
In gene expression data, it is used to observe
simple gene-to-gene and sample-to-sample
similarities so that potential functions of genes
can be found.
An important criteria for any clustering
algorithm is the distance measure that helps in
determining the similarity of two elements. For
e.g. The Euclidean distance.
Commonly used Stopping Criterion Distance wise
or Number wise.

4
Standard Clustering Methodologies
Hierarchical
Partitional

K-Means its derivatives
K-means
Fuzzy c-means
QT clustering

Locality-sensitive hashing
Graph-theoretic methods
Agglomerative (Bottom-up approach)
Divisive (Top-down approach)

Other common classifications
Symmetric/ Asymmetric distances
Two-way clustering Clusters objects as well as
the features of the objects.
Source Wikipedia Cluster Analysis

5
GO Fuzzy c-means algorithm

Its an extension of Fuzzy c-means with prior
biological knowledge.
Allows assignment of multiple gene functions as
well as incorporate prior biological knowledge
such that genes can be associated with more than
one biological function and thus help in
capturing the actual behavior of genes.
The prior knowledge used here is obtained from
Gene Ontology and is used for initial membership
assigment. (Can support other sources of
knowledge as well).
Uses the biological process ontology.
Specifically GOSlim biological process terms are
used to interpret the functions of genes at a
higher level.
GOSlim terms a set of general GO terms for
various organisms and generic use. E.g. Cell
cycle, Meiosis.
Developed by Luis Tari under guidance of Dr. Kim
and Dr. Baral.

6
Some important terms

GOSlim
Narrows down the clustering to general biological
processes.
For experimental purposes , 32 GOSlim biological
process terms have been used.
GO Annotations
Annotations provided by experts.
Evidence codes determine the degree of belief (0
lt p_ij lt 1) of these annotations.
E.g. IEA - 0.5, NAS - 0.6, IC - 0.7, IDA - 0.9
Two customizable parameters play an important
role in initialization (range 0 lt alpha, r lt 1)
Alpha Degree of dependency.
Alpha 0 implies initialization of membership is
totally dependent on Gene annotation and less
dependent when it approaches 1.
r Degree of belief when a gene is not
associated with a particular biological process.
Value of r should be smaller but not too small
either.
Validity of Cluster Measure S
Minimal value of S implies that the cluster is
most compact and the furthest separation exists
between clusters. (Inter and Intra cluster
distances)
Stopping Criteria

7
Algorithm steps

Initial membership assignment
Each GOSlim biological process term (GOBP) is
considered as a cluster. (Thus, number of
clusters is automatically determined).
Suppose as per the annotation, gene g is
associated with a process bp. Now using GO
hierarchy the parent (direct or indirect) sbp of
bp is found such that sbp belongs to GOBP. In
this case g is assigned to the process sbp and
thus associated with the corresponding cluster.
The actual membership value is calculated based
on evidence codes (degree of belief), alpha
(degree of dependency) and r (constant that acts
as degree of belief when g is not associated with
a process b).
Repeat the following steps until stopping
criteria is achieved.
At each step (k), compute the fuzzy centroid. The
parameters involved are m (fuzzy parameter),
expression vector for gene, number of genes, and
current membership.
Update the fuzzy membership based on expression
vector, centroid and initial membership.
Calculate the validity of cluster measure (S).
If validity of cluster S lt S, then assign S S
and update the optimal cluster and memberships
with the values calculated above (in the kth
step).

8
Results Output

On reaching the stopping criteria, Optimal
Clusters and memberships are provided as output.
A cluster is considered optimal if the validity
measure (S, initialized to infinity) of the
cluster is minimal among the iterations.
For mathematical equations, results of
experimental evaluation, please refer the paper.
(Its beyond the scope of this presentation.)

9
Major differences from Fuzzy c-means

In fuzzy c-means, initial assignment of
memberships is random while GOFuzzy uses gene
annotations for initialization.
This ensures that results generated are
repeatable.
It uses predefined classes based on GOSlim
biological processes, so number of clusters is
not needed as an input from user.
In recursive steps, the membership is updated
based on data as well as GO annotations.
Flexibility of degree of reliability on the GO
annotations for clustering using alpha (degree of
dependency) and r (degree of belief).
GOFuzzy eliminates the extra effort to identify
the functions associated with the clusters as
they are pre-defined based on biological
processes.

10
The assigned project Application for the
underlying algorithm

Developing a user friendly and robust interface
for the clustering algorithm developed.
The functionalities should include
Visualization of results
Regular updates of annotation files, data, etc.
through the online repositories
Input Parameters for user customization.
File input/output.
Report generations based on some input and
generated parameters.
The objective is to make a stand-alone
application for GO Fuzzy c-means clustering in
order to make it useful for all the users,
especially, biologists.

11
Whats already present?

GOFuzzyv1.0 - A command line program for the
algorithm has been programmed in JAVA (1.5) and
runs perfectly fine.
Input file format
Tab-delimited format .txt file
Tab-delimited format in Eisen's CDT format - .cdt
file
The command line program takes input file name,
taxonId, number of genes, number of samples and
type of input file as input.
TaxonId current version can take Yeast Human
data-sets
Uses static files saved in the working directory
for
GO data files
GO Slim yeast file
GO annotation files
File containing probabilities for various
evidence codes
Output
Three .txt output files are generated.
out-geneexp Indicating initial assignment of
genes according to the GO annotation file
multiple-out-geneexp All genes assigned to the
optimal clusters

12
What I plan to do?

Start with a basic version
Interface for file input.
Following input parameters can be set by the
user (Need more input)
alpha ( 0 lt alpha lt 1)
r (0 lt r lt 1)
Type of organism/species for which the dataset is
to be clustered
Output
Display alpha and r in the output file. Also, as
the users can change this parameter it will be a
better idea to name the generated file such that
it indicates the value of alpha and r.
Generate output files such that it can be viewed
in Maple Tree software. (Luis has worked some
more on this particular function)
Display compactness and separation in the output
file.
Build up-on the basic version to add the required
features
Use the Maple Tree source code /develop our own
method so that the user can view the output of
clustering directly in our application and need
not open another application to view the results.
Allow the user to get periodic/manual updates of
GO Annotations (based on evidence codes), GO data
files and GO Slim files.
Give users the flexibility to choose which
version of the above mentioned files they want to
use.
Additional features (if time permits)
Some way/tools to be implemented/collaborated to
match the names of genes for species when they
are not uniform across all the datasets. (e.g.
human).
More input required in this area. GO Browser ?

13
References

Tari L et al., Fuzzy c-means clustering with
prior biological knowledge, J Biomed Inform
(2008),doi10.1016/j.jbi.2008.05.009
http//sysbio.fulton.asu.edu/gofuzzy/
http//en.wikipedia.org/wiki/Data_clustering
http//www.scholarpedia.org/article/Fuzzy_C-means_
cluster_analysis
http//www.geneontology.org

14
Any Questions