Title: Diapositiva 1
1Functional annotation and identification of
candidate disease genes by computational analysis
of normal tissue gene expression data L.
Miozzi1, U. Ala1, R. Piro2, F. Rosa3, F. Di
Cunto1 and P. Provero1 1Dipartimento di Genetica,
Biologia e Biochimica, Università di Torino,
Torino, Italy 2INFN, Sezione di Torino, Torino,
Italy 3ISI Foundation, Torino, Italy
Introduction Among the open problems of molecular
biology in the post-genomic era the functional
annotation of the human genome and the
identification of genes involved in genetic
diseases are especially important. Expression
data on a genomic scale have been available for
several years thanks to a set of new experimental
techniques, and are widely believed to contain
much information potentially relevant towards the
solution of such problems. Here we present the
results of a computational analysis of publicly
available expression data on human normal
tissues, based on the integration of data
obtained with the two most important experimental
platforms (microarrays and SAGE) and different
measures of dissimilarity between expression
profiles. The building blocks of the procedure
are the Gene Expression Neighborhoods (GEN),
small sets of tightly coexpressed genes which are
analyzed in terms of functional annotation and
relevance to human diseases. This analysis
provides putative functional annotations for many
genes, and identifies promising candidate disease
genes for experimental verification.
The guilt by association principle The
presented work is based on the following
principle since there is a strong correlation
between coexpression and functional relatedness,
a gene found to be coexpressed with several
others involved in the same biological process
can be putatively given the same functional
annotation (Brazma A. et Vilo J., 2000, FEBS
Lett. 48017-24) .
Method In this work we analyze publicly
available expression data on human normal tissues
obtained with Affymetrix microarrays
(http//symatlas.gnf.org/SymAtlas/) and with SAGE
(Serial Analysis of Gene Expression
http//cgap.nci.nih.gov/). We considered 158
experiments concerning 12109 genes for Affymetrix
and 62 experiments concerning 11741 genes for
SAGE. Different measures of dissimilarity
between expression profiles have been defined and
integrated Euclidean distance and Pearson linear
dissimilarity for the microarray data, Euclidean
distance and a dissimilarity measure based on the
Poisson distribution (developed in Van Helden J.,
2004, Bioinformatics 20(3)399-406 in a different
context) for SAGE data. The unit of functional
analysis, named Gene Expression Neighborhood
(GEN), has been defined as a gene plus its k
nearest expression neighbors, with k typically a
rather small number (the results we report were
obtained with k6). For each dataset and each
choice of dissimilarity measure we identified a
number of GENs equal to the number of genes
represented in the dataset. A GEN was considered
functionally characterized if there was at least
one Gene Ontology term (http//www.geneontology.or
g/) shared by the majority (K) of its genes (K4
genes in the results presented). To avoid too
generic GO terms, the analysis has been limited
to those terms, shared by no more than a given
maximum number M of genes in the whole
experimental dataset under investigation (M300
in the results presented). This limit ensures
that the majority rule used to define
functionally characterized GENs automatically
implies statistically significant
overrepresentation of the GO term involved. The
false discovery rate for the functionally
characterized GENs has been estimated random
GENs have been generated by reshuffling the gene
names in the whole dataset (thus preserving the
characteristics of the actual GENs, such as their
degree of self-overlapping) and subjected to the
same functional analysis. A leave-one-out
analysis has been performed to estimate how many
correct annotations the method can correctly
identify. Characterized GENs have been used to
determine putative new functional annotations
for each functionally characterized GEN and for
each GO term associated to it (shared by the
majority of its genes), the same GO term has been
putatively attributed to the genes in the GEN not
associated to it. Finally, we looked for
functionally characterized GENs containing at
least 3 genes associated with a genetic disease
in the OMIM database (http//www.ncbi.nlm.nih.gov/
entrez/query.fcgi?dbOMIM). When the relevant
OMIM entries were related to each other, the
genes in the GEN not associated to OMIM entries
have been considered as interesting candidates to
be involved in similar pathologies.
- Results
- The leave-one-out analysis showed that 1026
correct GO annotations involving 644 genes and 94
GO terms would have been correctly identified by
the method (see table 1).
Dataset Disease Gene
MicroarrayPearson ACROMEGALOID FEATURES, OVERGROWTH, CLEFT PALATE, AND HERNIA ENSG00000069482
MicroarrayPearson AORTIC ANEURYSM, FAMILIAL THORACIC 1 ENSG00000149591
MicroarrayPearson CARDIOMYOPATHY, DILATED, 1C CMD1C ENSG00000107796
MicroarrayPearson CHARCOT-MARIE-TOOTH DISEASE, AXONAL, TYPE 2G CMT2G ENSG00000166986
MicroarrayPearson CHARCOT-MARIE-TOOTH DISEASE, DOMINANT INTERMEDIATE A ENSG00000166197
MicroarrayPearson CONVULSIONS, BENIGN FAMILIAL INFANTILE, 2 ENSG00000087258
MicroarrayPearson CONVULSIONS, FAMILIAL INFANTILE, WITH PAROXYSMAL CHOREOATHETOSIS ICCA ENSG00000087258
MicroarrayPearson DEAFNESS, NEUROSENSORY, AUTOSOMAL RECESSIVE 46 DFNB46 ENSG00000101608
MicroarrayPearson EPILEPSY, IDIOPATHIC GENERALIZED, SUSCEPTIBILITY TO, 3 EIG3 ENSG00000078725
MicroarrayPearson EPILEPSY, PARTIAL, WITH VARIABLE FOCI ENSG00000100095
MicroarrayPearson FACIOSCAPULOHUMERAL MUSCULAR DYSTROPHY 1A FSHMD1A ENSG00000154553
MicroarrayPearson MUSCULAR DYSTROPHY, LIMB-GIRDLE, TYPE 1F LGMD1F ENSG00000128595
MicroarrayPearson PARKINSON DISEASE 3, AUTOSOMAL DOMINANT LEWY BODY PARK3 ENSG00000075340
MicroarrayPearson POLYDACTYLY, PREAXIAL II PPD2 ENSG00000106538
MicroarrayPearson ROSSELLI-GULIENETTI SYNDROME ENSG00000137699
MicroarrayPearson SCAPULOPERONEAL MYOPATHY SPM ENSG00000139329
MicroarrayPearson VACUOLAR NEUROMYOPATHY ENSG00000077009
MicroarrayPearson VACUOLAR NEUROMYOPATHY ENSG00000099800
MicroarrayPearson ACROMEGALOID FEATURES, OVERGROWTH, CLEFT PALATE, AND HERNIA ENSG00000131808
MicroarrayPearson BREAST CANCER, 11-22 TRANSLOCATION ASSOCIATED ENSG00000137713
MicroarrayPearson BREAST CANCER, DUCTAL, 1 BRCD1 ENSG00000139618
MicroarrayPearson ELECTROENCEPHALOGRAM, LOW-VOLTAGE ENSG00000075043
MicroarrayPearson EOSINOPHILIA, FAMILIAL ENSG00000113721
MicroarrayPearson MICROCEPHALY, PRIMARY AUTOSOMAL RECESSIVE, 4 MCPH4 ENSG00000156970
MicroarrayPearson MUSCULAR DYSTROPHY, CONGENITAL, 1B ENSG00000143632
MicroarrayPearson SCAPULOPERONEAL MYOPATHY SPM ENSG00000011465
MicroarrayPearson TRIPHALANGEAL THUMB-POLYSYNDACTYLY SYNDROME ENSG00000106538
MicroarrayPearson TUMOR SUPPRESSOR GENE ON CHROMOSOME 11 ENSG00000137713
MicroarrayPearson CARDIOMYOPATHY, DILATED, 1F CMD1F ENSG00000118523
MicroarrayPearson CARDIOMYOPATHY, DILATED, 1Q CMD1Q ENSG00000091136
MicroarrayPearson DEAFNESS, AUTOSOMAL RECESSIVE 51 DFNB51 ENSG00000026508
MicroarrayPearson MYOPATHY, LIMB-GIRDLE, WITH BONE FRAGILITY ENSG00000147872
MicroarrayEuclidea ARRHYTHMOGENIC RIGHT VENTRICULAR DYSPLASIA, FAMILIAL, 5 ARVD5 ENSG00000160808
MicroarrayEuclidea NONCOMPACTION OF LEFT VENTRICULAR MYOCARDIUM, FAMILIAL ISOLATED, AUTOSOMAL DOMINANT 2 ENSG00000130598
MicroarrayEuclidea SCAPULOPERONEAL MYOPATHY SPM ENSG00000011465
MicroarrayEuclidea MUSCULAR DYSTROPHY, CONGENITAL, 1B ENSG00000143632
MicroarrayEuclidea CARDIOMYOPATHY, DILATED, 1C CMD1C ENSG00000122367
SAGEEuclidean ANEURYSM, INTRACRANIAL BERRY, 3 ENSG00000158747
SAGEEuclidean MYOPIA 5 ENSG00000108821
SAGEEuclidean MYOPIA 6 ENSG00000100122
SAGEEuclidean NONCOMPACTION OF LEFT VENTRICULAR MYOCARDIUM, FAMILIAL ISOLATED, AUTOSOMAL DOMINANT 2 ENSG00000130598
SAGEEuclidean MICROPHTHALMIA-CATARACT ENSG00000167971
SAGEEuclidean EXFOLIATIVE ICHTHYOSIS, AUTOSOMAL RECESSIVE, ICHTHYOSIS BULLOSA OF SIEMENS-LIKE ENSG00000186081
SAGEEuclidean MACULAR DYSTROPHY, RETINAL, 2, BULL'S EYE ENSG00000007062
SAGEEuclidean CATARACT, CONGENITAL NUCLEAR, AUTOSOMAL RECESSIVE 1 CATCN1 ENSG00000105370
SAGEEuclidean CARDIOMYOPATHY, DILATED, 1C CMD1C ENSG00000122367
SAGEEuclidean ARRHYTHMOGENIC RIGHT VENTRICULAR DYSPLASIA, FAMILIAL, 5 ARVD5 ENSG00000160808
SAGEEuclidean ACHROMATOPSIA 1 ENSG00000129535
SAGEEuclidean ACHROMATOPSIA 1 ENSG00000139988
SAGEEuclidean CONE-ROD DYSTROPHY 5 CORD5 ENSG00000109047
SAGEEuclidean CONE-ROD DYSTROPHY 5 CORD5 ENSG00000179036
SAGEEuclidean POSTERIOR COLUMN ATAXIA WITH RETINITIS PIGMENTOSA AXPC1 ENSG00000116703
SAGEEuclidean MYOPIA 6 ENSG00000196431
SAGEEuclidean GLAUCOMA 3, PRIMARY INFANTILE, B GLC3B ENSG00000158747
SAGEEuclidean MICROPHTHALMIA-CATARACT ENSG00000197253
SAGEEuclidean DUPUYTREN CONTRACTURE ENSG00000087245
SAGEEuclidean CORNEAL DYSTROPHY, CRYSTALLINE, OF SCHNYDER ENSG00000158747
SAGEEuclidean CATARACT, AUTOSOMAL RECESSIVE, EARLY-ONSET, PULVERULENT ENSG00000172014
SAGEEuclidean CATARACT, POSTERIOR POLAR 3 ENSG00000125864
Euclidean Pearson Poisson Euclidean Pearson Euclidean Poisson
Microarray 428 788 / 958 428
SAGE 50 / 51 50 92
MicroarraySAGE 468 788 51 992 504
Euclidean Pearson Poisson Euclidean Pearson Euclidean Poisson
Microarray 318 546 / 598 318
SAGE 48 / 48 48 82
MicroarraySAGE 353 546 48 625 376
b)
a)
Table 1 - Leave-one-out analysis results showing
the number of GO annotations (a) and annotated
genes (b) correctly identified.
- The distribution of GO terms among the three Gene
Ontology branches changes significantly among the
experimental datasets-dissimilarity measures
showing that different combinations are able to
capture different aspects of coexpression.
- Different definition of dissimilarity measures
describe different aspects of coexpression
correlated with different kinds of functional
annotation (see table 1 and 2) as shown by the
fact that only a small fraction of GO annotations
is predicted by more than one dissimilarity
measure dataset.
Euclidean Pearson Poisson Euclidean Pearson Euclidean Poisson
Microarray 569 950 / 1240 569
SAGE 173 / 216 173 362
MicroarraySAGE 720 950 216 1378 892
Euclidean Pearson Poisson Euclidean Pearson Euclidean Poisson
Microarray 688 1215 / 1731 688
SAGE 188 / 230 188 407
MicroarraySAGE 866 1215 230 1906 1081
d)
c)
Table 2 - Number of obtained putative new
functional GO annotations (c) and new annotated
genes (d).
- We have obtained 2113 putative new GO annotations
involving 1540 genes and 194 GO terms (see table
2).
Table 3 List of candidates genes potentially
involved in human genetic diseases.
- The integration of our functional annotation
results with the OMIM database allowed us to
identify at least 59 interesting candidate genes
potentially involved in human genetic disease
(see table 3).
Conclusion We have developed a useful approach to
analyze and integrate information obtained with
different experimental techniques and different
definitions of dissimilarity measures able to
explore several aspects of coexpression. The
results demonstrate that this integration
increases the amount of useful information
obtained.