Title: A text-mining analysis of the human phenome
1A text-mining analysis of the human phenome
European Journal of Human Genetics (2006) 14,
535-542
- Marc A van Driel1, Jorn Bruggeman2, Gert Vriend1,
Han G Brunner,3 and Jack AM Leunissen2
1Centre for Molecular and Biomolecular
Informatics, Radboud University Nijmegenthe
Netherlands 2Department of Bioinformatics,
Wageningen University and Research Centre
3Department of Human Genetics, University Medical
Centre Nijmegen
Speaker Yu-Ching Fang Advisors Hsueh-Fen Juan
and Hsin-His Chen
2Outline
- Introduction
- Methods
- Results
- Discussion
3Introduction
- Functional annotation of genes is an important
challenge once the sequence of a genome has been
completed. - Previous studies have correlated various
attributes of human genes with the chance of
causing a disease.
4Introduction (cont.)
- But, few attempts have been made to
systematically classify relationships between
genes and proteins at the phenotype level.
5Introduction (cont.)
- The Online Mendelian Inheritance in Man (OMIM)
database contains human disease phenotype data
and record-based textual information, one gene or
one genetic disorder per record. - Goal Systematic grouping of genes by their
associated phenotypes from the OMIM database.
6Methods The OMIM database
- Full text (TX) field 5132 (disease)/16357
7Methods The OMIM database (cont.)
- Clinical synopsis (CS) field
8Creation of feature vectors
- MeSH terms and their components are concepts.
- MeSH concepts serve as phenotype features
characterizing OMIM records. - Ex OMIM_1-gtMeSH_1,MeSH_2,
9Refinement of the feature vectors
- MeSH concepts can be very broad like Eye or
more specific like Retina. - A concepts hierarchy that describes relationships
such as Eye-Retina-Photoreceptors. - Retina is a hyponym of Eye.
10Refinement of the feature vectors (cont.)
- To ensure that the concepts eye and retina are
recognized as similar, the MeSH hierarchy was
used to encode this similarity in the feature
vectors by increasing the value of all hypernyms.
rc relevance of concept c rc,counted count of
the concept c in a document rhypos relevance of
the concept cs hyponym nhypo,c the number of
the concept cs hyponyms
11Refinement of the feature vectors (cont.)
- Example of concept expansion using the MeSH
hierarchical structure.
12Refinement of the feature vectors (cont.)
- Not all concepts in the OMIM records are equally
informative. - Ex retina pigment epithelium occurs rarely,
and thus provides more specific information than
very frequently terms such as Brain. - Inverse document frequency measure
gwc inverse document frequency or global weight
of concept c N 5080 nc the number of records
that contain concept c
13Refinement of the feature vectors (cont.)
- Not all OMIM records contain equally extensive
descriptions (record size differences). - These differences will make a comparison between
records difficult because the diversity and the
frequency of concepts in the larger records will
exceed those in the smaller records.
rc relevance of concept c rmf the frequency of
the most occurring MeSH concept in that record
14Comparing OMIM records
- The similarity between OMIM records can be
quantified by comparing the feature vectors that
are expanded and corrected. - Similarities between feature vectors were
determined by the cosines of their angles.
s(X,Y) the similarity between the feature
vectors X and Y xi, yi concept frequencies
15Results Comparing OMIM records
- 5080/5132 OMIM records could match one or more
MeSH terms. - The 5080x5080 pair-wise feature vector
similarities form phenomap (All to all
similarities).
Most phenotype-phenotype pairs have a low
similarity score.
16Comparing OMIM records - The best scores for all
phenotypes in the disease phenotype data set
- For each OMIM record, the most similar of the
other 5079 records was identified. - Moderately similar phenotype pairs might still
yield reasonable hypotheses.
Ex Fibromuscular Dysplasia of Arteries and
Cardiomyopathy, Familial Hypertrophic have 0.31
similarity score
17Comparing OMIM records (cont.)
- Conclusion The more phenotypes resemble each
other, the more likely they are to share an
interaction.
18Discussion
- Developed a text-mining approach to map
relationships between more than 5000 human
genetic disease phenotypes from the OMIM
database. - Phenotype clustering reflects the modular nature
of human disease genetics. Thus, the phenomap may
be used to predict candidate genes for diseases.