Annotating Gene List From Literature - PowerPoint PPT Presentation

1 / 9
About This Presentation
Title:

Annotating Gene List From Literature

Description:

Annotating Gene List From Literature. Xin He. Department of Computer Science. UIUC. Motivation ... For each gene, find a collection of articles that discuss this gene ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 10
Provided by: xin96
Category:

less

Transcript and Presenter's Notes

Title: Annotating Gene List From Literature


1
Annotating Gene List From Literature
  • Xin He
  • Department of Computer Science
  • UIUC

2
Motivation
  • Biologists often need to understand the
    commonalities of a list of genes (e.g. whether
    they are involved in the same pathway).
  • These genes typically come from clustering
    results in microarray expression
  • Given a list of gene names, is there any
    automatic way to find the common themes from
    literature articles?

3
Related Work
  • The most popular way is based on the analysis of
    GO terms associated with genes.
  • Method each gene is associated with a set of GO
    terms. Find the GO terms that are overrepresented
    in the input list
  • Hypergeometric test p-value of a GO term

N total number of genes M total number of genes
annotated with this term n number of genes in
the list k number of genes in the list annotated
with this term
4
Problems with GO-based Approach
  • GO cannot cover all the important concepts in the
    literature. E.g. GO has relatively low coverage
    for behavior terms (compared with specialized
    behavior ontology)
  • The associations of genes and concepts change
    very rapidly. E.g. new functions of known genes
    are constantly found..

5
Text-based Gene List Annotation
  • Hypothesis testing approach
  • find terms that are overrepresented for each
    gene Poisson distribution
  • find common terms across the gene list
    hypergeometric distribution
  • Comparative text mining approach find the common
    themes in multiple collections (one for each gene)

6
Comparative Text Mining
  • For each gene, find a collection of articles that
    discuss this gene
  • Each article in a collection is a mixture of two
    distributions a theme common to all collections
    and a collection-specific theme
  • Parameter estimation in the mixture model the
    standard EM algorithm

7
Results Pelle System
  • Pelle system in Drosophila Saptzle, Toll, Pelle,
    Tube, Cacus, Dorsal
  • Among the top-50 words signaling, pathway,
    receptor, embryo, ventral, dorsoventral,
    patterning, embryonic

8
Results MET cluster
  • MET cluster from yeast cell-cycle data MET28,
    MET14, MET16, MET10, MET2, MUP1
  • Among the top-50 words amino, met25, sulphite

9
Problems and Plan
  • Many common words (such as stop words) in the
    top-list, not properly normalized
  • Use the entire Medline corpus as background not
    working
  • Hypothesis testing approach as alternative
  • Single words not very suggestive
  • Phrase extraction as the postprocessing step
Write a Comment
User Comments (0)
About PowerShow.com