Data Mining Techniques in Identifying Protein Functions - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Data Mining Techniques in Identifying Protein Functions

Description:

Using keywords to narrow the search often produces far more candidates than can ... be a useful tool to identify patients who might benefit from adjuvant therapy ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 12
Provided by: soc128
Category:

less

Transcript and Presenter's Notes

Title: Data Mining Techniques in Identifying Protein Functions


1
Data Mining Techniques in Identifying Protein
Functions
  • Chen Xin HT00-6077W
  • Luo Chong HT00-6126A

2
Problem
  • A vast corpus of biology literature available
    electronically, e.g. the abstracts in Medline
    (PubMed)
  • Using keywords to narrow the search often
    produces far more candidates than can be properly
    read (or processed)

3
Seek Solution in Data Mining
  • Identify literatures that directly state the
    relationship of P53 and cancer
  • Examples
  • Positive Assessment of p53 protein expression is
    more discriminative than TP53 mutation to predict
    the outcome of Dukes' stage B tumours and could
    be a useful tool to identify patients who might
    benefit from adjuvant therapy
  • Negative p21/WAF1, and p53 are considered as
    regulating each other based on in vitro studies,
    the relation in human lung cancer is not fully
    understood

4
Data Collection
  • All literatures are got from PubMed
  • All literatures are supposed to have both P53
    and Cancer or syndromes of them in field of
    Abstract
  • All the data are either Positive or Negative,
    which is assigned by human inspection

5
Proposed Framework
  • Data base structure
  • Attributes particular word
  • Values 1 (appear) 0 (not appear)
  • Class P (direct relation) N (indirect
    relation)
  • Generate CARs (Class Association Rules) using CBA
    (Classification Based on Associations), by Liu,
    B. et.al.

6
Processing
  • Pick out key sentences
  • Key sentences are those sentences in abstracts
    that contain all the key words (P53 cancer)
  • Filter the stop words
  • Stop words are those high frequency words that
    are helpless for classification
  • Vectorization
  • Format key sentences to records
  • Experiment Design
  • Separate training and testing set (4 sets)
  • Optimize min sup. and min conf.

7
Experimental Results
8
Analysis
  • Statistically the precision is 73.8 recall is
    81.2 using this relatively primitive way
  • The experimental results are comparable to that
    of recent publications
  • It may save overall human efforts while need more
    training data
  • May discover unconscious rules of scientific
    languages
  • There leaves still much margin for improvement of
    performance.

9
Further Works for Improvement
  • Consider the grammatical pattern rather than the
    single key word
  • Consider relative order of these important word
    to key words
  • Consider relative order of these important word
    to each other
  • Consider pattern Mining

10
Conclusion
  • Data mining techniques can be used in this field
    in a quite primitive way
  • Same techniques could be applied to other field
    without much modification
  • Further study on this generation of approaches is
    desirable and promising

11
Q A
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com