Data mining in bioinformatics: problems and challenges - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Data mining in bioinformatics: problems and challenges

Description:

Why bioinformatics? We are witnessing a 'biotechnology revolution' Biotechnology ... Problems and challenges in bioinformatics. Insufficient data. Example: ... – PowerPoint PPT presentation

Number of Views:480
Avg rating:3.0/5.0
Slides: 30
Provided by: iroUmo
Category:

less

Transcript and Presenter's Notes

Title: Data mining in bioinformatics: problems and challenges


1
Data mining in bioinformatics problems and
challenges
Sorin Draghici
Email sod_at_cs.wayne.edu WWW http//vortex.cs.wayn
e.edu http//www.cs.wayne.edu/sod

2
Why bioinformatics?
  • We are witnessing a "biotechnology revolution"
  • Biotechnology
  • has the potential to improve our lives
    dramatically (new drugs, treatments, etc.)
  • has also a huge distructive potential (careless
    genetic manipulations, etc)

3
Why bioinformatics?
  • Human genome project
  • completed by Celera
  • How is that to be used?
  • map functions on genes
  • find/treat/correct/eliminate genetic diseases
  • gene treatment
  • patient oriented treatment and drugs
    (pharmacogenomics) ACE inhibitors (blood
    pressure medication

4
The HIV virus
  • HIV is a retrovirus that attacks the immune
    system
  • Replication mechanism
  • RNA based
  • makes lots of mistakes during the replication
  • Compensates for the primitive replication through
    a high replication speed

5
Why is it so deadly?
  • 10 billion copies of HIV are produced every day
  • High replication speed
  • Many random mutations
  • Selection pressure from the drug
  • very good search ability in the version space of
    all viable HIV viruses

6
Current treatments
  • Protease inhibitors
  • Reverse transcriptease inhibitors

7
Current problems
  • Very few drugs available
  • 5 FDA approved protease inhibitors
  • 9 FDA approved RT inhibitors
  • Cross-resistance
  • patient treated with drug A may develop
    resistance to drug B as well

8
Current problems
  • Drug development is
  • very slow (10 years)
  • very expensive (10-30 milion/year)
  • Viral mutations are
  • very probable in each generation
  • very rapid (10 billion copies a day)
  • The result

throwing stones at fighter planes
9
Our approach
  • Find the structural features which
  • cause drug resistance
  • are common to several mutants
  • Design drugs to counteract such common features
    as opposed to individual mutants
  • secondary therapy

10
effective
mutant HIV
wild type HIV
less effective
drug development
option 1
mutant HIV 1
option 2
mutant HIV 2
wild type HIV
resistance
genotyping
option 3
mutant HIV 3
first antiretroviral therapy (FAT)
second antiretroviral therapy (SAT)
11
Our data
  • Genotypic data (genetic sequences of mutants)
  • easy to obtain
  • there are lots of them
  • Structural data (X ray crystallography)
  • difficult to obtain
  • not very many
  • Phenotypic data (drug resistance)
  • very difficult to obtain
  • very few available

12
Our data
  • Genotypic data
  • PQITLWQRPLVTIKIGGQLKEALLDTGADDT... (approx. 200
    residues for protease)
  • Structure data
  • Phenotypic data
  • IC90 3.51
  • fold resistance IC90 mutant/IC90 wildtype

13
Our work
  • Develop a structure-function model of HIV drug
    resistance

sequence
resistance
structure
14
Dataflow
Sequence
Contacts/PDB
Structures
Machine learning
15
Supervised learning
  • Inputs
  • Atomic contacts between the inhibitor and the
    protease
  • Atomic distances
  • Output
  • Fold resistance

16
Ligplot Contacts File
17
Atomic contacts - resistance
18
Unsupervised learning
  • Inputs
  • Contact residues (21 distinct contacts)
  • Output
  • A self organized map embedding structural
    information

19
Ligplot Contacts File
20
Self-organizing feature maps
21
Residue contacts - resistance
22
Problems and challenges in bioinformatics
  • Insufficient data
  • Example
  • Largest data set has 50 mutants
  • Why?
  • The field is very recent
  • Data collection can be very difficult (one
    structure may take 1-2 years if done from
    scratch one IC90 value may take up to two weeks)
  • Data has commercial value
  • Solutions
  • Get more data
  • Cross-validate very carefully

23
Problems and challenges in bioinformatics
  • Data consistency
  • Example
  • Same sample sent to two different labs can come
    back with different IC90 values
  • Why?
  • The experimental tools are not mature yet
  • Solutions
  • Select your data carefully
  • Use data from consistent sources
  • If not possible, pre-process the data to make it
    consistent (not very good since you actually
    change the data!)

24
Problems and challenges in bioinformatics
  • Data accuracy
  • Example
  • Same sample sent to the same lab at different
    times can be reported with different IC90 values
    (4 fold error)
  • Why?
  • The experimental tools are not mature yet
  • Solutions
  • Use relative values to reduce the requirement for
    high numerical precision
  • Map data into clusters and attach values to
    clusters (1-4 no resistance, 4-10 reduced
    resistance, gt10 resistance)

25
Problems and challenges in bioinformatics
  • Data quality
  • Example
  • Papers reporting IC90 values do not give the
    whole sequence
  • Why?
  • People are not aware of its importance
  • Data may have commercial value
  • Solutions
  • Never trust your data...

26
Problems and challenges in bioinformatics
  • The choice of features
  • Example
  • Atoms?, Residues?, Genes?, Larger structures?
  • Why?
  • The phenomena are very complex and span different
    scales in time and space
  • Solutions
  • Try to merge different types of data in order to
    capture the complexity of the phenomenon
  • Use several qualitatively different analysis and
    machine learning techniques

27
Problems and challenges in bioinformatics
  • Lack of tools
  • Example
  • There were no tools able to correlate
    sequence/structure/resistance data for the HIV
    virus
  • We wrote more than 15,000 lines of code for this
    problem
  • Why?
  • The field is new
  • The structure/function problem is just starting
    to be addressed
  • Solutions
  • Develop your own software
  • Partnerships with bioinformatics companies?

28
Problems and challenges in bioinformatics
  • Difficult communication between the "bio" and the
    "informatics" sides
  • Example
  • Definition of "successful prediction"
  • Why?
  • Different backgrounds, different traditions
  • Solution
  • Cross-training
  • Exposure to "the other" field

29
Conclusions
  • Data mining in bioinformatics is
  • Challenging
  • Interesting
  • Useful
Write a Comment
User Comments (0)
About PowerShow.com