Opportunities of Systems EngineeringOperations Research in Bioinformatics - PowerPoint PPT Presentation

About This Presentation
Title:

Opportunities of Systems EngineeringOperations Research in Bioinformatics

Description:

Introduction on Bioinformatics. Paradigm Shift in Biology ... What is Bioinformatics/Computational Molecular Biology? ... Why Bioinformatics? ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 55
Provided by: hyoung9
Category:

less

Transcript and Presenter's Notes

Title: Opportunities of Systems EngineeringOperations Research in Bioinformatics


1
Opportunities of Systems Engineering/Operations
Research in Bioinformatics
  • Hyoungtae Kim
  • (Joint work with Wiljeana Jackson, S.C. LIN and
    Dr. JC LU)

2
Outline
  • Introduction on Bioinformatics
  • Paradigm Shift in Biology
  • Systems Engineering/Operations Research
    for Bioinformatics
  • About Funding Opportunities
  • Conclusions

3
What is Bioinformatics/Computational Molecular
Biology?
  • An application of mathematical, statistical, and
    computational tools in the analysis of the huge
    size biological data
  • Most of the cases, it involves analyzing
    information stored in large databases
  • Multi-disciplinary
  • -Biology -Mathematics -Statistics
  • -Physics -Chemistry -Computer Science
  • -Engineering

It has not yet found its own natural home
department
4
Why Bioinformatics?
  • Current data analysis tools are far from being
    efficient for analyzing vast amount of biological
    data
  • The pace of biological understanding is much
    slower than the pace of the technology advance
    that have powered experimental discovery and data
    collection
  • Benefits
  • Advances in detection and treatment of disease
    and the production of genetically engineered
    foods
  • ?Profound impact on health and medicine

5
Three Elements of Bioinformatics Research
  • Significant Biological problems
  • Gene, motif, signal recognition
  • Protein structure prediction
  • Metabolic pathway deduction
  • Etc.

Bioinformatics
  • Data
  • Microarrays
  • Mass Spectroscopy
  • Etc.
  • Theory Methods
  • Algorithms
  • Statistical Methods
  • Ontologies
  • Etc.

6
Prerequisites of Bioinformatics
Scientific Mind
  • Basic knowledge in Molecular Biology
  • Prokaryotic and Eukaryotic cells
  • Genes, Codons, DNA, RNA, Central dogma of biology
  • Etc.
  • Computing Skills
  • Program Languages Python, Perl, Java, etc.
  • Knowledge in Relational Databases, etc.
  • Other Skills
  • General Statistical Knowledge
  • Optimization Tools Math Programming, Network
    Optimization, etc.

7
Various Problems in Bioinformatics
Standard Problems
  • DNA and Protein Sequence Analysis
  • Gene Finding and Prediction
  • Etc.
  • Microarray Experiment and Data Analysis
  • Protein Structure Prediction
  • Deduction of Metabolic Pathways
  • And more

Emerging Problems
8
Outline
  • Introduction to Bioinformatics
  • Paradigm Shift in Biology
  • Systems Engineering/Operations Research
    for Bioinformatics
  • About Funding Opportunities
  • Concluding Remarks

9
Paradigm Shift in Biology
  • The Human Genome Project (HGP)
  • Working Draft of the human genome (2001)
  • Goal of the HGP sequencing of the human genome
  • Hypothesis driven reductionism ?discovery science
    approach
  • Drive-forced the development of high throughput
    technologies and computer applications to
    transmit, analyze, and model very large size data
    sets

10
Paradigm Shift in Biology
  • High-throughput Technologies
  • Microarrays allow the expression of thousands
    of genes to be surveyed at one time
  • Protein Arrays can examine all proteins in a
    cell and check if they are interacting under
    designed conditions
  • Mass Spectrometry The basic modality is protein
    mass fingerprinting

11
Paradigm Shift in Biology
  • Microarray Chip Technology
  • Allows data collection in high-throughput manner
  • Can put all genes in a microbe on a chip
  • Interpretation of the data is very challenging

12
253x15154 Microarray Gene Expression Data 162
cancer vs 91 normal patients
13
Paradigm Shift in Biology
Genes and proteins
Protein-protein interaction data
Gene activity data
Black box
Protein structure data
Proteomic data
Regulatory elements
Metabolite data
Gigantic amount of biological information is
hidden in these data and their inter-data
relationship!
14
Paradigm Shift in Biology
  • Concept of Systems Biology
  • The Reductionist paradigm has been phenomenally
    successful in biology since 1950s
  • Genomics era ? exhaustive lists of biological
    parts (i.e. genes and proteins) together with
    their functional characteristics
  • A System-level perspective is required to make
    sense of how all of these individual parts emerge
    and act collectively to perform a biological
    function

15
Outline
  • Introduction to Bioinformatics
  • Paradigm Shift in Biology
  • Systems Engineering/Operations Research tools
    for Bioinformatics
  • About Funding Opportunities
  • Concluding Remarks

16
Systems Engineering/Operations Research tools
  • Three Categories
  • Network Optimization
  • Combinatorial
  • Integer Programming
  • Dynamic Programming
  • Network Optimization
  • Minimum Spanning Tree
  • Etc.
  • Stochastics
  • Hidden Markov Models
  • MCMC
  • Simulation Models
  • Etc.
  • Statistics
  • MLE
  • Regression
  • Sampling
  • Linear Model
  • Cross Validation
  • Statistical Estimation and Test
  • Multivariate Analysis (or ANOVA)
  • Wavelet Transformation
  • Bayesian Networks, Etc.

17
Systems Engineering tools for Bioinformatics
  • Some Examples
  • Hidden Markov Model for Gene Finding
  • Dynamic Programming for Sequence Alignment
  • Integer Programming for Protein Folding
  • Minimum Spanning Tree approach to Clustering for
    Motif Identification (Xu et al. (2001)
  • And many more

18
A Significant Biological Problem
  • Identification of Transcription Factor Binding
    Sites(Motifs)
  • A genes transcriptional level is regulated by
    proteins (transcription factors), which bind to
    specific sites in the genes promoter region,
    called binding sites
  • The binding-site identification problem is to
    find short conserved fragments, from a set of
    genomic sequences
  • ? Features of transcription factor binding site
  • These short DNA fragments in the upstream regions
    of genes are generally very similar to each other
  • Relatively high frequencies compared to other
    sequence fragments

19
Data Collection
  • Data Set (D) Set of All Short DNA fragments in
    the upstream regions of genes
  • Microarray gene expression technologies allow
    simultaneous view of the transcription levels of
    many thousands of genes under various cellular
    conditions

Upstream regions of genes
GATCACCTGACATCAGGAGTTCAAGACCAGCCTGCCAACG CCATCTCTA
CTAAAAATAGGAAATTCACCTGGTGGCAGGT CCAGCTACTCGGGAGGCT
GAGGCAGAAGAATCGCTTGAAT GAGATTGCACTGAGCTGAGATCACGCC
ACTGCGCTCCAGC GAGCAAGACTCCATAAAAAAAAAAATTATAACCTAA
TGAT AGGGAAGAGCTTACCACAATTGCTGGCCCATGGCCAATGC ACAG
CTACTGCAAACAACCATGATGATGATACATCTCTTG GGTTGTTTGAGAC
ACATTCTATGCTCCTTGATTTGATTGG GGTTCCTTGGGGACTTGGAGGT
GACGAAAGCCTCCCTGGG ACCTTCACTTCTCTAATATCAAGCTTCAGCA
ACCTGCTCC CAGGGTTGGACAGGCCCAACAACAGAGGAAATCCACAAAG
CACATACATCCACGGGGTCTAACGAGGTGAGGCCAATGAC CACCCCAG
CCAGACTCTGACTTCACTCCCGGCAGGTTTCA CAGCAGTTGGAGCGAGC
TGGCTTCTTGCGGTAGGCAGCCA GCTCCCAATAGTCCTCGTTTCCTGGT
AATCTCATGCTTGG
Experiment
Find group of genes having correlated expression
profiles
20
  • Some testing data sets are available on the
    internet or in the literature
  • For example ?
  • CRP binding sites 18 sequences with 105 BPs
  • Yeast binding sites 8 sequences with 1000 BPs
  • Human binding sites 113 sequences with 30 BPs

21
CRP binding sites 18 sequences with 105 BPs
22
Theory Methods
  • Traditional approaches
  • Various sampling techniques including Gibbs
    sampling
  • EM Algorithm
  • Greedy Algorithm
  • Multi-Order Markov Chain Algorithm
  • All these are heuristic algorithms so this
    problem remains as a challenging and unsolved
    problem

23
Brief Review Minimum Spanning Tree
  • Input A graph, G (V,E), with weighted edges
  • Output the cheapest subset of edges that keeps
    the graph in one connected component
  • Two Popular Algorithms
  • Kruskals Algorithm
  • Prims Algorithm

24
Theory Methods
  • Minimum Spanning Tree approach
  • Step1 Define a distance measure (?) on the data
    set (D), and compute distances b/w each pair of
    data points (i.e., ?(A,B) for all A, B in D)
  • Higher the sequence similarity b/w two fragments,
    smaller the distance is b/w their mapped
    positions

25
Theory Methods
  • Minimum Spanning Tree approach
  • Step2 Find the MST ,T, representing D with its
    edge weight defined by ? and treat it as a data
    clustering problem

c1
c4
T
e1
e2
c2
e3
c3
Remove three edges e1,e2,e3
4 Clusters, c1c4, are identified
26
Evaluation of the MST Method
  • Comparison with Other Methods
  • MST is based on a combinatorial approach
  • ? can identify all clusters of possible binding
    sites
  • While existing heuristic methods are likely to
    miss some clusters
  • Implemented result is at least as good as results
    by other methods
  • While Simple structure of a tree facilitates
    efficient implementations of rigorous algorithm

27
Outline
  • Introduction to Bioinformatics
  • Paradigm Shift in Biology
  • Systems Engineering/Operations Research tools
    for Bioinformatics
  • About Funding Opportunities
  • Concluding Remarks

28
Funding Overviews by Funding Institutions(Top)/Fie
ld of Research(Bot)
Total of 54.1 billion in FY2004
Environmental science
Physical science
Life science
Engineering
9.1 billion
29.3 billion
Percentage of Total Federal Funding Preliminary
2004 Statistics Source National Science
Foundation/Division of Science Resources
Statistics, Survey of Federal Funds for Research
29
How to Search for Funding Opportunities?
  • NIH Computer Retrieval of Information on
    Scientific Projects (CRISP)
  • http//crisp.cit.nih.gov
  • NIH Office of Extramural Research (OER)
  • http//grants1.nih.gov
  • Other Websites
  • http//www.grants.gov
  • http//fedgrants.gov
  • http//www.nsf.gov/pubsys/ods/index.html

30
Growing Opportunities in Bioinformatics
From CRISP Search Data
31
NIH Funded Projects in 2004
From CRISP Search Data
  • Searched all Related Institutes, Centers, and
    States for the 2004 Fiscal Year

NIH Grants in Bioinformatics, 826
Microarray, 214 grants
Systems Biology, 80 grants
Cancer,63 grants
32
NIH Funding Opportunities for 2004
From http//grants1.nih.gov
  • 2004 Program Announcement (PA)
  • Total 171 PAs
  • Larger variety of topics
  • Cancer most prevalent topic
  • Many wish to have multidisciplinary outlook on
    topics
  • 2005 Requests For Application (RFA)
  • Total 68 RFAs
  • Although listed for 2005, some application
    deadlines have passed
  • 2 directly related to bioinformatics
  • Cancer still most prevalent topic

33
Outline
  • Introduction to Bioinformatics
  • Paradigm Shift in Biology
  • Systems Engineering/Operations Research
    for Bioinformatics
  • About Funding Opportunities
  • Conclusions

34
Developing Potential Research Plans
  • Two Takeaways
  • Systems Engineers/Operations Research Society
    already have tools to solve various
    bioinformatics problems
  • Moneys are there to support your research

Then, what do we need to start?
Biological Problems to solve
35
Concluding Remarks!!
  • The main driving force of bioinformatics/computati
    onal biology is the high-throughput data
    production
  • I.E. tools together with computing power can play
    an important role in this process
  • Funding opportunities in this area are very rich

36
Thank you!
Any Questions?
37
(No Transcript)
38
Level of Organization and Related Field of Study
39
Central Dogma of Biology
Transcription
Translation
example
Transcription
Translation
TTG CTG CGG
UUG CUG CGG
Leu Leu Arg
40
Transcription and Translation
41
Gene
  • A gene is a region of DNA that controls a
    hereditary characteristic, usually corresponding
    to a single mRNA carrying the information for
    constructing a protein.
  • The human genome contains about 30,000 genes.
    (February 2001)

42
Introns and Exons
43
(No Transcript)
44
Pair-wise Sequence Alignment
VLSPADKTNVKAAWAKVGAHAAGHG
VLSEAEWQLVLHVWAKVEADVAGHG
45
Sequence Alignment
  • Purposes
  • Learn about evolutionary relationships
  • Finding genes, domains, signals
  • Classify protein families (function, structure).
  • Identify common domains (function, structure).

46
Multiple Sequence Alignment
47
Scoring Systems for Alignment
Simple case
Sequence 1 Sequence 2
A G C T A 1 0 0 0 G 0 1 0 0 C 0 0 1 0 T 0 0 0 1
Scoring matrix
Match 1 Mismatch 0 Score 5
DNA
48
Scoring Systems for Alignment
Complex case
Sequence 1 Sequence 2
PTHPLASKTQILPEDLASEDLTI
PTHPLAGERAIGLARLAEEDFGM
C S T P A G N D . . C 9 S -1 4 T -1 1
5 P -3 -1 -1 7 A 0 1 0 -1 4 G -3 0 -2 -2
0 6 N -3 1 0 -2 -2 0 5 D -3 0 -1 -1 -2 -1
1 6 . .
Scoring matrix
TG -2 TT 5 Score 48
Protein
49
Protein Structure
50
Public Databases
  • Big 3 Centers

National Center for Biotechnology Information
EBI
DNA Database Bank of Japan
51
The Human Genome
  • 23 pairs of chromosomes comprise the human
    genome.
  • The human genome contains 3,164.7 million (or 3
    Billion) nucleotide base.
  • The average gene consists of 3,000 bases, but
    sizes vary greatly, with the largest known human
    gene being dystrophin at 2.4 million bases.
  • The total number of genes is estimated at 30,000
    to 40,000
  • The total number of protein variant is
    estimated as 1 Million.

52
Various Fields in Biology
Genomics
DNA
Transcriptomics
RNA
Proteomics
Proteins
Metabolomics
Metabolites
53
Trends in Molecular Biology
Reverse Genetics
Functional Genomics
Gene
Function (Mutation)
Function
Gene
Genome Project High Throughput Tech
Genome
Genomics
Structural Genomics
Functional Genomics
54
DNA Bases
A (Adenine), G (Guanine), C (Cytosine),
T(Thymine)
Write a Comment
User Comments (0)
About PowerShow.com