Title: Opportunities of Systems EngineeringOperations Research in Bioinformatics
1Opportunities of Systems Engineering/Operations
Research in Bioinformatics
- Hyoungtae Kim
- (Joint work with Wiljeana Jackson, S.C. LIN and
Dr. JC LU)
2Outline
- Introduction on Bioinformatics
- Paradigm Shift in Biology
- Systems Engineering/Operations Research
for Bioinformatics - About Funding Opportunities
- Conclusions
3What is Bioinformatics/Computational Molecular
Biology?
- An application of mathematical, statistical, and
computational tools in the analysis of the huge
size biological data - Most of the cases, it involves analyzing
information stored in large databases - Multi-disciplinary
- -Biology -Mathematics -Statistics
- -Physics -Chemistry -Computer Science
- -Engineering
It has not yet found its own natural home
department
4Why Bioinformatics?
- Current data analysis tools are far from being
efficient for analyzing vast amount of biological
data - The pace of biological understanding is much
slower than the pace of the technology advance
that have powered experimental discovery and data
collection -
- Benefits
- Advances in detection and treatment of disease
and the production of genetically engineered
foods - ?Profound impact on health and medicine
5Three Elements of Bioinformatics Research
- Significant Biological problems
- Gene, motif, signal recognition
- Protein structure prediction
- Metabolic pathway deduction
- Etc.
Bioinformatics
- Data
- Microarrays
- Mass Spectroscopy
- Etc.
- Theory Methods
- Algorithms
- Statistical Methods
- Ontologies
- Etc.
6Prerequisites of Bioinformatics
Scientific Mind
- Basic knowledge in Molecular Biology
- Prokaryotic and Eukaryotic cells
- Genes, Codons, DNA, RNA, Central dogma of biology
- Etc.
- Computing Skills
- Program Languages Python, Perl, Java, etc.
- Knowledge in Relational Databases, etc.
- Other Skills
- General Statistical Knowledge
- Optimization Tools Math Programming, Network
Optimization, etc.
7Various Problems in Bioinformatics
Standard Problems
- DNA and Protein Sequence Analysis
- Gene Finding and Prediction
- Etc.
- Microarray Experiment and Data Analysis
- Protein Structure Prediction
- Deduction of Metabolic Pathways
- And more
Emerging Problems
8Outline
- Introduction to Bioinformatics
- Paradigm Shift in Biology
- Systems Engineering/Operations Research
for Bioinformatics - About Funding Opportunities
- Concluding Remarks
9Paradigm Shift in Biology
- The Human Genome Project (HGP)
- Working Draft of the human genome (2001)
- Goal of the HGP sequencing of the human genome
- Hypothesis driven reductionism ?discovery science
approach - Drive-forced the development of high throughput
technologies and computer applications to
transmit, analyze, and model very large size data
sets
10Paradigm Shift in Biology
- High-throughput Technologies
- Microarrays allow the expression of thousands
of genes to be surveyed at one time - Protein Arrays can examine all proteins in a
cell and check if they are interacting under
designed conditions - Mass Spectrometry The basic modality is protein
mass fingerprinting
11Paradigm Shift in Biology
- Microarray Chip Technology
- Allows data collection in high-throughput manner
- Can put all genes in a microbe on a chip
- Interpretation of the data is very challenging
12253x15154 Microarray Gene Expression Data 162
cancer vs 91 normal patients
13Paradigm Shift in Biology
Genes and proteins
Protein-protein interaction data
Gene activity data
Black box
Protein structure data
Proteomic data
Regulatory elements
Metabolite data
Gigantic amount of biological information is
hidden in these data and their inter-data
relationship!
14Paradigm Shift in Biology
- Concept of Systems Biology
- The Reductionist paradigm has been phenomenally
successful in biology since 1950s - Genomics era ? exhaustive lists of biological
parts (i.e. genes and proteins) together with
their functional characteristics - A System-level perspective is required to make
sense of how all of these individual parts emerge
and act collectively to perform a biological
function
15Outline
- Introduction to Bioinformatics
- Paradigm Shift in Biology
- Systems Engineering/Operations Research tools
for Bioinformatics - About Funding Opportunities
- Concluding Remarks
16Systems Engineering/Operations Research tools
- Network Optimization
- Combinatorial
- Integer Programming
- Dynamic Programming
- Network Optimization
- Minimum Spanning Tree
- Etc.
- Stochastics
- Hidden Markov Models
- MCMC
- Simulation Models
- Etc.
- Statistics
- MLE
- Regression
- Sampling
- Linear Model
- Cross Validation
- Statistical Estimation and Test
- Multivariate Analysis (or ANOVA)
- Wavelet Transformation
- Bayesian Networks, Etc.
17Systems Engineering tools for Bioinformatics
- Hidden Markov Model for Gene Finding
- Dynamic Programming for Sequence Alignment
- Integer Programming for Protein Folding
- Minimum Spanning Tree approach to Clustering for
Motif Identification (Xu et al. (2001) - And many more
18A Significant Biological Problem
- Identification of Transcription Factor Binding
Sites(Motifs)
- A genes transcriptional level is regulated by
proteins (transcription factors), which bind to
specific sites in the genes promoter region,
called binding sites - The binding-site identification problem is to
find short conserved fragments, from a set of
genomic sequences - ? Features of transcription factor binding site
- These short DNA fragments in the upstream regions
of genes are generally very similar to each other - Relatively high frequencies compared to other
sequence fragments
19Data Collection
- Data Set (D) Set of All Short DNA fragments in
the upstream regions of genes
- Microarray gene expression technologies allow
simultaneous view of the transcription levels of
many thousands of genes under various cellular
conditions
Upstream regions of genes
GATCACCTGACATCAGGAGTTCAAGACCAGCCTGCCAACG CCATCTCTA
CTAAAAATAGGAAATTCACCTGGTGGCAGGT CCAGCTACTCGGGAGGCT
GAGGCAGAAGAATCGCTTGAAT GAGATTGCACTGAGCTGAGATCACGCC
ACTGCGCTCCAGC GAGCAAGACTCCATAAAAAAAAAAATTATAACCTAA
TGAT AGGGAAGAGCTTACCACAATTGCTGGCCCATGGCCAATGC ACAG
CTACTGCAAACAACCATGATGATGATACATCTCTTG GGTTGTTTGAGAC
ACATTCTATGCTCCTTGATTTGATTGG GGTTCCTTGGGGACTTGGAGGT
GACGAAAGCCTCCCTGGG ACCTTCACTTCTCTAATATCAAGCTTCAGCA
ACCTGCTCC CAGGGTTGGACAGGCCCAACAACAGAGGAAATCCACAAAG
CACATACATCCACGGGGTCTAACGAGGTGAGGCCAATGAC CACCCCAG
CCAGACTCTGACTTCACTCCCGGCAGGTTTCA CAGCAGTTGGAGCGAGC
TGGCTTCTTGCGGTAGGCAGCCA GCTCCCAATAGTCCTCGTTTCCTGGT
AATCTCATGCTTGG
Experiment
Find group of genes having correlated expression
profiles
20 - Some testing data sets are available on the
internet or in the literature
- For example ?
- CRP binding sites 18 sequences with 105 BPs
- Yeast binding sites 8 sequences with 1000 BPs
- Human binding sites 113 sequences with 30 BPs
21CRP binding sites 18 sequences with 105 BPs
22Theory Methods
- Various sampling techniques including Gibbs
sampling - EM Algorithm
- Greedy Algorithm
- Multi-Order Markov Chain Algorithm
- All these are heuristic algorithms so this
problem remains as a challenging and unsolved
problem
23Brief Review Minimum Spanning Tree
- Input A graph, G (V,E), with weighted edges
- Output the cheapest subset of edges that keeps
the graph in one connected component
24Theory Methods
- Minimum Spanning Tree approach
- Step1 Define a distance measure (?) on the data
set (D), and compute distances b/w each pair of
data points (i.e., ?(A,B) for all A, B in D) - Higher the sequence similarity b/w two fragments,
smaller the distance is b/w their mapped
positions
25Theory Methods
- Minimum Spanning Tree approach
- Step2 Find the MST ,T, representing D with its
edge weight defined by ? and treat it as a data
clustering problem
c1
c4
T
e1
e2
c2
e3
c3
Remove three edges e1,e2,e3
4 Clusters, c1c4, are identified
26Evaluation of the MST Method
- Comparison with Other Methods
- MST is based on a combinatorial approach
- ? can identify all clusters of possible binding
sites - While existing heuristic methods are likely to
miss some clusters - Implemented result is at least as good as results
by other methods - While Simple structure of a tree facilitates
efficient implementations of rigorous algorithm
27Outline
- Introduction to Bioinformatics
- Paradigm Shift in Biology
- Systems Engineering/Operations Research tools
for Bioinformatics - About Funding Opportunities
- Concluding Remarks
28Funding Overviews by Funding Institutions(Top)/Fie
ld of Research(Bot)
Total of 54.1 billion in FY2004
Environmental science
Physical science
Life science
Engineering
9.1 billion
29.3 billion
Percentage of Total Federal Funding Preliminary
2004 Statistics Source National Science
Foundation/Division of Science Resources
Statistics, Survey of Federal Funds for Research
29How to Search for Funding Opportunities?
- NIH Computer Retrieval of Information on
Scientific Projects (CRISP) - http//crisp.cit.nih.gov
- NIH Office of Extramural Research (OER)
- http//grants1.nih.gov
- Other Websites
- http//www.grants.gov
- http//fedgrants.gov
- http//www.nsf.gov/pubsys/ods/index.html
30Growing Opportunities in Bioinformatics
From CRISP Search Data
31NIH Funded Projects in 2004
From CRISP Search Data
- Searched all Related Institutes, Centers, and
States for the 2004 Fiscal Year
NIH Grants in Bioinformatics, 826
Microarray, 214 grants
Systems Biology, 80 grants
Cancer,63 grants
32NIH Funding Opportunities for 2004
From http//grants1.nih.gov
- 2004 Program Announcement (PA)
- Total 171 PAs
- Larger variety of topics
- Cancer most prevalent topic
- Many wish to have multidisciplinary outlook on
topics - 2005 Requests For Application (RFA)
- Total 68 RFAs
- Although listed for 2005, some application
deadlines have passed - 2 directly related to bioinformatics
- Cancer still most prevalent topic
33Outline
- Introduction to Bioinformatics
- Paradigm Shift in Biology
- Systems Engineering/Operations Research
for Bioinformatics - About Funding Opportunities
- Conclusions
34Developing Potential Research Plans
- Systems Engineers/Operations Research Society
already have tools to solve various
bioinformatics problems - Moneys are there to support your research
Then, what do we need to start?
Biological Problems to solve
35Concluding Remarks!!
- The main driving force of bioinformatics/computati
onal biology is the high-throughput data
production - I.E. tools together with computing power can play
an important role in this process - Funding opportunities in this area are very rich
36Thank you!
Any Questions?
37(No Transcript)
38Level of Organization and Related Field of Study
39Central Dogma of Biology
Transcription
Translation
example
Transcription
Translation
TTG CTG CGG
UUG CUG CGG
Leu Leu Arg
40Transcription and Translation
41Gene
- A gene is a region of DNA that controls a
hereditary characteristic, usually corresponding
to a single mRNA carrying the information for
constructing a protein. - The human genome contains about 30,000 genes.
(February 2001)
42Introns and Exons
43(No Transcript)
44Pair-wise Sequence Alignment
VLSPADKTNVKAAWAKVGAHAAGHG
VLSEAEWQLVLHVWAKVEADVAGHG
45Sequence Alignment
- Purposes
- Learn about evolutionary relationships
- Finding genes, domains, signals
- Classify protein families (function, structure).
- Identify common domains (function, structure).
46Multiple Sequence Alignment
47Scoring Systems for Alignment
Simple case
Sequence 1 Sequence 2
A G C T A 1 0 0 0 G 0 1 0 0 C 0 0 1 0 T 0 0 0 1
Scoring matrix
Match 1 Mismatch 0 Score 5
DNA
48Scoring Systems for Alignment
Complex case
Sequence 1 Sequence 2
PTHPLASKTQILPEDLASEDLTI
PTHPLAGERAIGLARLAEEDFGM
C S T P A G N D . . C 9 S -1 4 T -1 1
5 P -3 -1 -1 7 A 0 1 0 -1 4 G -3 0 -2 -2
0 6 N -3 1 0 -2 -2 0 5 D -3 0 -1 -1 -2 -1
1 6 . .
Scoring matrix
TG -2 TT 5 Score 48
Protein
49Protein Structure
50Public Databases
National Center for Biotechnology Information
EBI
DNA Database Bank of Japan
51The Human Genome
- 23 pairs of chromosomes comprise the human
genome. - The human genome contains 3,164.7 million (or 3
Billion) nucleotide base. - The average gene consists of 3,000 bases, but
sizes vary greatly, with the largest known human
gene being dystrophin at 2.4 million bases. - The total number of genes is estimated at 30,000
to 40,000 - The total number of protein variant is
estimated as 1 Million.
52Various Fields in Biology
Genomics
DNA
Transcriptomics
RNA
Proteomics
Proteins
Metabolomics
Metabolites
53Trends in Molecular Biology
Reverse Genetics
Functional Genomics
Gene
Function (Mutation)
Function
Gene
Genome Project High Throughput Tech
Genome
Genomics
Structural Genomics
Functional Genomics
54DNA Bases
A (Adenine), G (Guanine), C (Cytosine),
T(Thymine)