Title: An Overview of Bioinformatics
1An Overview of Bioinformatics
2Overview
- Genomics, Bioinformatics and Medicine
Molecular Diagnosis
Genomics
Identify Drug Targets
Molecular Epidemiology
Genetic Therapy
Rational Drug Design
Bioinformatics
Information Theory
Graph Theory
Artificial Intelligence
Robotics
Machine Learning
Databases
Statistics
Algorithm
3- Introduction to Bioinformatics
- Bioinformatics a strategic discipline at the
frontier between biology and computer science - Loosely defined at the intersection of molecular
and computational biology - Primarily contributed by the academic user
community - Driving force the advent of new, efficient
experimental techniques, especially in DNA
sequencing - Major goals understanding of life and evolution,
and discoveries of new drugs and therapies.
4- Significant advances
- Collecting and managing data
- Databases of various types
- Nucleotide sequences
- Protein sequences
- Structures
- Gene expression
- Integrated data retrieval
- Entrez at NCBI
- Data Analysis
- Sequence
- Structure
- Expression
5(No Transcript)
6(No Transcript)
7- Entrez
- Developed at NCBI, freely available and allow for
integrated access to PubMed records, nucleotide
and protein sequence data and 3D structure
information - http//www.ncbi.nlm.nih.gov/Entrez
8(No Transcript)
9Central Dogma of Molecular Biology
transcription
translation
replication
DNA
RNA
Protein
reverse transcription
- Molecules
- Structure
- Function
- Processes
- Mechanism
- Regulation
10What Information to organize?
- Molecular Biology as Information Science
- Central Dogma of Molecular Biology Central
Paradigm for Bioinformatics - DNA Genomic Sequence
- ? ?
- RNA mRNA (gene expression level)
- ? ?
- Protein Protein Sequence
- ? ?
- Phenotype Protein Function
- ?
- Phenotype
- Biological Process Process large amount and
different - types of information
- Performed or facilitated by proteins,
apply various informatics techniques - executing instructions encoded in DNA
11Organize Information
- Redundancy and Multiplicity
- Different sequences have the same/similar
structure - Organism has many similar genes
- Single may have multiple functions
- How to find the similarities?
- Modular parts
- shared parts (bolts, washers)
- unique parts (steering wheel)
- Vast growth in data but limited increase of
fundamentally new parts - More new protein structures, but not many new
protein families - Simplification by grouping conserved elements
(e.g. sequences, structures) into modular parts
12- Examining and analyzing data
- Gene finding
- ML techniques have been applied to almost all
steps in computational gene finding, including
the assignment of translation start and stop,
quantification of reading frames, gene modeling,
etc. - Promoter recognition transcription initiation
and termination - Transcription initiation the first step in gene
expression - RNA polymerase recognizes and binds to certain
sequences called promoters. - Why difficult?
- Large variable distance between DNA signals
recognized by RNA polymerase - Many other factors involved in expression
regulation
13- Gene Expression
- Clustering group the genes with similar
expression behavior together - Find target genes correlated with diseases
- Construct genetic network understanding gene
interactions - Motif prediction patterns showing regularity
- Sequence motifs DNA, proteins
- Structural motifs RNA, proteins
- Protein structure prediction
- Classification
- Protein family
- DNA-protein interaction
14- Information Retrieval from Biological databases
- Exponential growth of databases
- GeneBank is an annotated collection of all
publicly available DNA sequences. It contains
over 1.6 million sequence records covering over 1
billion nucleotides. - Many efforts have been driven into making such
data accessible to average users, and the
programs and interfaces resulting from these
efforts are the focus. - A possible scenario of information retrieval
- Found a paper in PubMed which describes a gene in
GenBank - May want to know the protein coded from the gene
- May want to know its 3D structure recorded in
structure database - Entrez at NCBI (http//www.ncbi.nlm.nih.gov/Entrez
)
15 16Informatics Techniques in Bioinformatics
- Databases
- Building
- Querying
- String comparison
- Text search
- Sequence alignment
- Significance statistics
- Pattern finding
- AI/Machine Learning
- Data mining
- Geometry
- Robotics
- Graphics
- 3D matching
- Physical simulation
- Numerical analysis
- Simulation
- Visualization
17- Microarrays for Genomic Studies
- Why is it important?
- Its good to have information of (all) genes that
are present in a genome. - Genetics is the study of the interactions among
individual genes in an organism. - Biologists want to know the interplay of all
genes simultaneously. - This leads to the need for high-throughput and
large-scale technologies.
18- Gene Expression Studies
- Pattern of genes expressed in a cell is
characteristic of its current state. - Virtually all differences in cell state are
correlated with changes in mRNA levels of many
genes. - Expression patterns of genes may provide clues to
their functions by comparison.
19- DNA Microarray Technology
- Initially conceived of to detect expression of
1000s genes simultaneously. - Potential applications
- Identification of complex genetic diseases
- Drug discovery
- Differing expression of genes over time between
tissues or disease state - Potential impacts
- Preventive medicine
- Ability to subtype diseases, and design drugs
based on their causes instead of symptoms. - Personalized drugs
- Rapid diagnosis
20- Microarray Technology
- Basic technology is the same
- DNA sequence complementary to the gene of
interest is generated and then laid out in
microscopic quantities on solid surfaces at
predefined positions - DNAs from samples are washed out over the
surface, only complementary DNAs are left due to
binding - Presence of bound DNAs is detected by
fluorescence following laser excitation
21- Microarray Fabrication
- Photolithography
- Photomasks direct DNA synthesis
- Add one base to growing chain at a time
- Ink-jetting
- Utilize ink-jet printing to dispense
sub-nanolitre volume of reagent to defined
positions - Microspotting
- Robot with a printhead
- What DNA sequences are laid on surface
- A series of DNA fragments vs. complete gene
sequence
22- How it works
- Fluorescent samples are prepared from two mRNA
sources to be compared - Cy3 (green) used for one sample
- Cy5 (red) used for the other sample
- Samples are mixed and washed over the microarray
- The microarray is then scanned by a laser scanner
to detect fluorescent level. - The ratio of Cy3 to Cy5 (G/R) is calculated for
each array element. - Relative intensity of G/R is a reliable measure
of the relative abundance of specific mRNA in
each sample.
23 24- ??????????
- ??????????????????
- ??????????????
- ????28????????,????????????,????????????,????????
- ????2.7????????????,???????,??B-Korea,????2025??
?,????????????????? - ??????????(????)??????,?????????????????
- ?????1996???Bioinformatics Center,???????????680??
??????????????????? - ???????????????????O???,?????????????????????,????
?????????????????? - ??????????????????,?????????????????????,?????????
??????,????????????????,?????????,?IBM????????????
??
25- ????????
- ?????????
- ????????,????????,???????
- ???,????????????????,???????
- ???????????(NIH)???,??????20??????????????
- ???????????
- ?????,??????????????????????,??????????,??????????
?,??????????????,??????????,???????
26- ??????????
- ??????????????,???????
- ??????
- ???????????
- ????????,?????????????
- ????????,????????????????
27Gene Regulation Analysis
- Motivation
- Microarray technology provides a global view of
changes in gene expression on a genomic scale. - Obtain temporal patterns of gene expression
- What else we may want to know about ?
- Consensus sequence motifs
- Correlation between motifs
- What cause genes to behave differently
- To learn beyond how genes behave over time
- Look into control regions of genes
- Find hypotheses
28Gene Regulation Analysis
- Method
- Use Affymetrix GeneChip machine to collect gene
expressions. - Cluster genome based on temporal patterns, e.g.,
slope, distance, etc. - For each cluster, use motif-finding algorithm to
find motifs. - Given multiple clusters of interest, transform
raw sequences into higher-level representation. - Apply constructive induction to find motif
interactions. - Apply inductive learner to generate hypotheses.
29(No Transcript)
30RNA Secondary Structure Prediction Basics
- Like protein secondary structure, RNA secondary
structure can be viewed as an intermediate step
in the formation of a 3D structure. - In predicting RNA secondary structure, several
simplifying assumptions are usually made. - The most likely structure is similar to the
energetically most stable structure. - The energy associated with any position in the
structure is only influenced by local sequence
and structure. most reliable when used for
standard Watson-Crick base pairs and single G/U
pairs surrounded by Watson-Crick pairs. - The structure is assumed to be formed by folding
of the chain back on itself in a manner that does
not produce any knots.
31Types of RNA Secondary Structure Prediction
Methods
- Based on objective functions
- Free energy minimization
- Covariance analysis from sequence comparison
- Based on number of RNA sequences for which to
predict - Single-sequence prediction
- To find the possible folding of a single RNA
sequence - Multiple-sequence prediction
- To find a global structure alignment for a set of
RNA sequences - To find common structure elements within a set of
RNA sequences
32Motif Prediction vs. Concept Learning
- Target concept common motifs
- Training examples biosequences
- Motif prediction as supervised learning
- Positive examples
- a given set of coregulated RNAs
- Negative examples
- the same number of sequences randomly generated
based on the observed frequencies of sequence
alphabet in positive examples. - Target concept
- The common structural motifs that can be used to
distinguish the given coregulated RNAs from the
random sequences.
33GPRM Genetic Programming for RNA Motifs
- Focus on finding Watson-Crick complementary
basepairs - C-G and A-U
- RNA secondary structures are typically formed by
basepairing interactions. - Three components of GPRM
- Population of putative structural motifs
- Fitness function of motifs
- Genetic operators that simulate the natural
evolution process of motifs
34Representing Individuals in A Population
- Each individual in a population is a putative
motif - Structural motif description
- Watson-Crick complementary segments
- Non-pairing segments
35Fitness Function
- Interested in those motifs that can reflect the
characteristics conserved in a family of
coregulated RNAs - Assign higher values to those motifs commonly
shared by the given family of RNAs, and rarely
contained in random RNA sequences. - We define the fitness function as
36Genetic Operators
- Reproduction
- Pass the better half of the population to the
next generation - Accelerate the reproduction process
- Mutation
- If a complementary segment is picked, its segment
length and corresponding pairing segment are both
randomly changed. - If a non-pairing segment is selected, then only
its length is randomly modified. - Crossover
- Exchange segment configuration between two
putative motifs. - Either a pair of complementary segments or a
non-pairing segment is randomly chosen for
exchange.
37(No Transcript)
38(No Transcript)
39Reconstruction of Transcriptional Regulatory
Networks
- Various Genome Projects produce a sufficient
amount of sequence data - Microarray technologies generate a large amount
of gene expression data - Large-scale and high-throughput sequence data and
expression profiles are considered one of the
most promising techniques to reconstruct gene
networks - Computational analysis and reconstruction of
genetic regulatory networks is now feasible - Goal Combining different types of information to
model transcriptional regulatory networks for
genes and TFs of interest
40Methods
- A Bottom-up Approach
- transcription modules ? transcription network
- Transcription module a functional unit
consisting of a TF, target genes, genes producing
this TF
41Objective
- Incorporate several hypotheses and analyze
multiple data sources to enhance the performance
in predicting transcription modules and construct
transcription networks from these modules. - The transcription binding site information
- Expression profile similarity of potential
co-regulated target genes - Correlation between expression profile of the
genes producing the TF and that of those
regulated
42Background Hypotheses
- The development of large-scale expression
monitoring and the availability of complete
genome sequence allow the refinement of
computational analysis. - A candidate gene is considered a target gene of a
particular TF if - The upstream region of the gene contains the
binding sites of the TF. - The PEA associated with the gene is significantly
small. - Expression profile similarity of potential
co-regulated target genes - The PF associated with the gene is significantly
small. - Correlation between a target gene and the genes
producing TF
43Synergy of Binding Sites, PEA, PF
- Binding sites
- IUPAC/IUB code
- Matched against upstream to provide preliminary
candidate cis-regulatory sequences - Problem false positives
- Combination of sequence similarity and expression
phenotype to reduce false positives - PEA (Probabilistic Element Assessment)
- PF (P-value of F test)
44Synergy of PF and PEA
- Varying PF and PEA (0.4 10-6)
- PF-PEA combination against PF or PEA alone
- Over all 27 TFs, appropriate PF-PEA threshold
combo outperforms the best single PF or PEA - Prove the synergy of PF-PEA combination
45Results Reconstruction of Transcription Networks
- Goal to reconstruct the transcriptional
regulatory network for a given set of TFs and
genes of interest - TFs of interest MCM1, ACE2, SWI5, SBF, MBF
- Parameter Setting PF threshold 0.03 and PEA
threshold 0.65 - Case 1 CLB1, CLB2, SWI5, ACE2, CDC5, CLN3, SWI4,
FAR1, RME1, SIC1, CDC6, CLN1, CLN2, CLB5 and CLB6
related to Cell Cycle - Case 2 randomly pick 130 mitotic cell
cycle-related genes from CYGD
46Advantages of Combinatorial Approach
- Different metrics cover different background
knowledge - Exploit more information to avoid false positives
when building transcription modules - These metrics complement each other by
characterizing different biological activities - e.g. Similar expression profiles among
co-regulated genes and the association between
regulators and the target genes - More robust
- e.g., In case TF binding sites are unavailable,
regression analysis can still be applied to
identify reasonable transcription modules.
47(No Transcript)