An Overview of Bioinformatics - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

An Overview of Bioinformatics

Description:

Population of putative structural motifs. Fitness function of motifs ... Exchange segment configuration between two putative motifs. ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 48
Provided by: yuhj7
Category:

less

Transcript and Presenter's Notes

Title: An Overview of Bioinformatics


1
An Overview of Bioinformatics
  • Yuh-Jyh Hu
  • CIS, NCTU

2
Overview
  • Genomics, Bioinformatics and Medicine

Molecular Diagnosis
Genomics
Identify Drug Targets
Molecular Epidemiology
Genetic Therapy
Rational Drug Design
Bioinformatics
Information Theory
Graph Theory
Artificial Intelligence
Robotics
Machine Learning
Databases
Statistics
Algorithm
3
  • Introduction to Bioinformatics
  • Bioinformatics a strategic discipline at the
    frontier between biology and computer science
  • Loosely defined at the intersection of molecular
    and computational biology
  • Primarily contributed by the academic user
    community
  • Driving force the advent of new, efficient
    experimental techniques, especially in DNA
    sequencing
  • Major goals understanding of life and evolution,
    and discoveries of new drugs and therapies.

4
  • Significant advances
  • Collecting and managing data
  • Databases of various types
  • Nucleotide sequences
  • Protein sequences
  • Structures
  • Gene expression
  • Integrated data retrieval
  • Entrez at NCBI
  • Data Analysis
  • Sequence
  • Structure
  • Expression

5
(No Transcript)
6
(No Transcript)
7
  • Entrez
  • Developed at NCBI, freely available and allow for
    integrated access to PubMed records, nucleotide
    and protein sequence data and 3D structure
    information
  • http//www.ncbi.nlm.nih.gov/Entrez

8
(No Transcript)
9
Central Dogma of Molecular Biology
transcription
translation
replication
DNA
RNA
Protein
reverse transcription
  • Molecules
  • Structure
  • Function
  • Processes
  • Mechanism
  • Regulation

10
What Information to organize?
  • Molecular Biology as Information Science
  • Central Dogma of Molecular Biology Central
    Paradigm for Bioinformatics
  • DNA Genomic Sequence
  • ? ?
  • RNA mRNA (gene expression level)
  • ? ?
  • Protein Protein Sequence
  • ? ?
  • Phenotype Protein Function
  • ?
  • Phenotype
  • Biological Process Process large amount and
    different
  • types of information
  • Performed or facilitated by proteins,
    apply various informatics techniques
  • executing instructions encoded in DNA

11
Organize Information
  • Redundancy and Multiplicity
  • Different sequences have the same/similar
    structure
  • Organism has many similar genes
  • Single may have multiple functions
  • How to find the similarities?
  • Modular parts
  • shared parts (bolts, washers)
  • unique parts (steering wheel)
  • Vast growth in data but limited increase of
    fundamentally new parts
  • More new protein structures, but not many new
    protein families
  • Simplification by grouping conserved elements
    (e.g. sequences, structures) into modular parts

12
  • Examining and analyzing data
  • Gene finding
  • ML techniques have been applied to almost all
    steps in computational gene finding, including
    the assignment of translation start and stop,
    quantification of reading frames, gene modeling,
    etc.
  • Promoter recognition transcription initiation
    and termination
  • Transcription initiation the first step in gene
    expression
  • RNA polymerase recognizes and binds to certain
    sequences called promoters.
  • Why difficult?
  • Large variable distance between DNA signals
    recognized by RNA polymerase
  • Many other factors involved in expression
    regulation

13
  • Gene Expression
  • Clustering group the genes with similar
    expression behavior together
  • Find target genes correlated with diseases
  • Construct genetic network understanding gene
    interactions
  • Motif prediction patterns showing regularity
  • Sequence motifs DNA, proteins
  • Structural motifs RNA, proteins
  • Protein structure prediction
  • Classification
  • Protein family
  • DNA-protein interaction

14
  • Information Retrieval from Biological databases
  • Exponential growth of databases
  • GeneBank is an annotated collection of all
    publicly available DNA sequences. It contains
    over 1.6 million sequence records covering over 1
    billion nucleotides.
  • Many efforts have been driven into making such
    data accessible to average users, and the
    programs and interfaces resulting from these
    efforts are the focus.
  • A possible scenario of information retrieval
  • Found a paper in PubMed which describes a gene in
    GenBank
  • May want to know the protein coded from the gene
  • May want to know its 3D structure recorded in
    structure database
  • Entrez at NCBI (http//www.ncbi.nlm.nih.gov/Entrez
    )

15
  • Visualization tools
  • Cn3D

16
Informatics Techniques in Bioinformatics
  • Databases
  • Building
  • Querying
  • String comparison
  • Text search
  • Sequence alignment
  • Significance statistics
  • Pattern finding
  • AI/Machine Learning
  • Data mining
  • Geometry
  • Robotics
  • Graphics
  • 3D matching
  • Physical simulation
  • Numerical analysis
  • Simulation
  • Visualization

17
  • Microarrays for Genomic Studies
  • Why is it important?
  • Its good to have information of (all) genes that
    are present in a genome.
  • Genetics is the study of the interactions among
    individual genes in an organism.
  • Biologists want to know the interplay of all
    genes simultaneously.
  • This leads to the need for high-throughput and
    large-scale technologies.

18
  • Gene Expression Studies
  • Pattern of genes expressed in a cell is
    characteristic of its current state.
  • Virtually all differences in cell state are
    correlated with changes in mRNA levels of many
    genes.
  • Expression patterns of genes may provide clues to
    their functions by comparison.

19
  • DNA Microarray Technology
  • Initially conceived of to detect expression of
    1000s genes simultaneously.
  • Potential applications
  • Identification of complex genetic diseases
  • Drug discovery
  • Differing expression of genes over time between
    tissues or disease state
  • Potential impacts
  • Preventive medicine
  • Ability to subtype diseases, and design drugs
    based on their causes instead of symptoms.
  • Personalized drugs
  • Rapid diagnosis

20
  • Microarray Technology
  • Basic technology is the same
  • DNA sequence complementary to the gene of
    interest is generated and then laid out in
    microscopic quantities on solid surfaces at
    predefined positions
  • DNAs from samples are washed out over the
    surface, only complementary DNAs are left due to
    binding
  • Presence of bound DNAs is detected by
    fluorescence following laser excitation

21
  • Microarray Fabrication
  • Photolithography
  • Photomasks direct DNA synthesis
  • Add one base to growing chain at a time
  • Ink-jetting
  • Utilize ink-jet printing to dispense
    sub-nanolitre volume of reagent to defined
    positions
  • Microspotting
  • Robot with a printhead
  • What DNA sequences are laid on surface
  • A series of DNA fragments vs. complete gene
    sequence

22
  • How it works
  • Fluorescent samples are prepared from two mRNA
    sources to be compared
  • Cy3 (green) used for one sample
  • Cy5 (red) used for the other sample
  • Samples are mixed and washed over the microarray
  • The microarray is then scanned by a laser scanner
    to detect fluorescent level.
  • The ratio of Cy3 to Cy5 (G/R) is calculated for
    each array element.
  • Relative intensity of G/R is a reliable measure
    of the relative abundance of specific mRNA in
    each sample.

23
  • Microarray Image

24
  • ??????????
  • ??????????????????
  • ??????????????
  • ????28????????,????????????,????????????,????????
  • ????2.7????????????,???????,??B-Korea,????2025??
    ?,?????????????????
  • ??????????(????)??????,?????????????????
  • ?????1996???Bioinformatics Center,???????????680??
    ???????????????????
  • ???????????????????O???,?????????????????????,????
    ??????????????????
  • ??????????????????,?????????????????????,?????????
    ??????,????????????????,?????????,?IBM????????????
    ??

25
  • ????????
  • ?????????
  • ????????,????????,???????
  • ???,????????????????,???????
  • ???????????(NIH)???,??????20??????????????
  • ???????????
  • ?????,??????????????????????,??????????,??????????
    ?,??????????????,??????????,???????

26
  • ??????????
  • ??????????????,???????
  • ??????
  • ???????????
  • ????????,?????????????
  • ????????,????????????????

27
Gene Regulation Analysis
  • Motivation
  • Microarray technology provides a global view of
    changes in gene expression on a genomic scale.
  • Obtain temporal patterns of gene expression
  • What else we may want to know about ?
  • Consensus sequence motifs
  • Correlation between motifs
  • What cause genes to behave differently
  • To learn beyond how genes behave over time
  • Look into control regions of genes
  • Find hypotheses

28
Gene Regulation Analysis
  • Method
  • Use Affymetrix GeneChip machine to collect gene
    expressions.
  • Cluster genome based on temporal patterns, e.g.,
    slope, distance, etc.
  • For each cluster, use motif-finding algorithm to
    find motifs.
  • Given multiple clusters of interest, transform
    raw sequences into higher-level representation.
  • Apply constructive induction to find motif
    interactions.
  • Apply inductive learner to generate hypotheses.

29
(No Transcript)
30
RNA Secondary Structure Prediction Basics
  • Like protein secondary structure, RNA secondary
    structure can be viewed as an intermediate step
    in the formation of a 3D structure.
  • In predicting RNA secondary structure, several
    simplifying assumptions are usually made.
  • The most likely structure is similar to the
    energetically most stable structure.
  • The energy associated with any position in the
    structure is only influenced by local sequence
    and structure. most reliable when used for
    standard Watson-Crick base pairs and single G/U
    pairs surrounded by Watson-Crick pairs.
  • The structure is assumed to be formed by folding
    of the chain back on itself in a manner that does
    not produce any knots.

31
Types of RNA Secondary Structure Prediction
Methods
  • Based on objective functions
  • Free energy minimization
  • Covariance analysis from sequence comparison
  • Based on number of RNA sequences for which to
    predict
  • Single-sequence prediction
  • To find the possible folding of a single RNA
    sequence
  • Multiple-sequence prediction
  • To find a global structure alignment for a set of
    RNA sequences
  • To find common structure elements within a set of
    RNA sequences

32
Motif Prediction vs. Concept Learning
  • Target concept common motifs
  • Training examples biosequences
  • Motif prediction as supervised learning
  • Positive examples
  • a given set of coregulated RNAs
  • Negative examples
  • the same number of sequences randomly generated
    based on the observed frequencies of sequence
    alphabet in positive examples.
  • Target concept
  • The common structural motifs that can be used to
    distinguish the given coregulated RNAs from the
    random sequences.

33
GPRM Genetic Programming for RNA Motifs
  • Focus on finding Watson-Crick complementary
    basepairs
  • C-G and A-U
  • RNA secondary structures are typically formed by
    basepairing interactions.
  • Three components of GPRM
  • Population of putative structural motifs
  • Fitness function of motifs
  • Genetic operators that simulate the natural
    evolution process of motifs

34
Representing Individuals in A Population
  • Each individual in a population is a putative
    motif
  • Structural motif description
  • Watson-Crick complementary segments
  • Non-pairing segments

35
Fitness Function
  • Interested in those motifs that can reflect the
    characteristics conserved in a family of
    coregulated RNAs
  • Assign higher values to those motifs commonly
    shared by the given family of RNAs, and rarely
    contained in random RNA sequences.
  • We define the fitness function as

36
Genetic Operators
  • Reproduction
  • Pass the better half of the population to the
    next generation
  • Accelerate the reproduction process
  • Mutation
  • If a complementary segment is picked, its segment
    length and corresponding pairing segment are both
    randomly changed.
  • If a non-pairing segment is selected, then only
    its length is randomly modified.
  • Crossover
  • Exchange segment configuration between two
    putative motifs.
  • Either a pair of complementary segments or a
    non-pairing segment is randomly chosen for
    exchange.

37
(No Transcript)
38
(No Transcript)
39
Reconstruction of Transcriptional Regulatory
Networks
  • Various Genome Projects produce a sufficient
    amount of sequence data
  • Microarray technologies generate a large amount
    of gene expression data
  • Large-scale and high-throughput sequence data and
    expression profiles are considered one of the
    most promising techniques to reconstruct gene
    networks
  • Computational analysis and reconstruction of
    genetic regulatory networks is now feasible
  • Goal Combining different types of information to
    model transcriptional regulatory networks for
    genes and TFs of interest

40
Methods
  • A Bottom-up Approach
  • transcription modules ? transcription network
  • Transcription module a functional unit
    consisting of a TF, target genes, genes producing
    this TF

41
Objective
  • Incorporate several hypotheses and analyze
    multiple data sources to enhance the performance
    in predicting transcription modules and construct
    transcription networks from these modules.
  • The transcription binding site information
  • Expression profile similarity of potential
    co-regulated target genes
  • Correlation between expression profile of the
    genes producing the TF and that of those
    regulated

42
Background Hypotheses
  • The development of large-scale expression
    monitoring and the availability of complete
    genome sequence allow the refinement of
    computational analysis.
  • A candidate gene is considered a target gene of a
    particular TF if
  • The upstream region of the gene contains the
    binding sites of the TF.
  • The PEA associated with the gene is significantly
    small.
  • Expression profile similarity of potential
    co-regulated target genes
  • The PF associated with the gene is significantly
    small.
  • Correlation between a target gene and the genes
    producing TF

43
Synergy of Binding Sites, PEA, PF
  • Binding sites
  • IUPAC/IUB code
  • Matched against upstream to provide preliminary
    candidate cis-regulatory sequences
  • Problem false positives
  • Combination of sequence similarity and expression
    phenotype to reduce false positives
  • PEA (Probabilistic Element Assessment)
  • PF (P-value of F test)

44
Synergy of PF and PEA
  • Varying PF and PEA (0.4 10-6)
  • PF-PEA combination against PF or PEA alone
  • Over all 27 TFs, appropriate PF-PEA threshold
    combo outperforms the best single PF or PEA
  • Prove the synergy of PF-PEA combination

45
Results Reconstruction of Transcription Networks
  • Goal to reconstruct the transcriptional
    regulatory network for a given set of TFs and
    genes of interest
  • TFs of interest MCM1, ACE2, SWI5, SBF, MBF
  • Parameter Setting PF threshold 0.03 and PEA
    threshold 0.65
  • Case 1 CLB1, CLB2, SWI5, ACE2, CDC5, CLN3, SWI4,
    FAR1, RME1, SIC1, CDC6, CLN1, CLN2, CLB5 and CLB6
    related to Cell Cycle
  • Case 2 randomly pick 130 mitotic cell
    cycle-related genes from CYGD

46
Advantages of Combinatorial Approach
  • Different metrics cover different background
    knowledge
  • Exploit more information to avoid false positives
    when building transcription modules
  • These metrics complement each other by
    characterizing different biological activities
  • e.g. Similar expression profiles among
    co-regulated genes and the association between
    regulators and the target genes
  • More robust
  • e.g., In case TF binding sites are unavailable,
    regression analysis can still be applied to
    identify reasonable transcription modules.

47
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com