MSCS282: Bioinformatics I - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

MSCS282: Bioinformatics I

Description:

Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science ... Bioinformatics - Craig A. Struble. 10. Biological Data. Sus scrofa ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 35
Provided by: craigs6
Category:

less

Transcript and Presenter's Notes

Title: MSCS282: Bioinformatics I


1
MSCS282 Bioinformatics I
  • Introduction
  • Craig A. Struble, Ph.D.
  • Department of Mathematics, Statistics, and
    Computer Science
  • Marquette University
  • Michael A. Thomas, Ph.D.
  • Bioinformatics Research Center
  • Medical College of Wisconsin

2
Michael A. Thomas, Ph.D.
3
Overview
  • Welcome
  • Syllabus
  • Student Introductions
  • Introduction to Bioinformatics

4
What Is Bioinformatics?
  • Bioinformatics is a new subject of genetic data
    collection, analysis and dissemination to the
    research community. Hwa A. Lim (1987)
  • Bioinformatics Research, development, or
    application of computational tools and approaches
    for expanding the use of biological, medical,
    behavioral or health data,including those to
    acquire, store, organize, archive, analyze, or
    visualize such data. NIH working definition
    (2000)

5
What is Bioinformatics?
Informatics Computer Science Computer
Engineering Information Science
Biology Other Natural Sciences
Bioinformatics
Mathematics Statistics
6
Bioinformatics is sometimes called
  • Computational biology
  • Computational molecular biology
  • Biomolecular informatics
  • Computational genomics

7
Different perspectives on Bioinformatics
  • Bioinformatics is a tool
  • Biologists, biochemists, medical professionals,
    etc.
  • Obtain meaningful and understandable results
  • Bioinformatics is a discipline
  • Informaticians, mathematicians, statisticians,
    etc.
  • Generate meaningful and understandable results

8
Biological Data
  • Genomes
  • DNA Sequences of A, T, C, G
  • Annotated with function, interesting features
  • Proteins
  • Amino Acid Sequences
  • Sequences of 20 letters
  • Annotated with structure, function, etc.

9
Biological Data
  • Gene Expression
  • Dynamic behavior of genes
  • Protein Expression
  • Dynamic behavior of proteins
  • Structural Features
  • RNA and proteins

10
Biological Data Sus scrofa agouti-related
protein gene
  • 1 ggcacattct cctgttgagc caggctatgc
    tgaccacaat gttgctgagc tgtgccctac
  • 61 tgctggcaat gcccaccatg ctgggggccc agataggctt
    ggcccccctg gagggtatcg
  • 121 gaaggcttga ccaagccttg ttcccagaac tccaaggtca
    gtgcgggcag gagtgggttg
  • 181 ggtggggctt ggacatcctc tggccacaaa gtattctgct
    tgtatgagcc ctttcttccc
  • 241 cttcccaatc ccaggcctgg gaggtgggtg ttttgtgcat
    gggtggttct gccctcacat
  • 301 catctgtccc agatctaggc ctgcagcccc cactgaagag
    gacaactgca gaacgggcag
  • 361 aagaggctct gctgcagcag gccgaggcca aggccttggc
    agaggtaaca gctcagggaa
  • 421 agggctgagg ccacaagtct tgagtgggtg tgtcaagcat
    caacctctat ctgtgcttgg
  • 481 agttgccact gtggtacaac gggattggcg gtgtcttggg
    agcgctggga cgtggtttca
  • 541 tccccggcca gcacaagtgg gttaaggatc tggccttgcc
    atcccttcag cttaggctga
  • 601 gactgtggct tggagctgat ctctgaccgg aagctccata
    tgctctgggg tgaccaaaaa
  • 661 tggaaaaaca aacatacaaa acacctctac ctgcacttcc
    tgaccccctc acccggggcg
  • 721 acactgcaga ccatcccgtt cacgctccac ttccatcctg
    ccttgatctg gcgcattcca
  • 781 tgaatgtgct tttggaagtc cttgtttccc aacccttgta
    ggtgctagat cctgaaggac
  • 841 gcaaggcacg ctccccacgt cgctgcgtaa ggctgcacga
    atcctgtctg ggacaccagg
  • 901 taccatgctg cgacccatgt gctacatgct actgccgttt
    cttcaacgcc ttctgctact
  • 961 gccgcaagct gggtactgcc acgaacccct gcagccgcac
    ctagctggcc agccaatgtc
  • 1021 gtcg

11
Genome Sizes
12
Database Growth
13
Database Growth
14
Database Growth
15
Database Growth
  • Exponential growth in sequence data
  • Not much growth in sequence size
  • Expect exponential growth in annotation
    information
  • What are we to do with all this data?

16
Challenges of Large Databases
  • Storage
  • Indexing, physical layout, memory management
  • Modeling
  • Relational, hierarchical, semi-structured
  • Efficiency
  • Update, query, analysis
  • Interpretation
  • Visualization

17
Problems in Bioinformatics
  • Consider just sequence analysis
  • Sequence alignment
  • Gene discovery
  • Promoter discovery
  • Intron splice sites
  • Protein and RNA structure prediction

18
Applications of Bioinformatics
  • VCMAP
  • DORR and ASAP
  • miRNA

19
VCMAP
  • Comparative mapping is a strategy that allows
    cross-organism study of physiological genomics
  • Virtual Comparative Map (VCMap) performs homology
    analysis with mathematical predictions to
    construct un-tested (in the wet-lab)
    cross-organism maps between human, rat, mouse and
    zebrafish
  • This application provides a highly modular
    investigative environment for the
  • Analysis of multiple organisms including
    Zebrafish
  • Collection of genetic and radiation hybrid maps
  • Prediction of Genes based on homology

20
VCMAP
  • Homology analysis was based on sequence
    similarity (Altschul, et al 1990) and curated
    homologous genes.
  • 85 similarity with 100 bp stretch across all
    species was used to create the maps
  • NCBIs UniGene sequence sets, RH and Genetic
    maps were chosen to create anchor objects
    (Kwitek-Black, et al. 2001).
  • 1-to-1 homologous objects were used for building
    the virtual comparative maps with a pipeline
    architecture

21
VCMAP
Download UniGene data from NCBI
Mask UniGene sequences
Load UniGene data to DB
DB
Format masked sequences
Blast
Map Data
Search UniGene
VC Maps Building
Anchor Report
Generate anchor report
Create Homolog UniGene Object and Scoring
1-to-1 Objects
22
VCMAP
23
Disease Oriented Research Resource
  • The major goal of the RGD Disease Oriented
    Research Resource is to create collaborative
    relationships between RGD and 20 particular
    disease rat research communities to identify,
    collect, and integrate disease-specific
    components of data and information all the way
    down to specific genes of interest into RGD
    Disease portals.

Specific Goals for RGD
  • Prioritize data for curation and addition to RGD
    based on targeted disease areas
  • Effectively combine automated and manual data
    acquisition and curation methods
  • Provide a way to integrate Rat Genome Sequencing
    Project results with RGD activities
  • Help RGD incorporate tools developed in BRC to
    add focus of data mining and analysis to
    traditional curation and database functions

24
DORR Workflow
Genes
Curation
Strains
ASAP
QTLs
VCMap
Biomedical Literature
Identify Disease
GROIs
Microarray
In Development
Pathways
Phenotypes
25
Data Acquisition
26
Automated Sequence Analysis Pipeline
Markers
seqs mapped in ROI Seqs predicted in ROI
Seq(Fasta/Trace)
VCMap
input sequences
RepeatMask
Search extra genomic sequence(HTGS, TraceDB,nr)
RepeatMask
Assembly Seq
AssembledSeqUnassembledSeq
MetaGene
UniGene Search
ePCR
RepeatMask
Homolog Search
RepeatMask
Hot Zone
BLAST (nrpir)
UniGene Report
Repeat Report
Marker Report
Homolog Report
MetaGene Report
Gene Report
Visualization (Clickable Image Map)
Additional Reports
27
miRNA Gene and Target Prediction
  • microRNA genes (miRNAs) were recently recognized
    as a class of functional non-coding genes
  • 70nt precursor which has a hairpin fold
  • 20nt RNA molecule from Dicer cutting the stem
    loop
  • First identified were lin-4 and let-7
  • Developmental role in C. Elegans

28
miRNA Examples
29
miRNA Gene Prediction
  • Can we predict where miRNA genes might be?
  • Microscan (Burge Lab)

Scan C. Elegans for 70nt hairpin folds (structure
prediction)
Compare with C. Briggsae (sequence alignment)
  • Score alignments
  • 3 and 5 conservation
  • Overall conservation
  • Size of loop, etc.

Select sequences with score gt threshold
30
miRNA Target Prediction
  • Lin-4 and let-7 interact with 3-UTR
  • Idea look for conserved 3-UTR regions which are
    complementary to discovered miRNA genes

Find conserved sequences of 20nt in length from
3-UTR database
Align with miRNA genes
  • Score alignment
  • 3 and 5 matches
  • Overall matching

Select high scoring matches
31
Goals of the Course
  • For everyone
  • Communication
  • For the biologist
  • Incorporate bioinformatics into research
  • Understand computational modeling
  • For the computational scientist
  • Develop tools for biological research
  • Create new algorithms for mining biological data
  • Understand how to find biologically meaningful
    information

32
Summary
  • Bioinformatics is truly interdisciplinary
  • Biology (natural sciences), informatics,
    mathematics statistics
  • Databases
  • Large, semistructured, incomplete, inaccurate
  • Wide-range of problems
  • Solutions employ knowledge from sciences with
    algorithms and models from informatics,
    mathematics, and statistics

33
Biological Data
  • DNA and Protein Sequences are annotated
  • Source
  • Organism
  • Function
  • Updates
  • Etc.

34
Classic examples
  • Sequence alignment
  • Multiple sequence alignment
  • Examples from Setubal/Meidanis (1997)
Write a Comment
User Comments (0)
About PowerShow.com