Title: MSCS282: Bioinformatics I
1MSCS282 Bioinformatics I
- Introduction
- Craig A. Struble, Ph.D.
- Department of Mathematics, Statistics, and
Computer Science - Marquette University
- Michael A. Thomas, Ph.D.
- Bioinformatics Research Center
- Medical College of Wisconsin
2Michael A. Thomas, Ph.D.
3Overview
- Welcome
- Syllabus
- Student Introductions
- Introduction to Bioinformatics
4What Is Bioinformatics?
- Bioinformatics is a new subject of genetic data
collection, analysis and dissemination to the
research community. Hwa A. Lim (1987) - Bioinformatics Research, development, or
application of computational tools and approaches
for expanding the use of biological, medical,
behavioral or health data,including those to
acquire, store, organize, archive, analyze, or
visualize such data. NIH working definition
(2000)
5What is Bioinformatics?
Informatics Computer Science Computer
Engineering Information Science
Biology Other Natural Sciences
Bioinformatics
Mathematics Statistics
6Bioinformatics is sometimes called
- Computational biology
- Computational molecular biology
- Biomolecular informatics
- Computational genomics
7Different perspectives on Bioinformatics
- Bioinformatics is a tool
- Biologists, biochemists, medical professionals,
etc. - Obtain meaningful and understandable results
- Bioinformatics is a discipline
- Informaticians, mathematicians, statisticians,
etc. - Generate meaningful and understandable results
8Biological Data
- Genomes
- DNA Sequences of A, T, C, G
- Annotated with function, interesting features
- Proteins
- Amino Acid Sequences
- Sequences of 20 letters
- Annotated with structure, function, etc.
9Biological Data
- Gene Expression
- Dynamic behavior of genes
- Protein Expression
- Dynamic behavior of proteins
- Structural Features
- RNA and proteins
10Biological Data Sus scrofa agouti-related
protein gene
- 1 ggcacattct cctgttgagc caggctatgc
tgaccacaat gttgctgagc tgtgccctac - 61 tgctggcaat gcccaccatg ctgggggccc agataggctt
ggcccccctg gagggtatcg - 121 gaaggcttga ccaagccttg ttcccagaac tccaaggtca
gtgcgggcag gagtgggttg - 181 ggtggggctt ggacatcctc tggccacaaa gtattctgct
tgtatgagcc ctttcttccc - 241 cttcccaatc ccaggcctgg gaggtgggtg ttttgtgcat
gggtggttct gccctcacat - 301 catctgtccc agatctaggc ctgcagcccc cactgaagag
gacaactgca gaacgggcag - 361 aagaggctct gctgcagcag gccgaggcca aggccttggc
agaggtaaca gctcagggaa - 421 agggctgagg ccacaagtct tgagtgggtg tgtcaagcat
caacctctat ctgtgcttgg - 481 agttgccact gtggtacaac gggattggcg gtgtcttggg
agcgctggga cgtggtttca - 541 tccccggcca gcacaagtgg gttaaggatc tggccttgcc
atcccttcag cttaggctga - 601 gactgtggct tggagctgat ctctgaccgg aagctccata
tgctctgggg tgaccaaaaa - 661 tggaaaaaca aacatacaaa acacctctac ctgcacttcc
tgaccccctc acccggggcg - 721 acactgcaga ccatcccgtt cacgctccac ttccatcctg
ccttgatctg gcgcattcca - 781 tgaatgtgct tttggaagtc cttgtttccc aacccttgta
ggtgctagat cctgaaggac - 841 gcaaggcacg ctccccacgt cgctgcgtaa ggctgcacga
atcctgtctg ggacaccagg - 901 taccatgctg cgacccatgt gctacatgct actgccgttt
cttcaacgcc ttctgctact - 961 gccgcaagct gggtactgcc acgaacccct gcagccgcac
ctagctggcc agccaatgtc - 1021 gtcg
11Genome Sizes
12Database Growth
13Database Growth
14Database Growth
15Database Growth
- Exponential growth in sequence data
- Not much growth in sequence size
- Expect exponential growth in annotation
information - What are we to do with all this data?
16Challenges of Large Databases
- Storage
- Indexing, physical layout, memory management
- Modeling
- Relational, hierarchical, semi-structured
- Efficiency
- Update, query, analysis
- Interpretation
- Visualization
17Problems in Bioinformatics
- Consider just sequence analysis
- Sequence alignment
- Gene discovery
- Promoter discovery
- Intron splice sites
- Protein and RNA structure prediction
18Applications of Bioinformatics
- VCMAP
- DORR and ASAP
- miRNA
19VCMAP
- Comparative mapping is a strategy that allows
cross-organism study of physiological genomics - Virtual Comparative Map (VCMap) performs homology
analysis with mathematical predictions to
construct un-tested (in the wet-lab)
cross-organism maps between human, rat, mouse and
zebrafish - This application provides a highly modular
investigative environment for the - Analysis of multiple organisms including
Zebrafish - Collection of genetic and radiation hybrid maps
- Prediction of Genes based on homology
20VCMAP
- Homology analysis was based on sequence
similarity (Altschul, et al 1990) and curated
homologous genes. - 85 similarity with 100 bp stretch across all
species was used to create the maps - NCBIs UniGene sequence sets, RH and Genetic
maps were chosen to create anchor objects
(Kwitek-Black, et al. 2001). - 1-to-1 homologous objects were used for building
the virtual comparative maps with a pipeline
architecture
21VCMAP
Download UniGene data from NCBI
Mask UniGene sequences
Load UniGene data to DB
DB
Format masked sequences
Blast
Map Data
Search UniGene
VC Maps Building
Anchor Report
Generate anchor report
Create Homolog UniGene Object and Scoring
1-to-1 Objects
22VCMAP
23Disease Oriented Research Resource
- The major goal of the RGD Disease Oriented
Research Resource is to create collaborative
relationships between RGD and 20 particular
disease rat research communities to identify,
collect, and integrate disease-specific
components of data and information all the way
down to specific genes of interest into RGD
Disease portals.
Specific Goals for RGD
- Prioritize data for curation and addition to RGD
based on targeted disease areas - Effectively combine automated and manual data
acquisition and curation methods - Provide a way to integrate Rat Genome Sequencing
Project results with RGD activities - Help RGD incorporate tools developed in BRC to
add focus of data mining and analysis to
traditional curation and database functions
24DORR Workflow
Genes
Curation
Strains
ASAP
QTLs
VCMap
Biomedical Literature
Identify Disease
GROIs
Microarray
In Development
Pathways
Phenotypes
25Data Acquisition
26Automated Sequence Analysis Pipeline
Markers
seqs mapped in ROI Seqs predicted in ROI
Seq(Fasta/Trace)
VCMap
input sequences
RepeatMask
Search extra genomic sequence(HTGS, TraceDB,nr)
RepeatMask
Assembly Seq
AssembledSeqUnassembledSeq
MetaGene
UniGene Search
ePCR
RepeatMask
Homolog Search
RepeatMask
Hot Zone
BLAST (nrpir)
UniGene Report
Repeat Report
Marker Report
Homolog Report
MetaGene Report
Gene Report
Visualization (Clickable Image Map)
Additional Reports
27miRNA Gene and Target Prediction
- microRNA genes (miRNAs) were recently recognized
as a class of functional non-coding genes - 70nt precursor which has a hairpin fold
- 20nt RNA molecule from Dicer cutting the stem
loop - First identified were lin-4 and let-7
- Developmental role in C. Elegans
28miRNA Examples
29miRNA Gene Prediction
- Can we predict where miRNA genes might be?
- Microscan (Burge Lab)
Scan C. Elegans for 70nt hairpin folds (structure
prediction)
Compare with C. Briggsae (sequence alignment)
- Score alignments
- 3 and 5 conservation
- Overall conservation
- Size of loop, etc.
Select sequences with score gt threshold
30miRNA Target Prediction
- Lin-4 and let-7 interact with 3-UTR
- Idea look for conserved 3-UTR regions which are
complementary to discovered miRNA genes
Find conserved sequences of 20nt in length from
3-UTR database
Align with miRNA genes
- Score alignment
- 3 and 5 matches
- Overall matching
Select high scoring matches
31Goals of the Course
- For everyone
- Communication
- For the biologist
- Incorporate bioinformatics into research
- Understand computational modeling
- For the computational scientist
- Develop tools for biological research
- Create new algorithms for mining biological data
- Understand how to find biologically meaningful
information
32Summary
- Bioinformatics is truly interdisciplinary
- Biology (natural sciences), informatics,
mathematics statistics - Databases
- Large, semistructured, incomplete, inaccurate
- Wide-range of problems
- Solutions employ knowledge from sciences with
algorithms and models from informatics,
mathematics, and statistics
33Biological Data
- DNA and Protein Sequences are annotated
- Source
- Organism
- Function
- Updates
- Etc.
34Classic examples
- Sequence alignment
- Multiple sequence alignment
- Examples from Setubal/Meidanis (1997)