Title: BIIN200: Bioinformatics I
1BIIN200 Bioinformatics I
- Introduction
- Craig A. Struble, Ph.D.
- Department of Mathematics, Statistics, and
Computer Science - Marquette University
- Norie Dela Cruz, Ph.D.
- Rat Genome Database
- Medical College of Wisconsin
2Overview
- Introduction to Bioinformatics
- Syllabus
- Student Introductions
3What Is Bioinformatics?
- Bioinformatics is a new subject of genetic data
collection, analysis and dissemination to the
research community. Hwa A. Lim (1987) - Bioinformatics Research, development, or
application of computational tools and approaches
for expanding the use of biological, medical,
behavioral or health data,including those to
acquire, store, organize, archive, analyze, or
visualize such data. NIH working definition
(2000)
4What is Bioinformatics?
Informatics Computer Science Computer
Engineering Information Science
Biology Other Natural Sciences
Bioinformatics
Mathematics Statistics
5Bioinformatics Related Fields
- Computational biology
- Computational molecular biology
- Biomolecular informatics
- Computational genomics
6Biological Data
- Genomes
- DNA Sequences of A, T, C, G
- Annotated with function, interesting features
- Proteins
- Amino Acid Sequences
- Sequences of 20 letters
- Annotated with structure, function, etc.
7Biological Data
- Gene Expression
- Dynamic behavior of genes
- Protein Expression
- Dynamic behavior of proteins
- Structural Features
- RNA and proteins
8Biological Data Sus scrofa agouti-related
protein gene
- 1 ggcacattct cctgttgagc caggctatgc
tgaccacaat gttgctgagc tgtgccctac - 61 tgctggcaat gcccaccatg ctgggggccc agataggctt
ggcccccctg gagggtatcg - 121 gaaggcttga ccaagccttg ttcccagaac tccaaggtca
gtgcgggcag gagtgggttg - 181 ggtggggctt ggacatcctc tggccacaaa gtattctgct
tgtatgagcc ctttcttccc - 241 cttcccaatc ccaggcctgg gaggtgggtg ttttgtgcat
gggtggttct gccctcacat - 301 catctgtccc agatctaggc ctgcagcccc cactgaagag
gacaactgca gaacgggcag - 361 aagaggctct gctgcagcag gccgaggcca aggccttggc
agaggtaaca gctcagggaa - 421 agggctgagg ccacaagtct tgagtgggtg tgtcaagcat
caacctctat ctgtgcttgg - 481 agttgccact gtggtacaac gggattggcg gtgtcttggg
agcgctggga cgtggtttca - 541 tccccggcca gcacaagtgg gttaaggatc tggccttgcc
atcccttcag cttaggctga - 601 gactgtggct tggagctgat ctctgaccgg aagctccata
tgctctgggg tgaccaaaaa - 661 tggaaaaaca aacatacaaa acacctctac ctgcacttcc
tgaccccctc acccggggcg - 721 acactgcaga ccatcccgtt cacgctccac ttccatcctg
ccttgatctg gcgcattcca - 781 tgaatgtgct tttggaagtc cttgtttccc aacccttgta
ggtgctagat cctgaaggac - 841 gcaaggcacg ctccccacgt cgctgcgtaa ggctgcacga
atcctgtctg ggacaccagg - 901 taccatgctg cgacccatgt gctacatgct actgccgttt
cttcaacgcc ttctgctact - 961 gccgcaagct gggtactgcc acgaacccct gcagccgcac
ctagctggcc agccaatgtc - 1021 gtcg
9Genome Sizes
10Database Growth
11Database Growth
12Database Growth
13Database Growth
- Exponential growth in sequence data
- Not much growth in sequence size
- Expect exponential growth in annotation
information - We have lots of data, but its difficult to make
sense of it.
14Fundamental Problems in Bioinformatics
- Pairwise Sequence Alignment
- Multiple Sequence Alignment
- Phylogenetic Analysis
- Sequence Based Database Searches
- Gene Prediction
- Structure Prediction (RNA and Protein)
- Protein Classification
- Gene Expression
- ...
15Pairwise Sequence Alignment
- Given two DNA or AA sequences, find the best way
to line them up - Biology allows for variation
- Gaps, mismatches, etc..
HEAGAWGHEE
PAWHEAE
HEAGAWGHE-E
HEAGAWGHE-E
P-A--W-HEAE
--P-AW-HEAE
16Multiple Sequence Alignment
- Extend pairwise problem to multiple sequences
17Phylogenetic Analysis
- Study relationships between organisms
- Characteristic similarity
- Sequence similarity
- Whole genome comparison
18Phylogenetic Analysis
19Sequence Based Database Searches
- Keyword
- Find all sequences named cytochrome c
- Sequence
- Find all sequences similar to HEAGAWGHEE
- Remember, there are gigabytes to search, and Im
not about to wait two days for an answer! - BLAST, FASTA,
20Gene Prediction
- Does the following sequence contain a gene?
- How many introns? Exons? Promoters? Other
features?
TTGTAATCTCCTCTGTGACTATAATGACTAGTCTCAGGCCTGCCTTCCCC
AGAAACCTCTCTTTTGGCTATTTCTCTTTC TAGTTCTCTGTTTAAACAA
AATTTATTCTATATATCTATCTATCTGTCTATCTATCTATCTATCTATCT
ATCTATCTATC TATCTATCTATCTATCATCTACTTATCATCTGTCTAGC
CATTTGAAGCATCTTTGTGTTTTAGGTCCTGTTAGATTCTCC TTTCAGC
CAGTGGAGGATCTGGACAGAGCTATTTCTTAGCTTCCCCTAAGCCATGTT
GTTAGAACGAATCCCCCACACCT CCTCTGAGTGCTACGTCTCCGTCAAG
AATTATGTATGTGGGATCCAGATGGCCCAGTGGATAAAACTGCAAGTGTC
ATGA CCATGACCTGACTTCAAGGGATTGTGTAGAAAGGGAGTTATCACA
GTGTGAGGGACAGGGCTAAGGACACTAACCCGTAT GTTGAGGGGCACAG
ACGCTAGCAACAACAGTGAAGTGTTTAAAAAGGCAAAAATCATGTTTCTA
GAAGTCAGGAAGAGCC TAACTTGTGGACAAGGACCAACAGGCAGCAGTT
GTAATGGGGCAGGGCAGAGGGAGAGCGGACACGCAGCTTTTGGCATC AA
ACACACCCAGAGTGTGGATAGAGAGTAGGGAAATACTCTAGTCTCTGGCT
AGGATACTCCCCTCTCTTTTTGACATTT CTCATTGGCAGCCCCAAGTGG
TCACTGGAGAGCCAGGAAGCCTAAAGGACACAGTTAGTAGCAGCCAGCTC
CTTTGGTGG AATTTTGGGGACATGGTGGGGTGACTTGGCTCTATCCAGG
CCAGGGCTGGGTGTGAGTATACACTTAGTGACTGGCCTTC
21Gene Prediction
22Structure Prediction (RNA, Protein)
- From sequence, predict 2 and 3D structures.
23Protein Classification
- From sequence, identify characteristics of a
protein - Active sites
- Families (e.g. globin)
- Blocks
- Domains
- Folds
- Motifs
- Etc.
24Gene Expression
- Study of gene activity under experimental
conditions - Large scale studies with microarrays
25Bioinformatic Based ApplicationVCMAP
- Comparative mapping is a strategy that allows
cross-organism study of physiological genomics - Virtual Comparative Map (VCMap) performs homology
analysis with mathematical predictions to
construct un-tested (in the wet-lab)
cross-organism maps between human, rat, mouse and
zebrafish - This application provides a highly modular
investigative environment for the - Analysis of multiple organisms including
Zebrafish - Collection of genetic and radiation hybrid maps
- Prediction of Genes based on homology
26VCMAP
- Homology analysis was based on sequence
similarity (Altschul, et al 1990) and curated
homologous genes. - 85 similarity with 100 bp stretch across all
species was used to create the maps - NCBIs UniGene sequence sets, RH and Genetic
maps were chosen to create anchor objects
(Kwitek-Black, et al. 2001). - 1-to-1 homologous objects were used for building
the virtual comparative maps with a pipeline
architecture
27VCMAP
Download UniGene data from NCBI
Mask UniGene sequences
Load UniGene data to DB
DB
Format masked sequences
Blast
Map Data
Search UniGene
VC Maps Building
Anchor Report
Generate anchor report
Create Homolog UniGene Object and Scoring
1-to-1 Objects
28VCMAP
29Different perspectives on Bioinformatics
- Bioinformatics is a tool
- Biologists, biochemists, medical professionals,
etc. - Obtain meaningful and understandable results
- Bioinformatics is a discipline
- Informaticians, mathematicians, statisticians,
etc. - Generate meaningful and understandable results
30Goals of the Course
- Communication between biologists and
computational scientists - Access, retrieve, and analyze bioinformatic data
- Know fundamental problems in bioinformatics
- Use standard bioinformatic tools to answer
biological questions - Understand theories used to build the tools
- Critically assess solutions to bioinformatic
problems
31How Are We Going To Get There?
- References
- Cynthia Gibas and Per Jambeck, Developing
Bioinformatics Computer Skills, OReilly
Publishers, 2001, ISBN 1-56592-644-1. - David W. Mount, Bioinformatics Sequence and
Genome Analysis, Cold Spring Harbor Laboratory,
2001, ISBN 0879696087.
32How Are We Going To Get There?
- Lab Assignments
- Nine (9) assignments covering the major topics
- Maintain a lab notebook, collected 3 times for
review - Bistro Lab
- 368 Cudahy Hall
- Windows workstations, Sun server
- Variety of software
- Lab orientation (when should we have it?)
33How Are We Going To Get There?
- Lab Web Page
- http//bistro.mscs.mu.edu
- For 70 grade
- Post 3 stories with commentary
- Post 3 links to bioinformatic tools, properly
categorized - Post 5 comments on others stories
- More posts, writing plug-ins, writing lab HOWTOs,
etc. will increase grade
34How Are We Going To Get There?
- Exams
- Midterm
- Final
- Intangibles
- Discussion with instructors
- Being engaged in the class
- Suggestions about the lab
35Grading
36Who We Are
- Craig A. Struble, Ph.D.
- Ph.D. in Computer Science, 2000 from Va. Tech
- 3rd year at Marquette
- Interests Microarray data analysis, medical
literature mining, miRNA, - Norie Dela Cruz, Ph.D.
- Rat Genome Database
37Who Are You?
- Name
- Where are you from?
- Background
- Why bioinformatics?
38Summary
- Bioinformatics is truly interdisciplinary
- Biology (natural sciences), informatics,
mathematics statistics - Databases
- Large, semistructured, incomplete, inaccurate
- Wide-range of problems
- Solutions employ knowledge from sciences with
algorithms and models from informatics,
mathematics, and statistics