Title: Introductory Biological Sequence Analysis Through Spreadsheets
1Introductory Biological Sequence Analysis
Through Spreadsheets
- Stephen J. Merrill
- Sandra E. Merrill
- Marquette University
- Milwaukee, WI
2Teaching Mathematics to Students of
Biology
- Need to make the math in the courses correlate
with math that needed in that discipline - The most important math needed is statistics
- The molecular biology revolution in biology
presents data in a form in which calculus has
little impact (sequences of letters)
3The Nature of Biological Sequence Data
- Primary structure of DNA, RNA, and proteins are
sequences of letters -- 4 letters in the case of
DNA (ATGC) and RNA (AUGC) and 20 letters
representing the sequence of amino acids which
makes up a protein - Secondary and Tertiary structures (bending,
folding and twisting) of structures determines
function -- hints seen through primary structure
4Use of Spreadsheets in this setting
- Commonly found and used in biological labs for
data acquisition, storage and organization, and
data analysis - Commonly present on student computers and
computer labs - Unlike calculators -- able to handle data sets
typical of real world applications - R.F. Murphy at CMU has developed a set of
worksheets for sequence analysis
5Meaningful Questions Problems
- 1. Measuring the similarity between two strings
-- alignment or homology - 2. Finding instances of a pattern in a string
- 3. Describing the composition and properties of a
string - 4. Graphing the evolutionary process and
construction of phylogenetic trees
6Measuring the Similarity between Strings
- Given a gene -- suggest the function of the
protein coded for by finding a similar sequence
(possibly in another species) - Simple homology involves assigning a 1 for
agreement and 0 for nonagreement at each site.
Then sum over all sites - Homology is the fraction of the highest possible
score, in
7Spreadsheet 1 Simple Homology
8 Spreadsheet 1 (cont.)comparing random
sequences
9Finding Instances of a Particular
Pattern in a String
- The process of locating genes involves locating
regions of the DNA sequences that contain
patterns which resemble those of known genes - Identifying sites on DNA where one of the
restriction enzymes can cleave DNA -- Also of
interest is size of the fragments that result - Identify regions of RNA which correspond to
particular features (e.g. loops) which may be
splice sites
10Describing the Composition and Properties of a
String
- Counts of frequencies of particular letters due
to their properties (e.g. regions rich in GC or
AT in DNA) - Properties of proteins (e.g. charge or
hydrophobicity) which depend on the nature and
frequencies of the particular amino acids
11Spreadsheet 2 Hydropathy Plot
12Spreadsheet 2 (Cont.)
13 Graphing Evolution and Phylogenetic
Trees
- Evolutionary distance between two DNA sequences
used to determine the process of the changes in
the sequences over time (e.g. the evolution of
HIV or the flu viruses) - Trees constructed to express the relationship
between related sequences -- distance in the tree
a monotone function of homology
14Spreadsheet 3 Mutation Evolution
15Spreadsheet 3 (cont.)
To study the evolution of a
sequence, we randomly pick a site for mutation,
then change its letter
16Conclusion
- Use of a spreadsheet makes possible an
experimental approach to introducing the
mathematics of sequence analysis - The use of spreadsheets makes possible the use of
real-world data and presents the computational
tool in a meaningful context - The importance of the topics to all educated
individuals suggests that the topics be included
in many liberal arts math courses