Title: Methodologies and computation: tools of comparative genomics
1Methodologies and computation tools of
comparative genomics
2Contents
- Databases
- Tools for using databases
- Sequence alignment
- Use of ESTs
- Finding genes
- Gene function
- Gene expressin
- Microarrays
- Association studies
3The first step
Usually the first thing a researcher would like
to do with a sequence or group of sequences that
have been generated is compare them to other
sequences and look for similarities or
differences. This can lead to information or
inferences on evolution, gene function, gene
origin, etc. This is now possible because
millions of sequences are now available for
hundreds of organisms. These are usually publicly
available, stored in databases..
4Databases
A database is simply a collection of data,
information of any kind. Even an Excel file can
be a database, but they can be far more complex.
Databases used for genomics can include many
types of data, such as EST sequences, genetic or
physical map information, phenotype, function or
annotation data, and even images.
A relational database contains cross-references
to the various types of data in the database (eg.
sequence names, library sources, etc.) which can
be cross-queried. For example, one might know
only one piece of information but still be able
to retrieve all the data related to that
query. This is an example of the schema used to
develop a relational database of club member
information.
5Purpose of databases
The purpose of a database may be simply to store
data, but they may also be designed to
distribute, publicize, or make the data available
to a larger user group.
Databases usually also contain software tools for
searching and comparing the sequence data, and
often statistical tools for analyzing it. This
is a screenshot of some of the tools offered by
the MIPS database (Munich Information Center for
Protein Sequences, http//mips.gsf.de/).
6Example of a database GenBank
One of the most widely used databases is GenBank,
a genetic sequence database funded by the U.S.
National Institute of Health and hosted by the
National Center for Biotechnology Information at
http//www.ncbi.nlm.nih.gov/. In addition to
GenBank, NCBI also has a searchable literature
database, a huge amount of other information, and
free tools such as BLAST, that we will look at in
upcoming slides.
7Finding databases
Periodically, the journal Nucleic Acids Research
publishes a special issue on Databases. This
issue is available for free online at
http//nar.oxfordjournals.org/. Note that while
many databases are open access, meaning the
data is available to anyone, some are private,
available only to certain users who are given a
user name and password. Using any online
search engine, such as Google, is very likely to
identify possible databases for any organism you
are interested in.
The January 2007 Database issue of Nucleic Acids
Research
8Using databases
Although so much data is now freely available in
a wide variety of databases, it is often
difficult to figure out how to use the data,
depending on the web interface presented to the
user. Some database websites include tools
(software programs) for using their data, but
often these databases are created specifically
for the laboratory that generated the data (and
collaborators) so it can be very difficult to
find specific data or know how to take advantage
of the data that is available.
Looking for specific information in a database
may be called querying, searching, or database
mining.
9Databases are dynamic and variable
Most databases are dynamic, meaning that they
change frequently, sometimes constantly, as new
data is added. This is because the kind of
statistics used take into account the number of
sequences in the database, which of course
changes every time someone adds new sequences.
This means that results from searching a
database can vary from day to day (or more) and
even the statistics resulting from these searches
may change, and searching different databases
with the same sequence will also give variable
results. In addition, since, in many cases,
anyone can contribute data, it is difficult to
monitor the quality of the data. People using the
data must be careful to confirm any results
independently or check the source of the data.
In August, 2005, the International Nucleotide
Sequence Database Collaboration, of which GenBank
is a member, announced that the DNA sequence
database had exceeded 100 gigabases (100 billion
bases)!
10Tools for using databases
Because genomics databases contain large amounts
of complex data, special tools are required for
retrieving, comparing, and interpreting the data.
There are many software programs available for
this, often housed together with the data on the
same webpage. The next slides explain some of
these tools.
11Phred and Phrap
Phred is a much-used program that helps turns the
raw data from sequencers (called trace data) into
the strings of bases (Gs, As, Ts, and Cs) that
are easier to store and work with. A related
program is Phrap, which helps assemble short DNA
sequences into longer ones. For more about
these programs see http//www.phrap.com/ However
, these programs take a bit of computer
expertise. Depending on the type of sequencer
that you use, you may have other, more
user-friendly programs available to you from your
institution.
AAACTGATTGAGTTTGAGAATT
12FASTA
FASTA was one of the earliest widely used
database searching tools (Lipman 1985, Pearson
Lipman 1988). It is still available and in use
but has been gradually replaced by newer
programs. However, the particular format of
defining a sequence and its name have remained in
convention. When you retrieve a sequence or
receive one from a sequencing facility, it will
probably be in FASTA format. An example is shown
below. Now, this file format is the standard
format accepted by researchers and software tools
used in genomics.
gtsequencename AGAATCCAAGCATACTCAGTCCAAGATTCTAAAAGA
TTGGCCAAATGGCAGCAGCAGCAGCAGCTTCTACTTCCATGGCGGCTACT
GCCGT
Note the gt symbol is very important, as it
denotes the name of the sequence, and in a file
containing many sequences it denotes each
separate sequence.
13Sequence retrieval
- To use a database to find more information there
are several possible starting points - You may have a DNA sequence of your own (perhaps
you had a PCR product from your research
sequenced). - You may be interested in a particular sequence,
for example one that you saw in a publication or
an online map. - You may be interested in sequences that relate to
a specific function, for example drought
tolerance.
Most databases allow you to search in any of
these ways. Below is an example from Gramene (
http//www.gramene.org/ ).
14Sequence alignment
Once you have a sequence (or set of sequences),
the next step is usually to see if there are any
sequences already in the database similar to
yours, if they have been assigned a function, and
what else, if anything, is known about this
sequence. Essentially, this requires comparing
your sequence(s) against all the other sequences
in the database, which is done by trying to align
your sequence against all the other
sequences. Below is a simplified example of
alignment (taken from Schneider and La Rota
2000). Note that gaps (signified by -) are
introduced by the algorithm to compensate for
insertions or deletions that have occurred in the
sequences.
15BLAST
The Basic Local Alignment Search Tool (BLAST)
program is by far the most widely used program to
look at sequence alignments and similarities
(Altschul et al. 1990). BLAST searches a database
for sequences similar to your sequence (the
query sequence) by using the 2-step approach
shown in the next 2 slides. The basic concept
is that the higher number of similar segments
between 2 sequences, and the longer in length of
similar segments, the less divergent the
sequences are, and therefore the more genetically
related (homologous) they are likely to
be. Because of the widespread use of this
algorithm, we will look at it more in the next
few slides. Also helpful are the tutorials
available from NCBI at http//www.ncbi.nlm.nih.gov
/Education/.
16How BLAST works
- BLAST first searches for short regions of a given
length (W) called words (or substrings) that
score at least T when compared to the query
sequence that align with sequences in the
database (target sequences), using a
substitution matrix. - 2. For every pair of sequences (query and
target) that have a word or words in common,
BLAST extends the alignment in both directions to
find alignments that score greater (are more
similar) than a certain score threshold (S).
These alignments are called high scoring pairs or
HSPs the maximal scoring HSPs are called MSPs. - Note that this is a vastly simplified explanation
of what BLAST does. For a more detailed look at
the statistical algorithms involved see, for
example, Saccone and Pesole (2003). - The next slide shows a graphical picture of
this process.
17How BLAST works - pictoral
Query Sequence
words (subsequences of the query sequence)
Query words are compared to the database (target
sequences) and exact matches identified
For each word match, alignment is extended in
both directions to find alignments that score
greater than some threshold (maximal segment
pairs, or MSPs)
(Schneider and La Rota 2000)
18Interpreting BLAST output
Here is a sample blast result (from BLAST on the
NCBI site, using a tomato sequence)
- The list of hits is organized starting with the
best (most similar) - E-value expected number of chance alignments
the lower the E value, the more significant the
score - First in the list is the sequence finding itself,
which obviously has the best score - To the left is the Accession Number A unique
code that identifies a sequence in a database (in
this case it is the GenBank number
Continued next slide
19Interpreting BLAST output, contd
It is important to know that there is no set
cut-off that determines whether a match is
considered significant or similar enough this
must be determined according to the goals of the
project. In this example we see that the second
hit (remember that the first is the sequence
hitting itself) is a different EST sequence, but
exactly the same significance level, implying
that this may indeed be a redundant (the same)
EST sequence in any case it is exactly the same
at least for the length of these sequences.
20Interpreting BLAST output, contd
Further down the list is an EST from potato
leaves (EST462540) that has an E-value of only
-11. This might be considered borderline similar.
Any hits below this would in most cases not be
considered significantly similar to the query
sequence.
21Interpreting BLAST output, contd
By clicking on the Score (bits) in the right
column (highlighted in blue), we can see the
detailed alignment. Below is the result from
clicking on the score of the second tomato hit,
which shows clear 100 alignment at all possible
bases.
22Interpreting BLAST output, contd
Below is the result of clicking on the much less
significant potato hit. It only has a very short
stretch where the sequences can be aligned, and
even then there are still places where the
nucleotides are different between the 2
sequences, thus leading to the different score.
Note that query is the sequence we used to
search the database with (in this case the tomato
sequence) while subject is the sequence in the
database that has been found to be similar.
23BLAST parameters
Many of the parameters of the BLAST algorithm
could be changed if the user so desired, such as
the various thresholds, and the type of pairwise
matrices used to find matches. For most purposes,
the default settings work fine. However, if you
are interested in finding out more about the
possible changes you can make, see the tutorials
that NCBI makes available at http//www.ncbi.nlm.n
ih.gov/Education/BLASTinfo/information3.html
The shot at the left shows some of the
parameters that can be changed.
24The BLAST suite
BLAST is really more than just one program. In
fact, there are several versions, depending on
the type of sequences being aligned
blastn compares a nucleotide query sequence
against a nucleotide sequence database blastp
compares an amino acid query sequence against a
protein sequence database blastx compares the
six-frame conceptual translation products of a
nucleotide query sequence (both strands) against
a protein sequence database. This translation is
the simple conversion of a nucleotide string into
six separate strings of aminoacids (one for each
possible reading frame) tblastn compares a
protein query sequence against a nucleotide
sequence database dynamically translated in all
six reading frames (both strands) tblastx
compares the six-frame translation of a
nucleotide query sequence against the six- frame
dynamic translations of a nucleotide sequence
database. As you can imagine, this program is
doing 36 comparisons (6x6) for each comparison
between the query sequence and any of the target
sequences in the database. This will of course
reflect on the speed of the program, making this
one the slowest of the pack. However, this
simultaneous translation into protein of both the
query (nucleotide) and the target database (also
nucleotide), allows us to find more distantly
related sequences. (from Schneider and La Rota
2000)
Deciding which to use depends on the type of data
you have (amino acid sequence? DNA sequence?),
how genetically related are the organisms you are
interested in, and what you are hoping to learn
from it.
25Large scale computational genomics
Of course, many researchers wish to do more than
one database query at one time, and perhaps even
BLAST an entire data set of many sequences at one
time. There are methods of doing this,
sometimes called batch BLASTing, using
additional software programs or the use of
scripts such as using the program Perl that can
automatically keep repeating a BLAST with one
sequence after the other. This scaling up
usually requires the assistance of a
bioinformatics specialist, a computer scientist
or at least someone with some extra computer
skills. We will not go into these methods in
more detail here, but instead will move on to
other genomics tools.
Perl is a dynamic programming language created by
Larry Wall and first released in 1987
26Sampling the genome
Sequencing an entire genome is often not
economically feasible. However, there are methods
to sequence much, or parts of the genome. Some of
the alternatives are listed below. These are
extremely technical so we will not go into great
detail in this module, but they are described in
brief in the next few slides, with emphasis given
to ESTs as that is the most popular and
cost-effective method currently in use.
- BAC end sequencing
- Methyl filtration or methyl-restriction libraries
- Cot analysis
- ESTs
DNA sequencing costs have fallen more than
50-fold over the past decade, fueled in large
part by tools, technologies and process
improvements developed as part of the successful
project to sequence the human genome. However, it
still costs around 10 million to sequence 3
billion base pairs - the amount of DNA found in
the genomes of humans and other mammals. -From
NIH news release (2006) NHGRI Aims to Make DNA
Sequencing Faster, More Cost Effective available
at http//www.genome.gov/19518500
27BAC end sequencing and Methyl-filtration libraries
BAC end sequencing although BAC clones are too
long to sequence in their entirety, each BAC in a
library can be sequenced from both ends. While
this does generate new sequences, because they
are random, only a fraction of these can be
expected to represent genes. Methyl filtration
or methyl-restriction libraries These approaches
are based on the tendency (not always strict) of
genes to be less methylated than non-genic
regions. Thus highly-methylated regions of the
genome are filtered out or selected against using
methylation-sensitive restriction enzymes,
leaving mainly gene-rich regions of the genome to
be sequenced.
For an overview of the effectiveness of various
gene-enrichment techniques used in maize, see NM
Springer, X Xu and WB Barbazuk (2004) Utility of
different gene enrichment approaches toward
identifying and sequencing the maize gene space.
Plant Physiology 136 3023-3033.
28Use of Cot analysis
Cot analysis an old technique that is now being
used again in combination with new genomics
techniques to help fractionate the genome into
low copy, moderately repetitive and highly
repetitive sequences. It uses the principles of
DNA renaturation kinetics, where the rate at
which a particular sequence reassociates (returns
to the double-stranded state) is proportional to
the number of times it is found in the genome
(Cot stands for nucleotide concentration times
reassociation time) (Cullis 2004, Peterson 2002).
In this way, highly repetitive sequences can be
eliminated and only single and/or low copy
sequences selected for cloning and sequencing,
aiding in the discovery of previously
unidentified genes. This is particularly helpful
in dealing with large genomes containing a high
amount of repetitive DNA, which would be
difficult and expensive to completely sequence.
For a nice example of how this was used in
sorghum, see Peterson DG, Schulze SR, Sciara EB,
Lee SA, Bowers JE, Nagel A, Jiang N, Tibbitts
DC,Wessler SR Paterson AH (2002) Integration of
Cot analysis, DNA cloning, and high-throughput
sequencing facilitates genome characterization
and gene discovery. Genome Research 12 (5)
795-807. Also available at http//www.genome.org/
cgi/content/full/12/5/795
Image from USDA http//www.usda.gov/oc/photo/94cs
2601.htm
29ESTs
Expressed Sequence Tags (ESTs) are short (usually
between 300-500 nucleotides) DNA sequences
generated from one or both ends of clones from
cDNA libraries.
- Because ESTs are generated from cDNA, which was
synthesized from RNA, they are more likely to be
from genes - Sequencing only the beginning portion of the cDNA
produces what is called a 5' EST. This part of
the cDNA transcript usually codes for a protein
therefore, this is the more widely used type.
These regions tend to be conserved across
species. - Sequencing the end portion of the cDNA molecule
produces what is called a 3' EST. Because these
ESTs are generated from the 3' end of a
transcript, they are more likely to fall within
non-coding, or untranslated regions (UTRs), and
therefore tend to be less conserved across
species (ie. more divergent).Â
30EST production pictoral
From National Center for Biotechnology
Information (NCBI). A Science Primer. Available
at http//www.ncbi.nlm.nih.gov/About/primer/index
.html
31Use of ESTs
- ESTs are a relatively inexpensive and quick way
to generate a lot of sequence information in
cases where whole genome sequencing is not
possible. - Since they are generated from cDNA they are more
likely to represent genic regions of the genome. - ESTs can be used simply as markers and mapped to
a chromosomal location just like any other marker
(see de Vicente and Fulton, 2002 for more on
molecular markers). - However, they are especially useful in the hunt
for genes, as they greatly reduce the time and
expense in the search.
32Use of ESTs, contd
A particular advantage of ESTs is that, if there
is a large enough set of them, overlaps among
them can be used to order them into larger
sequences called contigs, which can often be
the sequences of whole genes.
Overlapping EST sequences can be computationally
ordered into one long sequence, a contig.
33Use of ESTs, contd
Thus, ESTs have many uses, some of which are
listed below
- Discovering new genes
- Confirming coding regions of genomic sequence
- Studying phylogenetic relationships
- Developing genome maps
- Producing expression arrays
34Available ESTs
Because of their utility, speed with which they
may be generated, and low cost, many individual
scientists as well as large genome sequencing
centers have been generating hundreds of
thousands of ESTs, many of which are made freely
available. NCBI houses a large database of ESTs
called dbEST. As of January 5, 2007 there were
over 40 million entries.
http//www.ncbi.nlm.nih.gov/dbEST/index.html
35Limitations of ESTs
- 1. It is very difficult to isolate mRNA from
some tissue. This results in a deficiency of data
about certain genes that may only be found in
these tissues, or that occur in low-abundance
transcripts. - Important gene regulatory sequences may be found
within an intron. Because ESTs are small segments
of cDNA generated from a mRNA in which the
introns have been removed, much valuable
information may be lost by focusing only on cDNA
sequencing. - ESTs may contain errors due to the library
construction process or the sequencing (this can
be overcome by confirming ESTs of interest via
resequencing or contigging) - As they do not include non-transcribed regions of
the genome, information about gene regulatory
regions (eg. promotors) will be missing - Since ESTs are developed from mRNA, and genes may
be expressed as mRNA many times, there is a lot
of redundancy in groups of ESTs, as well as
overlap among them. This problem can be mitigated
by clustering the ESTs into unigenes.
36Finding genes using gene-finding programs
One of the key goals of genomics research is
finding the genes of an organism, either specific
genes or as many of the genes as possible. There
are several methods of doing this.
If genomic sequence is available, there are a
number of gene-finding software programs that
can identify possible genes, mainly by
identifying open reading frames (by finding start
and stop codons). However, due to differences
among genomes, these programs are not 100
accurate and may need to be adjusted for specific
species. Some of the programs available are
Genescan (from MIT), ORF Finder (from NCBI) and
TigrScan (from TIGR). For reviews of
gene-finding programs and a good explanation of
how they work see Brent MR and Guigo R (2004)
Recent advances in gene structure prediction.
Current Opinion in Structural Biology 14
264272 Zhang MQ (2002) Computational prediction
of eukaryotic protein-coding genes. Nat. Rev.
Genet. 3 698709
37Finding genes using other known genes
If genome sequence is available, another method
of identifying genes in the sequence is by using
sequences from other species that have already
been identified as genes and using these to
search for corresponding genes in the new
sequence (this is possible due to the high level
of conservation of genes among most organisms).
For example, the known gene sequence could be
compared to the new genomic sequence using BLAST.
Known gene sequence
New genomic sequence
Using BLAST, the known gene sequence finds the
homologous gene in the genomic sequence
38Other methods for finding genes
- Other methods for finding genes include
- Generating ESTs, as previously mentioned,
produces large numbers of short sequences that,
as they originate from cDNA, are likely to be
genes. - Generating full-length cDNA clones. As these are
longer than EST sequences, these libraries are
somewhat more difficult to construct and require
other sequencing strategies. - Identification of genes by mutagenesis. These
methods identify genes by first identifying that
a particular sequence has a function. This can be
done by - Knock-out methods where a T-DNA element or
transposon is inserted into a sequence and the
effect is noted - Targeted induced local lesions in genomes
(TILLING) which uses chemical mutagenesis to
induce point mutations that are identified by
screens of pooled PCR products - RNA Interference (RNAi) where all members of a
particular gene family are silenced (turned
off) by the use of a specific double-stranded RNA
For more on the overall topic see Cullis C
(2004) insertional mutagenesis see Azpiroz-Leehan
Feldmann 1997 TILLING see McCallum 2000 RNAi
see Tang et al. 2003
The Nobel Prize in Physiology or Medicine for
2006 was awarded jointly to Andrew Z. Fire and
Craig C. Mello for their discovery of "RNA
interference -- gene silencing by double-stranded
RNA."
39Identifying the function of a gene
Often the same methods used to find genes also
aid in identifying their function. For example,
by knocking out or silencing a gene and noting
the result, we can sometimes infer the function
of that particular gene (if it results in a
change that is noticeable but not lethal to the
organism). An example of where knock-outs are
being used to help identify gene function is in
the Arabidopsis 2010 project (see The Arabidopsis
Information Resource, http//www.arabidopsis.org/)
. However, this can be complicated since in most
cases an organism has evolved to have
duplications of genes available such that
knocking out or silencing one does not
necessarily cause a visible phenotype change.
Wild-type Arabidopsis (l) and with the MPK4 gene
knocked-out (a gene related to pathogen
resistance).
From L Frank (2001) Paranoid but popular. Genome
News Network http//www.genomenewsnetwork.org/arti
cles/02_01/Cress_plant_resistance.shtml
40The study of gene expression
Traditionally, Northern blotting was used to
study gene expression using one labeled probe
hybridized to an RNA target. With new
high-throughput methods, in particular
microarrays, the level of expression in tens of
thousands of genes, in some cases the whole
genome, can be visualized at once. Expression
profiling is the identification of all of the
RNAs that are present in a specific tissue sample
at a particular time (Cullis 2004). The simple
pictoral below depicts how many genes can be
analyzed at once on a microarray as compared to a
very small number on a Northern blot (more about
microarrays later).
Microarray
Northern blotting
cDNAs or oligonucleotides affixed to chip
Labeled probe
RNA blot
RNA from healthy leaves and diseased or stressed
leaves extracted
hybridization
hybridization
Fluorescent color intensities of thousands of
genes can be quantified at once
Band sizes and intensities are analyzed
41Gene expression analysis
A number of methods have been developed to study
gene expression. For a good review of these
systems, see Alba (2004). The most widely used is
the microarray (next slide). Some of the others
are listed below
Name of system Differential display cDNA-AFLP
SAGE (serial analysis of gene expression)
Brief description Uses low-stringency PCR,
primers and gel electrophoresis to amplify and
visualize cDNAs Uses the principles of AFLP
with cDNA templates Combines differential
display and cDNA sequencing techniques
Advantages Little RNA is required Parallel
profiling is possible Higher-stringency leads
to clearer data which can be quantified wide
variety of tissue types, developmental stages or
time points can be compared Is quantitative
Disadvantages Output not quantitiative
positives difficult to confirm Substantial
resources required for cloning and sequencing
less sensitive to low-abundance
transcripts Laborious requires extensive
sequence information
Adapted from Alba R, Fei Z, Payton P, Liu Y,
Moore SL, Debbie P, Cohn J, DAscenzo M, Gordon
JS, Rose JKC, Martin G, Tanksley SD, Bouzayen M,
Jahn MM, Giovannoni J (2004) ESTs, cDNA
microarrays, and gene expression profiling tools
for dissecting plant physiology and development.
The Plant Journal 39 (5) 697-714.
42Microarrays
Microarrays are a way to study the expression of
many genes or even the whole genome at once.
These are chips, usually made of glass or
plastic, that contain thousands of small
molecules (typically PCR products, cDNAs, or
oligonucleotides) affixed using robotics as small
dots in rows. These chips are very small .
Next, other small pieces of DNA that have been
tagged with fluorescence are hybridized to the
microarray chip where they bind to any
complementary sequences among the molecules
affixed to the chip. A scanning machine reads the
amount of fluorescence on each dot and a computer
analyzes the pattern of the active, or turned
on, genes.
A picture of the Q-bot robotic system for
producing microarrays, and 2 examples of partial
microarray chips.
43Use of microarrays
Microarrays can contain tens of thousands of dots
representing different RNA transcripts on a chip
the size of a microscope slide. They have the
advantages over the expression analysis systems
mentioned in a prevous slide of being
quantitative and sensitive to low-abundance
transcripts. In addition microarrays can, if
enough sequence is available for a particular
organism, represent the expression of an entire
genome in one experiment. As of this writing,
microarray chips for some organisms are already
available either commercially or at cost from
universities, for example, mouse, human,
Arabidopsis, tomato and many others.
While the cost of these pre-made chips as well as
the construction of new ones can still be
prohibitive, prices are decreasing. Analyzing the
large amount of data they contain, however, can
be overwhelming and expensive.
Keep in mind, too, that a microarray is just a
snapshot of a moment in the life of an
organism. The real potential is in comparing
microarrays of an organism over time, or under
different conditions.
44Limitations of microarrays
- Although microarrays have great potential in
genomics and comparative genomics research, there
are a number of limitations as well - They may require a large amount of high-quality
RNA, which is difficult to acquire - Varying protocols are required, depending on the
goals and priorities of the experiment (whether
high-throughput is necessary, low abundance
transcripts are important, what tissue types are
used, etc.) - They are still very expensive
- Analyzing the results requires a high level of
technical skill and software - Artifacts occur frequently
45Association Studies
Genetic linkage mapping has long been used to
pinpoint the chromosomal location of a gene or
other marker, using structured progeny
populations (ie. F2, backcross) and the
principals of recombination frequencies. A newer
type of mapping is called association mapping or
linkage disequilibrium mapping. This is
similar to linkage mapping except that it does
not require a structured population, simply a
group of unrelated individuals. DNA sequences are
compared in an attempt to find an association
(correlation) between a particular sequence,
marker or SNP and a disease or other measured
phenotype. Since a dense marker map, a physical
map, or sequence information is a prerequisite,
association studies are generally used mainly for
fine mapping.
For more on this see Borevitz Nordborg
2003 Doerge RW 2002
46Association studies Limitations
- For statistical reasons, these associations can
usually only be identified between tightly linked
loci, thus a dense molecular marker map or DNA
sequence is a prerequisite. - The statistics needed to assess these
associations must be very stringent false
positives and false negatives, due, for example,
to population structure are common. A popular
example, noted in Lander Shork (1994), is that
in San Francisco, skill with chopsticks is
strongly linked to the HLA-A1 allele (one of the
human leukocyte antigens). This is, as you might
guess, simply because this allele is found more
often among Chinese people than other groups. - The strength of the associations are dependent
upon many variables, for example the rate of
recombination between them, the history of
mutation at both loci, and the historical
heterozygosity of the population, the size of the
population, etc. (Borevitz Nordborg 2003).
47Resources additional information
Friend, S.H. and Stoughton, R.B. (2002,
February). The magic of microarrays. Scientific
American, pp. 44-53 Schneider, D and La Rota M
(2000) Tutorial Introduction to Software for the
Analysis of Sequence Similarity.Tutorial prepared
for the Cornell University course PL BR 607
http//cbsu.tc.cornell.edu/resources/seq_comp/TOC.
html National Center for Biotechnology
Information (NCBI). A Science Primer. Available
at http//www.ncbi.nlm.nih.gov/About/primer/index
.html. See also the Tutorials available at
http//www.ncbi.nlm.nih.gov/Education/ dbEST a
database of ESTs housed by NCBI
http//www.ncbi.nlm.nih.gov/dbEST/index.html http
//learn.genetics.utah.edu/units/biotech/microarra
y/ The Genetic Science Learning Center (Utah)
has a tutorial about microarrays (it is animated,
however). Genome Research Limited (GRL) (2001)
http//www.yourgenome.org Has helpful background
information and news