Discovery of Genes for Improved Cellulose - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Discovery of Genes for Improved Cellulose

Description:

Discovery of Genes for Improved Cellulose and Cellulose-Extractability from Poplar Secondary Xylem Jill L Wegrzyn1, Jennifer M. Lee2, Andrew J. Eckert2, Charlyn J ... – PowerPoint PPT presentation

Number of Views:127
Avg rating:3.0/5.0
Slides: 30
Provided by: ucd67
Category:

less

Transcript and Presenter's Notes

Title: Discovery of Genes for Improved Cellulose


1
Discovery of Genes for Improved Cellulose and
Cellulose-Extractability from Poplar Secondary
Xylem Jill L Wegrzyn1, Jennifer M. Lee2, Andrew
J. Eckert2, Charlyn J. Suarez2 Brian J.
Stanton3, Mark F. Davis4, Chung-Jui Tsai5, David
B. Neale1 1Department of Plant Sciences,
University of California at Davis, Davis,
CA 2Department of Evolution and Ecology,
University of California at Davis, Davis,
CA 3Genetic Resources Conservation Program,
Greenwood Resources, Portland, OR 4National
Renewable Energy Lab, Golden, CO 5School of
Forest Resouces, Michigan Technical University,
Hougton, MI
2
Project Objectives
  • Resequence 40 candidate genes using a discovery
    panel of 15 unrelated poplar individuals
  • Identify SNPs in the 40 genes using an automated
    alignment and SNP calling bioinformatics pipeline
  • SNP genotype 456 poplar clones for 1536 SNPs
    (Illumina Golden Gate assay)
  • Harvest wood increment cores from 2-3 ramets of
    each of the 456 poplar clones (1100 trees in
    total)
  • Molecular Beam Mass Spectrometry (MBMS) analysis
    on all 1100 wood cores to develop secondary xylem
    metabolomic profiles
  • Association genetics analyses to identify genes
    controlling cellulose quantity and quality
    phenotypic variation in poplar

3
Poplar Biofuels Genome ProjectProject Overview
4
Selected Candidate Genes 40 Genes highly
expressed in wood-forming tissues and associated
with lignin and cellulose biosynthesis
5
(No Transcript)
6
Primer Design and SequencingAgencourt Biosciences
  • Primer Design
  • mRNA sequences were used to direct custom
    software to use 1000 bp
  • upstream along with intronic sequence from the
    poplar genome
  • 517 primers were designed across 40 genes
  • 203 non-overlapping primers were finally
    selected based on
  • quality score, position, homopolymer regions
    (bioinformatic validation)
  • Goal Fully re-sequence 40 candidate genes to
    facilitate SNP discovery
  • between 3 and 12 amplicons/gene
  • total of 202 amplicons from Agencourt
  • forward and reverse sequencing

7
Candidate Genes Re-Sequenced from a Panel of 15
Unrelated Poplar Clones
DNA landmarks responsible for extraction
8
Alignment and SNP Calling PipelineChallenges in
High-Throughput SNP Identification
  • Alignment
  • Critical in the automation of base calls
  • Commonly used Phrap (from PhredPhrap) is an
    assembler and is NOT ideal for alignments
  • Many commonly used aligners work best with
    protein sequences or with a reference sequence
  • Preservation of quality scores for input into SNP
    identification programs
  • Speed for high-throughput programs
  • Automated SNP Calls
  • Reference Sequence Required
  • Traditional approaches without reference sequence
    include eSNPs (human, maize, and pine)
  • -Very little redundancy outside of abundant genes
  • -Overall high number of false positives (single
    pass reads)
  • Not specific to frequencies observed in different
    organisms
  • High number of false positives in currently
    accepted methods
  • Polybayes PolyPhred

9
Identification of SNPs in the 40 Candidate
GenesAutomated Alignment and SNP identification
Pipeline
Re-Sequencing data from Agencourt Initial
Processing Base Calling Sequence
Alignment SNP Identification Machine
Learning Data Storage Release
10
Base Calling and Sequence Alignment
Modified PhredPhrap allows for trimming of bases
from start and end of sequence based on trace
quality
Ace2FASTA Converts native PhredPhrap output (ace
file) into an unaligned FASTA file
ProbconsRNA Optimal DNA sequence alignment program
AlignedContig2ReadFASTA Provides single
multifasta file with all reads aligned to the
contig from PhredPhrap AND the contigs alignment
to the other contigs from probconsRNA
FASTA2Ace Converts resulting FASTA file back into
ace file for SNP Identification
11
Alignment and SNP IdentificationSNP
Identification Overview
  • Examine features to improve the accuracy of SNP
    location prediction
  • Utilize machine learning to apply the features
  • Refine the accuracy of the learning algorithm
    through adjustments to feature representation
  • Utilize the classifier against the large
    re-sequenced set to improve accuracy of SNP calls
    originating from Polybayes and Polyphred

12
Alignment and SNP IdentificationExisting SNP
Identification Software
  • Polyphred
  • http//droog.mbt.washington.edu/PolyPhred.html
  • PolyPhred identifies potential SNPs using the
    base calls and peak information provided by Phred
    and the sequence alignments provided by Phrap
  • SNP score based on base quality and sequence
    depth
  • Polybayes
  • http//genome.wustl.edu/tools/software/polybayes.c
    gi
  • Fully probabilistic SNP detection algorithm that
    identifies SNPs based on discrepancies at a given
    location of a multiple alignment.
  • SNP score is based on a Bayesian-statistical
    formulation and can take-in prior frequency
    information

13
Alignment and SNP IdentificationFeature Selection
Description Representation
Sequence Depth Continuous
Variation Type Categorical
Polybayes Score Continuous
Polyphred Score Continuous
Freq of major/minor alleles Continuous
Max quality of major/minor alleles Continuous
Local average quality Continuous
Overall average quality Continuous
Alignment Quality Continuous
14
Alignment and SNP IdentificationFeature
Representation
  • Sequence depth
  • - Count of number of sequences in the alignment
    at the position of variation.
  • All sequences in the alignment may not overlap at
    the position of variation number is different
    from the total number of the sequences in the
    alignment
  • Variation type
  • Variation type can be SNP or INDEL.
  • PolyBayes score
  • PolyBayes program assigns a Bayesian posterior
    probability value for each called SNP using the
    frequency priors given for observing a variation
    at that position.

15
Alignment and SNP IdentificationFeature
Representation
  • Polyphred score
  • Polyphred assigns a score calculated primarily
    from sequence depth and quality score.
  • Base frequencies
  • The number of occurrences of different bases at
    the position of variation is important in
    determining a polymorphic position.
  • Frequencies of the first (major allele) and the
    second (minor allele) represented as ratio to
    sequence depth.
  • Relative distance
  • Sequence quality at the ends of the alignment
    tends to be poor due to inherent limitations of
    current sequencing technology.
  • SNP position was represented as the ratio of the
    distance in the consensus sequence from the
    closest end, or the relative distance

16
Alignment and SNP IdentificationFeature
Representation
  • Sequence quality
  • Variation is observed because of a poor quality
    base.
  • Based on the base frequencies calculated
  • maximum qualities of the major and minor alleles
  • average qualities of major and minor alleles
  • Alignment quality
  • Misalignment of bases caused by sequence
    alignment programs sometimes result in an
    erroneous SNP call.
  • In the neighborhood of the SNP (/- 10 bases) all
    the mismatches with the consensus sequence are
    given a penalty and the penalty is more if the
    mismatch is continuous

17
Alignment and SNP IdentificationSNP
Identification Datasets
  • Training set for loblolly pine was composed of a
    total of 300 validated sequences.
  • Divided to represent the relative percentages of
    sequence source
  • Testing set is composed of 120 validated sequence
    sets
  • Training set for poplar was composed of 42
    validated sequences selected at random
  • Testing set is composed of a total of 30
    validated sequence sets.
  • Validation manually observed FP, FN, TP, and TN
    SNP calls through observation of tracefiles in
    Consed.

18
Alignment and SNP IdentificationClassification
  • GOAL Prediction
  • Learn a function or set of functions that assign
    a record to one of several predefined classes.
  • Decision tree C4.5 program is open-source C code
    (WEKA) - J48
  • At each point in the construction of the decision
    tree, C4.5 selects the feature to test based on
    maximum information gain.
  • The goal is to generate a minimum size tree that
    correctly classifies all the SNP calls in the
    training set.

19
(No Transcript)
20
Alignment and SNP IdentificationEvaluation
Criteria
  • Accuracy (TP TN)/total
  • Sensitivity TP/(TP FN)
  • Specificity TN/(FP TN)

Evaluation J48 Polyphred Polybayes
Accuracy 93.6 76.25 78.02
Sensitivity 88.21 83.22 86.54
Specificity 98.73 N/A N/A
Evaluation J48 Polyphred Polybayes
Accuracy 94.6 79.35 80.24
Sensitivity 90.54 85.01 88.14
Specificity 97.23 N/A N/A
21
PineSAP
  • PineSAP alignment improves
  • Inaccuracies introduced by using Phrap to align
    sequences
  • Time which would be required by using a aligner
    such as ProbconsRNA or ClustalW on its own
  • PineSAP has a 98 success rate when used to align
    loblolly resequencing data.
  • PineSAP identified a success list of features to
    enhance polymorphism predictions
  • PineSAP obtained an overall prediction accuracy
    of 93 in SNP Identification
  • PineSAP provided a full alignment and
    polymorphism detection system that can be adapted
    to specific genomes

22
Alignment and SNP IdentificationSNPs Identified
  • Total of 202 amplicons
  • Number of SNPs Identified - 1486
  • Meet a minimum confidence score from the PineSAP
    pipeline
  • Average number of SNPs/amplicon 7
  • Amplicon length 600 - 700bp
  • Remaining SNPs generated from 232 additional
    genes.
  • Utilized an eSNP method with publicly available
    EST data and reference genome from JGI.
  • Identified a total of 1,232 potential SNPs

23
Alignment and SNP IdentificationSNP Formatting
  • Polyphred style output is transformed into
    Illumina style input
  • -adding IUPAC codes for SNPs in flanking sequence

24
SNP GenotypingIllumina GoldenGate Assay
25
Alignment and SNP IdentificationIllumina Design
26
Alignment and SNP IdentificationSNP Selection
  • All SNP and amplicon information is databased.
  • SQL queries can be used to select specific SNPs
  • Pair-wise comparisons of all SNPs
  • Scores were assigned to each pair of SNPs in each
    amplicon, accounting for distance between the
    SNPs, Illumina score for both SNPs, and frequency
    of minor allele
  • We can also use SQL queries to select SNPs and
    minimize additional SNPs in flanking sequence

27
Pyrolysis Molecular Beam Mass Spectrometry
Analysis
cell wall chemistry
lignin
hemicellulose
cellulose
28
(No Transcript)
29
Acknowledgements
Mike Davis
Chung-Jui Tsai
David Neale Jill Wegrzyn Jennifer Lee Andrew
Eckert John Liechty
Brian Stanton Rich Shuren
Funding
Write a Comment
User Comments (0)
About PowerShow.com