Next Generation Sequencing Data Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Next Generation Sequencing Data Analysis

Description:

Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa Why sequencing? The knowledge of DNA and RNA sequences has become a crucial tool for: Basic ... – PowerPoint PPT presentation

Number of Views:5666
Avg rating:3.0/5.0
Slides: 32
Provided by: NadiaP9
Category:

less

Transcript and Presenter's Notes

Title: Next Generation Sequencing Data Analysis


1
Next Generation Sequencing Data Analysis
  • Nadia Pisanti, University of Pisa

2
Why sequencing?
  • The knowledge of DNA and RNA sequences has become
    a crucial tool for
  • Basic research in biology, pharmacology and
    medicine.
  • Many applied fields diagnostic (genetic diseases
    detection), pharmacogenomics (influence of
    genetic variation on drug response) and
    personalized medicine, forensic biology, gene
    therapies, biological systematics (the study of
    the diversification of living forms)

3
Sequencing some history
  • "rapid DNA sequencing" by Frederick Sanger (UK)
    in the 1970s, became the method of choice for DNA
    sequencing, and was worth him his 2nd Nobel Prize
    in chemistry in 1980.
  • Sanger method for sequencing DNA was used in the
    Human Genome Project (HGP) that produced the
    first reference sequence of the human genome.
  • The HGP started in 1990 and was expected to take
    15 years.
  • A first "rough draft" was finished in 2000 and
    announced in a press conference by Bill Clinton
    and Tony Blair!
  • The complete genome was announced in 2003.
  • Why announcing the rough draft in 2000?
  • Why did the HGP take less than expected?

4
Celera Genomics- cut and paste from wikipedia
and my memory -
  • In 1998, the American NIH researcher Craig Venter
    announced that his private company Celera
    Genomics would sequence the human genome at a
    fraction of the cost of the public project.
  • A significant portion of the human genome had
    already been sequenced when Celera entered the
    field and was freely available to the public from
    GenBank.
  • Celera used a technique called whole genome
    shotgun sequencing. This novelty spurred the HGP
    to change its own strategy, leading to a rapid
    acceleration of the public effort.
  • Celera filed preliminary ("place-holder") patent
    applications on 6,500 whole or partial genes.
    Celera also promised to publish their findings in
    accordance with the terms of the 1996 "Bermuda
    Statement," by releasing new data annually (the
    HGP released its new data daily), although,
    unlike the publicly funded project, they would
    not permit free redistribution or scientific use
    of the data. For this reason, the public
    competitor was compelled to publish the first
    draft of the human genome before Celera.
  • In 2000, the HGP released a first working draft
    on the web. The scientific community downloaded
    one-half trillion bytes of information from the
    UCSC genome server in the first 24 hours of free
    and unrestricted access to the first ever
    assembled blueprint of our human species.
  • Also in 2000, president Clinton announced that
    the genome sequence could not be patented, and
    should be made freely available to all
    researchers. The statement sent Celera's stock
    plummeting and dragged down the
    biotechnology-heavy Nasdaq. The biotechnology
    sector lost about 50 billion in market
    capitalization in two days. But the public
    release of the data ensured its fair use and
    availability.

5
shotgun sequencing
  • The Sanger sequencing technology could only be
    used for short DNA fragments (from 100 to 1000
    bases) DNA must thus be divided into small
    pieces, and then be re-assembled.
  • This can be done in two ways
  • Chromosome walking sequencing piece by piece
    consecutive fragments.
  • Shotgun sequencing break several copies of the
    DNA strand into random overlapping fragments,
    sequencing them, and then re-assemblying in
    silico exploiting the overlap.
  • Since when shotgun sequencing was introduced by
    Celera, it is the method of choice for large
    scale sequencing.

6
Shotgun sequencing assembly
  • Wikipedia, about shotgun sequencing "faster but
    more complex".
  • The "complexity" of the approach is because of
    algorithmic issues
  • (Eu)gene Myers, a string algorithms expert, was
    leading the computer scientists at Celera he
    made the difference
  • Challenges in assembly phase finding
    prefix/suffix overlap, data structure for storing
    fragments and "overlap graph", assembly algorithm
    managing duplications.

7
Fragment Assembly
The problem of sequence assembly can be compared
to taking many copies of a book, passing them all
through a shredder, and piecing the text of the
book back together just by looking at the
shredded pieces. Besides the obvious difficulty
of this task, there are some extra practical
issues the original may have many repeated
paragraphs, and some shreds may be modified
during shredding to have typos. Excerpts from
another book may also be added in, and some
shreds may be completely unrecognizable.
8
What is NGS?
  • Next/New Generation Sequencing
  • Massively Parallel Sequencing
  • Third Generation Sequencing
  • High Throughput Sequencing

millions of fragments (reads) in a single run
!! by means of new technologies developed mainly
by
  • Lynx Therapeutics merged with Solexa and they
    were bought by Illumina.
  • ABI SOLiD
  • ION Torrent Systems
  • 454 Life Science acquired by Roche Diagnostics

they actually differ quite a lot on performances
and characteristics.
9
What's new with NGS?
  • Sequencing the whole human genome took the HGP
  • 3.000.000.000 dollars
  • 13 years
  • Sequencing a whole human genome now with NGS
    techniques takes
  • about 1.000 dollars
  • 4-5 days

Sequencing is much faster and (thus) cheaper !!
10
What is NGS great for
  • re-sequencing no assembly, just mapping on a
    known reference genome.
  • Metagenomics
  • Transcriptome Sequencing RNA-Seq
  • Chromatin immunoprecipitation combined with DNA
    sequencing ChIP-Seq

11
re-sequencing
  • Sequencing a new individual of a species for
    which the reference genome is know (and (well)
    annotated).
  • Important applications
  • Medicine
  • Building datasets of several strains of the same
    organism to investigate intra-species evolution.

12
re-sequencing medical applicationswe will get
back to this later
  • Genotyping testing for known mutations
    (sequencing can be possibly targeted to specific
    regions).
  • Variation analysis scanning for any mutation
    such as Single Nucleotide Polymorphisms (SNPs),
    or Copy Number Variations (CNVs) or other
    Structural Variants (SVs) that can be associated
    to congenital diseases, predisposition for
    certain pathologies, or drug response.
  • Most of NGS tools offer the relative software to
    detect mutations.
  • With NGS these tests can be made on large scale.
    and back in time Roche sequenced the Neanderthal
    genome in 2006!

13
re-sequencing
  • Challenges for computer science
  • Indexing data and (quickly) mapping on reference
    genome
  • SNPs and SVs calling.
  • Mind the repeats up there!
  • Challenges for informatics
  • Build tools for genetists.
  • Interpreting SNPs and SVs crossing with DB
    information.
  • DB management

14
metagenomics
  • Metagenomics essentially entails brute force
    sequencing of DNA fragments obtained from an
    uncultured, unpurified, microbial and/or viral
    population, followed by bioinformatics-based
    analyses that attempt to answer the question
    "Who's there?" E.R.Mardis, Trends in genetics
    2008
  • Characterizing the human microbiome we live in
    symbiosis with millions of microbial species.
    There is a theory saying that these symbiotic
    microbes provide an extension of the human genome
    and hence contribute to its genetic potentials in
    terms of protective immunity, added enzymatic
    capability
  • Metagenomics not only in human body, but also in
    important ecosystems such as ocean, soil, deep
    mines.
  • Metagenomics costs are effordable only now with
    NGS (mostly 454 Roche as with longer reads they
    better allow de novo sequencing)

15
What is RNA-Seq
  • NGS opened a new phase
  • in transcriptomics (aka
  • expression profiling)
  • thanks to
  • low requirements of
  • nucleotide sequence
  • product
  • and
  • deep coverage

16
Why RNA-Seq
  • Among the goals of the HGP there was the mapping
    and genotype associated to (the predisposition
    for) diseases.
  • It is now very clear (and it was not then) that
    reading the genome is not enough
  • Same genome, different phenotypes and different
    diseases how comes?
  • Environmental effects (food, pollution, life
    style) act on gene transcription.
  • We ought to investigate the transcriptome!
  • The transcriptome are the genes that are being
    actively expressed at a given time.
  • The role of miRNA for gene regulation.

17
RNA-Seq
  • Sequencing the transcriptome to investigate
    differentially expressed genes
  • under different conditions, or
  • in different tissues
  • in different alleles
  • The different expression can be in quantitative
    terms or in alternative splicing terms
    (eukaryotes only).

de novo transcriptome assembly
18
RNA-Seq
  • Sequencing the transcriptome to investigate
    differentially expressed genes
  • under different conditions, or
  • in different tissues
  • in different alleles
  • The different expression can be in quantitative
    terms or in alternative splicing terms
    (eukaryotes only).

transcriptome re-sequencing
19
RNA-Seq quantification
  • RNA-Seq (Quantification) is used to analyze gene
    expression of certain biological objects under
    specific conditions.

20
Alternative Splicingwe will get back to this
later
  • AS is when several mRNAs can be produced from a
    unique pre-mRNA
  • E.g. in humans there are approximately 30,000
    genes and it is estimated that 70 of human
    protein-coding genes undergo alternative splicing
    to generate up to 150,000-200,000 mRNAs and
    proteins through alternative splice site usage.
  • In 2008, an experiment revealed that 34 of human
    transcripts were not from known genes Science
    321

21
non coding RNA
  • ncRNA includes a wide class of regulatory RNA
    molecules whose function is as crucial as not yet
    understood.
  • Discovering their sequences and (hence) genomic
    locations is hard because they (mostly) small and
    poorly conserved over evolutionary time.
  • In silico prediction methods are of high
    importance and very promising, but so far of
    little use.
  • Currently, ncRNA are mostly discovered by
    sequencing small RNA fragments, for which task
    NGS tools are ideal!
  • In silico analysis of such data will be crucial
    for understanding it (secondary structure
    prediction, putative functions prediction based
    on learning methods).
  • A new class of miRNA (or small RNA) is being
    discovered every day

22
ChIP-Seq
  • ChIP-seq combines chromatin immunoprecipitation
    (ChIP) with massively parallel DNA sequencing to
    identify the binding sites of DNA-associated
    proteins.
  • The goal is to analyze protein interactions with
    DNA (e.g. how transcription factors, that are
    proteins, regulate gene expression).

23
The bad side of NGS
  • Even shorter fragments from 1000 of Sanger
    technology to 25, then 50, then 75, now 100
    bases.
  • Even more errors (when new size is released).

Fragment assembly is even harder !!
24
From M.L. Metzker "Sequencing technologies the
next generation", Nature Reviews Genetics 11,
31-46, 2010
What is the best depends on what you need it
forand how much money you have
25
Roche 454 Genome Sequencer
  • It was the first introduced in the market in
    2005.
  • Its technology allows to produce relatively long
    reads (400-700 bases).
  • Its base calling cannot handle long (gt6)
    stretches of the same nucleotide, resulting in
    insertions and deletions errors there
  • On the other hand very low substitutions error
    rate.
  • Overall error rate at 1.

26
Illumina Genome Analyzer(aka Solexa sequencer)
  • The most widely available NGS technology.
  • Reads up to 100b long.
  • Error rate at 1-1,5, mostly substitutions
    (indels are much less common).

27
ABI's SOLiD
  • Probably the second most widely used.
  • The workflow is similar to Solexa/Illumina's.
  • An interesting difference SOLiD uses a di-base
    sequencing technique in which two nucleotides are
    read simultaneously. 16 di-bases still
    represented by 4 "colors", but the one-base-shift
    solves the redundancy.
  • As a consequence
  • Sequencing error may propagate.
  • Read alignment can be speed up.
  • Error rate around 2-4

28
Paired-end and Mate-pairs
  • Two very different objects from the point of view
    of the technology as they are obtained with very
    different procedures.
  • Available from all NGS platforms.
  • From the computational point of view, they are
    the same two sequences at an approximatively
    know distance from eachother in the genome
    (insert size).
  • They are crucial to
  • Correctly map/assemble repeated fragments
  • Detect Structural Variants and Copy Number
    Variations.

29
Fragment Assemblywith NGS data
  • It is like a diabolic sudoku
  • - with very few initial numbers
  • - many solutions satisfy the constraint choice
    is arbitrary
  • - only one of the many solution is the good one,
    and there is no clue on which

30
NGS and Informaticsthe challenges 1
  • Massive Image processing and basecalling within
    sequencing technology.
  • Growing need of managing big data
  • Indexing issues.
  • Efficient mapping and alignments.
  • Parallel and High Performance computing.
  • New emphasis on efficient data structures and
    algorithms with special care on memory usage.

31
NGS and Informaticsthe challenges 2
  • Designing and producing tools for data analysis
    integrating information from different sources
    (e.g. genome browsers).
  • Designing and producing tools for assemblying..
  • Designing and producing tools for genotyping a
    new one every day, hard to compare...
  • Customized analysis informatics is needed for
    any project and in any lab.
  • "Curiously" back to old style stuff such as
    command line, machine language programming
Write a Comment
User Comments (0)
About PowerShow.com