Displaying associations, improving alignments and gene sets at UCSC - PowerPoint PPT Presentation

About This Presentation
Title:

Displaying associations, improving alignments and gene sets at UCSC

Description:

Displaying associations, improving alignments and gene sets at UCSC ... Galt Barber - Genome Graphs extensions. Webb Miller Lab - Alignments ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 31
Provided by: usersS
Category:

less

Transcript and Presenter's Notes

Title: Displaying associations, improving alignments and gene sets at UCSC


1
Displaying associations, improving alignments and
gene sets at UCSC
Jim Kent and the UCSC Genome Bioinformatics Group
2
Wellcome Trust Case Control Consortium rheumatoid
arthritis data
3
Wellcome Trust Case Control Consortium rheumatoid
arthritis data
4
Sort Genes to see candidates
5
Case control consortium rheumatoid arthritis
data, type1 diabetes and bipolar disorder.
National Institute of Mental Health bipolar
disorder in US and Germanpopulations (different
scale).
6
(No Transcript)
7
In the long term we hope to import data from
GAIN and dbGAP and other sources as well.
8
28-way multiple alignment
Still based on Penn State/UCSC blastz/chain/net/mu
ltiz pipeline. Have added syntenic filtering
for high coverage genomes and reciprocal-best
filtering for 2x genomes to reduce artifacts from
paralogs.
9
PhyloP vs. PhastCons
Existing conservation track uses PhastCons
algorithm, which computes probability that a
region is conserved. As more species are added
this converges to 0 or 1. PhyloP track instead
shows degree of conservation of a base
10
UCSC Genes Goals
  • Include noncoding as well as coding genes
  • Increase sensitivity of gene set in general.
  • Increase coverage of alternative splice forms
    (but not too much).
  • Apply comparative genomics to protein (CDS)
    prediction.
  • Create permanent accessions for transcripts.

11
Make graph
Snap soft ends to hard end within 6 bp
12
Extend soft ends to hard ends
13
Consensus of soft ends weighted 3/4 of way
towards long
14
Weigh edges by number of transcripts that make
them
1
3
2
3
3
2
4
1
3
1
15
1
3
2
3
3
2
4
1
3
1
Make graphs from various other sources
exoniphy
ests
Mousesplicing
Merge in weights from other graphs
2
4
3
5
4
5
6
3
5
3
16
Initial transcripts (ordered by exon count)
A
B
D
C
E
2
4
3
5
4
5
6
3
5
3
Walk graph to get nonredundant transcripts,
starting withfirst transcript and continuing
until all edges in graph of weight above a
threshold are emitted.
A
17
A
B
D
C
E
2
4
3
5
4
5
6
3
5
3
Walk graph to get nonredundant transcripts,
starting withfirst transcript and continuing
until all edges in graph of weight above a
threshold are emitted.
A
18
A
B
D
C
E
2
4
3
5
4
5
6
3
5
3
Walk graph to get nonredundant transcripts,
starting withfirst transcript and continuing
until all edges in graph of weighted above a
threshold are emitted.
gt 3
A
gt 2
B
19
A
B
D
C
E
2
4
3
5
4
5
6
3
5
3
Walk graph to get nonredundant transcripts,
starting withfirst transcript and continuing
until all edges in graph of weighted above a
threshold are emitted.
gt 3
A
gt 2
B
DONE
20
Evidence type and weights
Minimum total weight of 3 for spliced
transcripts, 4 for unspliced.
21
Assigning Coding Regions
  • Take top scoring ORF using a program,
    txCdsPredict, that considers
  • Length of ORF
  • Kozak consensus sequence
  • Nonsense mediated decay
  • Upstream open reading frames
  • Length of orthologous ORF in other species.
  • txCdsPredict agrees with RefSeq reviewed 96 of
    the time.

22
Gene Statistics
Transcript Statistics
23
Coding
Non-coding
Near-coding
  • 38 of UCSC noncoding genes are lt 200 bp
    transcripts primarily of known types such as
    snoRNAs, piRNAs, miRNAs etc.
  • 62 are long, with a size distribution much like
    coding.
  • (For Ensemble only 21 of noncoding are long)

24
Long noncoding genes have lower expression levels
Coding
Non coding
Absolute expression values from Affymetrix human
exon arrays
25
Other characteristics of long noncoding
  • Long noncoding have lower tissue specificity.
  • Poor conservation. Average phastCons score is
    0.09 for long noncoding vs 0.73 for coding.
  • BLAST analysis suggests 20 of long noncoding may
    be transcribed pseudogenes.
  • Conclusion - long noncoding but transcribed genes
    are slippery. Most are likely nonfunctional.
  • Xist is poorly conserved overall but has some
    peaks and is reasonably well expressed.

26
Acknowledgements
  • Programming and analysis
  • Galt Barber - Genome Graphs extensions
  • Webb Miller Lab - Alignments
  • Adam Seipel - Evolutionary analysis
  • Dorota Retelska - UCSC noncoding genes
  • Data
  • Sanger, Wash U, Broad, JGI, NCBI, EBI, Affy
  • Contributors to scientific databases worldwide
  • Funding
  • NHGRI, NCI, HHMI, State of California

27
The End
28
UCSC Genes Overall Pipeline
  • Start with genomic/RNA alignments
  • Remove antibody fragments
  • Clean alignments and project to genome
  • Cluster into splicing graph
  • Add EST, Exoniphy, OrthoSplice info.
  • Walk unique well supported transcripts out of
    graph.
  • Assign coding regions (CDS) to transcripts.
  • Classify into coding, antisense, noncoding.
  • Assign accessions.

29
UCSC Genes Overall Pipeline
  • Start with genomic/RNA alignments
  • Remove antibody fragments
  • Clean alignments and project to genome
  • Cluster into splicing graph
  • Add EST, Exoniphy, OrthoSplice info.
  • Walk unique well supported transcripts out of
    graph.
  • Assign coding regions (CDS) to transcripts.
  • Classify into coding, antisense, noncoding.
  • Assign accessions.

30
Classifying transcripts
  • Coding CDS survives trimming stage
  • Near-coding overlap coding by at least 20 bases
    on same strand
  • Near-coding junk near-coding transcripts that
    show signs of incomplete splicing. These are
    removed.
  • Antisense overlap coding by at least 20 bases on
    opposite strand
  • Noncoding other transcripts
Write a Comment
User Comments (0)
About PowerShow.com