DTFNPACISDSC SAN Plans PowerPoint PPT Presentation

presentation player overlay
1 / 8
About This Presentation
Transcript and Presenter's Notes

Title: DTFNPACISDSC SAN Plans


1
The Integrative Biosciences Program Within
NPACI/SDSC As Exemplified by the PDB
2
PDB Background
  • The PDB is vital biological infrastructure
    150,000 Web hits/day one file every second 7/24
    over ½ of all SDSCNPACI web access
  • The NSF is the lead agency
  • Protein structure is the focus
  • as a follow on to the human
  • genome project

3
PDB Relevance to Drug Discovery
Search Google for Protein The PDB comes up
number 1 gt Many novice users access the
PDB Our role is to educate this population,
examples 1. 2. On-line As developed by
Aston Taylor an NSF REU student in 2000.
4
What NPACI/SDSC Brings to PDB
  • 7/24 (only off-line 3 hrs in 2000)
  • Standards development and deployment (thrust)
  • Large scale distributed databases (DAKS)
  • Grid deployment (CE, electrostatics)
  • Visualization (molecular biology toolkit)
  • Outreach and Education

5
The Encyclopedia of Life
A National Proteomics Resource
  • Integrates our existing strengths
  • The largest biological computation ever
    undertaken

6
Almost everything in the body is either made
of proteins or made by them.
Matt Ridley, Genome The Autobiography of a
Species in 23 Chapters
  • Types of Questions which can be addressed by
    EOL
  • Is protein X found in anthrax?
  • Is protein X a drug target, that is, does it
    exist predominantly in pathogenic bacteria or is
    it found in eukaryotes also?
  • Has caspase-1, a protein involved in cell death
    and aging, been identified in any plants, if so
    what species and do the proposed protein
    structures look similar?
  • Give me all available information on caspase-1

7
Computational Pipeline
800 genomes with 10k-20k ORFs (Open Reading
Frame) each107 ORFs/hours
Genome protein sequences
sequence info
structure info
Prediction of signal peptides (SignalP,
PSORT) transmembrane (TMHMM, PSORT) coiled
coils (COILS) low complexity regions (SEG)
NR, PFAM
SCOP, PDB
4 CPU years
104 entries
Building FOLDLIB PDB chains SCOP domains PDP
domains CE matches PDB vs. SCOP 90 sequence
non-identical minimum size 25 aa coverage (90,
gaps lt30, endslt30)

228 CPU years
Create PSI-BLAST profiles for Protein sequences
3 CPU years
Structural assignment of domains by PSI-BLAST on
FOLDLIB
Only sequences w/out A-prediction
9 CPU years
Structural assignment of domains by 123D on
FOLDLIB
Only sequences w/out A-prediction
252 CPU years
Functional assignment by PFAM, NR, PSIPred
assignments
3 CPU years
Domain location prediction by sequence
FOLDLIB
Store assigned regions in the DB
8
Initial PrototypeAnnotation of Arabidopsis
thaliana Proteins
  • Prototype computation (one genome) took
  • 40,000 hours on 4-64 processors at .1 TF
  • Post-TeraGrid, computation (all genomes in EOL)
    will take
  • 4,500 hours on 1000 processors at 5 TFs

Arabidopsis annotation joint work with SDSCand
Ceres
Write a Comment
User Comments (0)
About PowerShow.com