Title: DTFNPACISDSC SAN Plans
1The Integrative Biosciences Program Within
NPACI/SDSC As Exemplified by the PDB
2PDB Background
- The PDB is vital biological infrastructure
150,000 Web hits/day one file every second 7/24
over ½ of all SDSCNPACI web access - The NSF is the lead agency
- Protein structure is the focus
- as a follow on to the human
- genome project
3PDB Relevance to Drug Discovery
Search Google for Protein The PDB comes up
number 1 gt Many novice users access the
PDB Our role is to educate this population,
examples 1. 2. On-line As developed by
Aston Taylor an NSF REU student in 2000.
4What NPACI/SDSC Brings to PDB
- 7/24 (only off-line 3 hrs in 2000)
-
- Standards development and deployment (thrust)
- Large scale distributed databases (DAKS)
- Grid deployment (CE, electrostatics)
- Visualization (molecular biology toolkit)
- Outreach and Education
5The Encyclopedia of Life
A National Proteomics Resource
- Integrates our existing strengths
- The largest biological computation ever
undertaken
6Almost everything in the body is either made
of proteins or made by them.
Matt Ridley, Genome The Autobiography of a
Species in 23 Chapters
- Types of Questions which can be addressed by
EOL - Is protein X found in anthrax?
- Is protein X a drug target, that is, does it
exist predominantly in pathogenic bacteria or is
it found in eukaryotes also? - Has caspase-1, a protein involved in cell death
and aging, been identified in any plants, if so
what species and do the proposed protein
structures look similar? - Give me all available information on caspase-1
7Computational Pipeline
800 genomes with 10k-20k ORFs (Open Reading
Frame) each107 ORFs/hours
Genome protein sequences
sequence info
structure info
Prediction of signal peptides (SignalP,
PSORT) transmembrane (TMHMM, PSORT) coiled
coils (COILS) low complexity regions (SEG)
NR, PFAM
SCOP, PDB
4 CPU years
104 entries
Building FOLDLIB PDB chains SCOP domains PDP
domains CE matches PDB vs. SCOP 90 sequence
non-identical minimum size 25 aa coverage (90,
gaps lt30, endslt30)
228 CPU years
Create PSI-BLAST profiles for Protein sequences
3 CPU years
Structural assignment of domains by PSI-BLAST on
FOLDLIB
Only sequences w/out A-prediction
9 CPU years
Structural assignment of domains by 123D on
FOLDLIB
Only sequences w/out A-prediction
252 CPU years
Functional assignment by PFAM, NR, PSIPred
assignments
3 CPU years
Domain location prediction by sequence
FOLDLIB
Store assigned regions in the DB
8Initial PrototypeAnnotation of Arabidopsis
thaliana Proteins
- Prototype computation (one genome) took
- 40,000 hours on 4-64 processors at .1 TF
- Post-TeraGrid, computation (all genomes in EOL)
will take - 4,500 hours on 1000 processors at 5 TFs
Arabidopsis annotation joint work with SDSCand
Ceres