Title: The Shocking Details of Genome.ucsc.edu
1The Shocking Details of Genome.ucsc.edu
2History of the Code
- Started in 1999 in C after Java proved hopelessly
unportable across browsers. - Early modules include a Worm genome browser
(Intronerator), and GigAssembler which produced
working draft of human genome. - In 2001 a few other grad students started working
on the code. - In 2002 hired staff to help with Genome Browser
- Currently project employs 20 full time people.
3The Genome Browser Staff
- 5 programmers Mark, Angie, Hiram, Kate, Rachel,
Fan, Jim - 4 quality assurance engineers - Heather, Bob,
Mike, Galt - 3 post-docs - Terry, Gill, Katie
- 9 grad students - Chuck, Daryl, Brian, Robert,
Yontao, Krish, Adam, Ryan, Andy - 3 system administrators - Paul, Jorge, Patrick
- 1 writer - Donna
- David Haussler and CBSE Staff
- About 1/3 of staff (including me 3 days a week)
telecommutes.
4The Goal
Make the human genome understandable by humans.
5Prognosis
Maybe well understand it one of these days
6(No Transcript)
7Cardiac Troponin T2
8Comparative Genomics at BMP10
9Normalized eScores
10Conservation Levels of Regulatory Regions
11Complex Transcription
12Add Your Own Tracks
- Users can extend the browser with their own
tracks. - User tracks can be private or public.
- No programming required.
- GFF, GTF, PSL or BED formats supported
- chrom start end name strand score
- chr1 1302347 1302357 SP1 800
- chr1 1504778 1504787 SP2 980
13The Underlying Database
- Power users and bioinformaticians sometimes want
underlying database. - There is a table for each track.
- Larger tracks have a table for each chromosome.
- Format of a track table generally similar to
add-your-own track formats. - Pieces of database available from tables
browser. - Whole database available as tab-separated files.
- Most of database served via DAS.
14Parasol and Kilo Cluster
- UCSC cluster has 1000 CPUs running Linux
- 1,000,000 BLASTZ jobs in 25 hours for mouse/human
alignment - We wrote Parasol job scheduler to keep up.
- Very fast and free.
- Jobs are organized into batches.
- Error checking at job and at batch level.
15Science is Hard
16Coding Discipline Is Required
- While software development is immune from almost
all physical laws, entropy his us hard. - The
Pragmatic Programmer - To keep the system from devolving into disorder
we have to follow code conventions and insist on
a lot of testing. - We use CVS (concurrent version system) to help
all of us work on the same code at once.
17Obtaining the Code from CVS
- See http//genome.ucsc.edu/admin/cvs.html
- This gets you a sandbox - a local copy of the
source to compile and edit. - Type make in the lib and utilities directory.
- You can do a cvs update to get our updates to
the code base. - To add permanently to code base email me to
enable cvs commit
18Expand Your Mental Capacity With
19Lagging Edge Software
- C language - compilers still available!
- CGI Scripts - portable if not pretty.
- SQL database - at least MySQL is free.
20Problems with C
- Missing booleans and strings.
- No real objects.
- Must free things
21Advantages of C
- Very fast at runtime.
- Very portable.
- Language is simple.
- No tangled inheritance hierarchy.
- Excellent free tools are available.
- Libraries and conventions can compensate for
language weaknesses.
22Coping with Missing Data Types in C
- define boolean int
- Fixing lack of real string type much harder
- lineFile/common modules and autoSql code
generator make parsing files relatively painless - dyString module not a horrible string class
23Object Oriented Programming in C
- Build objects around structures.
- Make families of functions with names that start
with the structure name, and that take the
structure as the first argument. - Implement polymorphism/virtual functions with
function pointers in structure. - Inheritance is still difficult. Perhaps this is
not such a bad thing.
24- struct dnaSeq
- / A dna sequence in one-letter-per-base format.
/ -
- struct dnaSeq next / Next in list. /
- char name / Sequence name. /
- char dna / as cs gs and ts. Null
terminated / - int size / Number of bases. /
-
- struct dnaSeq dnaSeqFromString(char string)
- / Convert string containing sequence and
possibly - white space and numbers to a dnaSeq. /
- void dnaSeqFree(struct dnaSeq pSeq)
- / Free dnaSeq and set pointer to NULL. /
- void dnaSeqFreeList(struct dnaSeq pList)
- / Free list of dnaSeqs. /
25- struct screenObj
- / A two dimensional object in a sleazy video
game. / -
- struct screenObj next / Next in list. /
- char name / Object name. /
- int x,y,width,height / Bounds of object.
/ - void (draw)(struct screenObj obj) / Draw
object / - boolean (in)(struct screenObj obj, int x,
int y) - / Return true if x,y is in
object / - void custom / Custom data for a
particular type / - void (freeCustom)(struct screenObj obj)
- / Free custom data. /
-
- define screenObjDraw(obj) (obj-gtdraw(obj))
- / Draw object. /
- void screenObjFree(struct screenObj pObj)
- / Free up screen object including custom part. /
26Naming Conventions
- Code is constrained by few natural laws.
- There are many ways to do things, so programmers
make arbitrary decisions. - Arbitrary decisions are hard to remember.
- Conventions make decisions less arbitrary.
- varName vs. VarName vs varname vs var_name. We
use varName. - variable vs. var vs. vrbl vs. vble vs varible if
you need to abbreviate, keep it short.
27Commenting Conventions
- Each module has a comment describing its overall
purpose. - Each function also has an overall comment.
- Each field in a structure has a comment.
- Longer functions broken into paragraphs that
each begin with a comment. - The module, function, and structure comments are
replicated in the .h file, which serves as an
index to the module.
28Error Handling
- Code prints out a message and aborts (via the
errAbort function) when there is a problem. - This saves loads of error handling code and is
generally the right thing to do. - You can catch an errAbort if necessary, though
it rarely is.
29Memory
- Uninitialized memory leads to difficult bugs.
- Compiler set to warn of uninitialized vars
- Dynamic memory goes through needMem. It is
always zeroed. - Memory usually freed with freez(), which sets
pointer to null as well as freeing it. - Careful memory handler can be pushed to help
track down memory bugs - Sentinal values to detect writing past end of
array - Detects memory freed twice or not freed
- Detects heap corruption in general.
30(No Transcript)
31Generally Useful Modules
- String handling - common dystring wildcmp
- Collections - common (singly linked lists), hash,
dlist, binRange rbTree - DNA - dnautils dnaseq
- Web - htmshell, cheapcgi, htmlPage
- I/O - linefile, xap (XML), fa, nib, twoBit,
blastParse, blastOut, maf, chain, gff - Graphics - memgfx, gifwrite, psGfx, vGfx
32Anatomy of a CGI Script
- Gets called by Web Server when user clicks submit
or follows a cgi link. - Input is in environment variables and sometimes
also stdin. Routines in cheapCgi move this to a
hash table. - Output is to stdout. Routines in htmshell help
with output formatting. - In the middle often access a database.
33Challenges of CGI
- Each click launches program anew.
- User state can be kept in cart variables
- Run from Web Server, harder to debug
- Use cgiSpoof to run from command line
- Push an error handler that will close out web
page, so can see your error messages. htmShell
does this, but webShell may not. - Ideally should run in less than 2 seconds.
34Relational Databases
- Relational databases consist of tables, indices,
and the Structured Query Language (SQL). - Tables are much like tab-separated files
chrom start end name strand score
chr22 14600000 14612345 ldlr
0.989 chr21 18283999 18298577 vldlr -
0.998Fields are simple - no lists or
substructures. - Can join tables based on a shared field. This is
flexible, but only as fast as the index. - Tables and joins are accessed a row at a time.
- The row is represented as an array of strings.
35Converting A Row to Object
struct exoFish exoFishLoad(char row) / Load a
exoFish from row fetched with select from
exoFish from database. Dispose of this with
exoFishFree(). / struct exoFish
ret AllocVar(ret) ret-gtchrom
cloneString(row0) ret-gtchromStart
sqlUnsigned(row1) ret-gtchromEnd
sqlUnsigned(row2) ret-gtname
cloneString(row3) ret-gtscore
sqlUnsigned(row4) return ret
36Motivation for AutoSql
- Row to object code is tedious at best.
- Also have save object, free object code to write.
- SQL create statement needs to match C structure.
- Lack of lists without doing a join can seriously
impact performance and complicate schema.
37AutoSql Data Declaration
table exoFish "An evolutionarily conserved region
(ecore) with Tetroadon" ( string chrom
"Human chromosome or FPC contig" uint
chromStart "Start position in chromosome"
uint chromEnd "End position in
chromosome" string name "Ecore name
in Genoscope database" uint score
"Score from 0 to 1000" )
See autoSql.doc for more details. See also
autoXml
38Coding Conclusion
- Its always safer on the lagging edge
- Consider redesigning system as COBOL
character-based application
39UCSC Gene Family Browser
Expression and other information on genes in a
big sorted, linked table
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)
48(No Transcript)
49(No Transcript)
50(No Transcript)
51Up in Testes, Down in Brain
52(No Transcript)
53(No Transcript)
54(No Transcript)
55(No Transcript)
56(No Transcript)
57(No Transcript)
58(No Transcript)
59(No Transcript)
60(No Transcript)
61(No Transcript)
62(No Transcript)
63(No Transcript)
64Conclusions
- Genome browser - good for exploring genome and
displaying your custom tracks - kent code base - a good starting point for many
programming projects - Family browser - a fine way to collect data sets.
- Browser staff - helpful but overworked.