Title: HTC in Research
 1HTC inResearch  Education 
 2Claims for benefits provided by Distributed 
Processing Systems 
- High Availability and Reliability 
- High System Performance 
- Ease of Modular and Incremental Growth 
- Automatic Load and Resource Sharing 
- Good Response to Temporary Overloads 
- Easy Expansion in Capacity and/or Function 
What is a Distributed Data Processing System? , 
P.H. Enslow, Computer, January 1978 
 3Democratizationof ComputingYou do not need to 
be asuper-person to do super-computing 
 4NCBI FTP
Searching for small RNAs candidates in a 
kingdom 45 CPU days
.ffn
.fna
IGRExtract3
.ptt
.gbk
RNAMotif
FindTerm
TransTerm
All other IGRs
ROI IGRs
BLAST
Terminators
Conservation
Known sRNAs, riboswitches
sRNAPredict
IGR sequences of candidates
Candidate loci
FFN_parse
IGRs all known sRNAs
BLAST
BLAST
TFBS matrices
homology
QRNA
ORFs flank known
ORFs flank candidates
Patser
2o cons.
BLAST
BLAST
paralogy
TFBSs
sRNA_Annotate
synteny
Annotated candidate sRNA-encoding genes 
 5Education and Training
- Computer Science  develop and implement novel 
 HTC technologies (horizontal)
- Domain Sciences  develop and implement 
 end-to-end HTC capabilities that are fully
 integrated in the scientific discovery process
 (vertical)
- Experimental methods  develop and implement a 
 curriculum that harnesses HTC capabilities to
 teach how to use modeling and numerical data to
 answer scientific questions.
- System Management  develop and implement a 
 curriculum that uses HTC resources to teach how
 to build, deploy, maintain and operate
 distributed systems
6- "As we look to hire new graduates, both at the 
 undergraduate and graduate levels, we find that
 in most cases people are coming in with a good,
 solid core computer science traditional education
 ... but not a great, broad-based education in all
 the kinds of computing that near and dear to our
 business."
- Ron BrachmanVice President of Worldwide Research 
 Operations, Yahoo
7- Yahoo! Inc., a leading global Internet company, 
 today announced that it will be the first in the
 industry to launch an open source program aimed
 at advancing the research and development of
 systems software for distributed computing.
 Yahoos program is intended to leverage its
 leadership in Hadoop, an open source distributed
 computing sub-project of the Apache Software
 Foundation, to enable researchers to modify and
 evaluate the systems software running on a
 4,000-processor supercomputer provided by Yahoo.
 Unlike other companies and traditional
 supercomputing centers, which focus on providing
 users with computers for running applications and
 for coursework, Yahoos program focuses on
 pushing the boundaries of large-scale systems
 software research.
81986-2006Celebrating 20 years since we first 
installed Condor in our CS department 
 9Integrating Linux Technology with Condor Kim van 
der Riet Principal Software Engineer 
 10What will Red Hat be doing?
- Red Hat will be investing into the Condor project 
 locally in Madison WI, in addition to driving
 work required in upstream and related projects.
 This work will include
- Engineering on Condor features  infrastructure 
- Should result in tighter integration with related 
 technologies
- Tighter kernel integration 
- Information transfer between the Condor team and 
 Red Hat engineers working on things like
 Messaging, Virtualization, etc.
- Creating and packaging Condor components for 
 Linux distributions
- Support for Condor packaged in RH distributions 
- All work goes back to upstream communities, so 
 this partnership will benefit all.
-  Shameless plug If you want to be involved, Red 
 Hat is hiring...
10 
 11High Throughput Computingon Blue Gene
- IBM Rochester Amanda Peters, Tom Budnik 
- With contributions from 
-  IBM Rochester Mike Mundy, Greg Stewart, Pat 
 McCarthy
-  IBM Watson Research Alan King, Jim Sexton 
-  UW-Madison Condor Greg Thain, Miron Livny, 
 Todd Tannenbaum
12Condor and IBM Blue Gene Collaboration
- Both IBM and Condor teams engaged in adapting 
 code to bring Condor and Blue Gene technologies
 together
-  
- Initial Collaboration (Blue Gene/L) 
- Prototype/research Condor running HTC workloads 
 on Blue Gene/L
- Condor developed dispatcher/launcher running HTC 
 jobs
- Prototype work for Condor being performed on 
 Rochester On-Demand Center Blue Gene system
- Mid-term Collaboration (Blue Gene/L) 
- Condor supports HPC workloads along with HTC 
 workloads on Blue Gene/L
- Long-term Collaboration (Next Generation Blue 
 Gene)
- I/O Node exploitation with Condor 
- Partner in design of HTC services for Next 
 Generation Blue Gene
- Standardized launcher, boot/allocation services, 
 job submission/tracking via database, etc.
- Study ways to automatically switch between 
 HTC/HPC workloads on a partition
- Data persistence (persisting data in memory 
 across executables)
- Data affinity scheduling 
- Petascale environment issues 
13The Grid Blueprint for a New Computing 
Infrastructure Edited by Ian Foster and Carl 
Kesselman July 1998, 701 pages.
The grid promises to fundamentally change the way 
we think about and use computing. This 
infrastructure will connect multiple regional and 
national computational grids, creating a 
universal source of pervasive and dependable 
computing power that supports dramatically new 
classes of applications. The Grid provides a 
clear vision of what computational grids are, why 
we need them, who will use them, and how they 
will be programmed. 
 14-    We claim that these mechanisms, although 
 originally developed in the context of a cluster
 of workstations, are also applicable to
 computational grids. In addition to the required
 flexibility of services in these grids, a very
 important concern is that the system be robust
 enough to run in production mode continuously
 even in the face of component failures.
Miron Livny  Rajesh Raman, "High Throughput 
Resource Management", in The Grid Blueprint for 
 a New Computing Infrastructure. 
 15(No Transcript) 
 16CERN 92 
 17The search for SUSY
- Sanjay Padhi is a UW Chancellor Fellow who is 
 working at the group of Prof. Sau Lan Wu located
 at CERN (Geneva)
- Using Condor Technologies he established a grid 
 access point in his office at CERN
- Through this access-point he managed to harness 
 in 3 month (12/05-2/06) more that 500 CPU years
 from the LHC Computing Grid (LCG) the Open
 Science Grid (OSG) the Grid Laboratory Of
 Wisconsin (GLOW) resources and local group owned
 desk-top resources.
Super-Symmetry 
 18High Throughput Computing
- We first introduced the distinction between High 
 Performance Computing (HPC) and High Throughput
 Computing (HTC) in a seminar at the NASA Goddard
 Flight Center in July of 1996 and a month later
 at the European Laboratory for Particle Physics
 (CERN). In June of 1997 HPCWire published an
 interview on High Throughput Computing.
19Why HTC? 
- For many experimental scientists, scientific 
 progress and quality of research are strongly
 linked to computing throughput. In other words,
 they are less concerned about instantaneous
 computing power. Instead, what matters to them is
 the amount of computing they can harness over a
 month or a year --- they measure computing power
 in units of scenarios per day, wind patterns per
 week, instructions sets per month, or crystal
 configurations per year.
20High Throughput Computingis a24-7-365activity 
FLOPY ? (606024752)FLOPS 
 21High Throughput Computing
EPFL 97
- Miron Livny 
- Computer Sciences 
- University of Wisconsin-Madison 
- miron_at_cs.wisc.edu
22Customers of HTC
- Most HTC application follow the Master-Worker 
 paradigm where a group of workers executes a
 loosely coupled heap of tasks controlled by on or
 more masters.
- Job Level - Tens to thousands of independent jobs 
- Task Level - A parallel application (PVM,MPI-2) 
 that consists of a small group of master
 processes and tens to hundreds worker processes.
23The Challenge
-  Turn large collections of existing 
 distributively owned computing resources into
 effective High Throughput Computing Environments
- Minimize Wait while Idle
24Obstacles to HTC
(Sociology) (Robustness) (Portability) (Technology
)
- Ownership Distribution 
- Size and Uncertainties 
- Technology Evolution 
- Physical Distribution
25Sociology
-  Make owners ( system administrators) happy. 
- Give owners full control on 
- when and by whom private resources are used for 
 HTC
- impact of HTC on private Quality of Service 
- membership and information on HTC related 
 activities
- No changes to existing software and make it easy 
-  to install, configure, monitor, and maintain 
Happy owners ? more resources ? higher throughput   
 26Sociology
- Owners look for a verifiable contract with the 
 HTC environment that spells out the rules of
 engagements.
- System administrators do not like weird 
 distributed applications that have the potential
 of interfering with the happiness of their
 interactive users.
27Robustness
-  To be effective, a HTC environment must run as 
 a 24-7-356 operation.
- Customers count on it 
- Debugging and fault isolation may be a very 
 time consuming processes
- In a large distributed system, everything that 
 might go wrong will go wrong.
Robust system ? less down time ? higher throughput 
 28Portability
-  To be effective, the HTC software must run on 
 and support the latest greatest hardware and
 software.
- Owners select hardware and software according to 
 their needs and tradeoffs
- Customers expect it to be there. 
- Application developer expect only few (if any) 
 changes to their applications.
Portability ? more platforms? higher throughput 
 29Technology 
-  A HTC environment is a large, dynamic and 
 evolving Distributed System
- Autonomous and heterogeneous resources 
- Remote file access 
- Authentication 
- Local and wide-area networking
30Robust and PortableMechanisms Hold The 
ToHigh ThroughputComputing
Policies play only a secondary role in HTC 
 31Leads to a bottom upapproach to building and 
operating distributed systems 
 32My jobs should run 
-  on my laptop if it is not connected to the 
 network
-  on my group resources if my certificate expired 
- ... on my campus resources if the meta scheduler 
 is down
-  on my national resources if the trans-Atlantic 
 link was cut by a submarine
33The Open Science Grid(OSG) 
- Miron Livny - OSG PI  Facility Coordinator, 
- Computer Sciences Department 
- University of Wisconsin-Madison
Supported by the Department of Energy Office of 
Science SciDAC-2 program from the High Energy 
Physics, Nuclear Physics and Advanced Software 
and Computing Research programs, and the 
National Science Foundation Math and Physical 
Sciences, Office of CyberInfrastructure and 
Office of International Science and Engineering 
Directorates. 
 34The Evolution of the OSG    
LIGO operation
LIGO preparation  
LHC construction, preparation 
LHC Ops  
iVDGL
(NSF)  
OSG
Trillium
Grid3
GriPhyN
(DOENSF)
(NSF) 
PPDG
(DOE) 
DOE Science Grid
(DOE) 
1999
2000
2001
2002
2005
2003
2004
2006
2007
2008
2009 
European Grid  Worldwide LHC Computing Grid 
Campus, regional grids 
 35The Open Science Grid vision
- Transform processing and data intensive science 
 through a cross-domain self-managed national
 distributed cyber-infrastructure that brings
 together campus and community infrastructure and
 facilitating the needs of Virtual Organizations
 (VO) at all scales
36D0 Data Re-Processing
Total Events
12 sites contributed up to 1000 jobs/day 
OSG CPUHours/Week
 2M CPU hours 286M events 286K Jobs on 
OSG 48TB Input data 22TB Output data 
 37The Three Cornerstones 
Need to be harmonized into a well integrated 
whole.
National
 Campus 
Community 
 38OSG challenges
- Develop the organizational and management 
 structure of a consortium that drives such a
 Cyber Infrastructure
- Develop the organizational and management 
 structure for the project that builds, operates
 and evolves such Cyber Infrastructure
- Maintain and evolve a software stack capable of 
 offering powerful and dependable capabilities
 that meet the science objectives of the NSF and
 DOE scientific communities
- Operate and evolve a dependable and well managed 
 distributed facility
396,400 CPUs available Campus Condor pool 
backfills idle nodes in PBS clusters - provided 
5.5 million CPU-hours in 2006, all from idle 
nodes in clusters Use on TeraGrid 2.4 million 
hours in 2006 spent Building a database of 
hypothetical zeolite structures 2007 5.5 
million hours allocated to TG 
http//www.cs.wisc.edu/condor/PCW2007/presentation
s/cheeseman_Purdue_Condor_Week_2007.ppt 
 40Clemson Campus Condor Pool
- Machines in 27 different locations on Campus 
- 1,700 job slots 
- gt1.8M hours served in6 months 
- users from Industrial and Chemical engineering, 
 and Economics
- Fast ramp up of usage 
- Accessible to the OSG through a gateway 
41Grid Laboratory of Wisconsin
2003 Initiative funded by NSF(MIR)/UW at 1.5M. 
Second phase funded in 2007 by NSF(MIR)/UW at 
1.5M. Six Initial GLOW Sites
- Computational Genomics, Chemistry 
- Amanda, Ice-cube, Physics/Space Science 
- High Energy Physics/CMS, Physics 
- Materials by Design, Chemical Engineering 
- Radiation Therapy, Medical Physics 
- Computer Science
Diverse users with different deadlines and usage 
patterns. 
 42GLOW Usage 4/04-11/08
Over 35M CPU hours served! 
 43The next 20 years
- We all came to this meeting because we believe in 
 the value of HTC and are aware of the challenges
 we face in offering researchers and educators
 dependable HTC capabilities.
- We all agree that HTC is not just about 
 technologies but is also very much about people
 users, developers, administrators, accountants,
 operators, policy makers,