Title: Campus
1Campus State Grids in Texas
- Jay Boisseau
- Texas Advanced Computing Center
- The University of Texas at Austin
- June 23, 2005
2TACC Grid Deployment Projects
- TACC is involved in grids at five scales
- Campus UT Grid
- State Texas Internet Grid for Research
Education (new) - Regional SURA Grid (planning phases)
- National TeraGrid
- International Open Science Grid (just joining)
3TACC Grid Technology Projects
- GridPort grid portal toolkit
- Also building grid portals of course
- GPIR
- Web services-based grid resource info system
- GridShell
- Shell environment for managing jobs and data on
grids - MyCluster
- Virtualizing grid resources for local clusters
- Scheduling Prediction Services
- Providing estimates for queue waits, execution
times, data transfer time
4UT GRID
5UT Grid Develop and Provide a Unique,
Comprehensive Cyberinfrastructure
- The strategy of the UT Grid project is to
integrate - common security/authentication
- scheduling and provisioning
- aggregation and coordination
- diverse campus resources
- computational (PCs, servers, clusters)
- storage (Local HDs, NASes, SANs, archives)
- visualization (PCs, workstations, displays,
projection rooms) - data collections (sci/eng, social sciences,
communications, etc.) - instruments sensors (CT scanners, telescopes,
etc.) - from personal scale to terascale
- personal laptops and desktops
- department servers and labs
- institutional (and national) high-end facilities
6That Provides Maximum Opportunity Capability
for Impact in Research, Education
- into a campus cyberinfrastructure
- evaluate existing grid computing technologies
- develop new grid technologies
- deploy and support appropriate technologies for
production use - continue evaluation, RD on new technologies
- share expertise, experiences, software
techniques - that provides simple access to all resources
- through web portals
- from personal desktop/laptop PCs, via custom CLIs
and GUIs - to the entire community for maximum impact on
- computational research in applications domains
- educational programs
- grid computing RD
7Add Services Incrementally, Driven By User
Requirements
8Hub Spoke Approach
- Deploying P2P campus grid requires overcoming two
trust issues - grid software reliability, security, and
performance - each other not to abuse ones own resources
- Advanced computing center presents opportunity to
build centrally manage grid as step to P2P grid - already has trust relationships with users
- so, when facing both issues, install grid
software centrally first - create centrally managed services
- create spokes from central hub
- then, when grid software is trusted
- show usage and capability data to demonstrate
opportunity - show policies and procedures to ensure fairness
- negotiate spokes among willing participants
9UT Grid Logical View
- Integrate a set of resources(clusters, storage
systems, etc.)within TACC first
TACC Compute, Vis, Storage, Data
(actually spread across two campuses)
10UT Grid Logical View
- Next add other UTresources usingsame tools
andprocedures
TACC Compute, Vis, Storage, Data
ACES Cluster
ACES Data
ACES PCs
11UT Grid Logical View
- Next add other UTresources usingsame tools
andprocedures
GEO Data
GEO Cluster
TACC Compute, Vis, Storage, Data
GEO Cluster
ACES Cluster
ACES Data
ACES PCs
12UT Grid Logical View
BIO Data
BIO Instrument
- Next add other UTresources usingsame tools
andprocedures
PGE Cluster
GEO Data
PGE Data
GEO Cluster
TACC Compute, Vis, Storage, Data
PGE Instrument
GEO Cluster
ACES Cluster
ACES Data
ACES PCs
13UT Grid Logical View
BIO Data
BIO Instrument
- Finally negotiateconnectionsbetween spokesfor
willing participantsto develop a P2P grid.
PGE Cluster
GEO Data
PGE Data
GEO Cluster
TACC Compute, Vis, Storage, Data
PGE Instrument
GEO Cluster
ACES Cluster
ACES Data
ACES PCs
14Distributed Serial Computing Roundup
- Roundup consists of UT Austin campus desktops and
servers running the United Devices Grid MP
software - Clients are pooled together to make up a single
UT Grid resource - Resources contributed by several UT
organizations TACC, ICES, CoE, ITS, etc - 1500 CPUs available today
- Integrated into TACC user portal
- Production usage began April 1
- Identified future RD opportunities for/with UD
15Distributed Serial Computing Rodeo
- Rodeo is set of Condor Pools of dedicated and
non-dedicated resources - Dedicated resources
- Condor Central Manager (collector and negotiator)
- TACC Condor Pool can flock to CS and ICES pools
as needed - Non-dedicated resources
- Linux, Windows, and Mac resources are managed by
Condor (similar to United Devices) - Usage policy is configured by resource owner,
i.e. - when there is no other activity
- when load (utilization) is low
- give preference to certain group or users
- 700 CPUs across multiple pools
- In production since April 1
16Distributed Parallel Computing CSF
- Community Scheduling Framework (CSF) is open
source framework for meta-scheduling - Coordinates communications between multiple
heterogeneous resource managers - LSF, GRAM
- Issues
- provides metascheduler framework (only)
- current functionality inadequate
- tightly coupled with Globus Toolkit
- requires significant investment in development,
maintenance, and support investment
17Distributed Parallel Computing CSF
- UT Grid team
- ported CSF to Globus Toolkit 3.2, 3.2.1
- started development of scheduler plug-in
- contributed development back to the CSF
- completed technology evaluation, DeveloperWorks
article - Current status stop and monitor
- current functionality inadequate
- requires significant investment in development,
maintenance, and support - CSF tightly coupled to the Globus toolkit which
makes it hard to upgrade - currently monitoring CSF discussion lists to keep
track of the project
18Distributed Parallel Computing Metascheduler
Evaluation
Condor CSF Moab
RM coverage LSF PBS SGE Condor LoadLeveler Y Y Y Y Y Y Y N N N Y Y Y N Y
Grid Integration GSI GRAM GridFTP Index Service Y Y Y N Y Y Y Y Y Y Y N
Customizable Policies Y N Y
Availability Licensing Open Source Free Y() Free Y Commercial N
Administration Interface Tools Dynamic updates Y Y Y Command line N N Y Y Y
NMI Y N N
Research Opportunities Y Y N
- Results of evaluation
- CSF
- development resources required
- monitor evolution
- Condor
- use as resource broker for UT Grid
- MOAB, LSF MultiCluster
- too expensive
- Portable designs to support other metaschedulers
in future
19Condor for Distributed Parallel Computing
- MPI Universe can be used for running parallel
jobs - MPI jobs can be run only on dedicated Condor
resources - condor does not preempt MPI jobs
- single Condor submit description file is required
- support for staging specified files to each
compute node (in case shared file system does not
exist) - Alternate method submit MPI job using Condor-G
- submit a Globus Universe job to a native resource
manager (such as LSF) - Globus job uses bsub to submit MPI job to LSF
20Resource Broker Service
- Resource Broker is a central service of UT Grid
- advertises capabilities and specifics of
resources - Resource broker components
- Catalog of resources
- Resource broker query from GUN, GUP
- will send query string to catalog service
- select resources will be returned, based on input
string - User to select resources based on some criteria
- user will get names of qualified resources
returned - scheduling decisions can be based on these
results - example GridShell job submission
- In development!
21Scheduling Logic for Resource Broker
- Serial jobs can be scheduled dynamically based on
availability parallel jobs usually queued to
busy systems - Ideally, use same model for parallel as serial,
but cant if data transfer times are longcommon
for big parallel jobs - Cant stage data to all systemsresource
limitations - Solution predict system likely to complete job
first - estimate data transfer time to each possible
system - estimate queue wait time on each system
- estimate execution time on each systems
- calculate minimum of maxttrans,tqueuetexec for
each system - Currently working to design, then develop broker
to include predictions based on these three
variables
22File Transfer Services
- GridFTP
- high-performance, secure, reliable data transfer
protocol - incorporates GSI for enabling secure
authentication and communication over an open
network - enables Third Party Transfers between remote
servers while client manages transfer - Comprehensive File Transfer Portlet
- developed multiple file transfer capabilities
- uses NWS to estimate file transfer times
- enables monitoring and persistent storage of file
transfers
23Grid Visualization
- Network bandwidth growth means remote
visualization is possible - bandwidth growing faster than display sizes!
- now possible to leverage powerful central/remote
rendering/visualization resources, just like HPC - requires s/w tools, demand/reservation
scheduling, etc. - With remote visualization enabled, collaborative
visualization is possible - requires further advances in tools, incl.
integration of multiple keyboard/mouse inputs - Want to enable both for grid visualization
resources!
24Grid Visualization
- Grid rendering is like traditional grid batch
computing - already used by animation studios
- Grid remote/collaborative visualization our goal
- identify rendering resource based on data,
technique, availability - move data to rendering system based on
reservation or demand - calculate geometries and push geometry to local
device if bandwidth not sufficient and local
graphics hardware is - render images and push pixels to local display if
bandwidth is sufficient - still requires GSI, scheduling (on demand and
advanced reservation), data management, etc. - requires multi-platform clients for remote and
collaborative vis
25Initial UT Grid Visualization Services
- Installed Maverick as terascale visualization
resource with parallelism over commodity graphics - TeraBurst V2D hardware for remote visualization
- high performance
- multi-tiled displays
- Sun 3D Server software for remote visualization
- evaluating versions of 3D Server not based on X
protocol to increase interactive performance - Leveraging NSF TeraGrid activities heavily
26Deploying Initial Remote Visualization Tools On
Maverick
Ethernet
3D Server Client
Sun 3D Server Software
TeraBurst V2D Receiver
TeraBurst V2D Transmitter
Ethernet
27Data Collections Servics
- TACC hosting four UT scientific data collections
- Data were already available/used by researchers
at low b/w - Multiple data sets will be used in Flood Modeling
SGW - Leverages strong relationships with UT
Geosciences - Enables researchers to use these data in high-end
simulations and analyses, science gateways, etc.
Collection Initial Size Projected Growth
NEXRAD Precipitation 200 GB 1-1.5 TB/year
MODIS Satellite Imagery 6 TB 5-6 TB/year
LiDAR Terrain 15 GB ?
X-Ray CT Scan 1 TB 2 TB/year
28Data Collections Activities
- Evaluating issues fur using DBs for sci data
- data schemas
- extensions for scientific data types (bio, geo)
- clusters for database I/O performance
- TeraGrid
- Leveraging NSF TeraGrid activities heavily
- Currently analyzing data collection requirements
in TeraGrid, will utilize in UT Grid
29Grid User Node (GUN)
- Campus users have PCs for research education
projects and they are used to their local systems - Researchers also often need additional resources
- need to be able to keep doing what they know best
- Issuing same commands, yet reaching additional
resources - would like access to those resources easily and
transparently - Data available to both local and remote
resources, etc. - The Grid User Node (GUN) concept is designed to
address these needs by integrating local
resources into UT Grid - removes distinction of local vs. remote resource
- GUN will probably be adopted in TeraGrid, TIGRE,
etc.
30Grid User Node (GUN)
- Two types of GUNs are available
- TACC-hosted GUNs (Linux for now Windows, Mac
coming) - allows easy start and testing of environment
- hosted GUNs are already fully integrated into UT
Grid - Personal GUN via downloadable GUN software
- links to downloadable packages, user guides to
make it easy - can then be further customized to suit needs and
tastes - users PC is now fully integrated into UT Grid
- Currently have Linux and Mac versions in
production Windows version under discussion.
31Grid User Node (GUN)
- Developed GridShell software to enable GUNs
- GridShell incorporates features to transparently
execute commands and data transfers across
computational resources integrated by grid
computing technologies. - Built on top of GSI, GRAM, GridFTP, Condor, LSF
- GridShell v1.0
- bash and tcsh
- Linux and Mac OS X
- Implementing GridShell for TeraGrid as well as UT
Grid - Already in use by researchers on UT Grid and
TeraGrid
32Grid User Node (GUN)
- GUN already enables
- information queries about grid resources
- Roundup and Rodeo job submission
- monitoring job status
- reviewing job results
- resource brokering based on ClassAd catalogs
- GridFTP enabled GSIFTP
- On-Demand glide-in of UD resources into Condor
pool - expand, generalize resource broker design
implementation - integrated real-life applications NAMD,
SNOOP3D, POVray
33Grid User Portal (GUP)
- Portals lower the barrier of entry for novice
users - Also provide alternatives to CLI for advanced
users - Enable easy access to multiple resources through
a single interface - Offer simple GUI interface to complex grid
computing capabilities - Can host applications for domain-specific
scientific research using grid technology - Present a Virtual Organization view of the Grid
as a whole - Increase productivity of UT researchers do more
science!
34Grid User Portal (GUP)
- Developed UT Grid-specific portal based on
GridPort3 that focused on the following
functionality - View information on resources within UT Grid,
including status, load, jobs, queues, etc. - View network bandwidth and latency between
systems, aggregate capabilities for all systems. - Submit user jobs and run hosted applications
- Manage files across systems, and move/copy
multiple files between resources with transfer
time estimates - Browse Data Collections
35Grid User Portal (GUP)
- Incorporated UT Grid resources into the current
production TACC User Portal - Developing GUP components (portlets) using GPv3
JSR-168 - Portlets will be compatible to work with other
JSR-168 frameworks (WebSphere, GridSphere,
uPortal, etc.) - Enables sharing of portlets with other
communities (IBM, OGCE) - Current JSR168 implementations
- GPIR Browser
- Comprehensive File Transfer
- Comprehensive Job Management
- Data Collection browsing
- NAMD hosted application (in progress)
- Leading development of TeraGrid User Portal
- Driving requirements for GPv4
36Serial Compute Services Plans
- Increase client counts to 10,000 through UT
BevoWare - Increase user community through training, docs,
consulting - Simplify usage through hosted applications,
application portals - Add UD screen saver educational content
- Develop Condor glide-in for United Devices
- Develop UD support for multiple grids, GSI
- Long term plans/possibilities
- explore P2P algorithms for traditional sci/eng
apps - integrate each into TIGRE, TeraGrid
- explore development environments for
multiplatform client execution - ask Texas Exes to support, distribute clients
37Parallel Compute Services Plans
- Developer UT Grid Resource Broker for integrating
clusters with different queuing systems - use Condor, Globus Toolkit v4, and own
development - address TeraGrid requirements, other partner
requirements - use data transfer times, queue wait predictions,
application execution time predictions - provide to TeraGrid, TIGRE, partners
- Integrate UT Grid Resource Broker into GUP, GUN
- Install UT clusters as spokes, later convert to
peer clusters - Long term plans/possibilities
- explore more sophisticated scheduling algorithms
- explore WS-Agreement based scheduling
- evaluate and develop workflow tools
38Storage Services Plans
- File Services
- explore network performance issues impact on
GridFTP - harden, distribute file transfer portlet
- integrate comprehensive file transfer services
into GridShell - Grid File Systems
- GPFS discuss results with SDSC, set up TACC
testbed, evaluate ease of deployment, robustness,
performance - GridNFS track development, invite speaker from
project to visit TACC - select one technology for campus deployment
39Visualization Services Plans
- Complete installation of Maverick
- begin measuring effectiveness, impact of remote
visualization - Complete determination, documentation of campus
vis tools for personal scale vis and high-end
vis, provide support - Work with IBM to develop Linux remote
visualization tools - continue meeting with Deep graphics team
- deploy Linux visualization cluster? SMP?
- Define and develop remote and collaborative
visualization software in 2Q05 and beyond - leverage TeraGrid funding
- prepare and submit NSF proposals on grid
visualization
40Data Collections Plans
- Continue work with SRB
- develop better interfaces for collections such as
query mechanism - complete technology evaluations
- complete hosting of initial four data collections
- Continue evaluating Avaki for collections
- will this technology meet our needs for data
collections? - complete technology evaluations
- Evaluate DB2, SQL Server, Oracle, etc. for
collection hosting capabilities - Solicit UT community for additional data
collections and user requirements
41Grid User Portal Grid User Node
- Complete and distribute
- GridShell v1.0 (bash and tcsh)
- GridPort v4.0
- GridPort portlets, application portlets
- Develop and deploy
- TeraGrid User Portal 1.0
- TACC User Portal v3.0
- TACC GUN v1.0
- Write DeveloperWorks articles on
- GridShell v1.0 and GridPort v4.0
- GUN and GUP concepts, value, implementations
- Evangelize GUN concept to TeraGrid for
deployment, support - Develop GridShell agents for additional grid
technologies
42TIGRE
43About TIGRE
- High Performance Computing Across Texas (HiPCAT)
is a consortium of Texas higher ed and medical
research institutions - Texas Internet Grid for Research Education
(TIGRE) is a new project of HiPCAT to build a
state grid for higher ed and med research - Lonestar Education And Research Network (LEARN)
will connect 30 higher ed and med research
institutions in Texas
44About TIGRE
- TIGRE is a 2.5M two-year project for UT Austin,
Texas Tech, Texas AM, Rice, andU. Houston to
deploy a grid - Limited funding, duration project
- must be lightweight, easy to extend to all Texas
institutions - Must be reliable, easy to support at Texas
institutions - TIGRE needs GRIDS Center!
45TIGRE Requirements
- Initial applications communities include
- atmospheric modeling/environmental issues
- biomedical research (diverse)
- petroleum modeling/engineering
- Technology requirements
- Sharing data in these domains
- Aggregating compute resources
- Maximizing throughput of compute jobs
- Integration with campus grids
- Integration with TeraGrid
46TIGRE Request to GRIDS Center
- TIGRE/GRIDS design meeting late July
- Determine minimal complete software stack for
- setting up grid usage
- conducting grid accounting
- deploying user portal
- enabling data sharing
- distributing compute jobs
- Develop aggressive plan, timeline for deployment
- SC05 as a driver for initial capabilities,
demonstrations? - Summer 06 for initial production with at least
one app domain? - Summer 07 for completion
- Regular consulting meetings with TIGRE teams
- Journal entire process, publish jointly as case
study
47More About TACC
- Texas Advanced Computing Center
- www.tacc.utexas.edu
- info_at_tacc.utexas.edu
- (512) 475-9411