Title: SC03 TeraGrid Tutorial: Applications in the TeraGrid Environment
1SC03 TeraGrid Tutorial Applications in the
TeraGrid Environment
- John Towns, NCSA ltjtowns_at_ncsa.edugt
- Nancy Wilkins-Diehr, SDSC ltwilkinsn_at_sdsc.edugt
- Sharon Brunett, CACR ltsharon_at_cacr.caltech.edugt
- Sandra Bittner, ANL ltbittner_at_mcs.anl.govgt
- Derek Simmel, PSC ltdsimmel_at_psc.edugt
- and many others participating in the TeraGrid
Project
2Tutorial Outline - Morning
- TeraGrid Overview
- John Towns Slide 4 20 mins
- Introduction to TeraGrid Resources and Services
- John Towns Slide 15 60 mins
- TeraGrid Computing Paradigms
- Sharon Brunett Slide 55 20 mins
- BREAK
- TeraGrid User Environment Job Execution
- Sandra Bittner Slide 69 60 mins
- TeraGrid Support Services and Resources
- Nancy Wilkins-Diehr Slide 129 20 mins
3Tutorial Outline - Afternoon
- LUNCH
- Getting Started with User Certificates on the
TeraGrid - Derek Simmel Slide 138
- Factals with MPI and MPICH-G2 Exercise
- Sandra Bittner Slide 160
- Pipelined Application Exercise with Mcell
- Nancy Wilkins-Diehr Slide 166
- We will take a break when it is time
4Brief Overview of the TeraGrid
- John Towns
- NCSA / Univ of Illinois
- Co-Chair, TG Users Services WG
- jtowns_at_ncsa.edu
5The TeraGrid VisionDistributing the resources is
better than putting them at one site
- Build new, extensible, grid-based infrastructure
to support grid-enabled scientific applications - New hardware, new networks, new software, new
practices, new policies - Expand centers to support cyberinfrastructure
- Distributed, coordinated operations center
- Exploit unique partner expertise and resources to
make whole greater than the sum of its parts - Leverage homogeneity to make the distributed
computing easier and simplify initial development
and standardization - Run single job across entire TeraGrid
- Move executables between sites
6TeraGrid Objectives
- Create unprecedented capability
- integrated with extant PACI capabilities
- supporting a new class of scientific research
- Deploy a balanced, distributed system
- not a distributed computer but rather
- a distributed system using Grid technologies
- computing and data management
- visualization and scientific application analysis
- Define an open and extensible infrastructure
- an enabling cyberinfrastructure for scientific
research - extensible beyond the original sites
- NCSA, SDSC, ANL, Caltech, PSC (under ETF)
- ETF2 awards to TACC, Indian/Purdue, ORNL
7Measuring Success
- Breakthrough science via new capabilities
- integrated capabilities more powerful than
existing PACI resources - current PACI users and new communities requiring
Grids - An extensible Grid
- design principles assume heterogeneity and more
than four sites - Grid hierarchy, scalable, replicable, and
interoperable - formally documented design, standard protocols
and specifications - encourage, support, and leverage open source
software - A pathway for current users
- evolutionary paths from current practice
- provide examples, tools, and training to exploit
Grid capabilities - user support, user support, and user support
8TeraGrid Application Targets
- Multiple classes of user support
- each with differing implementation complexity
- minimal change from current practice
- new models, software, and applications
- Usage exemplars
- traditional supercomputing made simpler
- remote access to data archives and computers
- distributed data archive access and correlation
- remote rendering and visualization
- remote sensor and instrument coupling
9TeraGrid Components
- Compute hardware
- Intel/Linux Clusters, Alpha SMP clusters, POWER4
cluster, - Large-scale storage systems
- hundreds of terabytes for secondary storage
- Very high-speed network backbone
- bandwidth for rich interaction and tight
coupling - Grid middleware
- Globus, data management,
- Next-generation applications
10Wide Variety of Usage Scenarios
- Tightly coupled jobs storing vast amounts of
data, performing visualization remotely as well
as making data available through online
collections (ENZO) - Thousands of independent jobs using data from a
distributed data collection (NVO) - Applications employing novel latency-hiding
algorithms adapting to a changing number of
processors (PPM) - High-throughput applications of loosely coupled
jobs (MCell)
11 Prioritization to ensure success
- Diagnostic apps to test functionality (ENZO, PPM)
- Flagship apps provide early requirements for
software and hardware functionality - Cactus, ENZO, EOL, Gadu, LSMS, MCell, MM5,
Montage, NAMD, NekTar, PPM, Quake, Real time
brain mapping - Plans to approach existing grid communities
- GriPhyN, NEES, BIRN, etc.
12TeraGrid Roaming
Attend TeraGrid training class or access
web-based TG training materials
Receive Account info, pointers to training, POC
for user services Ops, pointers to login
resources, atlas of TG resources
Apply for TeraGrid Account
Develop and optimize code at Caltech
Run large job at NCSA, move data from SRB to
local scratch and store results in SRB
Run large job at SDSC, store data using SRB.
Run larger job using both SDSC and PSC systems
together, move data from SRB to local scratch
storing results in SRB
Move small output set from SRB to ANL cluster, do
visualization experiments, render small sample,
store results in SRB
Move large output data set from SRB to
remote-access storage cache at SDSC, render using
ANL hardware, store results in SRB
(Recompile may be necessary in some cases)
13Strategy Define Build Standard Services
- Finite Number of TeraGrid Services
- defined as specifications, protocols, APIs
- separate from implementation
- Extending TeraGrid
- adoption of TeraGrid specifications, protocols,
APIs - protocols, data formats, behavior specifications,
SLAs - Engineering and Verification
- shared software repository
- build sources, scripts
- service must be accompanied by test module
14TeraGrid Extensibility
- You must be this high to ride the TeraGrid
- fast network
- non-trivial resources
- meet SLA (testing and QA requirements)
- become a member of the virtual organization
- capable of TG hosting (peering arrangements)
- TG Software Environment
- user (download, configure, install, and run TG
1.0) - developer (join distributed engineering team)
- TG Virtual Organization
- Operations, User-services
- Add new capability
- make the whole greater than the sum of its parts
repo.teragrid.org
15Introduction to TeraGrid Resources and Services
- John Towns
- NCSA / Univ of Illinois
- Co-Chair, TG Users Services WG
- jtowns_at_ncsa.edu
16TeraGrid Components
- Compute hardware
- Phase I
- Intel Linux clusters
- open source software and community
- Madison processors for commodity leverage
- Alpha SMP clusters
- Phase II
- more Linux cluster hardware
- POWER4 cluster
- ETF2
- Resources from additional sites TACC,
Indiana/Purdue, ORNL - Large-scale storage systems
- hundreds of terabytes for secondary storage
17TeraGrid Components
- Very high-speed network backbone
- bandwidth for rich interaction and tight
coupling - Grid middleware
- Globus, data management,
- Next-generation applications
- breakthrough versions of todays applications
- but also, reaching beyond traditional
supercomputing
18Introduction to TeraGrid Resources and Services
- Compute Resources
- Data Resources and Data Management Services
- Visualization Resources
- Network Resources
- Grid Services
- Grid Scheduling
- Allocations and Proposals
19Compute Resources Overview
4 Lambdas
CHI
LA
96 GeForce4 Graphics Pipes
100 TB DataWulf
96 Pentium4 64 2p Madison Myrinet
32 Pentium4 52 2p Madison 20 2p Madison Myrinet
20 TB
Caltech
ANL
256 2p Madison 667 2p Madison Myrinet
128 2p Madison 256 2p Madison Myrinet
1.1 TF Power4 Federation
500 TB FCS SAN
230 TB FCS SAN
NCSA
SDSC
PSC
Charlie Catlett ltcatlett_at_mcs.anl.govgt Pete
Beckman ltbeckman_at_mcs.anl.govgt
20Compute Resources NCSA2.6 TF ? 10.6 TF w/ 230
TB
30 Gbps to TeraGrid Network
GbE Fabric
8 TF Madison 667 nodes
2.6 TF Madison 256 nodes
Storage I/O over Myrinet and/or GbE
2p Madison 4 GB memory 2x73 GB
2p Madison 4 GB memory 2x73 GB
2p 1.3 GHz 4 or 12 GB memory 73 GB scratch
2p Madison 4 GB memory 2x73 GB
250MB/s/node 670 nodes
250MB/s/node 256 nodes
256 2x FC
Myrinet Fabric
Brocade 12000 Switches
92 2x FC
InteractiveSpare Nodes
230 TB
8 4p Madison Nodes
Login, FTP
21Compute Resources SDSC 1.3 TF ? 4.3 1.1 TF
w/ 500 TB
30 Gbps to TeraGrid Network
GbE Fabric
3 TF Madison 256 nodes
1.3 TF Madison 128 nodes
2p Madison 4 GB memory 2x73 GB
2p 1.3 GHz 4 GB memory 73 GB scratch
2p Madison 4 GB memory 2x73 GB
128 250MB/s
128 250MB/s
128 250MB/s
128 2x FC
128 2x FC
128 2x FC
Myrinet Fabric
Brocade 12000 Switches
256 2x FC
500 TB
InteractiveSpare Nodes
6 4p Madison Nodes
Login, FTP
22Compute Resources ANL1.4 TF w/ 20 TB, Viz
30 Gbps to TeraGrid Network
Visualization .9 TF Pentium IV 96 nodes
Compute .5 TF Madison 64 nodes
GbE Fabric
2p Madison 4 GB memory 2x73 GB
2p 2.4 GHz 4 GB RAM 73 GB disk Radeon 9000
2p 2.4 GHz 4 GB RAM 73 GB disk Radeon 9000
2p Madison 4 GB memory 2x73 GB
250MB/s/node 64 nodes
250MB/s/node 96 nodes
96 visualization streams
Myrinet Fabric
Storage I/O over Myrinet and/or GbE
Viz Devices
Network Viz
Viz I/O over Myrinet and/or GbE To TG network.
2p 2.4 GHz 4 GB RAM
4 2p PIV Nodes
4 4p Madison Nodes
Login, FTP
8 2x FC
20 TB
Storage Nodes
Interactive Nodes
23Compute Resources Caltech 100 GF w/ 100 TB
30 Gbps to TeraGrid Network
GbE Fabric
6 Opteron nodes
33 IA32 storage nodes 100 TB /pvfs
72 GF Madison 36 IBM/Intel nodes
34 GF Madison 17 HP/Intel nodes
2p Madison 6 GB memory 2x73 GB
2p Madison 6 GB memory 73 GB scratch
2p ia32 6 GB memory 100 TB /pvfs
4p Opteron 8 GB memory 66 TB RAID5 HPSS
Datawulf
2p Madison 6 GB memory 73 GB scratch
36 250MB/s
33 250MB/s
17 250MB/s
Myrinet Fabric
13 2xFC
Interactive Node
2p IBM Madison Node
Login, FTP
13 Tape drives 1.2 PB silo raw capacity
24Compute Resources PSC6.4 TF w/ 150 TB
30 Gbps to TeraGrid Network
GbE Fabric
Linux Cache Nodes (LCNs) 150TB RAID disk
Application Gateways
Hierarchical Storage (DMF)
Quadrics
25PSC Integration Strategy
- TCS (lemieux.psc.edu),Marvel (rachel.psc.edu)and
Visualization nodes - OpenPBS, SIMON Scheduler
- openSSH/SSL
- Compaq C/C Fortran, gcc
- Quadrics MPI (lemieux)
- Marvel Native MPI (rachel)
- Python with XML libraries
- GridFTP via Linux Cache Nodes / HSM
- Adding
- Globus 2.x.y GRAM, GRIS
- softenv
- gsi-openssh, gsi-ncftp
- Condor-G
- INCA test harness
- more as Common TeraGrid Software Stack develops
26SDSC POWER4 Integration Strategy
- Software Stack Test Suite
- Porting Core and Basic Services to AIXL
- POWER4 cluster with common TeraGrid software
stack as close as practical to IA-64 and TCS
Alpha versions to support TeraGrid Roaming - Network Attachment Architecture
- Federation Switch on every node
- Fibre Channel on every node
- GbE to TeraGrid via Force10 switch
27Data Resources and Data Management Services
- Approach
- Deploy core services
- Drive the system with data intensive flagship
applications - TG Data Services Plan
- Integrate mass storage systems at sites into TG
- GridFTP-based access to mass storage systems
- SRB-based access to data
- HDF5 libraries
28Common Data Services
- Database systems five systems (5x32 IBM Regatta)
acquired at SDSC for DB2 and other related DB
apps Oracle and DB2 clients planned at NCSA
29The TeraGrid Visualization Strategy
- Combine existing resources and current
technology - Commodity clustering and commodity graphics
- Grid technology
- Access Grid collaborative tools
- Efforts, expertise, and tools from each of the
ETF sites - Volume Rendering (SDSC)
- Coupled Visualization (PSC)
- Volume Rendering (Caltech)
- VisBench (NCSA)
- Grid and Visualization Services (ANL)
- to enable new and novel ways of visually
interacting with simulations and data
30Two Types of Loosely Coupled Visualization
Interactive Visualization
TeraGrid Simulation
Computationally steeringthrough pre-computed data
TeraGridnetwork
User
Batch Visualization
short term storage
Long term storage
Processing batch jobssuch as movie generation
31On-Demand and Collaborative Visualization
TeraGrid Simulation
On-Demand Visualization
Coupling simulation with interaction
AG
Voyager Recording
Collaborative Visualization
Preprocessing,filtering, featuredetection.
Multi-party viewingand collaboration
32Visualization Sample Use Cases
33The TeraGrid Networking Strategy
- TeraGrid Backplane
- Provides sufficient connectivity (bandwidth,
latency) to support virtual machine room - Core backbone 40 Gbps
- Connectivity to each site 30 Gbps
- Local networking
- Provide support for all nodes at each site to
have adequate access to backplane - Support for distributed simulations
34TeraGrid Wide Area Network
35TeraGrid Optical Network
Ciena Metro DWDM (operated by site)
818 W. 7th St. (CENIC Hub)
455 N. Cityfront Plaza (Qwest Fiber Collocation
Facility)
2200mi
Ciena CoreStream Long-Haul DWDM (Operated by
Qwest)
Los Angeles
DTF Backbone Core Router
Chicago
Cisco Long-Haul DWDM (Operated by CENIC)
Additional Sites And Networks
Routers / Switch-Routers
Starlight
DTF Local Site Resources and External Network
Connections
115mi
25mi
140mi
25mi
??mi
Caltech
SDSC
ANL
NCSA
PSC
Site Border Router
Cluster Aggregation Switch
Caltech Systems
SDSC Systems
NCSA Systems
ANL Systems
PSC Systems
36NCSA TeraGrid Network
Juniper T640
Site Border Router
3x10GbE
30 Gbps
Cluster Aggregation Switch
Force10
160 Gbps
4x10GbE
4x10GbE
4x10GbE
4x10GbE
Force10
Force10
Force10
Force10
37SDSC TeraGrid Network
30 Gbps
Juniper T640
Site Border Router
40 Gbps
2x10GbE
2x10GbE
Force10
38Caltech TeraGrid Network
30 Gbps
Juniper T640
Site Border Router
30 Gbps
3x10GbE
Force10
39Argonne TeraGrid Network
30 Gbps
Juniper T640
Site Border Router
30 Gbps
3x10GbE
Force10
40PSC TeraGrid Network
Cisco
Site Border Router
3x10GbE
30 Gbps
Linux Cache Nodes (LCNs) 150TB RAID disk
30x1GbE
30 Gbps
Application Gateways
SGI DMF Hierarchical Storage
GbE
F/C
Quadrics
4x32p EV7 SMPs
20 Viz nodes
41Grid Services A Layered Grid Architecture
Talking to things communication (Internet
protocols) security
Connectivity
Controlling things locally Access to,
control of, resources
Fabric
42TeraGrid Runtime Environment
CREDENTIAL
Single sign-on via grid-id
Assignment of credentials to user proxies
Globus Credential
Mutual user-resource authentication
Site 2
Authenticated interprocess communication
Mappingtolocal ids
Certificate
43Homogeneity Strategies
- Common Grid Middleware Layer
- TeraGrid Software Stack
- Collectively Designed
- prerequisite for adding services (components) to
common stack is associated INCA test and build
module - Multiple Layers Coordinated
- environment variables
- pathnames
- versions for system software, libraries, tools
- Minimum requirements plus Site Value-Added
- multiple environments possible
- special services and tools on top of common
TeraGrid software stack
44Common Authentication Service
- Standardized GSI authentication across all
TeraGrid systems allows use of the same
certificate - Developed coordinated cert acceptance policy
- today accept
- NCSA/Alliance
- SDSC
- PSC
- DOE Science Grid
- Developing procedures and tools to simplify the
management of certificates - Grid mapfile distribution
- simplified certificate request/retrieval
- Sandra and Derek will cover these in more detail
later
45Grid Information Services
- Currently Leveraging Globus Grid Information
Service - each service/resource is an information source
- index servers at each of the TG sites
- full mesh between index servers for fault
tolerance - access control as needed
- Resource Information Management Strategies
- TG GIS for systems level information
- generic non-sensitive information
- access control on sensitive info such as job
level information - Applications specific GIS services
- access controls applied as needed
- user control
46TeraGrid Software Stack V1.0
- A social contract with the user
- LORA Learn Once, Run Anywhere
- Precise definitions
- services (done In CVS)
- software (done In CVS)
- user environment (done In CVS)
- Reproducibility
- standard configure, build, and install
- single CVS repository for software
- initial releases for IA-64, IA-32, Power4, Alpha
47Current TG Software Stack
- SuSE SLES
- X-cat
- OpenPBS
- Maui scheduler
- MPICH, MPICH-G2, MPICH-VMI
- gm drivers
- VMI/CRM
- Globus
- Condor-G
- gsi-ssh
- GPT Wizard and GPT
- GPT
- SoftEnv
- MyProxy
- Intel compilers
- GNU compilers
- HDF4/5
- SRB client
48Grid Scheduling Job Management Condor-G, the
User Interface
- Condor-G is the preferred job management
interface - job scheduling, submission, tracking, etc.
- allows for complex job relationships and data
staging issues - interfaces to Globus layers transparently
- allows you to use your workstation as your
interface to the grid - The ability to determine current system loads and
queue status will come in the form of a web
interface - allows for user-drive load balancing across
resources - might look a lot like the PACI HotPage https//ho
tpage.paci.org/
49Pipelined Jobs Execution
- Scheduling of such jobs can be done now
- Condor-G helps significantly with job
dependencies - Can be coordinated with non-TG resources
- Nancy will go through and exercise on this in the
afternoon
50Multi-Site, Single Execution
- Support for execution via MPICH-G2 and MPICH-VMI2
- MPI libraries optimized for WAN execution
- Scheduling is still very much a CS research area
- investigating product options
- Maui, Catalina, PBSPro
- tracking Globus developments
- tracking GGF standards
51Advanced Reservations
- Allow scheduled execution time for jobs
- provides support for co-scheduling of resources
for multi-site execution - Still need to manually schedule across sites
- provides support for co-scheduling with non-TG
resources (instruments, detectors, etc.) - Send a note to help_at_teragrid.org if you want to
do co-scheduling
52Allocations Policies
- TG resources allocated via the PACI allocations
and review process - modeled after NSF process
- TG considered as single resource for grid
allocations - Different levels of review for different size
allocation requests - DAC up to 10,000
- PRAC/AAB lt200,000 SUs/year
- NRAC 200,000 SUs/year
- Policies/procedures posted at
- http//www.paci.org/Allocations.html
- Proposal submission through the PACI On-Line
Proposal System (POPS) - https//pops-submit.paci.org/
53Accounts and Account Management
- TG accounts created on ALL TG systems for every
user - information regarding accounts on all resources
delivered - working toward single US mail packet arriving for
user - accounts synched through centralized database
- certificates provide uniform access for users
- jobs can be submitted to/run on any TG resource
- NMI Account Management Information Exchange
(AMIE) used to manage account and transport usage
records in GGF format
54And now
- on to the interesting details
55TeraGrid Computing Paradigm
- Sharon Brunett
- CACR / Caltech
- Co-Chair, TG Performance Eval WG
- sharon_at_cacr.caltech.edu
56TeraGrid Computing Paradigm
- Traditional parallel processing
- Distributed parallel processing
- Pipelined/dataflow processing
57Traditional Parallel Processing
- Tightly coupled multicomputers are meeting
traditional needs of large scale scientific
applications - compute bound codes
- faster and more CPUs
- memory hungry codes
- deeper cache, more local memory
- tightly coupled, communications intensive codes
- high bandwidth, low latency interconnect message
passing between tasks - I/O bound codes
- large capacity, high performance disk subsystems
58Traditional Parallel Processing - When Have
We Hit the Wall?
- Applications can outgrow or be limited by a
single parallel computer - heterogeneity desirable due to application
components - storage, memory and/or computing demands exceed
resources of a single system - more robustness desired
- integrate remote instruments
59Traditional Parallel Processing
- Single executables to be on a single remote
machine - big assumptions
- runtime necessities (e.g. executables, input
files, shared objects) available on remote
system! - login to a head node, choose a submission
mechanism - Direct, interactive execution
- mpirun np 16 ./a.out
- Through a batch job manager
- qsub my_script
- where my_script describes executable location,
runtime duration, redirection of stdout/err,
mpirun specification
60Traditional Parallel Processing II
- Through globus
- globusrun -r some-teragrid-head-node.teragrid.or
g/jobmanager -f my_rsl_script - where my_rsl_script describes the same details as
in the qsub my_script! - Through Condor-G
- condor_submit my_condor_script
- where my_condor_script describes the same details
as the globus my_rsl_script!
61Distributed Parallel Processing
- Decompose application over geographically
distributed resources - functional or domain decomposition fits well
- take advantage of load balancing opportunities
- think about latency impact
- Improved utilization of a many resources
- Flexible job management
62Overview of Distributed TeraGrid Resources
Site Resources
Site Resources
HPSS
HPSS
External Networks
External Networks
Caltech
Argonne
External Networks
External Networks
NCSA/PACI 10.3 TF 240 TB
SDSC 4.1 TF 225 TB
Site Resources
Site Resources
HPSS
UniTree
63Distributed Parallel Processing II
- Multiple executables to run on multiple remote
systems - tools for pushing runtime necessities to remote
sites - Storage Resource Broker, gsiscp,ftp,
globus-url-copy - copies files between sites - globus-job-submit my_script
- returns https address for monitoring and post
processing control
64Distributed Parallel Processing III
- Multi-site runs need co-allocated resources
- VMI-mpich jobs can run multi-site
- vmirun np local_cpus grid_vmi gnp total_cpus
-crm crm_name key key_value ./a.out - server/client socket based data exchanges between
sites - Globus and Condor-G based multi-site job
submission - create appropriate RSL script
65Pipelined/dataflow processing
- Suited for problems which can be divided into a
series of sequential tasks where - multiple instances of problem need executing
- series of data needs processing with multiple
operations on each series - information from one processing phase can be
passed to next phase before current phase is
complete
66Pipelined/dataflow processing
- Key requirement for efficiency
- fast communication between adjacent processes in
a pipeline - interconnect on TeraGrid resources meets this
need - Common examples
- frequency filters
- Monte Carlo
- MCELL example this afternoon!
67Pipeline/Dataflow Example CMS (Compact Muon
Solenoid) Application
- Schedule and run 100s of Monte Carlo detector
response simulations on TG compute cluster(s) - Transfer each jobs 1 GB of output to mass
storage system at a selected TG - Schedule and run 100s of jobs on a TG cluster to
reconstruct physics from the simulated data - Transfer results to mass storage system
68Pipelined CMS Job Flow
2) Launch secondary job on remote pool of nodes
get input files via Globus tools (GASS)
Master Condor job running at Caltech
Secondary Condor job on remote pool
5) Secondary reports complete to master
Caltech workstation
6) Master starts reconstruction jobs via Globus
jobmanager on cluster
9) Reconstruction job reports complete to master
Vladimir Litvin, Caltech Scott Koranda,
NCSA/Univ of Wisc-Milwaulke
3a) 75 Monte Carlo jobs on remote Condor pool
3b) 25 Monte Carlo jobs on remote nodes via
Condor
7) gsiftp fetches data from mass storage
4) 100 data files transferred via gsiftp, 1 GB
each
TG or other Linux cluster
8) Processed database stored to mass storage
TeraGrid Globus-enabled FTP server
69The TeraGrid User Environment Job Execution
- Sandra Bittner
- Argonne National Laboratory
- bittner_at_mcs.anl.gov
70The TG User Environment Job Execution
- Development Environment
- Grid Mechanisms
- Data Handling
- Job Submission and Monitoring
71Development Environment
72SoftEnv System
- Software package management system instituting
symbolic keys for user environments - Replaces traditional UNIX dot files
- Supports community keys
- Programmable similar to other dot files
- Integrated user environment transfer
- Well suited to software lifecycles
- Offers unified view of heterogeneous platforms
73Manipulating the Environment
- /home/ltusernamegt/.soft
- _at_teragrid
- softenv
- displays symbolic software key names
- soft add ltpackage-namegt
- temporary addition of package to environment
- soft delete ltpackage-namegt
- temporary package removal from environment
- resoft
- modify dotfile and apply to present environment
74softenv output (part 1 of 2)
- softenv
- SoftEnv version 1.4.2
- The SoftEnv system is used to set up environment
variables. For details, see 'man softenv-intro'. - This is a list of keys and macros that the
SoftEnv system understands. - In this list, the following symbols indicate
- This keyword is part of the default
environment, which you get by putting "_at_default"
in your .soft - U This keyword is considered generally
"useful". - P This keyword is for "power users", people
who want to build their - own path from scratch. Not recommended
unless you know what you - are doing.
75softenv output (part 2 of 2)
- These are the keywords explicitly available
- P atlas ATLAS
- P globus Globus -- The
Meta Scheduler - P gm Myricom GM
networking software - P goto goto BLAS
libraries - P gsi-openssh GSI OpenSSH
- P hdf4 HDF4
- P hdf5 HDF5
- P intel-compilers Intel C
Fortran Compilers - java Java
Environment flags power - P maui Maui Scheduler
- P mpich-g2 MPICH for G2
- P mpich-vmi MPICH for VMI
- P myricom GM Binaries
- P openpbs-2.3.16 Open Portable
Batch System 2.3.16 - P pbs Portable Batch
System - P petsc PETSc 2.1.5
- P srb-client SRB Client
76SoftEnv Documentation
- Overview
- man softenv
- User Guide
- man softenv-intro
- Administrators Guide
- man softenv-admin
- The Msys Toolkit
- http//www.mcs.anl.gov/systems/software
77Communities
- Creating organizing communities
- Registering keys
- Adding software
- Software versions and life cycle
78Software Layers
- Breaking down a directory name
- /soft/globus-2.4.3_intel-c-7.1.025-f-7.1.028_
ssh-3.5p1_gm-2.0.6_mpich-m_1.2.5..10_
mpicc64dbg_vendorcc64dbg - Or one softkey of globus-2.4.3-intel
79Compilers Scripting Languages
- Intel C, Intel Fortran
- may differ across platforms/architecture
- pre-production v7.1 v8.0
- GNU Compiler Collection, GCC
- may differ across platforms/architecture
- pre-production v3.2-30 v3.2.2-5
- Scripting languages
- PERL
- Python
80Grid Mechanisms
81Certificates Your TeraGrid Passport
- Reciprocal agreements
- NCSA, SDSC, PSC, DOEgrids
- what happen to DOEScienceGrid certs
- what about Globus certificates
- Apply from command line
- ncsa-cert-req
- sdsc-cert-req
- Register distribute certificate on TG
- gx-map
- GLOBUS_LOCATION/grid-proxy-init
82Globus Foundations
83GSI in Action Create Processes at A and B
that Communicate Access Files at C
Single sign-on via grid-id generation of
proxy cred.
User Proxy
User
Proxy credential
Or retrieval of proxy cred. from online
repository
Remote process creation requests
GSI-enabled GRAM server
GSI-enabled GRAM server
Authorize Map to local id Create process Generate
credentials
Ditto
Site A (Kerberos)
Site B (Unix)
Computer
Computer
Process
Process
Communication
Local id
Local id
Remote file access request
Kerberos ticket
Restricted proxy
Restricted proxy
GSI-enabled FTP server
Site C (Kerberos)
Authorize Map to local id Access file
With mutual authentication
Storage system
84globus-job-run
- For running of interactive jobs
- Additional functionality beyond rsh
- Ex Run 2 process job w/ executable staging
- globus-job-run - host np 2 s myprog arg1 arg2
- Ex Run 5 processes across 2 hosts
- globus-job-run \
- - host1 np 2 s myprog.linux arg1 \
- - host2 np 3 s myprog.aix arg2
- For list of arguments run
- globus-job-run -help
85globus-job-submit
- For running of batch/offline jobs
- globus-job-submit Submit job
- same interface as globus-job-run
- returns immediately
- globus-job-status Check job status
- globus-job-cancel Cancel job
- globus-job-get-output Get job stdout/err
- globus-job-clean Cleanup after job
86globusrun
- Flexible job submission for scripting
- uses an RSL string to specify job request
- contains an embedded globus-gass-server
- defines GASS URL prefix in RSL substitution
variable - (stdout(GLOBUSRUN_GASS_URL)/stdout)
- supports both interactive and offline jobs
- Complex to use
- must write RSL by hand
- must understand its esoteric features
- generally you should use globus-job- commands
instead
87GridFtp
- Moving a Test File
- globus-url-copy "grid-cert-info -subject" \
gsiftp//localhost5678/tmp/file1 \
file///tmp/file2 - Examples during hands-on session
88Condor-G
- Combines the strengths of Condor
- and the Globus Toolkit
- Advantages when managing grid jobs
- full featured queuing service
- credential management
- fault-tolerance
89Standard Condor-G
- Examples during hands-on demonstration
90How It Works
Condor-G
Grid Resource
Schedd
PBS
91How It Works
Condor-G
Grid Resource
Schedd
PBS
92How It Works
Condor-G
Grid Resource
Schedd
PBS
GridManager
93How It Works
Condor-G
Grid Resource
JobManager
Schedd
PBS
GridManager
94How It Works
Condor-G
Grid Resource
JobManager
Schedd
PBS
GridManager
User Job
95Condor-G with Glide In
- Examples during hands-on session
96How It Works
Condor-G
Grid Resource
Schedd
PBS
97How It Works
Condor-G
Grid Resource
Schedd
PBS
98How It Works
Condor-G
Grid Resource
Schedd
PBS
GridManager
99How It Works
Condor-G
Grid Resource
JobManager
Schedd
PBS
GridManager
100How It Works
Condor-G
Grid Resource
JobManager
Schedd
PBS
GridManager
User Job
101How It Works
Condor-G
Grid Resource
Schedd
PBS
Collector
102How It Works
Condor-G
Grid Resource
Schedd
PBS
glide-ins
Collector
103How It Works
Condor-G
Grid Resource
Schedd
glide-ins
PBS
GridManager
Collector
104How It Works
Condor-G
Grid Resource
JobManager
Schedd
PBS
GridManager
glide-ins
Collector
105How It Works
Condor-G
Grid Resource
JobManager
Schedd
PBS
GridManager
Startd
glide-ins
Collector
106How It Works
Condor-G
Grid Resource
JobManager
Schedd
PBS
GridManager
Startd
glide-ins
Collector
107How It Works
Condor-G
Grid Resource
JobManager
Schedd
PBS
GridManager
Startd
glide-ins
Collector
User Job
108MPI Message Passing Interface
- MPICH-G2
- Grid-enabled implementation of the MPI v1
standard - harnesses services from the Globus Toolkit to run
MPI jobs across heterogeneous platforms - MPICH-GM
- used to exploit the lower latency and higher data
rates of Myrinet networks - may be used alone or layered with other MPI
implementations - MPI-VMI2
- exploits network layer and integrates profiling
behaviors for optimization
109MPI
- TG default is MPI-v1, for MPI-v2 use softkey
- ROMIO
- high performance, portable MPI-IO
- optimized for noncontiguous data access patterns
common in parallel applications - optimized I/O collectives
- C, Fortran, Profiling interfaces provided
- not included file interoperability or
user-defined error handlers for files
110MPICH-G2
- Excels at cross-site or inter-cluster jobs
- Offers multiple MPI receive behaviors to enhance
job performance under known conditions - Not recommended for intra-cluster jobs
- Examples during hands-on session
111MPICH-G2
- Three different receive behaviors of MPICH-G2.
- offers topology aware communicators using
GLOBUS_LAN_ID - enhanced non-vendor MPI through point-to-point
messaging - data exchange through UDP enabled GridFTP
- SC 2003 Demo ANL Booth
- Offers MPI_Comm_connect,accept from the MPI-2
standard - Uses standard MPI directives such as mpirun and
mpicc, mpif77, etc
112MPICH-VMI2
- Operates across varied network bandwidth
protocols, such as TCP, Infiniband, Myrinet - Utilizes standard MPI directives such as mpirun,
mpicc, and mpi77 - Harnesses profiling routines to provide execution
optimization when application characteristics are
not previously known. - Examples during hands-on session
113Data Handling
114Wheres the disk
- Local node disk
- Shared writeable global areas
- Scratch Space TG_SCRATCH
- Parallel Filesystems, GPFS, PVFS
- Home directories /home/ltusernamegt
- Project/Database space
- LORA learn once run anywhere
115Data responsibilities Expectations
- Storage lifetimes
- check local policy command TG documentation
- Data transfer
- srb, grid-ftp, scp
- Data restoration services/Back-ups
- varies by site
- Job Check-pointing
- responsibility rests with the user
- Email Relay only, no local delivery
- forwarded to address of registration
116File Systems
- GPFS
- available for IA32 based clusters
- underdevelopment for IA64 based clusters
- fast file system - initial tests promising
- PVFS
- parallel file system
- used for high performance scratch
117PVFS
- Parallel file system providing shared access to a
high-performance scratch space - software is quite stable but does not try to
handle single-point hardware (node) failures - Excellent place to store a replica of input data,
or output data prior to archiving - Quirks
- is/can be very slow
- executing off PVFS has traditionally been buggy
(not suggested) - no client caching means poor small read/write
performance - For more information visit the PVFS BOF on Wed.
at 5pm
118Integrating complex resources
- SRB
- Visualization Resources
- ANL booth demos
- fractal demo during hands-on session
- Real-time equipment
- shake tables
- microscopy
- haptic devices
- Integration work in progress
- A research topic
119Job Submission Monitoring
120Scheduling
- Metascheduling
- user setable reservations
- pre-allocated advanced reservations
- Local scheduling
- PBS/Maui
- Condor-G
- Peer scheduling
- may be considered in the future
121Job Submission Methods
- TG Wide submissions
- Condor-G
- MPICH-G2
- MPICH-VMI 2
- TG Local cluster submissions
- Globus
- Condor-G
- PBS Batch Interactive
- Examples during hands-on session
122The Teragrid Pulse
- Inca System
- test harness
- unit reporters
- version reporters
- Operation monitor
- system resources
- job submissions
123What is the Inca Test Harness?
- Software built to support the Grid Hosting
Environment - Is SRB working at all sites?
- Should we upgrade Globus to version 2.4?
- Is TG_SCRATCH available on the compute nodes?
- Framework for automated testing, verification,
and monitoring - Find problems before users do!
124Architecture Overview
- Reporter - a script or executable
- version, unit test, and integrated test
- assembled into suites
- Harness - perl daemons
- Planning and execution of reporter suites
- Archiving
- Publishing
- Client - user-friendly web interface,
application, etc.
125How will this help you?
- Example pre-production screenshots
126Network Characteristics
- Each site is connected at 30 Gb/s
- Cluster nodes are connected at 1 Gb/s
- Real world performance node to node is 990 Mb/s
- TCP tuning is essential within your application
to attain good throughput - TCP has issues along high speed, high latency
paths which are current research topics
127Ongoing Iperf tests (single day example)
- Iperf tests run once an hour between dedicated
test platforms at each site (IA32 based) - Will report to INCA soon
- Deployed on a variety of representative machines
soon - Code to be made available under an open source
license soon - http//network.teragrid.org/tgperf/
128Detailed Iperf Graph (single day close up)
129For More Information
- TeraGrid http//www.teragrid.org/userinfo
- Condor http//www.cs.wisc.edu/condor
- Globus http//www.globus.org
- PBS http//www.openpbs.org
- MPI http//www.mcs.anl.gov/mpi
- MPICH-G2 http//www.niu.edu/mpi
- MPICH-VMI http//vmi.ncsa.uiuc.edu
- SoftEnv http//www.mcs.anl.gov/systems/software
130TeraGrid Support Services and Resources
- Nancy Wilkins-Diehr
- San Diego Supercomputer Center
- Co-Chair, TG Users Services WG
- wilkinsn_at_sdsc.edu
131Production in January!
- First phase of the TeraGrid will be available for
allocated users in January, 2004 - Variety of disciplines represented by first users
- groundwater and oil resevoir modeling
- Large Hadron Collider support
- Southern California Earthquake Center (SCEC)
- Apply by Jan 6 for April access
132Complete User Support
- Documentation
- Applications
- Consulting
- Training
133Documentation
- TeraGrid-wide documentation
- simple
- high-level
- what works on all resources
- Site-specific documentation
- full details on unique capabilities of each
resource - http//www.teragrid.org/docs
134Common Installation of Applications
- ls TG_APPS_PREFIX
- ATLAS globus-2.4.2-2003-07-30-test2
netcdf-3.5.0 - HPSS goto
papi - LAPACK gx-map
pbs - PBSPro_5_2_2_2d gx-map-0.3
perfmon - bin hdf4
petsc - crm hdf5
13524/7 Consulting Support
- help_at_teragrid.org
- advanced ticketing system for cross-site support
- staffed 24/7
- 866-336-2357, 9-5 Pacific Time
- http//news.teragrid.org/
- Extensive experience solving problems for early
access users - Networking, compute resources, extensible
TeraGrid resources
136Training that Meets User Needs
- Asynchronous training
- this tutorial and materials will be available
online - Synchronous training
- TeraGrid training incorporated into ongoing
training activities at all sites - Training at your site
- with sufficient participants
137Questions?
- help_at_teragrid.org
- Come visit us in any TeraGrid site booth
- bittner_at_mcs.anl.gov
- dsimmel_at_psc.edu
- jtowns_at_ncsa.edu
- sharon_at_cacr.caltech.edu
- wilkinsn_at_sdsc.edu
138Lunch! Then hands-on lab
- Certificate creation and management, SRB
initialization - Fractals
- Single-site MPI
- Cross-site MPICH-G2
- Cross-site VMI2
- Visualization
- MCell
- PBS
- Globus
- Condor-G
- Condor DAGman
- SRB
139Getting Started with User Certificates on the
TeraGrid
- Derek Simmel
- Pittsburgh Supercomputing Center
- dsimmel_at_psc.edu
140Requesting a TeraGrid Allocation
141TeraGrid Accounts
- Principal Investigators (PIs) identify who should
have TeraGrid accounts that can charge against
their project's allocation - PIs initiate the account creation process for
authorized users via an administrative web page - Units of a project's allocation are charged at
rates corresponding to the resources used
142Approaches to TeraGrid Use
- Log in interactively to a login node at a
TeraGrid site and work from there - no client software to install/maintain yourself
- execute tasks from your interactive session
- Work from your local workstation and authenticate
remotely to TeraGrid resources - comfort and convenience of working "at home"
- may have to install/maintain add'l TG software
143User Certificates for TeraGrid
- Why use certificates for authentication?
- Facilitates Single Sign-On
- enter your pass-phrase only once per session,
regardless of how many systems and services that
you access on the Grid during that session - one pass-phrase to remember (to protect your
private key), instead of one for each system - Widespread Use and Acceptance
- certificate-based authentication is standard for
modern Web commerce and secure services
144Certificate-Based Authentication
Registration Authority
Certificate Authority
A
CA
RA
Client Z
145TeraGrid Authentication-gtTasks
GIIS
RA/CA
HPC
HPC
HPC
Data
Viz
146TeraGrid-Accepted CAs
- NCSA CA
- SDSC/NPACI CA
- PSC CA
- DOEGrids CANCSA CA and SDSC/NPACI CA will
generate new TeraGrid User Certificates
147New TeraGrid Account TODO List
- Use Secure Shell (SSH) to log into a TeraGrid
site - Change your Password WE'RE SKIPPING THIS STEP
TODAY - Obtain a TeraGrid-acceptable User Certificate,
and install it in your home directory assuming
you do not already have one - Register your User Certificate in Globus
grid-mapfile on TeraGrid systems - Securely copy your TeraGrid User Certificate and
Private Key to your home workstation - Test your User Certificate for Remote
Authentication - Initialize your TeraGrid user SRB collection (if
applicable)WE'RE SKIPPING THIS STEP TODAY -
1480. Logging into your Classroom Laptop Computer
- You have each been assigned a temporary TeraGrid
user account trainNNNN is the number assigned
to you for the duration of this course - Login to the laptop by entering your user account
name (sc03) and the password provided (sc2003) - Once logged in, open a Terminal (xterm)
- STOP and await further instructions...
1491. SSH to a TeraGrid Site
- ssh trainNN_at_tg-login1.ncsa.teragrid.org(Enter
the password provided when prompted to do
so)STOP and await further instructions...
1502a. Change your Account Password
WE'RE SKIPPING THIS STEP TODAY
- Good Password Selection Rules Apply
- Do not use words that could be in any dictionary,
including common or trendy misspellings of words - Pick something easy for you to remember, but
impossible for others to guess - Pick something that you can learn to type
quickly, using may different fingers - Combine letters, digits, punctuation symbols and
capitalization - Never use the same password for two different
systems, nor for two different accounts - If you must write your password down, do so away
from prying eyes and lock it securely away!
1512b. Change your Account Password
WE'RE SKIPPING THIS STEP TODAY
- Means for changing local passwords vary among
systems - local password on Linux and similar operating
systems - passwd
- Kerberos environments
- kpasswd
- Systems managed using NIS
- yppasswd
- See site documentation for correct method
- http//www.teragrid.org/docs/
1522c. Change your Account Password
WE'RE SKIPPING THIS STEP TODAY
- kpasswd(Follow the prompts to enter your
current user account password and then to enter
(twice) your newly selected password) - exitto log out from tg-master2.ncsa.teragrid.org
STOP and await further instructions...
1533a. User Certificate Request
- For this exercise, we will execute a command-line
program to request a new TeraGrid User
Certificate from the NCSA CA - NCSA CA User Cert instructions are available at
- http//www.ncsa.uiuc.edu/UserInfo/Grid/Security/Ge
tUserCert.html - For SDSC/NPACI CA User Certificates, a similar
program may be used, or the web interface at - https//hotpage.npaci.edu/accounts/cgi-bin/create_
certificate.cgi
1543b. User Certificate Request
WE'RE SKIPPING THIS STEP TODAY
- Log into a TeraGrid Login node at NCSA
- gt ssh trainNN_at_tg-login1.ncsa.teragrid.org(use
your new password to log in)STOP and await
further instructions...
155A1 New step for today...
- Execute ls -a in your home directory on
tg-login1.ncsa.teragrid.org - If you see a directory named .globus, AND no
directory named .globus-sdsc, then STOP and await
instructions to correct this - We want to make sure you have the right .globus
in place later for the exercises...
1563c. User Certificate Request
- Execute the NCSA CA User Certificate request
script - gt ncsa-cert-request(use your new password again
to authenticate)STOP and await further
instructions...
NCSA Kerberos
1573d. User Certificate Request
- When prompted, enter a Pass-phrase for your new
certificate (and a second time to verify) - A Pass-phrase may be a sentence with spaces
- Make it as long as you care to type "in the dark"
- Good password selection rules apply
- Write your pass-phrase down but store it
securely! - Never allow your passphrase to be discovered by
others - especially since this gets you in to
multiple systems... - If you lose your pass-phrase, it cannot be
recovered - you must get a new certificate
1583e. User Certificate Request
- The Certificate request script will place your
new user certificate and private key into a
.globus directory in your home directory - gt ls -la .globustotal 24drwxr-xr-x 3 train00
train00 4096 Nov 17 1345 .drwx------ 33
train00 train00 4096 Oct 17 2017 ..-r--r--r--
1 train00 train00 2703 Nov 17 1355
usercert.pem-r--r--r-- 1 train00 train00 1420
Nov 17 1350 usercert_request.pem-r-------- 1
train00 train00 963 Nov 17 1350 userkey.pem - Your Pass-phrase protects your private key
1593f. User Certificate Request
- Examine your new certificate
- gt grid-cert-info -issuer -subject -startdate
-enddate/CUS/ONational Center for
Supercomputing Applications/CNCertification
Authority/CUS/ONational Center for
Supercomputing Applications/CNTraining
User00Jul 11 211605 2003 GMTJul 10 211605
2004 GMT - Your Certificate's Subject is your Certificate DN
- DN Distinguished Name
1603g. User Certificate Request
- Test Globus certificate proxy generation
- gt grid-proxy-init -verify -debugUser Cert File
/home/train00/.globus/usercert.pemUser Key File
/home/train00/.globus/userkey.pemTrusted CA Cert
Dir /etc/grid-security/certificatesOutput File
/tmp/x509up_u500Your identity /CU