Title: GRID superscalar: a programming model for the Grid
1GRID superscalar a programming model for the Grid
Doctoral ThesisComputer Architecture
DepartmentTechnical University of Catalonia
- Raül Sirvent Pardell
- Advisor Rosa M. Badia Sala
2Outline
- Introduction
- Programming interface
- Runtime
- Fault tolerance at the programming model level
- Conclusions and future work
3Outline
- Introduction
- 1.1 Motivation
- 1.2 Related work
- 1.3 Thesis objectives and contributions
- Programming interface
- Runtime
- Fault tolerance at the programming model level
- Conclusions and future work
41.1 Motivation
- The Grid architecture layers
Applications
Grid Middleware (Job management, Data transfer,
Security, Information, QoS, ...)
Distributed Resources
51.1 Motivation
- What middleware should I use?
61.1 Motivation
- Programming tools are they easy?
Grid AWARE
Grid UNAWARE
VS.
71.1 Motivation
- Can I run my programs in parallel?
Explicit parallelism
Implicit parallelism
VS.
for(i0 i lt MSIZE i) for(j0 j lt MSIZE
j) for(k0 k lt MSIZE k)
matmul(A(i,k), B(k,j), C(i,j))
fork
Draw it by hand means explicit
join
81.1 Motivation
- The Grid a massive, dynamic and heterogeneous
environment prone to failures - Study different techniques to detect and overcome
failures - Checkpoint
- Retries
- Replication
91.2 Related work
System / Features Grid unaware Implicit parallelism Language
Triana No No Graphical
Satin Yes No Java
ProActive Partial Partial Java
Pegasus Yes Partial VDL
Swift Yes Partial SwiftScript
101.3 Thesis objectives and contributions
- Objective create a programming model for the
Grid - Grid unaware
- Implicit parallelism
- Sequential programming
- Allows to use well-known imperative languages
- Speed up applications
- Include fault detection and recovery
111.3 Thesis objectives and contributions
- Contribution GRID superscalar
- Programming interface
- Runtime environment
- Fault tolerance features
12Outline
- Introduction
- Programming interface
- 2.1 Design
- 2.2 User interface
- 2.3 Programming comparison
- Runtime
- Fault tolerance at the programming model level
- Conclusions and future work
132.1 Design
- Interface objectives
- Grid unaware
- Implicit parallelism
- Sequential programming
- Allows to use well-known imperative languages
142.1 Design
- Target applications
- Algorithms which may be easily splitted in tasks
- Branch and bound computations, divide and conquer
algorithms, recursive algorithms, - Coarse grained tasks
- Independent tasks
- Scientific workflows, optimization algorithms,
parameter sweep - Main parameters FILES
- External simulators, finite element solvers,
BLAST, GAMESS
152.1 Design
- Applications architecture a master-worker
paradigm - Master-worker parallel paradigm fits with our
objectives - Main program the master
- Functions workers
- Function Generic representation of a task
- Glue to transform a sequential application into a
master-worker application stubs skeletons
(RMI, RPC, ) - Stub call to runtime interface
- Skeleton binary which calls to the user function
162.1 Design
void matmul(char f1, char f2, char f3)
getBlocks(f1, f2, f3, A, B, C) for (i 0 i lt
A-gtrows i) for (j 0 j lt B-gtcols j)
for (k 0 k lt A-gtcols k)
C-gtdataij A-gtdataik B-gtdatakj
putBlocks(f1, f2, f3, A, B, C)
for(i0 i lt MSIZE i) for(j0 j lt MSIZE
j) for(k0 k lt MSIZE k)
matmul(A(i,k), B(k,j), C(i,j))
Local scenario
172.1 Design
app.c
app-functions.c
app-functions.c
app-functions.c
app-functions.c
app-functions.c
app-functions.c
Middleware
Master-Worker paradigm
182.1 Design
- Intermediate language concept assembler code
- In GRIDSs
- The Execute generic interface
- Instruction set is defined by the user
- Single entry point to the runtime
- Allows easy building of programming language
bindings (Java, Perl, Shell Script) - Easier technology adoption
C, C,
Assembler
Processor execution
C, C,
Workflow
Grid execution
192.2 User interface
- Steps to program an application
- Task definition
- Identify those functions/programs in the
application that are going to be executed in the
computational Grid - All parameters must be passed in the header
(remote execution) - Interface Definition Language (IDL)
- For every task defined, identify which parameters
are input/output files and which are input/output
scalars - Programming API master and worker
- Write the main program and the tasks using GRIDSs
API
202.2 User interface
- Interface Definition Language (IDL) file
- CORBA-IDL like interface
- in/out/inout files
- in/out/inout scalar values
- The functions listed in this file will be
executed in the Grid
interface MATMUL void matmul(in File f1, in
File f2, inout File f3)
212.2 User interface
- Programming API master and worker
app.c
app-functions.c
- Master side
- GS_On
- GS_Off
- GS_FOpen/GS_FClose
- GS_Open/GS_Close
- GS_Barrier
- GS_Speculative_End
- Worker side
- GS_System
- gs_result
- GS_Throw
222.2 User interface
- Tasks constraints and cost specification
- Constraints allow to specify the needs of a task
(CPU, memory, architecture, software, ) - Build an expression in a constraint function
(evaluated for every machine) - Cost estimated execution time of a task (in
seconds) - Useful for scheduling
- Calculate it in a cost function
- GS_GFlops / GS_Filesize may be used
- An external estimator can be also called
other.Mem 1024
cost operations / GS_GFlops()
232.3 Programming comparison
Grid-aware
int main() rsl "(executable/home/user/si
m)(argumentsinput1.txt output1.txt)
(file_stage_in(gsiftp//bscgrid01.bsc.es/path/inp
ut1.txt home/user/input1.txt))(file_stage_out/hom
e/user/output1.txt gsiftp//bscgrid01.bsc.es/path/
output1.txt)(file_clean_up/home/user/input1.txt
/home/user/output1.txt)" globus_gram_client_j
ob_request(bscgrid02.bsc.es, rsl, NULL, NULL)
rsl "(executable/home/user/sim)(argumentsin
put2.txt output2.txt) (file_stage_in(gsiftp//bsc
grid01.bsc.es/path/input2.txt /home/user/input2.tx
t))(file_stage_out/home/user/output2.txt
gsiftp//bscgrid01.bsc.es/path/output2.txt)(file_c
lean_up/home/user/input2.txt /home/user/output2.t
xt)" globus_gram_client_job_request(bscgrid03
.bsc.es, rsl, NULL, NULL) rsl
"(executable/home/user/sim)(argumentsinput3.txt
output3.txt) (file_stage_in(gsiftp//bscgrid01.b
sc.es/path/input3.txt /home/user/input3.txt))(file
_stage_out/home/user/output3.txt
gsiftp//bscgrid01.bsc.es/path/output3.txt)(file_c
lean_up/home/user/input3.txt /home/user/output3.t
xt)" globus_gram_client_job_request(bscgrid04
.bsc.es, rsl, NULL, NULL)
Explicit parallelism
242.3 Programming comparison
void sim(File input, File output) command
"/home/user/sim " input ' ' output
gs_result GS_System(command) int main()
GS_On() sim("/path/input1.txt",
"/path/output1.txt") sim("/path/input2.txt",
"/path/output2.txt") sim("/path/input3.txt",
"/path/output3.txt") GS_Off(0)
252.3 Programming comparison
A
B
C
D
Explicit parallelism
int main() GS_On() task_A(f1, f2,
f3) task_B(f2, f4) task_C(f3, f5)
task_D(f4, f5, f6) GS_Off(0)
No if/while clauses
JOB A A.condor JOB B B.condor JOB C
C.condor JOB D D.condor PARENT A CHILD B
C PARENT B C CHILD D
262.3 Programming comparison
Grid-aware
int main() grpc_initialize("config_file")
grpc_object_handle_init_np("A", A_h,
"class") grpc_object_handle_init_np("B",
B_h," class") for(i 0 i lt 25 i)
grpc_invoke_async_np(A_h,"foo",sid,f_in
2i,f_out2i) grpc_invoke_async_np(B_
h,"foo",sid,f_in2i1,f_out2i1)
grpc_wait_all() grpc_object_handle_dest
ruct_np(A_h) grpc_object_handle_destruct_np(
B_h) grpc_finalize()
Explicit parallelism
int main() GS_On() for(i 0 i lt 50
i) foo(f_ini, f_outi) GS_Off(0)
272.3 Programming comparison
No if/while clauses
DV trans1( a2_at_outputtmp.0, a1_at_inputfilein.0
) DV trans2( a2_at_outputfileout.0,
a1_at_inputtmp.0 ) DV trans1(
a2_at_outputtmp.1, a1_at_inputfilein.1 ) DV
trans2( a2_at_outputfileout.1, a1_at_inputtmp.1
) ... DV trans1( a2_at_outputtmp.999,
a1_at_inputfilein.999 ) DV trans2(
a2_at_outputfileout.999, a1_at_inputtmp.999 )
int main() GS_On() for(i 0 i lt
1000 i) tmp "tmp." i filein
"filein." i fileout "fileout."
i trans1(tmp, filein) trans2(fileout, tmp)
GS_Off(0)
28Outline
- Introduction
- Programming interface
- Runtime
- 3.1 Scientific contributions
- 3.2 Developments
- 3.3 Evaluation tests
- Fault tolerance at the programming model level
- Conclusions and future work
293.1 Scientific contributions
- Runtime objectives
- Extract implicit parallelism in sequential
applications - Speed up execution using the Grid
- Main requirement Grid middleware
- Job management
- Data transfer
- Security
303.1 Scientific contributions
- Apply computer architecture knowledge to the Grid
(superscalar processor)
? ns ? seconds/minutes/hours
313.1 Scientific contributions
- Data dependence analysis allow parallelism
task1(..., f1)
Read after Write
task2(f1, ...)
task1(f1, ...)
Write after Read
task2(..., f1)
task1(..., f1)
Write after Write
task2(..., f1)
323.1 Scientific contributions
for(i0 i lt MSIZE i) for(j0 j lt MSIZE
j) for(k0 k lt MSIZE k)
matmul(A(i,k), B(k,j), C(i,j))
matmul(A(0,0), B(0,0), C(0,0))
k 0
i 0 j 0
matmul(A(0,1), B(1,0), C(0,0))
k 1
matmul(A(0,2), B(2,0), C(0,0))
k 2
matmul(A(0,0), B(0,0), C(0,1))
...
k 0
i 0 j 1
k 1
matmul(A(0,1), B(1,0), C(0,1))
k 2
matmul(A(0,2), B(2,0), C(0,1))
333.1 Scientific contributions
for(i0 i lt MSIZE i) for(j0 j lt MSIZE
j) for(k0 k lt MSIZE k)
matmul(A(i,k), B(k,j), C(i,j))
i 0 j 2
i 1 j 0
i 1 j 1
i 1 j 2
matmul(A(0,0), B(0,0), C(0,0))
k 0
i 0 j 0
matmul(A(0,1), B(1,0), C(0,0))
k 1
matmul(A(0,2), B(2,0), C(0,0))
k 2
...
...
matmul(A(0,0), B(0,0), C(0,1))
k 0
i 0 j 1
k 1
matmul(A(0,1), B(1,0), C(0,1))
k 2
matmul(A(0,2), B(2,0), C(0,1))
343.1 Scientific contributions
- File renaming increase parallelism
task1(..., f1)
Read after Write
Unavoidable
task2(f1, ...)
task1(f1, ...)
Write after Read
Avoidable
task2(..., f1)
task2(..., f1_NEW)
task1(..., f1)
Avoidable
Write after Write
task2(..., f1)
task2(..., f1_NEW)
353.2 Developments
- Basic functionality
- Job submission (middleware usage)
- Select sources for input files
- Submit, monitor or cancel jobs
- Results collection
- API implementation
- GS_On read configuration file and environment
- GS_Off wait for tasks, cleanup remote data, undo
renaming - GS_(F)Open create a local task
- GS_(F)Close notify end of local task
- GS_Barrier wait for all running tasks to finish
- GS_System translate path
- GS_Speculative_End barrier until throw. If
throw, discard tasks from throw to
GS_Speculative_End - GS_Throw use gs_result to notify it
363.2 Developments
...
Middleware
Task scheduling Direct Acyclic Graph
373.2 Developments
- Task scheduling resource brokering
- A resource broker is needed (but not an
objective) - Grid configuration file
- Information about hosts (hostname, limit of jobs,
queue, working directory, quota, ) - Initial set of machines (can be changed
dynamically)
lt?xml version"1.0" encoding"UTF-8"?gt ltproject
isSimple"yes" masterBandwidth"100000"
masterBuildScript"" masterInstallDir"/home/rsirv
ent/matmul-master" masterName"bscgrid01.bsc.es"
masterSourceDir"/datos/GRID-S/GT4/doc/examples/ma
tmul" name"matmul" workerBuildScript""
workerSourceDir"/datos/GRID-S/GT4/doc/examples/ma
tmul"gt ... ltworkersgt ltworker Arch"x86"
GFlops"5.985" LimitOfJobs"2" Mem"1024"
NCPUs"2" NetKbps"100000" OpSys"Linux"
Queue"none" Quota"0" deploymentStatus"deployed"
installDir"/home/rsirvent/matmul-worker"
name"bscgrid01.bsc.es"gt
383.2 Developments
- Task scheduling resource brokering
- Scheduling policy
- Estimation of total execution time of a single
task - FileTransferTime time to transfer needed files
to a resource (calculated with the hosts
information and the location of files) - Select fastest source for a file
- ExecutionTime estimation of the tasks run time
in a resource. An interface function (can be
calculated, or estimated by an external entity) - Select fastest resource for execution
- Smallest estimation is selected
393.2 Developments
- Task scheduling resource brokering
- Match task constraints and machine capabilities
- Implemented using the ClassAd library
- Machine offers capabilities (from Grid
configuration file memory, architecture, ) - Task demands capabilities
- Filter candidate machines for a particular task
SoftwareList BLAST, GAMESS
Software BLAST
SoftwareList GAMESS
403.2 Developments
f3
f3
Middleware
Task scheduling File locality
413.2 Developments
- Other file locality exploitation mechanisms
- Shared input disks
- NFS or replicated data
- Shared working directories
- NFS
- Erasing unused versions of files (decrease disk
usage) - Disk quota control (locality increases disk usage
and quota may be lower than expected)
423.3 Evaluation
NAS Grid Benchmarks Representative benchmark, includes different types of workflows which emulate a wide range of Grid Applications
Simple optimization example Representative of optimization algorithms, workflow with two-level synchronization
New product and process development Production application, workflow with parallel chains of computation
Potential energy hypersurface for acetone Massively parallel, long running application
Protein comparison Production application, big computational challenge, massively parallel, high number of tasks
fastDNAml Well-known application in the context of MPI for Grids, workflow with synchronization steps
433.3 Evaluation
HC
ED
MB
VP
443.3 Evaluation
- Run with classes S, W, A (2 machines x 4 CPUs)
- VP benchmark must be analyzed in detail (does not
scale up to 3 CPUs)
453.3 Evaluation
- Performance analysis
- GRID superscalar runtime instrumented
- Paraver tracefiles from the client side
- The lifecycle of all tasks has been studied in
detail - Overhead of GRAM Job Manager polling interval
463.3 Evaluation
- VP.S task assignment
- 14.7 of the transfers when exploiting locality
- VP is parallel, but its last part is sequentially
executed
BT
MF
MG
MF
FT
BT
MF
MG
MF
FT
BT
MF
MG
MF
FT
Kadesh8
Khafre
Remote file transfers
473.3 Evaluation
- Conclusion workflow and granularity are
important to achieve speed up
483.3 Evaluation
- Two-dimensional potential energy hypersurface for
acetone as a function of the ?1, and ?2 angles
493.3 Evaluation
- Number of executed tasks 1120
- Each task between 45 and 65 minutes
- Speed up 26.88 (32 CPUs), 49.17 (64 CPUs)
- Long running test, heterogeneous and
transatlantic Grid
22 CPUs
14 CPUs
28 CPUs
503.3 Evaluation
- 15 million protein sequences have been compared
using BLAST and GRID superscalar
Genomes
15 million Proteins
15 million Proteins
513.3 Evaluation
- 100,000 tasks in 4000 CPUs ( 1,000 exclusive
nodes) - Grid of 1,000 machines with very low latency
between them - Stress test for the runtime
- Avoids user to work with queuing system
- Saves queuing system from handling a huge set of
independent tasks
52GRID superscalar programming interface and
runtime
- Publications
- Raül Sirvent, Josep M. Pérez, Rosa M. Badia,
Jesús Labarta, "Automatic Grid workflow based on
imperative programming languages", Concurrency
and Computation Practice and Experience, John
Wiley Sons, vol. 18, no. 10, pp. 1169-1186,
2006. - Rosa M. Badia, Raul Sirvent, Jesus Labarta, Josep
M. Perez, "Programming the GRID An Imperative
Language-based Approach", Engineering The Grid
Status and Perspective, Section 4, Chapter 12,
American Scientific Publishers, January 2006. - Rosa M. Badia, Jesús Labarta, Raül Sirvent, Josep
M. Pérez, José M. Cela and Rogeli Grima,
"Programming Grid Applications with GRID
Superscalar", Journal of Grid Computing, Volume
1, Issue 2, 2003.
53GRID superscalar programming interface and
runtime
- Work related to standards
- R.M. Badia, D. Du, E. Huedo, A. Kokossis, I. M.
Llorente, R. S. Montero, M. de Palol, R. Sirvent,
and C. Vázquez, "Integration of GRID superscalar
and GridWay Metascheduler with the DRMAA OGF
Standard", Euro-Par, 2008. - Raül Sirvent, Andre Merzky, Rosa M. Badia, Thilo
Kielmann, "GRID superscalar and SAGA forming a
high-level and platform-independent Grid
programming environment", CoreGRID Integration
Workshop. Integrated Research in Grid Computing,
Pisa (Italy), 2005.
54Outline
- Introduction
- Programming interface
- Runtime
- Fault tolerance at the programming model level
- 4.1 Checkpointing
- 4.2 Retry mechanisms
- 4.3 Task replication
- Conclusions and future work
554.1 Checkpointing
- Inter-task checkpointing
- Recovers sequential consistency in the
out-of-order execution of tasks - Single version of every file is saved
- No need to save any data structures in the
runtime - Drawback some completed tasks may be lost
- Application-level checkpoint can avoid this
3
0
1
2
3
4
5
6
564.1 Checkpointing
- Conclusions
- Low complexity in order to checkpoint a task
- 1 overhead introduced
- Can deal with both application level errors or
Grid level errors - Most important when an unrecoverable error
appears - Transparent for end users
574.2 Retry mechanisms
C
C
Middleware
Automatic drop of machines
584.2 Retry mechanisms
Soft timeout
Failure
Success
Middleware
Soft timeout
Hard timeout
Soft and hard timeouts for tasks
594.2 Retry mechanisms
Success
Failure
Failure
Success
Middleware
Retry of operations
604.2 Retry mechanisms
- Conclusions
- Keep running despite failures
- Dynamic when and where to resubmit
- Detects performance degradations
- No overhead when no failures are detected
- Transparent for end users
614.3 Task replication
0
1
2
1
0
2
1
3
4
5
6
7
Middleware
Replicate running tasks depending on successors
624.3 Task replication
0
1
2
1
0
3
4
5
6
7
Middleware
Replicate running tasks to speed up the execution
634.3 Task replication
- Conclusions
- Dynamic replication application level knowledge
is used (the workflow) - Replication can deal with failures hiding retry
overhead - Replication can speed up applications in
heterogeneous Grids - Transparent for end users
- Drawback increased usage of resources
644. Fault tolerance features
- Publications
- Vasilis Dialinos, Rosa M. Badia, Raül Sirvent,
Josep M. Pérez and Jesús Labarta, "Implementing
Phylogenetic Inference with GRID superscalar",
Cluster Computing and Grid 2005 (CCGRID 2005),
Cardiff, UK, 2005. - Raül Sirvent, Rosa M. Badia and Jesús Labarta,
"Graph-based task replication for workflow
applications", Submitted, HPCC 2009.
65Outline
- Introduction
- Programming interface
- Runtime
- Fault tolerance at the programming model level
- Conclusions and future work
665. Conclusions and future work
- Grid-unaware programming model
- Transparent features for users, exploiting
parallelism and failure treatment - Used in REAL systems and REAL applications
- Some future research is already ONGOING (StarSs)
675. Conclusions and future work
- Future work
- Grid of supercomputers (Red Española de
Supercomputación) - Higher scale tests (hundreds? thousands?)
- More complex brokering
- Resource discovery/monitoring
- New scheduling policies based on the workflow
- Automatic prediction of execution times
- New policies for task replication
- New architectures for StarSs