Title: Corina Stratan, Ciprian Dobre UPB
1 MONARC Simulation Framework
- Corina Stratan, Ciprian Dobre UPB
- Iosif Legrand, Harvey Newman CALTECH
2The GOALS of the Simulation Framework
- The aim of this work is to continue and improve
the development of the MONARC simulation
framework - To perform realistic simulation and modelling of
large scale distributed computing systems,
customised for specific HEP applications. - To offer a dynamic and flexible simulation
environment to be used as a design tool for
large distributed systems - To provide a design framework to evaluate the
performance of a range of possible computer
systems, as measured by their ability to provide
the physicists with the requested data in the
required time, and to optimise the cost.
3A Global View for Modelling
MONITORING
REAL Systems
Testbeds
4Design Considerations
- This Simulation framework is not intended to
be a detailed simulator for basic components such
as operating systems, data base servers or
routers. -
- Instead, based on realistic mathematical models
and measured parameters on test bed systems for
all the basic components, it aims to correctly
describe the performance and limitations of large
distributed systems with complex interactions.
5Simulation Engine
6Design Considerations of the Simulation Engine
- A process oriented approach for discrete event
simulation is well suited to describe concurrent
running programs. - Active objects (having an execution thread, a
program counter, stack...) provide an easy way
to map the structure of a set of distributed
running programs into the simulation environment.
- The Simulation engine supports an interrupt
scheme - This allows effective correct simulation
for concurrent processes with very different time
scale by using a DES approach with a continuous
process flow between events
7The Simulation Engine Tasks and Events
Task for simulating an entity with time
dependent behavior (active object, server, )
Running
semaphore.v()
semaphore.p()
Assigned to worker thread
5 possible states for a task CREATED, READY,
RUNNING, FINISHED, WAITING
Finished
Created
Ready
Event happens or sleeping period is over
Waiting
Each task maintains an internal semaphore
necessary for switching between states.
Event - used for communication and
synchronization between tasks when a task must
notify another task about something that happened
or will happen in the future, it creates an event
addressed to that task. The events are queued
and sent to the destination tasks by the engines
scheduler.
8Tests of the Engine
Processing a TOTAL of 100 000 simple jobs in
1 , 10, 100, 1000, 2 000 , 4 000, 10 000 CPUs
using the same number of parallel threads
more tests http//monarc.cacr.caltech.edu/
9Basic Components
10Basic Components
- These Basic components are capable to simulate
the core functionality for general distributed
computing systems. They are constructed based on
the simulation engine and are using efficiently
the implementation of the interrupt functionality
for the active objects . - These components should be considered the basic
classes from which specific components can be
derived and constructed
11Basic Components
- Computing Nodes
- Network Links and Routers , IO protocols
- Data Containers
- Servers
- Data Base Servers
- File Servers (FTP, NFS )
- Jobs
- Processing Jobs
- FTP jobs
- Scripts Graph execution schemes
- Basic Scheduler
- Activities ( a time sequence of jobs )
12Multitasking Processing Model
Concurrent running tasks share resources (CPU,
memory, I/O) Interrupt driven scheme For each
new task or when one task is finished, an
interrupt is generated and all processing times
are recomputed.
13LAN/WAN Simulation Model
Link
Node
LAN
ROUTER
Internet Connections
Interrupt driven simulation for each new
message an interrupt is created and for all the
active transfers the speed and the estimated
time to complete the transfer are recalculated.
ROUTER
Continuous Flow between events ! An efficient and
realistic way to simulate concurrent transfers
having different sizes / protocols.
14Network model
- data traffic simulated for both local and wide
area networks - a simulation at the packet level is practically
impossible - we adopted a larger scale approach, based on an
interrupt mechanism
- Network Entity
- LAN, WAN, LinkPort
- main attribute bandwidth
- keeps the evidence of the messages that traverse
it
Components of the network model
15Simulating the network transfers
- interrupt mechanism similar with the one used for
job execution simulation - the initial speed of a message is determined by
evaluating the bandwidth that each entity on the
route can offer - different network protocols can be modelled
Caltech WAN
CERN WAN
CERN Router
Caltech Router
INT
INT
Caltech LAN
INT
CERN LAN
Message1
Message2
newMessage
Message3
CPU
CPU
LinkPort
LinkPort
1. The route and the available bandwidth for the
new message are determined. 1. The messages on
the route are interrupted and their speeds are
recalculated.
16Job Scheduling and Execution
Activity1 class Activity1 extends Activity
public void pushJobs() Job newJob
new Job () addJob(newJob)
CPU 1
CPU 2
Job 3 (30 CPU)
Activity2 class Activity2 extends Activity
Job 4 (30 CPU)
Job 5 (40 CPU)
1. The activity class creates a job and submits
it to the farm. 2. The job scheduler sends the
new job to a CPU unit. All the jobs executing on
that CPU are interrupted. 3. CPU power
reallocated on the unit where the new job was
scheduled. The interrupted jobs reestimate their
completion time.
CPU 3
INT
3
Job 6 (50 CPU)
INT
Job 7 (50 CPU)
17Output of the simulation
Node
Simulation Engine
DB
Output Listener Filters
GRAPHICS
Router
Output Listener Filters
Log Files EXEL
User C
Any component in the system can generate generic
results objects Any client can subscribe with a
filter and will receive the results it is
Interested in . VERY SIMILAR structure as in
MonALISA . We will integrate soon The output of
the simulation framework into MonaLISA
18Specific Components
19Specific Components
- These Components should be derived from the
basic components and must implement the specific
characteristics and way they will operate. -
- Major Parts
- Data Model
- Data Flow Diagrams from Production and
- especially for Analysis Jobs
- Scheduling / pre-allocation policies
- Data Replication Strategies
20 Data Model
- Generic Data
- Container
- Size
- Event Type
- Event Range
- Access Count
- INSTANCE
META DATA Catalog Replication Catalog
Network FILE
FILE
Data Base
Custom Data Server
FTP Server Node
DB Server
NFS Server
Export / Import
21 Data Model (2)
META DATA Catalog Replication Catalog
Data Processing JOB
Data Request
Data Container
Select from the options
JOB
List Of IO Transactions
22Database Functionality
Automatic storage management example
- Client-server model
- Automatic storage management is possible, with
data being sent to mass storage units
DatabaseServer
Mass Storage 1
DContainer 1
DContainer 20
DB1
myJob
DContainer 21
DContainer 2
DContainer 22
Mass Storage 2
DContainer 3
DContainer 23
DContainer 24
- 3 kinds of requests for the database server
- write
- read
- get (read the data and erase it from the server)
DContainer 15
DB2
DContainer 16
1. The job wants to write a container into the
database DB1, but the server is out of storage
space. 2. The least frequently used container is
moved to a mass storage unit. The new container
is written to the database.
23Data Flow Diagrams for JOBS
Input and output is a collection of data. This
data is described by type and range
Input
Processing 1
Process is described by name
A fine granularity decomposition of processes
which can be executed independently and the way
they communicate can be very useful for
optimization and parallel execution !
Output
Input
Output
Processing 2
Processing 4
10x
Output
Output
Input
Input
Processing 3
Processing 4
Output
Input
24Job Scheduling Centralized Scheme
Site A
Site B
JobScheduler
JobScheduler
Dynamically loadable module
GLOBAL Job Scheduler
25Job Scheduling Distributed Scheme market
model
COST
Site A
Site B
JobScheduler
JobScheduler
Request
DECISION
JobScheduler
Site A
26Computing Models
27Activities Arrival Patterns
A flexible mechanism to define the Stochastic
process of how users perform data processing
tasks
Dynamic loading of Activity tasks, which are
threaded objects and are controlled by the
simulation scheduling mechanism
Physics Activities Injecting Jobs
Each Activity thread generates data processing
jobs
These dynamic objects are used to model the users
behavior
28Regional Centre Model
Simplified topology of the Centers
D
A
B
E
C
29MONARC - Main Classes
30Monitoring
31Real Need for Flexible Monitoring Systems
- It is important to measure monitor the Key
applications in a well defined test environment
and to extract the parameters we need for
modeling - Monitor the farms used today, and try to
understand how they work and simulate such
systems. - It requires a flexible monitoring system able to
dynamically add new parameters and provide
access to historical data - Interfacing monitoring tools to get the
parameters we need in simulations in a nearly
automatic way - MonALISA was designed and developed based on the
experience with the simulation problems.
32EXAMPLES
33FTP and NFS clusters
- This examples evaluate the performance of a local
area network with a server and several worker
stations. The server stores events used by the
processing nodes. - NFS Example the server concurrently delivers
the events, one by one to the clients. - FTP Example the server sends a whole file with
events in a single transfer
34FTP Cluster
50 CPU units x 2 Jobs per unit 100 events per
job, event size 1MB LAN bandwidth 1 Gbps,
servers effective bandwidth 60Mbps
35NFS Cluster
36Distributed Scheduling
- Job Migration when a regional center is
assigned too many jobs, it sends a part of them
to other centers with more free resources - New job scheduler implemented, which supports
job migration, applying load balancing criteria
export()
export()
Regional Center
Jobs
export()
We tested different configurations, with 1, 2 and
4 regional centers, and with different numbers of
CPUs per regional center. The number of jobs
submitted is kept constant, the job arrival rate
varying during a day.
37Distributed Scheduling (2)
- Test Case
- 4 regional centers, 20 CPUs per center
- average job processing time 3h, approx. 500 jobs
per day submitted in a center
Average processing time and CPU usage for 1, 2,
4, 6 centers
38Distributed Scheduling (3)
- similar with the previous example, but the jobs
are more complex, involving network transfers - centers connected in a chain configuration
Chain WAN connection
Every job submitted to a regional center needs an
amount of data located in that center. If the job
is exported to another center, would the benefits
be great enough to compensate the cost of the
data transfer?
39Distributed Scheduling (4)
The average processing time significantly
increases when reducing the bandwidth and the
number of CPUs
The network transfers are more intense in the
centers from the middle of the chain (like
Caltech)
40Distributed Scheduling (5)
41Local Data Replication
- Evaluates the performance improvements that can
be obtained by replicating data. - We simulated a regional center which has a
number of database servers, and another four
centers which host jobs that process the data on
those database servers - A better performance can be obtained if the data
from the servers is replicated into the other
regional centers
42Local Data Replication (2)
43WAN Data Replication
- similar with the previous example, but now with
two central servers, each holding an equal amount
of replicated data, and eight satellite regional
centers, hosting worker jobs - a worker job will get a number of events from one
of the central regional centers (one event at a
time) and process them locally
workers choose the best server to get the data
from. They use a Replication Load balancing
service (knowing the load of the network and of
the servers) VS The server is chosen randomly
44WAN Data Replication
Both servers have the same bandwidth and support
the same maximum load
Better average response time, total execution
time is smaller when taking decisions based on
load balancing
One server has half of the others bandwidth
and supports half of its maximum load
45Summary
- Modelling and understanding current systems,
their performance and limitations, is essential
for the design of the large scale distributed
processing systems. This will require continuous
iterations between modelling and monitoring - Simulation and Modelling tools must provide the
functionality to help in designing complex
systems and evaluate different strategies and
algorithms for the decision making units and the
data flow management.
http//monarc.cacr.caltech.edu/