Slot Acquisition - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Slot Acquisition

Description:

node1. PBS Mom. node2. PBS Mom. node3. PBS Mom. node4. Transfer scriptA. qsub scriptA' scriptA gets ... node1. PBS Mom. node2. PBS Mom. node3. PBS Mom. node4 ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 22
Provided by: daniel282
Category:

less

Transcript and Presenter's Notes

Title: Slot Acquisition


1
Slot Acquisition
  • Presenter Daniel Nurmi

2
Scope
  • One aspect of VGDL request is the time slot
    when resources are needed
  • Earliest time when resource set is needed
  • Maximum duration resource set will be used
  • Three classes of resources
  • dedicated always available
  • batch controlled lag before available
  • advanced reservation guaranteed availability in
    the future

3
Acquisition Routines
  • Each class of resource needs the following
    (logical) routines
  • Prob Query (cluster, nodes, walltime,
    starttime)
  • Id BindInit (cluster, nodes, walltime,
    starttime, success_prob)
  • Status Check (id)
  • Status Install(id)

4
Slot Manager Acquisition Procedure
Is available?
Query()
Query
probability
Initiate bind
BindInit()
Slot Manager
Bind yet?
Bind
Check()
True/false/abort
Install()
Install PBS glide-in when time
5
Dedicated
  • Query
  • NOP (prob 1)
  • BindInit
  • NOP (always true)
  • Check
  • NOP (always true)
  • Install
  • Installs PBS glide-in

6
Advanced Reservation
  • Query
  • Makes request to advanced reservation system
  • Prob 1 if we can make the reservation
  • Prob 0 if we cannot
  • BindInit
  • Make adv. res. Request
  • Check
  • NOP (always return true)
  • Install
  • Submit PBS glide-in installation job to
    specialized adv. res. queue

7
Batch Controlled
  • Query
  • Performs an algorithm to determine probability of
    meeting the slot requirement through regular
    batch queue
  • BindInit
  • Use values calculated from query for job
    dimensions and time to wait before submission
  • Check
  • When time to wait has elapsed, return true
  • Install
  • Submit PBS glide-in installation job

8
The Algorithm
  • Routines
  • deadline is seconds from now
  • P bqp_pred(machine, nodes, walltime, deadline)
  • Algorithm

Preq 0.75 past 0 P bqp_pred(M, N, WD,
D) While((D-past) gt 0) if (P Preq)
wait past real_walltime W(D-past)
past 30 P bqp_pred(M, N, W(D-past),
(D-past))
9
Batch Experiment
0.75 submit time
now
  • 75 is the target probability
  • 356 total requests
  • 257 total batch submissions
  • 99 requests resulted in initial not possible
    response
  • 192 slots successfully acquired
  • 257 .75 193
  • Choose last acceptable time to minimize waste

10
Near Term Experiments
  • Try other probability levels
  • Try other deadlines

11
PBS Glide-in
  • Basic batch queue system assumes one-to-one
    mapping of job to resource set (slot)
  • Idea once a single slot has been acquired,
    install personal res. manager and scheduler
    within it in order to support multiple jobs
    within single slot
  • Have instrumented torque (PBS) to fulfill this
    task
  • Plays the role that Condor would play as
    infrastructure scheduler
  • PBS glide-in
  • Simpler, supports MPI, etc.

12
PBS Overview
scriptA gets node1, node2, and node3
qsub scriptA
PBS Server
PBS Sched
Transfer scriptA
PBS Mom
PBS Mom
PBS Mom
PBS Mom
node1
node2
node3
node4
13
PBS Overview
PBS Server
PBS Sched
PBS Mom
PBS Mom
PBS Mom
PBS Mom
node1
node2
node3
node4
scriptA
cmd
ssh cmd
cmd
ssh cmd
14
PBS glide-in
qsub pglide.pbs
PBS Server
PBS Sched
pglide.pbs
PBS Mom
PBS Mom
PBS Mom
PBS Mom
node1
node2
node3
node4
15
PBS glide-in
PBS Server
PBS Sched
PBS Mom
PBS Mom
PBS Mom
PBS Mom
node1
node2
node3
node4
pglide.pbs
pbs_mom
pbs_mom
pbs_mom
pbs_server
pbs_sched
16
PBS glide-in
globusrun-ws jobA
globusrun-ws jobB
PBS Server
PBS Sched
GRAM
qsub scriptA
qsub scriptB
PBS Mom
PBS Mom
PBS Mom
PBS Mom
node1
node2
node3
node4
pglide.pbs
pbs_mom
pbs_mom
pbs_mom
pbs_server
scriptA
scriptB
pbs_sched
17
PBS glide-in TODO
  • In order to implement this, needed to disable
    some of PBS internal security features (drop
    privs, root check, priv ports, user auth checks,
    host auth checks)
  • Streamline installation process (good but not
    great)
  • Architecture discussion one server per slot? One
    server for all slots on a single machine?
  • Requires reworking torque software a bit

18
Slot Acquisition Status
  • BQP virtual advanced reservation system in
    place
  • PBS glide-in working on all machines Dan has
    access to
  • Need to investigate advanced reservation
    interface(s)
  • Need to figure out how to properly submit PBS
    jobs using GRAM

19
Thanks!
  • Questions?

20
Statistics TODO
  • More reactive change point detection
  • Machine down time constitutes a change point we
    can detect better
  • Better understanding of autocorrelation and
    quantiles
  • Non-statistical case
  • One user submits 20,000 single processor jobs

21
Current Cluster Status
Write a Comment
User Comments (0)
About PowerShow.com