Experiences with Distributed and Parallel MATLAB - PowerPoint PPT Presentation

About This Presentation
Title:

Experiences with Distributed and Parallel MATLAB

Description:

Experiences with Distributed and Parallel MATLAB on CCS Daniel Goodman, Stef Salvini and Anne Trefethen Thoughts on CCS Mostly a good experience Few specific ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 41
Provided by: worc150
Category:

less

Transcript and Presenter's Notes

Title: Experiences with Distributed and Parallel MATLAB


1
Experiences with Distributed and Parallel MATLAB
on CCS
  • Daniel Goodman, Stef Salvini
  • and Anne Trefethen

2
Who we are
  • The focus of the OeRC is the development and
    application of new advances in computational and
    information technology to allow groups of
    researchers to tackle problems with increasing
    scale and complexity, facilitating
    interdisciplinary research and creating
    appropriate research infrastructure.
  • The Centre supports a community of
    multidisciplinary researchers who are engaged in
    e-Research, providing suitable education and
    training and an interface to industry.

3
Our CCS Cluster
  • 20 dual CPU, dual core SMP nodes with 8 GB of RAM
  • 2 quad CPU, dual core SMP nodes with 32 GB of
    RAM
  • Gigabit private network
  • 10 terabyte file store
  • Installed libraries include MS-MPI, Intel Math
    Kernel Libraries, Numerical Algorithms Group
    Windows libraries and ITK
  • 32 Distributed MATLAB licenses

4
Users of the OeRC Cluster
  • Financial Computing Both for research and
    teaching, lead by Prof. Mike Giles
  • Zoology Department Analysing homogenous
    recombination in bacteria
  • Experiments using CCS as the backend for large
    Excel workbooks
  • OxGrid Globus gateway based on software from
    Southampton

5
Users of the OeRC Cluster
  • ClimatePrediction.net Worlds largest climate
    experiment

6
Users of the OeRC Cluster
  • Optical Grid High bandwidth-based collaboration
    between Oxford and UCSD

7
What we are going to cover in this talk
  • Introduce MATLAB Distributed toolbox
  • Introduce two existing MATLAB projects
  • Examine the different techniques available to
    port these to the CCS cluster
  • Our thoughts on the MATLAB Distributed toolbox
  • Our thoughts on the CCS cluster
  • Recommendations for improvement

8
MATLAB Distributed Toolbox
  • Allows instances of MATLAB to run as workers on
    clusters.
  • These workers can be used to run a range of
    different styles of job. (Condor, message
    passing, global operations)
  • Supports a set of distributed matrices that can
    be used to abstract the parallelisation from the
    system.

9
Electron Microscope Data
Thanks to Rick Lawrence of UCSD
10
Electron Microscope Data
First take images of a slide from many different
angles
Thanks to Rick Lawrence of UCSD
11
Electron Microscope Data
First take images of a slide from many different
angles
Thanks to Rick Lawrence of UCSD
12
Electron Microscope Data
First take images of a slide from many different
angles
Thanks to Rick Lawrence of UCSD
13
Electron Microscope Data
Slice 1
Then using CT style techniques convert this into
slices of your sample.
Thanks to Rick Lawrence of UCSD
14
Electron Microscope Data
Slice 2
Then using CT style techniques convert this into
slices of your sample.
Thanks to Rick Lawrence of UCSD
15
Electron Microscope Data
Slice 3
Then using CT style techniques convert this into
slices of your sample.
Thanks to Rick Lawrence of UCSD
16
Electron Microscope Data
Slice 4
Then using CT style techniques convert this into
slices of your sample.
Thanks to Rick Lawrence of UCSD
17
Electron Microscope Data
Slice 5
Then using CT style techniques convert this into
slices of your sample.
Thanks to Rick Lawrence of UCSD
18
Electron Microscope Data
Sweep
Slices
19
Merging Medical Information
Take different sources of information and merge
them to produce a more detailed image
Thanks to Vicente Grau of Oxford University
20
Merging Medical Information
Take different sources of information and merge
them to produce a more detailed image
Beating Heart
Static Heart With Annotations
Thanks to Vicente Grau of Oxford University
21
Merging Medical Information
Take different sources of information and merge
them to produce a more detailed image
Beating Heart
Alignment Function
Static Heart With Annotations
Thanks to Vicente Grau of Oxford University
22
Merging Medical Information
Take different sources of information and merge
them to produce a more detailed image
Beating Heart
Alignment Function
Beating Heart With Annotations
Static Heart With Annotations
Thanks to Vicente Grau of Oxford University
23
Independent Tasks
  • Both problems are Embarrassingly Parallel so
    are in theory, easily split into independent
    tasks.
  • Used standard distributed toolbox objects to
    parallelise the code and executed on the cluster.
  • Return results to the client for marshalling and
    saving.

24
Independent Tasks
  • jm findResource('scheduler','configuration',CCS
    ')
  • job1 createJob(jm)
  • createTask(job1, _at_projective_reconstruction_core,
    1, imodfile_in, numtlts, xsize, ysize,
    plane_coeffs2, blocksize, z_inc, homography_3D,
    z_block_start)
  • f 'projective_reconstruction_core.m',
    'imod_fileread_slice.m', 'mrc_head_read.m',
    'mrc_read_slice.m', 'init_mrc_head.m',
    'get_datumsize.m'
  • set(job1, 'FileDependencies', f)

25
Independent Tasks
  • submit(job1)
  • waitForState(job1, 'finished')
  • blocks getAllOutputArguments(job1)

26
Analysis of Method 1
  • Lack of refactoring tools and tools to determine
    file dependences makes construction from legacy
    code fiddly. (MATLAB)
  • Limited and sometimes fiddly control of output
    (MATLAB)
  • Requires construction of custom submission and
    activation filters (CCS)

27
Communicating Tasks
  • Allow tasks to communicate so they can save the
    results from the nodes directly to the file
    system.
  • Use LabSend and LabReceive commands to pass a
    token that controls access to the file system.
  • Construct code to control which tasks each node
    will perform based their index.

28
Communicating Tasks
  • sched findResource('scheduler',configuration',
    CCS')
  • pjob createParallelJob(sched)
  • set(pjob, 'MaximumNumberOfWorkers', 30)
  • set(pjob, 'MinimumNumberOfWorkers', 15)
  • f 'mrc_write_slice.m', 'imod_filewrite_slice.m'
    , 'mrc_head_write.m', 'imod_filewrite_first_slice.
    m', 'projective_reconstruction.m',
    'imod_fileread_slice.m', 'mrc_head_read.m',
    'mrc_read_slice.m', 'init_mrc_head.m',
    'get_datumsize.m'
  • set(pjob, 'FileDependencies', f)

29
Communicating Tasks
  • task createTask(pjob, _at_projective_reconstruction
    , 0, basename)
  • submit(pjob)
  • waitForState(pjob)

30
Communicating Tasks
  • Initialise
  • iblock labindex
  • Save output
  • if iblock 1
  • out_ptrlabReceive(mod(labindex-2,numlabs)1)
  • end
  • Save output
  • if iblock numblocks
  • labSend(out_ptr, mod(labindex,numlabs)1)
  • end
  • Advance
  • iblock iblock numlabs

31
Analysis of Method 2
  • Same issues as before with refactoring and
    determining file dependencies (MATLAB)
  • Lack of multi-threading wastes resources on tasks
    with heterogeneous execution times. This is being
    addressed (MATLAB)
  • Again custom submission and activation filters
    need to be constructed (CCS)
  • Much more vulnerable to failing nodes (CCS and
    MATLAB)

32
Performance
  • Both methods provided almost linear speedup.
  • Using 30 nodes the time to perform the analysis
    of the microscope data is reduced from 3.4 hours
    to 7 minutes.
  • Using 19 nodes the time to run the heart analysis
    is reduced from 5.7 hours to 18 minutes

33
Thoughts on MATLAB
  • Easy to install
  • Easy to configure
  • Easy to use
  • Lacks tooling for refactoring jobs out of
    existing code, and setting configuration
    parameters
  • Data model needs extending
  • Lack of ability to have threads sharing data
    wastes time and memory

34
Thoughts on CCS
  • Mostly a good experience
  • Few specific difficulties
  • Submission and Activation Filters
  • Authentication
  • Shared folders
  • Error Messages
  • Failover function for head node

35
Submission and Activation Filters
  • Single executable for each makes management of
    multiple applications hard
  • It can be hard to determine which application the
    user is attempting to run
  • No means of the activation filter feeding back
    why the job was rejected
  • Would be nice to have more control over job
    license restrictions without the use of filters

36
Authentication
  • On some client machines it appears not to be
    possible to get the client to remember the users
    password and automatically authenticate.

37
Shared Folders
  • When copying large data files to nodes, the file
    server ceases to appear as a network resource,
    resulting in transfers failing.

38
Error Messages
  • Often when a job fails, no error message is
    provided to assist in debugging.

39
Failover of head node
  • When the head node fails it will remain in its
    failed state indefinitely.

40
Recommendations
  • Tool for picking up console output and load
    information from your job.
  • Better way of managing licenses
  • Mandatory field identifying the program to be
    executed
  • Better control of job distribution across nodes
  • Make it easier to integrate legacy systems
  • Include SFU and SUA
  • Include more information from active directory in
    CCS Administrator
  • Add more descriptive filtering to the job queue
Write a Comment
User Comments (0)
About PowerShow.com