AI, HPF, Grid Computing: Chronicles of failures - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

AI, HPF, Grid Computing: Chronicles of failures

Description:

Need to hijack malloc to catch when memory pages are released to the OS. Maintenance nightmare. ... of the OS page table into this IOMMU, or only in a specific ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 20
Provided by: charles150
Category:

less

Transcript and Presenter's Notes

Title: AI, HPF, Grid Computing: Chronicles of failures


1
AI, HPF, Grid Computing Chronicles of failures
  • How to play funding agencies.
  • Patrick Geoffray
  • patrick_at_myri.com

2
HPC lies and the lying liars telling them
  • How the HPC community gave up on checks and
    balances.
  • Patrick Geoffray
  • patrick_at_myri.com

3
Hardware folks are from Mars, Software guys are
from Venus.
  • Design choices for Myrinet / MX.
  • Patrick Geoffray
  • patrick_at_myri.com

4
Addressing woes
  • Hardware folks use IO space addresses, only valid
    on the IO bus.
  • Software people manipulate virtual memory
    addresses, allocated to them by the OS, only
    meaningful in the OS virtual memory mapping.
  • Sometimes, the IO address space is mapped
    one-on-one with the physical memory space.
  • X86, AMD64, IA-64 (some).
  • Sometimes, an IOMMU is sitting between the IO bus
    and the memory bus and maps the IO space into the
    physical memory space.
  • PPC (Power, G5), IA-64 (some).
  • For zero-copy, we need to convert addresses from
    virtual memory space into IO space gt memory
    registration/deregistration.

5
Memory registration/deregistration
  • Memory registration
  • Trap into OS kernel.
  • Lock access to OS page table.
  • Loop for every page
  • Walk the OS page table to find the virtual page.
  • Eventually swap the virtual page in.
  • Increment the page reference count.
  • Pin virtual page in the OS page table (marked not
    swappable).
  • Get the IO address of the related physical memory
    page (may require to use the IOMMU).
  • Unlock access to OS page table.
  • Return to user space.
  • Memory deregistration
  • Trap into OS kernel.
  • Lock access to OS page table.
  • Loop for every page
  • Walk the OS page table to find the virtual page.
  • Unpin virtual page in the OS page table (marked
    swappable).
  • Decrement the page reference count.
  • Eventually clear the related entry in the IOMMU.
  • Eventually clear the related entry in any cache
    on the NIC.
  • Unlock access to OS page table.
  • Return to user space.

6
Dodging the bullet
  • Memory registration can be very expensive.
  • Hardware folks did not think that it would be
    used in the critical path explicit memory
    registration in low-level hardware-driven
    hardware API such as VIA and Infiniband.
  • No explicit memory registration in higher level
    communication libraries such as MPI or Sockets.
  • Various methods to attempt to dodge the bullet
  • Not do zero-copy make sense for small messages
    where memory copy is cheaper.
  • trashes cache, uses CPU.
  • Implement a registration cache lazy
    deregistration with garbage collector.
  • Need to hijack malloc to catch when memory pages
    are released to the OS.
  • Maintenance nightmare.
  • Poor efficiency on complex code 9 cache hit on
    Linpack.
  • Do not register memory maintain a copy of the OS
    page table in the NIC.
  • No OS support requires patching the OS,
    portability issues.

7
State of the Union
8
Design choice in Myrinet / MX
  • Patching the OS is bad for many reasons.
  • Customers dont like it, maintenance nightmare.
  • Memory registration is in the critical path,
    lets live with it.
  • Optimize the register/deregister code path.
  • Factorize page table access, aggressive locking.
  • 4 us for the first page, 0.15 us asymptotically
    (8 pages).
  • No long term caching in the NIC.
  • Will deregister after single usage.
  • No explicit memory registration for Send/Recv
    semantic.
  • Keep the door open for registration cache.
  • Just for Linux where its easy to catch the
    free() syscall.
  • No big reward, but its so simple.

9
MX/GM MPI Pingpong
10
MX/GM MPI Pingpong (no regcache)
11
The Right Thing
  • Hardware would provide an IOMMU mapping virtual
    memory addresses into physical memory addresses.
    All IO devices would then directly manipulates
    virtual memory addresses.
  • If the IOMMU resolution miss, interrupt the OS
    and Nack the IO transaction.
  • Software would maintain a copy of the OS page
    table into this IOMMU, or only in a specific
    device. At the very least, the OS should provide
    well-defined hooks to let the driver maintain by
    itself a copy of the OS page table.

Who said As the memory copy is always going to
be much faster than the network, zero-copy just
does not make sense ?
Linus Torvalds, 2002
12
Message passing woes
  • Hardware folks like simple semantics like PUT or
    GET.
  • Called RDMA by the marketing department.
  • Software guys use two-sided interfaces such as
    MPI or Socket.
  • Two-sided interfaces are easier to manipulate for
    large, complex code.
  • MPI is the de facto programming standard in HPC.
  • Mapping MPI on top of RDMA is like
  • Train an AI to be a shrink.
  • Using HPF to generate non-trivial parallel code.
  • Running HPC codes on a set of loosely coupled,
    geographically widely distributed machines.

It sounds easy but it is a pain to implement and
it performs poorly.
13
MPI over RDMA
  • History repeats itself Memory Channel, Giganet
    (VIA), SCI, IB.
  • How to implement a matching semantic on top of a
    one-sided API ?
  • Matching as to be done somewhere sometime by
    someone.
  • Matching can be done after sending data
  • Eager mode, copy on the receive side.
  • Where do you PUT the eager message in the first
    place ?
  • Shared queue with tokens ? Multiple windows ?
    Polling or blocking ?
  • Matching can be done before sending data
  • Rendez-vous with small packets containing
    matching information.
  • Matching done by the host whenever the
    rendez-vous happen.
  • If the host is not available, either wait or
    interrupt it.

14
State of the Union
15
Classical MPI overlap problem
16
Matching in the NIC ?
  • Matching between incoming message and posted
    receives in MPI.
  • 64 bits of matching information.
  • Receives are posted by the host.
  • If no matching receives found -gt unexpected
    message.
  • Asynchronous aspect is important for large
    messages.
  • Bad
  • Limited resources
  • NIC cycles (linear matching).
  • NIC memory (number of posted receives).
  • MPI state in the NIC.
  • Good
  • Perfect overlap of communication with
    computation.
  • Does not interrupt the host.
  • Very loose rendez-vous for zero-copy.

17
Design choice in Myrinet / MX
  • Limited number of posted receives in the NIC 32
    or 64 per endpoint.
  • Pass it to the host if
  • No matching found.
  • Race between incoming message and posting of
    receive handle.
  • When host receive message
  • Check posted receives, starting after the ones
    tried by the NIC.
  • If no match found -gt unexpected message.
  • Progression thread NIC may need to ask the host
    to be involved
  • Matching large messages unmatched by the NIC.
  • Host queues are full.
  • No periodic timer, only when needed.

18
The Right Thing
  • The interconnects should natively support the
    interfaces that people uses in real world
    applications
  • Reliable matching (MPI, Socket, Storage).
  • Datagram (IP, Databases).
  • The hardware people should not impose their
    simplistic notion of low-level communication
    software.
  • The software people should not passively accept
    the domination from the hardware clan and do the
    best they can with what they are given.

Who is behind RDMA-based low-level interfaces ?
Intel
19
What to take home
  • Hardware and software people like each other,
    they just try to make their own life easier.
  • Linus Torvalds does not know much about high
    speed networking.
  • Memory registration is a false problem.
  • Size of the pipe is only one factor.
  • Appropriate semantic and overlap is also
    important.
  • RDMA will not save the world.
  • It just makes the hardware easier, at the cost of
    software complexity.
  • Look for MX at SC04 !
  • Physicists are from Pluto, Grid Computing folks
    are from Melmak.
  • Physicists claim that the speed of light is a
    constant but Grid Computing people say that
    latency is not a problem.
Write a Comment
User Comments (0)
About PowerShow.com