Title: AI, HPF, Grid Computing: Chronicles of failures
1AI, HPF, Grid Computing Chronicles of failures
- How to play funding agencies.
- Patrick Geoffray
- patrick_at_myri.com
2HPC lies and the lying liars telling them
- How the HPC community gave up on checks and
balances. - Patrick Geoffray
- patrick_at_myri.com
3Hardware folks are from Mars, Software guys are
from Venus.
- Design choices for Myrinet / MX.
- Patrick Geoffray
- patrick_at_myri.com
4Addressing woes
- Hardware folks use IO space addresses, only valid
on the IO bus. - Software people manipulate virtual memory
addresses, allocated to them by the OS, only
meaningful in the OS virtual memory mapping. - Sometimes, the IO address space is mapped
one-on-one with the physical memory space. - X86, AMD64, IA-64 (some).
- Sometimes, an IOMMU is sitting between the IO bus
and the memory bus and maps the IO space into the
physical memory space. - PPC (Power, G5), IA-64 (some).
- For zero-copy, we need to convert addresses from
virtual memory space into IO space gt memory
registration/deregistration.
5Memory registration/deregistration
- Memory registration
- Trap into OS kernel.
- Lock access to OS page table.
- Loop for every page
- Walk the OS page table to find the virtual page.
- Eventually swap the virtual page in.
- Increment the page reference count.
- Pin virtual page in the OS page table (marked not
swappable). - Get the IO address of the related physical memory
page (may require to use the IOMMU). - Unlock access to OS page table.
- Return to user space.
- Memory deregistration
- Trap into OS kernel.
- Lock access to OS page table.
- Loop for every page
- Walk the OS page table to find the virtual page.
- Unpin virtual page in the OS page table (marked
swappable). - Decrement the page reference count.
- Eventually clear the related entry in the IOMMU.
- Eventually clear the related entry in any cache
on the NIC. - Unlock access to OS page table.
- Return to user space.
6Dodging the bullet
- Memory registration can be very expensive.
- Hardware folks did not think that it would be
used in the critical path explicit memory
registration in low-level hardware-driven
hardware API such as VIA and Infiniband. - No explicit memory registration in higher level
communication libraries such as MPI or Sockets. - Various methods to attempt to dodge the bullet
- Not do zero-copy make sense for small messages
where memory copy is cheaper. - trashes cache, uses CPU.
- Implement a registration cache lazy
deregistration with garbage collector. - Need to hijack malloc to catch when memory pages
are released to the OS. - Maintenance nightmare.
- Poor efficiency on complex code 9 cache hit on
Linpack. - Do not register memory maintain a copy of the OS
page table in the NIC. - No OS support requires patching the OS,
portability issues.
7State of the Union
8Design choice in Myrinet / MX
- Patching the OS is bad for many reasons.
- Customers dont like it, maintenance nightmare.
- Memory registration is in the critical path,
lets live with it. - Optimize the register/deregister code path.
- Factorize page table access, aggressive locking.
- 4 us for the first page, 0.15 us asymptotically
(8 pages). - No long term caching in the NIC.
- Will deregister after single usage.
- No explicit memory registration for Send/Recv
semantic. - Keep the door open for registration cache.
- Just for Linux where its easy to catch the
free() syscall. - No big reward, but its so simple.
9MX/GM MPI Pingpong
10MX/GM MPI Pingpong (no regcache)
11The Right Thing
- Hardware would provide an IOMMU mapping virtual
memory addresses into physical memory addresses.
All IO devices would then directly manipulates
virtual memory addresses. - If the IOMMU resolution miss, interrupt the OS
and Nack the IO transaction. - Software would maintain a copy of the OS page
table into this IOMMU, or only in a specific
device. At the very least, the OS should provide
well-defined hooks to let the driver maintain by
itself a copy of the OS page table.
Who said As the memory copy is always going to
be much faster than the network, zero-copy just
does not make sense ?
Linus Torvalds, 2002
12Message passing woes
- Hardware folks like simple semantics like PUT or
GET. - Called RDMA by the marketing department.
- Software guys use two-sided interfaces such as
MPI or Socket. - Two-sided interfaces are easier to manipulate for
large, complex code. - MPI is the de facto programming standard in HPC.
- Mapping MPI on top of RDMA is like
- Train an AI to be a shrink.
- Using HPF to generate non-trivial parallel code.
- Running HPC codes on a set of loosely coupled,
geographically widely distributed machines.
It sounds easy but it is a pain to implement and
it performs poorly.
13MPI over RDMA
- History repeats itself Memory Channel, Giganet
(VIA), SCI, IB. - How to implement a matching semantic on top of a
one-sided API ? - Matching as to be done somewhere sometime by
someone. - Matching can be done after sending data
- Eager mode, copy on the receive side.
- Where do you PUT the eager message in the first
place ? - Shared queue with tokens ? Multiple windows ?
Polling or blocking ? - Matching can be done before sending data
- Rendez-vous with small packets containing
matching information. - Matching done by the host whenever the
rendez-vous happen. - If the host is not available, either wait or
interrupt it.
14State of the Union
15Classical MPI overlap problem
16Matching in the NIC ?
- Matching between incoming message and posted
receives in MPI. - 64 bits of matching information.
- Receives are posted by the host.
- If no matching receives found -gt unexpected
message. - Asynchronous aspect is important for large
messages.
- Bad
- Limited resources
- NIC cycles (linear matching).
- NIC memory (number of posted receives).
- MPI state in the NIC.
- Good
- Perfect overlap of communication with
computation. - Does not interrupt the host.
- Very loose rendez-vous for zero-copy.
17Design choice in Myrinet / MX
- Limited number of posted receives in the NIC 32
or 64 per endpoint. - Pass it to the host if
- No matching found.
- Race between incoming message and posting of
receive handle. - When host receive message
- Check posted receives, starting after the ones
tried by the NIC. - If no match found -gt unexpected message.
- Progression thread NIC may need to ask the host
to be involved - Matching large messages unmatched by the NIC.
- Host queues are full.
- No periodic timer, only when needed.
18The Right Thing
- The interconnects should natively support the
interfaces that people uses in real world
applications - Reliable matching (MPI, Socket, Storage).
- Datagram (IP, Databases).
- The hardware people should not impose their
simplistic notion of low-level communication
software. - The software people should not passively accept
the domination from the hardware clan and do the
best they can with what they are given.
Who is behind RDMA-based low-level interfaces ?
Intel
19What to take home
- Hardware and software people like each other,
they just try to make their own life easier. - Linus Torvalds does not know much about high
speed networking. - Memory registration is a false problem.
- Size of the pipe is only one factor.
- Appropriate semantic and overlap is also
important. - RDMA will not save the world.
- It just makes the hardware easier, at the cost of
software complexity. - Look for MX at SC04 !
- Physicists are from Pluto, Grid Computing folks
are from Melmak. - Physicists claim that the speed of light is a
constant but Grid Computing people say that
latency is not a problem.