AI, HPF, Grid Computing: Chronicles of failures - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

AI, HPF, Grid Computing: Chronicles of failures

Description:

Need to hijack malloc to catch when memory pages are released to the OS. Maintenance nightmare. ... of the OS page table into this IOMMU, or only in a specific ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 20

Provided by: charles150

Category:

more less

Transcript and Presenter's Notes

Title: AI, HPF, Grid Computing: Chronicles of failures

1
AI, HPF, Grid Computing Chronicles of failures

How to play funding agencies.
Patrick Geoffray
patrick_at_myri.com

2
HPC lies and the lying liars telling them

How the HPC community gave up on checks and
balances.
Patrick Geoffray
patrick_at_myri.com

3
Hardware folks are from Mars, Software guys are
from Venus.

Design choices for Myrinet / MX.
Patrick Geoffray
patrick_at_myri.com

4
Addressing woes

Hardware folks use IO space addresses, only valid
on the IO bus.
Software people manipulate virtual memory
addresses, allocated to them by the OS, only
meaningful in the OS virtual memory mapping.
Sometimes, the IO address space is mapped
one-on-one with the physical memory space.
X86, AMD64, IA-64 (some).
Sometimes, an IOMMU is sitting between the IO bus
and the memory bus and maps the IO space into the
physical memory space.
PPC (Power, G5), IA-64 (some).
For zero-copy, we need to convert addresses from
virtual memory space into IO space gt memory
registration/deregistration.

5
Memory registration/deregistration

Memory registration
Trap into OS kernel.
Lock access to OS page table.
Loop for every page
Walk the OS page table to find the virtual page.
Eventually swap the virtual page in.
Increment the page reference count.
Pin virtual page in the OS page table (marked not
swappable).
Get the IO address of the related physical memory
page (may require to use the IOMMU).
Unlock access to OS page table.
Return to user space.

Memory deregistration
Trap into OS kernel.
Lock access to OS page table.
Loop for every page
Walk the OS page table to find the virtual page.
Unpin virtual page in the OS page table (marked
swappable).
Decrement the page reference count.
Eventually clear the related entry in the IOMMU.
Eventually clear the related entry in any cache
on the NIC.
Unlock access to OS page table.
Return to user space.

6
Dodging the bullet

Memory registration can be very expensive.
Hardware folks did not think that it would be
used in the critical path explicit memory
registration in low-level hardware-driven
hardware API such as VIA and Infiniband.
No explicit memory registration in higher level
communication libraries such as MPI or Sockets.
Various methods to attempt to dodge the bullet
Not do zero-copy make sense for small messages
where memory copy is cheaper.
trashes cache, uses CPU.
Implement a registration cache lazy
deregistration with garbage collector.
Need to hijack malloc to catch when memory pages
are released to the OS.
Maintenance nightmare.
Poor efficiency on complex code 9 cache hit on
Linpack.
Do not register memory maintain a copy of the OS
page table in the NIC.
No OS support requires patching the OS,
portability issues.

7
State of the Union
8
Design choice in Myrinet / MX

Patching the OS is bad for many reasons.
Customers dont like it, maintenance nightmare.
Memory registration is in the critical path,
lets live with it.
Optimize the register/deregister code path.
Factorize page table access, aggressive locking.
4 us for the first page, 0.15 us asymptotically
(8 pages).
No long term caching in the NIC.
Will deregister after single usage.
No explicit memory registration for Send/Recv
semantic.
Keep the door open for registration cache.
Just for Linux where its easy to catch the
free() syscall.
No big reward, but its so simple.

9
MX/GM MPI Pingpong
10
MX/GM MPI Pingpong (no regcache)
11
The Right Thing

Hardware would provide an IOMMU mapping virtual
memory addresses into physical memory addresses.
All IO devices would then directly manipulates
virtual memory addresses.
If the IOMMU resolution miss, interrupt the OS
and Nack the IO transaction.
Software would maintain a copy of the OS page
table into this IOMMU, or only in a specific
device. At the very least, the OS should provide
well-defined hooks to let the driver maintain by
itself a copy of the OS page table.

Who said As the memory copy is always going to
be much faster than the network, zero-copy just
does not make sense ?
Linus Torvalds, 2002
12
Message passing woes

Hardware folks like simple semantics like PUT or
GET.
Called RDMA by the marketing department.
Software guys use two-sided interfaces such as
MPI or Socket.
Two-sided interfaces are easier to manipulate for
large, complex code.
MPI is the de facto programming standard in HPC.
Mapping MPI on top of RDMA is like
Train an AI to be a shrink.
Using HPF to generate non-trivial parallel code.
Running HPC codes on a set of loosely coupled,
geographically widely distributed machines.

It sounds easy but it is a pain to implement and
it performs poorly.
13
MPI over RDMA

History repeats itself Memory Channel, Giganet
(VIA), SCI, IB.
How to implement a matching semantic on top of a
one-sided API ?
Matching as to be done somewhere sometime by
someone.
Matching can be done after sending data
Eager mode, copy on the receive side.
Where do you PUT the eager message in the first
place ?
Shared queue with tokens ? Multiple windows ?
Polling or blocking ?
Matching can be done before sending data
Rendez-vous with small packets containing
matching information.
Matching done by the host whenever the
rendez-vous happen.
If the host is not available, either wait or
interrupt it.

14
State of the Union
15
Classical MPI overlap problem
16
Matching in the NIC ?

Matching between incoming message and posted
receives in MPI.
64 bits of matching information.
Receives are posted by the host.
If no matching receives found -gt unexpected
message.
Asynchronous aspect is important for large
messages.

Bad
Limited resources
NIC cycles (linear matching).
NIC memory (number of posted receives).
MPI state in the NIC.

Good
Perfect overlap of communication with
computation.
Does not interrupt the host.
Very loose rendez-vous for zero-copy.

17
Design choice in Myrinet / MX

Limited number of posted receives in the NIC 32
or 64 per endpoint.
Pass it to the host if
No matching found.
Race between incoming message and posting of
receive handle.
When host receive message
Check posted receives, starting after the ones
tried by the NIC.
If no match found -gt unexpected message.
Progression thread NIC may need to ask the host
to be involved
Matching large messages unmatched by the NIC.
Host queues are full.
No periodic timer, only when needed.

18
The Right Thing

The interconnects should natively support the
interfaces that people uses in real world
applications
Reliable matching (MPI, Socket, Storage).
Datagram (IP, Databases).
The hardware people should not impose their
simplistic notion of low-level communication
software.
The software people should not passively accept
the domination from the hardware clan and do the
best they can with what they are given.

Who is behind RDMA-based low-level interfaces ?
Intel
19
What to take home

Hardware and software people like each other,
they just try to make their own life easier.
Linus Torvalds does not know much about high
speed networking.
Memory registration is a false problem.
Size of the pipe is only one factor.
Appropriate semantic and overlap is also
important.
RDMA will not save the world.
It just makes the hardware easier, at the cost of
software complexity.
Look for MX at SC04 !
Physicists are from Pluto, Grid Computing folks
are from Melmak.
Physicists claim that the speed of light is a
constant but Grid Computing people say that
latency is not a problem.