1 linux vs hpc : life (and death) of strange features2010.rmll.info/img/pdf/lsm2010-os-hpc-2.pdf ·...
TRANSCRIPT
1
Linux vs HPC :Life (and death) ofstrange features
Brice Goglin
RMLL – Talence - 2010/07/09
2
High Performance Computing
Solving very large/complex problems• Numerical simulations
● Weather forecasting, seismology, …
Huge amount of computation• A single machine cannot do it• Interconnect many servers and make them work together
Eats computing power like a black hole• Computing time does not decrease with bigger machines
● Problem size and solution accuracy increase
3
Linux rules the HPC world
http://top500.org
4
Open-source in the HPC world
Why Linux for high performance computing?• Long history of manual tuning/tweaking/modifying the
system for better performance● Tune the OS to improve hardware usage
• Strong links with academic world● Many researchers involved, from computer science and other
sciences
Open source everywhere in HPC ?• No, many applications/libraries/drivers are proprietary
5
Linux rules HPC but HPC doesn't rule Linux
Linus doesn't like HPC that much• or doesn't like what HPC people do
HPC needs strange features that seem very specific
HPC tries to avoid the kernel because it's slow• or not clever enough
Long history of hacks to improve performance• Performance at any cost
6
Linux modified a lot for HPC
Performance records drive development• Roadrunner broke the Petaflop/s barrier
● Who will break the 10Peta- or Exaflop/s barrier ?
Breaking records more important than portability• There are very few huge computing machines in the world
● Portability does not appear so important for these people
Ugly hardware and/or software hacks
7
HPC doesn't like when Linux tries to be clever
Operating systems are full of tradeoffs• Desktop/server/embedded/… have different workloads
● Cannot support all of them optimally at the same time• Try to support all of them satisfyingly
Operating systems are full of heuristics• Need to predict the future to anticipate
● Load pages from disks in advance, …• Try to be clever and guess what the user wants to do
HPC doesn't want any of this• HPC wants best performance, no tradeoff, no heuristics• HPC wants Linux to be dumb and just do what we want
8
Example : Accessing files with O_DIRECT
Operating systems try to reduce disk accesses by reading files earlier and writing files later• Not always efficient
● The kernel doesn't know what the application really want• Not good for memory consumption• Not good when the application does it better
● Only the application knows what it really wants● and what it will really do in the future
9
Example : Accessing files with O_DIRECT
Disks
Application
OS trying todo clever things
Application doingclever things
OS trying todo clever things
Application doingclever things
Disks Disks
Likely OK Not OK OK
10
Example : Accessing files with O_DIRECT
Operating systems try to reduce disk accesses by reading files earlier and writing files later• Not always efficient
● The kernel doesn't know what the application really want• Not good for memory consumption• Not good when the application does it better
● The application knows what it really wants● and what it will really do in the future
High-performance applications (database, out-of-core computation, ...) want the kernel to be dumb• Stop doing this heuristics that doesn't help us !
A new way to open files was added (O_DIRECT)
11
What do people actually think of O_DIRECT?
« The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances »
Linus Torvalds – man 2 open
What's the actual problem ?• Some nice interfaces exist (e.g. fadvise) but they may be
a bit less efficient● And people don't want to rewrite/retune their HPC code for
this other interface
12
Another Example : Large Receive Offload
Agregation of incoming packets• Process a single big packet instead of many small ones
The history of LRO in Linux isn't really clear• Custom implementation inside Neterion 10Gbit/s driver
● Added to Linux 2.6.17• Other custom implementations rejected later
● Myricom (2.6.19) and Chelsio (2.6.20)• Kernel maintainers want a generic implementation that all
drivers would use!● They're right !
13
Another Example : Large Receive Offload
History of LRO in Linux isn't really clear (continued)• Another custom implementation in Netxen driver !?
● Hidden and undocumented in big commit in 2.6.20● Not really reviewed by kernel maintainers ?
« It was an error on my part […] I would gladly accept patches to rip out the code from NetXen. »
• Generic implementation added in 2.6.24● Pushed by drivers that didn't have a custom implementation
● Mostly Intel and Myricom● Others still using their custom implementation
● They won't convert to the new generic implementation unless the kernel maintainers remove their custom code by force ?
14
Yet Another Example : TCP Offload Engine
Advanced network cards embed the TCP/IP stack• Nothing to do in the processor and operating system
● Supposedly very nice for performance
Rejected by kernel devs because it's a bad idea• No coordination/compatibility between duplicated
components● Firewall, quality of service, …
• Based on closed-source blackbox firmwares● Unclear security, maintenance, updates, …
http://www.linuxfoundation.org/collaborate/workgroups/networking/toe
15
So what ?
Conflict between (sane?) people• want nice code and features
and (crazy?) people• want the best performance ever at any cost
● Want random stuff in the kernel for performance reasons● For some HPC people, 10ns is worth uglifying the code
● Being non-portable, not supporting some corner-cases, …● Could lead to breakage, security risk, ...
Many discussions that may be constructive or not• The way HPC features are accepted/rejected isn't so clear
16Reasons for rejecting high-performance features
• The whole idea is stupid
• The implementation is wrong
• No user
• Not enough users
• Stupid because the kernel can be clever than that or than your application
17
Don't try to abuse kernel maintainers
« How about we just remove the RDMA stack altogether ? [...] If you guys can't stay in your sand box and need to cause problems for the normal network stack, it's unacceptable. […] It seems an at least bi-monthly event that the RDMA folks need to put their fingers into something else in the normal networking stack. No more. »
18
Adding things to the Linux kernel
Applications
Hardware
System Calls
Stuff
New Hardware
Drivers New Driver
New Needs
New System Call
Improved Stuff
New Stuff
Easy
Careful review :Is the application sane ?
Is the interface safe ?
What improvement ?Who will use it ?
19
The easy way
The InfiniBand stack added in 2.6.11• High performance networking technology• Supported by many vendors, described in lengthy
specifications, …
Many new drivers• No problem
New application interface to access new hardware• Many existing applications already ported• Only minor technical problems were raised
No intrusive internal changes in the kernel• Advanced features not included (see later in this talk)
20
I want my cool feature added in Linux !
You need somebody to use it in the official kernel• Code that's not used isn't tested/maintained• If only your external module uses it, you need to add your
module to the official kernel first
That's even true for some bugs• If a bug only occurs with a non-official module, it's not an
important bug :)● Some work-arounds in external HPC modules
Isn't InfiniBand the HPC user that you needed?• Having a more widespread user may be better to
convince people that your feature is really useful
21Another example :Page Attribute Table (PAT)
Since Pentium III, caching may be tuned precisely• Eases very fast data transfers from the processor to I/O
devices (write-combining)• Critical for networking latency !
● Supported on Windows but not on Linux ?!● Lots of hacks in HPC network stacks
● Custom non-portable PAT implementations within HPC drivers
InfiniBand (in the kernel) wanted PAT support• But PAT support required a lot of work in Linux
● Discussed and rejected periodically since 2006• And PAT support is buggy on many old processors
22Another example :Page Attribute Table (PAT)
Linux finally got PAT support in 2.6.26• Not because HPC needed it• Who else needs high-performance transfer to I/O
devices ?
GPUs !• Latest Linux graphics stack pushed PAT support for
improved performance• HPC may now benefit from PAT too :)
● No need for ugly custom hacks anymore
23
MMU Notifiers
HPC needs deep knowledge of virtual memory• Applications use virtual memory, hardware uses physical• HPC needs to know how they correspond to each other
● It eases data transfers without expensive memory copies
HPC stacks have been hacking the kernel for 10 years to extract this knowledge• No HPC stack in the kernel, no official user, no way to get
official support for this feature
What about now ?• InfiniBand would be a user in the kernel
24
MMU Notifiers
This feature is highly specific• Some people think the whole idea is wrong...• Nobody envisions any usage outside of HPC...
What about virtualization ?• KVM needs similar knowledge of virtual/physical memory
correspondancy• KVM is in the kernel
● And virtualization is widely used, more than HPC
KVM developers pushed MMU Notifiers in 2.6.27• Should solve what HPC has been wanting for 10 years !
25
ummunotify
MMU Notifier is the kernel side• Some HPC software want it in user applications too
Hard work to design/implement what HPC still needs• Not accepted (yet?)
« The interface claims to be generic, but is really just a hack for a single use case that very few people care about. I find the design depressingly stupid, even if the code itself is at least small and simple. […] Can't you crazy RDMA people just agree on an RDMA interface, and making it part of that ? It still makes zero sense outside of that small niche as far as I can tell. »
26
Summary
HPC uses Linux intensively• But Linux support for HPC is always very late
HPC has special requirements• Don't want the kernel to be clever
● Want the kernel to let HPC applications do what they want• Specific needs that are rarely used in any other context• Makes new features hard to merge in the Linux kernel
Things are getting better on the networking side• But HPC is more than networking• e.g. Storage still has problems with POSIX API being too
restrictive for parallel file systems
27Last example :HPC vs. Complex modern architectures
HPC wants to know what the hardware is made of• Try to exploit the cores and
memory in the best way• Very important on modern
machines● Many processors, cores, shared-
caches, …
Needs Linux to show the hardware structure• Many things shown in /sys/
28Kernel developers want to drive things they can't
AMD Magny-Cours processor adds new type of structure
« First I must say it's unclear to me if CPU topology is really generally useful to export to the user. »
« It would be very nice to propagate this info to where it really matters : the sched-domains topology info. »
Exposing this info to the scheduler is enough?• Assumes the scheduler is clever enough for HPC
● Far from true...