ultrafast x-ray imaging of scientific · pdf file2 s. chilingaryan et. all institute for data...
TRANSCRIPT
www.kit.eduKIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
Ultrafast X-Ray Imaging of Scientific Processes
AgendaSynchrotron TomographyUFO ProjectParallel ArchitecturesHardware-aware ComputingUFO Reconstruction PlatformPerformance
Suren A. Chilingaryan, KITMichele Caselle, KITThomas van de Kamp, KITAndreas Kopmann, KITAlessandro Mirone, ESRFUros Stevanovic, KITTomy dos Santos Rolo, KITMatthias Vogelgesang, KIT
Authors Collaboration
S. Chilingaryan et. all2 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Synchrotron
Electrons accelerated in the linear accelerator and booster and injected in synchrotron storage ring in bunches
Magnets are bending electrons trajectories and X-ray fan beam and forces electrons to emit X-ray radiation fan.
ANKA synchrotronat KIT
Bending Magnets
S. Chilingaryan et. all3 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
ANKA Synchrotron
2.5 GeV Storage Ring16 Beam-Lines
RadiographyTopographyMicrodifraction ImagingTomographyLaminographyReflectivitySpectroscopyMicroscopyLithography
Chracterization of nano- and micro-structure Life sciences
S. Chilingaryan et. all4 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Tomography Beam-line at ANKA
Storage Ring
Bending Magnet
DMM Monochromator
Attenuator
Slits
Be-window
Slits
Experiment
Sample
Scintillator
DetectorCCD
The rotating sample in front of a pixel detector is penetrated by X-rays. Absorption at different angles is registered by camera and 3D map of sample denisity is reconstructed.
Scintilators are used to convert X-rays to visible light. This allows usage of Standard CCD cameras.
S. Chilingaryan et. all5 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
X-ray MicrotomographyHeads of a newt larva showing bone formation and muscle insertions (top) and a stick insect (bottom)
Left: Volume rendering of a tomographic stack of Endler’s guppy (Poecilia wingei). Right: isolated gonopodium (modified anal fin).
S. Chilingaryan et. all6 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Synchrotron Tomography
θ=180
θ=0
The sample is evenly rotated and the pixel detector registers series of parallel 2D projections of the sample density at different angles.
S. Chilingaryan et. all7 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Reconstruction
n = 1
n = 2 n = 3 n = 10 n = 50 n = 1000
Tomographic reconstruction is used to produce 3D images from a manifold
of two dimensional projections. Vertical slices are processed independently. For each slice all projections
are smeared back onto the cross section along the direction of incidence yielding an integrated image.
S. Chilingaryan et. all9 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Filtered Back Projection
1 2 3 4 5proj. α
1. For each position we compute: x●cos(α) - y●sin(α)
2. Interpolate between neighboring bins3. Sum over all projection4. The sum is the value of (x,y)
(x,y)
x●cos(α) - y●sin(α)
….
1 2 3 4 5proj. 0 ….1 2 3 4 5proj. 1 ….1 2 3 4 5proj. 2 ….
….
bins
For each texel of output volumeand for each projection we performa single linear interpolation
1. FilteringMultiplication with the configured filter in the Fourier space
2. BackProjection
S. Chilingaryan et. all10 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Reconstruction Problem
Resolution: 2560 x 2160Dynamic Range: 16 bitFrame Rate: 100 fps
PCO.edgeTomographic Reconstruction3D image: 20003
Projections: 2000Acquisition time: 20 seconds
FBP Complexity: 144 TflopsXeon Performance: ~ 100 GflopsMinimum time: ~ 15 minute on DPActually: ~ 1 hour
20 seconds acquisition1 hour reconstruction
S. Chilingaryan et. all11 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Ultra Fast X-ray Imaging of Scientific Processes with On-Line Assessment and Data-Driven Process Control
ANKA beam line
Optics and sample manipulators
Smart high-speed camera
Online monitoring and evaluation
Offline storage
UFO
GoalsHigh speed tomographyIncrease sample throughputTomography of temporal processesAllow interactive quality assessment
Enable data driven controlAuto-tunning optical systemTracking dynamic processesFinding area of interest
S. Chilingaryan et. all12 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Example 4D cine-tomography experiment
In vivo X-ray 4D cine-tomography experiment. (a) Photograph of Sitophilus granarius, dorsal view. (b) Experimental set-up for ultra-fast X-ray microtomography showing bending magnet (1), rotation stage (2), fixed specimen (3) and detector system (4). (c) Radiographic projection. (d) 3D rendering of the reconstructed volume with thorax cut open and revealing hip joints (arrows). (e) In vivo cine-tomographic sequence of moving weevil.
S. Chilingaryan et. all13 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Parallel Architectures
2007 2008 2010 201210
100
1000
10000
192
96
3950
1170
Xeon/SP Xeon/DP Tesla/SP Tesla/DP
GF
lop
s
ParallelArchitecture
E7-8870 Xeon/Phi Tesla K20X GeForce Titan AMD HD7970 Power7+
SP 192 2020 3950 4500 3790 265
DP 96 1010 1310 1300 950 132
Mem 34.11 320 250 288 264 68
Texture ~ 180 187.5 118.4
Max dev. 8 ? 8 8 >=4 32
Price $4,800 $2,800 $3,200 $1,000 $400 $$$$$$$$
S. Chilingaryan et. all14 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Efficiency
GTX Titan
GTX680
GTX580
HD7970
Core i7-980X
0 1000 2000 3000 4000 5000Peak Measured GFlops
Core i7-980XHD7970
GTX580GTX680
0
50
100
150
200
250
300
Peak MeasuredG
B/s
GTX Titan
GTX 680
GTX 580
HD7970
Core i7-980X
0 50 100 150 200 250 300 350 400 450 500GFlops
Matrix Multiplication
1D Fast Fourier Transform
Memory Bandwidth
98%
69%
86% 78%
94%
63%
63%
43%
72%
GTX Titan performance is taken from Anandtech
52%
3%
18%
9%
10%
S. Chilingaryan et. all15 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
GPU-programing considerations• Special programming tools and techniques are required
– Multiple different and ever-changing architectures– Branching and many other operations are very expensive– Optimized mathematical libraries are some times missing
• Limited amount of memory and expensive data transfers– x16 PCIe gen2 (8 GB/s), gen3 (16 GB/s)– Specially allocated (pinned) memory required for a full performance and to
overlap computations and data transfers• Reduced caches, low memory to computation ratio, strict access patterns
– 177 GB/s per Teraflop for Xeon, 60 - 70 GB/s per Teraflop for GPUs– Varying cache hierarchies on different architectures– Special access patterns are required for better performance. For instance,
bandwidth of matrix transpose (GTX280 with 142 GB/s memory bandwidth)• 2 GB/s – for naive approach• 17 GB/s – if shared memory is used• 80 GB/s – if care taken for shared memory banks and global memory partitions
• I/O problem– HDDs providee 100 MB/s sequential write while camera produces ~ 1 GB/s– Handling big data sets not fitting in the memory (up to 500 GB)
• Various problems with growing number of GPUs connected to a system
S. Chilingaryan et. all16 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
NUMA Architecture and Data Transfers
BandwidthQPI: 14.4 GB/sDDR3: 10.6 GB/s (PC1333)PCIe: 16 GB/s (gen3 x16)
Xeon E5-2640
QPI bus is even not enough to feed both GPU cards
Non-NUMA
NUMA
Pinned
0 1 2 3 4 5 6 7 8
Host to Device Device to Host GB/s
GTX590 (gen2)
Pageable
Pinned
0 2 4 6 8 10 12
Host to Device Device to Host GB/s
AMD HD7970 (gen3)
Pinned memory support is limited with OpenCL
NVIDIA does not support gen3 mode on X79 boards (workaround exits for Win, but not Linux)
S. Chilingaryan et. all17 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Image Processing Framework
acquisitionflat field
correction
. . . . . .
noise reduction
PreprocessingExecuted on CPUs
sinogramgeneration FFT
. . .. . .
filter
ReconstructionExecuted on GPUs
iFFT backprojection
Storage
. . .. . .
Segmentation /meshing
storage of raw data
OpenSourcehttp://ufo.kit.edu
Features➢Easy Algorithm Exchange➢Camera Abstraction➢Pipelined Processing➢Glib/GObject, scripting language support with introspection➢OpenCL + automated management of OpenCL buffers
FiltersTomography & Laminography
FBP, DFI, Algebraic MethodsNon Local Means for Noise ReductionOptical Flow
S. Chilingaryan et. all18 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
90%Back ProjectionFilteringPCIe Transfer
Filtered Back Projection
Image Loader
Pool of Sinograms (host memory)
Pool of CPU and GPUprocessing threads
Pool of Vertical Slices(host memory)
Texture
Data Storage
W
H
GPU thread
1st Stage 2nd Stage
Double buffering
Double buffering
Filtering
PC
Ie Data
Tran sfer
PC
Ie Da
taT
ra nsf er
Fetch slicesfor processing Store results
S. Chilingaryan et. all19 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Tuning for hardware architectures
GT200Base version Uses texture engine
FermiHigh computation power, but low speed of texture unit Reduce load on texture engine: use shared memory to cache the fetched data and, then, perform linear interpolationusing computation units.
KeplerLow bandwidth of integer inst-ructions, but high register countUses texture engine, but processes 16 projections at once and 16 points per thread to enhance cache hit rate
GCNHigh performance of texture engine and computation nodesBalance usage of texture engine and computation nodes to get highest performance
VLIWExecutes 5 independent operations per threadComputes 16 points per thread in order to provide sufficient flow of independent instructions to VLIW engine
+100%
+530% +95%
+75%
0
20
40
60
80
100
120
140
160
180
E5-2640 GTX280
HD5970 GTX580
GTX680 HD7970
giga
-int
erp
ola
tions
pe
r se
cond
S. Chilingaryan et. all20 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Handling large data sets
A7K2000
Raid 12 Drives
Intel X25E
2 x Crucial C300
4 x Crucial C300
8 x Samsung 840 Pro
RamdDisk
1 10 100 1000 10000
28.71
29.74
140
417.39
965.21
1575.85
2859.88
MB/s
Using SSD drives may significantly increase random access performance to the data sets which are not fitting in memory completely. The big arrays of magnetic hard drives will not help unless multiple readers involved.
S. Chilingaryan et. all21 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
FBP Reconstruction Pipeline
Moderate size data-sets are stored in memory
Huge data-sets are cached on SSD Raid
Preprocessing
Reconstru
ction Real-timestorage
1
2 3
4
4 stage pipelineI/O + Computations
1. Reading data from fast SSD Raid-0 (random reads are effective)2. Scheduling and preprocessing using SIMD instructions of x86 CPUs3. Reconstructing on GPUs4. Storing to Raid on magnetic disks (sequential writes are effective)
S. Chilingaryan et. all22 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
High performance data storage
Buffer Cache
Storage
Data
Default data flow in Linux
Buffer cache significantly limits maximal write performance
Kernel AIO may be used to program IO scheduler to issue read requests without delays
Optimizing I/O for maximum streaming performance using a single data source/receiver
Read
Write
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Buffered Direct AIO MB/s
S. Chilingaryan et. all23 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
UFO Computing Infrastructure
LSDFLarge Scale Data Facility
0 100 200 300 400 500 600
Internal ExternalGT/s
External GPU Box SAS Attached Storage
CameraLink
850MB/s
External PCIe x16 (8 GB/s)
Ethernet
10 Gb/sPCO.edgePCO.dimaxPCO.4000
SuperMicro 7046GT-TRF (Dual Intel 5520 Chipset)CPU: 2 x Xeon X5650 ( total 12 cores at 2.66 Ghz)GPUs: 4 x GTX590 ExternalMemory: 96 GB / 12 DDR3 slots (192GB max)Network: Intel 82598EB (10 Gb/s)Camera Link Frame Grabber (850 MB/s)Storage: Areca ARC-1880-ix-12 SAS Raid 16 x Hitachi A7K200 (Raid6) 8 x Samsung 840 Pro 510 (Raid0)
0 500 1000 1500 2000
32 disks 16 diskssequential write, MB/s
0 1000 2000 3000 4000
Read WriteMB/s
SFF8088 (2.4 GB/s)
Camera Storage
SSD Raid
S. Chilingaryan et. all24 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
PCIe Extension Box
External GPU Enclosure by One Stop Systems
1 x PCIe x16 2.04 x GTX5908 GPU cores
NUMA
Non-NUMA
0 1 2 3 4 5 6 7 8 9
1 slice, transfer time, ms
NUMA
Compute
Non-NUMA
0 10 20 30 40 50 60 70
8
7
6
5
4
3
2
1
Computems, per 8 slices
With external box configuration
1 2 3 4 5 6 7 81
2
3
4
5
6
7
8
Internal External Number of GPUs
Spe
ed-
up
7 7.2 7.4 7.6 7.8Internal External 8 GPUs, Speed-up
Scalability
0 1 2 3 4 5 6 7 8NUMA Non-NUMA 8 GPUs, Speed-up
S. Chilingaryan et. all25 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Performance of Filtered Back Projection
small (4GB) standard (11GB) Big (120 GB) Huge (300 GB)0
200
400
600
800
1000
1200
Memory SSD
MB
/s
CPU GPU1
10
100
1000
10000
17.93
1016.26
MB
/s
11 GB data set
S. Chilingaryan et. all26 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Scaling up to Cluster
Camera
SSD Cache
up to8 GB/s
4GB/s <2 GB/s 0,25 GB/sTransferrates
Master NodeTask Scheduler
(lots of memory)
DistributedStorage
OpenCLNode
OpenCLNode
Readout PCCUDA +
GPUDirect
Real-time control loopOuter control loop
StorageNode
StorageNode
Infiniband(Optical)
IB router
LSD
FLa
r ge S
c al e
Da
t a F
ac ility
S. Chilingaryan et. all27 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Storage Protocols
Raw
iSER
XFS/Local
XFS/iSER
OCFS2/Local
OCFS2/iSER
FhGFS (over XFS)
Gluster (over XFS)
0 500 1000 1500 2000 2500 3000 3500 4000 4500Write Read MB/s
Network FSNFSSambaSSHFS
Slow
Cluster FSLustre (patched kernel)GlusterFhGFS (close-sourced)
Slow if few nodes
Network DevicesISCSI (slow)iSER
OCFS2
S. Chilingaryan et. all28 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
UFO Storage Subsystem
20 TB8 TB
Storage Node 1Raid6: 16 Hitachi 7K300, 28TB
20 TB8 TB
Storage Node 2Raid6: 16 Hitachi 7K300, 28TB
8 TB 8 TB
16 TB, XFS
iSer
iSer
SoftRaid Level 0
Camera PC
High-speed Streaming storageRead: 2.3 GB/sWrite: 2.5 GB/s
GlusterFS
ComputeNode 2
ComputeNode 3
ComputeNode 1
Client 1 Client 2
NFS
S. Chilingaryan et. all29 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Summary• We are in the age of parallel architectures• Getting good performance is rather easy, getting ultimate-performance
managing multi-gigabyte streams of data is complicated and needs care on multiple levels. The hardware should be carefully selected according to the planned tasks and data rates. The software should be tunned to the selected hardware.
• Streams about 500 MB/s may be processed with a single reconstruction station, cluster is required to handle more data in near real-time.
• Hybrid CUDA/OpenCL system is probably the best approach.• UFO Parallel Processing Framework is provided to help you to come along
some of these difficulties and will be presented in next talk.
OpenSourcehttp://ufo.kit.edu
Features
➢Easy Algorithm Exchange➢Camera Abstraction➢Pipelined Processing➢Glib/GObject, scripting language support with introspection➢OpenCL + automated management of OpenCL buffers
S. Chilingaryan et. all30 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Extra Slides
S. Chilingaryan et. all31 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Parallel Programming Environments
• CUDA – The oldest GPU programming technology from NVIDIA• OpenCL – Open standard technology close to CUDA, but working with
wide range of hardware platforms including CPUs and GPUs• OpenAC – Declarative technology similar to OpenMP• MATLAB and other mathematical packages with integrated GPU
support. Only some operations are parallelized and necessity to transfer over slow PCIe bus to execute non-parallelized operations kill the performance.
CUDA• Supports latest NVIDIA technologies
– GPUDirect – direct transfers between GPU and IB, etc. Integration with MPI frameworks
– Dynamic parallelism – GPUs are ablee to spawn new jobs• NVIDIA provides a set of highly optimized libraries (BLAS, FFT, Lapack,
Reduction, etc.)• Only NVIDIA GPUs are supported
S. Chilingaryan et. all32 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Programming EnvironmentsOpenCL
• Syntax is very similar to CUDA (easy porting)• Well written code is as fast as CUDA• Works with CPUs and GPUs from multiple vendors (Intel, AMD, IBM,
NVIDIA)• Still no way to run code simultaneously on NVIDIA GPU and CPU, but
possible with AMD cards• Many libraries existing, but generally slightly slower than CUDA
counterparts. Some libraries are only available commercially• No GPUDirect, significantly limited options to use pinned memory (i.e.
slower data transfers)OpenACC
• Existing applications may be easily parallelized. Also developing new code is easy compared to OpenCL/CUDA
• No free compilers are existing at the moment. Though there is a similar technology OmpSS developed at Barcelona Supercomuter Center.
• At current level, technology does not support shared memory and some other technologies available with direct programing (i.e. it is slower)
cuFFToclFFT
0 50 100 150 200 250 300GFlops
S. Chilingaryan et. all33 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Memory: Space vs. Speed
CPU
DIMM DIMM DIMM
DIMM DIMM DIMM
DIMM DIMM DIMM
channel 1 channel 2 channel 3
DPCDIMM per Channel
Xeon X5500
Xeon E5-2600
1 DIMM 10.6 GB/s 12.8 GB/s
2 DIMMs 8.5 GB/s 12.8 GB/s
3 DIMMs 6.4 GB/s 8.5 GB/s
S. Chilingaryan et. all34 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Fermi Optimizations
GT280 GTX580 Speedup
240 x 1.3 GHz
512 x 1.55 GHz 2.5 x
48 GT/s 49.4 GT/s 1.0 x
Texture Engine
Computational Units
Threadblock
16 px
16 px
Each block of threads accesses actually only 3 ● N / 2 bins per projection
Texture
Image
N2●P texture fetches
Standard VersionTexture engine is heavily loaded
Texture
SharedMemory
(3/2)*N●P texture fetches
Image
N2 interpolations
Fermi-optimized VersionBoth texture & computations engines are used
v = x●cos(α) - y●sin(α)max
x,y<N(v) – min
x,y<N(v) < N√2
N√2 < 1.5 N
S. Chilingaryan et. all35 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Performance of Back Projection Step
HD7970
GTX680
GTX580
HD5970
GTX280
E5-2640
0 20 40 60 80 100 120 140 160 180
Linear Oversample giga-interpolations per seconds (texture fill rate, GT/s)
133%126%
92%107%
157%195%
78%144%
39%
S. Chilingaryan et. all36 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Adding more GPU devices
1 2 3 4 5 6 7 85
7
9
11
13
15
17
19
21
23
25
time
(se
cond
s)
Initialization Time Maximum 8-9 GPU cores (not cards) per system. System will not turn on otherwise
Lan Option ROM have to be turned off in the BIOS
The PCIe slots, where storage adapters inserted, have to be disabled
ASTRA Lab reported to run 13 GPU cores with modified BIOS
To run more than 5 GPUs, NVIDIA driver have to be force to use MSI interrupts. Crashes will occur otherwise
NVIDIANVIDIA
AMD4 GPU cards (single core) working
fine, no configuration modifications required
Dual-core card are working in a single-core mode only
Number of GTX590 cores used
S. Chilingaryan et. all37 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Data Streaming
XFSExt4Ext2JFSBtrfs
Reiser3
0 500 1000 1500 2000 2500 3000 3500Start End MB/s
XFSExt4Ext2JFSBtrfs
Reiser3
0 500 1000 1500 2000 2500 3000 3500 4000Start End MB / s
OpenSuSE 12.1 / Kernel 3.3.132 disks per 2 raid controllers, raid 60
Read
Write
Used file system matter. And it should be adapted to raid configuration (strip, read-ahead)Unless really big number of disks used, the start of partition will be faster than the end)fallocate may significantly improve performance (allocation unit may be increased during
FS creation/mount, XFS supports allocation sizes up to 1GB)Ext4 does not support partitions more than 16TB yetReal-time feature of XFS is unstable, data is loss is likely
S. Chilingaryan et. all38 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
90%
Back Projection
Filtering
PCIe Transfer
Building a server for real-time imaging
• Too much external hardware is required– High speed network– Storage system (and SSD cache separately preferably)– High speed Frame Grabber for Camera– Normally 4-6 high speed PCIe slots per server– Space for 1-2 GPUs only
• System cooling is complicated– both GPUs, HDDs, and SSDs produce a lot of heat
• Extensibility– There is no space to add more storage / computing power
• Tomography is a focus– Mainly Filtered Back Projection
is used– Data transfer composes only
10% of the task and may be completely hidden behind computations
S. Chilingaryan et. all39 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
GPUDirect and Frame Grabbing
1 core
all cores
0 5 10 15 20 25 30 35
Core i7 950
Core i7-980X
Core i7 3820
2 x E5-2640
GB/s
And we get ~ 1 GB/s from camera. With 3 memcpy it is already on the border.
S. Chilingaryan et. all40 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
High-speed Programmable Camera
Vacuum or N2
Optical body and lens
CMOS pixel sensor
Daughter card
“made in house”Readout mother card
FPGA Virtex 6
Flex electrical cable(max f = 183MHz)
Heat exchangerPeltier cells
Link
High speed CMOS sensor 1Mpix, 5000 fps, 10 bitsSelf-trigger & Data compressionOn-line elaborations and controlFull ProgrammabilityDirect connection to Infiniband-cluster
First Prototype
S. Chilingaryan et. all41 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Infiniband: Connection and Protocols
2 x QDR Direct (gen3)
QDR Direct (gen3)
QDR Switched (gen3)
QDR Switched (gen2)
QDR Optical (gen2)
0 1 2 3 4 5 6
GB/s
RDMA
MVAPICH
SDP
TCPoIB
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
GB/s
1Switch
RDMAMVAPICH
SDPTCPoIB
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
late
ncy
(ns)
Mellanox ConnectX 3 VPI
SDP is obsolete by OpenFabric alliance, but we have patches for latest kernels.