ultrafast x-ray imaging of scientific · pdf file2 s. chilingaryan et. all institute for data...

www.kit.eduKIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

Ultrafast X-Ray Imaging of Scientific Processes

AgendaSynchrotron TomographyUFO ProjectParallel ArchitecturesHardware-aware ComputingUFO Reconstruction PlatformPerformance

Suren A. Chilingaryan, KITMichele Caselle, KITThomas van de Kamp, KITAndreas Kopmann, KITAlessandro Mirone, ESRFUros Stevanovic, KITTomy dos Santos Rolo, KITMatthias Vogelgesang, KIT

Authors Collaboration

S. Chilingaryan et. all2 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology

Synchrotron

Electrons accelerated in the linear accelerator and booster and injected in synchrotron storage ring in bunches

Magnets are bending electrons trajectories and X-ray fan beam and forces electrons to emit X-ray radiation fan.

ANKA synchrotronat KIT

Bending Magnets


ANKA Synchrotron

2.5 GeV Storage Ring16 Beam-Lines

RadiographyTopographyMicrodifraction ImagingTomographyLaminographyReflectivitySpectroscopyMicroscopyLithography

Chracterization of nano- and micro-structure Life sciences


Tomography Beam-line at ANKA

Storage Ring

Bending Magnet

DMM Monochromator

Attenuator

Slits

Be-window

Slits

Experiment

Sample

Scintillator

DetectorCCD

The rotating sample in front of a pixel detector is penetrated by X-rays. Absorption at different angles is registered by camera and 3D map of sample denisity is reconstructed.

Scintilators are used to convert X-rays to visible light. This allows usage of Standard CCD cameras.


X-ray MicrotomographyHeads of a newt larva showing bone formation and muscle insertions (top) and a stick insect (bottom)

Left: Volume rendering of a tomographic stack of Endler’s guppy (Poecilia wingei). Right: isolated gonopodium (modified anal fin).


Synchrotron Tomography

θ=180

θ=0

The sample is evenly rotated and the pixel detector registers series of parallel 2D projections of the sample density at different angles.


Reconstruction

n = 1

n = 2 n = 3 n = 10 n = 50 n = 1000

Tomographic reconstruction is used to produce 3D images from a manifold

of two dimensional projections. Vertical slices are processed independently. For each slice all projections

are smeared back onto the cross section along the direction of incidence yielding an integrated image.


Filtered Back Projection

1 2 3 4 5proj. α

1. For each position we compute: x●cos(α) - y●sin(α)

2. Interpolate between neighboring bins3. Sum over all projection4. The sum is the value of (x,y)

(x,y)

x●cos(α) - y●sin(α)

….

1 2 3 4 5proj. 0 ….1 2 3 4 5proj. 1 ….1 2 3 4 5proj. 2 ….

….

bins

For each texel of output volumeand for each projection we performa single linear interpolation

1. FilteringMultiplication with the configured filter in the Fourier space

2. BackProjection


Reconstruction Problem

Resolution: 2560 x 2160Dynamic Range: 16 bitFrame Rate: 100 fps

PCO.edgeTomographic Reconstruction3D image: 20003

Projections: 2000Acquisition time: 20 seconds

FBP Complexity: 144 TflopsXeon Performance: ~ 100 GflopsMinimum time: ~ 15 minute on DPActually: ~ 1 hour

20 seconds acquisition1 hour reconstruction


Ultra Fast X-ray Imaging of Scientific Processes with On-Line Assessment and Data-Driven Process Control

ANKA beam line

Optics and sample manipulators

Smart high-speed camera

Online monitoring and evaluation

Offline storage

UFO

GoalsHigh speed tomographyIncrease sample throughputTomography of temporal processesAllow interactive quality assessment

Enable data driven controlAuto-tunning optical systemTracking dynamic processesFinding area of interest


Example 4D cine-tomography experiment

In vivo X-ray 4D cine-tomography experiment. (a) Photograph of Sitophilus granarius, dorsal view. (b) Experimental set-up for ultra-fast X-ray microtomography showing bending magnet (1), rotation stage (2), fixed specimen (3) and detector system (4). (c) Radiographic projection. (d) 3D rendering of the reconstructed volume with thorax cut open and revealing hip joints (arrows). (e) In vivo cine-tomographic sequence of moving weevil.


Parallel Architectures

2007 2008 2010 201210

100

1000

10000

192

96

3950

1170

Xeon/SP Xeon/DP Tesla/SP Tesla/DP

GF

lop

s

ParallelArchitecture

E7-8870 Xeon/Phi Tesla K20X GeForce Titan AMD HD7970 Power7+

SP 192 2020 3950 4500 3790 265

DP 96 1010 1310 1300 950 132

Mem 34.11 320 250 288 264 68

Texture ~ 180 187.5 118.4

Max dev. 8 ? 8 8 >=4 32

Price $4,800 $2,800 $3,200 $1,000 $400 $$$$$$$$


Efficiency

GTX Titan

GTX680

GTX580

HD7970

Core i7-980X

0 1000 2000 3000 4000 5000Peak Measured GFlops

Core i7-980XHD7970

GTX580GTX680

0

50

100

150

200

250

300

Peak MeasuredG

B/s

GTX Titan

GTX 680

GTX 580

HD7970

Core i7-980X

0 50 100 150 200 250 300 350 400 450 500GFlops

Matrix Multiplication

1D Fast Fourier Transform

Memory Bandwidth

98%

69%

86% 78%

94%

63%

63%

43%

72%

GTX Titan performance is taken from Anandtech

52%

3%

18%

9%

10%


GPU-programing considerations• Special programming tools and techniques are required

– Multiple different and ever-changing architectures– Branching and many other operations are very expensive– Optimized mathematical libraries are some times missing

• Limited amount of memory and expensive data transfers– x16 PCIe gen2 (8 GB/s), gen3 (16 GB/s)– Specially allocated (pinned) memory required for a full performance and to

overlap computations and data transfers• Reduced caches, low memory to computation ratio, strict access patterns

– 177 GB/s per Teraflop for Xeon, 60 - 70 GB/s per Teraflop for GPUs– Varying cache hierarchies on different architectures– Special access patterns are required for better performance. For instance,

bandwidth of matrix transpose (GTX280 with 142 GB/s memory bandwidth)• 2 GB/s – for naive approach• 17 GB/s – if shared memory is used• 80 GB/s – if care taken for shared memory banks and global memory partitions

• I/O problem– HDDs providee 100 MB/s sequential write while camera produces ~ 1 GB/s– Handling big data sets not fitting in the memory (up to 500 GB)

• Various problems with growing number of GPUs connected to a system


NUMA Architecture and Data Transfers

BandwidthQPI: 14.4 GB/sDDR3: 10.6 GB/s (PC1333)PCIe: 16 GB/s (gen3 x16)

Xeon E5-2640

QPI bus is even not enough to feed both GPU cards

Non-NUMA

NUMA

Pinned

0 1 2 3 4 5 6 7 8

Host to Device Device to Host GB/s

GTX590 (gen2)

Pageable

Pinned

0 2 4 6 8 10 12

Host to Device Device to Host GB/s

AMD HD7970 (gen3)

Pinned memory support is limited with OpenCL

NVIDIA does not support gen3 mode on X79 boards (workaround exits for Win, but not Linux)


Image Processing Framework

acquisitionflat field

correction

. . . . . .

noise reduction

PreprocessingExecuted on CPUs

sinogramgeneration FFT

. . .. . .

filter

ReconstructionExecuted on GPUs

iFFT backprojection

Storage

. . .. . .

Segmentation /meshing

storage of raw data

OpenSourcehttp://ufo.kit.edu

Features➢Easy Algorithm Exchange➢Camera Abstraction➢Pipelined Processing➢Glib/GObject, scripting language support with introspection➢OpenCL + automated management of OpenCL buffers

FiltersTomography & Laminography

FBP, DFI, Algebraic MethodsNon Local Means for Noise ReductionOptical Flow


90%Back ProjectionFilteringPCIe Transfer

Filtered Back Projection

Image Loader

Pool of Sinograms (host memory)

Pool of CPU and GPUprocessing threads

Pool of Vertical Slices(host memory)

Texture

Data Storage

W

H

GPU thread

1st Stage 2nd Stage

Double buffering

Double buffering

Filtering

PC

Ie Data

Tran sfer

PC

Ie Da

taT

ra nsf er

Fetch slicesfor processing Store results


Tuning for hardware architectures

GT200Base version Uses texture engine

FermiHigh computation power, but low speed of texture unit Reduce load on texture engine: use shared memory to cache the fetched data and, then, perform linear interpolationusing computation units.

KeplerLow bandwidth of integer inst-ructions, but high register countUses texture engine, but processes 16 projections at once and 16 points per thread to enhance cache hit rate

GCNHigh performance of texture engine and computation nodesBalance usage of texture engine and computation nodes to get highest performance

VLIWExecutes 5 independent operations per threadComputes 16 points per thread in order to provide sufficient flow of independent instructions to VLIW engine

+100%

+530% +95%

+75%

0

20

40

60

80

100

120

140

160

180

E5-2640 GTX280

HD5970 GTX580

GTX680 HD7970

giga

-int

erp

ola

tions

pe

r se

cond


Handling large data sets

A7K2000

Raid 12 Drives

Intel X25E

2 x Crucial C300

4 x Crucial C300

8 x Samsung 840 Pro

RamdDisk

1 10 100 1000 10000

28.71

29.74

140

417.39

965.21

1575.85

2859.88

MB/s

Using SSD drives may significantly increase random access performance to the data sets which are not fitting in memory completely. The big arrays of magnetic hard drives will not help unless multiple readers involved.


FBP Reconstruction Pipeline

Moderate size data-sets are stored in memory

Huge data-sets are cached on SSD Raid

Preprocessing

Reconstru

ction Real-timestorage

1

2 3

4

4 stage pipelineI/O + Computations

1. Reading data from fast SSD Raid-0 (random reads are effective)2. Scheduling and preprocessing using SIMD instructions of x86 CPUs3. Reconstructing on GPUs4. Storing to Raid on magnetic disks (sequential writes are effective)


High performance data storage

Buffer Cache

Storage

Data

Default data flow in Linux

Buffer cache significantly limits maximal write performance

Kernel AIO may be used to program IO scheduler to issue read requests without delays

Optimizing I/O for maximum streaming performance using a single data source/receiver

Read

Write

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Buffered Direct AIO MB/s


UFO Computing Infrastructure

LSDFLarge Scale Data Facility

0 100 200 300 400 500 600

Internal ExternalGT/s

External GPU Box SAS Attached Storage

CameraLink

850MB/s

External PCIe x16 (8 GB/s)

Ethernet

10 Gb/sPCO.edgePCO.dimaxPCO.4000

SuperMicro 7046GT-TRF (Dual Intel 5520 Chipset)CPU: 2 x Xeon X5650 ( total 12 cores at 2.66 Ghz)GPUs: 4 x GTX590 ExternalMemory: 96 GB / 12 DDR3 slots (192GB max)Network: Intel 82598EB (10 Gb/s)Camera Link Frame Grabber (850 MB/s)Storage: Areca ARC-1880-ix-12 SAS Raid 16 x Hitachi A7K200 (Raid6) 8 x Samsung 840 Pro 510 (Raid0)

0 500 1000 1500 2000

32 disks 16 diskssequential write, MB/s

0 1000 2000 3000 4000

Read WriteMB/s

SFF8088 (2.4 GB/s)

Camera Storage

SSD Raid


PCIe Extension Box

External GPU Enclosure by One Stop Systems

1 x PCIe x16 2.04 x GTX5908 GPU cores

NUMA

Non-NUMA

0 1 2 3 4 5 6 7 8 9

1 slice, transfer time, ms

NUMA

Compute

Non-NUMA

0 10 20 30 40 50 60 70

8

7

6

5

4

3

2

1

Computems, per 8 slices

With external box configuration

1 2 3 4 5 6 7 81

2

3

4

5

6

7

8

Internal External Number of GPUs

Spe

ed-

up

7 7.2 7.4 7.6 7.8Internal External 8 GPUs, Speed-up

Scalability

0 1 2 3 4 5 6 7 8NUMA Non-NUMA 8 GPUs, Speed-up


Performance of Filtered Back Projection

small (4GB) standard (11GB) Big (120 GB) Huge (300 GB)0

200

400

600

800

1000

1200

Memory SSD

MB

/s

CPU GPU1

10

100

1000

10000

17.93

1016.26

MB

/s

11 GB data set


Scaling up to Cluster

Camera

SSD Cache

up to8 GB/s

4GB/s <2 GB/s 0,25 GB/sTransferrates

Master NodeTask Scheduler

(lots of memory)

DistributedStorage

OpenCLNode

OpenCLNode

Readout PCCUDA +

GPUDirect

Real-time control loopOuter control loop

StorageNode

StorageNode

Infiniband(Optical)

IB router

LSD

FLa

r ge S

c al e

Da

t a F

ac ility


Storage Protocols

Raw

iSER

XFS/Local

XFS/iSER

OCFS2/Local

OCFS2/iSER

FhGFS (over XFS)

Gluster (over XFS)

0 500 1000 1500 2000 2500 3000 3500 4000 4500Write Read MB/s

Network FSNFSSambaSSHFS

Slow

Cluster FSLustre (patched kernel)GlusterFhGFS (close-sourced)

Slow if few nodes

Network DevicesISCSI (slow)iSER

OCFS2


UFO Storage Subsystem

20 TB8 TB

Storage Node 1Raid6: 16 Hitachi 7K300, 28TB

20 TB8 TB

Storage Node 2Raid6: 16 Hitachi 7K300, 28TB

8 TB 8 TB

16 TB, XFS

iSer

iSer

SoftRaid Level 0

Camera PC

High-speed Streaming storageRead: 2.3 GB/sWrite: 2.5 GB/s

GlusterFS

ComputeNode 2

ComputeNode 3

ComputeNode 1

Client 1 Client 2

NFS


Summary• We are in the age of parallel architectures• Getting good performance is rather easy, getting ultimate-performance

managing multi-gigabyte streams of data is complicated and needs care on multiple levels. The hardware should be carefully selected according to the planned tasks and data rates. The software should be tunned to the selected hardware.

• Streams about 500 MB/s may be processed with a single reconstruction station, cluster is required to handle more data in near real-time.

• Hybrid CUDA/OpenCL system is probably the best approach.• UFO Parallel Processing Framework is provided to help you to come along

some of these difficulties and will be presented in next talk.

OpenSourcehttp://ufo.kit.edu

Features

➢Easy Algorithm Exchange➢Camera Abstraction➢Pipelined Processing➢Glib/GObject, scripting language support with introspection➢OpenCL + automated management of OpenCL buffers


Extra Slides


Parallel Programming Environments

• CUDA – The oldest GPU programming technology from NVIDIA• OpenCL – Open standard technology close to CUDA, but working with

wide range of hardware platforms including CPUs and GPUs• OpenAC – Declarative technology similar to OpenMP• MATLAB and other mathematical packages with integrated GPU

support. Only some operations are parallelized and necessity to transfer over slow PCIe bus to execute non-parallelized operations kill the performance.

CUDA• Supports latest NVIDIA technologies

– GPUDirect – direct transfers between GPU and IB, etc. Integration with MPI frameworks

– Dynamic parallelism – GPUs are ablee to spawn new jobs• NVIDIA provides a set of highly optimized libraries (BLAS, FFT, Lapack,

Reduction, etc.)• Only NVIDIA GPUs are supported


Programming EnvironmentsOpenCL

• Syntax is very similar to CUDA (easy porting)• Well written code is as fast as CUDA• Works with CPUs and GPUs from multiple vendors (Intel, AMD, IBM,

NVIDIA)• Still no way to run code simultaneously on NVIDIA GPU and CPU, but

possible with AMD cards• Many libraries existing, but generally slightly slower than CUDA

counterparts. Some libraries are only available commercially• No GPUDirect, significantly limited options to use pinned memory (i.e.

slower data transfers)OpenACC

• Existing applications may be easily parallelized. Also developing new code is easy compared to OpenCL/CUDA

• No free compilers are existing at the moment. Though there is a similar technology OmpSS developed at Barcelona Supercomuter Center.

• At current level, technology does not support shared memory and some other technologies available with direct programing (i.e. it is slower)

cuFFToclFFT

0 50 100 150 200 250 300GFlops


Memory: Space vs. Speed

CPU

DIMM DIMM DIMM

DIMM DIMM DIMM

DIMM DIMM DIMM

channel 1 channel 2 channel 3

DPCDIMM per Channel

Xeon X5500

Xeon E5-2600

1 DIMM 10.6 GB/s 12.8 GB/s

2 DIMMs 8.5 GB/s 12.8 GB/s

3 DIMMs 6.4 GB/s 8.5 GB/s


Fermi Optimizations

GT280 GTX580 Speedup

240 x 1.3 GHz

512 x 1.55 GHz 2.5 x

48 GT/s 49.4 GT/s 1.0 x

Texture Engine

Computational Units

Threadblock

16 px

16 px

Each block of threads accesses actually only 3 ● N / 2 bins per projection

Texture

Image

N2●P texture fetches

Standard VersionTexture engine is heavily loaded

Texture

SharedMemory

(3/2)*N●P texture fetches

Image

N2 interpolations

Fermi-optimized VersionBoth texture & computations engines are used

v = x●cos(α) - y●sin(α)max

x,y<N(v) – min

x,y<N(v) < N√2

N√2 < 1.5 N


Performance of Back Projection Step

HD7970

GTX680

GTX580

HD5970

GTX280

E5-2640

0 20 40 60 80 100 120 140 160 180

Linear Oversample giga-interpolations per seconds (texture fill rate, GT/s)

133%126%

92%107%

157%195%

78%144%

39%


Adding more GPU devices

1 2 3 4 5 6 7 85

7

9

11

13

15

17

19

21

23

25

time

(se

cond

s)

Initialization Time Maximum 8-9 GPU cores (not cards) per system. System will not turn on otherwise

Lan Option ROM have to be turned off in the BIOS

The PCIe slots, where storage adapters inserted, have to be disabled

ASTRA Lab reported to run 13 GPU cores with modified BIOS

To run more than 5 GPUs, NVIDIA driver have to be force to use MSI interrupts. Crashes will occur otherwise

NVIDIANVIDIA

AMD4 GPU cards (single core) working

fine, no configuration modifications required

Dual-core card are working in a single-core mode only

Number of GTX590 cores used


Data Streaming

XFSExt4Ext2JFSBtrfs

Reiser3

0 500 1000 1500 2000 2500 3000 3500Start End MB/s

XFSExt4Ext2JFSBtrfs

Reiser3

0 500 1000 1500 2000 2500 3000 3500 4000Start End MB / s

OpenSuSE 12.1 / Kernel 3.3.132 disks per 2 raid controllers, raid 60

Read

Write

Used file system matter. And it should be adapted to raid configuration (strip, read-ahead)Unless really big number of disks used, the start of partition will be faster than the end)fallocate may significantly improve performance (allocation unit may be increased during

FS creation/mount, XFS supports allocation sizes up to 1GB)Ext4 does not support partitions more than 16TB yetReal-time feature of XFS is unstable, data is loss is likely


90%

Back Projection

Filtering

PCIe Transfer

Building a server for real-time imaging

• Too much external hardware is required– High speed network– Storage system (and SSD cache separately preferably)– High speed Frame Grabber for Camera– Normally 4-6 high speed PCIe slots per server– Space for 1-2 GPUs only

• System cooling is complicated– both GPUs, HDDs, and SSDs produce a lot of heat

• Extensibility– There is no space to add more storage / computing power

• Tomography is a focus– Mainly Filtered Back Projection

is used– Data transfer composes only

10% of the task and may be completely hidden behind computations


GPUDirect and Frame Grabbing

1 core

all cores

0 5 10 15 20 25 30 35

Core i7 950

Core i7-980X

Core i7 3820

2 x E5-2640

GB/s

And we get ~ 1 GB/s from camera. With 3 memcpy it is already on the border.


High-speed Programmable Camera

Vacuum or N2

Optical body and lens

CMOS pixel sensor

Daughter card

“made in house”Readout mother card

FPGA Virtex 6

Flex electrical cable(max f = 183MHz)

Heat exchangerPeltier cells

Link

High speed CMOS sensor 1Mpix, 5000 fps, 10 bitsSelf-trigger & Data compressionOn-line elaborations and controlFull ProgrammabilityDirect connection to Infiniband-cluster

First Prototype


Infiniband: Connection and Protocols

2 x QDR Direct (gen3)

QDR Direct (gen3)

QDR Switched (gen3)

QDR Switched (gen2)

QDR Optical (gen2)

0 1 2 3 4 5 6

GB/s

RDMA

MVAPICH

SDP

TCPoIB

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

GB/s

1Switch

RDMAMVAPICH

SDPTCPoIB

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

late

ncy

(ns)

Mellanox ConnectX 3 VPI

SDP is obsolete by OpenFabric alliance, but we have patches for latest kernels.

ultrafast x-ray imaging of scientific · pdf file2 s. chilingaryan et. all institute for data...

Documents