cuda image registration 29 oct 2008 richard ansorge medical image registration a quick win richard...
TRANSCRIPT
![Page 1: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/1.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Medical Image RegistrationA Quick Win
Richard Ansorge
![Page 2: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/2.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
The problem
• CT, MRI, PET and Ultrasound produce 3D volume images
• Typically 256 x 256 x 256 = 16,777,216 image voxels.
• Combining modalities (inter modality) gives extra information.
• Repeated imaging over time same modality, e.g. MRI, (intra modality) equally important.
• Have to spatially register the images.
![Page 3: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/3.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Example – brain lesion
CT MRI PET
![Page 4: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/4.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
PET-MR Fusion
The PET image shows metabolic activity.
This complements the MR structural information
![Page 5: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/5.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Registration Algorithm
Transform Im B to
match Im A
Im AIm A
Im B′
Im B
Compute Cost
Function
Done
Update transform
parameters
Yes
No
good fit?
NB Cost function calculation dominates for 3D images and is inherently parallel
![Page 6: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/6.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Transformations
11 12 13 14
21 22 23 24
31 32 33 34
0 0 0 1
a a a aa a a aa a a a
æ ö÷ç ÷ç ÷ç ÷ç ÷ç ÷÷ç ÷ç ÷ç ÷ç ÷÷çè ø
General affine transform has 12 parameters:
Polynomial transformations can be useful for e.g. pin-cushion type distortions:
2 2 211 12 13 14 1 2 3 4 5 6
x a x a y a z a bx bxy by bz bxz byzyz
¢= + + + + + + + + +¢=¢=
LK
Local, non-linear transformations, e.g using cubic BSplines, increasingly popular, very computationally demanding.
![Page 7: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/7.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
We tried this before6 Parameter Rigid Registration - done 8 years ago
0
200
400
600
800
1000
1200
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
Number of Processors
Tim
e/s
ec
s
0
8
16
24
32
40
48
56
64
Sp
eed
up
Fac
tor
SR2201 PC 333MHz Speedup perfect scaling
![Page 8: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/8.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Now - Desktop PC - Windows XP
Needs 400 W power supply
![Page 9: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/9.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Free Software: CUDA & Visual C++ Express
![Page 10: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/10.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Visual C++ SDK in action
![Page 11: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/11.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Visual C++ SDK in action
![Page 12: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/12.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Architecture
![Page 13: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/13.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
9600 GT Device Query
Current GTX 280 has 240 cores!
![Page 14: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/14.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Matrix Multiply from SDK
NB using 4-byte floats
![Page 15: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/15.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Matrix Multiply (from SDK)
GPU v CPU for NxN Matrix Multipy
0
50
100
150
200
250
300
350
400
0 1024 2048 3072 4096 5120 6144
N
GP
U S
pee
du
p
average speedup
![Page 16: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/16.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Matrix Multiply (from SDK)
GPU v CPU for NxN Matrix Multipy
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
800.0
0 1024 2048 3072 4096 5120 6144
N
GP
U S
pee
du
p
speedup average speedup
![Page 17: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/17.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Matrix Multiply (from SDK)
GPU v CPU for NxN Matrix Multipy
0
100
200
300
400
500
600
700
800
0 1024 2048 3072 4096 5120 6144
N
GP
U S
pee
du
p
0
5
10
15
20
25
30
35
40
spee
d /
mad
s/n
s o
r m
ads/
100
ns
speedup CPU mads/100 ns GPU mads/ns
![Page 18: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/18.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Image Registration
CUDA Code
![Page 19: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/19.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture__constant__ float c_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b// and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s){
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zerofloat4 v = make_float4(x,y,z,1.0f);float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f;uint is = iy*nx+ix;uint istep = nx*ny;for(int iz=0;iz<nz;iz++) { // process all z's in same thread here
source = s[is];target = tex3D(tex1, tx, ty, tz);is += istep;v.z += 1.0f; tx = dot(r0,v);ty = dot(r1,v);tz = dot(r2,v);cost += fabs(source-target); // other costfuns here as required
}b[iy*nx+ix]=cost; // store thread sum for host
}
![Page 20: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/20.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture__constant__ float c_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b// and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s){
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zerofloat4 v = make_float4(x,y,z,1.0f);float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f;uint is = iy*nx+ix;uint istep = nx*ny;for(int iz=0;iz<nz;iz++) { // process all z's in same thread here
source = s[is];target = tex3D(tex1, tx, ty, tz);is += istep;v.z += 1.0f; tx = dot(r0,v);ty = dot(r1,v);tz = dot(r2,v);cost += fabs(source-target); // other costfuns here as required
}b[iy*nx+ix]=cost; // store thread sum for host
}
texture<float, 3, cudaReadModeElementType> tex1;
__constant__ float c_aff[16];
tex1: moving image, stored as 3D texturec_aff: affine transformation matrix, stored as constants
![Page 21: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/21.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture__constant__ float c_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b// and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s){
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zerofloat4 v = make_float4(x,y,z,1.0f);float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f;uint is = iy*nx+ix;uint istep = nx*ny;for(int iz=0;iz<nz;iz++) { // process all z's in same thread here
source = s[is];target = tex3D(tex1, tx, ty, tz);is += istep;v.z += 1.0f; tx = dot(r0,v);ty = dot(r1,v);tz = dot(r2,v);cost += fabs(source-target); // other costfuns here as required
}b[iy*nx+ix]=cost; // store thread sum for host
}
// device function declaration __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s)
nx, ny & nz: image dimensions (assumed same of both)b: output array for partial sumss: reference image (mislabelled in code)
![Page 22: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/22.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture__constant__ float c_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b// and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s){
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zerofloat4 v = make_float4(x,y,z,1.0f);float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f;uint is = iy*nx+ix;uint istep = nx*ny;for(int iz=0;iz<nz;iz++) { // process all z's in same thread here
source = s[is];target = tex3D(tex1, tx, ty, tz);is += istep;v.z += 1.0f; tx = dot(r0,v);ty = dot(r1,v);tz = dot(r2,v);cost += fabs(source-target); // other costfuns here as required
}b[iy*nx+ix]=cost; // store thread sum for host
}
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zero
Which thread am I? (similar to MPI) however one thread for each x-y pixel, 240x256=61440 threads (CF ~128 nodes for MPI)
![Page 23: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/23.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture__constant__ float c_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b// and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s){
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zerofloat4 v = make_float4(x,y,z,1.0f);float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f;uint is = iy*nx+ix;uint istep = nx*ny;for(int iz=0;iz<nz;iz++) { // process all z's in same thread here
source = s[is];target = tex3D(tex1, tx, ty, tz);is += istep;v.z += 1.0f; tx = dot(r0,v);ty = dot(r1,v);tz = dot(r2,v);cost += fabs(source-target); // other costfuns here as required
}b[iy*nx+ix]=cost; // store thread sum for host
}
float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f; // accumulates cost function contributionsv.z=0.0f; // z of first slice is zero (redundant as done above) uint is = iy*nx+ix; // this is index of my voxel in first z-sliceuint istep = nx*ny; // stride to index same voxel in subsequent slices
Initialisations and first matrix multiply. “v” is 4-vector current voxel x,y,z address“tx,ty,tz” hold corresponding transformed position
![Page 24: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/24.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture__constant__ float c_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b// and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s){
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zerofloat4 v = make_float4(x,y,z,1.0f);float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f;uint is = iy*nx+ix;uint istep = nx*ny;for(int iz=0;iz<nz;iz++) { // process all z's in same thread here
source = s[is];target = tex3D(tex1, tx, ty, tz);is += istep;v.z += 1.0f; tx = dot(r0,v);ty = dot(r1,v);tz = dot(r2,v);cost += fabs(source-target); // other costfuns here as required
}b[iy*nx+ix]=cost; // store thread sum for host
}
for(int iz=0;iz<nz;iz++) { // process all z's in same thread here source = s[is]; target = tex3D(tex1, tx, ty, tz); // NB very FAST trilinear interpolation!! is += istep; v.z += 1.0f; // step to next z slice tx = dot(r0,v); ty = dot(r1,v); tz = dot(r2,v); cost += fabs(source-target); // other costfuns here as required}b[iy*nx+ix]=cost; // store thread sum for host
Loop sums contributions for all z values at fixed x,y position. Each tread updates a different element of 2D results array b.
Y
X
Z
![Page 25: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/25.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Host Code Initialization Fragment
...blockSize.x = blockSize.y = 16; // multiples of 16 a VERY good ideagridSize.x = (w2+15) / blockSize.x;gridSize.y = (h2+15) / blockSize.y;
// allocate working buffers, image is W2 x H2 x D2cudaMalloc((void**)&dbuff,w2*h2*sizeof(float)); // passed as “b” to kernelbufflen = w2*h2;Array1D<float> shbuff = Array1D<float>(bufflen);shbuff.Zero();hbuff = shbuff.v;
cudaMalloc((void**)&dnewbuff,w2*h2*d2*sizeof(float)); //passed as “s” to kernelcudaMemcpy(dnewbuff,vol2,w2*h2*d2*sizeof(float),cudaMemcpyHostToDevice);
e = make_float3((float)w2/2.0f,(float)h2/2.0f,(float)d2/2.0f); // fixed rotation origino = make_float3(0.0f); // translationsr = make_float3(0.0f); // rotationss = make_float3(1.0f,1.0f,1.0f); // scale factorst = make_float3(0.0f); // tans of shears...
![Page 26: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/26.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Calling the Kernel double nr_costfun(Array1D<double> &a) {
static Array2D<float> affine = Array2D<float>(4,4); // a holds current transformationdouble sum = 0.0;make_affine_from_a(nr_fit,affine,a); // convert to 4x4 matrix of floats
cudaMemcpyToSymbol(c_aff,affine.v[0],4*4*sizeof(float)); // load constant memd_costfun<<<gridSize, blockSize>>>(w2,h2,d2,dbuff,dnewbuff); // run kernelCUT_CHECK_ERROR("kernel failed"); // OK?cudaThreadSynchronize(); // make sure all done
// copy partial sums from device to hostcudaMemcpy(hbuff,dbuff,bufflen*sizeof(float),cudaMemcpyDeviceToHost);
for(int iy=0;iy<h2;iy++) for(int ix=0;ix<w2;ix++) sum += hbuff[iy*w2+ix]; // final sumcalls++;if(verbose>1){
printf("call %d costfun %12.0f, a:",calls,sum);for(int i=0;i<a.sizex();i++)printf(" %f",a.v[i]);printf("\n");
}return sum;
}
![Page 27: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/27.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Example Run (240x256x176 images)C: >airwcairwc v2.5 Usage: AirWc <target> <source> <result> opts(12rtdgsf)
C:>airwc sb1 sb2 junk 1fNIFTI Header on File sb1.niiconverting short to float 0 0.000000NIFTI Header on File sb2.niiconverting short to float 0 0.000000
Using device 0: GeForce 9600 GT
Initial correlation 0.734281using cost function 1 (abs-difference)using cost function 1 (abs-difference)Amoeba time: 4297, calls 802, cost:127946102
Cuda Total time 4297, Total calls 802File dofmat.mat writtenNifti file junk.nii written, bswop=0Full Time 6187
timer 0 1890 mstimer 1 0 mstimer 2 3849 mstimer 3 448 mstimer 4 0 msTotal 6.187 secsFinal Transformation: 0.944702 -0.184565 0.017164 40.637428 0.301902 0.866726 -0.003767 -38.923237 -0.028792 -0.100618 0.990019 18.120852 0.000000 0.000000 0.000000 1.000000Final rots and shifts 6.096217 -0.156668 -19.187197 -0.012378 0.072203 0.122495scales and shears 0.952886 0.912211 0.995497 0.150428 -0.101673 0.009023
![Page 28: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/28.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Desktop 3D Registration
Registration with
CUDA6 Seconds
Registration with
FLIRT 4.18.5 Minutes
![Page 29: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/29.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Comments
• This is actually already very useful. Almost interactive (add visualisation)
• Further speedups possible– Faster card – Smarter optimiser– Overlap IO and Kernel execution– Tweek CUDA code
• Extend to non-linear local registration
![Page 30: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/30.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Intel Larabee?
Figure 1: Schematic of the Larabee many-core architecture: The number of CPU cores and the number and type of co-processors and I/O blocks are implementation-dependent, as are the positions of the CPU and non-CPU blocks on the chip.
Porting from CUDA to Larabee should be easy
![Page 31: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge](https://reader030.vdocuments.us/reader030/viewer/2022032701/56649c745503460f94928091/html5/thumbnails/31.jpg)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Thank you