split primitive on the gpu. split primitive split can be defined as performing ::...
TRANSCRIPT
Split Primitive
Split can be defined as performing ::append(x,List[category(x)])
for each x, List holds elements of same category together
Split Sequential Algorithm
I. Count the number of elements falling into each bin– for each element x of list L do
• histogram[category(x)]++ [Possible Clashes on a category]
II. Find starting index for each bin (Prefix Sum)– for each category ‘m’ do
• startIndex[m] = startIndex[m – 1]+histogram[m-1]
III. Assign each element to the output– for each element x of list L do [Initialize localIndex[x]=0]
• itemIndex = localIndex[category(x)]++ [Possible Clashes on
a category]• globalIndex = startIndex[category(x)]• outArray[globalIndex+itemIndex] = x
Split Operation in Parallel
• In order to parallelize the above split algorithm, we require a clash free method for building histogram on the GPU
• Above can be achieved on a parallel machine using one of the following two methods– Personal Histograms for each processors, followed
by merging the histograms– Atomic Operations on Histogram array(s)
Global Memory Atomic Split• Code :
__global__ void globalHist ( unsigned int *histogram, int* gArray, int *category )
{ int curElement; int curCategory;
for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ ) { curElement= gArray[blockIdx.x * blockDim.x
* i + threadIdx.x]; curCategory = category[curElement]; atomicInc(&histogram[curCategory],99999); }}
• Global Memory too slow to access• Single Histogram in Global Memory (Number of clashes is data dependent)• Overuse of Shared Memory limits the maximum number of categories to 64
Non-Atomic Approach (He et al.)• A Histogram for each ‘Thread’ • Combine all the histograms to get the final histogram
__global__ void nonAtomicHistogram( int* gArray, int *category, unsigned int *tHistGlobal )
{ int curElement, curCategory; __shared__ unsigned int tHist[NUMBINS*NUMTHREADS];
for ( int i=0; i < NUMBINS; i++ ) tHist[threadIDx.x*NUMBINS+i] = 0;
for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ ) {
curElement = gArray[blockIdx.x * NUMTHREADS * ELEMENTS_PER_THREAD + ( i * NUMTHREADS ) + threadIdx.x];
curCategory = category[curElement];tHist[tx*NUMBINS+curCategory]++;
} for ( int i=0; i<NUMBINS; i++ ) tHistGlobal[i * NUMBLOCKS * NUMTHREADS + blockIdx.x*NUMTHREADS +
threadIdx.x] = tHist[tx*NUMBINS+i]; }
Shared Memory Atomic
• Global Atomic does not use the fast shared memory available• Non-Atomic approach overuses the shared memory
• Incorporating atomic operations on fast shared memory may perform better compared to above two approaches
• Shared Memory Atomic can be performed using one of the below mentioned techniques– H/W Atomic Operations– Clash Serial Atomic Operations– Thread Serial Atomic Operations
SM Atomic :: H/W Atomic• Latest GPUs (G2xx and later) support atomic operations on the Shared Memory
__global__ void histkernel ( unsigned int *blockHists, int* gArray, unsigned int *category )
{const int numThreads = blockDim.x * gridDim.x;extern __shared__ int sharedmem[];unsigned int* s_Hist = (unsigned int *)&sharedmem;unsigned int curElement, curCategory;
for(int pos = threadIdx.x; pos < NUMBINS; pos += blockDim.x)s_Hist[pos] = 0;
__syncthreads();for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ )
{ curElement = gArray[ blockIdx.x * NUMTHREADS * ELEMENTS_PER_THREAD )
+ ( i * NUMTHREADS ) + threadIdx.x]; curCategory = category[curElement]; atomicInc(&s_Hist[category],9999999); } __syncthreads(); for(int pos = threadIdx.x; pos < NUMBINS; pos += blockDim.x) blockHists[ blockIdx.x + gridDim.x * pos ] = s_Hist[pos];}
SM Atomic :: Thread Serial
• Threads can be serialized within a ‘warp’ in order to avoid clashes. …………. for ( int i = 0; i < ELEMENTS_PER_THREAD; i++ ){
curElement = gArray[ blockIdx.x * NUMTHREADS * ELEMENTS_PER_THREAD + ( i *
NUMTHREADS ) + threadIdx.x]; curCategory = category[curElement];
for ( int i=0; i < WARPSIZE; i++ ) if ( threadIdx.x == i )
s_Hist[curCategory]++;}………….
SM Atomic :: Clash Serial• Each thread writes to the common histogram of the block until it succeeds.• A Thread is tagged by its thread ID in order to find out if the thread successfully updated the
histogram
//Mainfor(int pos = globalTid; pos < NUMELEMENTS; pos += numThreads) {
unsigned int curElement = gArray[pos]; unsigned int curCategory = category[curElement]; addData256(s_Hist, curCategory, threadTag); }
//Clash serializing function for a Warp__device__ void addData256(volatile unsigned int *s_WarpHist,
unsigned int data, unsigned int threadTag){ unsigned int count; do{ count = s_WarpHist[data] & 0x07FFFFFFU; count = threadTag | (count + 1); s_WarpHist[data] = count; }while(s_WarpHist[data] != count);}
Split using Shared Atomic
• Shared Atomic used to build Block-level histograms
• Parallel Prefix Sum used to compute starting index
• Split is performed by each block for same set of elements used in Step 1
Comparison of Split Methods
• Global Atomic suffers for low number of categories• Non-Atomic can do maximum of 64 categories in one pass
(multiple-pass for higher categories)• Shared Atomic performs better than other 2 GPU methods and CPU
for a wide range of categories• Shared Memory limits maximum number of bins to 2048 (for power
of 2 bins)
Multi Level Split
• Bins higher than 2K are broken into sub-bins
• Hierarchy of bins is created and split is performed at each level for different sub-bins
• Number of splits to be performed grow exponentially
• With 2 levels we can perform split for up to 4Million bins
8 bits 8 bits 8 bits8 bits
32 bit Bin broken into 4 sub-bins of 8 bits
Results for Bins up to 4 Million
Multi Level Split performed on GTX280. Bins from 4K to 512K are handled with 2 passes and results for 1M and 2M bins for 1M elements are computed using 3 passes for better performance
MLS :: Right to Left• Using an iterative approach
requires constant number of splits at each level
• Highly scalable due to its iterative nature and ideal number of bins can be chosen for best performance
• Dividing the bins from Right-to-Left requires to preserve the order of elements from previous pass
• Complete list of elements is re-arranged at each level
Ordered Atomic• Atomic operations perform
safe reads/writes by serializing the clashes, but do not guarantee required order of operation
• Ordered atomic serializes the clashes in a fixed order provided by the user
• In case of a clash at higher levels in Right-to-Left Split, elements should be inserted in order of their existing position in the list
Split on 4 Billion bins• Right to Left split
can be used for splitting integers to 4 billion bins ( sorting? )
• Integers can be sorted to desired number of bits
( Keys can be 8, 16, 24, 32 bit long, 64 bit too )
Conclusion• Various histogram methods implemented on
shared memory• Split operation now handles millions and billions
of bins using Left-to-Right and Right-to-Left methods of Multi-Level-Split
• Shared memory split operation faster and scalable than previous implementation (He et al.)
• Fastest Sorting achieved with extension of split to billions of bins
• Variable bit-length sorting helpful with keys of varying size ( bit-length )