weekly report- kmeans

22
Weekly Report-Kmeans Weekly Report-Kmeans Ph.D. Student: Leo Lee date: Nov. 13, 2009

Upload: zazu

Post on 08-Jan-2016

29 views

Category:

Documents


1 download

DESCRIPTION

Weekly Report- Kmeans. Ph.D. Student: Leo Lee date: Nov. 13, 2009. Outline. K-means CPU-based algorithm workflow; Reading Kaiyong’s code; Some naïve thoughts; Work plan. K-means. CPU-based algorithm workflow;. N data and K centers, dim dimension;. Compute D[N][K]. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Weekly  Report- Kmeans

Weekly Report-KmeansWeekly Report-Kmeans

Ph.D. Student: Leo Leedate: Nov. 13, 2009

Page 2: Weekly  Report- Kmeans

OutlineOutlineK-means

◦CPU-based algorithm workflow;◦Reading Kaiyong’s code;◦Some naïve thoughts;

Work plan

Page 3: Weekly  Report- Kmeans

K-meansK-meansCPU-based algorithm workflow;

N data and K centers, dim dimension;

Compute D[N][K]

Compute MinD[N]

Compute NewCenter[K]

If NewCenter == center

No

Yes

Page 4: Weekly  Report- Kmeans

K-meansK-meansPseudocode:

While(!bFlag && ++i <= nIterationsTime){

ComputeDis(&dis, data, centers);

FindMinDis(dis, &index);

ComputeNewCen(&newCen, data, index);

if(newCen-centers < b)bFlag = true

elsecenters = newCen;

}

Page 5: Weekly  Report- Kmeans

K-meansK-meansSince each iteration relays on the

previous one, we would sequential run each iteration but parallel each function inside the iteration.

◦ Compute distance;

◦ Find the nearest center;

◦ Computer new centers;

Page 6: Weekly  Report- Kmeans

K-means-compute the K-means-compute the distancedistance

Data[N][Dim], Centers[Dim][K] Dis[N][K] Nearly the same as Matrix mulitipication Only replace A[i][k]*B[K][j]->(A[i][k]-B[K][j])2

Using the so called tiles, increase the compute to memory

access ratio

Data

Centers

Distances

Page 7: Weekly  Report- Kmeans

K-means-compute the K-means-compute the distancedistanceFrom Kaiyong

◦ dim3 threads(16, 2, 1); ◦ dim3 grid(k/32, n/32, 1);

◦ ComputeDistance<32,32><<<grid,threads>>>(…)

◦ template <unsigned int B_WIDTH, unsigned int C_HIGH> __global__ void ComputeDistance(….)

◦ {

◦ }

Page 8: Weekly  Report- Kmeans

K-means-compute the K-means-compute the distancedistance float* indexQ = Query

+ threadIdx.x + (blockIdx.y*C_HIGH + threadIdx.y) *

dim;

float* indexR = Ref + blockIdx.x*B_WIDTH + threadIdx.y * blockDim.x + threadIdx.x;

float* indexC = C + blockIdx.x * C_HIGH + threadIdx.y * blockDim.x

+ threadIdx.x + blockIdx.y *C_HIGH* wB ;

Page 9: Weekly  Report- Kmeans

K-means-compute the K-means-compute the distancedistance __shared__ float as[16][C_HIGH+1]; Do

◦ for(int i = 0; i < C_HIGH; i += 2) as[threadIdx.x][threadIdx.y + i] = indexQ[i*dim];

◦ indexQ += 16;◦ __syncthreads();◦ for(int i = 0; i < 16; i++, indexR += wB)

for( int j = 0; j < C_HIGH; j++) { c_temp = indexR[0]-as[i][j]; c[j] += c_temp*c_temp; }

◦ __syncthreads(); while(indexQ < Alast);

Page 10: Weekly  Report- Kmeans

K-means-compute the K-means-compute the distancedistancefor(int i = 0; i < C_HIGH; i++, indexC

+= wB){

indexC[0] = c[i];}

Page 11: Weekly  Report- Kmeans

K-means - compute the K-means - compute the distancedistanceQuestions

◦Why template <unsigned int B_WIDTH, unsigned int C_HIGH>, not parameters?

◦Why load sub matrix in that way? Sth to do with WARP? If we use 16*16 tile instead of 32*32, should load method change?

◦This algorithm is nearly the same as the so-called most efficient Matrix Mulitiplication. Thread(16, 4), grid(wc/4, hc/16)

Page 12: Weekly  Report- Kmeans

Compute the distanceCompute the distanceVery useful in data mining

◦K-means;◦K-nn;◦Hieratical clustering;◦…

Page 13: Weekly  Report- Kmeans

K-means-K-means-Find the nearest centerN Reductions

◦ Sum, max, min…◦ Sequential addressing◦ Completely unroll◦ n/logn threads, logn steps;

N

K

Page 14: Weekly  Report- Kmeans

K-means-K-means-Find the nearest centerdim3 threads_find(16,1,1);dim3 grid_find(1, data_height, 1);template <unsigned int blockSize> __global__ void

cpu_FindSmallDistance( float* Dist, int* D_index, int k){

__shared__ float sdata[blockSize];__shared__ int d_index[blockSize];

// perform first level of reduction, reading from g-memory, writing to s-memory

unsigned int tid = threadIdx.x; unsigned int i = blockSize + threadIdx.x;

float* p_data = Dist + blockIdx.y*k;sdata[tid] = p_data[tid];d_index[tid] = tid;if (i < k)

if( sdata[tid] > p_data[i]){ sdata[tid] = p_data[i];

d_index[tid] = i; }EMUSYNC;

Page 15: Weekly  Report- Kmeans

K-means-K-means-Find the nearest center if( sdata[tid] > sdata[tid + 8] ) {sdata[tid] = sdata[tid + 8];

d_index[tid] = d_index[tid+8];} EMUSYNC;

if( sdata[tid] > sdata[tid + 4] ) {sdata[tid] = sdata[tid + 4]; d_index[tid] = d_index[tid+4];} EMUSYNC;

if( sdata[tid] > sdata[tid + 2] ) {sdata[tid] = sdata[tid + 2]; d_index[tid] = d_index[tid+2];} EMUSYNC;

if( sdata[tid] > sdata[tid + 1] ) {d_index[tid] = d_index[tid+1];} EMUSYNC;

// write result for this block to global mem if (tid == 0) D_index[blockIdx.y] = d_index[0];} Since the K is presumed to be equal or smaller than 32, this

implementation is optimal.

Page 16: Weekly  Report- Kmeans

K-means-K-means-Computer new centers;CPU-based Algorithm

◦ For each Data C[ Index[i] ] += Data[i]

Not direct addressingNot as beautiful as Matrix Mul and Reduction

// 把最近的点都加起来,分成 100组,每组 512个

dim3 threads_collect(32,1,1);dim3 grid_collect(100, 1, 1);

N

dim Index

Page 17: Weekly  Report- Kmeans

K-means-K-means-Computer new centers; __shared__ int index[32];

__shared__ float as[32*34]; int idx = threadIdx.x;float* p_Data = Data + blockIdx.x * GroupSize*Dim;int* p_index = D_index + blockIdx.x * GroupSize;int* p_index_last = p_index + GroupSize;float* p_centro = Centro + blockIdx.x * k*Dim;int* p_counter = Counter + blockIdx.x*32;int index_i = 0;int centro_count = 0;

// initial shared mem centeroif(idx < k)

for( int i = 0; i < Dim; i++)as[i*k+idx] = 0;

EMUSYNC;

N

dim Index

Page 18: Weekly  Report- Kmeans

K-means-K-means-Computer new centers;//每次取 32个 index,所以如果一共有 512个数据,, 512/32=16轮for(; p_index < p_index_last; p_index += blockDim.x){

//取 32个 index放入到 shared mem中,这样可以让 IO结合index[idx] = p_index[idx]; EMUSYNC;//循环次,每次计算一个数据for(int i = 0; i < 32; i++, p_Data += Dim){

index_i = index[i];// 每一个 thread对应一个 centro,所以当处理一个对应的数据时,这里 ++if(idx == index_i)

centro_count++;

for(int j = idx; j < Dim; j += blockDim.x){

as[j*k+index_i] += p_Data[j];}EMUSYNC;

}}

N

dim Index

Page 19: Weekly  Report- Kmeans

K-means-K-means-Computer new centers; //只有在 centro范围内的线程才回写数据,每个线程里面负责一个 centroif(idx < k){

p_counter[idx] = centro_count;for(int j = 0; j < Dim; j++, p_centro += k){

p_centro[idx] = as[j*k+idx];}

}

Now we got 100 Dim*K Matrix◦ Matrix adding-reduction, each element is a matrix.◦ Kaiyong’s code reduces to 10, and gets the final.

N

Dim Index

Page 20: Weekly  Report- Kmeans

Work plan - K-meansWork plan - K-meansTest the program

◦Each function, GPU VS CPU;

Compare with other papers.

Page 21: Weekly  Report- Kmeans

Work planK-means;

Learn data mining, prepare for final exam;

Go on reading parallel computing books.

Page 22: Weekly  Report- Kmeans

Thanks