影像處理 histogram 等化處理時, histogram 視為機率分布函數 (pdf), 經常將其積分為累積分布函數 cdf, 例如:
        pdf = { 1, 2, 3, 4, 5, 6, 7, 8}
        cdf = { 1, 3, 6, 10, 15, 21 28, 36}
程式碼為:

     cdf[0] = pdf[0];

        for (int i = 1; i < N; i++) {

         cdf[i] = cdf[i - 1] + pfd[i];

     }

 
SCAN演算法有兩類, inclusive的計算形式為:
        output[i[ = output[i-1] operator input[i-1]
exclusive的計算形式為:
        output[i[ = output[i-1] operator input[i]
兩者差別在於目前輸出的計算總合是至前一個輸入項或是包括目前輸入項. 因為SCAN data dependency 用到前一個計算結果, 乍看之下似乎很難平行化. 如果 operator 具有結合性, 結果與計算順序無關, 則可以用一些空間安排的技巧來達到平行化. [1][2] , 介紹了Hillis Steele Blelloch兩種演算法, 下圖為 Hillis Steele 演算法的計算過程:

undefined

 
以下程式碼把資料放在同一個 block 內計算, 因目前 CUDA 一個 block最多有1024thread, 所以陣列 a[] 最多只能有 1024 個元素. 這個實作先把 a[] 複製到 shared memory, 然後只在 shared memory運算, 因此不會改變原本 a[] 的內容.
 

 

#include "cuda_runtime.h"

#include "device_launch_parameters.h"

#include <stdio.h>

 

#define BlockSize 1

#define ThreadSize 19

#define ArraySize (BlockSize*ThreadSize)

 

__device__ void __syncthreads();

__global__ void scanHillisSteele(int *b, int *a)

{

     __shared__ int x[ThreadSize];

 

     int id = blockIdx.x * blockDim.x + threadIdx.x;

     x[threadIdx.x] = a[id];

     __syncthreads(); //wait copy compelete

 

     for (int d = 1; d<blockDim.x; d <<= 1)

     {

         if (threadIdx.x >= d) {

              x[threadIdx.x] += x[threadIdx.x - d];

         } //keep

         __syncthreads();

     }

 

     b[threadIdx.x] = x[threadIdx.x];

}

 

int main()

{

     int host_a[ArraySize];

     int host_b[ArraySize];

     int *dev_a = 0;

     int *dev_b = 0;

     int sum = 0;

     float elapsedTime;

 

     // setup performance meter from CUDA ----------

     cudaEvent_t start, stop;

     cudaEventCreate(&start);

     cudaEventCreate(&stop);

 

     cudaSetDevice(0);

 

     cudaMalloc((void**)&dev_a, ArraySize * sizeof(int));

     for (int i = 0; i < ArraySize; i++)

         host_a[i] = i + 1;

     cudaMemcpy(dev_a, host_a, ArraySize * sizeof(int), cudaMemcpyHostToDevice);

 

     cudaMalloc((void**)&dev_b, ArraySize * sizeof(int));

     //cudaMemset(dev_b, 0, ArraySize * sizeof(int));

 

     // Run scanHillisSteele

 

     cudaEventRecord(start, 0); //keep start time

     scanHillisSteele << <BlockSize, ThreadSize >> > (dev_b, dev_a);     //calculate

     cudaEventRecord(stop, 0); //keep stop time

     cudaEventSynchronize(stop); //wait stop event    

     cudaEventElapsedTime(&elapsedTime, start, stop);

 

     cudaMemcpy(host_b, dev_b, ArraySize * sizeof(int), cudaMemcpyDeviceToHost);

 

     //Print result

     printf("pdf:\n");

     for (int i = 0; i < ArraySize; i++) {

         printf("%4d ", host_a[i]);

     }

     printf("\n");

 

     printf("cdf:\n");

     for (int i = 0; i < ArraySize; i++) {

         printf("%4d ", host_b[i]);

     }

     printf("\nt=%f\n\n", elapsedTime);

     //cudaDeviceSynchronize();

     getchar();

 

     cudaFree(dev_a);

     cudaFree(dev_b);

     return 0;

}

 
以下為執行結果:
 
undefined
 
 
參考資料:
[1] “Parallel Prefix Sum (Scan) with CUDA”, https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html.
[2] Adam O’Donovan, “Parallel Prefix Sum on the GPU (Scan) “, University of Maryland Institute for Advanced Computer Studies, http://users.umiacs.umd.edu/~ramani/cmsc828e_gpusci/ScanTalk.pdf
 
 
 
 
 
arrow
arrow
    全站熱搜
    創作者介紹
    創作者 ghostyguo 的頭像
    ghostyguo

    No More Codes

    ghostyguo 發表在 痞客邦 留言(0) 人氣()