implementing a speech recognition system on a gpu using cuda presented by omid talakoub astrid yi
TRANSCRIPT
![Page 1: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/1.jpg)
Implementing a Speech Recognition System on a
GPU using CUDA
Presented by
Omid Talakoub
Astrid Yi
![Page 2: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/2.jpg)
OutlineBackground
Motivation
Speech recognition algorithm
Implementation steps
GPU implementation strategies
Data flow and representation
Profiler results
Floating point accuracy
Future optimizationsApril 23rd, 2009 2
![Page 3: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/3.jpg)
BackgroundSpeech recognition system:
1. Speaker-dependent or speaker-independent
2. Isolated words or continuous speech
Practical applications use isolated-word recognition systems
April 23rd, 2009 3
![Page 4: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/4.jpg)
Motivation2 phases in speech recognition:
1. Memorize a set of reference templates
2. Take a test template and return the closest reference template match
Recognition accuracy improved with a larger set of reference templates
But, it takes more time to find the closest match
April 23rd, 2009 4
![Page 5: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/5.jpg)
Algorithm Overview
April 23rd, 2009 5
Sound Recorder
Voice Activity
Detection (VAD)
Segmentation
Mel Frequency Cepstral
Coefficients (MFCC)
Dynamic Time
WarpingTest Word
Recognized Word
MFCC
![Page 6: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/6.jpg)
6
Words are parameterised on a frame-by-frame basis
Choose frame length, over which speech remains reasonably stationary
Overlap frames e.g. 25ms frames, 10ms frame shift
We want to compare frames of test and reference words i.e. calculate distances between them
25ms
10ms
Hamming Window
![Page 7: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/7.jpg)
7
• Problem:
Number of frames won’t always correspond
• Easy:
Sum differences between corresponding frames
Calculating Distances
![Page 8: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/8.jpg)
Feature Extraction
Calculating Mel-frequency cepstral coefficients (MFCCs):
April 23rd, 2009 8
MFCCs are coefficients of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.
![Page 9: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/9.jpg)
“seven”
Cepstral domain
DCT
Log energy
Mel-scaled filter bank
Fourier
x(t)
Time
Filter #
MFCC algorithm
![Page 10: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/10.jpg)
Dynamic Time Warping
i
j
t(i-1)
r(j)
),1(
)1,1(
)1,(
min
)(),(),(
jiD
jiD
jiD
jritjiD
),( jiD
r(j-1)
t(i)
t: input MFCC matrix (Each row is a frame’s feature.)r: reference MFCC matrixLocal paths: 0-45-90 degrees
DTW recurrence:
![Page 11: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/11.jpg)
Local Path Constraints
0-45-90 local paths
jiD ,
),1(
)1,1(
)1,(
min
)()(),(
jiD
jiD
jiD
jritjiD
1, jiD 1,1 jiD
jiD ,1
![Page 12: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/12.jpg)
Other VariantsLocal constraints Start/ending area
![Page 13: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/13.jpg)
DTW Paths of “Match Corners”
i
j
![Page 14: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/14.jpg)
14
Dynamic Time Warping The Brute Force of the Engineering Approach
TE
MP
LA
TE
(W
OR
D 7
)
UNKNOWN WORD
![Page 15: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/15.jpg)
Another Example of DTW Minimum Path
April 23rd, 2009 15
![Page 16: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/16.jpg)
Reference ImplementationReference code implemented on CPU
Provided a base on which to compare results from GPU
Based on original MATLAB code
Used as a base to compare performance increase by shifting work to the GPU
Single threaded with no SIMD type execution used
April 23rd, 2009 16
![Page 17: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/17.jpg)
GPU Implementation StrategiesTwo main approaches to implementation
1. Do more work in the same amount of time
2. To the same amount of work in less time
How to choose an approach in general
1. Amdahl’s law states that performance is dictated by how much can be parallelized
2. If algorithm is serial, do more work simultaneously instead of faster
April 23rd, 2009 17
![Page 18: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/18.jpg)
Data Flow and RepresentationMatrices stored in row major format with a pitch
equal to width
After initial data load, all operations on data take place on device and transform the data in place or to another location
Transformation implemented in the form of CUDA kernels
Kernels can be invoked to process all the data at a specific transformation stage
April 23rd, 2009 18
![Page 19: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/19.jpg)
Profiler ResultsCPU implementation was used as a base
Profiled the generation of the MFCCs which is the computationally expensive portion
Averaging the results over 5 runs of 45 MFCC calls yields the following results:
1. GPU Time: 1.4080285 seconds
2. CPU Time: 8.6354016 seconds
3. Speedup: 5.8
April 23rd, 2009 19
![Page 20: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/20.jpg)
Kernel GPU UtilizationHalf of GPU runtime is limited to a single kernel
Further optimizations should concentrate on first three listed kernels
April 23rd, 2009 20
![Page 21: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/21.jpg)
Kernel Runtimes
April 23rd, 2009 21
![Page 22: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/22.jpg)
Floating Point Accuracy
April 23rd, 2009 22
Floating point accuracy was an issue during verification
Two main problem areas:
1. FFT
2. DCT
Problems from the FFT could not be fixed due to the existing implementation in CUFFT
Problems in the DCT occurred due to cancellation
![Page 23: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/23.jpg)
Floating Point Accuracy con’tDetermined that differences between
implementations did not affect results
To continue comparing against reference results, CPU data was loaded for the GPU during verification
April 23rd, 2009 23
![Page 24: Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi](https://reader036.vdocuments.us/reader036/viewer/2022062718/56649eb15503460f94bb7946/html5/thumbnails/24.jpg)
Future OptimizationsUse the already optimized CUBLAS library for
some operations
Requires reordering of data to use column major ordering
IIR filters used have coefficients such that their invocations can be better parallelized
Allocate memory for matrices based using pitches to align data, ie. cudaMalloc2D
April 23rd, 2009 24