inference perform beam engine i think search algorithm ... · processed by parallel threads on...

1
ASR Characteristics and Software Architecture Challenge 1: Handling irregular data structures with data-parallel operations Solution 1: Construct efficient dynamic vector data structure to handle irregular data accesses Challenge 2: Eliminating redundant work when threads are computing results for an unpredictable subset of the problems based on input Solution 2: Implement efficient find-unique function by leveraging the GPU global memory write-conflict-resolution policy Challenge 3: Conflict-free reduction in graph traversal to implement the Viterbi beam-search algorithm Solution 3: Implement lock-free accesses of a shared map leveraging advanced GPU atomic operations to enable conflict-free reductions Challenge 4: Parallel construction of a shared queue while avoiding sequential bottlenecks when atomically accessing queue control variables Solution 4: Use of hybrid local/global atomic operations and local buffers for the construction of a shared global queue to avoid sequential bottlenecks in accessing global queue control variables Jike Chong, Ekaterina Gonina, Kurt Keutzer Department of Electrical Engineering and Computer Science, University of California, Berkeley [email protected], [email protected], [email protected] Parallel Computing Lab AS PA Automatic Speech Recognition (ASR) Results !"#$ &’()*+,-)+ .’/0)12 !!! #$% &&’’ ( !!! $) ’’ * !!! %$% ( ’’ ( !!! ’’ && * #$% $) %$% +,- #,- .) -#/ !!! !!! !!! !!! !!! +,- #,- !!! !!! #$% .) !!! $) %$% !!! -#/ !!! #00 ,123451 %&2*6 02768 %92*3*1:’52* 02768 ;:<9’= >’*<3’<6 02768 ?’344:’* 0:@A396 02768 B29 $*6 %&2*6 CA’A6 D D 3,4/51’ 6)78)+’+/9 6)785-+* :,9/;+(’ /) ’;(< 7,4/51’ ()78)+’+/9 6)785-+* 0’,*</’: 957 )= ;>> ()78)+’+/9 Speech Feature Extractor Inference Engine Voice Input Recognition Network Speech Features Word Sequence D I think therefore I am Acoustic Model Pronunciation Model Language Model Obs 1 Obs 2 Obs 3 Obs 4 State 1 State 2 State 3 State N Time 1. Forward Pass 2. Backward Pass Observations Speech Model States In each step, consider alternative interpretations Iterative through inputs one time step at a time In each iteration, perform beam search algorithm Inference Engine: Beam Search with Graph Traversal Voice Input Recognize Speech Wreck a nice beach r eh k ax g n ay z ‘ s p iy ch r eh k ‘ ax (g)’ n ay (z) s ‘(b) iy ch Reckon eyes peach r eh k ax (g) n ‘ ay z (s)‘ p iy ch r eh k ax g n ay z s p iy ch Recognize Speech ! Automatic speech recognition (ASR) allows multimedia content to be transcribed from acoustic waveforms to word sequences ! This is a challenging task as there can be exponentially many ways to interpret an utterance (a sequence of phones) into words ! ASR uses the hidden Markov model (HMM) States are hidden because phones are indirectly observed through the waveform Must infer the most likely interpretation while taking the language model into account ! The Viterbi algorithm is used to infer the most likely interpretation of the observed waveform It has a forward pass and a backward pass ! The forward pass has two phases of execution Phase 1 evaluates the observation probability, which matches the observation to known speech model states (dashed arrows) Phase 2 references historic information and evaluates the likelihood of going from one time step to the next (solid arrows) Read Files Initialize data structures CPU Manycore GPU Backtrack Output Results Phase 0 Phase 1 Compute Observation Probability Phase 2 Graph Traversal Save Backtrack Log Backtrack Table Active Set LM HMM W R W R R W Data Control Data Control R R W R W W R W R Collect Backtrack Info Prepare ActiveSet Iteration Control Hash write (0.030) Duplicate Removal Unique-index Prefix-scan (0.020) Unique-list Gathering (0.005) Hash insertion Alternative Approach Sort (0.310) Duplicate Removal Cluster-boundary Detection (0.007) Unique-index Prefix-scan (0.025) Unique-list Gathering (0.007) List Sorting Traditional Approach Real Time Factor: 0.349 Real Time Factor: 0.055 ! In the forward pass of the Viterbi algorithm, there are 1,000s to 10,000s of concurrent tasks that represent the most likely alternative interpretations of the input being tracked ! To track these alternative interpretations, one has to reference a selected subset of data from the WFST Recognition Network with a sparse irregular graph structure ! The concurrent access of irregular data structure requires “uncoalesced” memory accesses in the middle of important algorithm steps, which degrades performance ! Instantiate a Phase 0 in the implementation to gather all operands necessary for the current time step of the algorithm ! Caching them in a memory- coalesced runtime data structure allows any uncoalesced accesses to happen only once for the full time step Figure illustrates the various components of a speech model that are compiled together off- line with Weighted Finite State Transducer (WFST) techniques to form a flat probabilistic finite state machine. The network often contains millions of states and tens of millions of arcs. At runtime, only a small (<1%) subset of the recognition network is referenced at any one time step. ! In a recognition network, there are millions of states, each labeled with one of ~3,000 tied triphone labels ! In a typical recognition sequence, only 20% of the triphone states are used for observation probability computation ! During traversal many of the states being tracked have duplicate labels ! In a sequential execution, memoization is often used to avoid redundant computation ! What do we do for data-parallel platforms? An Observation A State P( xt|st ) P( st|st-1 ) m [t-1][st-1] m [t][st ] Legends: Model size for a WFST language model A Pruned State # states: 4 million, # arcs: 10 million, # observations: 100/sec Average # active states per time step: 10,000 – 20,000 1.9 0.709 0.146 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 All encounterd labels All unique labels Encountered unique labels Real Time Factor for Observation Probability Computation Real Time Factor shows the number of seconds required to process one second of input data ! The traditional approach for finding unique elements in a list involves Sorting the list Detecting consecutive identical elements Scanning to identify unique element indices Copying out unique elements ! Sorting is an computationally expensive step on highly parallel platforms ! For scenarios where the number of possible labels is within 100,000, one can create a hash table of all possible labels ! We leverage the semantics of conflicting non-atomic write of the GPU to use the hash table as a flag array: CUDA guarantees at least one conflicting write to a device memory location to be successful, which is enough to build a flag array ! The alternative “Hash Insertion” step greatly simplifies the find-unique operation ! During graph traversal, active states are being processed by parallel threads on different cores ! Write-conflicts frequently arise when threads are trying to update the same destination states ! To further complicate things, in statistical inference, we would like to only keep the most likely result ! Efficiently resolving these write conflicts while keeping just the most likely result for each state is essential for achieving good performance A section of a Weighed Finite State Transducer Network ! arc non ! arc Active state (contains a token) Inactive state (contains no token) Thread 0 Thread 1 Thread 2 ! CUDA offers atomic operations with various flavors of arithmetic operations ! The “atomicMax” operation is ideal for statistical inference on GPU ! By using it in all threads, the final result in each atomically accessed memory location will be the maximum of all results that was attempted to be written to that memory location ! This type of access is lock-free from the software perspective, as the write-conflict resolution is performed by hardware ! Atomically writing results in to a memory location is a process of reduction ! Hence, this process is performing a conflict-free reduction Mem atomicMax(address, val); atomicMax(address, val); Thread 0 Thread 1 T0 T1 T2 Tn Multiprocessor 0 T0 T1 T2 Tn Multiprocessor m Global Task Queue Q Head Ptr Q size T0 T1 T2 Tn Multiprocessor 0 T0 T1 T2 Tn Multiprocessor m Global Task Queue Q Head Ptr Q size Local Task Queue Q Head Ptr Q size Local Task Queue Q Head Ptr Q size ! When many threads are trying to insert tasks into a global task queue, significant serialization occurs at the point of the queue control variables ! By using hybrid global/local queues, we can eliminate the single point of serialization ! Each multiprocessor can build up its local queue using local atomic operations, which have much lower latency than the global atomic operations ! The writes to the shared global queue are performed in one batch process, and thus are significantly more efficient ! An ASR application extracts features from a waveform, compares them to the recognition network, and infers the most likely word sequence ! The recognition network is compiled off-line from a variety of knowledge sources and trained using powerful statistical learning techniques ! The inference process traverses a graph-based recognition network using the Viterbi algorithm ! This architecture is modular and flexible: It can be adapted to recognize different languages by swapping in different recognition networks and different speech feature extractors, the inference engine would remain unchanged ! The fine-grained concurrency in ASR lies in the evaluation of each algorithmic step in Phase 1 and Phase 2: The algorithm typically tracks 10,000 – 20,000 alternative interpretations (or active states) at the same time ! The various components of the recognition network can be composed using Weighted Finite State Transducer (WFST) techniques Models typically contain millions of states and tens of millions of arcs ! The software architecture of the inference process is defined above: There is an iterative outer loop that examines one input observation (corresponding to a 10ms time step) at a time The two phases of execution dominate each iteration Concurrency lies in each algorithm steps in the two phases ! The software architecture presents significant challenges when implemented on the GPU, see below for details This research is supported in part by an Intel Ph.D. Fellowship. This research is also supported in part by Microsoft (Award #024263 ) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227). !"! !"$ %"! %"$ &"! &"$ ’"! ’"$ ()*+),-./ 0.,1234) 5&"67 839:+;) <,;),=>?) %6"’7 8399+,>2.-3, <,;),=>?) @A"!7 839:+;) <,;),=>?) $%"!7 8399+,>2.-3, <,;),=>?) B)23C>,D E>9) :)4 ()23,C 3F (:))2G %!"$H I=)2J ! The speech model is taken from the SRI CALO real time meeting recognition system ! The acoustic model includes 52K triphone states clustered into 2,613 mixtures of 128 Gaussian components ! The pronunciation model contains 59K words with 80k pronunciations ! A small back-off bigram language model with 167k bigram transitions was used ! Results presented are based on: Manycore: GTX280 GPU, 1.296GHz, 1GB Sequential: Core i7 920, 2.66GHz, 6GB ! The accuracy achieved for CPU and GPU implementations are identical Key Lessons ! Challenge 1: Having the freedom to improve the data layout of the runtime data structures is crucial to effectively exploit the fine-grained concurrency in ASR ! Challenge 2: An effective sequential algorithm often cannot be directly translated into a parallel algorithm, e.g. Memoization does not have an equivalent efficient parallel form, the sort-and-filter approach for finding unique element in a list has to be dramatically modified to execute efficiently on a GPU ! Challenge 3: Hardware atomic operation support is extremely important for highly parallel application development Various flavors of atomic instructions with arithmetic and logic operations enable highly efficient implementations for statistical inference problems in machine learning based applications ! Challenge 4: Local synchronization scope important to leverage for relieving global synchronization bottlenecks ! An order of magnitude speed up was achieved as compared to a SIMD optimized sequential implementation running on one core of Core i7 processor ! The compute intensive phase was accelerated by 17.7x ! The communication intensive phase was accelerated by 3.7x Synchronization overhead is dominating execution time as we leverage more computing platform parallelism

Upload: others

Post on 12-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Inference perform beam Engine I think search algorithm ... · processed by parallel threads on different cores !"Write-conflicts frequently arise when threads are trying to update

ASR Characteristics and Software Architecture

Challenge 1:

Handling irregular data structures with data-parallel operations

Solution 1:

Construct efficient dynamic vector data structure to handle irregular data accesses

Challenge 2:

Eliminating redundant work when threads

are computing results for an unpredictable

subset of the problems based on input

Solution 2:

Implement efficient find-unique function by leveraging the GPU global memory write-conflict-resolution policy

Challenge 3:

Conflict-free reduction in graph traversal

to implement the Viterbi beam-search

algorithm

Solution 3:

Implement lock-free accesses of a shared map leveraging advanced GPU atomic operations to enable conflict-free reductions

Challenge 4:

Parallel construction of a shared queue

while avoiding sequential bottlenecks

when atomically accessing queue control

variables

Solution 4:

Use of hybrid local/global atomic operations and local buffers for the construction of a shared global queue to avoid sequential bottlenecks in accessing global queue control variables

Jike Chong, Ekaterina Gonina, Kurt Keutzer Department of Electrical Engineering and Computer Science, University of California, Berkeley

[email protected], [email protected], [email protected]

Parallel Computing Lab

ASPA

Automatic Speech Recognition (ASR) Results

!"#$%&'()*+,-)+%.'/0)12%

!!!"

#$%""&&''"("

!!!"

$)"""''"*"

!!!"

%$%""("''"("

!!!"

''"

&&"

*"

#$%"

$)"%$%"

+,-"

#,-"

.)"

-#/"

!!!"

!!!"

!!!"

!!!"

!!!"

+,-"

#,-"

!!!"

!!!"

#$%"

.)"

!!!"

$)"

%$%"

!!!"

-#/"

!!!"

#00",123451""

%&2*6"02768" %92*3*1:'52*"02768"

;:<9'=""

>'*<3'<6"02768"

?'344:'*"0:@A396"02768""

B29"$*6"%&2*6"CA'A6"

… D"

D"

3,4/51'%6)78)+'+/9%

6)785-+*%%

:,9/;+('%/)%

';(<%7,4/51'%%

()78)+'+/9%

6)785-+*%

0',*</':%957%

)=%;>>%()78)+'+/9%

Speech

Feature Extractor!

Inference !

Engine!

Voice

Input!

Recognition Network!

Speech"Features!

Word"

Sequence!

D"

I think

therefore

I am

Acoustic

Model!

Pronunciation

Model!

Language

Model! Obs 1 Obs 2 Obs 3 Obs 4 !

State 1!

State 2!

State 3!

State N!

Time!

…!

…!

…!

…!

…!

1. Forward Pass!

2. Backward Pass!

Observations!

Speech!

Model "

States!

In each "step, "consider "

alternative interpretations !

Iterative through inputs one time step at a time

In each iteration, perform beam

search algorithm

Inference Engine: Beam Search with Graph Traversal!Voice

Input% Recognize Speech

Wreck a nice beach

r eh k ax g n ay z ‘ s p iy ch

r eh k ‘ ax (g)’ n ay (z) s ‘(b) iy ch

Reckon eyes peach

r eh k ax (g) n ‘ ay z (s)‘ p iy ch

r! eh! k! ax! g! n! ay! z! s! p! iy! ch!

Recognize Speech

!"Automatic speech recognition (ASR) allows multimedia content to be transcribed from acoustic waveforms to word sequences

!"This is a challenging task as there can be exponentially many ways to interpret an utterance (a sequence of phones) into words

!"ASR uses the hidden Markov model (HMM)

•" States are hidden because phones are indirectly observed through the waveform

•" Must infer the most likely interpretation while taking the language model into account

!"The Viterbi algorithm is used to infer the most likely interpretation of the observed waveform

•" It has a forward pass and a

backward pass

!"The forward pass has two phases of execution

•" Phase 1 evaluates the observation

probability, which matches the observation to known speech model

states (dashed arrows)

•" Phase 2 references historic information and evaluates the

likelihood of going from one time step to the next (solid arrows)

Read Files!

Initialize data

structures!

CPU!Manycore GPU!

Backtrack!

Output Results!

Phase 0!

Phase 1!

Compute Observation

Probability!

Phase 2!

Graph Traversal!

Save !

Backtrack Log!

Backtrack

Table!

Active

"

Set!

LM!

HM

M!

W!

R#W!

R!

R!W!

Data!Control!Data! Control!

R!

RW!

R!

W! W!

RW!

R!

Collect #

Backtrack Info!

Prepare ActiveSet!

Iteration Control!

Hash write (0.030)

Duplicate Removal

Unique-index

Prefix-scan (0.020)

Unique-list

Gathering (0.005)

Hash insertion

Alternative Approach

Sort (0.310)

Duplicate Removal

Cluster-boundary

Detection (0.007)

Unique-index

Prefix-scan (0.025)

Unique-list

Gathering (0.007)

List Sorting

Traditional Approach

Real Time Factor: 0.349 Real Time Factor: 0.055

!" In the forward pass of the Viterbi algorithm, there are 1,000s to 10,000s of

concurrent tasks that represent the most likely alternative interpretations of the

input being tracked

!"To track these alternative interpretations, one has to reference a selected subset of

data from the WFST Recognition Network with a sparse irregular graph structure

!"The concurrent access of irregular data structure requires “uncoalesced” memory

accesses in the middle of important algorithm steps, which degrades performance

!" Instantiate a Phase 0 in the

implementation to gather all

operands necessary for the current

time step of the algorithm

!"Caching them in a memory-

coalesced runtime data structure

allows any uncoalesced accesses

to happen only once for the full time

step

Figure illustrates the various

components of a speech model

that are compiled together off-line with Weighted Finite State

Transducer (WFST) techniques

to form a flat probabilistic finite

state machine.

The network often contains

millions of states and tens of

millions of arcs. At runtime, only

a small (<1%) subset of the

recognition network is referenced at any one time step.

!" In a recognition network, there are millions

of states, each labeled with one of ~3,000

tied triphone labels

!" In a typical recognition sequence, only

20% of the triphone states are used for

observation probability computation

!"During traversal many of the states being

tracked have duplicate labels

!" In a sequential execution, memoization is

often used to avoid redundant

computation

!"What do we do for data-parallel platforms?

An Observation!A State!

P( xt|st ) P( st|st-1 ) m [t-1][st-1] m [t][st ]

Legends:!

Model size for a WFST language model!

A Pruned State!

# states: 4 million, # arcs: 10 million, # observations: 100/sec# Average # active states per time step: 10,000 – 20,000!

1.9

0.709

0.146

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

All

encounterd

labels

All unique

labels

Encountered

unique labels

Real Time

Factor for Observation

Probability Computation

Real Time Factor shows the number of seconds

required to process one second of input data

!"The traditional approach for finding unique

elements in a list involves

•" Sorting the list

•" Detecting consecutive identical elements

•" Scanning to identify unique element indices

•" Copying out unique elements

!"Sorting is an computationally expensive

step on highly parallel platforms

!"For scenarios where the number of

possible labels is within 100,000, one can

create a hash table of all possible labels

!"We leverage the semantics of conflicting

non-atomic write of the GPU to use the hash table as a flag array:

•" CUDA guarantees at least one conflicting write to a device memory location to be

successful, which is enough to build a flag array

!"The alternative “Hash Insertion” step greatly simplifies the find-unique operation

!"During graph traversal, active states are being

processed by parallel threads on different cores

!"Write-conflicts frequently arise when threads

are trying to update the same destination states

!"To further complicate things, in statistical inference,

we would like to only keep the most likely result

!"Efficiently resolving these write conflicts while keeping

just the most likely result for each state is essential for achieving good performance

A section of a Weighed Finite State Transducer Network

! arc

non ! arc

Active state

(contains a token)

Inactive state

(contains no token)

Thread 0

Thread 1

Thread 2

!"CUDA offers atomic operations with various

flavors of arithmetic operations

!"The “atomicMax” operation is ideal for

statistical inference on GPU

!"By using it in all threads, the final result in each atomically accessed memory

location will be the maximum of all results that was attempted to be written to that

memory location

!"This type of access is lock-free from the software perspective, as the write-conflict

resolution is performed by hardware

!"Atomically writing results in to a memory location is a process of reduction

!"Hence, this process is performing a conflict-free reduction

Mem

atomicMax(address, val);

atomicMax(address, val);

Thread 0

Thread 1

T0 T1 T2 Tn …

Multiprocessor 0

T0 T1 T2 Tn …

Multiprocessor m

Global Task Queue

Q Head Ptr Q size

T0 T1 T2 Tn …

Multiprocessor 0

T0 T1 T2 Tn …

Multiprocessor m

Global Task Queue

Q Head Ptr Q size

Local Task Queue

Q Head Ptr Q size Local Task Queue

Q Head Ptr Q size

!"When many threads are trying to

insert tasks into a global task queue,

significant serialization occurs at the

point of the queue control variables

!"By using hybrid global/local queues,

we can eliminate the single point of

serialization

!"Each multiprocessor can build up

its local queue using local atomic

operations, which have much lower

latency than the global atomic operations

!"The writes to the shared global queue

are performed in one batch process, and

thus are significantly more efficient

!"An ASR application extracts features from a waveform, compares them to the recognition network, and infers the most likely word sequence

!"The recognition network is compiled off-line from a variety of knowledge sources and trained using powerful statistical learning techniques

!"The inference process traverses a graph-based recognition network using the Viterbi algorithm

!"This architecture is modular and flexible:

•" It can be adapted to recognize different languages by swapping in different recognition networks and different speech feature extractors, the inference engine would remain unchanged

!"The fine-grained concurrency in ASR lies in the evaluation of each algorithmic step in Phase 1 and Phase 2:

•" The algorithm typically tracks

10,000 – 20,000 alternative interpretations (or active states) at

the same time

!"The various components of the recognition network can be composed using Weighted Finite State Transducer (WFST) techniques

•" Models typically contain millions of

states and tens of millions of arcs

!"The software architecture of the inference process is defined above:

•" There is an iterative outer loop that

examines one input observation (corresponding to a 10ms time

step) at a time

•" The two phases of execution

dominate each iteration

•" Concurrency lies in each algorithm steps in the two phases

!"The software architecture presents significant challenges when implemented on the GPU, see below for details

This research is supported in part by an Intel Ph.D. Fellowship. This research is also supported in part by Microsoft (Award #024263 ) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227).

!"!# !"$# %"!# %"$# &"!# &"$# '"!# '"$#

()*+),-./#

0.,1234)#

5&"67#839:+;)#<,;),=>?)##

%6"'7#8399+,>2.-3,#<,;),=>?)##

@A"!7#839:+;)#<,;),=>?)##

$%"!7#8399+,>2.-3,#<,;),=>?)##

B)23C>,D#E>9)#:)4#

()23,C#3F#(:))2G##%!"$H#

I=)2J#!"The speech model is taken from the SRI CALO real time meeting recognition system

!"The acoustic model includes 52K triphone states clustered into 2,613 mixtures of 128 Gaussian components

!"The pronunciation model contains 59K words with 80k pronunciations

!"A small back-off bigram language model with 167k bigram transitions was used

!"Results presented are based on:

•" Manycore: GTX280 GPU, 1.296GHz, 1GB

•" Sequential: Core i7 920, 2.66GHz, 6GB

!"The accuracy achieved for CPU and GPU implementations are identical

Key Lessons !"Challenge 1:

•" Having the freedom to improve the data layout of the runtime data structures is crucial to effectively exploit the fine-grained concurrency in ASR

!"Challenge 2:

•" An effective sequential algorithm often cannot be

directly translated into a parallel algorithm, e.g.

Memoization does not have an equivalent efficient

parallel form, the sort-and-filter approach for finding

unique element in a list has to be dramatically modified

to execute efficiently on a GPU

!"Challenge 3:

•" Hardware atomic operation support is extremely important for highly parallel application development

•" Various flavors of atomic instructions with arithmetic and logic operations enable highly efficient implementations for statistical inference problems in machine learning based applications

!"Challenge 4:

•" Local synchronization scope important to leverage for

relieving global synchronization bottlenecks

!"An order of magnitude speed up was achieved as compared to a SIMD optimized sequential implementation running on one core of Core i7 processor

!"The compute intensive phase was accelerated by 17.7x

!"The communication intensive phase was accelerated by 3.7x

•" Synchronization overhead is dominating execution time as we

leverage more computing platform parallelism