crim’s transcription and call sign detection system …€¦ · crim | 2 transcription system for...

CRIM’S TRANSCRIPTION AND CALL

SIGN DETECTION SYSTEM FOR ATC

CONVERSATIONS

VISHWA GUPTA, LISE REBOUT, GILLES BOULIANNE, PIERRE-ANDRE MENARD AND JAHANGIR ALAM

OCT 4, 2018

CRIM | 2

TRANSCRIPTION SYSTEM FOR ATC AUDIO

1. ATC TRAINING DATA

Divided into:

• 27149 files for training

• 839 files for development

2. DICTIONARY

• Starting dictionary from: Hub4, RT03, RT04, Market, WSJ, Librispeech, switchboard, Fisher

• Add new words from training transcripts

• Abreviations (QNH ILS etc) transcribed using alphabet transcription

• New words transcribed using CMU dictionary followed by a phonetizer (if necessary)

• A total of 190k distinct words

CRIM | 3

LANGUAGE MODELS FOR DECODING

1. GENERATED FROM TRAINING TRANSCRIPTS

• 409k words in the training transcript

• RNNLM: Mikolov’s toolkit: 200 classes, hidden layer size of 300

Language model Perplexity

3-gram 11.5

4-gram 9.9

5-gram 9.4

RNNLM 7.7

CRIM | 4

ACOUSTIC MODELS

1. TRAIN INITIAL MODELS FROM: HUB4 RT03 RT04 MARKET WSJ LIBRISPEECH

SWITCHBOARD FISHER DATA

• 40 dimensional MFCC + 100 dimensional ivectors

• Bi-directional LSTM (model BLSTM1)

• LF-MMI (lattice free MMI) TDNN (model TDNN1)

• Adapt model to the ATC training data

• Bi-directional LSTM (model BLSTM2), LF-MMI TDNN (model TDNN2)

• Adapt model to the ATC training data 2nd time

• Bi-directional LSTM (model BLSTM3), LF-MMI TDNN (model TDNN3)

• Train LF-MMI TDNN from scratch on ATC training data (model TDNN4)

CRIM | 5

ACOUSTIC MODEL COMPARISON USING THE DEVELOPMENT SET

LSTM model WER LF-MMI TDNN WER

Original (BLSTM1) 41.4% Original (TDNN1) 34.6%

Adapted (BLSTM2) 10.7% Trained from scratch 16.2%

Adapted 2nd time (BLSTM3) 9.4% Adapted from TDNN1 13.5%

BLSTM3 + 5-g LM 9.36% sMBR training of adapted TDNN (TDNN2)

11.7%

BLSTM4 from training + leaderboard data

9.3% Adapted 2nd time (TDNN3) 11.35%

BLSTM4 + 5-g + RNNLM 9.06%

BLSTM4 + LM adapted to training + leaderboard

9.03%

CRIM | 6

NOISE ROBUST FEATURES

1. TRIED MANY NOISE ROBUST FEATURES

• RMCC, PNCC, RMFCC, MMFCC etc.

• WER significantly higher for training from scratch

• No models already trained from Hub4, RT03, RT04, Libri-Speech, WSJ, Switchboard etc

• Add ATCOSIM data: did not help after ROVER

• Did not try filter-bank features

• Multi-style features did give good WER

• Add aircraft noise, babble noise, music, noise and reverberation

• Training time is longer (x5 training data)

• WER is 9.4% with BLSTM and 5-g LM

CRIM | 7

FINAL RESULTS FOR EVALUATION DATA

1. ROVER OF 6 CTM FILES FROM 6 DECODES:

• BLSTM adapted 2nd time, 5-g LM, RNNLM

• Multi-style BLSTM_adapted 2nd time, 5-g LM, RNNLM

• LF-MMI TDNN adapted 2nd time, 5-g LM, RNNLM

• BLSTM adapted with training + leaderboard data, 5-g LM, RNNLM adapted with leaderboard

• LF-MMI TDNN adapted with training + leaderboard, 5-g LM, RNNLM adapted with leaderboard

• Multi-style BLSTM adapted with training + leaderboard data, 5-g LM, RNNLM

• OUTPUT of ROVER is our 1st submission

CRIM | 8

OUR 2ND AND 3RD SUBMISSIONS

1. 2ND SUBMISSION

• Add evaluation set with ROVER text to the training + leaderboard data

• Train Bi-directional LSTM again from this combined data

• Decode with BLSTM + 5-g LM + RNNLM adapted with leaderboard

• 3rd SUBMISSION

• Adapt above BLSTM to evaluation data only using a small learning rate

• Decode with BLSTM + 5-g LM + RNNLM adapted with leaderboard

• WER on EVALUATION SET: 9.41%

• (BEST RESULTS AFTER ROVER on DEV SET: 8.45% on LEADERBOARD: 9.98%)

CRIM | 9

CALL SIGN DETECTION

1. USE N-GRAM FOR CALL SIGN DETECTION

• StHC9fqp7GxR6C8y,Easy two six one quebec, clear takeoff three two right Easy two six one Quebec

• clear takeoff three two right <cs> Easy two six one Quebec <\cs>

• Train 5-gram LM.

• Use the 5-gram LM to find the begin and end of the call sign

• Best result on leaderboard: 0.7957

Call sign detection using RNN - CRF

Address Call sign detection as a sequence labeling problem:

● Training data : the 28045 training examples of training_dataset.csv containing 20438 call signs

● Transform data to use the BIO tagset (Begin - In - Out)

StHC9fqp7GxR6C8y , Easy two six one quebec , clear takeoff three two right Easy two six one quebec

clear takeoff three two right Easy two six one quebec

O O O O O B I I I I

Call sign detection: RNN - CRF

LSTM LSTM LSTM

LSTM LSTM LSTM

LSTM LSTM LSTM

LSTM LSTM LSTM

Word embedding200 dim

r i g h t E a s y . . .. . .

CRF classifier

O O B

- two Bi-LSTM: one at character level (50 dim), one at word level (200 dim)

- Pretrained word embeddings: GloVe (200 dim)- Classifier: Conditional Random Field (CRF) that considers

not only the word and its context but also the previous tag

- Loss function: difference for a sentence between the gold tags and the predicted tags

Training details:- Using PyTorch- Validation set to do early stopping- Dropout between the two LSTM (with drop-out

probability 0.25)- Gradient clipping for regularization- Adam optimizer for gradient descent

Results :- On validation set : 0.96 - Best result on leaderboard: 0.8207

. . .. . .

Bi-LSTM Bi-LSTM

. . . . . .

t w o

Fully-connected layer

Character embedding25 dim

100 dim

Call sign detection - Post processing

- Restrict call signs and possible ambiguities

- To match possible call signs encountered for each airline

- To eliminate airport-related references (tower, radio frequency, etc)

- Pattern generalisation from training data

- Generalize letters and characters into generic placeholder (9 or Z)

- Airline-base patterns: i.e. “Airbus zero two india quebec” >>“Airbus 9 9 Z Z”

- Airport-based patterns: i.e. “tower two one three” >> “tower 9 9 9”

- Pattern-based reduction of the neural network output

- Remove detected call signs outside the airline patterns

- Remove call signs overlapping airport patterns

Call sign detection - Post processing

Transcription

textual training

data(csv files)

Transcribedtext

RNN-CRF Candidatecall signs

Final call signs

Call signpatterns

Airportpatterns

Airbus zero two india quebec

==> Airbus 9 9 Z Z

tower two one three

==> tower 9 9 9

Tag missed call signs

Post processing

Filter tagged call signs left via mike four Easy four zero one niner tower one two three expediting

Pattern: Easy 9 9 9 9

x

two bravo, Easy two zero two bravo requesting for the descent

Pattern : Easy 9 9 9 Z

+

1 2

3

Audio test data

CRIM | 14

CALL SIGN DETECTION RESULTS

DATA ALGORITHM F1

Leaderboard 5-gram 0.7957

Leaderboard RNN-CRF 0.8207

Leaderboard RNN-CRF with majority vote

0.8289

EVALUATION RNN-CRF with majority vote + post processing(?)

0.8017

CRIM | 15

CONCLUSIONS

1. ADAPTING ACOUSTIC MODELS TRAINED FROM LARGE DATA TO ATC DATA GAVE OVER 2%

ABSOLUTE REDUCTION IN WER.

2. REPEATED ADAPTATION ALSO REDUCED WER.

3. ADAPTED BLSTM GAVE LOWER WER THAN ADAPTED LF-MMI TRAINED TDNN.

4. ROVER GAVE SIGNIFICANT REDUCTION IN WER.

5. CALL SIGN DETECTION USING RNN-CRF GAVE THE BEST RESULTS

6. MAJORITY VOTE OVER CALL SIGNS IMPROVED F1 SCORE

7. OVERALL WE RANKED 3RD IN THE EVALUATION

CRIM | 16

LE CRIM,

UN CENTRE D’EXPERTISE EN TI

AU SERVICE DES ENTREPRISES

ET ORGANISMESDEPUIS PLUS DE 30 ANS !

CRIM | 17

MISSION Le CRIM est un centre de recherche appliquée et d’expertise en technologies de l’information qui rend les organisations plus performantes et compétitives par le développement detechnologies innovatrices et le transfert de savoir-faire de pointe, tout en contribuant à l’avancement scientifique.

crim’s transcription and call sign detection system …€¦ · crim | 2 transcription system for...

Documents