crim’s transcription and call sign detection system …€¦ · crim | 2 transcription system for...
TRANSCRIPT
CRIM’S TRANSCRIPTION AND CALL
SIGN DETECTION SYSTEM FOR ATC
CONVERSATIONS
VISHWA GUPTA, LISE REBOUT, GILLES BOULIANNE, PIERRE-ANDRE MENARD AND JAHANGIR ALAM
OCT 4, 2018
CRIM | 2
TRANSCRIPTION SYSTEM FOR ATC AUDIO
1. ATC TRAINING DATA
Divided into:
• 27149 files for training
• 839 files for development
2. DICTIONARY
• Starting dictionary from: Hub4, RT03, RT04, Market, WSJ, Librispeech, switchboard, Fisher
• Add new words from training transcripts
• Abreviations (QNH ILS etc) transcribed using alphabet transcription
• New words transcribed using CMU dictionary followed by a phonetizer (if necessary)
• A total of 190k distinct words
CRIM | 3
LANGUAGE MODELS FOR DECODING
1. GENERATED FROM TRAINING TRANSCRIPTS
• 409k words in the training transcript
• RNNLM: Mikolov’s toolkit: 200 classes, hidden layer size of 300
Language model Perplexity
3-gram 11.5
4-gram 9.9
5-gram 9.4
RNNLM 7.7
CRIM | 4
ACOUSTIC MODELS
1. TRAIN INITIAL MODELS FROM: HUB4 RT03 RT04 MARKET WSJ LIBRISPEECH
SWITCHBOARD FISHER DATA
• 40 dimensional MFCC + 100 dimensional ivectors
• Bi-directional LSTM (model BLSTM1)
• LF-MMI (lattice free MMI) TDNN (model TDNN1)
• Adapt model to the ATC training data
• Bi-directional LSTM (model BLSTM2), LF-MMI TDNN (model TDNN2)
• Adapt model to the ATC training data 2nd time
• Bi-directional LSTM (model BLSTM3), LF-MMI TDNN (model TDNN3)
• Train LF-MMI TDNN from scratch on ATC training data (model TDNN4)
CRIM | 5
ACOUSTIC MODEL COMPARISON USING THE DEVELOPMENT SET
LSTM model WER LF-MMI TDNN WER
Original (BLSTM1) 41.4% Original (TDNN1) 34.6%
Adapted (BLSTM2) 10.7% Trained from scratch 16.2%
Adapted 2nd time (BLSTM3) 9.4% Adapted from TDNN1 13.5%
BLSTM3 + 5-g LM 9.36% sMBR training of adapted TDNN (TDNN2)
11.7%
BLSTM4 from training + leaderboard data
9.3% Adapted 2nd time (TDNN3) 11.35%
BLSTM4 + 5-g + RNNLM 9.06%
BLSTM4 + LM adapted to training + leaderboard
9.03%
CRIM | 6
NOISE ROBUST FEATURES
1. TRIED MANY NOISE ROBUST FEATURES
• RMCC, PNCC, RMFCC, MMFCC etc.
• WER significantly higher for training from scratch
• No models already trained from Hub4, RT03, RT04, Libri-Speech, WSJ, Switchboard etc
• Add ATCOSIM data: did not help after ROVER
• Did not try filter-bank features
• Multi-style features did give good WER
• Add aircraft noise, babble noise, music, noise and reverberation
• Training time is longer (x5 training data)
• WER is 9.4% with BLSTM and 5-g LM
CRIM | 7
FINAL RESULTS FOR EVALUATION DATA
1. ROVER OF 6 CTM FILES FROM 6 DECODES:
• BLSTM adapted 2nd time, 5-g LM, RNNLM
• Multi-style BLSTM_adapted 2nd time, 5-g LM, RNNLM
• LF-MMI TDNN adapted 2nd time, 5-g LM, RNNLM
• BLSTM adapted with training + leaderboard data, 5-g LM, RNNLM adapted with leaderboard
• LF-MMI TDNN adapted with training + leaderboard, 5-g LM, RNNLM adapted with leaderboard
• Multi-style BLSTM adapted with training + leaderboard data, 5-g LM, RNNLM
• OUTPUT of ROVER is our 1st submission
CRIM | 8
OUR 2ND AND 3RD SUBMISSIONS
1. 2ND SUBMISSION
• Add evaluation set with ROVER text to the training + leaderboard data
• Train Bi-directional LSTM again from this combined data
• Decode with BLSTM + 5-g LM + RNNLM adapted with leaderboard
• 3rd SUBMISSION
• Adapt above BLSTM to evaluation data only using a small learning rate
• Decode with BLSTM + 5-g LM + RNNLM adapted with leaderboard
• WER on EVALUATION SET: 9.41%
• (BEST RESULTS AFTER ROVER on DEV SET: 8.45% on LEADERBOARD: 9.98%)
CRIM | 9
CALL SIGN DETECTION
1. USE N-GRAM FOR CALL SIGN DETECTION
• StHC9fqp7GxR6C8y,Easy two six one quebec, clear takeoff three two right Easy two six one Quebec
• clear takeoff three two right <cs> Easy two six one Quebec <\cs>
• Train 5-gram LM.
• Use the 5-gram LM to find the begin and end of the call sign
• Best result on leaderboard: 0.7957
Call sign detection using RNN - CRF
Address Call sign detection as a sequence labeling problem:
● Training data : the 28045 training examples of training_dataset.csv containing 20438 call signs
● Transform data to use the BIO tagset (Begin - In - Out)
StHC9fqp7GxR6C8y , Easy two six one quebec , clear takeoff three two right Easy two six one quebec
clear takeoff three two right Easy two six one quebec
O O O O O B I I I I
Call sign detection: RNN - CRF
LSTM LSTM LSTM
LSTM LSTM LSTM
LSTM LSTM LSTM
LSTM LSTM LSTM
Word embedding200 dim
r i g h t E a s y . . .. . .
CRF classifier
O O B
- two Bi-LSTM: one at character level (50 dim), one at word level (200 dim)
- Pretrained word embeddings: GloVe (200 dim)- Classifier: Conditional Random Field (CRF) that considers
not only the word and its context but also the previous tag
- Loss function: difference for a sentence between the gold tags and the predicted tags
Training details:- Using PyTorch- Validation set to do early stopping- Dropout between the two LSTM (with drop-out
probability 0.25)- Gradient clipping for regularization- Adam optimizer for gradient descent
Results :- On validation set : 0.96 - Best result on leaderboard: 0.8207
. . .. . .
Bi-LSTM Bi-LSTM
. . . . . .
t w o
Fully-connected layer
Character embedding25 dim
100 dim
Call sign detection - Post processing
- Restrict call signs and possible ambiguities
- To match possible call signs encountered for each airline
- To eliminate airport-related references (tower, radio frequency, etc)
- Pattern generalisation from training data
- Generalize letters and characters into generic placeholder (9 or Z)
- Airline-base patterns: i.e. “Airbus zero two india quebec” >>“Airbus 9 9 Z Z”
- Airport-based patterns: i.e. “tower two one three” >> “tower 9 9 9”
- Pattern-based reduction of the neural network output
- Remove detected call signs outside the airline patterns
- Remove call signs overlapping airport patterns
Call sign detection - Post processing
Transcription
textual training
data(csv files)
Transcribedtext
RNN-CRF Candidatecall signs
Final call signs
Call signpatterns
Airportpatterns
Airbus zero two india quebec
==> Airbus 9 9 Z Z
tower two one three
==> tower 9 9 9
Tag missed call signs
Post processing
Filter tagged call signs left via mike four Easy four zero one niner tower one two three expediting
Pattern: Easy 9 9 9 9
x
two bravo, Easy two zero two bravo requesting for the descent
Pattern : Easy 9 9 9 Z
+
1 2
3
Audio test data
CRIM | 14
CALL SIGN DETECTION RESULTS
DATA ALGORITHM F1
Leaderboard 5-gram 0.7957
Leaderboard RNN-CRF 0.8207
Leaderboard RNN-CRF with majority vote
0.8289
EVALUATION RNN-CRF with majority vote + post processing(?)
0.8017
CRIM | 15
CONCLUSIONS
1. ADAPTING ACOUSTIC MODELS TRAINED FROM LARGE DATA TO ATC DATA GAVE OVER 2%
ABSOLUTE REDUCTION IN WER.
2. REPEATED ADAPTATION ALSO REDUCED WER.
3. ADAPTED BLSTM GAVE LOWER WER THAN ADAPTED LF-MMI TRAINED TDNN.
4. ROVER GAVE SIGNIFICANT REDUCTION IN WER.
5. CALL SIGN DETECTION USING RNN-CRF GAVE THE BEST RESULTS
6. MAJORITY VOTE OVER CALL SIGNS IMPROVED F1 SCORE
7. OVERALL WE RANKED 3RD IN THE EVALUATION
CRIM | 16
LE CRIM,
UN CENTRE D’EXPERTISE EN TI
AU SERVICE DES ENTREPRISES
ET ORGANISMESDEPUIS PLUS DE 30 ANS !
CRIM | 17
MISSION Le CRIM est un centre de recherche appliquée et d’expertise en technologies de l’information qui rend les organisations plus performantes et compétitives par le développement detechnologies innovatrices et le transfert de savoir-faire de pointe, tout en contribuant à l’avancement scientifique.