speech technology using in wechatrose1.ntu.edu.sg/wemagechallenge2014/gallery/slides...microsoft...

14
26/9/2014 1 Speech Technology Using in Wechat FENG RAO Outline Introduce Algorithm of Speech Recognition Acoustic Model Language Model Decoder SpeechTechnologyOpenPlatform Framework of Speech Recognition Products of Speech Recognition Speech Synthesis Speaker Verification

Upload: others

Post on 25-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • 26/9/2014

    1

    Speech Technology

    Using in Wechat

    Powered by WeChatFENG RAO

    Outline

    • Introduce Algorithm of Speech Recognition – Acoustic Model

    – Language Model

    – Decoder

    • Speech Technology Open Platform– Framework of Speech Recognition

    – Products of Speech Recognition

    – Speech Synthesis

    – Speaker Verification

  • 26/9/2014

    2

    Speech Recognition

    ˆ W = argmaxW ∈L

    P(W | O) ˆ W = argmaxW ∈L

    P(O |W )P(W )

    P(O)

    ˆ W = argmaxW ∈L

    P(O |W )P(W )

    Acoustic Model

    • Spoken words: “I think there are”• Phonemes: ‘ay-th-in-nk-kd dh-eh-r-aa-r’• Each tri-phone correspond to a hmm.• H.M.M : 5 state representation• Each state correspond to a mixture Gaussian

    model

  • 26/9/2014

    3

    Acoustic Model

    P(O | S) = ϖ iN (O | µi,i

    ∑ )i=1

    M

    P(O |W ) = P(O | S)i=1

    M

    Deep Neural Network

    outputs

    hidden

    layers

    input vector

    Compare outputs with

    correct answer to get

    error signal

    Back-propagate

    error signal to

    get derivatives

    for learning

    P(Y = i | x,W ,b) = soft maxi(Wx + b) =

    eW

    ix+b

    i

    eW

    jx+b

    j

    j

  • 26/9/2014

    4

    Language Model

    1 2 1 2 1 1 2 3 1( ) ( , ,..., ) ( ) ( | )... ( | , , ,..., )k k kp S p W W W p W p W W p W W W W W −= =

    • N-Gram Model• Build the LM by calculating n-gram probabilities from

    text training corpus: how likely is one word to follow another? To follow the two previous words?

    • Smooth methods– KN, GT ,Stupid Backoff

    • Grammar• ABNF, is to describe a formal system of a language to be

    used as bidirectional communication protocol.

    • Quick , Small

    • \data\• ngram 1=4• ngram 2=3• ngram 3=2• \1-grams:-• 0.60206 hello -0.39794• -0.60206 world -0.3979• -0.60206 -0.39794• -0.60206 -0.39794• \2-grams:0• 0 hello world -0.39794• 0 world -0.39794• 0 hello -0.39794• \3-grams:• 0 hello world • 0 hello world\end\

    N-Gram

    Grammar

    public $basicCmd = $digit;

    $digit = (0|1|2|3|4|5|6|7|8);

  • 26/9/2014

    5

    Decoder

    • Find the best hypothesis P(O|W) P(W) given– A sequence of acoustic feature vectors (O)– A trained HMM (AM)– Lexicon (PM)– Probabilities of word sequences (LM)

    • For O– Weighted finite state transducer– Build network composed with HMM trip-hone and words in Am and Lm.– Calculate most likely state sequence in HMM given transition and

    observation probs.– Trace back through state sequence to get the word sequence.– Viterbi decoder– N best vs. 1 best vs. lattice output

    • Limiting search– Lattice minimization and determination– Pruning: beam search

    Decoder Network

    • Viterbi Decoder Process

    t

    start End

    Phoneme / word

  • 26/9/2014

    6

    Decoder Network

    • Viterbi Decoder Process

    t

    start End

    Decoder Network

    • Viterbi Decoder Process

    t

    start End

  • 26/9/2014

    7

    Decoder Network

    • Viterbi Decoder Process

    t

    start End

    Decoder Network

    • Viterbi Decoder Process

    t

    start End

  • 26/9/2014

    8

    Challenge Under Internet

    • Big Training Data – Txt corpus is TB level and thousand hours of speech

    data as training data

    – Speed Optimized methods

    • Large Mount of Users– Real time response

    – More machines, Robust service

    • Quick Update– Content in Internet in changing every day.

    – Update model especially on language model

    Speech Open Platform

    Speech recognition

    Speech synthesis

    Speaker verification

    --using in wechat

  • 26/9/2014

    9

    One Network

    Multiple Products

    Universal Interface

    decoder

    General

    Filed LM

    decoder

    Map filed

    LM

    decoder

    Command

    Filed LM

    decoder

    Others…

    One Network

    Multiple Technology

    Universal Interface

    One Pass Decoder

    Lattice Decoder

    Parallel Decoding Space

    Ngram Model

    ABNF

    Parallel Decoding Space

    GMM

    DNN

    One –Best

    N-Best

    One –Best

    N-Best

  • 26/9/2014

    10

    Non-Finite FieldThe core performance of Speech Recognition is optimized and developing� Accuracy rate: 94% (Audio sampling at 16kHz)� Usage amount: 18 million per day 92.60% 94.10% 94.70% 94.30% 95.10% 94.90%91.00%91.50%92.00%92.50%93.00%93.50%94.00%94.50%95.00%

    95.50%

    Week 33 Week 34 Week 35 Week 36 Week 37 Week 38

    Sampling Accuracy of Current

    Availablity

    * Source: Accuracy Assessment from Third Party in April'14

    Recognition rate

    Multi VerticalsUnify entrance with Parallel

    decoding of space technology

    � Parallel recognition supports 11 classifications of verticals � 30% better in performance than

    speech input in Verticals

    � recognition rate: 96%, more

    accurate than common* Source: Accuracy Assessment from Third Party in April'1492.70%

    93.40%

    94.40%

    96.20%95.80%

    95.50%

    93.80%

    90.00%

    91.00%

    92.00%

    93.00%

    94.00%

    95.00%

    96.00%

    97.00%

    Vertical Fields

  • 26/9/2014

    11

    Speech Technology Product

    21

    • Speech to text

    Input ToolQQ InputWechat Input

    Speech Technology Product

    • Vertical ApplicationMusic

    SearchingQQ Map

  • 26/9/2014

    12

    Contact Searching

    Voice Quality Identify

    Voice Awaken To Unlock Mobile Phone

    Speech Synthesis

    • Features– 1. High efficient synthesis.

    – 2. Available SDK for Android and iOS clients.

    – 3. Offline and Online TTS

    • Applications

    – 1. WeChat Official Account.

    – 2. WeCall.

    .

  • 26/9/2014

    13

    Speaker verification

    �Application of scene

    �User login verification

    �bank transfer, payment verification

    �Forgot password

    �Advantage:

    �Convenient , fast

    �Safety

    �Good user experience

    How To Get Speech Technology

    • http://pr.weixin.qq.com/voice/intro

  • 26/9/2014

    14

    Thanks