clear journal september 2014

40

Upload: simple-groups

Post on 07-Apr-2016

222 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Clear Journal September 2014

CLEAR September 2014 1

Page 2: Clear Journal September 2014

CLEAR September 2014 2

Page 3: Clear Journal September 2014

CLEAR September 2014 3

C

CLEAR September 2014 Volume-3 Issue-3

CLEAR Journal (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 www.simplegroups.in [email protected]

Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad

Editors Dr. Ajeesh Ramanujan Raseek C Nisha M Anagha M

Cover page and Layout Sarath K S Manu.V.Nair

Content Translation Tool for Translating Wikipedia contents Santhosh Thottingal

Speech Recognition Divya Das

Feature extraction for Speaker Recognition System Neethu Johnson

FreeSpeech: Real-Time Speech Recognition Vidya P V

CMU Sphinx 4: Speech Recognition System Raveena R Kumar

Deep Learning in Speech Processing Rajitha K, Alen Jacob

Parallelising NLP Tasks using MapReduce Paradigm Freny Clara Davis, Shalini M, Nidhin M Mohan, Shibin Mohan

Editorial…………………… 4

News & Updates……….5

Events……………………….6

CLEAR Dec 2014 Invitation…………………38

Last word…………………39

Page 4: Clear Journal September 2014

CLEAR September 2014 4

Dear Readers!

Greetings!

In this season of ONAM, the September edition of CLEAR comes

to you as a Journal, decorated by flowers in the form of articles mainly

on Speech Processing, and centered around the special article on

Translation tool from Wikipedia by Santhosh Thottingal, Senior Software

Engineer, Language Engineering team, Wikipedia Foundation. It is

heartening to note the present batch of M.Tech students also are trying

to extend the frontiers of knowledge by venturing into unexplored areas.

On top of all this, we have glimpses of the visit by the Acharya

Prof. R. Kalyana Krishnan to our institution, which we consider as a

blessing.

We will try to include articles from other departments also in the future

editions, keeping in mind the broad objectives of the journal.

Hope you all will enjoy the colours of this edition. Do send in your

feedback.

Best Regards,

P.C. Reghu Raj

(Chief Editor)

Page 5: Clear Journal September 2014

CLEAR September 2014 5

Adieu to Second Batch of SIMPLE Groups Yet another milestone was conquered by simple groups when the students of second batch

(2012-14) of Computational Linguistics successfully completed their M.Tech course. It was an academically eventful period they spent at the institution. Research projects were taken up and several publications resulted from them. Workshops, expert talks etc., were organized by them. They also attended similar programs at various other institutions that brought about the sharing of knowledge. Their examination results were spectacular. Besides academics they also have a story of radiant life at the institution-a story of friendship sprinkled with love, helping and taking care of each other.

A farewell party was organized by the junior batch (2013-15) for their outgoing seniors. The forenoon session of the function was a get-together of staff, faculty and students. First, Dr. P. C. Reghu Raj shared his memories and experiences with the batch, gave a final word of advice for them and wished all of them all the successes in their future endeavours. Then the other staffs also recalled the times with them and gifted them with well wishes and enlightenments. The students also shared their memories in the college, and gave their feedback on the course and the institution. They had suggestions on how to further improve the system here and they also gave valuable directions to their junior batch regarding various aspects of the course and project works.

The forenoon function was concluded and a lush lunch was served. Then the students of both batches gathered for some games and other fun activities. The party was hilarious and filled with excitement. Towards the end, when everyone recollected their past days and took a trip down the memory lane it turned nostalgic and heart-warming. Thus the second batch of Computational Linguistics bid good bye to the college with a bunch of treasured memories and colourful achievements, with a promise to stay connected. (Content prepared by Kavitha Raju)

Page 6: Clear Journal September 2014

CLEAR September 2014 6

Talk by Prof. Kalyana Krishnan R

Professor R Kalyana Krishnan one amoung the most reputed Professors of IIT Madras visited our college on 16th of September 2014. Prof. Kalyana Krishnan is retired from IIT Madras since 2012 after a long service lasting nearly four decades. He visited our college and delivered expert talks on various engineering fields for both post and undergraduate students.

Interaction with PG students began by afternoon as per the schedule. Both the first and second year M.tech Computational Linguistics batch students attended the class. The talk was centered around text processing and various issues associated. After providing a shot introduction to the mathematical basis provided to text processing, the discussion advanced to the much anticipated Multilingual text processing.

Prof. Kalyana Krishnan compared English with Malayalam in terms of size of alphabet and homophones. Later the talk proceeded with issues regarding the character encodings in Malayalam, mostly the ones associated with Koottaksharams. In the following discussions the Professor made us realize the beauty of Malayalam and asked us to envy the expertise with which the Indian literatures were written. The beauty of Indian literature further drove the discussions towards the Rama Krishna Viloma Kavyam, a sloka build of palindromes, and excellence of the epic Mahabharatha.

Opportunities like this where we could share our thoughts and ideas with experts like Prof. Kalyana Krishnan comes only rarely in a lifetime. We, the students of M.Tech CL is hence sincerely thankful to our Head of the department for providing us such an opportunity.

(Content prepared by Amal Babu)

Page 7: Clear Journal September 2014

CLEAR September 2014 7

Talk by Dr. T Ashokan

Dr T Asokan is a Professor in the Department of Engineering Design at IIT Madras. He completed his B.Tech and M.Tech in mechanical engineering from Calicut University. Dr Asokan received his Ph.D in Mechanical Engineering from the Indian Institute of Technology Madras, in the year 2000. His area of specialization was electro hydraulic controls for robotic applications. He visited our college and delivered expert talk on “Under Water Robotics” for undergraduate students of Mechanical department and some of the students from Electrical department. Later he interacted with faculty members.

Page 8: Clear Journal September 2014

CLEAR September 2014 8

Mangalyaan

The Mars Orbiter Mission, Mangalyaan, launched into Earth orbit on 5th November 2013 by Indian Space Research Organisation, was successfully inserted into Mars orbit on 24th September 2014, making India the first nation to send a satellite into Mars orbit on its first attempt, and the first Asian nation to do so.

The Mangalyaan robotic probe is one of the cheapest interplanetary missions ever. Only the US, Russia and the European Space Agency have previously sent missions to Mars, and India has succeeded on its first attempt - an achievement that eluded even the Americans and the Soviets. It is India's first interplanetary mission and ISRO has become the fourth space agency to reach Mars, after the Soviet space program, NASA, and the European Space Agency.

The specific objectives of the Mars Orbiter Mission are primarily associated with spacecraft construction and mission operations as Mangalyaan serves as a pathfinder, being India’s first mission beyond the Moon which brings its own unique challenges such as the 20-minute average signal delay to Mars. The Indian Space Science Data Center has provided the following Mission Objectives:

1. Develop the technologies required for design, planning, management and operations of aninterplanetary mission. 2. Orbit maneuvers to transfer the spacecraft from an elliptical Earth orbit to a heliocentric trajectoryand finally insert it into Mars orbit. 3. Development of force models and algorithms for orbit and attitude computations and analyses.4. Navigation in all mission phases.5. Maintain the spacecraft in all phases of the Mission meeting Power, Communications, Thermal andPayload requirements. 6. Incorporate autonomous features to handle contingency situations.

The following scientific Objectives have been set for the Mars Orbiter Mission: 1. Study climate, geology, origin and evolution of Mars.2. To study sustainability of life on the planet.

MOM will be set on a highly elliptical orbit around Mars, with a period of 3.2 days and a planned periapsis of 423 km (263 mi) and apoapsis of 80,000 km (50,000 mi). Commissioning and checkout operations are planned over the coming weeks to prepare MOM's instruments for scientific operations.

(Content prepared by Vidya P V)

Page 9: Clear Journal September 2014

CLEAR September 2014 9

Content Translation Tool for Translating Wikipedia contents

Santhosh Thottingal Senior Software Engineer

Language Engineering Team Wikimedia Foundation

[email protected]

ഇന‍റര‍നെററിലെനആദ യ അദ ച വദ നസൈറററിലളിലെനആനായവദ െകിപെഡിയെവദ 287ദഭഷിലെല‍ദ പഡര‍തെകിപവനായവട.ദഇതെല‍ദഏററിലവംദആവതവംദ പഡശസതവംദഇംഗലിഷദഡതെപപണ.ദ20ദഇനതഅന‍ദഭഷിലെആവംദെകിപെഡിയെവദെെആെആവട.ദഇനതഅന‍ദഭഷദഡതെപപളിനലആംദ ഇംഗലിഷദ ഭഷദ ഡതെപപെനെദ പഡകഷെചചദ ലനെദ നെറവതണ.ദവപറഡഅന‍ദഭഷിലണദഇംഗലിഷെെവദനതടടളതനെദ ആവപപതെആവംദ പഡശസതെവെആവംദഉളളത.ദ പആഖെദങലളനണദ ണതതതെആവംദ വവനണദ ഉളളണകിപതെന‍റദ െതെനന‍റദിെഅതെആവംദ ഈദ തെതമഅംദ ശെെവണ.ദ ഈദ തെതമഅംദ ഡപകഷദ ദ ഭഷദംെെകിപവനായെവനണദ ണതതതെനന‍റദ ിെഅതെല‍ദ െവഡതെിമആദ തെവം.ദംെെകിപവനായെവനണദണതതതെല‍ദഹെനദെദപആിതെല‍ദെആംദദസഥെതണ.ദഒെവദആകഷതെല‍ദ ഡെംദ പആഖെദങലണദ ദ ഹെനദെദ ദ െകിപെഡിയെവവെആവളളനതിെല‍ദഇംഗലിഷെല‍ദ 40ദ ആകഷപതലംദ പആഖെദങലളട.ദ മആവലംദ െകിപെഡിയെവവെല‍ദ 35000ദപആഖെങലണവളളത.

ഇതെംദ നെറെവദ െകിപെിള‍കകിപവദ ആെവദ െകിപെിലെല‍ദ െെനായവംദ പആഖെദങള‍കദഡെെഭഷനപപണവതെനവണവതവിനണദ ദ ണനായദ പെയ അംദ പഡകതമണ. െെെധെദെകിപെഡിയെവദ ണയെററിലര‍മര‍ദ ങനെദ ദ ഭഗെിമവെദ പആഖെദങള‍കദഡെെഭഷനപപണവതവനായവമവട. ഡെെഭഷനപപണവതല‍ദണലളപപമകിപന‍ദഡആദ ഭഷിള‍കകിപവംദനമഷിന‍ദ പണന‍പേഷന‍ദ നൌിെഅങള‍കദ ആഭഅമണ. ഡപകഷദ ഇനതഅന‍ദ ഭഷിനല,

ംബനധെചചെണപതലംദ ദ നമഷിന‍ദ പണന‍പേഷന‍ദ പിതെിെയ അദ ഇതെംദശഅങള‍കകിപവതിന‍ദ മപതംദ ഇതവനെവവംദ ലര‍നായെടടെആ. നമഷിന‍ദ പണന‍പേഷന‍ദപിതെിെയ അദഉളളദഭഷിലെആവംദപആഖെങള‍കദഡെെഭഷനപപണവതന‍ദെകിപെഡിയെവദനൌിെഅങനലനായവംദ ദണദയെററിലര‍മര‍കിപവദ നിണവകിപവനായെആ. ഗഗെള‍കദ പണന‍പേററില, ബെങദദതവണങെവദ ഉഡപവഗെചചദ ദ െകിപെകിപദ ഡവറതദ ഡെെഭഷനപപണവതെവദ ഉളളണകിപംദെംദ െകിപെവെപആകിപദ ഡിര‍തവിവണദര‍ദ നെയയളനായത. ദ ഉളളണകിപതെനആദആെിവിള‍ക,ദ നറഫറന‍വിള‍ക,ദ നണപലററിലളിള‍കദ തവണങെവനവനകിപദ ണദയെററിലര‍മര‍ദ മററിലെദണെവതണം.

ഈനവെവദ നൌിെഅകിപവറവദ ഡെെഹെെകിപന‍ദ ദ െകിപെഡിയെവദ ദ പശമംദതവണങെകിപെെഞഞവ. െകിപെയകകിപവളളെല‍ദ തനനായദ െെധതെംദ ഡെെഭഷദ

Page 10: Clear Journal September 2014

CLEAR September 2014 10

ംെധെങലളനണദ ഹവപതനണദ ഒെവദ ഭഷവെല‍ദ െെനായവംദ പനറെവദഭഷവെപആയകകിപദ ദ പആഖെങള‍കദ ഡെെഭഷനപപണവതന‍ദ ഇതവെെദ ധെയകകിപവം.െെഘടവിള‍ക,ദനമഷിന‍ദപണന‍പേഷന‍ദദണനായെയകകിപവദഡവറപമ, െകിപെവെനആദആെിവിള‍കദപനറദഭഷവെപആകിപദഓപടടമററിലെകദവെദമററിലെവളളദദനൌിെഅംദഉടവം. ഒെവദെഷവനതപപററിലെദെെധദഭഷിലെനആദപആഖെങലളനണദെെങള‍കദെെആെല‍ദതനനായദെകിപെയററിലദണനായദനപഡജകണെനന‍റദഹവതല‍ദംഭെെയകകിപവനായവട. നറഫറന‍വിള‍ക,ദനണംപേററിലളിള‍കദണനായെവവംദ ഒെവദ ഭഷവെല‍ദെെനായവംദ പനറനായെപആകിപദഡര‍തതമപവദഭഗെിമപവദ ണലളിലളനണദ ഹവപതനണദ ദ ഓപടടമററിലെകദ വെദ മററിലന‍ദധെയകകിപവം. െകിപെഡിയെവദപആഖെങള‍കദതെെവതവനായതെനദഡആെവംദമണെകിപവനായതെെവളളദഒെവദ ിെണംദ തെനന‍റദ ിവറചചദ െഷമംദ ഡെണെചചദ െകിപെദ മര‍കിപപപദ ഭഷദഡഠെനചചണവകിപന‍ദ ദ ദ മണെകിപവനായതെെആണ. പആഖെങള‍കദ ഡെെഭഷനപപണവതന‍ദ െന‍ദപഡിവനായദ ഈദ ംെധെതെല‍ദ ഗഗെള‍കദ പയപകിപദ പഡദ പഡആവളളദദപയിഅനമപന‍റദദണയെററിലളദനെയയളനായപഡനആദപആഖെങള‍കദദതയയറകിപന‍ദിെെവവം.

തവണകിപതെല‍ദ സഡെെഷദ - ിററിലആന‍ദ ഭഷിള‍കദ വെെെകിപവംദ ഡെനതവണകിപവനായത.

പഡര‍ഷഅവം (http://www.apertium.org/)ദ ണനായദ വതപനതദ നമഷിന‍ദ ദ പണന‍നേഷന‍ദംെധെമണദ ഉഡപവഗെചചെെെകിപവനായത. ണവതഡണെവവെദ മററിലളഭഷിലെആവംദ ഈദംെധെംദെിെപപെയകകിപവം.

ിടന‍റദ പണന‍സപആഷന‍ദണള‍കദെെദഏപറപേനന‍റദപആഖെനതദസഡെെഷദഭഷവെല‍ദെെനായദിററിലആന‍ദഭഷവെപആകിപദപണന‍സപആററിലദനെയകതെെെകിപവനായവ.

Watson Analytics merges big data with natural language tools

IBM has announced the launch of Watson Analytics, a cloud-based natural

language service that aims to simplify and streamline predictive data analytics

for businesses, creating handy visualizations in the process. It can help

companies source and cleanup data, so that the results seen are always

relevant.

Visit: http://www.wired.co.uk/news/archive/2014-09/16/ibm-watson-analytics

Page 11: Clear Journal September 2014

CLEAR September 2014 11

Speech Recognition Divya Das

Project Engineer-II CDAC

Thiruvananthapuram [email protected]

I. Introduction

Speech processing is the study of speech signals and the processing methods of these signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signal. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. It is also closely tied to natural language processing (NLP), as its input can come from or output can go to NLP applications.

The two main research areas of speech processing are:

Speech recognition (also called voicerecognition), which deals withanalysis of the linguistic content of aspeech signal and its conversion intoa computer readable format.

Speech synthesis the artificialsynthesis of speech, which usuallymeans computer generated speech.

This article briefly explains the fundamental steps followed in speech recognition systems.

II. Speech Recognition

Speech recognition is a complex decoding process which translates speech into its corresponding textual representation. Because of the stochastic nature of speech stochastic models are used for its decoding by modeling relevant acoustic speech features. Speech recognition engines usually require two basic components in order to recognize speech. One component is an acoustic model, created by taking audio recordings of speech and their transcriptions. The other component is called a language model, which gives the probabilities of sequences of words. The following figure shows the important speech recognition modules.

Figure 1: Speech Recognition

Page 12: Clear Journal September 2014

CLEAR September 2014 12

III. Acoustic Modelling

The first step toward building an automated speech recognition system is to create a module for acoustic representation of speech. The main goal of this module is the computation of the acoustic model probability as it describes the probability of a sequence of acoustic observations conditioned on the word sequence. Two main branches of possible model types have gained popularity, namely neural networks (NNs) and hidden Markov models (HMMs). HMMs are commonly used for stochastic modelling, especially in the field of automated speech recognition. This is because they have been found to be eminently suited to the task of acoustic modelling.

The hidden Markov model is a (first order) Markov model whose topology is optimized for the task of speech recognition. It is strictly a left-to-right model consisting of states and transitional edges. It is called hidden because the state sequence is effectively hidden from the resulting sequence of observation vectors. The number of states depends on the speech unit modelled by the HMM. Possible speech units are phones or phone groups (e.g. bi-phones or tri-phones), syllables, words or even sentences. The link between the speech signal and the corresponding speech units is made by acoustic modelling.

A. Feature Extraction

The main tasks of the acoustic feature extraction procedure are the conversion of the analog speech signal to its discrete representation and the extraction of the

relevant acoustic features in terms of best speech recognition capability. Mel Frequency Cepstral Coe cents (MFCCs) are a feature widely used in automatic speech recognition. The mel frequency is used as a perceptual weighting that more closely resembles how we perceive sounds such as music and speech. For example, if we are listening to a recording of music, most of what you “hear” is below 2000 Hz you are not particularly aware of higher frequencies, though they also play an important part in audio perception. The cepstrum is the spectrum of a spectrum. A spectrum gives you information about the frequency components of a signal. A cepstrum gives you information about how those frequencies change. The combination of the two, the mel weighting and the cepstral analysis, make MFCC particularly useful in audio recognition, such as determining timbre (i.e. the difference between a flute and a trumpet playing the same frequency), which forms the basis of instrument or speech recognition.

B. Training HMMs

The next step is the training of the acoustic model parameters. In this context, training means the computation of model parameters based on appropriate training material in order to emulate the stochastic nature of the speech signal. Therefore, the training material needs to be representative for the speech domain for whose recognition the acoustic models will be used later. Over iterations through the training data, efficient estimation approaches used by standard training methods converge to a local optimum.

Page 13: Clear Journal September 2014

CLEAR September 2014 13

There are several well established training methods such as the maximum likelihood (ML) or maximum a posteriori (MAP) approaches. Baum-Welch training and Viterbi training are commonly used implementations of the ML training approach. One main characteristic of Viterbi training is the direct assignment of speech frames to HMM states. The Baum-Welch training algorithm is more flexible and allows overlaps in the frame to state assignment during the training procedure. In Viterbi training, the HMM parameters are estimated based on an initial segmentation of the training data. Each iteration successively improves the estimation of the acoustic model probability. The training procedure is finished when no further significant improvement can be achieved.

IV. Language Modelling

The language model (LM), also known as grammar which describes the probability of the estimated sequence of words. The LM can be defined as a context-free grammar (CFG), stochastic model (n-gram) or a combination of the two. Context-free grammars are used by simple speech recognition systems where the input sentences are often modelled by grammars. CFGs allow only utterances which are explicitly covered/defined by the grammar. Since CFGs of reasonable complexity can never foresee all the spontaneous variations of the users input, n-gram language models are preferred for the task of large vocabulary spontaneous speech recognition.

N-gram language models represent an nth order stochastic Markov model which describes the probability of word occurrences conditioned on the prior

occurrence of n-1 other words. The probabilities are obtained from a large speech corpus and the resulting models are called unigram, bigram or n-gram language models depending on their complexity. The assumption to build such an LM is that the probability of a specific n-gram can be estimated from the frequency of its occurrence in a training set. The simplest n-gram is the unigram language model, which means a prior probabilities attached to each word. Prior probabilities describes the frequency of the specific word normalized by the total number of words.

V. Decoding Process

The search space of the speech decoding process is given by a network of HMM states. The connection roles within this network are defined at different hierarchy levels such as the word, the HMM and the state level. Words are connected based on language model roles, whereas each word is constructed of HMMs defined by the pronunciation dictionary. The primary objective of the search process is to find the optimal state sequence in this network associated with a given speech utterance.

The Viterbi algorithm is an application of the dynamic programming principle and it performs the maximum likelihood decoding. The Viterbi algorithm provides a solution of finding the optimal word sequence associated with a given sequence of feature vectors by using the acoustic model and the language model.

VI. Speech Recognition Tool

The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and

Page 14: Clear Journal September 2014

CLEAR September 2014 14

manipulating hidden Markov models. HTK is primarily used for speech recognition. HTK consists of a set of library modules and tools available in C source form. The tools provide sophisticated facilities for speech analysis, HMM training, testing and results analysis.

VII. References

[1] “Speech Recognition using Hidden Markov

Model”, Mikael Nilsson and Marcus Ejnarsson,

In Department of Telecommunications and

Signal Processing

[2] http://www.voxforge.org/ Visited on

September 2014

Brain-to-brain verbal communication in humans

achieved for the first time

A team of researchers has successfully achieved brain-to-brain human communication

using non-invasive technologies across a distance of 5,000 miles. The team,

comprising researchers from Harvard Medical School teaching affiliate Beth Israel

Deaconess Medical Center, Starlab Barcelona in Spain, and Axilum Robotics in

Strasbourg, France, used a number of technologies that enabled them to send

messages from India to France, a distance of 5,000 miles (8046.72km), without

performing invasive surgery on the test subjects.

This experiment, the researchers said, represents an important first step in exploring

the feasibility of complementing or bypassing traditional means of communication,

despite its current limitations, the bit rates were, for example, quite low at two bits

per minute. Potential applications, however, include communicating with stroke

patients, for example.

Visit:http://www.cnet.com/news/brain-to-brain-verbal-communication-in-humans-achieved-for-

the-first-time/

Page 15: Clear Journal September 2014

CLEAR September 2014 15

Feature extraction for speaker recognition

system

Neethu Johnson M. Tech Computational Linguistics GEC, Sreekrishnapuram, Palakkad

[email protected]

I. Speaker recognition

Anatomical structure of the vocal tract is unique for every person and hence the voice information available in the speech signal can be used to identify the speaker. Recognizing a person by her/his voice is known as speaker recognition.

Speaker recognition systems involve two phases namely, training and testing. Training is the process of familiarizing the system with the voice characteristics of the speakers registering. Testing is the actual recognition task. Feature vectors representing the voice characteristics of the speaker are extracted from the training utterances and are used for building the reference models. During testing,

similar feature vectors are extracted from the test utterance, and the degree of their match with the reference is obtained using some matching technique. The level of match is used to arrive at the decision. For speaker recognition it is important to extract features from each frame which can capture the speaker-specific characteristics.

II. Feature extraction

Feature extraction is the process of extracting a limited amount of useful information from speech signal while discarding redundant information. The extraction and selection of the best parametric representation of acoustic signals

ABSTRACT: Speech processing has emerged as an important application area of digital signal

processing. Various fields for research in speech processing are speech recognition, speaker

recognition, speech synthesis, speech coding etc. The objective of automatic speaker recognition is

to extract, characterize and recognize the information about speaker identity. Feature extraction is

the first step for speaker recognition. Many algorithms are developed by the researchers for feature

extraction out of which the Mel Frequency Cepstrum Coefficient (MFCC) feature has been widely

used for designing a text dependent speaker identification system.

Page 16: Clear Journal September 2014

CLEAR September 2014 16

is an important task in the design of any speaker recognition system; it significantly affects the recognition performance. The features can be extracted either directly from the time domain signal or from a transformation domain depending upon the choice of the signal analysis approach. Some of the signal features that have been successfully used for speech processing tasks include Mel-frequency cepstral coefficients (MFCC), Linear predictive coding (LPC) and Local discriminant bases (LDB). Few techniques generate a pattern from the features and use it while few other techniques use the numerical values of the features.

A. LPC

In LPC system, each sample of the signal is expressed as a linear combination of the previous samples. This equation is called a linear predictor and hence it is called as linear predictive coding .The coefficients of the difference equation (the prediction coefficients) characterize the formants. LPC (Linear Predictive coding) analyses the speech signal by estimating the formants, removing their effects from the speech signal, and estimating the intensity and frequency of the remaining buzz. The process of removing the formants is called inverse filtering, and the remaining signal is called the residue.

B. MFCC

MFCC is based on the human peripheral auditory system. The human perception of the frequency contents of sounds for speech signals does not follow a linear scale. Thus

for each tone with an actual frequency t measured in Hz, a subjective pitch is measured on a scale called the ‘Mel Scale’. The mel frequency scale is a linear frequency spacing below 1000 Hz and logarithmic spacing above 1kHz.As a reference point, the pitch of a 1 kHz tone, 40 dB above the perceptual hearing threshold, is defined as 1000 Mels.

C. LDB

LDB is a speech signal feature extraction and a multi group classification scheme that focuses on identifying discriminatory time-frequency subspaces. Two dissimilarity measures are used in the process of selecting the LDB nodes and extracting features from them. The extracted features are then fed to a linear discriminant analysis based classifier for a multi-level hierarchical classification of speech signals.

III. Mel Frequency Cepstral

Coefficients

The most widely used acoustic features for speech and speaker recognition are MFCCs. They are the results of a cosine transform of the real logarithm of the short-term energy spectrum expressed on a mel-frequency scale. The MFCCs are proved more efficient. These features take into account the perception characteristics of human ear. While deriving the MFCC features, it considers that the human perception is nearly linear upto 1000 Hz, and after that it is non-linear, with the importance of a frequency signal decreasing with increase in frequency value. As a result, we need to have a better resolution (constant) up

Page 17: Clear Journal September 2014

CLEAR September 2014 17

to 1000 Hz and a decreasing resolution as the frequency increases. This means that up to 1000 Hz, mel filter banks will have constant bandwidth that will be smaller than the bandwidth of the filter banks above 1000 Hz. Beyond 1000 Hz, the filter banks will increase in frequency values. The calculation of the MFCC includes the following steps.

A. Frames

The most fundamental process common to all forms of speaker and speech recognition systems is that of extracting vectors of features uniformly spaced across time from the time-domain sampled acoustic waveform.

a. Pre-emphasis

The pre-emphasis refers to filtering that emphasizes the higher frequencies in the speech signal. Its purpose is to balance the spectrum of voiced sounds that have a steep roll-off in the higher frequency region.

b. Framing

The speech signal is a slowly time-varying or quasi-stationary signal. For stable acoustic characteristics, speech signal needs to be examined over a sufficiently large duration of time over which it could be considered to be stationary. Further, samples between adjacent frames are overlapped to ensure continuity in the features extracted, and thus avoid any abrupt changes. The time-domain waveform of the utterance under consideration is divided into overlapping fixed duration segments called frames. In speaker recognition, a frame size of 20 ms is seen to be the optimum and 10 ms for the

overlap between the adjacent frames. Advancing the time window every 10 ms enables the temporal characteristics of the individual speech sounds to be tracked and the 20 ms analysis window is usually sufficient to provide good spectral resolution, and at the same time short enough to resolve significant temporal characteristics.

c. Windowing

Each frame is multiplied by a window function. The window function is needed to smooth the effect of using a finite-sized segment for the subsequent feature extraction by tapering each frame at the beginning and end edges. Any of the window functions can be deployed, with the Hamming window function being the most popular.

B. MFCC features

A Fast Fourier Transform (FFT) operation is applied to each frame to yield complex spectral values. Subsequently, the FFT coefficients are binned into 24 mel filter banks and the spectral energies in these 24 filter banks are calculated. Then, Discrete Cosine Transform (DCT) is applied on the log of the mel filter bank energies to obtain the MFCC coefficients, and the first 13 MFCC coefficients are selected as the features for the speaker recognition system. The DCT also serves the purpose of de-correlating the mel frequency band energies. It may also be interpreted that the last four coefficients discarded corresponds to the fast variations in the signal spectrum and is found that they do not add value to the speaker/speech recognition experiments.

Page 18: Clear Journal September 2014

CLEAR September 2014 18

Subsequently, the temporal delta and acceleration coefficients are calculated and appended to the 13 baseline features, making the total number of features to 39. The mel frequency scale is represented as:

Mel frequency is proportional to the logarithm of the linear frequency, reflecting similar effects in the human's subjective and aural perception.

As the vocal tract is smooth, the filter bank energies measured in adjacent bands tend to be co-related. DCT is applied to the transformed mel frequency coefficients to produce a set of cepstral coefficients. Prior to the computing DCT the mel spectrum is usually represented on a log scale. Since most of the signal information is represented by the first few MFCC coefficients, the system can be made robust by extracting only those coefficients ignoring or truncating higher order DCT components. Traditional MFCC systems use only 8-13 cepstral coefficients. The 0th coefficient is often excluded since it represents the average log-energy of the input signal, which only carries little speaker-specific information.

The cepstral coefficients are static features that contain information from a give n-frame, while the information about the temporal dynamics of the signal is represented by the first and second derivatives of the cepstral coefficients. The first order derivative called delta coefficients represent information about the speech rate (velocity) and the second order derivative called delta-delta coefficients represents

information about the acceleration of the speech signal.

IV. Tool for feature expression

A. HTK

Hidden Markov Model Tool Kit (HTK) is a toolkit for building Hidden Markov Models (HMMs) can be used to model any time series. It is primarily designed for building HMM-based speech processing tools, in particular recognisers.

Although all HTK tools can parameterise waveforms (MFCC features) on-the-fly, in practice it is usually better to parameterise the data just once. The tool HCopy is used for this. As the name suggests, Hcopy is used to copy one or more source files to an output file. Normally, HCopy copies the whole file, but a variety of mechanisms are provided for extracting segments of files and concatenating files. By setting the appropriate configuration variables, all input files can be converted to parametric form as they are read-in.

A sample setting of configuration file for Hcopy is shown below:

#Feature configuration

TARGETKIND = MFCC_E_D_A

TARGETRATE = 100000.0

SAVECOMPRESSED = F

SAVEWITHCRC = T

WINDOWSIZE = 200000.0

USEHAMMING = T

PREEMCOEF = 0.97

NUMCHANS = 24

CEPLIFTER = 22

NUMCEPS = 12

Page 19: Clear Journal September 2014

CLEAR September 2014 19

ENORMALISE = T

#input file format (header less 8 kHz 16 bit

linear PCM)

SOURCEKIND = WAVEFORM

SOURCEFORMAT = NOHEAD

SOURCERATE = 1250

Thus, it is simple to parameterise or extract MFCC features from the speech signal using HTK. These features can be then used for speaker recognition, spoken language identification, speech recognition or any other speech processing tasks.

V. CONCLUSION

Speaker recognition is a commonly used biometric for control of access to information services or user accounts as it can be used to replace or augment personal

identification numbers or passwords. The speech signal can be represented as a sequence of feature vectors in order to apply mathematical tools. These spectral based features including are used for speaker recognition in most of the systems. Mel Frequency Cepstral Coefficents (MFCCs) are a feature widely used in automatic speech and speaker recognition. It describes the signal characteristics, relative to the speaker discriminative vocal tract properties. High accuracy and low complexity are the major advantages of MFCCs. A freely available portable toolkit for building and manipulating Hidden Markov Models, Hidden Markov Model Tool Kit (HTK), is primarily used for speech recognition research. The tools provide sophisticated facilities for speech analysis, HMM training, testing and results analysis.

Microsoft Unveils Real-Time Speech Translation for Skype

At Re/code's inaugural Code Conference, Microsoft unveiled it’s real-time speech

translator for Skype-a technology that conjures up references to "Star Trek" and "A

Hitchhiker's Guide To The Galaxy" that's been in the works for years.

While Speaker A is talking, Speaker B will actually hear their voice, at a lower volume,

even as Skype Translator begins to do its work and starts delivering translated,

spoken words. Moreover, the system is looking for natural pauses or, "silence

detection," in speech to start translating. The length of time it takes to translate is

totally dependent on the length of the sentence or phrase. The alternative would have

been to have the speaker hold a button while speaking and let it go when they

wanted to deliver a sentence or phrase. This approach should be more natural.

The "Star Trek"-like translator will become available before the end of 2014.

Visit: http://research.microsoft.com/en-us/news/features/translator-052714.aspx

VisiVIvihttp://research.microsoft.com/en-us/news/features/translator-052714.aspx

Page 20: Clear Journal September 2014

CLEAR September 2014 20

FreeSpeech: Real-Time Speech Recognition

Vidya P V M. Tech Computational Linguistics GEC, Sreekrishnapuram, Palakkad

[email protected]

I. Introduction

Speech recognition is the process of converting the spoken words to written format. It provides exciting opportunities for creating great user experiences and efficiencies in real-time interactive applications. The advent of hand-held devices has paved way for a wide variety of speech recognition applications including voice user interfaces such as voice dialing, call routing, search, simple data entry, speech to text processing etc.

CMU SPHINX is a popular open source large vocabulary continuous speech recognition system developed by the Carnegie Mellon University. PocketSphinx is a version of Sphinx that can be used in embedded systems, which is capable of real-time, medium-vocabulary continuous speech recognition. FreeSpeech is a free and open-source real-time speech recognition application that

provides off-line speaker-independent voice recognition with dynamic language learning capability using the PocketSphinx speech recognition engine.

II. FreeSpeech

FreeSpeech is a free and open-source, dictation, voice transcription, real-time speech recognition application which provides offline speaker-independent voice recognition with dynamic language learning capability using the PocketSphinx speech recognition engine and the gstreamer open source multimedia framework. FreeSpeech is truly cross-platform, written in Python.

CMU Sphinx or simply Sphinx describes a group of speech recognition systems developed at Carnegie Mellon University. PocketSphinx is a version of Sphinx that can be used in embedded systems. It is a research system and is a lightweight, multi-platform,

ABSTRACT: Speech Recognition is the process of translation of spoken words into text. Real-time

continuous speech recognition has opened up a wide range of research opportunities in human-

computer interactive applications. PocketSphinx is an open-source embedded speech recognition

system that is capable of real-time, medium-vocabulary continuous speech recognition, developed by

the Carnegie Mellon University. FreeSpeech is a free and open source real time speech recognition

application that uses PocketSphnix.

Page 21: Clear Journal September 2014

CLEAR September 2014 21

speaker independent, large vocabulary continuous speech recognition engine.

A. Installation

In order to make FreeSpeech work reliably on Linux, the following packages must be installed. These can be installed through package manager.

Python 2.7

pygtk2

python-xlib

python-simplejson

gstreamer, including gstreamer-

python

pocketsphinx and sphinxbase

CMU-Cambridge Statistical

Language Modelling Toolkit v2

CMU-Cambridge Statistical Language Modelling Toolkit can be downloaded and after unpacking it, installation can be performed by reading the instructions in the README file and editing the Makefile. Manually copy the tools from the bin directory somewhere in $PATH like: /usr/local/bin. Similarly, PocketSphinx and Sphinxbase can also be downloaded and unpacked, and installed as per the instructions given in README file. FreeSpeech can be downloaded from Google Code and its installation requires only setting an environment variable to a user-writeable location if it isn't already set.

B. Using FreeSpeech

As already said, FreeSpeech is written in Python and hence launching the program can

be done using Python interpreter. Then, the application starts working and it recognizes what is being spoken.

The following figure indicates the window that shows the spoken text in written format.

Figure 1. FreeSpeech Window

The dictionary available along with the FreeSpeech application indicates the vocabulary and can be referred for further information regarding different special characters, pronunciations of words etc.

Voice commands are also included in this application. A menu listing various voice commands pops up upon running the FreeSpeech program using Python interpreter.

A list of voice commands that are supported by FreeSpeech application has been listed below:

file quit - quits the program file open - open a text file in the

editor file save (as) - save the file show commands - pops up a

customize-able list of spokencommands

editor clear - clears all text in theeditor and starts over

delete - delete [text] or erase selectedtext

Page 22: Clear Journal September 2014

CLEAR September 2014 22

insert - move cursor after word orpunctuation example: "Insert afterperiod"

Figure 2. Command Preferences

select - select [text] example: "selectthe states"

go to the end - put cursor at end ofdocument

scratch that - erase last spoken text

back space - erase one character new paragraph - equivalent to

pressing Enter twice

C. Corpus and Dictionary

The FreeSpeech application contains a very limited language corpus, freespeech.ref.txt. The appication can be trained by entering texts in the textbox provided and clicking the learn button, which adds the contents in the text box to the

language corpus, thereby making the application do better in understanding next time.

Similarly, the FreeSpeech dictionary can also be edited if there exists any word that the application refuses to recognize even after teaching it several sentences. Adding new words to the dictionary may be done manually, along with their phonetic representation.

III. Conclusion and Future Works

The FreeSpeech real-time speech recognition application provides a platform to perform real-time speech to text conversion and voice control. The speech recognition engine used, PocketSphinx, is a research system, which is also an early research. Several tools need to be made available to make this complete. The small size of the language corpus provided by FreeSpeech is also one of the limitations. Manual efforts may be required to do the learning part initially to increase the language corpus size so as to suite our needs. The difficulty in handling the pronunciation variations is also one of the major challenges which can be solved to an extent by editing the FreeSpeech dictionary.

IV. REFERENCES

[1] David Huggins-Daines, Mohit Kumar, Arthur

Chan, Alan W Black, Mosur Ravishankar, and

Alex I. Rudnicky, PocketSphinx: A Free,

Real-Time Continuous Speech Recognition

System For Hand-Held Devices, 2006 IEEE.

[2] http://thenerdshow.com/freespeech.html

Visited on September 2014.

Page 23: Clear Journal September 2014

CLEAR September 2014 23

CMU Sphinx 4: Speech Recognition System Raveena R Kumar

M. Tech Computational Linguistics GEC, Sreekrishnapuram, Palakkad

[email protected]

I. Introduction

CMU Sphinx toolkit has a number of packages for different tasks and applications. CMU Sphinx, also called Sphinx in short. These include a series of speech recognizers (Sphinx 2-4) and an acoustic model trainer (SphinxTrain). Sphinx is a continuous-speech, speaker-independent recognition system making use of hidden Markov acoustic models (HMMs) and an n-gram statistical language model.

Sphinx featured feasibility of continuous-speech, speaker-independent large-vocabulary recognition. Sphinx 2 focuses on real-time recognition suitable for spoken language applications. As such it incorporates functionality such as end-pointing, partial hypothesis generation, dynamic language model switching and so on. It is used in dialog systems and language learning systems. Sphinx 2 used a semi continuous representation for acoustic modelling. Sphinx 3 adopted the prevalent continuous HMM representation

and has been used primarily for high-accuracy, non-real-time recognition. Sphinx 3 is under active development and in conjunction with SphinxTrain provides access to a number of modern modeling techniques, such as LDA/MLLT, MLLR and VTLN, that improve recognition accuracy. PocketSphinx a version of Sphinx that can be used in embedded systems (e.g., based on an ARM processor). PocketSphinx is under active development and incorporates features such as fixed-point arithmetic and efficient algorithms for GMM computation.

Sphinx 4 is designed from the earlier Sphinx systems in terms of modularity, flexibility and algorithmic aspects. It uses newer search strategies, is universal in its acceptance of various kinds of grammars and language models, types of acoustic models and feature streams. It has been built entirely in the Java programming language. The Sphinx-4 system is an open source project.

ABSTRACT: Speech is a continuous audio stream where rather stable states mix with dynamically

changed states. The common way to recognize speech is to take the waveform, split it on utterances

by silences then try to recognize what's being said in each utterance. CMU Sphinx toolkit is a leading

speech recognition toolkit with various tools used to build speech applications. The Sphinx-4 speech

recognition system is the latest addition to Sphinx speech recognition systems.

Page 24: Clear Journal September 2014

CLEAR September 2014 24

II. Sphinx 4

The Sphinx-4 architecture has been designed for modularity. Any module in the system can be smoothly exchanged for another without requiring any modification of the other modules. One can, for instance, change the language model from a statistical N-gram language model to a context free grammar (CFG) or a stochastic CFG by modifying only one component of the system, namely the linguist. Similarly, it is possible to run the system using continuous, semi-continuous or discrete state output distributions by appropriate modification of the acoustic scorer. The system permits the use of any level of context in the definition of the basic sound units. Information from multiple information streams can be incorporated and combined at any level, i.e., state, phoneme, word or grammar. The search module can also switch between depth-first and breadth-first search strategies. One by-product of the system’s modular design is that it becomes easy to implement it in hardware.

A. Installation

Sphinx-4 is written in Java and therefore requires the JVM to run. To install Java on Ubuntu Linux, this requires the following:

sudo apt-get install sun-java6-jre

Download the Sphinx-4 1.0beta4 package from SourceForge.

Next:

unzip sphinx4-1.0beta4-bin.zip

cd sphinx4/lib

sh jsapi.sh

Now accept the BCL license agreement which will unpack jsapi.jar.

Now test Sphinx-4:

cd ..

java -jar bin/Dialog.jar

Press ctrl-C to exit the Sphinx4 dialog demo.

B. Basic Usage

There are several high-level recognition interfaces in Sphinx-4:

Live Speech Recognizer Stream Speech Recognizer Speech Aligner

Live Speech Recognizer uses microphone as the speech source. Stream Speech Recognizer uses audio file as the speech source. Speech Aligner time-aligns text with audio speech. For the most of the speech recognition jobs high-levels interfaces should be enough. And only to setup four attributes:

Acoustic model. Dictionary. Grammar/Language model. Source of speech.

III. Architecture of the Sphinx-4

decoder

The Figure 1 shows the overall architecture of sphinx 4 decoder. The speech signal is parameterized at the front-end module, which communicates the derived features to the decoding block. The decoding block has three components: the search manager, the linguist, and the acoustic

Page 25: Clear Journal September 2014

CLEAR September 2014 25

scorer. These work in tandem to perform the decoding.

Figure 1: The overall architecture of Sphinx 4

A. Front end

The module consists of several communicating blocks, each with an input and an output. Each block has its input linked to the output of its predecessor. When a block is ready for more data, it reads data from the predecessor, and interprets it to find out if the incoming information is speech data or a control signal. The control signal might indicate beginning or end of speech.

One of the features of this design is that the output of any of the blocks can be tapped. Similarly, the actual input to the system need not be at the first block, but can be at any of the intermediate blocks. The current implementation permits us to run the system using not only speech signals, but also spectra, etc. In addition, any of the blocks can be replaced. Additional blocks can also be introduced between any two blocks, to permit noise cancellation or compensation on the signal, on its spectrum or on the outputs of any of the intermediate blocks. Features computed using

independent information sources, such as visual features, can be directly fed into the decoder, either in parallel with the features from the speech signal, or bypassing the latter altogether.

B. Decoder

The decoder block consists of three modules: search manager, linguist, and acoustic scorer.

a. Search Manager

The primary function of the search manager is to construct and search a tree of possibilities for the best hypothesis. The construction of the search tree is done based on information obtained from the linguist. The search manager makes use of a token tree. Each token contains the overall acoustic and language scores of the path at a given point, a Sentence HMM reference, an input feature frame identification, and a reference to the previous token, thus allowing backtracking. The Sentence HMM reference allows the search manager to fully categorize a token to its senone, context-dependent phonetic unit, pronunciation, word, and grammar state. Search through the token tree and the sentence HMM is performed in two ways: depth-first or breadth-first. Depth-first search is similar to conventional stack decoding. In Sphinx-4, breadth-first search is performed using the standard Viterbi algorithm as well as a new algorithm called Bush-derby.

b. Linguist

The linguist translates linguistic constraints provided to the system into an internal data structure called the grammar

which is usable by the search manager. Linguistic constraints are typically provided in the form of context free grammars, N-

Page 26: Clear Journal September 2014

CLEAR September 2014 26

gram language models, finite state machines etc. The grammar is a directed graph, where each node represents a set of words that may be spoken at a particular time. The grammar is a directed graph, where each node represents a set of words that may be spoken at a particular time. The nodes are connected by arcs which have associated language and acoustic probabilities that are used to predict the likelihood of transiting from one node to another. Sphinx-4 provides several grammar loaders that load various external grammar formats and generate the internal grammar structure. The pluggable nature of Sphinx-4 allows new grammar loaders to be easily added to the system. Grammar nodes are decomposed into a series of word states, one for each word represented by the node. Words states are further decomposed into pronunciations states, based on pronunciations extracted from a dictionary maintained by the linguist. Each pronunciation state is then decomposed into a series of unit states, where units may represent phonemes, diphones, etc. and can be specific to contexts of arbitrary length. Each unit is then further decomposed to its sequence of HMM states. The Sentence HMM thus comprises all of these states. States are connected by arcs that have language, acoustic and insertion probabilities associated with them.

c. Acoustic Scorer

The task of the acoustic scorer is to compute state output probability or density values for the various states, for any given input vector. The acoustic scorer provides these scores on demand to the search module. In order to compute these scores, the scorer must communicate with the front-end module to obtain the features for which the scores must be computed.

C. Knowledge Base

This module consists of Language model and acoustic model. An acoustic model contains acoustic properties for each senone. There are context-independent models that contain properties (most probable feature vectors for each phone) and context-dependent ones (built from senones with context). A language model is used to restrict word search. It defines which word could follow previously recognized words (remember that matching is a sequential process) and helps to significantly restrict the matching process by stripping words that are not probable. There are two types of models that describe language - grammars and statistical language models. Grammars describe very simple types of languages for command and control, and they are usually written by hand or generated automatically with plain code.

IV. Training

If one wants to create an acoustic model for new language/dialect, or need specialized model for small vocabulary application a training should be done. The trainer learns the parameters of the models of the sound units using a set of sample speech signals. This is called a training database. This information is provided to the trainer through a file called the transcript file, in which the sequence of words and non-speech sounds are written exactly as they occurred in a speech signal, followed by a tag which can be used to associate this sequence with the corresponding speech signal. You have to design database prompts and post process the results to ensure that audio actually corresponds to prompts.

Page 27: Clear Journal September 2014

CLEAR September 2014 27

The file structure for the database is:

Etc

your_db.dic- Phonetic

dictionary

your_db.phone - Phoneset file

your_db.lm.DMP- Language

model

your_db.filler - List of fillers

your_db_train.fileids - List of

files for training

your_db_train.transcription-Transcription for training

your_db_test.fileids - List of

files for testing

your_db_test.transcription-Transcription for testing

wav speaker_1

file_1.wav speaker_2

file_2.wav

The following packages are required for training:

sphinxbase-0.8

SphinxTrain-0.8

pocketsphinx-0.8

To start the training change to the database folder and run the following commands:

sphinxtrain –t an4 setup

Replace an4 with your task name. After that go to the database directory:

cd an4

To train, just run the following commands: sphinxtrain run

V. Conclusion

Sphinx 4 is developed entirely in the Java programming language and is thus highly portable. Sphinx 4 also enables and uses multithreading and permits highly flexible user interfacing. Algorithmic innovations included in the system design enable it to incorporate multiple information sources in a more elegant manner as compared to the other systems in the Sphinx family. It's very flexible in its configuration, and in order to carry out speech recognition jobs. It provides a context class that takes out the need to setup each parameter of the object graph separately.

VI. References

[1] http://www.cs.cmu.edu/~rsingh/homepage

/papers/icassp03-sphinx4_2.pdf, The Cmu

Sphinx-4 Speech Recognition System, Paul

Lamere, Philip Kwok et al.

[2] http://cmusphinx.sourceforge.net/ Visited

on September 2014

Page 28: Clear Journal September 2014

CLEAR September 2014 28

Deep learning in Speech Processing

Alen Jacob M. Tech Computational Linguistics GEC, Sreekrishnapuram, Palakkad

[email protected]

Rajitha K M. Tech Computational Linguistics GEC, Sreekrishnapuram, Palakkad

[email protected]

I. Introduction

Deep learning refers to a class of machine learning techniques, where many layers of information processing stages in hierarchical architectures are exploited for pattern classification and for feature or representation learning. It is in the intersections among the research areas of neural network, graphical modeling, optimization, pattern recognition, and signal processing.

Deep learning algorithms are based on distributed representations, a concept used in machine learning. The underlying assumption behind distributed representations is that observed data is generated by the interactions of many different factors on different levels. Deep learning adds the assumption that these factors are organized into multiple levels, corresponding to different levels of abstraction or composition. Varying numbers of layers and layer sizes can be used to

provide different amounts of abstraction. The automatic conversion of written

to spoken language is commonly called Text-to-speech or simply TTS. The input is text and the output is a speech waveform. A TTS system is almost always divided into two main parts. The first of these converts text into what we will call a linguistic specification and the second part uses that specification to generate a waveform. This division of the TTS system into these two parts makes a lot of sense both theoretically and for practical implementation: the front end is typically language-specific, whilst the waveform generation component can be largely independent of the language.

Nowadays there are many speech synthesis systems exist, the quality of system are measured based on the naturalness and intelligibility of speech generated. Statistical parametric speech synthesis based on hidden Markov models (HMMs) has grown in popularity in the last decade. This system

ABSTRACT: Malayalam Deep learning is becoming a mainstream technology in speech processing. In

this article we provides some works in deep learning related to speech processing. The main

applications of speech processing are speech recognition and speech synthesis. In this document Deep

Neural Network based speech synthesis and Deep Neural Tensor Network based speech recognition

are explained.

Page 29: Clear Journal September 2014

CLEAR September 2014 29

simultaneously models spectrum, excitation, and duration of speech using context-dependent HMMs and generates speech waveforms from the HMMs themselves. This system offers the ability to model different styles without requiring the recording of very large databases. The major limitation of this method is the quality of synthesized speech.

To address the limitations of context dependent HMM based speech synthesis method, introduced an alternative scheme that is based on a deep architecture. The decision trees in HMM-based statistical parametric speech synthesis perform mapping from linguistic contexts extracted from text to probability densities of speech parameters. Here decision trees are replaced by a deep neural network (DNN). Until recently, neural networks with one hidden layer were popular as they can represent arbitrary functions if they have enough units in the hidden layer. Although it is known that neural networks with multiple hidden layers can represent some functions more efficiently than those with one hidden layer, learning such networks was impractical due to its computational costs. However, the recent progress both in hardware (e.g. GPU) and software enables us to train a DNN from a large amount of training data. Deep neural networks have achieved large improvements over conventional approaches in various machine learning areas including speech recognition and acoustic-articulatory inversion mapping. Note that NNs have been used in speech synthesis since the 90s.

Automatic speech recognition, translating of spoken words into text, is still a challenging task due to the high viability in

speech signals. Deep learning, sometimes referred as representation learning or unsupervised feature learning, is a new area of machine learning. Deep learning is becoming a mainstream technology for speech recognition and has successfully replaced Gaussian mixtures for speech recognition and feature coding at an increasingly larger scale.

II. Deep Neural Network Based

Speech Synthesis

A DNN, which is a neural network with multiple hidden layers, is a typical implementation of a deep architecture. We can have a deep architecture by adding multiple hidden layers to a neural network.

The properties of the DNN are contrasted with those of the decision tree as follows:

Decision trees are inefficient toexpress complicated functions ofinput features, such as XOR, d-bitparity function, or multiplexproblems. To represent such cases,decision trees will be prohibitivelylarge. On the other hand, they can becompactly represented by DNNs.

Decision trees rely on a partition ofthe input space and using a separateset of parameters for each regionassociated with a terminal node. Thisresults in reduction of the amount ofthe data per region and poorgeneralization. Yu et al. showed thatweak input features such as word-level emphasis in reading speechwere thrown away while buildingdecision trees. DNNs provide better

Page 30: Clear Journal September 2014
Page 31: Clear Journal September 2014
Page 32: Clear Journal September 2014

CLEAR September 2014 32

Fig. 3(c) is an alternative view of the same DTNN shown in Fig. 3(b). By defining, the input to the layer, as

𝑣𝑙 = 𝑣𝑒𝑐(ℎ1𝑙−1 ⊗ℎ2

𝑙−1)

= 𝑣𝑒𝑐(ℎ1𝑙−1(ℎ2

𝑙−1)𝑇)

This rewriting allows us to reduce and convert tensor layers into conventional matrix layers and to define the same interface in describing these two different types of layers. For example, in Fig. 3(c) hidden layer ℎ𝑙can now be considered as aconventional layer as in Fig. 3(a) and can be learned using the conventional back-propagation (BP) algorithm. This rewriting also indicates that the tensor layer can be considered as a conventional layer whose input comprises the cross product of the values passed from the previous layer.

IV. Conclusion and Future Works

The DNN-based approach has a potential to address the limitations in the conventional decision tree-clustered context-dependent HMM-based approach.

Future work includes the reduction of computations in the DNN-based systems, adding more input features including weak features such as emphasis, and exploring a better log F0 modeling scheme.

A novel deep model called DTNN, in which one or more layers are DP and tensor layers is described for speech recognition tasks. An approach to map the tensor layers to the conventional sigmoid layers is also shown so that the former can be treated and trained in a similar way to the latter. With this mapping we can consider a DTNN as the DNN augmented with DP layers and so the BP learning algorithm of DTNNs can be cleanly derived.

V. References

[1] Heiga Zen, Andrew Senior, Mike Schuster “Statistical Parametric Speech Synthesis using

Deep Neural Networks” , IEEE International

Conference on Acoustics, Speech and Signal

Processing(ICASSP), May 2013 . [2] Li Deng , ‘Three Classes of Deep Learning

Architectures and Their Applications: A Tutorial

Survey’ , Microsoft Research, Redmond, WA

98052, USA.

[3] Dong Yu, Li Deng, and Frank Seide , ‘Large

Vocabulary Speech Recognition Using Deep

Tensor Neural Networks ’ , Interspeech ISCA,

September 2012. [4] G.E. Dahl, D. Yu, L. Deng, and A. Acero ,

‘Context-dependent pretrained deep neural

networks for large vocabulary speech

recognition’ , IEEE Trans. Audio, Speech, and

Lang. Proc. Jan. 2012, vol. 20, no. 1, pp. 33-42.

Page 33: Clear Journal September 2014

CLEAR September 2014 33

Parallelising NLP Tasks Using MapReduce

Paradigm

Freny Clara Davis, Shalini M, Nidhin M Mohan, Shibin Mohan S7, Computer Science & Engineering

GEC, Sreekrishnapuram, Palakkad

I. Introduction

POS Tagging: Part of speech tagging is the process of assigning a part-of-speech or other lexical class marker to each word in a corpus. Taggers play an important role in speech recognition, natural language parsing and information retrieval. The input to a tagging algorithm is a string of words and the specified tagset. The output is the best tag for each word.

Stemming: Stemming is the process for reducing inflected (or sometimes de-rived) words to their stem, base or root form.

POS Tagging and Stemming are great skills of humans and we could not expect such great skills from everyone. POS Tagging and Stemming done by humans is extremely limited in quality and quantity. Therefore they would benefit from systems that perform NLP tasks to assist them in meeting their needs.

POS Tagging may be useful to know what function the word plays, instead of depending on the word itself.

POS tagging and Stemming can be useful in the following areas:

Speech synthesis - pronunciation Speech recognition - class-based N-

grams Information retrieval - stemming,

selection of high-content words Word-sense disambiguation Corpus analysis of language

lexicography Information Extraction Question Answering (Q A) Machine TranslationStemming can be used in the field of

information retrieval where it is required to find documents relevant to an information need from a large document set.

POS Tagging and Stemming are performed using the Map Reduce Paradigm which is a programming model. Map Reduce is an associated implementation for pro-cessing large data sets with a parallel, distributed algorithm on a cluster and is expected to improve the performance of the current NLP system in which the NLP tasks are done sequentially.

ABSTRACT: Natural Language Processing (NLP) refers to the applications that deal with natural

language in a way or other. The proposed system tries to parallelise NLP tasks using map-reduce

paradigm. The main NLP tasks that the proposed system is intended to perform are POS Tagging &

Stemming. This paper presents an approach to parallelize the NLP tasks using the map-reduce

paradigm.

Page 34: Clear Journal September 2014

CLEAR September 2014 34

II. Methodology

Usually, the NLP tasks like Tagging and Stemming are done sequentially. If tasks like POS tagging and Stemming are done sequentially, it may take a lot of time. This problem can be solved by performing these tasks using the map-reduce paradigm which is implemented in the hadoop framework. Hadoop scales up linearly to handle larger data sets by adding more nodes to the cluster. It also allows users to quickly write efficient parallel code. MapReduce is an associated implementation for processing large data sets with a parallel, distributed algorithm on a cluster.

A MapReduce program is composed of a Map procedure that performs filter-ing and sorting and a Reduce procedure that performs a summary operation. The two steps performed in the map-reduce program are:

Map step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

Reduce step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

MapReduce allows for distributed processing of the map and reduction operations. Provided that each mapping operation is independent of the others, all

maps can be performed in parallel.

III. Conclusion

In this paper we presented an approach to parallelize the NLP tasks using the map-reduce paradigm. How the huge amount of time required to perform the NLP tasks sequentially, can be reduced is shown here.

IV. References

[1]Chuck Lam, “Hadoop in Action”, Manning

publications, 2012.

[2]Daniel Jurafsky and James H.Martin,

“Speech and Language Processing, An

Introduction to Natural Language Processing,

Computational Linguistics, and Speech

Recognition”, Pearson Education, 2012.

Page 35: Clear Journal September 2014

CLEAR September 2014 35

TEAM INDIA DOES INDIA PROUD AT THE INTERNATIONAL

LINGUISTICS OLYMPIAD

At the recently concluded 12th International Linguistics Olympiad held in Beijing, China,

the Indian Team bagged the following awards:

* Bronze Medal - Anindya Sharma, Bangalore

* Bronze Medal - Rajan Dalal, Ranchi

* Best Solution Award for Problem No.3 - Anindya Sharma, Bangalore

The International Linguistics Olympiad (IOL) is one of the newest in a group of twelve

International Science

Olympiads, and is steadily growing in popularity over the last few years. The goal of the

Olympiad is to introduce students to linguistics because Linguistics is a subject, which

per se, is not taught in high-schools across the world. This year, IOL participants had to

decipher the grammar rules, kinship terms and word meanings of Benabena, Kiowa,

Engenni, Gbaya and Tangut (now extinct) languages spoken in Papua New Guinea, North

America, Nigeria, Congo, and central China respectively, each of which have less than a

few thousand speakers and are at the verge of extinction.

India first competed in the IOL in 2009, and has participated in 6 Olympiads till date.

Over the years, Team India has brought home 7 medals (3 Silver and 4 Bronze), 4 Best-

Solution prizes, and 3 Honorable Mentions. Team India is chosen through the Panini

Linguistic Olympiad conducted by the University of Mumbai, and actively supported by

Microsoft Research India, as well as several other premier institutes from across the

country including JNU, IIT Guwahati, IIT Patna, IIT Kharagpur, SNLTR, EFLU and

Chennai Mathematical Institute. In a country like India with many languages, we need a

lot more linguists and computational linguists to drive Indian language technology and

research. Linguistics Olympiad aims to realize this goal by exposing young minds to the

concepts of linguistics and computational linguistics presented in the form of interesting

yet challenging puzzles.

The highlight of this year's IOL is that India won the bid to host the Olympiad in 2016.

Linguistics Olympiad is much less known in India than the other science Olympiads

primarily because linguistics is not taught in the schools. On the other hand, exposure to

many languages make Indian students naturally adept in this Olympiad. We need more

support from the NLPAI community in spreading the awareness about this Olympiad and

helping us scale up our activities.

For more information on Panini Linguistics Olympiad, check website: https://sites.google.com/site/pa

ninilingui sticsolympiad/ or send email [email protected]

Page 36: Clear Journal September 2014

CLEAR September 2014 36

M.Tech Computational Linguistics

2012-2014 Batch

Abitha Anto

Lekshmi T S Indu Meledathu Indhuja K

Gopalakrishnan G Divya M Deepa C A

Ancy Antony Athira Sivaprasad

Page 37: Clear Journal September 2014

CLEAR September 2014 37

M.Tech Computational Linguistics

2012-2014 Batch

Neethu Johnson

Varsha K V Sruthimol M P Sreejith C

Sreeja M Sincy V T Reshma O K

Nibeesh K Prajitha U

Page 38: Clear Journal September 2014

CLEAR September 2014 38

Article Invitation for CLEAR- Dec-2014

We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted

aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in

Engineering And Research) Journal, publishing on Dec 2014. The suggested areas of discussion are:

The articles may be sent to the Editor on or before 10th Dec, 2014 through the email

[email protected].

For more details visit: www.simplegroups.in

Editor, Representative,

CLEAR Journal SIMPLE Groups

M.Tech Computational Linguistics

Dept. of Computer Science and Engg,

Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in [email protected]

SIMPLE Groups Students Innovations in Morphology

Phonology and Language Engineering

Page 39: Clear Journal September 2014

CLEAR September 2014 39

Hello World,

This is a very proud moment for all of us as India has become the first Asian nation

to reach Mars. This edition marks an important milestone in the history of CLEAR

as well, as this is the first printed edition of CLEAR Journal. I am very glad to be a

part of the editorial team to witness this most precious moment.

With this edition of CLEAR Journal, we bring an edition on speech processing. This

will provide a forum for students to enhance their background and get exposed to

intricate research areas in the field of speech and audio signal processing. The

exponential growth of audio and speech data, coupled with increase in computing

power, has lead to increasing popularity of deep learning for speech processing. This

edition also includes some NLP related works carried out by UG students and events

conducted by IT, EC and CS departments in our college.

I would like to sincerely thank the contributing authors, for their effort to bring their

insights and perspectives on the latest developments in Speech and language

engineering. Technical advances in speech processing and synthesis are posing new

challenges and opportunities to researchers.

The pace of Innovation continues. I wish our college could also fly high beyond the

horizon of Speech & Language Processing.

Simple groups welcomes more aspirants in this area.

Wish you all the best!!!

Nisha M

[email protected]

Page 40: Clear Journal September 2014

CLEAR September 2014 40