Permanent Magnetic Articulograph (PMA) vs ElectromagneticArticulograph (EMA) in Articulation-to-Speech Synthesis for Silent

Speech InterfaceBeiming Cao1, Nordine Sebkhi3, Ted Mau4, Omer T. Inan3,Jun Wang1,2

1Speech Disorders & Technology Lab, Department of Bioengineering2Callier Center for Communication Disorders

University of Texas at Dallas, Richardson, TX, USA3 Inan Research Lab, School of Electrical and Computer Engineering

Georgia Institute of Technology, Atlanta, GA, USA4 Department of Otolaryngology - Head and Neck Surgery

University of Texas Southwestern Medical Center, Dallas, TX, USA


Silent speech interfaces (SSIs) are devices thatenable speech communication when audiblespeech is unavailable. Articulation-to-speech(ATS) synthesis is a software design in SSIthat directly converts articulatory movementinformation into audible speech signals. Per-manent magnetic articulograph (PMA) is awireless articulator motion tracking technol-ogy that is similar to commercial, wired Elec-tromagnetic Articulograph (EMA). PMA hasshown great potential for practical SSI ap-plications, because it is wireless. The ATSperformance of PMA, however, is unknownwhen comparing with current EMA. In thisstudy, we compared the performance of ATSusing a PMA we recently developed and acommercially available EMA (NDI Wave sys-tem). Datasets with same stimuli and size thatwere collected from tongue tip were used inthe comparison. The experimental results in-dicated the performance of PMA was closeto, although not as equally good as that ofEMA. Furthermore, in PMA, converting theraw magnetic signals to positional signals didnot significantly affect the performance ofATS, which support the future direction inPMA-based ATS can be focused on the use ofpositional signals to maximize the benefit ofspatial analysis.

1 Introduction

People who had a laryngectomy have their lar-ynx surgically removed in the treatment of a con-dition such as laryngeal cancer (Bailey et al.,2006). The removal of the larynx, as a treat-ment of cancer, prevents laryngectomees fromproducing speech sounds and inhibit their abilityto communicate. Current approaches for improv-ing their ability to communicate include (intra-or extra-oral) artificial larynx (Baraff, 1994), tra-

cheoesophageal puncture (TEP) (Robbins et al.,1984), and esophageal speech (Hyman, 1955). Allof these approaches generate abnormal speech likehoarse voicing by tracheoesophageal speech orrobotic voicing by artificial larynx (Mau, 2010;Mau et al., 2012). These patients may feel de-pressed because of their health status and anxietyduring social interactions, as they think that otherpeople perceive them as abnormal, or they directlyexperience symbolic violence (Mertl et al., 2018).As a result, the development of communicationaids that can produce normal-sounding speech isessential to improving the quality of life for pa-tients in this population.

Silent speech interfaces (SSI) are devices whichconvert non-audio biological signals, such asmovement of articulators, to audible speech(Denby et al., 2010). Unlike existing methods,SSIs are able to produce natural sounding synthe-sized speech and even have the potential to re-cover the patients’ own voices. There are cur-rently two types of software designs in SSI. Oneis a “recognition-and-synthesis” approach, whichis to convert articulatory movement to text, andthen drive speech output using a text-to-speechsynthesizer (Kim et al., 2017). The other de-sign is direct articulation-to-speech (ATS) syn-thesis, which is more promising for SSI applica-tion, because ATS can be real-time. Currently,the prominent methods for capturing articulatorymotion data include: electromagnetic articulo-graph (EMA) (Schonle et al., 1987; Cao et al.,2018; Bocquelet et al., 2016), permanent mag-net articulograph (PMA) (Gonzalez et al., 2014;Kim et al., 2018), ultrasound image (Csapo et al.,2017), surface electromyography (sEMG) (Dieneret al., 2018), non-audible murmur (NAM) (Naka-jima et al., 2003). All of these technologies have





Figure 1: Our recently developed, head-set PMA de-vice, where a small magnet is attached on the tonguetip.

their own advantages and disadvantages. PMA hasrecently shown its potential for SSI because it iswireless and suitable for future practical applica-tions.

Unlike EMA that uses wired sensors attachedon the articulators with a magnetic field generatoroutside, PMA attaches (wireless) permanent mag-nets to articulators and adopts magnetometers tocapture the changes in the magnetic field gener-ated by the motion of the magnets. These mag-netic readings are then fed into a localization algo-rithm that estimates the 3D position of the magnetin the oral cavity (Sebkhi et al., 2017). Both EMAand PMA have been used in prior research on ATS(Cao et al., 2018; Gonzalez et al., 2017a; Cheahet al., 2018) with varying results. Although EMAhas been shown to yield more precise measure-ments (Yunusova et al., 2009; Berry, 2011) com-pared to PMA (Sebkhi et al., 2017), EMA devicesare normally cumbersome as they require wiredsensors be attached to articulators. Additionally,EMA devices are normally expensive. In contrast,PMA devices are mostly very light and portable,relying on wireless tracking by using permanentmagnets as the tracers, also affordable comparedto EMA. Due to the wireless, portability and low-cost advantages of PMA, it offers an appealing al-ternative to EMA if it is able to achieve similarlevels of performance as EMA in ATS systems.To our knowledge, however, no prior studies havedirectly compared the performance of these twotechnologies for SSI applications.

In this study, we compared the ATS perfor-

Figure 2: Wave System (EMA), where multiple sensorsare attached on the tongue and lips. Only the tongue tipsensor data was used in the comparison with PMA.

mance of our recently developed PMA-basedwireless tongue tracking system and a commer-cial EMA (NDI Wave system). We first examinedwhether it is more effective to use raw magneticfield signals than to use the converted magnet po-sitional data (x, y, z coordinates) of PMA in ATS.Second, we compared the performance of EMAand PMA using tongue tip data only. A deep neu-ral network (DNN)-based ATS model was used toevaluate the ATS performance for both EMA andPMA data. In this study, a dataset was collectedfrom two groups of subjects who spoke the samestimuli using PMA or EMA, respectively. Tonguetip is the common flesh point in the PMA andEMA datasets, which were used for analysis in thisstudy.

2 Dataset

2.1 PMA Data CollectionTen subjects (6 males and 4 females, average age:24.1 years ± 4.84) participated in the PMA datacollection session in which they repeated a list of132 phrases twice in their habitual speaking rate.The first repetition is normal voiced speech, andthe second repetition is unvoiced speech. In thisstudy, only the voiced speech data was used. Thephrases in the list were phrases that are frequentlyspoken by users of augmentative and alternativecommunication (AAC) devices (Glennen and De-Coste, 1997). The PMA data was collected at theGeorgia Institute of Technology.

The PMA data used in this study was collectedwith our newly developed wearable, headset sys-tem, which is based on the same magnetic tech-nology in the prior benchtop version multimodalspeech capture system (MSCS) (Sebkhi et al.,


-50 -40 -30 -20z(mm)








(a) EMA

30 35 40 45 50 55z(mm)






(b) PMA

Figure 3: Lateral view samples of tongue tip trajectory captured by PMA and EMA when saying: “That is perfect!”(By two different subjects).

2017). Figure 1 shows the wearable, wirelesstongue tracking system, which uses PMA and acamera for tongue and lip motion caption, respec-tively. A microphone was used for audio record-ing. This PMA system has an embedded array ofmagnetometers that measure the change of mag-netic field generated by a magnetic tracer attachedclose to the tongue tip.

During a data collection session, a disk-shapedmagnetic tracer (diameter = 3mm, thickness =1.5mm, D21BN52, K&J Magnetics) was attachedto about 1cm from tongue tip. An array of 24 ex-ternal 3-axial magnetometers (LSM303D, STMi-croelectronics) are divided into six modules, eachwith 4 magnetometers, which are positioned nearthe mouth, so there are two groups of 12 sensorsthat are near the right cheek and left cheek. Thesesensors were used for capturing the magnetic fieldfluctuations generated by the tracer, which are fedinto a localization algorithm that estimates the 3Dposition of the magnet every 10 ms (100 Hz). Thespatial tracking accuracy of the PMA varies from0.44 to 2.94 mm depending upon the position andorientation of the tracer (Sebkhi et al., 2017). Theaudio data recording was sampled at 96000 Hz.

Previous studies (Gonzalez et al., 2017a; Cheahet al., 2018) show that the combination of multipletracers on the tongue had better performance thansingle tracer (i.e., tongue tip). However, a smallernumber of magnetic tracers on the tongue is crit-ical for its practical use in daily life (Kim et al.,2018). Future users of this technology likely pre-fer to have only one permanent or semi-permanentattached magnetic tracer on their tongue. Even forlab experiment, attaching multiple tracers on thetongue takes longer time and relative logistic diffi-

culty to operate. In addition, with only one traceron the tongue tip, the risk of accidentally biting itis very small (Laumann et al., 2015).

To provide the best tracking performance withone single tracer, the system relies on 24 mag-netometers positioned outside the mount to accu-rately track the tongue motion (Kim et al., 2018).The six magnetometer modules are connected viaserial peripheral interface (SPI) to a sensor con-troller module (Kim et al., 2018) that also in-cludes a USB interface to communicate with thePC. More technical details about the tracking tech-nology can be found in (Sebkhi et al., 2017). Inthis study, although wearable, the headset was an-chored to a support in order to provide the bestpositional accuracy (to avoid possible head motionduring recording).

2.2 EMA Data Collection

Another group of 10 gender- and age-matchedsubjects (6 males and 4 females, average age: 24.3years ± 3.50) participated in the EMA data col-lection session. These individuals read the samelist of 132 phrases used in the PMA data collec-tion session. The EMA dataset was collected atthe University of Texas at Dallas.

Wave system (Northern Digital Inc., Waterloo,Canada) was used for EMA data collection (Figure2). Four small wired sensors were attached to thetongue tip (0.5 to 1cm from tongue apex), tongueback (20-30mm back from TT), upper lip andlower lip using dental glue or tape. Additionally,a fifth (head) sensor was attached to the middleof forehead for head correction. Finally, 3D EMAdata was sampled at 100 Hz which is same to PMAdata. The spatial precision of motion tracking is








T T+1T-1 .... T T+1T-1 ....

Figure 4: ATS using DNN.

about 0.5 mm (Berry, 2011), Figure 3(a) gives anexample of two-dimensional (2D) EMA tongue tipmovement trajectory (lateral view) when saying:“That is perfect!”. The sampling rate of audio datawas 22050 Hz. NDI Wave system does not providethe raw magnetic signals.

To ensure an analogous comparison with thePMA device, only the tongue tip data collected us-ing EMA was used in this study.

2.3 Data Preprocessing

To provide EMA and PMA consistent acousticfeatures, the sampling rates of audio data in EMAand PMA were resampled to same level. Theaudio data in PMA dataset was downsampled to48000 Hz from 96000 Hz, and the audio data inEMA dataset was upsampled to 48000 Hz from22050 Hz. After that, spectral envelope was ex-tracted with Cheaptrick algorithm (Morise, 2015)and then converted to 60-dimensional mel-cepstralcoefficients (MCCs) as the output acoustic fea-tures of ATS model. The MCCs were extractedat a rate of 200 frames per second, therefore, thePMA and EMA data were upsampled to 200 Hz tomatch the acoustic features.

Our PMA device captures the motion of tonguetip with the 72-channel raw magnet signals (3 axes24 magnetometers). In addition to raw magnetsignals, the 3D cartesian positions of the magnettracer were obtained by localizing the raw magnetsignals with nonlinear optimization method (Se-bkhi et al., 2017). Figure 3(b) gives an exam-ple of a 2D trajectory (lateral view) of magnettracer when saying “That is perfect!” obtained bylocalizing raw magnet signals. Both raw magnetsignals and 3D-position signals were used in thisstudy.

3 Method

3.1 Articulation-to-Speech Synthesis (ATS)Using Deep Neural Network (DNN)

The ATS model in this study uses a DNN to maparticulatory signals (PMA or EMA) to acousticfeatures (MCCs) (Figure 4).The first and secondorder derivatives of both input articulatory and theoutput acoustic data frames were computed andconcatenated to the original frames for context in-formation.

The DNN has 6 hidden layers, each layer has512 nodes with rectified linear unit (ReLU) acti-vation function. During the DNN training, Adamoptimizer (Kingma and Ba, 2014) was used, themaximum number of training epochs is 50, learn-ing rate for PMA data is 0.008 and 0.005 for EMAdata. The performances of ATS system is assessedusing EMA positional data, PMA raw data, PMApositional data, and the combination of PMA rawand positional data. Therefore, the input dimen-sions of ATS in this study are: 9 (3-dim. PMA orEMA positional + ∆ + ∆∆), 216 (72-dim. PMAraw magnet signals + ∆ + ∆∆), and 225 (con-catenation of 9-dim. and 216-dim.). The outputdimension is 180 (60-dim. MCCs + ∆ + ∆∆).The DNN model in this study was implementedwith Tensorflow machine learning library (Abadiet al., 2016).

3.2 Experimental Setup

As mentioned previously, we first compared theATS performance using raw PMA signals, con-verted positional data, or both. This experimentwill help to understand the which type of PMAdata leads to the best performance. NDI Waveis a commercial system, which does not provideany magnetic signals that have not been localized,thus this experiment was conducted for our PMAsystem only. Second, we compared the best per-formance in PMA with the performance in EMA.The results will reveal which technology (PMA orEMA) performs better.

Speaker-dependent setup was used in both ex-periments, as speaker-independent ATS is consid-ered challenging at this moment, due to the physi-ological difference among different speakers. TheATS performances on each subject were averagedas the final performance. For the 132 phrases inboth PMA and EMA data, 110 phrases were usedfor training, 10 for validating, and 12 for testing.


7.83 7.88 7.74


Raw Position Both Position


P < 0.01

P < 0.01

P < 0.01

Figure 5: Average MCD of 10 PMA Subjects and 10EMA Subjects. Statistical significances between theresults using EMA and all types of PMA data on ATSmodel are computed with ANOVA tests.

The ATS results were measured with mel-cepstral distortion (MCD). MCD is calculatedby equation (1), where C and Cgen denote theoriginal and generated mel-cepstral coefficients(MCCs), respectively, m is the frame step (ortime), d denotes dth dimension in frame m. D isthe dimension of MCCs, which is 60 in this study.

MCD =10





(Cm,d − Cgenm,d)2


As mentioned, lip movement information hasnot been used in this study, since PMA and EMAdevices use different approaches for lip motioncaption. PMA uses a computer vision algorithmto recognize the shape of the lips from images cap-tured by an embedded camera, whereas EMA re-lies on tracking the motion of attached sensors tothe vermilion borders of the lips to estimate lipsgesture. In additon, due to the relatively small datasize, the synthesized audio samples did not havesufficiently high speech intelligibility for listen-ing test. Therefore, the subjective/listening testingwas not conducted in this study.

4 Results and Discussion

4.1 Magnetic signals vs positional data inPMA

Experimental results are presented in Figure 5,where three-way ANOVA tests were used in thestatistical analysis. First, for PMA, that perfor-mance using raw magnet data was not significantly

different to the performance using positional dataonly (p < 0.85), and was also not significantlydifferent with that using combined raw magneticfield signals and positional data (p < 0.76). Therewas also no significance between the ATS per-formance using positional data only and that us-ing combined raw magnetic field signals and posi-tional data (p < 0.60).

These findings suggest, for PMA, we could useeither raw magnetic field signals or converted po-sitional data for a similar level of performance.Combining these two signals together may not im-prove the performance. This finding is inconsis-tent with our prior study in silent speech recogni-tion (SSR) using PMA data, where using magneticsignals outperformed than that using converted po-sitional data (Kim et al., 2018). Further studiesare needed to reveal why magnetic signals outper-formed positional data in SSR, but their perfor-mance in ATS was not significantly different.

The finding that positional data can have simi-lar performance with that using magnetic data isencouraging for our future development of ATSusing PMA. Although mapping the raw mag-netic signals directly to acoustic features is morestraightforward, transforming these signals to po-sitional signals allows the use of articulation dataprocessing methods, such as Procrustes matching(Gower, 1975; Kim et al., 2017), that cannot beeasily applied to the raw data. In addition, a PMApositional data-based ATS can be decoupled froma device configuration, it will be easier to changethe number of sensors, their positions, their model,and their settings. Finally, a PMA positional data-based ATS has a potential of using EMA data fortraining, since they both track the 3D motion ofarticulators.

4.2 PMA vs EMA

Second, when comparing the ATS performanceusing PMA data and EMA data, the results ob-tained using PMA is not as equally good as thatobtained in EMA. The performance in EMA sig-nificantly outperformed all the three configura-tions in PMA (raw, positional, and raw + posi-tional data) ( p < 0.01 also in an ANOVA test).

Although the EMA-based ATS system outper-formed the PMA-based system in our experiment,this finding does not negate the merits of PMAtechnology. Since PMA has shown the abilitiesof reaching a sufficiently good level in ATS (Gon-


zalez et al., 2014, 2017a,b; Cheah et al., 2018).Therefore, it is still a good fit for SSI application.

In this study, we focused on the comparison ofPMA and EMA, and only tongue tip motion wasused for ATS performance. Other studies in liter-ature that have incorporated lip motion and othertongue flesh point motion have achieved high per-formance for PMA-based ATS (Gonzalez et al.,2014, 2017a,b; Cheah et al., 2018). In addition,this study used on MCD as the ATS performancemeasure. While MCD is a widely used measurefor ATS performance, it does not fully representthe vocal quality of the resulting speech. Otheracoustic measures including band aperiodicitiesdistortion (BAP) (Morise, 2016), root mean squareerror of fundamental frequencies (F0-RMSE), andvoiced/unvoiced (V/UV) error rate, as well as lis-tening tests are needed to truly assess the differ-ences of PMA and EMA which has not been con-ducted in the current stage of this study as ex-plained.

Although the subjects were age- and gender-matched in the two groups for comparison (PMAvs EMA) with the same protocol (stimuli and datasize), they were different subjects. Indeed, thePMA and EMA systems were located in two dif-ferent research laboratories, and they could not beplaced at a same location for this study. Becausethe data were collected by two different teams andwith different subjects for the EMA and PMA,there could likely be variations in the outcome ofthe study between the datasets. This issue will beresolved in the future study where the same sub-jects will use both devices and the same operatorswill supervise the data collection sessions.

5 Conclusion and Future Work

In this study, we compared the ATS performancebetween a PMA-based tongue motion tracking de-vice and a commercially available EMA (NDIWave). We found both the raw magnetic signalsand transformed positional signals acquired fromPMA have similar ATS performance. Althoughwe found that PMA-based system did not performas well as the EMA-based system in this single-tracer comparison, PMA still has great potentialfor SSI application, because it is wireless, afford-able, portable, and easy to use. Future work willverify these findings using a larger data set (bothEMA and PMA) collected from the same speak-ers, and further improve the PMA measurement

accuracy as well as the localization approach thatconverts raw magnetic signals to positional data.


This work was supported by the National In-stitutes of Health (NIH) under award numberR03DC013990 and by the American Speech-Language-Hearing Foundation through a NewCentury Scholar Research Grant. We also thankDr. Maysam Ghovanloo and the volunteering par-ticipants.

