airwriting: hands-free mobile text input by...

Airwriting: Hands-free Mobile Text Input by Spotting and Continuous

Recognition of 3d-Space Handwriting with Inertial Sensors

Christoph Amma, Marcus Georgi, Tanja Schultz

Cognitive Systems Lab, Institute for Anthropomatics

Karlsruhe Institute of Technology (KIT)

Karlsruhe, Germany

{christoph.amma, tanja.schultz}@kit.edu, [email protected]

Abstract—We present an input method which enables com-plex hands-free interaction through 3d handwriting recogni-tion. Users can write text in the air as if they were usingan imaginary blackboard. Motion sensing is done wirelesslyby accelerometers and gyroscopes which are attached tothe back of the hand. We propose a two-stage approachfor spotting and recognition of handwriting gestures. Thespotting stage uses a Support Vector Machine to identifydata segments which contain handwriting. The recognitionstage uses Hidden Markov Models (HMM) to generate thetext representation from the motion sensor data. Individualcharacters are modeled by HMMs and concatenated to wordmodels. Our system can continuously recognize arbitrarysentences, based on a freely definable vocabulary with over8000 words. A statistical language model is used to enhancerecognition performance and restrict the search space. Wereport the results from a nine-user experiment on sentencerecognition for person dependent and person independentsetups on 3d-space handwriting data. For the person inde-pendent setup, a word error rate of 11% is achieved, forthe person dependent setup 3% are achieved. We evaluatethe spotting algorithm in a second experiment on a realisticdataset including everyday activities and achieve a samplebased recall of 99% and a precision of 25%. We show thatadditional filtering in the recognition stage can detect up to99% of the false positive segments.

Keywords-Wearable computers, User interfaces, Handwrit-ing recognition, Accelerometers

I. INTRODUCTION

Gestures facilitate new forms of user interfaces, which

will be especially suited for mobile and wearable computer

systems. Instead of forcing a user to manually operate a

device, hand gestures allow operation with empty hands

and without the need to focus on tiny screens and keys.

Various sensing techniques (e.g. cameras, inertial sensors)

are used for the purpose of gesture recognition [1]. Ac-

celerometers are especially appealing for mobile usage

because of their small size, low cost and independence

from the environmental conditions. Past research mostly

concentrated on the recognition of a limited set of pre-

defined single gestures, which can then be mapped to

commands. This limits the number of possible commands

to the number of recognizable gestures. Operations like

the input of text or other complex operations require more

expressive power than a small set of isolated gestures can

offer.

Our approach combines the mobility and intuitivity of

gestures as an input modality with the expressiveness

of handwriting. This paper describes a wearable gesture

recognition method based on inertial sensors that allows

for spotting and continuously recognizing whole sentences

written in the air. This comprises two main challenges.

Firstly, in a real-world application scenario, the system

will continuously measure hand motion, but only a small

part of the signal will actually contain handwriting. The

large part of the data will contain all sorts of everyday

motions which are irrelevant for the text input interface.

The data segments likely to contain handwriting must

be spotted in the data stream. Secondly, the actual text

must be recognized from the sensor signals. We use a

two-stage approach for the spotting and recognition task

which is illustrated in Figure 1. In the spotting stage, we

use Support Vector Machines to discriminate motion that

potentially contains handwriting from motion that does

not. In the recognition stage, we use Hidden Markov

Models in conjunction with a statistical language model to

recognize the written words. The two stages are evaluated

on different data sets independently but we show that

combining them by applying additional filtering of false

positive segments in the recognition stage can significantly

boost performance of the spotting stage.

While our current system is focused on the recognition

of text from continuous character gestures, it can serve

as a proof-of-concept system for any sort of gesture

recognition system that needs to continuously recognize

gestures which are constructed from an alphabet of primi-

tive gestures. We show that continuous gesture recognition

with inertial sensors is feasible for gesture vocabularies

that are several orders of magnitude larger than known

systems.

A. Related Work

The question on how to interact efficiently and intu-

itively with wearable and ubiquitous computer systems

has led to a multitude of different approaches. Besides

the widely used touchscreen keyboards in current smart-

phones, methods using only very few keys to reduce

size and to allow one-handed usage have been proposed

[2], [3]. Recently, research moved towards the exam-

ination of alternative interaction paradigms for mobile

computing by allowing hands-free operation. Gestures are

used to develop interfaces, that do not rely on any sort

of device that needs to be held in hand and therefore

allow seamless integration into peoples everyday activities.

Miniature projectors display the screen on any surface in

2012 16th International Symposium on Wearable Computers

1550-4816/12 $26.00 © 2012 IEEE

DOI 10.1109/ISWC.2012.21

52

Sa

mp

les

Figure 3: Schematic illustration of the spotting stage.

one 3-dimensional gyroscope, both sampled at 819.2 Hz,

which is the maximum rate of the sensor. Therefore, each

sensor sample sss is described by the six-dimensional vector

ststst = (aaa,ggg) = ((ax, ay, az), (gx, gy, gz))

with aaa being the vector of acceleration signals and ggg the

vector of angular rate (gyroscope) signals. The design

goals of this prototype were ease of development and

integration. Further miniaturization could be achieved with

off-the-shelf components. One possibility would be to

integrate all components in an unobtrusive wristlet.

III. SPOTTING STAGE

The spotting stage is used to perform a binary classifi-

cation of the data stream into segments that likely contain

handwriting and segments that do not. The segments

classified as potential handwriting motion are passed on

to the recognition stage. Ideally, the spotting algorithm

identifies all handwriting segments without producing a

high amount of false positives and imposes a minimal

delay in the processing chain. We use a binary Support

Vector Machine (SVM) classifier with an RBF kernel

(γ = 2, C = 128) to discriminate non-writing signal

parts from potential writing signal parts. In order to allow

real-time usage and operation on continuous data streams,

we use a sliding window approach. Individual overlapping

windows are classified and the classification results of

all overlapping windows are then combined and can be

passed on to the recognition stage immediately. Figure 3

shows the architecture of the spotting stage. We use a

sliding window of length 1.1 s and a shift width of 0.12 s.Successive windows are illustrated in the middle of Figure

3 as horizontal bars. The overlap is chosen to retain a fine

resolution at the beginning and end of relevant motion

patterns. On these windows the following features are

computed, where N denotes the number of sensor samples

per window:

• Average Angular Velocity 1

N

∑N

i=1‖gigigi‖2

• Average Mean Shifted Acceleration 1

N

∑N

i=1‖aiaiai −

aaa‖2• Distribution of power per frequency between 0 Hz

and 12 Hz in 8 bins

Visual inspection of the signals showed that the handwrit-

ing parts had high frequency and amplitude compared to

the non-writing parts in both acceleration and angular rate.

A frequency peak is present at around 3 Hz, depending on

Data set Writing No writing Total

Training 272min 111min 383min

Testing 4min 111min 115min

Table I: Data corpus used for the spotting experiment, data

sets are disjoint.

the writing speed. Therefore, we chose features represent-

ing angular rate and acceleration amplitude as well as the

frequency distribution. Defined by the window and shift

length, every single sensor sample is part of 9 windows.

For every window wt the SVM Classifier C(wt) returns

0 if no handwriting is detected and 1 otherwise. This is

illustrated by the color of the horizontal bars in figure 3.

One sensor sample st is classified as writing motion if at

least one window containing st is classified as handwriting

motion.

C(st) = maxk:st∈wk

C(wk) (1)

We call a sequence of sensor samples classified as hand-

writing a handwriting segment. The lower bar in -Figure

3 shows the resulting signal segmentation (one segment)

after combining the results from the individual classifi-

cation results. In contrast to other classifier combination

schemes (e.g. majority vote) our combination scheme has a

bias towards the detection of handwriting motion, i.e. some

samples not belonging to actual handwriting will likely be

classified as handwriting. Thereby, we try to avoid loosing

any important samples, since all handwriting samples

that we miss in the spotting stage are never forwarded

to the recognition stage and therefore lost. Furthermore,

short pauses during writing will not lead to gaps in the

detected handwriting segment. On the other hand, we

will get many non-handwriting segments forwarded to the

recognition stage. Usually, these segments are very short

and the recognition stage therefore produces empty or very

short hypotheses (1-3 characters in total). Thus, additional

filtering of the produced hypothesis in the recognition

stage could discard most of the false positive segments.

A sensor sample is passed on to the recognition stage as

soon as the last window it belongs to was classified. As a

result the spotting stage imposes a delay of one window

length, i.e. 1.1 s on further processing and choosing the

window length is always a trade-off between classification

accuracy and system delay. Our real-time demonstration

system shows that the chosen values are reasonable [19].

A. Experiments

We collected data from 3 subjects to evaluate our

spotting approach. We assume that inter-individual dif-

ferences are not as important for the discrimination of

handwriting and non-handwriting data as the variety of

recorded activities. Therefore we did not focus on a data

corpus including many subjects. Instead we gave priority

to collecting realistic every-day data. Table I lists the

data recorded and used for the experiments. The training

data containing handwriting was taken from our existing

corpus, including the data recorded for the recognition

54

containing the averaged 3d acceleration (ax, ay, az) and

the averaged 3d angular rate (gx, gy, gz).

B. Recognition

We use continuous density Hidden Markov Models

to model each character’s signal sequences. HMMs are

graphical state-based statistical models that can be used

to model a wide range of signal sequences. A detailed

introduction to HMMs is beyond the scope of this paper,

the classic paper by Rabiner [21] gives an overview of

theory and application. One important property of HMMs

is the possibility to concatenate individual models to form

a new HMM. This concatenated HMM then models the

complete signal sequence of the two individual models.

In our case we can therefore use a small set of primitive

character models and can construct an arbitrary amount

of word models by concatenation of character models. We

use 26 character models for the capital letters from A to

Z and one model to represent the repositioning motion

that occurs between single characters. All our character

HMMs have a left-right topology with 30 states, the

repositioning model with 10 states. Transition probabilities

are not trained. The observation probabilities for each

state are modeled with Gaussian Mixture Models with

six components each and diagonal covariance matrices.

The character models were initialized on data of 14 writ-

ers from former experiments. These former experiments

mostly contained only recordings of single characters and

therefore had pauses in the beginning and the end. This

makes the models unsuitable for concatenation, but we

used their parameters to initialize the new models prior to

an EM training on the sentence data. None of the subjects

in the experiments described in this paper contributed to

the older data set.

Recognition results are output on word level, i.e. a

set of words, called the vocabulary, defines all words

that can be recognized. All possible word combinations,

including repetitions, constitute the actual search space

of the recognition problem, which therefore is several

orders of magnitude larger than the size of the vocabulary.

For example, the sentences in our data corpus contained

4.5 words in average. Considering the size of the large

8k vocabulary, there exist 82314 = 4.5 · 1015 possible

sentences with 4 words. Since the number of words is not

known in advance the search space is even larger. Clearly

it is not feasible to construct one HMM for each of all

the possible word combinations and perform an viterbi

alignment for each to find the most probable sequence.

We use a viterbi beam-search together with a prefix-

tree based search graph that allows for decoding word-

beginnings only once, if words share their prefix (e.g.

“this”, “then”, “their”, ...) [22]. A statistical language

model, which models the probabilities of certain word

sequences, is used to restrict the search space to word

sequences that occur in real-world text data. We use the

Janus Recognition Toolkit [23], originally developed for

speech recognition, to perform our experiments.

C. Vocabulary

We use two vocabularies of different sizes in our

experiments to evaluate the effect of vocabulary size on

the scalability. If the vocabulary gets larger, the recogni-

tion task gets harder, since the search space grows. The

small vocabulary (V1k) contains 986 words and the large

vocabulary (V8k) contains 8231 words. Both vocabularies

are taken from the University of Leipzig’s list of frequent

english words which is obtained by webcrawling and

provided for free1. We applied additional cleaning steps

on the list, because it contains entries that are not actually

words. All entries consisting of only one character were

deleted, except “i” and “a”. All entries consisting of two

or three letters which either are not correct words by

themselves (e.g. “gl”) or are abbreviations (e.g. “os”) were

deleted. All words longer than three characters remained

unchanged. Finally, all words contained in the sentences

written by the subjects were added to the vocabularies.

D. Language Model

A statistical n-gram language model is used to model

the dependencies between successive words in order to

improve the recognition performance. Statistical language

models give information on the probability of a certain

sequence of words and are normally generated data-driven

from large text corpora. We use a 3-gram language model,

which means it contains the probabilities of word se-

quences up to a length of three. Given a sequence of three

words w1w2w3 the language model returns the conditional

probability P (w3|w1w2) of w3, given the words w1w2

were recognized so far. This probability is multiplied with

the probability of the signal sequence given by the HMM

whenever a complete word was recognized. The language

model we use contains 60000 words. It was generated for

a speech recognition system by crawling english internet

sites [24]. The relevant statistic measures for the language

model are given in Table II. The 1-gram coverage for

both vocabularies used in the experiments is 100, i.e. the

language model contains all words in the vocabularies.

The 2-gram and 3-gram coverage is lower, but if for

example a 3-gram cannot be found, the language model

falls back to bi- or unigrams. We took the language model

as is, i.e. we did not change it to optimize it towards

the specific task described in this paper, like giving the

word sequences occurring in our chosen sentences higher

probabilities. This is reflected in the reported perplexities.

The perplexity is an information theoretic measure to

evaluate the predictive quality of a language model on

a given set of word sequences. It can be interpreted as the

average number of possibilities for the next word if the

model had to choose uniformly and independently among

all possibilities. The perplexities in Table II are given for

the language model restricted on the vocabularies, i.e. the

language model is reduced to contain only words, that are

contained in the vocabulary.

1wortschatz.uni-leizpig.de/html/wliste

56

Perplexity Coverage

V1k V8k 1-gram 2-gram 3-gram

294.44 304.67 100.00 97.98 64.13

Table II: Perplexity and Coverage of Language Model

E. Experiments

We collected handwriting data from nine subjects. None

of the subjects had used the system before. Every subject

had to write 80 English sentences taken from broadcast

news. The sentence length varied from 2 to 6 words. In

total, every subject wrote 366 words and 1778 characters,

resulting in 3 h and 56 min of recorded data over all

subjects. Each character of the alphabet appeared in the

sentences multiple times. The subjects were sitting in a

chair in front of a computer screen showing the next

sentence to write. They were told to write in the air in front

of them as if they would write on an imaginary blackboard.

Instead of writing horizontally from left to right, the

subjects were asked to write in place. All subjects were

told to write in block capital letters with fixed wrist and

an approximate height of 20 cm per character. No further

constraints on speed or quality of writing were set. Due

to the missing visual feedback and the unusual writing,

it is harder to write without errors but a writer normally

realizes if he or she makes an error. The subjects were

told to correct their writing errors themselves by repeating

the respective sentence. The segmentation of the data was

done by the participants with a key press before and after

each sentence, i.e. participants were able to make pauses

between sentences which was necessary since writing in

the air gets fatiguing after several sentences. Although

this experiment is done on manually segmented data we

can transfer results to the combined setting with spotting

and recognition. The recognition performance on actual

handwriting segments will be approximately the same as

the one reported in this experiment. All false positive

segments that cannot be filtered out would produce errors.

Since spotting together with filtering does work well in our

spotting experiments, the overall performance is expected

to be only little worse.

F. Performance Metric

We measure our system by calculating the word error

rate, which is the metric commonly used to evaluate

the performance of speech and handwriting recognition

systems. The hypothesis of the recognizer is aligned with

the true reference sentence by computing the minimal edit

distance (Levenshtein distance) on word level. This gives

the number of substitution (S), insertion (I) and deletion

(D) errors that occur by aligning hypothesis and reference.

For a set of sentences S = {s1, . . . , sk} the word error

rate is computed by

WERS =

∑k

i=1Ssi

+ Isi+ Dsi

∑k

i=1Nsi

· 100,

no LM 3-gram LMV1k V8k V1k V8k

average WER 37% 49% 9% 11%stddev WER 17% 19% 8% 9%

Table III: Results of the person-independent evaluation

with and without language model. The WER achieved with

the language model is highlighted.

where Nsiis the total number of words in the reference

sentence si. The following example illustrates WER com-

putation for one single sentence:

Reference: we had a lot of expertise

Hypothesis: he had lot of expert ease

Error: S D S I

The word error rate in the above example is WER =(2 + 1 + 1)/6 · 100 = 66%. One should notice that the

WER can be larger than 100% because the hypothesis can

be longer than the reference.

G. Person Independent

To evaluate person-independent performance a per per-

son cross-validation was performed. The data of one

person was chosen as test set and the data of all other

persons was taken as training set. As stated in IV-B we

took the initialized models from former experiments and

performed three iterations of EM training. Table III shows

the summary of cross-validation results. The average word

error rate is 9% for the small vocabulary and 11% for the

large one.

The comparison of the two vocabulary sizes is promis-

ing. Vocabulary V8k is approximately eight times larger

than V1k, however the error rate on the large vocabulary

is only by a factor of 1.2 higher than for the small

vocabulary. Therefore we assume our approach scales to

even larger vocabularies.

If we compare the results acquired by using the lan-

guage model to the results without any language model

we see a significant boost in recognition performance.

As explained in section IV-B the search space is very

large and if no language model is used, ambiguities

between words cannot be resolved on the language level.

Therefore results are expected to be worse. Nevertheless,

still more than 50% of all words are correctly recognized.

Considering the size of the search space, this is still a

notable result and shows the discriminative quality of our

HMM based recognition system.

Figure 5 shows the breakdown of results for each

subject individually. On the data of subjects D and I,

the performance is very low compared to the average.

From observation during the experiments it is known that

subject D did not write with fixed wrist all the time, i.e.

there was more rotational motion in the wristjoint. As

a result the fraction of gravitational force in the sensor

signals cannot be assumed constant during writing and

the mean normalization of the signals does not remove

the gravitational acceleration. This might account for the

high word error rate.

57

A B C D E F G H I0

10

20

30

WE

R (

%)

Subject

PI

PD

Figure 5: Per subject results for the person-independent

(PI) and person-dependent (PD) case.

For subject I, there is no such obvious reason why the

performance is so low. We showed in our past experiments

[18] that even for the case of block capital letters, people

do write characters very differently. In our dataset of

ten writers in [18], five different variants of writing an

“E” were observed, mainly distinguished in the order and

direction of strokes. If one writer has a very different

writing style compared to the majority we would expect

the system to have a much lower performance. This could

be a possible reason for the performance drop for subject

I. In average the system achieves very good accuracies.

For some subjects a word error rate of 2% on the large

vocabulary is reached, which means that approximately

191 of 195 words were recognized correctly.

H. Person Dependent

Although our aim is to build a system that works

independent of its user, we evaluated the person dependent

performance. First, this gives an upper bound to which

the user-independent system can be compared to since

we generally expect a higher performance if the system

is specifically trained on a users own data. Second, the

results give insight what could be expected from runtime

adaptation methods. Runtime adaptation was not used for

the experiments described in this paper but might be a

future option. The person-independent models from the

experiments described in the former section were taken

as initialization. The person dependent performance of the

system was evaluated on each subjects data with a ten-fold

cross validation on the 80 sentences of each user with one

iteration of EM training. The large vocabulary (V8k) and

the 3-gram language model were used for the evaluation.

Figure 5 shows the results of the person dependent evalu-

ation for each subject together with the word error rate for

the person independent case. Subject specific retraining of

the models greatly improves recognition performance. The

average word error rate drops from 11% to 3%. Subjects

D and I, for which the person independent performance

is rather low, also reach very good performance after

retraining.

V. CONCLUSION

We presented a wearable gestural input method which

is capable of spotting and continuously recognizing text

written in the air based on inertial sensors attached to

the back of the hand. We made experiments on con-

tinuous recognition of sentences written in the air with

underlying vocabularies up to more than 8000 words.

To our knowledge, a gesture vocabulary of more than

8000 continuously recognizable gestures is significantly

larger than those in any previously reported findings. For

the person-independent case an average word error rate

of 11% was reached. The results can be transferred to

other domains of gesture recognition tasks where specific

gestures are built from a smaller set of primitives. None

of the used techniques is specific to the problem of

handwriting recognition. We proposed a spotting algorithm

and showed its applicability to our task. The proposed

architecture and methods allow the implementation of a

system operating in realtime on continuous data [19].

REFERENCES

[1] S. Mitra and T. Acharya, “Gesture Recognition: A Survey,”IEEE Transactions on Systems, Man, and Cybernetics, PartC: Applications and Reviews, vol. 37, pp. 311–324, 2007.

[2] K. Lyons, T. Starner, D. Plaisted, J. Fusia, A. Lyons,A. Drew, and E. W. Looney, “Twiddler typing: one-handedchording text entry for mobile phones,” in Proc. of theSIGCHI conference on Human factors in computing sys-tems (CHI’04), 2004.

[3] I. S. MacKenzie, R. W. Soukoreff, and J. Helga, “1 thumb,4 buttons, 20 words per minute: design and evaluation ofh4-writer,” in Proc. of the 24th annual ACM symposium onUser interface software and technology (UIST’11), 2011.

[4] P. Mistry, P. Maes, and L. Chang, “Wuw - wear ur world:a wearable gestural interface,” in Proc. of the 27th interna-tional conference extended abstracts on Human factors incomputing systems (CHI EA ’09), 2009.

[5] E. Tamaki, T. Miyaki, and J. Rekimoto, “Brainy hand: anear-worn hand gesture interaction device,” in Proc. of the27th international conference extended abstracts on Humanfactors in computing systems (CHI EA ’09), 2009.

[6] S. Gustafson, D. Bierwirth, and P. Baudisch, “Imaginaryinterfaces: spatial interaction with empty hands and withoutvisual feedback,” in Proc. of the 23nd annual ACM sympo-sium on User interface software and technology (UIST’10),2010.

[7] H. Junker, O. Amft, P. Lukowicz, and G. Troster, “Gesturespotting with body-worn inertial sensors to detect useractivities,” Pattern Recognition, vol. 41, no. 6, pp. 2010–2024, 2008.

[8] G. Raffa, J. Lee, L. Nachman, and J. Song, “Don’t slowme down: Bringing energy efficiency to continuous gesturerecognition,” in Proc. International Symposium on Wear-able Computers (ISWC’10), 2010.

[9] H.-K. Lee and J. Kim, “An hmm-based threshold modelapproach for gesture recognition,” IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 21, no. 10,pp. 961 –973, oct 1999.

[10] T. Stiefmeier, D. Roggen, G. Troster, G. Ogris, andP. Lukowicz, “Wearable activity tracking in car manufac-turing,” IEEE Pervasive Computing, vol. 7, no. 2, p. 42,2008.

58

[11] S. Kallio, J. Kela, and J. Mantyjarvi, “Online gesturerecognition system for mobile interaction,” in Proc. IEEEInternational Conference on Systems, Man and Cybernetics(ICSMC’03), 2003.

[12] A. Hein, A. Hoffmeyer, and T. Kirste, “Utilizing an ac-celerometric bracelet for ubiquitous gesture-based interac-tion,” in Universal Access in Human-Computer Interaction.Intelligent and Ubiquitous Interaction Environments, ser.Lecture Notes in Computer Science. Springer Berlin /Heidelberg, 2009, vol. 5615, pp. 519–527.

[13] O. Amft, R. Amstutz, A. Smailagic, D. Siewiorek, andG. Trster, “Gesture-controlled user input to complete ques-tionnaires on wrist-worn watches,” in Human-ComputerInteraction. Novel Interaction Methods and Techniques, ser.Lecture Notes in Computer Science. Springer Berlin /Heidelberg, 2009, vol. 5611, pp. 131–140.

[14] D. Kim, H. Choi, and J. Kim, “3d space handwritingrecognition with ligature model,” in Ubiquitous ComputingSystems, ser. Lecture Notes in Computer Science. SpringerBerlin / Heidelberg, 2006, vol. 4239, pp. 41–56.

[15] W.-C. Bang, W. Chang, K.-H. Kang, E.-S. Choi, A. Potanin,and D.-Y. Kim, “Self-contained spatial input device forwearable computers,” in Proc. IEEE International Sympo-sium on Wearable Computers (ISWC’03), 2003.

[16] R. Plamondon and S. Srihari, “Online and off-line hand-writing recognition: a comprehensive survey,” IEEE Trans-actions on Pattern Analysis and Machine Intelligence,vol. 22, no. 1, pp. 63–84, 2000.

[17] R. McGuire, J. Hernandez-Rebollar, T. Starner, V. Hen-derson, H. Brashear, and D. Ross, “Towards a one-wayamerican sign language translator,” in Proc. Sixth IEEEInternational Conference on Automatic Face and GestureRecognition (FGR’04), 2004.

[18] C. Amma, D. Gehrig, and T. Schultz, “Airwriting recogni-tion using wearable motion sensors,” in Proceedings of the1st Augmented Human International Conference (AH’10),2010.

[19] C. Amma and T. Schultz, “Airwriting: DemonstratingMobile Text Input by 3d-Space Handwriting,” in Proc.of the ACM international conference on Intelligent UserInterfaces (IUI’12), 2012.

[20] O. J. Woodman, “An introduction to inertial navigation,”University of Cambridge, Tech. Rep., 2007.

[21] L. Rabiner, “A tutorial on hidden markov models andselected applications in speech recognition,” Proceedingsof the IEEE, vol. 77, no. 2, pp. 257–286, 1989.

[22] X. Huang, A. Acero, and H. Hon, Spoken language pro-cessing. Prentice Hall, 2001.

[23] H. Soltau, F. Metze, C. Fugen, and A. Waibel, “A one-passdecoder based on polymorphic linguistic context assign-ment,” in IEEE Workshop on Automatic Speech Recognitionand Understanding (ASRU ’01), 2001.

[24] T. Schultz, “GlobalPhone: A multilingual speech and textdatabase developed at Karlsruhe University,” in Proc. of theInternational Conference on Spoken Language Processing,(ICSLP’02), 2002.

59

airwriting: hands-free mobile text input by...

Documents