airwriting: hands-free mobile text input by...
TRANSCRIPT
Airwriting: Hands-free Mobile Text Input by Spotting and Continuous
Recognition of 3d-Space Handwriting with Inertial Sensors
Christoph Amma, Marcus Georgi, Tanja Schultz
Cognitive Systems Lab, Institute for Anthropomatics
Karlsruhe Institute of Technology (KIT)
Karlsruhe, Germany
{christoph.amma, tanja.schultz}@kit.edu, [email protected]
Abstract—We present an input method which enables com-plex hands-free interaction through 3d handwriting recogni-tion. Users can write text in the air as if they were usingan imaginary blackboard. Motion sensing is done wirelesslyby accelerometers and gyroscopes which are attached tothe back of the hand. We propose a two-stage approachfor spotting and recognition of handwriting gestures. Thespotting stage uses a Support Vector Machine to identifydata segments which contain handwriting. The recognitionstage uses Hidden Markov Models (HMM) to generate thetext representation from the motion sensor data. Individualcharacters are modeled by HMMs and concatenated to wordmodels. Our system can continuously recognize arbitrarysentences, based on a freely definable vocabulary with over8000 words. A statistical language model is used to enhancerecognition performance and restrict the search space. Wereport the results from a nine-user experiment on sentencerecognition for person dependent and person independentsetups on 3d-space handwriting data. For the person inde-pendent setup, a word error rate of 11% is achieved, forthe person dependent setup 3% are achieved. We evaluatethe spotting algorithm in a second experiment on a realisticdataset including everyday activities and achieve a samplebased recall of 99% and a precision of 25%. We show thatadditional filtering in the recognition stage can detect up to99% of the false positive segments.
Keywords-Wearable computers, User interfaces, Handwrit-ing recognition, Accelerometers
I. INTRODUCTION
Gestures facilitate new forms of user interfaces, which
will be especially suited for mobile and wearable computer
systems. Instead of forcing a user to manually operate a
device, hand gestures allow operation with empty hands
and without the need to focus on tiny screens and keys.
Various sensing techniques (e.g. cameras, inertial sensors)
are used for the purpose of gesture recognition [1]. Ac-
celerometers are especially appealing for mobile usage
because of their small size, low cost and independence
from the environmental conditions. Past research mostly
concentrated on the recognition of a limited set of pre-
defined single gestures, which can then be mapped to
commands. This limits the number of possible commands
to the number of recognizable gestures. Operations like
the input of text or other complex operations require more
expressive power than a small set of isolated gestures can
offer.
Our approach combines the mobility and intuitivity of
gestures as an input modality with the expressiveness
of handwriting. This paper describes a wearable gesture
recognition method based on inertial sensors that allows
for spotting and continuously recognizing whole sentences
written in the air. This comprises two main challenges.
Firstly, in a real-world application scenario, the system
will continuously measure hand motion, but only a small
part of the signal will actually contain handwriting. The
large part of the data will contain all sorts of everyday
motions which are irrelevant for the text input interface.
The data segments likely to contain handwriting must
be spotted in the data stream. Secondly, the actual text
must be recognized from the sensor signals. We use a
two-stage approach for the spotting and recognition task
which is illustrated in Figure 1. In the spotting stage, we
use Support Vector Machines to discriminate motion that
potentially contains handwriting from motion that does
not. In the recognition stage, we use Hidden Markov
Models in conjunction with a statistical language model to
recognize the written words. The two stages are evaluated
on different data sets independently but we show that
combining them by applying additional filtering of false
positive segments in the recognition stage can significantly
boost performance of the spotting stage.
While our current system is focused on the recognition
of text from continuous character gestures, it can serve
as a proof-of-concept system for any sort of gesture
recognition system that needs to continuously recognize
gestures which are constructed from an alphabet of primi-
tive gestures. We show that continuous gesture recognition
with inertial sensors is feasible for gesture vocabularies
that are several orders of magnitude larger than known
systems.
A. Related Work
The question on how to interact efficiently and intu-
itively with wearable and ubiquitous computer systems
has led to a multitude of different approaches. Besides
the widely used touchscreen keyboards in current smart-
phones, methods using only very few keys to reduce
size and to allow one-handed usage have been proposed
[2], [3]. Recently, research moved towards the exam-
ination of alternative interaction paradigms for mobile
computing by allowing hands-free operation. Gestures are
used to develop interfaces, that do not rely on any sort
of device that needs to be held in hand and therefore
allow seamless integration into peoples everyday activities.
Miniature projectors display the screen on any surface in
2012 16th International Symposium on Wearable Computers
1550-4816/12 $26.00 © 2012 IEEE
DOI 10.1109/ISWC.2012.21
52
Sa
mp
les
Figure 3: Schematic illustration of the spotting stage.
one 3-dimensional gyroscope, both sampled at 819.2 Hz,
which is the maximum rate of the sensor. Therefore, each
sensor sample sss is described by the six-dimensional vector
ststst = (aaa,ggg) = ((ax, ay, az), (gx, gy, gz))
with aaa being the vector of acceleration signals and ggg the
vector of angular rate (gyroscope) signals. The design
goals of this prototype were ease of development and
integration. Further miniaturization could be achieved with
off-the-shelf components. One possibility would be to
integrate all components in an unobtrusive wristlet.
III. SPOTTING STAGE
The spotting stage is used to perform a binary classifi-
cation of the data stream into segments that likely contain
handwriting and segments that do not. The segments
classified as potential handwriting motion are passed on
to the recognition stage. Ideally, the spotting algorithm
identifies all handwriting segments without producing a
high amount of false positives and imposes a minimal
delay in the processing chain. We use a binary Support
Vector Machine (SVM) classifier with an RBF kernel
(γ = 2, C = 128) to discriminate non-writing signal
parts from potential writing signal parts. In order to allow
real-time usage and operation on continuous data streams,
we use a sliding window approach. Individual overlapping
windows are classified and the classification results of
all overlapping windows are then combined and can be
passed on to the recognition stage immediately. Figure 3
shows the architecture of the spotting stage. We use a
sliding window of length 1.1 s and a shift width of 0.12 s.Successive windows are illustrated in the middle of Figure
3 as horizontal bars. The overlap is chosen to retain a fine
resolution at the beginning and end of relevant motion
patterns. On these windows the following features are
computed, where N denotes the number of sensor samples
per window:
• Average Angular Velocity 1
N
∑N
i=1‖gigigi‖2
• Average Mean Shifted Acceleration 1
N
∑N
i=1‖aiaiai −
aaa‖2• Distribution of power per frequency between 0 Hz
and 12 Hz in 8 bins
Visual inspection of the signals showed that the handwrit-
ing parts had high frequency and amplitude compared to
the non-writing parts in both acceleration and angular rate.
A frequency peak is present at around 3 Hz, depending on
Data set Writing No writing Total
Training 272min 111min 383min
Testing 4min 111min 115min
Table I: Data corpus used for the spotting experiment, data
sets are disjoint.
the writing speed. Therefore, we chose features represent-
ing angular rate and acceleration amplitude as well as the
frequency distribution. Defined by the window and shift
length, every single sensor sample is part of 9 windows.
For every window wt the SVM Classifier C(wt) returns
0 if no handwriting is detected and 1 otherwise. This is
illustrated by the color of the horizontal bars in figure 3.
One sensor sample st is classified as writing motion if at
least one window containing st is classified as handwriting
motion.
C(st) = maxk:st∈wk
C(wk) (1)
We call a sequence of sensor samples classified as hand-
writing a handwriting segment. The lower bar in -Figure
3 shows the resulting signal segmentation (one segment)
after combining the results from the individual classifi-
cation results. In contrast to other classifier combination
schemes (e.g. majority vote) our combination scheme has a
bias towards the detection of handwriting motion, i.e. some
samples not belonging to actual handwriting will likely be
classified as handwriting. Thereby, we try to avoid loosing
any important samples, since all handwriting samples
that we miss in the spotting stage are never forwarded
to the recognition stage and therefore lost. Furthermore,
short pauses during writing will not lead to gaps in the
detected handwriting segment. On the other hand, we
will get many non-handwriting segments forwarded to the
recognition stage. Usually, these segments are very short
and the recognition stage therefore produces empty or very
short hypotheses (1-3 characters in total). Thus, additional
filtering of the produced hypothesis in the recognition
stage could discard most of the false positive segments.
A sensor sample is passed on to the recognition stage as
soon as the last window it belongs to was classified. As a
result the spotting stage imposes a delay of one window
length, i.e. 1.1 s on further processing and choosing the
window length is always a trade-off between classification
accuracy and system delay. Our real-time demonstration
system shows that the chosen values are reasonable [19].
A. Experiments
We collected data from 3 subjects to evaluate our
spotting approach. We assume that inter-individual dif-
ferences are not as important for the discrimination of
handwriting and non-handwriting data as the variety of
recorded activities. Therefore we did not focus on a data
corpus including many subjects. Instead we gave priority
to collecting realistic every-day data. Table I lists the
data recorded and used for the experiments. The training
data containing handwriting was taken from our existing
corpus, including the data recorded for the recognition
54
containing the averaged 3d acceleration (ax, ay, az) and
the averaged 3d angular rate (gx, gy, gz).
B. Recognition
We use continuous density Hidden Markov Models
to model each character’s signal sequences. HMMs are
graphical state-based statistical models that can be used
to model a wide range of signal sequences. A detailed
introduction to HMMs is beyond the scope of this paper,
the classic paper by Rabiner [21] gives an overview of
theory and application. One important property of HMMs
is the possibility to concatenate individual models to form
a new HMM. This concatenated HMM then models the
complete signal sequence of the two individual models.
In our case we can therefore use a small set of primitive
character models and can construct an arbitrary amount
of word models by concatenation of character models. We
use 26 character models for the capital letters from A to
Z and one model to represent the repositioning motion
that occurs between single characters. All our character
HMMs have a left-right topology with 30 states, the
repositioning model with 10 states. Transition probabilities
are not trained. The observation probabilities for each
state are modeled with Gaussian Mixture Models with
six components each and diagonal covariance matrices.
The character models were initialized on data of 14 writ-
ers from former experiments. These former experiments
mostly contained only recordings of single characters and
therefore had pauses in the beginning and the end. This
makes the models unsuitable for concatenation, but we
used their parameters to initialize the new models prior to
an EM training on the sentence data. None of the subjects
in the experiments described in this paper contributed to
the older data set.
Recognition results are output on word level, i.e. a
set of words, called the vocabulary, defines all words
that can be recognized. All possible word combinations,
including repetitions, constitute the actual search space
of the recognition problem, which therefore is several
orders of magnitude larger than the size of the vocabulary.
For example, the sentences in our data corpus contained
4.5 words in average. Considering the size of the large
8k vocabulary, there exist 82314 = 4.5 · 1015 possible
sentences with 4 words. Since the number of words is not
known in advance the search space is even larger. Clearly
it is not feasible to construct one HMM for each of all
the possible word combinations and perform an viterbi
alignment for each to find the most probable sequence.
We use a viterbi beam-search together with a prefix-
tree based search graph that allows for decoding word-
beginnings only once, if words share their prefix (e.g.
“this”, “then”, “their”, ...) [22]. A statistical language
model, which models the probabilities of certain word
sequences, is used to restrict the search space to word
sequences that occur in real-world text data. We use the
Janus Recognition Toolkit [23], originally developed for
speech recognition, to perform our experiments.
C. Vocabulary
We use two vocabularies of different sizes in our
experiments to evaluate the effect of vocabulary size on
the scalability. If the vocabulary gets larger, the recogni-
tion task gets harder, since the search space grows. The
small vocabulary (V1k) contains 986 words and the large
vocabulary (V8k) contains 8231 words. Both vocabularies
are taken from the University of Leipzig’s list of frequent
english words which is obtained by webcrawling and
provided for free1. We applied additional cleaning steps
on the list, because it contains entries that are not actually
words. All entries consisting of only one character were
deleted, except “i” and “a”. All entries consisting of two
or three letters which either are not correct words by
themselves (e.g. “gl”) or are abbreviations (e.g. “os”) were
deleted. All words longer than three characters remained
unchanged. Finally, all words contained in the sentences
written by the subjects were added to the vocabularies.
D. Language Model
A statistical n-gram language model is used to model
the dependencies between successive words in order to
improve the recognition performance. Statistical language
models give information on the probability of a certain
sequence of words and are normally generated data-driven
from large text corpora. We use a 3-gram language model,
which means it contains the probabilities of word se-
quences up to a length of three. Given a sequence of three
words w1w2w3 the language model returns the conditional
probability P (w3|w1w2) of w3, given the words w1w2
were recognized so far. This probability is multiplied with
the probability of the signal sequence given by the HMM
whenever a complete word was recognized. The language
model we use contains 60000 words. It was generated for
a speech recognition system by crawling english internet
sites [24]. The relevant statistic measures for the language
model are given in Table II. The 1-gram coverage for
both vocabularies used in the experiments is 100, i.e. the
language model contains all words in the vocabularies.
The 2-gram and 3-gram coverage is lower, but if for
example a 3-gram cannot be found, the language model
falls back to bi- or unigrams. We took the language model
as is, i.e. we did not change it to optimize it towards
the specific task described in this paper, like giving the
word sequences occurring in our chosen sentences higher
probabilities. This is reflected in the reported perplexities.
The perplexity is an information theoretic measure to
evaluate the predictive quality of a language model on
a given set of word sequences. It can be interpreted as the
average number of possibilities for the next word if the
model had to choose uniformly and independently among
all possibilities. The perplexities in Table II are given for
the language model restricted on the vocabularies, i.e. the
language model is reduced to contain only words, that are
contained in the vocabulary.
1wortschatz.uni-leizpig.de/html/wliste
56
Perplexity Coverage
V1k V8k 1-gram 2-gram 3-gram
294.44 304.67 100.00 97.98 64.13
Table II: Perplexity and Coverage of Language Model
E. Experiments
We collected handwriting data from nine subjects. None
of the subjects had used the system before. Every subject
had to write 80 English sentences taken from broadcast
news. The sentence length varied from 2 to 6 words. In
total, every subject wrote 366 words and 1778 characters,
resulting in 3 h and 56 min of recorded data over all
subjects. Each character of the alphabet appeared in the
sentences multiple times. The subjects were sitting in a
chair in front of a computer screen showing the next
sentence to write. They were told to write in the air in front
of them as if they would write on an imaginary blackboard.
Instead of writing horizontally from left to right, the
subjects were asked to write in place. All subjects were
told to write in block capital letters with fixed wrist and
an approximate height of 20 cm per character. No further
constraints on speed or quality of writing were set. Due
to the missing visual feedback and the unusual writing,
it is harder to write without errors but a writer normally
realizes if he or she makes an error. The subjects were
told to correct their writing errors themselves by repeating
the respective sentence. The segmentation of the data was
done by the participants with a key press before and after
each sentence, i.e. participants were able to make pauses
between sentences which was necessary since writing in
the air gets fatiguing after several sentences. Although
this experiment is done on manually segmented data we
can transfer results to the combined setting with spotting
and recognition. The recognition performance on actual
handwriting segments will be approximately the same as
the one reported in this experiment. All false positive
segments that cannot be filtered out would produce errors.
Since spotting together with filtering does work well in our
spotting experiments, the overall performance is expected
to be only little worse.
F. Performance Metric
We measure our system by calculating the word error
rate, which is the metric commonly used to evaluate
the performance of speech and handwriting recognition
systems. The hypothesis of the recognizer is aligned with
the true reference sentence by computing the minimal edit
distance (Levenshtein distance) on word level. This gives
the number of substitution (S), insertion (I) and deletion
(D) errors that occur by aligning hypothesis and reference.
For a set of sentences S = {s1, . . . , sk} the word error
rate is computed by
WERS =
∑k
i=1Ssi
+ Isi+ Dsi
∑k
i=1Nsi
· 100,
no LM 3-gram LMV1k V8k V1k V8k
average WER 37% 49% 9% 11%stddev WER 17% 19% 8% 9%
Table III: Results of the person-independent evaluation
with and without language model. The WER achieved with
the language model is highlighted.
where Nsiis the total number of words in the reference
sentence si. The following example illustrates WER com-
putation for one single sentence:
Reference: we had a lot of expertise
Hypothesis: he had lot of expert ease
Error: S D S I
The word error rate in the above example is WER =(2 + 1 + 1)/6 · 100 = 66%. One should notice that the
WER can be larger than 100% because the hypothesis can
be longer than the reference.
G. Person Independent
To evaluate person-independent performance a per per-
son cross-validation was performed. The data of one
person was chosen as test set and the data of all other
persons was taken as training set. As stated in IV-B we
took the initialized models from former experiments and
performed three iterations of EM training. Table III shows
the summary of cross-validation results. The average word
error rate is 9% for the small vocabulary and 11% for the
large one.
The comparison of the two vocabulary sizes is promis-
ing. Vocabulary V8k is approximately eight times larger
than V1k, however the error rate on the large vocabulary
is only by a factor of 1.2 higher than for the small
vocabulary. Therefore we assume our approach scales to
even larger vocabularies.
If we compare the results acquired by using the lan-
guage model to the results without any language model
we see a significant boost in recognition performance.
As explained in section IV-B the search space is very
large and if no language model is used, ambiguities
between words cannot be resolved on the language level.
Therefore results are expected to be worse. Nevertheless,
still more than 50% of all words are correctly recognized.
Considering the size of the search space, this is still a
notable result and shows the discriminative quality of our
HMM based recognition system.
Figure 5 shows the breakdown of results for each
subject individually. On the data of subjects D and I,
the performance is very low compared to the average.
From observation during the experiments it is known that
subject D did not write with fixed wrist all the time, i.e.
there was more rotational motion in the wristjoint. As
a result the fraction of gravitational force in the sensor
signals cannot be assumed constant during writing and
the mean normalization of the signals does not remove
the gravitational acceleration. This might account for the
high word error rate.
57
A B C D E F G H I0
10
20
30
WE
R (
%)
Subject
PI
PD
Figure 5: Per subject results for the person-independent
(PI) and person-dependent (PD) case.
For subject I, there is no such obvious reason why the
performance is so low. We showed in our past experiments
[18] that even for the case of block capital letters, people
do write characters very differently. In our dataset of
ten writers in [18], five different variants of writing an
“E” were observed, mainly distinguished in the order and
direction of strokes. If one writer has a very different
writing style compared to the majority we would expect
the system to have a much lower performance. This could
be a possible reason for the performance drop for subject
I. In average the system achieves very good accuracies.
For some subjects a word error rate of 2% on the large
vocabulary is reached, which means that approximately
191 of 195 words were recognized correctly.
H. Person Dependent
Although our aim is to build a system that works
independent of its user, we evaluated the person dependent
performance. First, this gives an upper bound to which
the user-independent system can be compared to since
we generally expect a higher performance if the system
is specifically trained on a users own data. Second, the
results give insight what could be expected from runtime
adaptation methods. Runtime adaptation was not used for
the experiments described in this paper but might be a
future option. The person-independent models from the
experiments described in the former section were taken
as initialization. The person dependent performance of the
system was evaluated on each subjects data with a ten-fold
cross validation on the 80 sentences of each user with one
iteration of EM training. The large vocabulary (V8k) and
the 3-gram language model were used for the evaluation.
Figure 5 shows the results of the person dependent evalu-
ation for each subject together with the word error rate for
the person independent case. Subject specific retraining of
the models greatly improves recognition performance. The
average word error rate drops from 11% to 3%. Subjects
D and I, for which the person independent performance
is rather low, also reach very good performance after
retraining.
V. CONCLUSION
We presented a wearable gestural input method which
is capable of spotting and continuously recognizing text
written in the air based on inertial sensors attached to
the back of the hand. We made experiments on con-
tinuous recognition of sentences written in the air with
underlying vocabularies up to more than 8000 words.
To our knowledge, a gesture vocabulary of more than
8000 continuously recognizable gestures is significantly
larger than those in any previously reported findings. For
the person-independent case an average word error rate
of 11% was reached. The results can be transferred to
other domains of gesture recognition tasks where specific
gestures are built from a smaller set of primitives. None
of the used techniques is specific to the problem of
handwriting recognition. We proposed a spotting algorithm
and showed its applicability to our task. The proposed
architecture and methods allow the implementation of a
system operating in realtime on continuous data [19].
REFERENCES
[1] S. Mitra and T. Acharya, “Gesture Recognition: A Survey,”IEEE Transactions on Systems, Man, and Cybernetics, PartC: Applications and Reviews, vol. 37, pp. 311–324, 2007.
[2] K. Lyons, T. Starner, D. Plaisted, J. Fusia, A. Lyons,A. Drew, and E. W. Looney, “Twiddler typing: one-handedchording text entry for mobile phones,” in Proc. of theSIGCHI conference on Human factors in computing sys-tems (CHI’04), 2004.
[3] I. S. MacKenzie, R. W. Soukoreff, and J. Helga, “1 thumb,4 buttons, 20 words per minute: design and evaluation ofh4-writer,” in Proc. of the 24th annual ACM symposium onUser interface software and technology (UIST’11), 2011.
[4] P. Mistry, P. Maes, and L. Chang, “Wuw - wear ur world:a wearable gestural interface,” in Proc. of the 27th interna-tional conference extended abstracts on Human factors incomputing systems (CHI EA ’09), 2009.
[5] E. Tamaki, T. Miyaki, and J. Rekimoto, “Brainy hand: anear-worn hand gesture interaction device,” in Proc. of the27th international conference extended abstracts on Humanfactors in computing systems (CHI EA ’09), 2009.
[6] S. Gustafson, D. Bierwirth, and P. Baudisch, “Imaginaryinterfaces: spatial interaction with empty hands and withoutvisual feedback,” in Proc. of the 23nd annual ACM sympo-sium on User interface software and technology (UIST’10),2010.
[7] H. Junker, O. Amft, P. Lukowicz, and G. Troster, “Gesturespotting with body-worn inertial sensors to detect useractivities,” Pattern Recognition, vol. 41, no. 6, pp. 2010–2024, 2008.
[8] G. Raffa, J. Lee, L. Nachman, and J. Song, “Don’t slowme down: Bringing energy efficiency to continuous gesturerecognition,” in Proc. International Symposium on Wear-able Computers (ISWC’10), 2010.
[9] H.-K. Lee and J. Kim, “An hmm-based threshold modelapproach for gesture recognition,” IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 21, no. 10,pp. 961 –973, oct 1999.
[10] T. Stiefmeier, D. Roggen, G. Troster, G. Ogris, andP. Lukowicz, “Wearable activity tracking in car manufac-turing,” IEEE Pervasive Computing, vol. 7, no. 2, p. 42,2008.
58
[11] S. Kallio, J. Kela, and J. Mantyjarvi, “Online gesturerecognition system for mobile interaction,” in Proc. IEEEInternational Conference on Systems, Man and Cybernetics(ICSMC’03), 2003.
[12] A. Hein, A. Hoffmeyer, and T. Kirste, “Utilizing an ac-celerometric bracelet for ubiquitous gesture-based interac-tion,” in Universal Access in Human-Computer Interaction.Intelligent and Ubiquitous Interaction Environments, ser.Lecture Notes in Computer Science. Springer Berlin /Heidelberg, 2009, vol. 5615, pp. 519–527.
[13] O. Amft, R. Amstutz, A. Smailagic, D. Siewiorek, andG. Trster, “Gesture-controlled user input to complete ques-tionnaires on wrist-worn watches,” in Human-ComputerInteraction. Novel Interaction Methods and Techniques, ser.Lecture Notes in Computer Science. Springer Berlin /Heidelberg, 2009, vol. 5611, pp. 131–140.
[14] D. Kim, H. Choi, and J. Kim, “3d space handwritingrecognition with ligature model,” in Ubiquitous ComputingSystems, ser. Lecture Notes in Computer Science. SpringerBerlin / Heidelberg, 2006, vol. 4239, pp. 41–56.
[15] W.-C. Bang, W. Chang, K.-H. Kang, E.-S. Choi, A. Potanin,and D.-Y. Kim, “Self-contained spatial input device forwearable computers,” in Proc. IEEE International Sympo-sium on Wearable Computers (ISWC’03), 2003.
[16] R. Plamondon and S. Srihari, “Online and off-line hand-writing recognition: a comprehensive survey,” IEEE Trans-actions on Pattern Analysis and Machine Intelligence,vol. 22, no. 1, pp. 63–84, 2000.
[17] R. McGuire, J. Hernandez-Rebollar, T. Starner, V. Hen-derson, H. Brashear, and D. Ross, “Towards a one-wayamerican sign language translator,” in Proc. Sixth IEEEInternational Conference on Automatic Face and GestureRecognition (FGR’04), 2004.
[18] C. Amma, D. Gehrig, and T. Schultz, “Airwriting recogni-tion using wearable motion sensors,” in Proceedings of the1st Augmented Human International Conference (AH’10),2010.
[19] C. Amma and T. Schultz, “Airwriting: DemonstratingMobile Text Input by 3d-Space Handwriting,” in Proc.of the ACM international conference on Intelligent UserInterfaces (IUI’12), 2012.
[20] O. J. Woodman, “An introduction to inertial navigation,”University of Cambridge, Tech. Rep., 2007.
[21] L. Rabiner, “A tutorial on hidden markov models andselected applications in speech recognition,” Proceedingsof the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
[22] X. Huang, A. Acero, and H. Hon, Spoken language pro-cessing. Prentice Hall, 2001.
[23] H. Soltau, F. Metze, C. Fugen, and A. Waibel, “A one-passdecoder based on polymorphic linguistic context assign-ment,” in IEEE Workshop on Automatic Speech Recognitionand Understanding (ASRU ’01), 2001.
[24] T. Schultz, “GlobalPhone: A multilingual speech and textdatabase developed at Karlsruhe University,” in Proc. of theInternational Conference on Spoken Language Processing,(ICSLP’02), 2002.
59