wihf: gesture and user recognition with wifi

1536-1233 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2020.3009561, IEEETransactions on Mobile Computing

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

WiHF: Gesture and User Recognition with WiFiChenning Li, Student Member, IEEE, Manni Liu, and Zhichao Cao, Member, IEEE

Abstract—User identified gesture recognition is a fundamental step towards ubiquitous WiFi based sensing. We propose WiHF, whichfirst simultaneously enables cross-domain gesture recognition and user identification using commodity WiFi in a real-time manner. Thebasic idea of WiHF is to derive a domain-independent motion change pattern of arm gestures from WiFi signals, rendering the uniquegesture characteristics and the personalized user performing styles. To extract the motion change pattern in real time, we develop anefficient method based on the seam carving algorithm. Moreover, taking as input the motion change pattern, a Deep Neural Network(DNN) is adopted for both gesture recognition and user identification tasks. In DNN, we apply splitting and splicing schemes to optimizecollaborative learning for dual tasks. We implement WiHF and extensively evaluate its performance on a public dataset including 6users and 6 gestures performed across 5 locations and 5 orientations in 3 environments. Experimental results show that WiHFachieves 97.65% and 96.74% for in-domain gesture recognition and user identification accuracy, respectively. The cross-domaingesture recognition accuracy is comparable with the state-of-the-art method, but the processing time is reduced by 30×.

Index Terms—Gesture Recognition; User Identification; Channel State Information; Commodity WiFi, Cross-domain Sensing

F

1 INTRODUCTION

Every time I lift my arm, it distorts a small electromagnetic fieldthat is maintained continuously across the room. Slightly differentpositions of my hand and fingers produce different distortions andmy robots can interpret these distortions as orders. I only use it forsimple orders: Come here! Bring tea! and so on. Such an amazingscenario was described in the science fiction ”The Robots ofDawn” [1] by Isaac Asimov in 1983. Nowadays, WiFi basedsensing is making it happen and researchers have proposedseveral WiFi based systems for gesture recognition [2], [3],[4], [5], [6], [7], [8], which can improve the efficiency andquality of human living in the smart home. The fundamentalprinciple that enables humans to naturally interact withsmart devices or even robots via WiFi is that WiFi channelsget distorted by arm or hand gestures as shown in Figure 1.

Besides the semantic meaning of diverse gestures, manyapplications usually require user identities, such as thesmart factory, smart home, and VR gaming for the accesscontrol, content recommendation, and VR customization.Specifically, access control is usually required for gesturecommands with different types of equipment in the smartfactory. When the identity of children or parents can beknown, it can recommend different contents when they arewatching TV or listening to music. Besides, VR gaming canbe further enhanced by associating the performed gestureswith specific users, such as recording the user’s information.As shown in Figure 1, the true potential of WiFi based ges-ture recognition can be unleashed only when it can associatethe performed gesture with a specific user simultaneously,just like HuFu used in the ancient Chinese military, whichcan simultaneously provide authentication (e.g., match apair of HuFu pieces) and semantic meaning (e.g., deploy

• Chenning Li, Manni Liu and Zhichao Cao are with the Department ofComputer Science and Engineering, Michigan State University, EastLansing, MI, 48824.

Manuscript received April 19, 2005; revised August 26, 2015.

TXWiHFSystem

RX

Push&Pull Sweep

Slide Draw Circle

Fig. 1. The illustration of WiHF system inspired by China HuFu, which isdesigned for both authentication and application purposes.

military force). Hence, the user identified gesture recogni-tion becomes an emerging research topic.

Nevertheless, user identified gesture recognition en-counters three fundamental challenges in practice. First,since the WiFi signals are usually noisy, it is difficult to de-rive an effective feature to represent both the unique gesturecharacteristics and the personalized user performing styles.Thus it cannot naturally support gesture recognition anduser identification simultaneously. Second, the computationcomplexity of feature extraction should be low enough sothat we can apply user identified gesture recognition in areal-time manner. Third, since a user may perform the samegesture in diverse locations, orientations, and environments,the recorded WiFi signals are no longer the same at all. Itis challenging to keep user identified gesture recognitionaccurate across domains (e.g., locations, orientations, envi-ronments) [7] with low extra overhead.

State-of-the-art approaches fail to resolve all three chal-lenges. For example, Widar3.0 [7] enables a zero-effort cross-domain gesture recognition, which means extra efforts areunnecessary in either data collection or model retrainingwhen gestures are performed in new domains. The feature

Authorized licensed use limited to: Michigan State University. Downloaded on October 26,2020 at 00:38:15 UTC from IEEE Xplore. Restrictions apply.




extracted however cannot support user identification andthe computation complexity of feature extraction is too highto achieve real-time. Moreover, WiID [9] achieves user iden-tification for gestures while it must take as input the per-formed gesture, which can restrict the user identification ac-curacy significantly. Consequently, real-time efficiency andcross-domain ability are degraded.

In this paper, we propose WiHF (WiFi HuFu), a pioneer-ing attempt to achieve cross-domain user identified gesturerecognition using commodity WiFi in a real-time manner.It aims to capture the personalized motion change patterncaused by arm gestures, which includes the rhythmic veloc-ity fluctuation and characteristic pause distribution. More-over, it keeps consistent across domains. Then an efficientmethod is developed to carve the motion change patternfor computation efficiency based on the seam carving algo-rithm [10]. Finally, we design a collaborative dual-task DNNmodel to simultaneously achieve accurate user identifiedgesture recognition, which further applies the splitting andsplicing scheme to bootstrap the cross-domain ability andcollaborative learning.

We implement WiHF and evaluate it extensively ona public dataset [7]. Results demonstrate WiHF achieves97.65% and 96.73% for in-domain gesture recognition anduser identification, respectively. Moreover, WiHF demon-strates zero-effort cross-domain characteristics for gesturerecognition, comparable with the state-of-the-art meth-ods [7]. The processing time is however reduced by 30×and thus can be running in real-time. In a nutshell, thecontributions of this paper are as follows:• We design a domain-independent motion change pattern

of arm gestures to reduce the deployment cost for realapplications and develop several efficient algorithms forreal-time operation.

• We propose a dual-task DNN framework that can recog-nize gestures and identify users collaboratively. And it canbe further generalized to multi-task sensing using wirelesssignals.

• We implement WiHF and conduct extensive experimentsto evaluate its performance. Evaluations demonstrate thefeasibility and effectiveness of our system.

The rest of this paper is organized as follows. Relatedworks are reviewed in Section 2 followed by the preliminaryand observation in Section 3. The system design is detailedin Section 4 before the performance evaluation of Section 5.We provide the discussion and summary the conclusion inSection 6 and 7, respectively.

2 RELATED WORK2.1 User Identification:

Prior work can be divided into passive and active useridentification, depending on whether it requires extra ac-tions of users. Passive user identification generally leveragesintrinsic physiological distinctions or behavioral features,such as the gait [11], [12], [13], [14], [15], [16], [17] andlocation-oriented activity patterns [18], [19], [20], [21], [22].Specifically, WiWho [11], WifiU [16], WiFi-ID [13], andWii [17] represent user-specific gait features with statisticalfeatures from WiFi measurements. AutoID [14] proposes the

C3SL shapelet learning framework to extract the uniquefine-grained gait patterns. All however either require usersto walk along pre-defined paths and keep their walkingstyle while avoiding any turns or breaks. For activity baseduser identification, WiPIN [20] extracts 4 human biologicfeatures and 9 statistical signal features using SVM whileShi et al. [21] leverage the statistical features and employa deep learning model to capture the fingerprints of WiFimeasurement for different users. All passive user identifi-cation methods however cannot convey extra information.Besides, they are vulnerable to domain variance such as thelocation, orientation and environment of users. In contrast.active user identification relies on users’ actions and con-veys extra information such as users’ gesture commands.Traditional methods validate credentials actively by requir-ing fingerprint, iris or RFID tags in a specific position. RF-Mehndi [23] leverages the unique phase patterns of thebackscatter signals caused by the user’s fingertip touching.It requires the RFID card as a credential and the systemworks in a distance less than 30cm. To the best of ourknowledge, WiID [9] is the first scheme that can recognizegestures while identifying the user at runtime by integratingexisting WiFi based gesture recognition systems as an add-on module. Moreover, the system requires data collectionand model retraining when employed in a new domain.

2.2 Gesture Recognition across domains:WiFi based gesture recognition systems generally need tobe adapted for new domains. Researchers have proposed toeither translate features between domains [3] or generatedomain-independent features [7]. Widar3.0 [7] extracts thedomain-independent feature BVP from CSI but requires theaccurate location of transceivers, otherwise, it may sufferdue to the noise and outliers [24]. Importantly, none ofaforementioned solutions can be real-time. The domainadversarial architecture [25] can also reduce the deploymentcost for new domains by removing the domain-dependentcomponents from the extracted features [26], [27].

3 PRELIMINARY AND OBSERVATION

In this section, we first discuss the preliminary principlesbehind the user identification based on arm gestures acrossdomains (§3.1). Then, we demonstrate the feasibility toutilize the motion change pattern of arm gestures for useridentified gesture recognition with WiFi (§3.2).

3.1 Arm Gesture based User Identification across Do-mainsWe survey the existing works to verify two fundamentalquestions as follows:

3.1.1 Are the characteristics of arm gestures representa-tive enough for user identification?The answer is true due to the following observations.During performing an arm gesture, the potential biomet-ric feature [9], [28], [29], [30], [31], which associates withthe shape of arm and hand, is tightly coupled with thecorresponding movements and brings an opportunity for





user identification. Moreover, arm gestures, including armsweep [30] and gesture vocabulary (e.g., drawing the line,circle, rectangle) [28], are diversely performed from userto user due to their different semantic understandings andperforming styles, then have been used for user authentica-tion. Several works have shown these unique characteristicsof arm gestures can be captured by either WiFi signalsor inertial sensors. Specifically, WiID [9] conducts a com-prehensive measurement study to validate that the time-series of the frequencies, appearing in WiFi Channel StateInformation (CSI) measurements while performing a givengesture, is different from that of the same gesture performedby different users but similar to that performed by the sameuser in a long period. Moreover, wrist acceleration samplesduring performing arm gestures are collected with inertialsensors and able to provide personalized characteristicswith long term stability over a month [29], [31]. Hence,the characteristics of arm gestures are indeed representativeenough for user identification.

3.1.2 Can we achieve accurate user identification whileavoiding extra re-training efforts across domains?The key challenge whether a user identification system canadapt the various inputs of the same arm gesture performedby the same user across domains is to extract a domain-independent feature free from the induced variances. Forexample, Widar3.0 [7] derives a domain-independent fea-ture called Body-coordinate Velocity Profile (BVP) from theDoppler Frequency Shift (DFS) spectrogram of raw CSImeasurements. It summarizes the occurred relative velocitycomponents of arm motion in the user’s body coordinatesystem which is irrelevant to the user’s orientations, loca-tions, and surrounding environments. With BVP, Widar3.0achieves cross-domain gesture recognition without the extracost of data collection and model retraining. BVP showsthe possibility of the domain-independent feature design,but it is still a challenging problem for user identificationbecause BVP cannot preserve the personalized character-istics while performing arm gestures. Besides, BVP is toocomputation-intensive to be running in real-time. Hence,to solve the problem, a possible way is to derive a newdomain-independent feature for both gesture recognitionand user identification.

Overall, to enable user identified gesture recognition inreal-time, we need to derive a new feature of arm gestures,which is domain-independent and supports both gesturerecognition and user identification with efficient computa-tion complexity.

3.2 Empirical Study of WiFi Motion FeatureWe intend to investigate several feasible personalized fea-tures preserved by WiFi CSI measurements while perform-ing arm gestures. Upon receiving the raw CSI measure-ments, we can derive the DFS spectrogram using Short-Term Fourier Transform (STFT) to capture the distortion ofchannels induced by the arm motions. By observing the DFSspectrogram, we notice that both power and temporal fea-tures (called carving paths) have the potential to indicate thepersonalized arm motion from velocity and rhythm aspects.Specifically, DFS spectrograms [5] can separate the move-ment of different arm parts when they move at different

speeds since the spectrogram power varies as the reflectionareas change for the certain velocity component over time.Based on the power of DFS spectrograms, two carving pathscan be derived. One is Dominant Power which reflects themost dominant power in the DFS spectrograms. The otheris Power Bound which profiles the dominant power areaand velocity bounds [32]. Note that both are power basedfeatures. Moreover, an arm gesture can be usually dividedinto some atomic motions (e.g., drawing the line, arc) in thetemporal order. For example, drawing the rectangle containsfour lines towards four different directions. The switchesbetween two adjacent atomic motions are called MotionChanges, which indicate motion pause/restart. We extract acarving path, called Motion Change Pattern, to represent thepattern of motion changes, namely the temporal rhythmicmotion during performing the gesture.

To validate whether the three carving paths are distin-guishable among different users and stable for any singleuser across domains, we further conduct some empiricalexperiments. With the collected CSI measurements of eachgesture instance, we manually extract and label the threecarving paths from DFS spectrograms. We have three obser-vations as follows. Note that WiHF builds its all work withthe public dataset. And details about the experimental setupcan refer to Widar3.0 [7].

3.2.1 The motion change pattern of an individual user per-forming the same gesture stays consistent over time whiledifferent users manifest various motion change patterns forperforming the same gesture, but it does not hold for powerbased carving paths all the time

The three top sub-figures of Figure 2 show the three carvingpaths of three different users while performing the samegesture (e.g., drawing the rectangle). Specifically, the blackand red dashed lines denote the carving paths of dominantpower and power bound. The pink dashed lines which aredistributed along the axis of frequency shift indicate themotion changes. With the visualization of all carving paths,we can observe both power based features and the motionchange pattern varies among different users. Intuitively,different users perform the same gesture with the person-alized action understanding and performing style. Amongdifferent users, their diverse arm shapes and sizes reinforcethe impacts on WiFi signals, leading diverse power basedfeatures and the motion change pattern.

Further, we collect three instances of the same gesturefrom all three users and superpose their carving paths foreach user in the bottom three sub-figures of Figure 2. Wecan see the power based carving paths of different instancesmay shift along the time axis, especially for the third user. Incontrast, for all users, their motion changes can be groupedinto three clusters which correspond to the three pauses dur-ing drawing the rectangle. In all clusters, the largest periodbetween two motion changes across different instances isless than 70ms (the first cluster of the third user appearedat about 500ms), which demonstrates its consistency of themotion change pattern for each user across instances. Thereason behind this is the inevitable noise (e.g. multi-path,body motion) has a significant influence on the power basedcarving paths, but motion changes are less affected.





Fig. 2. When three different users perform a gesture (e.g., drawing the rectangle), three carving paths (e.g., 3 different dashed lines with thecorresponding color) of their denoised DFS spectrograms are shown in the top three sub-figures, respectively. For each user, given the samegesture, the power bound and motion changes carving paths of three different instances (e.g., 3 color bold and dashed lines) are shown in eachbottom sub-figure.

Fig. 3. The distributions of the DFS power bound and motion changeswhile a user performs two different gestures (e.g., drawing the zigzagand triangle) in three different locations (e.g., 3 different color lines).

3.2.2 An individual user introduces similar motion changeswhen performing gesture across domains, but the powerbased carving paths vary significantly

To verify the stability of different carving paths acrossdomains, a user performs two gestures (e.g., drawing thezigzag and triangle) at three different locations. For eachgesture, the distribution of the derived carving paths isshown in Figure 3. We can see the motion change patternis consistent across locations for both gestures. The largestperiod between two motion changes in a cluster is 50ms(the second cluster in the top sub-figure for drawing thezigzag). Meanwhile, the power based carving paths exhibitnoticeable dynamics when a gesture is performed in dif-ferent locations. The reason is the motion change patternreflects the rhythm of the arm motion and the temporalcharacteristic is only related to the user performing styleand gesture composition, but not where the user performsthe gesture. And the diverse arm shapes and sizes of usersreinforce the impacts on WiFi signals. In contrast, in theview of WiFi transceiver pairs, the velocity components ofthe detected arm motion vary with the location changes sothat incurs DFS power dynamic since the human body isbest modeled as a quasi-specular reflector [33].

3.2.3 The motion change pattern does not work if we treatgesture recognition and user identification as separate tasks,but works when the two tasks are operated collaboratively.As Figure 2 shows, we observe among different users,their motion change patterns demonstrate noticeable vari-ances while performing the same gesture. The inconsistencymakes it difficult to be utilized for gesture recognition. Butif we know the user identity, as shown in Figure 3, wecan see the motion change pattern between two differentarm gestures performed by the same user are quite differentsince the distribution of noticeable pause/restart is usuallydifferent, which inspires us to design both classificationtasks in a collaborative manner. The corresponding com-ponents contained in the motion change pattern can beextracted for each task and bootstrap each other, leading tothe design of the dual-task module for collaborative learning(§4.3).

Overall, in comparison with power based features, themotion change pattern is a better feature to achieve useridentified gesture recognition across domains. The experi-mental results further verify the above intuition quantita-tively (§5.5).

4 SYSTEM DESIGN

Based on the empirical observations, we propose WiHF toleverage the motion change pattern and designs a collabo-rative dual-task module to recognize gestures and identifyusers simultaneously. Figure 4 provides an overview of thearchitecture of WiHF. WiHF consists of three modules, frombottom to top, including data acquisition, pattern extraction,and collaborative dual-task, respectively.

Data Acquisition Module: Upon receiving raw CSI mea-surements from WiFi transceiver pairs, WiHF first sanitizesthe CSI series using the band-pass filter and conjugatemultiplication [34], [35]. Then dominant DFS spectrogramcomponents reflected by different body parts (e.g. hand, el-bow, arm) are collected using Principal Component Analysis(PCA). Thus we can remove the interference while retainingthe unique gesture characteristic and the personalized userperforming styles (§4.1).





Tx Rx

Rx Data Acquisition

CSI Collection

Sanitation

Pattern Extraction

CSI Tensor SeriesTime-Frequency Analysis

GestureRecognition

Collaborative Dual-task

Splitting Sequence

Splicing Splicing

UserIdentification

Spatial and Temporal Feature Extractor

Seam Carving Motion Change Pattern

Fig. 4. The system architecture of WiHF.

Pattern Extraction Module: This module extracts thedomain-independent motion change pattern(§4.2). To derivethe DFS spectrogram, we first operate the time-frequencyanalysis by adopting STFT on the sanitized CSI. Then, WiHFdevelops an efficient method based on the seam carvingalgorithm [10] to capture the motion change pattern readily,which is fed into the collaborative dual-task module.

Collaborative Dual-task Module: This module worksfor collaborative classification tasks including gesture recog-nition and user identification at runtime (§4.3). First, themotion change pattern is filtered and split to the correspond-ing dual inputs for the dual tasks. Then a DNN of con-volution based Gated Recurrent Units (GRU) [36] extractsthe spatial (e.g., different body parts) and temporal (e.g.,motion change pattern) features of gesture motion. Next,the gradient block layer is integrated for splicing respectivefeatures while ensuring that they do not affect each otherduring the back-propagation with the loss function. Finally,it outputs the predictions for gesture recognition and useridentification simultaneously.

4.1 CSI Acquisition

A time series of CSI matrices characterizes MIMO chan-nel variations from different dimensions including time(packet), frequency (subcarrier), space (transceivers). For aMIMO-OFDM channel with M transmit antennas, N re-ceiver antennas, K subcarriers and T received packets, the4-D CSI tensor H ∈ CN×M×K×T can be formulated asfollows at packet t, frequency f and receiver antenna a [37],representing amplitude attenuation and phase shift of multi-path channels:

H(f, t, a) = (Hs(f, t, a) +Hd(f, t, a) +N(f, t, a))ejε(f,t)

(1)where ε(f, t) is the phase offset caused by cyclic shift diver-sity, sampling time offset, sampling frequency offset [37]. Nis the complex white Gaussian noise capturing the back-ground noise [38]. Hs is the static component with zeroDFS while the dynamic component Hd is a superpositionof vectors with time-varying phases and amplitudes [5].

Generally, we remove the phase offset ε(f, t) by cal-culating the conjugate multiplication with raw CSI mea-surements of two antennas on the same WiFi receiver [7],[38]. Then Hs and N are filtered out using the band-pass

PCA #1 PCA #2

PCA #3 Clap

Fig. 5. DFS spectrograms of the first three PCA components for clap-ping.

filter [39]. The remaining Hd is only affected by the motionof multiple body parts. We further obtain the dominantcomponents using PCA. The key question is how manycomponents should be selected for feature extraction. Themore components are selected, more motion features can beextracted but the processing time is sacrificed. In existingworks, Widar uses the first component with filters [7], [39]while some others [3], [9] demonstrate that the third com-ponent contains the most feasible motion features withoutfilters.

We illustrate diverse PCA components using the publicdataset of Widar3.0 [7]. Figure 5 shows that different com-ponents contain diverse scales of DFS spectrograms, whichis significant for retaining the unique gesture characteristicand the personalized user performing style. For example,PCA #1 show the dominant power distribution while PCA#3 provides finer-grained motion change pattern which canrepresent the symmetrical arm movements for the clapgesture. Note that the first three components are selectedfor motion change pattern profiling while balancing thecomputation cost.

4.2 Motion Change Pattern Extraction

Upon receiving the dominant CSI tensor series, WiHF firstapplies STFT and obtains the DFS component fD . And it’sincurred by the movement of the arm and associated withthe velocity of different body parts in Equation (2), where αlis the complex attenuation factor of the lth path [38] and Pdis the set of dynamic paths (fD 6= 0) [5], [7], [39]. Thus wederive the DFS spectrogram with the dimension as R×P ×F × T , where R and P are the numbers of transceiver linksand PCA components. And F and T denote the samplingpoints in frequency and time domain, respectively.

Hd(f, t, a) =∑k∈Pd

αl(f, t, a)ej2π∫ t−∞ fDk

(u)du (2)

As mentioned in (§3.2), users express unique personal-ized styles while performing the same gesture, resultingin the rhythmic increase, drop or even pause at certaininstances. Thus signals reflected by diverse body partsgenerate consistent motion change patterns and form thecorresponding DFS spectrogram sequence. The remainingchallenge is how to extract the motion change pattern





efficiently. Intuitively, the rhythmic increase, drop or evenpause usually induce noticeable moving velocity fluctuationdetected by the DFS spectrogram. And it occurs at certainmoments representing the velocity change peaks in the timedomain. However, intensive computation for the derivativeoperation of velocity sacrifices the real-time characteristic.To retain personalized features while balancing the process-ing time, the motion change pattern is derived out of carvingDFS spectrograms.

The basic idea is to extract the motion change patterncomparable with acceleration as biometrics for dominantbody parts, such as wrist, elbow, arm. However, we are fac-ing three challenges. First, DFS only demonstrates the powervalue directly for the specific velocity component over time.It cannot provide the accurate fine-grained accelerationinformation corresponding to various body parts due tothe superimposition of velocity components at the receiver.Second, the motion change pattern requires the derivativecalculation of high dimension data, which is computation-intensive and cannot be running in real time. Third, theDFS spectrogram contains excessive irrelevant interference,resulting in unnecessary computation and memory.

For the challenge for the superimposition of velocitycomponents, we propose a model to fill the gap betweenthe power distribution of the DFS spectrogram and the bodypart acceleration information, leading to the motion changepattern. As CARM [5] elaborates, the power distribution ofspectrograms changes as reflection areas S vary for specificDoppler shift frequency at instance t. Thus the power Pdsfor the DFS Spectrogram can be defined with the scalingfactor c due to propagation loss as the Equation (3a). As-suming K body parts are dominant to define the gesture,the relation between Pds and the superposed reflected areaS of body parts can be modeled as Equation (3b):

Pds(fD, t) = c · S(fD, t) (3a)

= c ·K∑k=1

Ref(k, t) · 1(fdfs(k, t) = fD) (3b)

where Ref(•) and fdfs(•; •) denote the individual reflectionarea and Doppler shift frequency at time t for the kth bodypart. And 1(•) represents the indicator function while cand fD denotes the scaling factor and the DFS component,respectively.

However, accurate Pds is non-reachable due to the res-olution of the WiFi signals and the approximate estima-tion for body parts K. Thus we can get the experimentalapproximation Pds with acceptable computing error ε andattenuation factor aat as (4a). For accessibility and deriv-ability, Gaussian distributions is applied to model the thesuperimposition for movements of body parts as (4b):

Pds(fD, t) ≈ cK∑k=1

Ref(k, t)aatRect(fdfs(k, t)− fD

2ε)(4a)

≈ cK∑k=1

Ref(k, t) · e−(fdfs(k,t)−fD)2

2(ε/3)2 (4b)

Since fdfs demonstrates the moving velocity v withv = fdfs × λ

2 [38], [39], the corresponding acceleration akof each body part can be denoted as ∂

∂t |fdfs(k, t) − fD0 |

for a fixed fD0. Assuming the Ref(•) variance can be

omitted compared with the superimposition effects on Pdsbetween consecutive DFS spectrogram, power change ratecan de derived as Equation (5a). With the limitation of rigidbody part of human and acceleration continuation, we canproperly decimate the derivative of power as Equation (5b)to simplify the relationship between the DFS power changeand acceleration ak:

|∂Pds(fD0, t)

∂t| ≈ |c

N∑k=1

Ref(k, t) · −9∆k

ε2e−

9∆2k

2ε2 · ak|(5a)

≈ cN∑k=1

Ref(k, t) · 9∆k

ε2e−

9∆2k

2ε2 · |ak| (5b)

(∆k = |fdfs(k, t)− fD0 |, ak =∂∆k

∂t)

Thus we find the power change rate increases as akrises for all K body parts. The personalized accelerationinformation as biometrics over time [28], [29], [30], [31] canbe detected when users perform gestures by computing thederivative of DFS spectrograms.

Further, the remaining challenges are the efficiency ofthe derivative calculation and the interference of redundantdata. And we design the algorithm based on the seam carv-ing problem in computer graphics for content-aware imageresizing [10]. First, we filter the redundant interference withedge detection methods while optimizing derivative cal-culation using the difference scheme with the convolutionoperator. Then we develop an efficient method based on theseam carving algorithm to generate multiple dominant carv-ing paths mentioned in (§3.2) as the motion change patternfor each DFS spectrogram of PCA components. Supposeestimations of K dominant carving paths are consideredand each path demonstrates the most significant motionchange pattern over time. We use wi,j ∈ (0, 1] to denotethe weight of the | ∂∂t Pds(fD, t)|i,j for the ith frequency binat jth packet. Thus, the optimal Motion Change Pattern(MCP) along the frequency axis, as the function of indicesof timestamps, can be defined as:

MCPopt = MCP (argmaxt1,··· ,tFD

FD∑i=1

wi,ti |∂Pds(fD, t)

∂t|i,ti) (6)

(s.t.|ti − ti−1| < 2; i = 1, · · · , FD)

where FD denotes the numbers of frequency bins with STFT.For computation efficiency, the Sobel operator for the

time axis [40] is applied on DFS spectrogram Pds(fD, t)thus we can get the temporal gradient matrix for each DFSspectrogram.

Algorithm 1 elaborates on the motion change patterncarving algorithm. Upon receiving the complete DFS spec-trograms from each receiver, the mean compressing methodis applied for retaining the unique patterns while avoidingredundant computation. Assigning an appropriate value tothe number of segmentation Ts is crucial since if Ts is toolarge, the segmentation may be instantaneous and cannotguarantee the robust feature extracted by carving paths witha small sliding window. In contrast, with a small Ts, carvingpaths can become similar because the unique motion changepattern of individual users get too averaged out to be distin-guished with a large sliding window. Moreover, an adaptive





Time (ms)

Freq

uenc

y bi

ns (

Hz)

0 500 1000

60 40 200

-20-40-60

0 500 1000

3210-1-2-3

Velo

city

(m/s

)

Freq

uenc

y bi

ns (

Hz)

0 500 1000

60 40 200

-20-40-60

0 500 1000

3210-1-2-3

Velo

city

(m/s

)

Time (ms)Time (ms)Time (ms)

Fig. 6. Visualization examples of DFS spectrograms and motion change patterns using Algorithm 1 when two users draw the rectangle.

Algorithm 1 Motion Change Pattern ExtractionInput : [MPowds

]FD×T , WFD×Ts , kersobel, Kpath, Ts

Output : MCPvelocity bins×Ts

1: [MPowds]FD×Ts = meanCompressing([MPowds

]FD×T , Ts);2: for n = 1 to Kpath do3: initialize weight matrix WFD×Ts ;4: Pds = W ◦MPowds

;Mgradient = Conv(Pds, kersobel);5: Msum(1, 1 : Ts) = Pds(1, 1 : Ts);Mindex(1, 1 : Ts) = 1;6: for i = 2 to FD do7: for j = 1 to Ts do8: [V al, Index] = max{Msum(i− 1,max(j− 1, 1) : min(j+

1, Ts))};9: Mindex(i, j) = (Index−min(2, j)) +Mindex(i, j);

10: Msum(i, j) = V al +Msum(i, j);11: end for12: end for13: [∼; Indextail] = max{Msum(FD, 1 : Ts)};14: [Mpath(n, 1 : FD), Index] = BackTrack(Mindex, Indextail);15: W = UpdateWeight(W, Index);16: end for17: MCPvelocity bins×Ts = V elocityMapping(Mpath);

algorithm is too computation-intensive to be real-time. Weset Ts with a constant value 60. Therefore, the durationfor each segmentation is restricted between 35ms ∼ 70msconsidering the total sample length, which is demonstratedas appropriate segmentation guideline range [9] for DFSspectrograms. Besides, the empirical study of motion changepattern (§3.2) also shows that the most evident differencebetween adjacent carving paths is less than 70ms. Note thatwe update WFD×Ts

for each path to avoid the overlappingof carving paths. Thus it can carve the DFS spectrogramcomprehensively. We further apply the velocity mappingfunction to alleviate the error for the resolution of the WiFisignals and the grid estimation for body part since the realmoving velocity component may be projected to the DFS binadjacent to the true one [7]. We set the 0.16 m/s resolutionwithin [−1.6, 1.6] velocity range to achieve 20 velocity bins.[MCPV B×Ts

]R×P is derived out of each sample and fedinto the Dual-task Module, where V B denotes the numberof Velocity Bins, the segmentation number Ts, the numberof receivers R as well as the number PCA components P.

To evaluate the effectiveness of Algorithm 1 on extract-ing the motion change pattern from DFS spectrograms, weillustrate two examples in Figure 6 when two users drawthe rectangle, respectively. First, we can observe the con-sistent motion change pattern with the preliminary, whichcan capture the motion changes as the temporal rhyth-mic motion during performing the gesture. Second, motion

change patterns are distinguishable between two users forthe same gesture, representing the personalized performingstyles. We further evaluate the Algorithm 1 separately ingesture recognition and user identification with the ablationstudy ((§5.4)).

4.3 Dual-task Module

Data Adaptation and Feature Extraction: Upon receivingthe motion change pattern, it’s necessary to adapt themotion change pattern before being fed into DNN. Notethat WiFi signals need a specific DNN architecture anddata modification for CSI measurements since it containsa lot of noises and is super-sensitive to environmentalchanges [37]. We reshape the motion change pattern as di-mensions [MCP(V B×R)×P ]Ts

. Thus it is similar to a digitalimage with the spatial resolution of V B ×R and P as colorchannels. The underlying principle of the modification isthat signals across receivers convey the Angle-of-Arrival(AoA) information [37], [38] while velocity bins containbody part movements [7], [32]. In contrast, PCA componentsare set as the color channel of images with information atdifferent scales.

Inspired by Widar3.0 [7], we first extract spatial fea-tures from the individual motion change pattern and thenprofile the temporal dependencies of the whole featuresequence. To do this, a DNN of the Convolutional NeuralNetwork (CNN) based GRU is adopted with the inputtensor [MCP(V B×R)×P ]Ts

. For each ts th sampling com-ponent, the matrix [MCP(V B×R)×P ] is fed into a CNN,which contains 16 3× 3 filters and two 64-unit dense layerssequentially. ReLU function and the flatten layer are furtherapplied for non-linear feature mapping and dimension re-shape, resulting in the final output for CNN characterizingthe ts th sampling component. Then the spatial featureseries is fed into the following GRU for temporal profiling.Empirically, we adopt the 128-unit single-layer GRUs toprofile the temporal relationships. Next, a dropout layeris added for avoiding over-fitting followed by the Softmaxlayer with cross-entropy loss for dual-task prediction. Notethat the early stopping scheme is utilized to halt the trainingat the right time with the patience epochs 30 for valueloss [41].

Splitting and Splicing Scheme: As illustrated in Fig-ure 7, the dual-task module requires to splice individualunique features for bootstrapping performance of dual taskscollaboratively.





Feature Gfg(𝜽𝜽𝒇𝒇𝒇𝒇)

CNN_GRU Module

CNN_GRU ModuleFeature Gfu(𝜽𝜽𝒇𝒇u)

Gradient Block Layer

User label yu

Gesture label yg

Lossu

Lossg

𝝏𝝏𝑳𝑳𝒖𝒖𝝏𝝏𝜽𝜽𝒖𝒖

Predictor Pg(𝜽𝜽𝒇𝒇)

Predictor Pu(𝜽𝜽𝒖𝒖)

InputCarving

Path[𝑪𝑪𝑪𝑪𝑽𝑽𝑽𝑽×𝑻𝑻𝒔𝒔]𝑹𝑹×𝑪𝑪

𝝏𝝏𝑳𝑳𝒖𝒖𝝏𝝏𝜽𝜽𝒇𝒇𝒖𝒖

𝝏𝝏𝑳𝑳𝒇𝒇𝝏𝝏𝜽𝜽𝒇𝒇

𝝏𝝏𝑳𝑳𝒖𝒖𝝏𝝏𝜽𝜽𝒇𝒇𝒖𝒖

Fig. 7. The structure of the collaborative dual-task network for gesturerecognition and user identification.

First, we split [MCP(V B×R)×P ]Ts evenly along the timeaxis. Thus we can avoid the issue of vanishing gradientswith too long time series (Ts ∼ 60). On the other hand,splitting the input generates dual inputs for the module andthe correlation between them can enhance the performanceof the user identification. Then the gradient block layer istailored for feature splicing inspired by Multi-Task Learning(MTL). However, different from traditional MTL, the dualtasks here are expected to only slice respective features forcollaborative learning while avoiding impacts of loss fromeach other during the back-propagation process. Take thegesture recognition task as an example, we utilize the outputof its own CNN based GRU module as the superior featurewhile the feature extracted by the CNN GRU module foruser identification is taken as an inferior one. Then bothare spliced together and fed into the final layer to predictthe gesture. The key point is we do not back-propagate thegesture prediction loss to the CNN based GRU for useridentification. In other words, we keep the CNN basedGRU of user identification from being influenced by gesturepredictions. Thus we introduce the Gradient Reversal Layer(GRL) [25] and adapt it into the gradient block layer bysetting the splicing factor with zero, which can be used forthe generative adversarial network with a positive factorwhile MTL with a negative one as normal back-propagation.

The underlying rationale comes from both theoreticalanalysis and experimental validation. First, the dual tasksare defined as a sub-type (user-defined gesture vs gesture-defined user) task instead of a main-type task(gesture vsuser), required by the preliminary (§3.2). That means thefeatures for the main-type task can assist the sub-type taskas a superior indicator while it should not be influencedby the loss of the inferior feature for the sub-type tasks.On the other hand, it has been validated that the user-specific features have noise to gesture recognition as domaininformation [7]. Previous work [25], [26], [27] applies GRLwith a negative factor to eliminate domain noise and extractthe cross-domain feature while sacrificing the performanceof the predictor. Since the motion change pattern alreadycontains cross-domain knowledge, we no longer need toapply GRL to eliminate noise at the expense of predictorperformance. Experiments demonstrate the efficiency of thezero splicing factor for gradient blocking in (§5.6).

TABLE 1Design instruction of different feature datasets

Dataset HuFuM HuFuE HuFuAverage duration 1.619s 3.238s 3.669s

Concatenation No Instances Gestures

5 IMPLEMENTATION AND EVALUATIONWe implement WiHF and evaluate its performance throughextensive experiments. The detailed settings are illustratedas follows:

Dataset: We evaluate WiHF on the public dataset fromWidar3.0 [7], which contains 9 gestures of 16 users collectedfrom 5 locations and 5 orientations in 3 environments. Andwe adopt the dataset including 9 users and randomly select6 users for the overall performance evaluation since theselected users share the 6 same gestures across 75 samedomains (5 positions × 5 orientations × 3 environments)while the distribution is non-uniform for the other users,making it not equivalent with the adopted ones for thecross-domain testing. Besides, it’s sufficient for WiHF towork with 6 users given the smart home scenario. To makethe comparison study convinced while fully extracting themotion change pattern, we make three feature datasetsusing the Pattern-Carving Module, shown in Table 1. First,HuFuM (HuFu Mini) feature dataset is collected usingthe original dataset for the comparison with Widar3.0. Tofully extract the characteristics representing the personal-ized motion change pattern, we concatenate HuFuM acrossinstances and gestures respectively, delivering the featuredataset HuFuE (HuFu Extend) with the doubled averagesample length of 3.238s and HuFu feature dataset. Therationale is the motion change pattern is impacted by theduration and complexity of gestures. For example, drawingthe rectangle is more sophisticated than sweeping since theformer has a much larger influential coverage area and moreexplicit rhythmic velocity fluctuation.

Metrics: To characterize the WiHF’s performance, pre-diction ACCuracy (ACC) and latency are the two mainmetrics for both of gesture recognition and user identifi-cation. And the former is the measure of the confidence ofprediction for each instance. Besides, False Authorized Rate(FAR) and False Unauthorized Rate (FUR) are adopted foruser identification as [23]. And FAR measures the likelihoodthat WiHF incorrectly accepts a gesture attempt by an unau-thorized user while FUR evaluates the likelihood that WiHFincorrectly rejects the gesture performed by an authorizeduser. All the metrics can be calculated with the confusionmatrix as:

ACC =TruePositive

TruePositive+ FalsePositive(7)

FAR =FalsePositive

FalsePositive+ TrueNegative(8)

FUR =FalseNegative

FalseNegative+ TruePositive(9)

5.1 Overall AccuracyWe first evaluate the performance of distinguishing gesturesand users on the HuFu feature dataset. Taking all domain





Accuracy 100%

98%

96%

94%

92%

90%

98.61 %

二。~“c:.o？衣A

.........

、，，＇.....

98.(

-0-Gesture-AC(-ts- User-ACC

3.93%

2.24% ，一一一一一

．

．．

' .

．．

0.64%: : 0.77% . '

' .

. .

. .

I

1 2

0.4%

3.13%

User-FAR

97.84%

-＿

2.03% I 一一一一一，

I I

．．

I I

I I

0.45%: : I I

．．

I I

l 1 , I I •

3 4 Different gestures or users

User-FUR FAR & FUR 10%

98.17% 97.94%

96.37%

乙-

＿＿

-

lo。

%-

19-．一

]

5

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

5

lo。

3

i

4

-

妒

邠

、

。

8%

6%

4.24% ;－－－－－： |4%．

．

．

．

．

．

．

．

．

．

．

．

0.74%: ．

．

｀

2%

0%

Fig. 8. Overall performance for gesture recognition and user identifica-tion on HuFu feature dataset.

Fig. 9. Comparison with Widar3.0 for gesture recognition across do-mains.

factors into consideration, WiHF achieves an overall accu-racy of 97.65% and 96.74% with 80 and 20 percentage dataused for training and testing. Figure 8 shows the overallmetrics including accuracy for gesture and user recognitionas well as FAR and FUR for user identification. WiHFachieves consistently high accuracy of over 95% for all6 gestures and 6 users in domains. Meanwhile, the FARrates are even less than 1%, meaning that WiHF can almostcorrectly identify all unauthorized users 100%. However, wenotice that the FURs slightly increase while they are stillless than 5%. The reason is that some concatenated gesturesare still simple and have limited motion details for users toexpress their consistent performing styles.

5.2 Cross-Domain Evaluation

As for cross-domain characteristics of motion change pat-tern, we first compare it with Widar3.0 [7] using BVP andHuFuM feature dataset. Then HuFuE feature dataset isadopted for exploring the impact of the gesture durationon the motion change pattern. And we calculate the averageaccuracy across domains by using one out of all domaininstances for testing while others for training. Note that theother domain components keep unchanged for evaluationon the specific domain component. By doing this, we canevaluate its zero-effort cross-domain ability as Widar3.0does. Figure 9 plots the accuracy distribution for each do-main component.

We can find WiHF achieves the median accuracy of93.92% for in-domain recognition using HuFuM while itreaches 90.58% for Widar3.0 [7]. Compared with Widar3.0

TABLE 2Accuracy for dual tasks on HuFu feature dataset

Target Labela 1 2 3 4 5Gesture In-domain 97.65%Location 90.39% 91.33% 92.00% 90.89% 95.72%

Orientation 70.17% 87.11% 86.44% 90.06% 75.67%User In-domain 96.74%

Location 73.83% 66.69% 73.44% 76.22% 86.33%Orientation 57.17% 77.33% 73.17% 77.39% 59.00%

aThe target label denotes the data for testing while others for training.

for the cross-domain gesture recognition, the average accu-racy of WiHF with HuFuM feature dataset slightly increasesacross locations and environments but drops for orienta-tion. Specifically, WiHF achieves the average accuracy of90.32%, 79.14%, and 89.67% for gesture across locations,orientations, and environments, comparable with the state-of-the-art [7] with 88.78%, 82.97%, and 88.95% respectively.The performance of the worst instance is only 68.19% withedge orientations (orientation#1 and #5) [7] as the targetorientation. Such an accuracy decrease can also be observedfor Widar3.0, which is 73.26%. And the reason behind thisis that gestures might be shadowed by human body partsin edge orientations, resulting in unrecoverable signal loss.WiHF drops more since the shadowing effect can inducedetailed information loss for the fine-grained motion changepattern. To evaluate the impact of motion details, we in-crease the gesture duration by repeating the gesture inHufuE dataset. The performance of HuFuE demonstratesthe longer gesture instances can improve the overall cross-domain recognition and compensate for the pattern missingdue to body shadow. For example, we find the gesture ofdrawing the rectangle demonstrates more resilient to thecross-domain scenarios. Thus we can provide additionalresilience to the shadowing effect by optimizing the gesturedesign.

Then we evaluate the performance of cross-domain dualtasks on HuFu feature dataset. Table 2 shows that the cross-domain performance for gesture recognition remains above85% for orientation 2,3,4 and all the locations while declinesby over 10% at edge orientation 1,5, which demonstratesconsistent and better accuracy with Widar3.0, HuFuM andHuFuE. And it verifies the efficiency of designing newgestures with more complexity and duration. Neverthe-less, the decrease also results from the body part shadowwhich blocks effective wireless signal sources for the motionchange pattern.

For user identification, Table 2 shows an apparent de-crease compared with gesture recognition and even de-scends to the worst 57.17% when testing at orientation 1.Generally, WiHF achieves 75.31% and 69.52% across lo-cations and orientations, respectively. The reason for theperformance decrease is that the collaborative dual-taskmodule identifies users with finer-grained information thangestures. It is weaker due to the cross-domain noise, espe-cially considering the short and simple gestures. We canalleviate the decrease across domains by extracting morePCA components and carving paths. On the other hand,the performance loss can be minimized by designing more





TABLE 3Time Consumption Comparison in Parallel

Widar3 HuFuM HuFuE HuFuSignal Processing 0.162s 0.992s 1.312s 1.557sFeature Extraction 70.29s 0.194s 0.358s 0.379s

Total Time Consumptiona 70.61s 1.462s 2.162s 2.488saIt includes procedures for loading data, signal processing,feature extraction and recognition & identification.

sophisticated gestures with more complexity and duration,which can minimize the body shadow effect. Besides, morereceivers can be added to increase the coverage so that it canreduce the possibility of signals missing.

To conclude, WiHF achieves average performance of92.07% and 82.38% for gesture across locations and orien-tations. It shows 75.31% for all locations and 75.96% forcentering orientations in user identification simultaneously.

5.3 Latency

In practice, the time consumption of WiHF mainly comesfrom feature extraction for Pattern-carving Module andrecognition as well as identification for Dual-task Module.Table 3 shows the time consumption distribution. Notethat results are all computed assuming it runs in parallelacross receivers since the feature [MCPV B×R×P ]Ts

can bederived from various receivers individually and concate-nated together. We can find WiHF spends more time onsignal processing due to STFT operation for more PCAcomponents. Widar3.0 demands on 70.61s for feature ex-traction while WiHF takes 69.932s less than Widar3.0. Thetotal time consumption for the HuFu feature extraction is2.488s with the average gesture duration 3.669s. We believeit is sufficient for most user identified gesture recognitionapplication scenarios. The remarkable improvement on timeconsumption lies in two folds. First, Widar3.0 derives theBVP in body part coordinate system using the l0 opti-mization problem with respect to Earth Mover’s Distance(EMD) metric. And it estimates the high dimensional BVPas the square of velocity bins resolution (202 variables) andbecomes computation-intensive although Widar3.0 controlsthe estimated variables using sparsity coefficients. Second,WiHF designs the optimization problems individually foreach receiver concerning the low dimensional estimatedcarving path as the resolution of velocity bins (20 variables).Moreover, it leverages the efficient method based on theseam carving algorithm instead of the constrained nonlinearmulti-variable function as Widar3.0.

In a nutshell, WiHF demonstrates comparable cross-domain characteristics using the motion change patternwith Widar3.0 for gesture recognition. However, the pro-cessing time is reduced by 30×.

5.4 Ablation Study

To evaluate the motion change pattern extracted by Algo-rithm 1 separately, we remove the dual-task network andadopt the simple classifier of Widar3.0 [7], which takesas input the motion change pattern. Table 4 shows thatit can reach 90.67% and 91.16% for gesture recognition

TABLE 4Accuracy for the in-domain testing on HuFu feature dataset for different

setups.

Setup Algorithm 1 WiID WiHFGesture Recognition 90.67% - 97.65%User Identification 91.16% 68.95% 96.74%

2020/1/24 10a_compare_double_bar.html

file:///D:/papers/infocom2020/code/python/figure_making/10a_compare_double_bar.html 1/1

Different users

Fig. 10. Performance of comparison with the existing work on the HuFufeature dataset.

and user identification, respectively. And both are slightlydropped without the dual-task network when the dualtasks are divided into two separate tasks. It demonstratesthe effectiveness of the collaborative dual-task network onfully enhancing the spatial and temporal features for motionchange pattern.

5.5 Comparative StudyWe compare WiHF with WiID [9] to demonstrate its supe-riority. On one hand, WiID utilizes the motion contour ofbody parts as the power based feature for user identificationwhile WiHF leverages the motion change pattern to repre-sent both the unique gesture characteristic and the personal-ized user performing styles (§3.2). On the other hand, WiHFbootstraps the dual-task learning collaboratively while WiIDcan only identify users using the known gesture recognizedby existing gesture recognition systems.

We implement WiID to extract the motion contour usingMatLab and compare it with WiHF using the HuFu featuredataset, Note that, WiID requires 50 instances for eachspecific domain when training with the simple classifier [9]while only 10 instances can be provided with the HuFufeature dataset.

As shown in Table 4, WiHF achieves 96.74% accuracyfor user identification on the HuFu feature dataset, muchhigher than the one for WiID with only 68.95%. Besides, wealso evaluate the performance of user identification acrossdomains. We can find the cross-location accuracy of WiHFis 30% higher than WiID, shown in Figure 10. And WiHFachieves 20% improvement for the cross-orientation testingfor the three centering orientations. The reason behind thepoor performance of WiID has two folds. First, the powerbased feature cannot be utilized for user identification arobust feature, especially in the cross-domain scenarios.And the evaluations are consistent with our preliminary





and observations(§3.2). Second, WiID is data-hungry andrequires 50 instances for training [9]. Thus it suffers usingonly 10 instances for each specific domain in the HuFu fea-ture dataset. Note that we over-estimate the performance ofWiID since we assume the gesture is truly known. However,the state-of-the-art methods only achieve 92% accuracy ofthe gesture recognition [7], which can affect the performanceof user identification of WiID.

5.6 Parameter Study5.6.1 Impact of Numbers of PCA Components:As shown in Table 5, the accuracy gradually drops for dualtasks of the in-domain, cross-location, and cross-orientationscenarios as the numbers of PCA components reduce from3 to 1. The rationale is more PCA components can providemore fine-grained motion change patterns at different scalesfor the Pattern-carving Module, illustrated in Figure 5. Andit can reduce the information loss induced by the shadowingeffect and supplement detailed motion of different bodyparts.

TABLE 5Impact of the number of PCA components

# PCA components 1 2 3Gesture In-domain 90.63% 94.33% 97.65%

Recognition Location 79.00% 88.33% 92.07%Orientation 69.50% 76.47% 82.38%

User In-domain 89.33% 94.07% 96.74%Identification Location 64.43% 68.53% 75.31%

Orientation 59.93% 65.15% 69.52%

5.6.2 Impact of Gesture and User Type Numbers:To study the impact of the number of gestures and subjects,various sets of the HuFu feature dataset are used for differ-ent total type numbers (the default type number is 6 for bothtasks). Table 6 shows that the in-domain accuracy remainsabove 92% though the number of gestures or users increasesto 9. As far as the accuracy of WiHF is concerned, it’s notsignificantly affected by the number of gestures and users.Moreover, we can find that the respective types of dual taskscannot influence each other. That means performance forgesture recognition stays consistent as the number of usersvaries even across domains, resulting from the effect of thegradient block layer.

5.6.3 Impact of Slicing Schema:We analyze the effectiveness of feature slicing schema in pre-dicting gestures and users collaboratively. Different factorsfor the gradient layer are adopted to validate the relation-ship between the features of gestures and users. Figure 11shows that WiHF achieves the best performance in both in-domain and cross-domain scenarios with the zero splicingfactor. Besides, the in-domain performance declines with apositive factor while the existence of GRL cannot improveits cross-domain performance as expected. The reason maybe that noisy signal from the domain features suppressesthe target feature with a fixed factor [25]. The performancedecreases with negative factor as MTL shows that the lossfunction of the inferior feature for the sub-type tasks can

TABLE 6Impact of Gesture and User Type Numbers

#Gesture 6 7 8 9Gesture In-domain 97.65% 96.14% 95.33% 93.11%

Recognition Location 92.07% 85.81% 84.92% 83.81%Orientation 82.38% 74.46% 72.72% 74.55%

User In-domain 96.74% 97.19% 97.29% 95.33%Identification Location 75.31% 68.00% 70.65% 71.36%

Orientation 69.52% 66.43% 68.34% 70.59%#User 6 7 8 9

Gesture In-domain 97.65% 96.17% 96.99% 97.21%Recognition Location 92.07% 90.94% 91.62% 91.22%

Orientation 82.38% 83.81% 79.62% 80.64%User In-domain 96.74% 92.56% 93.76% 94.43%

Identification Location 75.31% 66.98% 64.70% 65.26%Orientation 69.52% 63.26% 55.86% 57.26%

aThe target label denotes the test dataset.

Accuracy 100%

96.67% 96.72% 97.22% 96.72%

90%

80%

70%

60%

94.5% ••• ．．．．．心．． • • • ·······- 92.07%--... 93.56%

• • • • • • • • 0• • • • - '::J L. V / 7o - • • • •

o·········- 89.51 % ＂詹．0 •. •... ■•• •.. ■ • ■ O

87.84% 88.44% 88% 82.38%

78.77% 0- -

79.79% -- -0 .... -- -- ........ 79.3% -- □ ....0- - -

75.31% 78.85% 一口

-0-User-LOC

-0-User-ORI

今 User-IND

-0-Gesture-LO(

-0-Gesture-ORI

合 Gesture-IND

71.4% 71 7杰 ············0············1213% ．．．．孟慧2%O••••·••·••••·••O•·- 69.52% ··o•····

66.36% 67.22% ＿

- -0 - _ _ 67.59% 67.�4%工昌＿＿＿。 - －－巳

一一一一口

-1 -0.5 0.0 0.5 1Slicing Factor

Fig. 11. Impact of slicing factor for gradient block layer.

induce noises for the superior feature of the main-type task.Thus it should be blocked using the gradient block layerwith the zero splicing factor.

6 LIMITATION & DISCUSSIONAlthough our proposed WiHF system can recognize ges-tures and identify users simultaneously across domains, itstill has some limitations.

Background interference: First, WiFi based sensing sys-tems generally suffer from common issues, such as thebackground interference induced by other moving objectsor APs. And it can introduce the uninterested amplitudeattenuation and phase shift to CSI measurements, affect-ing the distinctness and stability of the designed feature.To compare the impact of the interference statistically, weadopt data from three environments with various interfer-ence (density of the APs and background objects) [7]. Asthe interference gets stronger for three environments, thegesture recognition achieves 97.65%, 96.54%, and 96.19%while the user identification for 96.74%, 95.37%, and 93.64%,respectively. Evaluations show that a clearer environmentcan introduce less interference from the background objectsand other APs, achieving a better performance for dualtasks. WiHF however renders the acceptable performancewith the limited degradation due to the adopted algorithmsfor noise reduction.

To alleviate the interference, PCA [4], [5], [16], [39] andthe band-pass filter [7], [39] are employed to extract princi-





ple components of raw measurements and filtering out un-interested frequency components. Besides, we optimize thedeployment for CSI acquisition. For example, transceiverscan be set to work on channel 165 at 5.825 GHz where thereare few interfering radios [7], [42]. WiHF utilize the CSIcollected from a relatively clean channel and employs thePCA and filter for interference cancellation.

Behavior dynamics: Moreover, humans cannot alwaysact regularly, rendering the unacceptable shift for the DFSspectrogram and the motion change pattern (See Section2). Thus our system can suffer due to variances of theextracted gesture characteristics and the personalized per-forming style in some cases. For example, the gesture canbe misclassified or the authorized user can be rejectedsometimes with the irregular behavior. WiHF can alleviatethis issue with online adaption. Specifically, given the ap-plication scenarios such as the smart home and VR gaming,users are usually in a relatively relaxing status and seldomchange the personalized performing style abruptly, whichhas been evaluated through long-term experiments [9], [29],[31]. To provide additional resilience to the variance of theperforming style, we can also adapt the deep learning modelto capture the variance pattern of users by re-training thedual-task network using users’ gestures over some time.

Deployment cost: Besides, DNN has some commonissues on deployment cost, such as cumbersome data col-lection and model retraining, especially when we employthe model in a new environment. Existing DNN basedworks [27], [43] adopt machine learning techniques, suchas transfer learning and adversarial learning to improve thecross-domain generalization ability for gesture recognition.Specifically, EI [27] designs an adversarial architecture withspecific loss functions to exploit characteristics of unlabeleddata in the target domain while CrossSense [43] employs anANN-based roaming model to translate features from thesource domain to target domain. Both however require extratraining efforts in either data collection or model re-trainingeach time a new target domain is added into the recognitionmodel [7].

Given the domain-independent feature of our system,evaluations in section 5.2 demonstrate that WiHF can reducethe deployment cost in cross-domain scenarios. Specifically,we train the model with data from four domains and test itin a new one, say different locations and orientations. Andwe also train the model in one environment and test it inanother environment directly. WiHF achieves the averageaccuracy of 90.32%, 79.14%, and 89.67% for gesture acrosslocations, orientations, and environments, comparable withthe state-of-the-art [7] with 88.78%, 82.97%, and 88.95%respectively. As for the user identification, one user is onlyrequired to do the gesture in some locations, orientations,and environments. And it can achieve acceptable perfor-mance when adopted in a new location, orientation, orenvironment. From this view, it can reduce the deploymentcost significantly compared with similar user identificationworks, such as WifiU [16] and WiID [9] in Section 5.5.

Shadowing effect: Further, the cross-orientation eval-uation suffers more than others. The rationale is that theshadowing effect of body parts can induce motion informa-tion loss by blocking reflected WiFi signals. One possiblesolution is to utilize more antennas and optimize the de-

ployment of receivers, delivering more vantage views forthe target person. Besides, some specific gestures can alsobe designed for WiHF to fully represent the personalizedperforming styles and reduce the information loss fromshadowing. Evaluations in Section 5.2 demonstrate thatgestures with longer duration can improve the recognitionaccuracy for gesture recognition, shown in Figure 9. Notethat the processing overhead cannot increase significantlysince we can extract the motion change pattern from an-tennas in parallel. Besides, samples can be segmented forfeature extraction with complex gestures, delivering a real-time system.

Future work: Dedicated algorithms and experimentalsetup can be designed to further improve the robustnessand accuracy of WiHF. For example, we can employ thebackground subtraction [34], [44], [45] to subtract the initialbackground interference. To provide additional resilienceto the noise, we can incorporate multi-dimensional fea-tures [34], [38] to capture more individual information ofusers, such as Doppler shift induced by the moving arm [4],[5] and BVP [7]. Besides, we can also adapt our system oversome time to capture the performing variances of humans.Furthermore, some specific gestures can be designed to fullyrepresent the personalized performing style. And we leavethem as our future work.

7 CONCLUSIONSTo unleash the potential of the WiFi based gesture recog-nition, we propose WiHF to enable gesture recognitionand user identification simultaneously in real time. WiHFproposes to derive a domain-independent motion changepattern of arm gestures from WiFi signals, rendering theunique gesture characteristics and the personalized userperforming styles. To be real-time, we carve the dominantmotion change patterns and develop an efficient methodbased on the seam carving algorithm. Taking the carvingpath for motion change pattern as input, a collaborativedual-task DNN with the splitting and splicing schemes isadopted. We implement WiHF and evaluate its performanceon a public dataset. Experimental results show that WiHFachieves 97.65% and 96.74% for in-domain gesture recog-nition and user identification accuracy, respectively. Andthe cross-domain gesture recognition performance is com-parable with the state-of-the-art methods, but the processingtime is reduced by 30×.

REFERENCES

[1] I. Asimov, The Robots of Dawn (Robot, 4). Spectra, 1983.[2] Q. Pu, S. Gupta, S. Gollakota, and S. Patel, “Whole-home gesture

recognition using wireless signals,” in Proceedings of ACM Mobi-Com, 2013.

[3] A. Virmani and M. Shahzad, “Position and orientation agnosticgesture recognition using WiFi,” in Proceedings of ACM MobiSys,2017.

[4] R. H. Venkatnarayan, G. Page, and M. Shahzad, “Multi-usergesture recognition using WiFi,” in Proceedings of ACM MobiSys,2018.

[5] W. Wang, A. X. Liu, M. Shahzad, K. Ling, and S. Lu, “Un-derstanding and modeling of WiFi signal based human activityrecognition,” in Proceedings of ACM MobiCom, 2015.

[6] K. Ali, A. X. Liu, W. Wang, and M. Shahzad, “Keystroke recogni-tion using WiFi signals,” in Proceedings of ACM MobiCom, 2015.





[7] Y. Zheng, Y. Zhang, K. Qian, G. Zhang, Y. Liu, C. Wu, andZ. Yang, “Zero-effort cross-domain gesture recognition with WiFi,”in Proceedings of ACM MobiSys, 2019.

[8] Z. Zhou, C. Wu, Z. Yang, and Y. Liu, “Sensorless sensing withWiFi,” Tsinghua Science and Technology, vol. 20, no. 1, 2015.

[9] M. Shahzad and S. Zhang, “Augmenting user identification withWiFi based gesture recognition,” ACM on Interactive, Mobile, Wear-able and Ubiquitous Technologies, vol. 2, no. 3, 2018.

[10] S. Avidan and A. Shamir, “Seam carving for content-aware imageresizing,” in Proceedings of ACM SIGGRAPH, 2007.

[11] Y. Zeng, P. H. Pathak, and P. Mohapatra, “Wiwho: WiFi-basedperson identification in smart spaces,” in Proceedings of ACM/IEEEIPSN, 2016.

[12] F. Hong, X. Wang, Y. Yang, Y. Zong, Y. Zhang, and Z. Guo, “Wfid:Passive device-free human identification using WiFi signal,” inProceedings of EAI Mobiquitous, 2016.

[13] J. Zhang, B. Wei, W. Hu, and S. S. Kanhere, “Wifi-id: Human iden-tification using WiFi signal,” in Proceedings of the IEEE InternationalConference on Distributed Computing in Sensor Systems, 2016.

[14] H. Zou, Y. Zhou, J. Yang, W. Gu, L. Xie, and C. J. Spanos, “WiFi-based human identification via convex tensor shapelet learning,”in Proceedings of AAAI, 2018.

[15] T. Xin, B. Guo, Z. Wang, M. Li, Z. Yu, and X. Zhou, “Freesense:Indoor human identification with WiFi signals,” in Proceedings ofIEEE GLOBECOM), 2016.

[16] W. Wang, A. X. Liu, and M. Shahzad, “Gait recognition using WiFisignals,” in Proceedings of ACM UbiComp, 2016.

[17] J. Lv, W. Yang, D. Man, X. Du, M. Yu, and M. Guizani, “Device-freepassive identity identification via WiFi signals,” 2017.

[18] Y. Wang, J. Liu, Y. Chen, M. Gruteser, J. Yang, and H. Liu, “E-eyes: Device-free location-oriented activity identification usingfine-grained WiFi signatures,” in Proceedings of ACM MobiCom,2014.

[19] D. Vasisht, A. Jain, C.-Y. Hsu, Z. Kabelac, and D. Katabi, “Duet:Estimating user position and identity in smart homes using in-termittent and incomplete rf-data,” ACM on Interactive, Mobile,Wearable and Ubiquitous Technologies, vol. 2, no. 2, 2018.

[20] F. Wang, J. Han, Z. Dai, H. Ding, and D. Huang, “Wipin:Operation-free person identification using WiFi signals,” Proceed-ings of IEEE GLOBECOM, 2019.

[21] C. Shi, J. Liu, H. Liu, and Y. Chen, “Smart user authenticationthrough actuation of daily activities leveraging WiFi-enabled iot,”in Proceedings of ACM MobiHoc, 2017.

[22] B. Wei, W. Hu, M. Yang, and C. T. Chou, “From real to complex:Enhancing radio-based activity recognition using complex-valuedcsi,” ACM Transactions on Sensor Networks, vol. 15, no. 3, 2019.

[23] C. Zhao, Z. Li, T. Liu, H. Ding, J. Han, W. Xi, and R. Gui, “Rf-mehndi: A fingertip profiled rf identifier,” in Proceedings of IEEEINFOCOM, 2019.

[24] Z. Yang, L. Jian, C. Wu, and Y. Liu, “Beyond triangle inequality:Sifting noisy and outlier distance measurements for localization,”ACM Transactions on Sensor Networks, vol. 9, no. 2, 2013.

[25] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation bybackpropagation,” Proceedings of ICML, 2015.

[26] M. Zhao, S. Yue, D. Katabi, T. S. Jaakkola, and M. T. Bianchi,“Learning sleep stages from radio signals: A conditional adver-sarial architecture,” in Proceedings of ICML, 2017.

[27] W. Jiang, C. Miao, F. Ma, S. Yao, Y. Wang, Y. Yuan, H. Xue, C. Song,X. Ma, D. Koutsonikolas et al., “Towards environment indepen-dent device free human activity recognition,” in Proceedings ofACM MobiCom, 2018.

[28] J. Liu, L. Zhong, J. Wickramasuriya, and V. Vasudevan, “uwave:Accelerometer-based personalized gesture recognition and its ap-plications,” 2009.

[29] J. Guerra-Casanova, C. Sanchez-Avila, G. Bailador, and A. de San-tos Sierra, “Authentication in mobile devices through handgesture recognition,” International Journal of Information Security,vol. 11, no. 2, 2012.

[30] F. Okumura, A. Kubota, Y. Hatori, K. Matsuo, M. Hashimoto, andA. Koike, “A study on biometric authentication based on armsweep action with acceleration sensor,” in Proceedings of the Interna-tional Symposium on Intelligent Signal Processing and CommunicationSystems, 2006.

[31] K. Matsuo, F. Okumura, M. Hashimoto, S. Sakazawa, and Y. Ha-tori, “Arm swing identification method with template updatefor long term stability,” in Proceedings of the IAPR InternationalConference on Biometrics, 2007.

[32] P. Van Dorp and F. Groen, “Feature-based human motion param-eter estimation with radar,” IET Radar, Sonar & Navigation, vol. 2,no. 2, 2008.

[33] F. Adib, C.-Y. Hsu, H. Mao, D. Katabi, and F. Durand, “Capturingthe human figure through a wall,” ACM Transactions on Graphics,vol. 34, no. 6, 2015.

[34] X. Li, D. Zhang, Q. Lv, J. Xiong, S. Li, Y. Zhang, and H. Mei, “Indo-track: Device-free indoor human tracking with commodity WiFi,”ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies,vol. 1, no. 3, p. 72, 2017.

[35] K. Qian, C. Wu, Z. Zhou, Y. Zheng, Z. Yang, and Y. Liu, “In-ferring motion direction using commodity WiFi for interactiveexergames,” in Proceedings of ACM CHI, 2017.

[36] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evalua-tion of gated recurrent neural networks on sequence modeling,”in Proceedings of NIPS Workshop on Deep Learning, 2014.

[37] Y. Ma, G. Zhou, and S. Wang, “WiFi sensing with channel stateinformation: A survey,” ACM Computing Surveys, vol. 52, no. 3,2019.

[38] K. Qian, C. Wu, Y. Zhang, G. Zhang, Z. Yang, and Y. Liu,“Widar2.0: Passive human tracking with a single WiFi link,” inProceedings of ACM MobiSys, 2018.

[39] K. Qian, C. Wu, Z. Yang, Y. Liu, and K. Jamieson, “Widar:Decimeter-level passive tracking via velocity monitoring withcommodity WiFi,” in Proceedings of ACM MobiHoc, 2017.

[40] N. Kanopoulos, N. Vasanthavada, and R. L. Baker, “Design of animage edge detection filter using the sobel operator,” IEEE Journalof solid-state circuits, 1988.

[41] F. Chollet et al., “Keras,” https://keras.io, 2015.[42] Y. Zheng, C. Wu, K. Qian, Z. Yang, and Y. Liu, “Detecting radio

frequency interference for csi measurements on cots wifi devices,”in Proceedings of IEEE ICC, 2017.

[43] J. Zhang, Z. Tang, M. Li, D. Fang, P. Nurmi, and Z. Wang,“Crosssense: Towards cross-site and large-scale wifi sensing,” inProceedings of ACM MobiCom, 2018.

[44] F. Adib, Z. Kabelac, and D. Katabi, “Multi-person localization viarf body reflections,” in Proceedings of USENIX NSDI, 2015.

[45] F. Adib, Z. Kabelac, D. Katabi, and R. C. Miller, “3d tracking viabody radio reflections,” in Proceedings of USENIX NSDI, 2014.

Chenning Li (S’20) received the B.Eng.degreefrom Tsinghua University, Beijing, China, in2018. He is currently working toward the Ph.D.degree in the Department of Computer Sci-ence and Engineering, Michigan State Univer-sity, USA. His primary research interests includewireless sensing, indoor localization, mobilecomputing, and artificial intelligence of things.

Manni Liu received the B.Eng.degree in theDepartment of Mathematics and Applied Mathe-matics at Sun Yat-sen University. She is currentlypursuing the Ph.D. degree in the Department ofComputer Science and Engineering, MichiganState University, USA. Her research interestsinclude internet of things, mobile computing, andindoor localization.





Zhichao Cao (M’12) received the BE degreein the Department of Computer Science andTechnology at Tsinghua University, and his Ph.D.degree in the Department of Computer Scienceand Engineering at Hong Kong University of Sci-ence and Technology. He worked as an assistantprofessor in the School of Software, TsinghuaUniversity. He is currently with Department ofComputer Science and Engineering, MichiganState University. His research interests includeIoT system, mobile and edge computing. He is a

member of the IEEE.


wihf: gesture and user recognition with wifi

Documents