fine-grained glove-based activity recognition using

8
Fine-Grained Glove-Based Activity Recognition using Instance Aggregation and Abstraction Donald J. Patterson, Dieter Fox, Henry Kautz University of Washington Department of Computer Science and Engineering Box 352350, Seattle, WA, USA 98195-2350 {djp3,fox,kautz}@cs.washington.edu Matthai Philipose Intel Research Seattle 1100 NE 45th St., Seattle, WA, USA 98105 [email protected] Abstract— In this paper we present results related to achieving fine-grained activity recognition. Such recognition holds the promise of giving wearable computers access to contextual infor- mation which will make them more effective and less obtrusive to their users. Recent RFID sensor technologies have allowed wearable computers to have unambiguous knowledge about the objects in their environment. In this paper we examine the advantages and challenges of reasoning with object instances detected by an RFID glove. We present a sequence of increasingly powerful probabilistic graphical models for activity recognition. We demonstrate the advantages of adding additional complexity and conclude with a model that can reason tractably about object instances, and uses a technique called abstraction smoothing that generalizes gracefully from instances to object classes. Using these models we demonstrate situations in which both aspects are crucial for disambiguating activities. We demonstrate the use of these techniques on data collected from a typical morning household routine. I. I NTRODUCTION The convergence of advances in wearable sensing tech- nology and representations and algorithms for probabilistic reasoning is pushing the boundaries of the types of context with which a computer may reason. Context has different meanings in different application domains, but in the realm of wearable computers it is primarily used to refer to the state of the user: his or her location, current activity, and social environment. In this paper we provide evidence that reasoning about objects in an environment, both in general and specific terms, using a radio-frequency identification (RFID) glove is fruitful for inferring aspects of context related to activity performance. Not only are RFIDs a novel sensing modality, but they are also able to recognize individual objects with a 0% false positive rate [1]. We believe that this is a promising development toward fine-grained activity recognition which will allow a computer to interpret and understand day to day human experience. Previous work has approached activity recognition from several different directions. Early work on plan recognition [2], [3] had the insight that much behavior follows stereotypical patterns, but lacked well-founded and efficient algorithms for learning and inference, or any way to ground their theories in directly sensed experience. More recent work on plan- based behavior recognition [4] uses more robust probabilistic inference algorithms, but still does not directly connect to sensor data. In the vision community there is much work on behavior recognition using probabilistic models, but it usually focuses on recognizing simple low-level behaviors in controlled envi- ronments [5]. Recognizing complex, high-level activities using machine vision has only been achieved by carefully engineer- ing domain-specific solutions, such as for hand-washing [6] or operating a home medical appliance [7]. Much of the recent work in wearable computing has been along the same lines as this paper, justifying the value of various sensing modalities and inference techniques for interpreting various aspects of context. This work ranges from unimodal evaluations of activity recognition and social context from audio [8][9][10], video [11], and accelerometers [12] to multi-modal sensing which included these and other sensors [13] and some which has attempted to optimize sensor selection for arbitrary activity recognition [14]. A theme of much of this recent work is that “heavyweight” sensors such as machine vision can be replaced by large numbers of tiny, robust, easily worn sensors [15][16]. RFID-based activity recognition relies on the “invisible human hypothesis” of [17] which states that activities are well characterized by the objects that are manipulated during their performance, and therefore can be recognized from streams of sensor data about object touches. Although very little work has explored this hypothesis, one notable exception is the work of Philipose et al. [17] who worked on the problem of detecting the performance of 65 activities of daily living (ADLs), from 14 classes such as cooking, cleaning, and doing laundry, and used data from a wearable RFID tag reader deployed in a home with tagged objects. Although the work succeeded in recognizing a variety of activities performed in a realistic setting, it functioned at a coarse grain. It was able to categorize activities that were easily disambiguated by key objects (for example, using a toothbrush versus using a vacuum cleaner), used hand-made models, and made no attempt to model how activities interact and flow between each other. In this paper we begin to explore extending object- interaction based behavior recognition in a much more chal- lenging and realistic setting. We address the problem of fine-grained activity recognition — for example, not just

Upload: others

Post on 02-Jan-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Fine-Grained Glove-Based Activity Recognitionusing Instance Aggregation and AbstractionDonald J. Patterson, Dieter Fox, Henry Kautz

University of WashingtonDepartment of Computer Science and Engineering

Box 352350, Seattle, WA, USA 98195-2350{djp3,fox,kautz}@cs.washington.edu

Matthai PhiliposeIntel Research Seattle

1100 NE 45th St., Seattle, WA, USA [email protected]

Abstract— In this paper we present results related to achievingfine-grained activity recognition. Such recognition holds thepromise of giving wearable computers access to contextual infor-mation which will make them more effective and less obtrusiveto their users. Recent RFID sensor technologies have allowedwearable computers to have unambiguous knowledge about theobjects in their environment. In this paper we examine theadvantages and challenges of reasoning with object instancesdetected by an RFID glove. We present a sequence of increasinglypowerful probabilistic graphical models for activity recognition.We demonstrate the advantages of adding additional complexityand conclude with a model that can reason tractably about objectinstances, and uses a technique called abstraction smoothing thatgeneralizes gracefully from instances to object classes. Usingthese models we demonstrate situations in which both aspectsare crucial for disambiguating activities. We demonstrate theuse of these techniques on data collected from a typical morninghousehold routine.

I. INTRODUCTION

The convergence of advances in wearable sensing tech-nology and representations and algorithms for probabilisticreasoning is pushing the boundaries of the types of contextwith which a computer may reason. Context has differentmeanings in different application domains, but in the realmof wearable computers it is primarily used to refer to the stateof the user: his or her location, current activity, and socialenvironment.

In this paper we provide evidence that reasoning aboutobjects in an environment, both in general and specific terms,using a radio-frequency identification (RFID) glove is fruitfulfor inferring aspects of context related to activity performance.Not only are RFIDs a novel sensing modality, but they are alsoable to recognize individual objects with a 0% false positiverate [1]. We believe that this is a promising developmenttoward fine-grained activity recognition which will allow acomputer to interpret and understand day to day humanexperience.

Previous work has approached activity recognition fromseveral different directions. Early work on plan recognition [2],[3] had the insight that much behavior follows stereotypicalpatterns, but lacked well-founded and efficient algorithms forlearning and inference, or any way to ground their theoriesin directly sensed experience. More recent work on plan-based behavior recognition [4] uses more robust probabilistic

inference algorithms, but still does not directly connect tosensor data.

In the vision community there is much work on behaviorrecognition using probabilistic models, but it usually focuseson recognizing simple low-level behaviors in controlled envi-ronments [5]. Recognizing complex, high-level activities usingmachine vision has only been achieved by carefully engineer-ing domain-specific solutions, such as for hand-washing [6] oroperating a home medical appliance [7].

Much of the recent work in wearable computing has beenalong the same lines as this paper, justifying the valueof various sensing modalities and inference techniques forinterpreting various aspects of context. This work rangesfrom unimodal evaluations of activity recognition and socialcontext from audio [8][9][10], video [11], and accelerometers[12] to multi-modal sensing which included these and othersensors [13] and some which has attempted to optimize sensorselection for arbitrary activity recognition [14]. A theme ofmuch of this recent work is that “heavyweight” sensors suchas machine vision can be replaced by large numbers of tiny,robust, easily worn sensors [15][16].

RFID-based activity recognition relies on the “invisiblehuman hypothesis” of [17] which states that activities are wellcharacterized by the objects that are manipulated during theirperformance, and therefore can be recognized from streams ofsensor data about object touches. Although very little work hasexplored this hypothesis, one notable exception is the work ofPhilipose et al. [17] who worked on the problem of detectingthe performance of 65 activities of daily living (ADLs), from14 classes such as cooking, cleaning, and doing laundry, andused data from a wearable RFID tag reader deployed in ahome with tagged objects. Although the work succeeded inrecognizing a variety of activities performed in a realisticsetting, it functioned at a coarse grain. It was able to categorizeactivities that were easily disambiguated by key objects (forexample, using a toothbrush versus using a vacuum cleaner),used hand-made models, and made no attempt to model howactivities interact and flow between each other.

In this paper we begin to explore extending object-interaction based behavior recognition in a much more chal-lenging and realistic setting. We address the problem offine-grained activity recognition — for example, not just

Fig. 2. An example of the RFID tag types used in this experiment next toa ruler for scale.

recognizing that a person is cooking, but determining whatthey are cooking — in the presence of complexities such asinterleaved and interrupted activities. We used data collectedduring the actual performance of a household morning routine.In particular, in this paper we put forward the followingcontributions related to accurately tracking and distinguishingactivities based on the objects that are touched during theirperformance:

• We demonstrate accurate activity recognition in the pres-ence of interleaved and interrupted activities.

• We accomplish this using automatically learned models.• We demonstrate the ability to disambiguate activities

when the models share common objects.• We show that object-based activity recognition can be

improved by distinguishing object instances from objectclasses, and learning dependencies on aggregate features.

• We introduce abstraction smoothing, a form of relationallearning, that can provide robustness and prevent drasticloss in recognition accuracy in the face of expectedvariation in activity performance.

II. PROBLEM DOMAIN

For the purpose of our experiments, we focused on amorning routine in a small home. To create our data set, oneof the authors performed 11 activities 12 times each. TableI lists the activities which were included. Each activity wasperformed by itself twice, and then on 10 mornings all of theactivities were performed together in a typical, but naturallyvarying manner.

In order to capture the identity of the objects being ma-nipulated, the kitchen was outfitted with 61 RFID tags placedon every object touched by the user during a practice trial.RFID tags have a form factor comparable to a postage stamp.The two tag technologies that we used are shown in figure 2.Unlike barcodes they do not require line-of-sight in order tobe read and they identify specific globally unique instances ofobjects, rather than merely classes of objects. The user woretwo gloves (see figure 1), built by Intel Research Seattle [18],outfitted with antennas in the palms that were able to detectwhen an RFID tag was touched. The time and id of every

1 Using the bathroom2 Making Oatmeal3 Making Soft-Boiled Eggs4 Preparing Orange Juice5 Making coffee6 Making tea7 Making or answering a phone call8 Taking out the trash9 Setting the table

10 Eating Breakfast11 Clearing the table

TABLE IACTIVITIES PERFORMED DURING DATA COLLECTION

object touched was sent wirelessly by the glove to a databasefor offline analysis.

The mean length of the 10 interleaved runs was 27.1 minutes(σ = 1.7) and events were captured at approximately 10Hz. The mean length of each uninterrupted portion of theinterleaved tasks was 74 seconds. Most tasks were interleavedwith or interrupted by others during the 10 full data collectionsessions.

This data represents a significant increase in inferencedifficulty over previous work [17] by introducing three con-founding factors:

• We collected data in the presence of multiple people.Although only one was instrumented, four people werepresent in the home during all data collection. This addednatural variability and additional relevance to our data.

• The activities that the user performed were not performedsequentially or in isolation from the others. Each activitywas performed in such a way that during pauses inone activity, progress was made in other activities (suchas when waiting for water to boil) and some activitiesinterrupted others at unexpected times (such as answeringthe phone).

• The activities that the user was asked to perform hadmany objects in common. Between the interleaving andobject commonality, activity recognition was much moredifficult that simply associating a characteristic objectwith an activity (such as a vacuum cleaner indicating avacuuming activity)

III. MODEL COMPARISON

In order to justify the inference model that we developed weproceeded systematically by developing the simplest possibleprobabilistic model, evaluating its performance and then aug-menting it with features that were sufficient to disambiguateerrors. In this section we will present the three baseline modelsthat we used and the final model that incorporated reasoningwith aggregate features. The models increase in complexity byadding representational power.

We chose to perform activity recognition using the proba-bilistic generative framework of Hidden Markov Models andtheir factored analog, the Dynamic Bayes Net. We chose thisapproach since the algorithms for inference and parameterlearning are well-understood [19] and the way in which

Fig. 1. The wearable RFID glove.

...A1 A2

OBS

A11

OBS OBS

Fig. 3. A state diagram for model baseline A consisting of 11 independentone-state HMMs. The log likelihood of each HMM was calculated onoverlapping successive short windows of data.

these models capture timing constraints of activities are well-characterized (e.g.[20]).

A. Baseline Model A: Independent Hidden Markov Models

As a baseline we modeled the activities individually as11 independent one-state Hidden Markov Models (HMMs).A diagram of this model is shown in figure 3. Used in agenerative context, each state emits an object-X-touched eventor an no-object-touched event at each tick of the clock. Eachstate’s emission probability was trained on 12 examples ofa user performing the corresponding activity. After training,the probability of emitting a no-object-touched event wasequalized across all models so that the timing characteristicof the model was completely captured by the self-transitionprobability.

To infer the activity being performed at each second, eachHMM was presented with a 74 second window of data endingat the query second (This was the mean amount of atomictime spent performing any portion of an activity in the trainingdata.) This produced a log-likelihood value for each model ateach tick of the clock. The activity model with the highest loglikelihood was used as the system’s estimate of the currentactivity.

This model was trained and tested on data in which objecttypes were equalized so that there was no distinction made

OBS

A1

A11

...

A2

Fig. 4. A state diagram for model baseline B consisting of an 11 state HMM.At any moment in time the user is in exactly one of the 11 states. The mostlikely state sequence was recovered using the Viterbi algorithm over the entiredata sequence.

between spoon #1 and spoon #2, for example, but both wereidentified as “spoon” to the model.

B. Baseline Model B: Connected HMMs

As a second baseline we connected the states from the11 independent HMMs of baseline A in order to captureknowledge about transitions between activities. We retrainedthis HMM using the 10 examples of the user performingthe 11 interleaved activities. The no-object-touched emissionprobability was again equalized across all states. A diagramof the resulting model is show in Figure 4.

This HMM was evaluated over the entire data window andthe Viterbi algorithm [19] was used to recover the activityat every time point given by the maximum likelihood paththrough the state-space. Again, this model was presented datain which object types were used to eliminate distinctionsbetween instantiations of objects.

C. Baseline Model C: Object-Centered HMMs

As a third baseline we split the states in baseline B into aseparate state for each activity and each object that may be

A1OBJ1

A11OBJ61

...

A1OBJ2

OBS

Fig. 5. A state diagram for model baseline C consisting of an 671 (61 tagsx 11 activities) state HMM. At any moment in time the user is in exactlyone of the 671 states. The most likely state sequence was recovered using theViterbi algorithm over the entire data sequence.

observed. This allowed this model to capture some informationabout how objects were used at different points (i.e. earlyversus later) in the execution of an activity at the expenseof introducing more trainable parameters. We retrained thisHMM using the 10 examples of the user performing the 11interleaved activities. The no-object-touched emission proba-bility was equalized across all states. The conditional proba-bility table associated with state observations was degeneratefor this model since each state could only emit an observationof one particular object. A diagram of the resulting model isshow in Figure 5.

This HMM was also evaluated over the entire data windowusing the Viterbi algorithm and was presented data with nodistinction made between instantiations of objects types.

D. Aggregate Model D: Dynamic Bayes Net with Aggregates

Figure 6 shows a Dynamic Bayesian Network (DBN) thatwe propose to improve activity recognition. Unlike figures 3,4,and 5, which are state diagrams, this figure is a dependencydiagram in which the state of the system is factored intoindependent random variables indicated by nodes in the graph.It has been rolled out in time showing nodes from threedifferent time steps and dependency diagrams for models A-Care also shown for comparison.

The top row of model D has the same as dependency struc-ture as models A-C. Model D adds an additional deterministicstate variable, the boolean exit node “E”, which capturesdynamic information about when an activity has changed. Itis true if the state has changed from time step to another.

This model also adds support for reasoning about instancesof objects: the gray box denotes a template which is instan-tiated for each class of objects with multiple object instan-tiations. Each “Obj” node in the template is a deterministicboolean node indicating whether a given instantiation of anobject has been touched since the last time the activitychanged.

The “Obj” nodes are aggregated by a summation node,“+”. When the DBN changes activities (“Exits”), the system

Exponential Inter- Intra- AggregateTiming Activity Activity Info.

Distribution Transitions TransitionsModel AInd. HMMsModel B

√ √

Conn. HMMsModel C

√ √ √

Object HMMsModel D

√ √ √

Agg. DBN

TABLE IISUMMARY OF MODEL REPRESENTATIONAL POWER .

explicitly reasons about the number of different instantiationsof the objects in the template that were touched. This iscaptured by the dependence of the aggregate distribution node,“AD”, on the state node,“S”. The “AD” node is constrained tobe equal to the summation node from the previous time stepthrough the observed equality node, “=”. The equality node isalways observed to be true and coupled with its conditionalprobability table force “AD” to equal “+”.

The overall effect is that the probability of the modelinferring whether or not an activity has ended is mediatedby an expectation over the number of objects in each classwhich have been touched.

This model, including the aggregate distribution for ac-tivities, was also automatically learned from the 10 labeledtraining traces, but in contrast to models A, B and C, thismodel differentiates between instantiations of objects so that,for example, spoon #1 and spoon #2 created different obser-vations.

E. Model Summary

The various features of the models are summarized in tableII. In this table “Exponential Timing” distributions refer to thefact that the model expects the length of an uninterrupted por-tion of an activity to occur with a duration which is distributedaccording to an exponential distribution. The parameters of thedistribution are learned from the data. This timing distributionis a result of the structure of the HMMs and DBNs used.“Inter-Activity” transitions refers to the ability of the modelto represent the tendency of certain activities to follow orinterrupt other activities more or less often. “Intra-ActivityTransitions” refers to the ability of the model to representbiases about when in the course of an activity certain objectsare used. (e.g., one uses a kettle early in the process of makingtea). Finally “Aggregate Information” refers to the ability ofthe model to represent aggregations over individual objects.

IV. EXPERIMENTS

For our experiments we conducted leave one out crossvalidation to train and test the three baseline models andfourth aggregate model as described previously. We calculatedtwo performance metrics. The first was what percentage ofthe time the model correctly inferred the true activity. Thismetric is biased against slight inaccuracies in the start and

Models A-COBS OBS

OBSE

OBS

ObjN

Obj...

Obj01

Obj00

+

Obj N

Obj...

Obj01

Obj00

+

ObjN

Obj...

Obj01

Obj00

+

AD

E

AD

= =

TT-1 T+1

S S S

Model DFig. 6. A dependency diagram showing the DBN used for inference in comparison to the baseline models, rolled out for three time steps. Time steps areseparated by a vertical dotted line. Observed variables are shaded, hidden variables are not. All variables are discrete multinomials.

end times and will vary based on the time granularity withwhich the experiments were conducted. We also evaluatedour models using a string edit distance measure. In thiscase we treated the output of the inference as a string overan 11 character alphabet, one character per activity, withall repeating characters merged. We calculated the minimumstring edit distance between the inference and the ground truth.A string edit distance of 1, means that the inferred activitysequence either added a segment of an activity that didn’toccur (insertion), it missed a segment that did occur (deletion),or it inserted an activity that didn’t occur into the middle ofan activity (reverse splicing). A perfect inference will havea string edit distance of 0. The string edit distance is biasedagainst rapid changes in the activity estimate and is tolerantof inaccuracies in the start and end time of activities. TableIII summarizes the results of experiments.

A. Baseline Model A

Figure 7-A shows a short portion of the inference in whichmodel A performed badly and demonstrates some of the keyreasons why inference with this model only obtained 67%accuracy. In this figure ground truth is indicated with a thin lineand the inference is indicated by a dot at each time slice. Themodel clearly needs to be smoothed to prevent rapid switchingbetween activities. Since independent HMMs cannot representtransitions between themselves, rapid switching is a problem.

Time-Slice Accuracy Edit Distanceµ (σ) µ (σ)

Model A 68% (5.9) 12 (2.9)Ind. HMMsModel B 88% (4.2) 9 (6.2)Connected HMMsModel C 87% (9.3) 14 (10.4)Object HMMsModel D 88% (3.1) 7 (2.2)Agg. DBN

TABLE IIISUMMARY OF ACCURACY RESULTS

This was the primary motivation for using model B.

B. Baseline Model B

Figure 7-B shows a short portion of the inference performedby model B. In this graph the benefits of learning a distributionover the sequence and ordering of activities is displayed inthe relatively smooth trace. The tendency of model A to jumpbetween activities was eliminated. There are two places in thefigure however in which the model inferred the wrong activity.In both cases the activity “Eat Breakfast” is confused with theactivity “Clear the Table”. By examining the training data,it is apparent that these two activities weren’t differentiableby the objects that were used in their performance: bothactivities used plates, spoons and cups. Two possibilities

4500 4750 5000 5250 5500 5750 6000 6250 6500

Use The Bathroom

Make Espresso

Make Tea

Make Juice

Make Oatmeal

Make Soft Boiled Eggs

Use The Phone

Take out trash

Set The Table

Eat Breakfast

Clear The Table

Time (seconds)

Model InferenceGround Truth

4500 4750 5000 5250 5500 5750 6000 6250 6500

Use The Bathroom

Make Espresso

Make Tea

Make Juice

Make Oatmeal

Make Soft Boiled Eggs

Use The Phone

Take out trash

Set The Table

Eat Breakfast

Clear The Table

Time (seconds)

Model InferenceGround Truth

Model A: Independent HMMs Model B:Connected HMM

4500 4750 5000 5250 5500 5750 6000 6250 6500

Use The Bathroom

Make Espresso

Make Tea

Make Juice

Make Oatmeal

Make Soft Boiled Eggs

Use The Phone

Take out trash

Set The Table

Eat Breakfast

Clear The Table

Time (seconds)

Model InferenceGround Truth

4500 4750 5000 5250 5500 5750 6000 6250 6500

Use The Bathroom

Make Espresso

Make Tea

Make Juice

Make Oatmeal

Make Soft Boiled Eggs

Use The Phone

Take out trash

Set The Table

Eat Breakfast

Clear The Table

Time (seconds)

Model InferenceGround Truth

Model C: Object-Centered HMM Model D: Aggregate DBNFig. 7. Small segment of inference with various models. Ground truth is indicated by the thin line. Inference is indicated by the dots.

emerged for representations in which these two activities couldbe disambiguated. The first was to allow the model to learn atendency for objects to be used earlier or later in the activity– this inspired model C. Second was to allow the model todifferentiate activities based on the number of objects whichwere touched – this capability inspired model D.

C. Object Model C

Figure 7-C shows a short portion of the inference performedby model C. The graph demonstrates how this model is ableto correct for the errors of model B, but it clearly makes aninter-activity error by allowing the inference to conclude thata transition from eating breakfast to setting the table thento clearing the table occurred. Given a sufficient amount oftraining data, this model will not make such an inference.However there are such a huge number of parameters requiredto specify this model. This model required 6712 transition

parameters to be learned and our data simply did not haveenough data to capture statistically significant information forevery transition. Unlike model B, which captures informationabout transitions for activity Ai to Aj , model C representsinformation about transitions from AiObjj

to AkObjm.

D. Aggregate Model D

Figure 7-D shows a short portion of the inference per-formed by model D. In this graph learning a distribution overthe aggregate number of object instantiations eliminates theambiguity in Figure 7-A,B,C without requiring an inordinatenumber of parameters to be introduced into the model.

V. ABSTRACTION SMOOTHING

One of the concerns of all of these models is how well theywill respond if someone used a functionally similar object butone which was nonetheless unexplained by the model. For

Home & Garden

Kitchen

Tableware

Flatware

Glassware

Mug

Bowl

Dinnerware

Cookware

Saucepan

Accessories

Coffee Container

Coffee Tamper

Home Furnishings Furniture

Cabinet Cabinet

Cabinet

Cabinet

Cabinet

Tea & Coffee Pots

Espresso Cup

Espresso Cup

Espresso Handle

Juice Pitcher

Kettle

Measuring Cup One

Measuring Cup Half

Barware Steaming Pitcher

Cup

Cup Cup

Cup

Cup

Plate

Plate

Plate

Plate

Plate

Spoon

Spoon

Spoon

Spoon

SpoonFig. 8. A portion of the object abstraction hierarchy mined from an Internetshopping site. Objects in our training data are shaded. Abstractions are notshaded.

example, in our model we trained on cooking oatmeal usinga cooking spoon. Our inference should not fail if the userperformed the same task using a table spoon. Likewise if theuser makes tea in a cup rather than a mug, that should be aless likely, but still plausible alternative.

A. Abstraction Smoothing Over Objects

In order to perform smoothing over objects we created arelational model inspired by [21]. Unlike the full power of thatwork, however, we used a single hierarchical object relationrather than a lattice. The hierarchy that we used was minedfrom an internet shopping site, and the semantics we appliedto it were that objects that were close to each other in thegraph were functionally similar. Figure 8 shows a portion ofthe object hierarchy that we used.

We weighted all edges on the graph equally and created anall-pairs functional equivalence metric according the followingformula:

P (Oi → Oj) =exp(−Dist(Oi,Oj)

2 )∑j

exp(−Dist(Oi,Oj)2 )

(1)

Where Dist(Oi, Oj) is the distance between Oi and Oj onthe graph. This says that when object Oi is expected in themodel, it will be substituted by object Oj with probabilityP (Oi → Oj)

Figure 9 shows how the abstraction function is insertedgraphically into model D. This changes the semantics of themodel somewhat so that now the node Oi represents the objectthat “should” be touched in order to complete this activityand what is observed is Oj . The conditional probability tableassociated with the dependency is capture by equation 1

B. Validating Abstraction Smoothing

To validate how well this technique worked when objectswere substituted, we reran our experiments with abstractionsmoothing added to model D. This resulted in an insignificantdecrease in accuracy of 0.1% and -.1 in edit distance.

Oi

S

Oj

Fig. 9. The Oi and Oj nodes replace the previous “OBS” node in the DBNwith P (Oi → Oj) embodied in the CPT.

Then we reran our experiments with the same data streamsin which all instances of an object were replaced by anotherinstance of an object. Table IV shows our results for onescenario.

Whereas abstraction smoothing doesn’t greatly harm normalactivity recognition it greatly increases robustness to objectsubstitution as Table IV demonstrates. The metrics from thistable were generated by replacing a mug in all of the testingsequences with a cup. In two of the baseline models accuracyis dramatically lowered, but the abstraction model suffers arelatively modest decrease in accuracy.

Model A Model B Model DInd. Single Aggregates

HMMs HMM w/ Abstract.Mean Accuracy 52.5% 77.4% 81.2%Net Change -15.1% -10.9% -6.4%Mean Edit Dist. -12.7 -26.6 -1.1

TABLE IVACCURACY METRICS FOR WHEN AN OBJECT IN THE TEST TRACE IS

REPLACED BY A FUNCTIONALLY SIMILAR OBJECT.

VI. DISCUSSION AND CONCLUSIONS

The previous sections described the progression of in-creasingly sophisticated models we applied to an activityrecognition problem generated by an RFID-glove. We madea methodological point of proceeding in a bottom-up man-ner, starting with the simplest possible approach, and addingcomplexity only when it was necessary to improve accuracyand illuminate relevant features related to activity recognition.Such an approach makes clear how and why each feature of thesystem contributes to successful recognition. A second benefitis that the final system supports highly efficient inference andlearning.

We saw that the popular approach of training separateHMMs (Model A) for each activity, and then performingclassification by selecting the HMM that accords the currentobservation the highest likelihood (for example, see [14]), per-forms poorly once activities interact in time or when object useis shared among activities. This led us to consider a combinedHMM model, where the observations still correspond to usingan object of a given type. Although this greatly improves

accuracy by learning transitions between activities, we sawthat it could not distinguish between activities that used thesame objects. Needing to distinguish among activities that usethe same objects occurs often in modeling household activities— not only in kitchen activities, as in our experiments, but inmany others ADLs: for example, doing laundry versus gettingdressed, reading a book versus organizing a bookshelf, etc.

Modeling observations as touches of particular objects (asan RFID-glove does) does not, by itself, solve the activityrecognition problem, because we would then lose any abilityto generalize in those cases where object identity is irrelevant.The heart of the matter is that many activities are naturallynon-Markovian, i.e., they are best modeled in terms of thehistory of objects that were used during their execution. Thisled to our DBN model (D) that maintained such a history, andwhich allowed us to infer the aggregate feature of the numberof objects used.

Differentiating between a model which allows for objectsto be used at different temporal points in an activity versusa model which reasons about aggregations of objects (ModelC vs Model D) allowed us to tease apart different ways inwhich two activities which used the same objects could bedisambiguated. Our results demonstrated that both model Cand model D had power to disambiguate these activities. Usingaggregations of object touches however performed the bestwithout requiring as much training data.

Finally, we saw that the use of smoothing over an abstrac-tion hierarchy of object types greatly enhances the robustnessof the system, while adding no significant computationalburden.

In summary, these results argue that key features for model-ing activities of everyday life on the basis of object interaction(and, very likely, on the basis of other kinds of direct sensordata) include a way to capture transition probabilities betweenactivities; a way to compute aggregate features over the historyof instantiated events in the activity; and a way to handleabstract classes of events. With these features alone, it ispossible to perform quite fine-grained activity recognition. Theactivity recognition is successful even when the activities areinterleaved and interrupted, when the models are automaticallylearned, when the activities have many objects in common andwhen the user deviates from the expected way of performingan activity.

REFERENCES

[1] R. Want, “A Key to Automating Everything,” Scientific American, vol.290, no. 1, pp. 56–65, 2003.

[2] R. Wilensky, Planning and Understanding: A Computational Approachto Human Reasoning. Reading, MA: Addison-Wesley, 1983.

[3] H. A. Kautz and J. F. Allen, “Generalized plan recognition.” in AAAI,1986, pp. 32–37.

[4] D. Colbry, B. Peintner, and M. Pollack, “Execution Monitoringwith Quantitative Temporal Bayesian Networks,” in 6th InternationalConf. on AI Planning and Scheduling, April 2002.

[5] T. Jebara and A. Pentland, “Action reaction learning: Automatic visualanalysis and synthesis of interactive behaviour,” in ICVS ’99: Proceed-ings of the First International Conference on Computer Vision Systems.London, UK: Springer-Verlag, 1999, pp. 273–292.

[6] A. Mihailidis, B. Carmichael, and J. Boger, “The Use of ComputerVision in an Intelligent Environment to Support Aging-in-Place, Safety,and Independence in the Home,” IEEE Transactions on InformationTechnology in BioMedicine, vol. 8, no. 3, pp. 238–247, 2004.

[7] Y. Shi, Y. Huang, D. Minnen, A. Bobick, and I. Essa, “PropagationNetworks for Recognition of Partially Ordered Sequential Action,” inProceedings of IEEE CVPR04, 2004.

[8] M. Stager, P. Lukowicz, N. Perera, T. von Buren, G. Troster, andT. Starner, “SoundButton: Design of a Low Power Wearable AudioClassification System,” in Proceedings of Seventh IEEE InternationalSymposium on Wearable Computers. IEEE, October 2003, pp. 12–17.

[9] M. Stager, P. Lukowicz, N. Perera, and G. Troster, “Implementationand Evaluation of a Low-Power Sound-Based User Activity RecognitionSystem,” in Proceedings of Eighth IEEE International Symposium onWearable Computers, vol. 1. IEEE, October 2004, pp. 138–141.

[10] T. Choudhury and A. Pentland, “Sensing and Modeling Human Net-works Using the Sociometer,” in Proceedings of Seventh IEEE Interna-tional Symposium on Wearable Computers. IEEE, October 2003, pp.216–222.

[11] P. Fitzpatrick and C. Kemp, “Shoes as a Platform for Vision,” inProceedings of Seventh IEEE International Symposium on WearableComputers. IEEE, October 2003, pp. 231–234.

[12] K. K. Van Laerhoven and H. Gellersen, “Spine Versus Porcupine: AStudy in Distributed Wearable Activity Recognition,” in Proceedings ofEighth IEEE International Symposium on Wearable Computers, vol. 1.IEEE, October 2004, pp. 142–149.

[13] N. Kern, S. Antifakos, B. Schiele, and A. Schwaninger, “A model for hu-man interruptability: Experimental evaluation and automatic estimationfrom wearable sensors,” in Proceedings of Eighth IEEE InternationalSymposium on Wearable Computers, vol. 1. IEEE, October 2004, pp.158–165.

[14] T. Choudhury, J. Lester, and G. Borriello, “A hybrid discrimina-tive/generative approach for modeling human activities,” in Proceedingsof the Nineteenth International Joint Conference on Artificial Intelli-gence IJCAI-05. Edinburgh, UK: Morgan-Kaufmann Publishers, 2005.

[15] J. L. Hill and D. E. Culler, “Mica: A wireless platform for deeplyembedded networks.” IEEE Micro, vol. 22, no. 6, pp. 12–24, 2002.

[16] E. M. Tapia, S. S. Intille, and K. Larson, “Activityrecognition in the home using simple and ubiquitoussensors.” in Pervasive, 2004, pp. 158–175. [Online]. Available:http://springerlink.metapress.com/openurl.asp?genre=article&issn=0302-9743&volume=3001&spage=158

[17] M. Philipose, K. P. Fishkin, M. Perkowitz, D. J. P. D. Hahnel, D. Fox,and H. Kautz, “Inferring Activities from Interactions with Objects,”IEEE Pervasive Computing: Mobile and Ubiquitous Systems, vol. 3,no. 4, pp. 50–57, 2004.

[18] Intel Research Seattle, “Intel Research Seattle Guide Project Website,”2003, http://seattleweb.intel-research.net/projects/guide/.

[19] L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Ap-plications in Speech Recognition,” in Readings in Speech Recognition,A. Waibel and K.-F. Lee, Eds. San Mateo, CA: Kaufmann, 1990, pp.267–296.

[20] T. Osogami and M. Harchol-Balter, “A Closed-Form Solution for Map-ping General Distributions to Minimal PH Distributions,” in ComputerPerformance Evaluation / TOOLS, ser. Lecture Notes in ComputerScience, P. Kemper and W. H. Sanders, Eds., vol. 2794. Springer,2003, pp. 200–217.

[21] C. Anderson, P. Domingos, and D. Weld, “Relational markov modelsand their application to adaptive web navigation,” in Proc. of the EighthInternational Conf. on Knowledge Discovery and Data Mining. ACMPress, 2002, pp. 143–152, edmonton, Canada.