assessing observer accuracy in continuous recording … · assessing observer accuracy in...

13
ASSESSING OBSERVER ACCURACY IN CONTINUOUS RECORDING OF RATE AND DURATION: THREE ALGORITHMS COMPARED OLIVER C. MUDFORD UNIVERSITY OF AUCKLAND, NEW ZEALAND NEIL T. MARTIN TREEHOUSE TRUST, LONDON AND JASMINE K. Y. HUI AND SARAH ANN TAYLOR UNIVERSITY OF AUCKLAND, NEW ZEALAND The three algorithms most frequently selected by behavior-analytic researchers to compute interobserver agreement with continuous recording were used to assess the accuracy of data recorded from video samples on handheld computers by 12 observers. Rate and duration of responding were recorded for three samples each. Data files were compared with criterion records to determine observer accuracy. Block-by-block and exact agreement algorithms were susceptible to inflated agreement and accuracy estimates at lower rates and durations. The exact agreement method appeared to be overly stringent for recording responding at higher rates (23.5 responses per minute) and for higher relative duration (72% of session). Time-window analysis appeared to inflate accuracy assessment at relatively high but not at low response rate and duration (4.8 responses per minute and 8% of session, respectively). DESCRIPTORS: continuous recording, interobserver agreement, observer accuracy, observational data, recording and measurement _______________________________________________________________________________ A review of the use of observational recording across 168 research articles (1995–2005) from the Journal of Applied Behavior Analysis found that 55% of research articles that presented data on free-operant human behavior reported using continuous recording (Mudford, Taylor, & Martin, 2009). Observers collected data on portable laptop or handheld computers in the majority (95%) of those applications. Most data were obtained for discrete behaviors, with 95% of articles reporting rate of occurrence of responding, although duration measures were reported in 36% of articles reviewed. Despite the ubiquity of continuous record- ing, there has been little research investigating this type of behavioral measurement. Three empirical studies have investigated variables affecting interobserver agreement and observer accuracy with continuous recording. Interob- server agreement concerns the extent to which observers’ records agree with one another. By contrast, observer accuracy concerns agreement between observers’ records and criterion records and has been typically quantified using inter- observer-agreement computational methods Jasmine Hui is now employed at Rescare Homes Trust, Manukau City, New Zealand. Sarah Taylor is now employed by Odyssey House, Auckland, New Zealand. Data from observers using Psion handheld computers were presented at the third international conference of the Association for Behavior Analysis, Beijing, China, No- vember 2005. Data from iPAQ users were from the third author’s master’s thesis research. Contributions to the funding of this research were obtained from the University of Auckland Research Committee (first author) and the Faculty of Science Summer Scholarship Programme (fourth author). Thanks for assistance to Philippa Hall. Address correspondence to Oliver C. Mudford, Applied Behaviour Analysis Programme, Department of Psychol- ogy, University of Auckland (Tamaki Campus), PB 92- 019, Auckland, New Zealand (e-mail: o.mudford@auckland. ac.nz). doi: 10.1901/jaba.2009.42-527 JOURNAL OF APPLIED BEHAVIOR ANALYSIS 2009, 42, 527–539 NUMBER 3(FALL 2009) 527

Upload: others

Post on 21-Feb-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ASSESSING OBSERVER ACCURACY IN CONTINUOUS RECORDING … · assessing observer accuracy in continuous recording of rate and duration: three algorithms compared oliver c. mudford university

ASSESSING OBSERVER ACCURACY IN CONTINUOUSRECORDING OF RATE AND DURATION:

THREE ALGORITHMS COMPARED

OLIVER C. MUDFORD

UNIVERSITY OF AUCKLAND, NEW ZEALAND

NEIL T. MARTIN

TREEHOUSE TRUST, LONDON

AND

JASMINE K. Y. HUI AND SARAH ANN TAYLOR

UNIVERSITY OF AUCKLAND, NEW ZEALAND

The three algorithms most frequently selected by behavior-analytic researchers to computeinterobserver agreement with continuous recording were used to assess the accuracy of datarecorded from video samples on handheld computers by 12 observers. Rate and duration ofresponding were recorded for three samples each. Data files were compared with criterion recordsto determine observer accuracy. Block-by-block and exact agreement algorithms were susceptibleto inflated agreement and accuracy estimates at lower rates and durations. The exact agreementmethod appeared to be overly stringent for recording responding at higher rates (23.5 responsesper minute) and for higher relative duration (72% of session). Time-window analysis appearedto inflate accuracy assessment at relatively high but not at low response rate and duration (4.8responses per minute and 8% of session, respectively).

DESCRIPTORS: continuous recording, interobserver agreement, observer accuracy,observational data, recording and measurement

_______________________________________________________________________________

A review of the use of observational recordingacross 168 research articles (1995–2005) fromthe Journal of Applied Behavior Analysis foundthat 55% of research articles that presented dataon free-operant human behavior reported using

continuous recording (Mudford, Taylor, &Martin, 2009). Observers collected data onportable laptop or handheld computers in themajority (95%) of those applications. Most datawere obtained for discrete behaviors, with 95%of articles reporting rate of occurrence ofresponding, although duration measures werereported in 36% of articles reviewed.

Despite the ubiquity of continuous record-ing, there has been little research investigatingthis type of behavioral measurement. Threeempirical studies have investigated variablesaffecting interobserver agreement and observeraccuracy with continuous recording. Interob-server agreement concerns the extent to whichobservers’ records agree with one another. Bycontrast, observer accuracy concerns agreementbetween observers’ records and criterion recordsand has been typically quantified using inter-observer-agreement computational methods

Jasmine Hui is now employed at Rescare Homes Trust,Manukau City, New Zealand. Sarah Taylor is nowemployed by Odyssey House, Auckland, New Zealand.

Data from observers using Psion handheld computerswere presented at the third international conference of theAssociation for Behavior Analysis, Beijing, China, No-vember 2005. Data from iPAQ users were from the thirdauthor’s master’s thesis research. Contributions to thefunding of this research were obtained from the Universityof Auckland Research Committee (first author) and theFaculty of Science Summer Scholarship Programme(fourth author). Thanks for assistance to Philippa Hall.

Address correspondence to Oliver C. Mudford, AppliedBehaviour Analysis Programme, Department of Psychol-ogy, University of Auckland (Tamaki Campus), PB 92-019, Auckland, New Zealand (e-mail: [email protected]).

doi: 10.1901/jaba.2009.42-527

JOURNAL OF APPLIED BEHAVIOR ANALYSIS 2009, 42, 527–539 NUMBER 3 (FALL 2009)

527

Page 2: ASSESSING OBSERVER ACCURACY IN CONTINUOUS RECORDING … · assessing observer accuracy in continuous recording of rate and duration: three algorithms compared oliver c. mudford university

(Boykin & Nelson, 1981). Van Acker, Grant,and Getty (1991) found that observers weremore accurate when videotaped observationalmaterials included more predictable and morefrequent occurrences of responses to be record-ed than when responses occurred less often orwere less predictable. Kapust and Nelson(1984), to the contrary, found that observerswere more accurate at recording low-rateresponding than when observing high-ratebehavior. Agreement between observers gener-ally exceeded accuracy when compared withcriterion records of the behavior. A third study(Fradenburg, Harrison, & Baer, 1995) reportedthat agreement between observers using contin-uous recording was affected by whether theobserved individual was alone (75.1% meanagreement) or with peers (87.2% agreement).

Mudford et al. (2009) identified threeinterobserver agreement algorithms used morethan 10 times in the 93 articles that reportedcontinuously recorded data: exact agreement (allintervals version; Repp, Dietz, Boles, Deitz, &Repp, 1976), block-by-block agreement (Page &Iwata, 1986; also known as mean count-per-interval agreement in Cooper, Heron, &Heward, 2007), and time-window analysis(MacLean, Tapp, & Johnson, 1985; Tapp &Wehby, 2000). Each has been applied withobservational records of discrete behavioralevents and for behaviors measured with duration.

Exact agreement has been subject to consid-erable theoretical and empirical research rele-vant to its use with discontinuous recording(e.g., interval recording; Bijou, Peterson, &Ault, 1968; Hawkins & Dotson, 1975; Hop-kins & Hermann, 1977; Repp et al., 1976).Block-by-block agreement is a derivative ofexact agreement (Page & Iwata, 1986). Therelative strengths, weaknesses, and biases ofthese two methods have been identified fordiscontinuous recording. However, exact agree-ment and block-by-block agreement have yet tobe empirically investigated concerning theirsuitability for continuous recording. Time-

window analysis has no direct analogue indiscontinuous recording, because it was devel-oped specifically for continuous recording(MacLean et al., 1985). There have beenanecdotal recommendations from users oftime-window analysis (MacLean et al.; Repp,Harman, Felce, Van Acker, & Karsh, 1989;Tapp & Wehby, 2000), but there are noempirical studies to guide researchers oninterpreting agreement or accuracy indexesproduced by time-window analysis.

Because there has been no published researchto date comparing interobserver agreementalgorithms for assessing the accuracy of contin-uously recorded behaviors, the purpose of thepresent study was to provide preliminaryillustrative data on the relative performance ofthe three commonly used computational pro-cedures: exact agreement, block-by-block agree-ment, and time-window analysis. The algo-rithms were applied to observers’ recordings ofbehaviors defined as discrete responses and forbehaviors recorded as a duration measure.

METHOD

Participants and Setting

Twelve observers participated voluntarily,having replied to an invitation to acquireexperience with continuous recording. Their agesranged from 20 to 45 years, and 1 was male. Ninehad completed (and 1 was studying for) masters’degrees in behavior analysis. The other 2 wereadvanced undergraduate students who receivedtraining to assist with data collection duringfunctional analyses conducted in a researchproject. No participants had any previous expe-rience with computer-assisted continuous record-ing, although they had experience with discontin-uous methods (partial- and whole-intervalrecording and momentary time sampling). Nonehad previous experience recording the particularbehaviors defined for the present study.

The study was conducted in a quiet univer-sity seminar room (measuring approximately6 m by 5 m). Depending on scheduling, 1 to 3

528 OLIVER C. MUDFORD et al.

Page 3: ASSESSING OBSERVER ACCURACY IN CONTINUOUS RECORDING … · assessing observer accuracy in continuous recording of rate and duration: three algorithms compared oliver c. mudford university

participants at a time recorded observationsfrom six video recordings projected onto a whitewall of the room. If more than one observer waspresent, they were seated approximately 1.5 mapart so they were unable to see anotherobserver’s recording responses.

Recording Apparatus and Software

Observers recorded defined behaviors usingone of two types of handheld computer, either aPsion Organiser II LZ64 through its alpha-numeric keypad or a Hewlett Packard iPAQPocket PC rx1950 with a touch-screen wand. Theraw data were downloaded to a PC for analysis byObsWin 3.2 software (Martin, Oliver, & Hall,2003), which had been supplemented with thethree agreement algorithms studied.

Observational Materials

Six video recordings were of 5-min (300-s)brief functional analysis sessions (as describedby Northup et al., 1991). There was a differentclient–behavior combination in each sample.The first 50 s of each sample was shown toallow the observers to identify the client andtherapists in the video, the behavior of interest,and the setting in the particular video sample.Each video sample was continuous (i.e.,unedited) except that there was a 3-s count-down and start indicator superimposed toprompt observers to start recording for theremaining 250 s of the sample.

A criterion record had been created for the250-s samples using a procedure similar to thatof Boykin and Nelson (1981). Two graduateresearch assistants independently examined thevideo samples to determine the second in whichevents (i.e., discrete behaviors) occurred or thesecond in which the onsets and offsets ofbehaviors with duration occurred. Repeatedviewings, slow motion, and frame-by-frameplayback were used to measure the samplesconsistently. Records were compared anddifferences resolved, except for seven discrepan-cies concerning occurrence (3% of 233 criterionrecord events or onsets) that were settled by

agreement with a third observer (the firstauthor). The criterion records were used toquantify the relevant dimension of the recordedbehaviors in the video samples. We selected thevideo samples to provide a range of people,behaviors, rates (expressed as responses perminute), and relative durations (expressed asthe percentage of each session in which thebehavior was observed). Figure 1 shows the six250-s criterion records, with a vertical line foreach target response for event recording or foreach second of occurrence of bouts of respond-ing in samples used for duration recording.

Definitions of Behaviors and Their Dimensions

Discrete responses recorded were spitting,hand to mouth, and vocalizing. Spitting wasdefined as an audible ‘‘puh’’ sound. The overallcriterion record rate of spitting (Figure 1, firstpanel) was 4.8 responses per minute. Hand tomouth was defined as the start of new contactbetween hand or wrist and the triangular areadefined by the tip of the nose and imaginarylines drawn from there to the jawline throughthe corners of the mouth. The criterion rate forthis behavior was 11.3 responses per minute(Figure 1, second panel). Vocalizing was thesound of any word (e.g., ‘‘car’’) or speech-likephrase (e.g., ‘‘ga-ga’’), and the rate was 23.5responses per minute (Figure 1, third panel).

Duration behaviors recorded were cardholding, therapist attention, and body rocking.Card holding was defined as visible contactbetween the hand and a greeting card. Thecriterion record showed that card holdingoccurred for 8% of the session. Therapistattention was verbal and addressed to the client(e.g., ‘‘Come and sit down,’’ ‘‘Do this one,’’‘‘later’’), and the onset was defined as occurringin the 1st second that the therapist spoke a newsentence after a 1-s break in talking. Offset ofattention was defined as the 1st second withoutany therapist vocalization. Therapist attentionoccurred for 44.4% of the observation. Bodyrocking was defined as repetitive rhythmicmotion of the torso back and forth, with onset

OBSERVER ACCURACY 529

Page 4: ASSESSING OBSERVER ACCURACY IN CONTINUOUS RECORDING … · assessing observer accuracy in continuous recording of rate and duration: three algorithms compared oliver c. mudford university

defined as the movement entering its secondcycle and offset defined as rhythm disrupted for1 s, and occurred during 72% of the observa-tion. Panels 4, 5, and 6 in Figure 1 showcriterion records for card holding, therapistattention, and body rocking, respectively.

Procedure

Video samples were recorded in the sameorder by all observers: lowest to highest rates(i.e., spit, hand to mouth, vocalizing), followedby lowest to highest durations (i.e., cardholding, therapist attention, rocking). A 5-minbreak was scheduled between each observationalsample, and all observations occurred consecu-tively in one session (of approximately 60 minin duration).

Written definitions of behaviors to berecorded and a handheld computer wereprovided to participants. One response topog-raphy only was recorded with each videosample, although multiple (nontargeted) behav-iors occurred during each sample. Observers

were told which behavior was to be recordedbefore viewing each sample. The instructionswere to watch the video sample and, when thestart indicator showed on the projection, eitherto press the code (Z) assigned to record the startof an observation session in the Psion or totouch the wand to ‘‘start’’ on the iPAQ screen.After that, observers were to press the key on thekeyboard for the code assigned to the behaviorbeing measured in that sample (Psion) or touchthe name of the behavior on screen (iPAQ). Ifthe behavior was defined as a discrete response,they were instructed to press the key or touchthe name of the behavior each time theyobserved a target response. When observerswere measuring duration, they were to press thecode key (or touch the name) for its onset andagain for its offset. The screen on the computersshowed whether the behavior was recorded.

At the end of each observation, the projectionwent blank. At any time after this, observersentered the code for the end of the session (‘‘/’’for Psions, ‘‘End’’ for iPAQ). When they did

Figure 1. Criterion records for observational samples. Events are shown as a vertical line in each second in whichthey occurred in the rate samples. Seconds of occurrence of bouts of responding are shown as vertical bars forduration samples.

530 OLIVER C. MUDFORD et al.

Page 5: ASSESSING OBSERVER ACCURACY IN CONTINUOUS RECORDING … · assessing observer accuracy in continuous recording of rate and duration: three algorithms compared oliver c. mudford university

this was not important, because the observa-tional records were restricted to 250 s from theentry of Z or ‘‘start.’’ No further instructionswere provided to the observers. Thereafter, the5-min break occurred, which was followed bythe next sample presentation (or the termina-tion of all data collection).

Analyses

Accuracy was computed as agreement be-tween each observer’s recorded files and thecorresponding criterion records by the exactagreement, block-by-block agreement, andtime-window analysis algorithms.

Block-by-block and exact agreement algo-rithms are similar in that the first step was toimpose 10-s intervals on the second-by-seconddata files to be compared. When computingaccuracy, one file was from the observer, and theother file was the criterion record. The level ofanalysis for discrete data was the number ofresponses recorded in a given 10-s interval. Forduration measures, the number of seconds withina 10-s interval that the response was recorded asoccurring was counted for each data file.

The exact agreement algorithm proceeded asfollows: First, intervals for which both theobserver’s and the criterion record agreed on theexact number of responses (or seconds forduration measures), including when bothshowed nonoccurrence, were counted as agree-ments. Second, when both data files showed$1 s with occurrence in an interval, but therewas not exact agreement on how manyresponses (frequency) or how many seconds(duration), the intervals were counted asdisagreements. Third, the typical percentageagreement formula was applied (i.e., agreementsdivided by agreements plus disagreementsmultiplied by 100%).

In the block-by-block agreement method, thesmaller (S) of the two data files’ totals in aspecific 10-s interval was divided by the larger(L) data file total within that interval to yield ascore between 0 and 1 for every interval. Whenboth data files showed no occurrences during an

interval, the intervals were scored as 1 (i.e.,agreement on nonoccurrence). After each 10-sinterval was scored in the manner describedabove, the scores from each 10-s interval weresummed across all 25 10-s intervals, divided bythe number of intervals, and the result wasmultiplied by 100% to provide a percentageagreement index.

The time-window analysis method involvedsecond-by-second comparisons across the twodata files. An agreement was scored when bothrecords contained a recorded response (fordiscrete behaviors) or 1 s of ongoing occurrence(for behaviors measured with duration). Anysecond in which only one record contained anevent or occurrence of responding was scored as adisagreement. Percentage agreement was calcu-lated by dividing the number of agreements bythe total number of agreements plus disagree-ments and multiplying by 100%. The time-window method has been described as overlystringent for data on discrete responses (Mac-Lean et al., 1985). Consequently, the algorithmallows tolerance for counting agreements byexpanding the definition of an agreement toinclude observations within 6t seconds in thetwo data files. Thus, rather than requiring thattwo records must have a response (or second ofoccurrence) at the exact same point in time, thetime-window method permits an agreement tobe scored if the records contain a response withina prespecified tolerance interval (t) that is definedby the user. It should be noted that a recordedresponse (or second of occurrence) in one recordcan be counted as an agreement with a similarevent in the other record once only. Thetolerance interval for the present analysis wasdetermined empirically by comparing the effectsof varying t from 0 to 5 s on measures of observeraccuracy (described below).

RESULTS

Accuracy with Different Recording Devices

Observer accuracy, computed as agreementof observers’ records of behavior with criterion

OBSERVER ACCURACY 531

Page 6: ASSESSING OBSERVER ACCURACY IN CONTINUOUS RECORDING … · assessing observer accuracy in continuous recording of rate and duration: three algorithms compared oliver c. mudford university

records, was examined for observers’ recordingon the different handheld computers (Psion andiPAQ) separately. The results showed thataccuracy varied minimally and not systematicallybetween the devices (graphs that show lack ofeffects are available from the first author). Thus,data from both types of handheld computer werepooled in all subsequent analyses.

Effects of Varying Tolerance withTime-Window Analysis

Data on observer accuracy computed usingtime-window analysis are presented first becausethere has been no previous empirical research toguide selection of an appropriate tolerance (i.e.,time window) for agreement or accuracy. Ourreview indicated that tolerance interval t hasvaried from 1 s to 5 s (Mudford et al., 2009);thus, accuracy with tolerance (t) from 0 to 65 swas computed with the present data sets.Figure 2 shows mean accuracy across tolerancesfor agreement with a time window ranging fromt 5 60 s to t 5 65 s.

Accuracy with recording of events (i.e.,discrete responses recorded without duration)increased markedly from 22.6% at zero toler-ance to 77.8% at 62 s. At tolerances above62 s to the maximum time window (65 s),increase in accuracy increased by only another4.1%. Accuracy in recording behaviors withduration increased as tolerance was increased;however, gains were less evident than thoseobserved for event recording. Accuracy in-creased by 12.1% from zero tolerance to85.6% at t 5 62 s.

Based on the accuracy data, a time window of62 s can be recommended for two reasons: (a)Increases in accuracy were shown to be relativelysmall above this level and may only becapitalizing on chance agreement that increasesas windows for agreement widen from 62 s to65 s; and (b) results from additional analyses ofinterobserver agreement with the same data setsshowed that levels of accuracy and agreementconverged at 62 s tolerance (results from thoseanalyses are available from the first author).

Figure 2. Mean percentage accuracy computed by the time-window analysis algorithm for behaviors measured asevents and behaviors with duration at tolerances for agreement from 0 to 65 s.

532 OLIVER C. MUDFORD et al.

Page 7: ASSESSING OBSERVER ACCURACY IN CONTINUOUS RECORDING … · assessing observer accuracy in continuous recording of rate and duration: three algorithms compared oliver c. mudford university

Thus, a 62 s tolerance interval was used as thebasis for observer accuracy with the time-window algorithm in the current analysis.

Accuracy Results across Computational Algorithms

Recordings of response rate. We conductedanalyses similar to those of Repp et al. (1976)by examining the relation between variations inresponse rate and variations in percentageaccuracy across computational algorithms. Rateswere arbitrarily labeled low (4.8 responses perminute), medium (11.3 responses per minute),and high (23.5 responses per minute). Figure 3(top) presents percentage accuracy computedwith exact agreement, block-by-block, andtime-window analysis (t 5 62 s) at thesedifferent rates. Percentage accuracy measuredby exact agreement decreased stepwise from78.3% to 50.3% as response rate increased fromlow to high. With block-by-block analysis, low-and high-rate behaviors were recorded withsimilar levels of accuracy (85.3% and 88%),with the medium-rate behavior being recordedless accurately at 76.8%. Finally, the time-window analysis produced accuracy levels of68.9%, 67.7%, and 96.8% for the low-,medium-, and high-rate samples, respectively.

Explanation of the reduction in accuracywith increasing response rate when computedby exact agreement can be advised by consid-eration of the top three panels in Figure 1. Gapsbetween vertical lines indicating periods ofnonoccurrence are visible. Exact agreementdivides the data stream for each sample shownin Figure 1 into 25 10-s intervals, as does theblock-by-block agreement algorithm. Furtheranalysis found that the low-, medium-, andhigh-rate samples contained 16, 12, and 0intervals with nonoccurrence, respectively.Thus, it was probable that the number ofagreements on nonoccurrence decreased withboth exact agreement and block-by-blockagreement as response rate increased. Converse-ly, the number of intervals with occurrences,including intervals with multiple occurrences,increased as response rate increased. For

example, in the high-rate sample (Figure 1,third panel), every 10-s interval contained twoto five events in the criterion record. Consideran interval in which an observer records fourevents and the criterion record for that intervalshows five events. That is scored as an intervalof disagreement in exact agreement but as .8 ofan agreement with block-by-block agreement.Thus, these data illustrate that it is easier forobservers to obtain higher accuracy indexes atlower rates than with higher rates when theexact agreement method is used. Allowance fora partial agreement (e.g., the example of .8agreement) in the block-by-block agreementalgorithm mitigates the effect with high-rateresponding.

In time-window analysis, because the algo-rithm ignores nonoccurrence (unlike exactagreement and block-by-block agreement), lowrates of responding in the criterion measurecannot inflate accuracy by increasing agreementon nonoccurrence (Hawkins & Dotson, 1975;Hopkins & Hermann, 1977). However, thequestion arises as to whether the relatively highlevel of mean accuracy (96.8%) in the high-ratesample (Figure 3, top) indicates that time-window analysis may exhibit rate-dependentinflationary effects compared to the other twomethods. Chance agreement is the proportionof obtained agreement that could result if anobserver’s recording responses were randomlydistributed through an observational session(Hopkins & Hermann). The mean interre-sponse time in the criterion record for the high-rate sample was 2.55 s. With 62 s tolerance forlocating an agreement event in an observer’s orcriterion record, a window of up to 5 s isopened, so each window included a mean of1.96 responses. Therefore, obtained agreementindexes with time-window analysis for relativelyhigh-rate behavior (e.g., 23.5 responses perminute) would likely be influenced by chanceagreement.

Recordings of response duration. The low,medium, and high durations of behavior

OBSERVER ACCURACY 533

Page 8: ASSESSING OBSERVER ACCURACY IN CONTINUOUS RECORDING … · assessing observer accuracy in continuous recording of rate and duration: three algorithms compared oliver c. mudford university

Figure 3. Mean percentage accuracy computed by exact agreement, block-by-block agreement, and time-windowanalysis with tolerance of 62 s for low-rate (4.8 responses per minute), medium-rate (11.3 responses per minute), andhigh-rate (23.5 responses per minute) behaviors (top) and for low-duration (8%), medium-duration (44.4%), and high-duration (72%) behaviors (bottom).

534 OLIVER C. MUDFORD et al.

Page 9: ASSESSING OBSERVER ACCURACY IN CONTINUOUS RECORDING … · assessing observer accuracy in continuous recording of rate and duration: three algorithms compared oliver c. mudford university

occurred in 8%, 44.4%, and 72% of the sessionin the criterion records. Figure 3 (bottom)presents percentage accuracy computed withexact agreement, block-by-block agreement,and time-window analysis (t 5 62 s) for thesedifferent durations. Exact agreement producedlower percentage accuracy than the otheralgorithms regardless of response duration.Accuracy computed using block-by-block agree-ment was similar for low and high durations(91.8% and 92.4%) but were lower for mediumdurations. A similar effect was observed for thetime-window analysis (87.4% and 96.1%accuracy for the low- and high-durationsamples, respectively). The difference betweenexact agreement and block-by-block agreementwas most noticeable for the medium-durationbehavior, with 25.8% accuracy assessed by exactagreement, which was 39.4% less than forblock-by-block agreement. The difference wasleast marked for the low-duration behavior at85% accuracy measured by exact agreement,compared to 91.8% from block-by-blockagreement.

Observers’ accuracy at recording the medi-um-duration behavior was the lowest by allmeasures of accuracy, and it shows the mostpronounced effect of using different algorithms(see solid bars, Figure 3, bottom). As shown inFigure 1 (Panel 5), the criterion record shows26 onsets of the recorded behavior and manyshort-duration episodes (e.g., less than 10 s).Therefore, to match precisely the criterionrecord an observer would have entered 52 keypresses at an overall rate of 12.5 responses perminute. This can be compared with the low-and high-duration behaviors recorded, whichhad only three and six onsets (see Figure 1,Panels 4 and 6, respectively). Thus, thecomplexity of the observers’ task in recordingthe medium-duration behavior may be theunderlying reason for lower accuracy. That is,the relatively high rate of onsets and offsets ofthe behavior in the medium-duration observa-tional sample might have negatively affected

observers’ accuracy on their recording ofduration across all measures of accuracy.

Explanation for the differences in accuracyacross algorithms with the medium-durationsample can be used to demonstrate additionalcharacteristics of the algorithms. To illustrate,the durations of individual occurrences of thebehavior (i.e., the width of the bars in Figure 1,Panel 5) were computed. Among the 26occurrences, 21 were less than 10 s in duration,of which 18 were less than 5 s in duration.Superimposition of 10-s comparison intervalsfor exact and block-by-block agreement on thecriterion record produced four intervals ofnonoccurrence and two with 10-s occurrence(i.e., the behavior continued throughout theentire interval). Therefore, 19 of 25 intervalscontained between 1 s and 9 s of occurrence.With only four intervals of nonoccurrence,obtaining high agreement using the exactagreement method relies almost exclusively onobservers’ records agreeing with the criterionrecord on occurrence exactly, to the second.Considering that the distribution of boutlengths was skewed towards durations of lessthan 5 s, we view the likelihood of observersobtaining exact agreement accuracy at levelsapproaching those of block-by-block agreementas low even for trained observers. As with thehigh-rate sample, the medium-duration sampleproduced higher accuracy for block-by-blockagreement than for exact agreement. The reasonfor this difference is that during block-by-blockagreement, observers can achieve partial agree-ment on the occurrence of a response withineach interval, which may compensate forinexact agreement.

DISCUSSION

In summary, three commonly used methodsfor computing interobserver agreement forcontinuously recorded data were applied todata recorded by observers who had no previousexperience with continuous computer-basedrecording. Rate of responding was measured

OBSERVER ACCURACY 535

Page 10: ASSESSING OBSERVER ACCURACY IN CONTINUOUS RECORDING … · assessing observer accuracy in continuous recording of rate and duration: three algorithms compared oliver c. mudford university

in three video samples, and duration of respond-ing was measured in another three video samples.First, we analyzed accuracy with the time-windowanalysis algorithm. A tolerance of 62 s foragreement with criterion records on occurrencesof behaviors was allowed in subsequent analyses.Second, the effects of the exact agreement, block-by-block agreement, and time-window analysis onobservers’ accuracy when recording high, medi-um, and low rates and durations were investigat-ed. Substantial differences in percentage accuracywere related to the characteristics of the algorithmand observational samples.

The effects of changing tolerance for agree-ment in the time-window analysis method wereoriginally illustrated with hypothetical data byMacLean et al. (1985), who recommended thata tolerance of 62 s might be allowed by a‘‘conservative user’’ (p. 69) when discretebehaviors were recorded. Others have reportedthat they employed 62 s tolerance with discreteevent recording (e.g., Repp et al., 1989). Resultsof the current analysis support these previousfindings, at least for the observation materialsand procedures employed herein.

The developers and users of time-windowanalysis have not recommended investigatingtolerance with duration measures (e.g., Mac-Lean et al., 1985; Repp et al., 1989; Tapp &Wehby, 2000). The current data suggest thatthe same level of tolerance (62 s) can bejustified when duration is the measure ofinterest. A standard tolerance (e.g., 62 s) forreporting accuracy and agreement for recordingevents or durations may assist in the interpre-tation of data quality assessed using the time-window analysis algorithm. The unnecessaryalternative, with multiple different tolerancesfor different measures within the same data set,would likely complicate data analysis.

Page and Iwata (1986) discussed the relativemerits and disadvantages of exact agreementand block-by-block agreement with simulatedinterval data, specifically events within intervals.Imposing 10-s intervals on continuous data

streams replicates that type of behavioral record.Page and Iwata showed how excessively strin-gent the exact agreement method can be, with5% agreement, in comparison with the block-by-block agreement method, which produced54% agreement with the same manufactureddata. In the present study using video samplesfrom analogue functional analyses, the differ-ence in level of agreement was apparent(Figure 3). At the extreme, exact agreementproduced accuracy measures close to 40%(37.7% and 39.4%) below those of block-by-block agreement for the high-rate behavior andthe medium-duration behavior.

It should be noted that, logically, there canbe no possible pair of observers’ records forwhich agreement measured by exact agreementexceeds that measured by block-by-block agree-ment. Unless observers agree 100% on numbersof occurrences (or number of seconds ofoccurrence) within all intervals, the exactagreement algorithm will always result in lowerapparent agreement than block-by-block agree-ment, showing that exact agreement is a morestringent measure.

An additional question concerns the extent towhich the different methods assess accuracy (oragreement) on particular instances of behavior.Block-by-block and exact agreement presentdifficulties that result from slicing the datastream into fixed (usually 10-s) intervals.Among other possibilities, two observers’ re-sponses within an imposed interval could be upto 9 s apart and count as an agreement or, at theother extreme, could be 1 s apart and count astwo disagreements (one in each interval oneither side of a cut). These possibilities canoccur regardless of the size of the intervals (e.g.,if the second-by-second data streams are dividedinto 5-s or 20-s intervals instead of the typical10-s intervals). In time-window analysis thecontinuous record is not divided into intervalsbefore analysis. Windows of opportunity foragreement are opened around an event orongoing occurrence in observers’ or criterion

536 OLIVER C. MUDFORD et al.

Page 11: ASSESSING OBSERVER ACCURACY IN CONTINUOUS RECORDING … · assessing observer accuracy in continuous recording of rate and duration: three algorithms compared oliver c. mudford university

records, with the size of the window being twicethe tolerance plus 1 s (e.g., a 5-s window for a2-s tolerance). Therefore, time-window analysis,with its moving interval for agreement, may besuperior in addressing agreement on particularinstances of recorded behaviors.

Bijou et al. (1968) pointed out a potentialproblem with an interval-by-interval agreementmethod when the large majority of intervals ofobservation contain either occurrences or non-occurrences of behavior. Specifically, undersuch circumstances estimates of percentageagreement are inflated. Hopkins and Hermann(1977) graphically illustrated the relation be-tween rate of behavior and interval-by-intervalagreement. For instance, if both observersrecorded nonoccurrence in 90% of intervals,the minimum possible level of agreement by theinterval-by-interval method is 80%, even if theynever agreed on a single interval containing anoccurrence of the behavior. Agreement reachesthe conventional criterion of 80% required toindicate that recording was sufficiently reliable(e.g., Cooper et al., 2007) but, clearly, users ofsuch data cannot rely on them (see Hawkins &Dotson, 1975). These are examples of theproblem of chance agreement and may occurwith time-window analysis when behavior is ofa sufficiently high rate or duration. Exactagreement and block-by-block agreement alsoare both susceptible to increasing agreementartifactually when behavior occurs in propor-tionally few intervals.

Overall, the time-window analysis methodwith 62 s tolerance was found to be morelenient than the exact agreement method.However, observers’ recording of high-rate(23.5 responses per minute) and high-duration(72% of session) behavioral streams producedaccuracy indexes from time-window analysisthat, at greater than 96%, were higher thanthose from exact agreement and block-by-blockagreement (Figure 3). Such high percentageaccuracy suggests that time-window analysismay provide inflated estimates at high relative

durations just as it can for high-rate responding.Hopkins and Hermann (1977) presentedformulas for calculating chance agreement withdiscontinuous records. These formulas requirethe number of observed intervals to be enteredinto the denominator. The absence of imposedand fixed intervals for computing agreement bytime-window analysis with moving intervalswith tolerance (6t s) prohibits counting ofintervals in the conventional sense. Thus,chance agreement cannot be computed byconventional methods. Nevertheless, researchersshould take caution from the data presented tobe aware of artifactually inflated indexes ofaccuracy from time-window analysis whenmean interresponse times are short comparedto the duration of the window for agreementdefined by tolerance.

Further research on time-window analysis isrequired. Although the method is resistant toinflated levels of agreement or accuracy at lowrates and durations due to chance agreement, itis possible that it might underestimate agree-ment at those levels of behavior. For accuracy,the time-window method allows decreasingroom for errors as response rates decline. Forinstance, if the criterion record shows oneresponse in 250 s but the observer does notrecord it, accuracy is zero with time-windowanalysis (cf. 96% with exact agreement andblock-by-block agreement). Another difficultywith time-window analysis is that zero rates inobservers’ and criterion records produce anincalculable result (zero divided by zero). Thesame problem has been overcome by users ofblock-by-block agreement by defining a 10-sinterval with 10 s of agreement on nonoccur-rence as being complete agreement and scoringas 1.0 in their computations, despite the smallerto larger fraction being zero to zero. Applyingthe same rule in the time-window analysisalgorithm would award 100% agreement for anobservational session with no responding re-corded by observers or in a criterion record, thesame as with exact agreement and block-by-

OBSERVER ACCURACY 537

Page 12: ASSESSING OBSERVER ACCURACY IN CONTINUOUS RECORDING … · assessing observer accuracy in continuous recording of rate and duration: three algorithms compared oliver c. mudford university

block agreement algorithms (Mudford et al.,2009).

The present research employed video sampleswith uncontrolled, naturally occurring distribu-tions of responses. Samples were selected fortheir overall rates and durations and to providea variety of real clients and response definitionsto broaden the pedagogical experience forparticipating observers. Interactions amongresponse rates, distributions, and durations ofbehaviors that complicated interpretation ofsome of the current findings may well be afeature of naturally occurring behaviors. Con-sidering the distributions of responding inFigure 1, every panel (with the possibleexception of the third panel) exhibits featuresthat suggest that describing these samplesmerely by their overall rate or duration isoversimplification. In the top panel, respond-ing was restricted to the first part of the sample.Other panels show some bursts (or bouts) ofresponding interspersed with relatively longinterresponse times. Failure to control thecontent of samples or to randomize order ofpresentation may be viewed as limitations ofthe study. However, the aim was not tocompare accuracy across behavioral topogra-phies, clients, rate or duration measures, orobservers as they acquired experience. Thepurpose was to illustrate differences in per-centage accuracy depending on the algorithmsemployed to compute it.

Considerations for researchers selectingmethods to compute the accuracy of continu-ous recording can be derived from the citedstudies with discontinuous recording and fromthe current results. The three methods inves-tigated in the current study all presentimperfections if applied across a wide rangeof rates and durations as may be experienced instudies that demonstrate reduction of behav-iors from high to low (or zero) rates anddurations (or vice versa when increasingbehaviors). Exact agreement and time-windowanalysis may be considered unsuitable at high

rates and durations for opposite reasons, withthe former being excessively conservative andthe latter being too liberal. Exact agreementand block-by-block agreement may produceinflated indexes of accuracy at low rates anddurations.

The generality of these findings regardingmeasurements of discrete events (i.e., rates inresponses per minute) may be restricted by therates observed (4.8 to 23.5 responses per minute).The range of investigated rates has been lower inprevious studies of interobserver agreement (e.g.,1.4 to 6.6 responses per minute, Repp et al., 1976;1.1 to 3.0 responses per minute, Kapust &Nelson, 1984; 0.3 to 1.2 responses per minute,Van Acker et al., 1991). If the lower ranges aremore typical of data reported in behavioralresearch, further studies should investigate theeffects on agreement and accuracy depending onalgorithms used at these lower rates.

The current study may stimulate furtherefforts to determine appropriate measures ofaccuracy and agreement for continuous data.Manipulation of observational materials may berequired to elucidate the effects of differentalgorithms more systematically than was possi-ble with the video recordings of free-operantbehaviors used in our study. The limitations ofthe three algorithms suggest that other methods,not in common use, should be investigated aswell. Further research with real behavioralstreams, theoretical analysis (e.g., Hopkins &Hermann, 1977), scripted behavioral interac-tions (e.g., Van Acker et al., 1991), and studieswith manufactured behaviors (e.g., Powell,Martindale, Kulp, Martindale, & Bauman,1977) are likely to further influence behavioranalysts’ choice of appropriate agreement andaccuracy algorithms.

REFERENCES

Bijou, S. W., Peterson, R. F., & Ault, M. H. (1968). Amethod to integrate descriptive and experimental fieldstudies at the level of data and empirical concepts.Journal of Applied Behavior Analysis, 1, 175–191.

538 OLIVER C. MUDFORD et al.

Page 13: ASSESSING OBSERVER ACCURACY IN CONTINUOUS RECORDING … · assessing observer accuracy in continuous recording of rate and duration: three algorithms compared oliver c. mudford university

Boykin, R. A., & Nelson, R. O. (1981). The effect ofinstructions and calculation procedures on observers’accuracy, agreement, and calculation correctness.Journal of Applied Behavior Analysis, 14, 479–489.

Cooper, J. O., Heron, T. E., & Heward, W. L. (2007).Applied behavior analysis (2nd ed.). Upper SaddleRiver, NJ: Pearson.

Fradenberg, L. A., Harrison, R. J., & Baer, D. M. (1995).The effect of some environmental factors oninterobserver agreement. Research in DevelopmentalDisabilities, 16, 425–437.

Hawkins, R. P., & Dotson, V. A. (1975). Reliability scoresthat delude: An Alice in Wonderland trip through themisleading characteristics of interobserver agreementscores in interval recording. In E. Ramp & G. Semb(Eds.), Behavior analysis: Areas of research andapplication (pp. 359–376). Englewood Cliffs, NJ:Prentice Hall.

Hopkins, B. L., & Hermann, J. A. (1977). Evaluatinginterobserver reliability of interval data. Journal ofApplied Behavior Analysis, 10, 121–126.

Kapust, J. A., & Nelson, R. O. (1984). Effects of rate andspatial separation of target behaviors on observeraccuracy and interobserver agreement. BehavioralAssessment, 6, 253–262.

MacLean, W. E., Tapp, J. T., & Johnson, W. L. (1985).Alternate methods and software for calculatinginterobserver agreement for continuous observationdata. Journal of Psychopathology and BehavioralAssessment, 7, 65–73.

Martin, N. T., Oliver, C., & Hall, S. (2003). ObsWin:Observational data collection and analysis for Windows.London: Antam Ltd.

Mudford, O. C., Taylor, S. A., & Martin, N. T. (2009).Continuous recording and interobserver agreementalgorithms reported in the Journal of Applied BehaviorAnalysis (1995–2005). Journal of Applied BehaviorAnalysis, 42, 165–169.

Northup, J., Wacker, D., Sasso, G., Steege, M., Cigrand,K., Cook, J., et al. (1991). A brief functional analysisof aggressive and alternative behavior in an outclinicsetting. Journal of Applied Behavior Analysis, 24,509–522.

Page, T. J., & Iwata, B. A. (1986). Interobserveragreement: History, theory and current methods. InA. Poling & R. W. Fuqua (Eds.), Research methods inapplied behavior analysis: Issues and advances (pp. 99–126). New York: Plenum.

Powell, J., Martindale, B., Kulp, S., Martindale, A., &Bauman, R. (1977). Taking a closer look: Timesampling and measurement error. Journal of AppliedBehavior Analysis, 10, 325–332.

Repp, A. C., Deitz, D. E. D., Boles, S. M., Deitz, S. M.,& Repp, C. F. (1976). Technical article: Differencesamong common methods for calculating interobserv-er agreement. Journal of Applied Behavior Analysis, 9,109–113.

Repp, A. C., Harman, M. L., Felce, D., Van Acker, R., &Karsh, K. G. (1989). Conducting behavioral assess-ments on computer-collected data. Behavioral Assess-ment, 10, 249–268.

Tapp, J., & Wehby, J. H. (2000). Observational softwarefor laptop computers and optical bar code readers. InT. Thompson, D. Felce, & F. J. Symons (Eds.),Behavioral observation: Technology and applications indevelopmental disabilities (pp. 71–81). Baltimore: PaulH. Brookes.

Van Acker, R., Grant, S. H., & Getty, J. E. (1991).Observer accuracy under two different methods ofdata collection: The effect of behavior frequency andpredictability. Journal of Special Education Technology,11, 155–166.

Received November 9, 2007Final acceptance December 16, 2008Action Editor, Henry Roane

OBSERVER ACCURACY 539