the reliability of selected techniques in clinical arthrometrics · 2017. 2. 28. · movements...

25
The Reliability of Selected Techniques in Clinical Arthrometrics A number of studies which have examined reliability of spinal assessment procedures in manual therapy are reviewed. The tests exam- ined were Passive Accessory Intervertebral Movements, Passive Physiological Interverte- bral Movements, Straight Leg Raise and For- ward Flexion. In general, tests of pain were found to be much more reproducible than tests of compliance. Straight Leg Raise and Forward Flexion tests were consistently more reliable than the Passive Intervertebral Movement tests. Possible explanations for these findings are ad- vanced. The role of tests of compliance based on passive intervertebral movements in clinical decision-making may need to be re-examined. An appendix on reliability theory is included for the uninitiated reader. THOMAS A. MATYAS Thomas Matyas, B.A.(Hons), Ph.D., is a Senior Lec- turer in the School of Behavioural Sciences, Lincoln Institute of Health Sciences, Melbourne. TIMOTHY M. BACH Timothy Bach, M.Sc., is Lecturer in Biomechanics in the School of Biological Sciences, Lincoln Institute of Health Sciences, Melbourne. Manual therapy employs a variety of assessment techniques such as the for- ward flexion (FF) test, the straight leg raise (SLR) test, passive accessory in- tervertebral movements (PAIVM) and passive physiological intervertebral movements (PPIVM). Collectively these tests and other similar ones may be taken to define the field of 'clinical arthrometrics' . Clinical arthrometry provides the ba- sis for a laudably empirical approach to treatment. Among other goals, test- ing is variously employed to help in the selection of a region for treatment, in the selection of appropriate manual techniques and in monitoring case progress. Clearly, then, the adequacy of the assessment procedures is a major issue in the field. However, inspection of the journal literature to 1980 re- vealed a remarkable dearth of system- atic investigations into the reliability, validity and scaling properties of the clinical assessment procedures em- ployed by manual therapists. Conse- quently, a research programme was in- itiated in 1980 with the intention of clarifying some of these issues. The aim of the present paper is to review several studies whose common theme is the reliability of some tech- niques in clinical arthrometry. The ma- jority of studies reviewed below are part of a continuing programme of re- search being carried out at the Lincoln Institute of Health Sciences in conjuc- tion with its postgraduate curriculum. Studies were conducted by postgrad- uate physiotherapists working under the guidance of experienced clinicians and one or both of the authors. The paper is organized in five sec- tions. The first section describes a method for measuring forces applied during manual procedures. The second section reviews studies on the reliability of pain measurement with three man- ual techniques: the PAIVM test, the FF test and the SLR test. The third section reviews studies on assessment of spinal compliance with PAIVM and PPIVM tests. The fourth section de- scribes our studies on the reliability of producing two grades of mobilization described by Maitland (1977). Al- though these are not studies of assess- ment techniques, the findings are rel- evant to those of section three. Section five conducts an integrative discussion of the studies performed to date. Each section also attempts to integrate the results of pertinent publications gen- erated outside our programme. I. An Indirect Method for Estimating Applied Force During Therapeutic Procedures Studies of the reliability of thera- peutic techniques have been limited by a lack of objective measures of ther- apist performance. While therapist per- ceptions may be readily obtained, measurement of the mechanical effect of therapeutic intervention is con- founded by the requirement that meas- urement techniques should not inter- fere with the task. To overcome this restriction, we have developed a The Australian Journal of PhYSiotherapy. Vol 31, No 5, 1985 175

Upload: others

Post on 24-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

The Reliability of Selected Techniques in ClinicalArthrometrics

A number of studies which have examinedreliability of spinal assessment procedures inmanual therapy are reviewed. The tests exam­ined were Passive Accessory IntervertebralMovements, Passive Physiological Interverte­bral Movements, Straight Leg Raise and For­ward Flexion. In general, tests of pain werefound to be much more reproducible than testsof compliance. Straight Leg Raise and ForwardFlexion tests were consistently more reliablethan the Passive Intervertebral Movement tests.Possible explanations for these findings are ad­vanced. The role of tests of compliance basedon passive intervertebral movements in clinicaldecision-making may need to be re-examined.An appendix on reliability theory is included forthe uninitiated reader.

THOMAS A. MATYAS

Thomas Matyas, B.A.(Hons), Ph.D., is a Senior Lec­turer in the School of Behavioural Sciences, LincolnInstitute of Health Sciences, Melbourne.

TIMOTHY M. BACH

Timothy Bach, M.Sc., is Lecturer in Biomechanics inthe School of Biological Sciences, Lincoln Instituteof Health Sciences, Melbourne.

Manual therapy employs a variety ofassessment techniques such as the for­ward flexion (FF) test, the straight legraise (SLR) test, passive accessory in­tervertebral movements (PAIVM) andpassive physiological intervertebralmovements (PPIVM). Collectivelythese tests and other similar ones maybe taken to define the field of 'clinicalarthrometrics' .

Clinical arthrometry provides the ba­sis for a laudably empirical approachto treatment. Among other goals, test­ing is variously employed to help inthe selection of a region for treatment,in the selection of appropriate manualtechniques and in monitoring caseprogress. Clearly, then, the adequacyof the assessment procedures is a majorissue in the field. However, inspectionof the journal literature to 1980 re­vealed a remarkable dearth of system­atic investigations into the reliability,validity and scaling properties of theclinical assessment procedures em­ployed by manual therapists. Conse­quently, a research programme was in-

itiated in 1980 with the intention ofclarifying some of these issues.

The aim of the present paper is toreview several studies whose commontheme is the reliability of some tech­niques in clinical arthrometry. The ma­jority of studies reviewed below arepart of a continuing programme of re­search being carried out at the LincolnInstitute of Health Sciences in conjuc­tion with its postgraduate curriculum.Studies were conducted by postgrad­uate physiotherapists working underthe guidance of experienced cliniciansand one or both of the authors.

The paper is organized in five sec­tions. The first section describes amethod for measuring forces appliedduring manual procedures. The secondsection reviews studies on the reliabilityof pain measurement with three man­ual techniques: the PAIVM test, theFF test and the SLR test. The thirdsection reviews studies on assessmentof spinal compliance with PAIVM andPPIVM tests. The fourth section de­scribes our studies on the reliability of

producing two grades of mobilizationdescribed by Maitland (1977). Al­though these are not studies of assess­ment techniques, the findings are rel­evant to those of section three. Sectionfive conducts an integrative discussionof the studies performed to date. Eachsection also attempts to integrate theresults of pertinent publications gen­erated outside our programme.

I. An Indirect Method forEstimating Applied ForceDuring TherapeuticProcedures

Studies of the reliability of thera­peutic techniques have been limited bya lack of objective measures of ther­apist performance. While therapist per­ceptions may be readily obtained,measurement of the mechanical effectof therapeutic intervention is con­founded by the requirement that meas­urement techniques should not inter­fere with the task. To overcome thisrestriction, we have developed a

The Australian Journal of PhYSiotherapy. Vol 31, No 5, 1985 175

Page 2: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

method which enables the indirectmeasurement of forces applied by ther­apists during mobilization and assess­ment techniques.

The procedure requires that thera­pists perform their assessment or treat­ment techniques while standing on aforce platform. Figure 1 illustrates theposition of the therapist during appli­cation of postero-anterior pressure tothe lumbar spine of a patient and in­dicates the three forces acting on thetherapist. For this situation we canwrite:

F + G - W = rna (1)

where W is the weight of the therapist,F is the reaction to the force appliedby the therapist to the patient, G is theground reaction force measured by theforce platform, m is the mass of thetherapist, and a is the acceleration ofthe centre of gravity of the therapist.In order to solve this equation for theapplied force, F, values of W, G anda must be known. The ground reactionforce G, is readily obtained from theforce platform as is the body weightW, when F and a are zero. Techniquesare available which enable computa­tion of the acceleration of the centreof gravity a, but these techniques aretoo tedious and time consuming forroutine application. An alternative ap­proach is to make some assumptionsabout the behaviour of a during mo­bilization and assessment techniques.

For some of the experiments re­ported here these assumptions presentlittle difficulty. If a therapist palpatesa point in range and holds that pointfor a brief period of time (0.5s-1s) whilerecordings are made, acceleration canbe assumed to be virtually zero overthis period. For the purposes of thispaper, this method will be termed thestatic force measurement technique.Similarly, if a therapist performs os­cillatory mobilizations and force plat­form data is sampled over a muchlonger period of time (20 or more os­cillations), the average acceleration overthe sampling period will be virtuallyzero (otherwise the therapist would ac-

Figure 1:The forces which act ona therapist performing spinal mo­bilization or palpation are bodyweight, W; the ground reactionforce, G; and the reaction to theforce applied to the patient, F.

quire a net positive or negative veloc­ity). The difference between bodyweight and the measured ground re­action force is an accurate estimate ofmean applied force in both cases.

In other experiments considered here,estimates of oscillation amplitude andpeak applied force were required. Themethod employed in these studies willbe termed dynamic force measurement.Instantaneous values of applied forceare much more susceptible to inertialeffects than average values. Bach (1985)has adopted an empirical approach to

estimate the degree of error involvedin using the force platform output (es­timated force) as an indirect measureof applied force under different con­ditions of movement amplitude andfrequency. Bach (1985) found that theerror associated with oscillation am­plitude measurement by this techniquewas approximately 12070. The error ofestimating peak forces by this tech­nique was in the neighborhood of1-3070 depending on characteristics ofthe applied forces.

In the studies reviewed in this chap­ter we have measured only the verticalcomponent of force. Many assessmentand treatment techniques require thatforce components other than verticalbe applied. However, studies describedhere concentrated on postero-anteriorcentral vertebral pressures on prone pa­tients and therefore primarily verticalforces were involved. In one experi­ment (Collis-Brown 1982) involving 192measurements of applied force duringa posterior-anterior PAIVM assess­ment the mean difference between thevertical component of the applied forceand the total applied force was 1.8N.This represented 0.5070 of the totalrange of measured forces. We havetherefore chosen to neglect horizontalcomponents of the applied force intechniques involving primarily postero­anterior movements.

An unresolved issue is that of thepressure distribution between therapistand patient. In studies of applied forcereported here therapists were requiredto use the pisiform techniques as de­scribed by Maitland (1977, p.137). Thistechnique involves placing the handsso that the point of contact with thespinous process is the medial border ofthe hand between the pisiform and thehamate. The purpose of this placementis to localize the pressure distributionas much as possible. The proportionof total applied force which acts onthe vertebral body itself could differbetween therapists and between pa­tients as a result of anatomical varia­tion in soft tissue distribution in boththe hands of therapists and the backs

176 The Australian Journal of PhySiotherapy. Vol 31, No.5, 1985

Page 3: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

of patients. To our knowledge, thereis no method available for obtainingprecise information on these pressuredistribution patterns but differences arelikely to be very small. Furthermore,these errors are fixed by the experi­mental designs employed: the thera­pists' hands do not change; the indi­viduals tested and retested are the same;the anatomical loci are the same. Thusonly absolute force values will be sub­ject to pressure-distribution error. Re­liability coefficients, which are only af­fected by random error, will not beinfluenced (see Appendix).

II The Reliability ofSome Movement Testsof Pain in the LumbarSpine

Tests of pain may employ either pas­sive movements as in PAIVM, PPIVMand SLR tests or active movements asin the FF test. These tests are employedto chart 'pain behaviour' (Maitland1977). Although other features, suchas 'quality' of pain, may also pertain,'pain behaviour' is often conceived asa two dimensional function: pain ver­sus range of movement (ROM). Keyfeatures of this function are: the pointin ROM of pain onset (PI); the painintensity at the limit of movement(when the limit is caused by factorsother than pain), or the point in ROMwhere pain is of sufficient intensity tolimit movement (P2); and the dynamicsof pain intensity between PI and P2,ie the nature of the change in painintensity as a function of ROM. Painassessment is an essential feature ofinitial diagnosis, acute pre-post evalu­ation of manual intervention and longerterm evaluation of intersession devel­opment. Therefore intertherapist reli­ability, within-session test-retest relia­bility and between-session test-retestreliability are all relevant practical is­sues for evaluating PI, P2 and paindynamics. Our studies to date have ex­amined only some of these issues.

Results witb tbe PAIVM testCollis-Brown (1982) and McNeill

(1982) examined in a within-session de­sign the test-retest and intertherapistreliability of locating PI in ROM whenusing PAIVM. Four physiotherapistswith postgraduate qualifications inmanual therapy examined two seg­ments from each of 12 patients. Pa­tients were included if prior examina­tion revealed: a history of back painor current back pain; a non-irritablecondition; and discernible pain onsetin at least two lumbar levels on appli­cation of PAIVM. Patients were ex­amined in prone with the two relevantlumbar levels pre-marked. As much ofthe upper and lower body was coveredas was possible in order to reduce bodyidentity. No communication was per­mitted other than the response 'Now'to the question 'Tell me when the painstarts'. The static force measurementtechnique described earlier was used tomeasure applied forces. Therapists re­corded their conclusion on a l00mmvisual analogue scale (VAS). This per­mitted simultaneous measurement ofthe force at which PI occurred and thesubjective distance from ROM originwhere PI occurred according to thetherapist. To control for series effectstherapists examined the patients in alatin square design (Meyers and Gros­sen 1974), with patients randomly al­located to four groups of three. Afterthree patients were examined by alltherapists the entire procedure was re­peated. The experimental design there­fore provided 24 test and 24 retestmeasurements from each of four ther­apists under conditions which at­tempted to minimize information otherthan the PI response to PAIVM.

ColJis-Brown (1982) found that theaverage test-retest reliability coefficientfor palpation conclusions was 0.73.This was only a little less than theaverage correlation between the test andretest forces required to produce a PIresponse, which reached a value of0.83. The difference between the twocoefficients was not statistically sig­nificant. In terms of classical reliability

theory this implies that 270/0 of thevariance in PI observed by palpationwas due to random error. This errormay be conceived as a composite resultof at least two processes: randomchanges in the patients's pain condi­tion, or in the verbal report; and ran­dom error in the therapist's ability toperceive the point in ROM where PIwas reported and record it on the VAS.An estimate of the first component maybe obtained from the test-retest cor­relation of the forces, which do notdepend on therapist perception and re­cording ability. This method estimatesthat 17070 of the observed score vari­ance was due to random error in thepatient's report, although the true valuewill be somewhat lower because anamount should be allowed for the ran­dom error in force measurement.Nevertheless, it was apparent that ran­dom error due to therapist perceptionand recording was small.

From a practical point of view, how­ever, random error destroys judgementreliability irrespective of its genesis inthe patient or the therapist. To makeinterpretation of patient changesclearer, confidence intervals were com­puted for the therapist judgements.These estimated that for 950/0 confi­dence that a change does not reflectmerely random error, a therapist mustobserve a change of at least 340/0 offull scale on the VAS. In clinical situa­tions confidence as low as 80070 maysometimes suffice. This was estimatedto require a change of at least 22070 onthe VAS. It is difficult, given the lackof evidence on the size of the effectrequiring measurement, to decide if therandom error is sufficiently small.

McNeill (1982) examined the degreeof intertherapist reliability present inthe above experiment. The average in­tertherapist correlation was 0.62. Thisindicates that a substantial proportionof the variance in observed scores(38070) was attributable to intertherapistvariation in performing the test. Theintertherapist correlation in forces re­quired to produce PI was 0.75, whichwas not significantly lower than the

The Australian Journal of PhYSiotherapy. Vol 31, No.5, 1985 177

Page 4: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

intratherapist value of 0.83. A largeportion of the variability in interther­apist correlations was attributable torandom error in patient report (25070)and a smaller portion to differencesbetween therapists (13070).

The effect of conducting a broaderPAIVM test, including compliance fea­tures, spasm, and a complete chart of'pain behaviour', was investigated sub­sequently by Flint (1983). Four manualtherapists with postgraduate qualifi­cations independently examined onelumbar level from each of twelve pa­tients. The patients were selected fromseveral clinics providing that a screen­ing physiotherapist identified, follow­ing a full examination (Maitland 1977),current back pain attributable to thelumbar region. The patients were ex­amined in a latin square sequence asin the earlier study. The movement dia­gram described by Maitland (1977) wasemployed as a two dimensional VASof Intensity x ROM (67xl00mm). Ther..apists recorded PI, P2, the dynamicsof pain between PI and P2, as well asthe other features typically required bya Maitland movement diagram: thelimit of range (L); the point in ROMof resistance onset (Rl); limiting re­sistance (R2); the dynamics of resist­ance between Rl and R2; and the be­haviour of muscle spasm, if present(Maitland 1977). The screening physio­therapist premarked the level to betested, which was the 'most sympto­matic' level found in the prior exami­nation. Therapists were required topalpate only the marked level usingcentral PAIVM. No other patient in­formation was given to the therapists.

Flint (1983) found that the mean in­tertherapist correlatioc for locating PIin ROM was 0.48, somewhat lowerthan the 0.62 obtained by McNeill(1982). Although this difference is notstatistically significant, the result in­dicates that additional palpation infor­mation failed to improve the reliabilityof PI ratings.

Furthermore Flint's sample had amore acute status than that employedby McNeill. Thus the result also failed

to support the hypothesis that PI rat­ings from more acute patients wouldprovide better reliability because acutepatients are likely to have a clearer painonset, with a distinct 'bite of pain'(Maitland 1977, Collis-Brown 1982,McNeill 1982).

As a part of the same study, Flintalso examined intertherapist reliabilityin measuring pain intensity at P2. Shefound a mean intertherapist reliabilitycoefficient of 0.75, a relatively goodresult and the best intertherapist reli­ability coefficient obtained to date inour PAIVM investigations. It is inter­esting to note that this feature of themovement diagram is probably morereliant on the patient's response andless reliant on the therapist's abilitythan any other PAIVM finding.

A final aspect of the reliability ofpain assessment investigated by Flintwas the degree of intertherapist agree­ment on whether pain, spasm or re­sistance was the cause of movementlimitation. The mean pairwise agree­ment was 66.6070 which proved signifi­cantly higher than the expected randomagreement rate (51.8070) given the ob­tained base rates. Nevertheless, an in­tertherapist disagreement rate of 32.4010is substantial in a practical sense, sincethe decision about the cause of move­ment limitation plays a significant rolein selecting treatment approach (Mait­land 1977).

Results with the SLR testThe SLR is a widely used test,

recommended (Cyriax 1982) for bothdiagnosis and progress evaluation. It isassociated with a considerable body ofliterature discussing its underlyingprocesses (Goddard and Reid 1965,DePalma and Rothman 1970, Murphy1977, Breig and Troup 1979 and Cyriax1982). Like the PAIVM test it employspassive movement, but the movementis 'physiological' rather than 'acces­sory'.

McFarlane (1981) examined the re­liability of assessing pain onset as apoint in ROM during the SLR. Twentypatients with low back pain of recent

origin were selected from several Mel­bourne hospitals provided that they didnot show an unusually high anxietycomponent, or failed to show a changein symptoms under 80° of SLR, orshowed restricted movement or pain inthe squatting test. Five SLR tests topain onset were performed on eachsubject with a 90 second inter-test in­terval. A gravitational goniometer, wasused to record the angle at PI. Medialhip rotation was manually controlledas suggested by Breig and Troup (1979).

A mean test-retest correlation of 0.96was found between adjacent pairs oftrials, indicating a very high reliabilityfor this test. On the basis of Mc­Farlane's data we calculated that achange of at least 13.6° should be ob­served in P1 if changes due to randomerror are to be excluded with a cer­tainty of 95070. If typical normal ROMis estimated around 90° (DePalma andRothman 1970, Cyriax 1982) the 95070confidence interval for test-retestchange is 15070 of scale, which is betterthan the 34070 obtained with thePAIVM test (Collis-Brown 1982). Thusboth metric and metric-free estimatesof reliability show better values for theSLR test.

In addition, McFarlane examined thepossibility that systematic trends mayoccur in the SLR data. She found thatthe range to pain onset increased be­tween successive tests by an average of1.2° which was a statistically signifi­cant effect. Therefore increases in rangeto pain onsets of 15 0 would probablybe safer minima for error free estimatesof therapeutic improvement betweensucceeding tests obtained within ses­sion.

A subsequent experiment performedby Puentedura (1983) to examine theeffects of trunk position on the SLRtest indirectly yielded confirmatory evi­dence of high reliability for this test.Puentedura recorded pain onset andlimiting pain in seventeen young, non­symptomatic subjects who reported nohistory of chronic musculoskeletal ill­ness. Electrogoniometric readings wereobtained with the trunk in three posi-

178 The Australian Journal of PhYSIotherapy. Vol. 31, No.5, 1985

Page 5: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

tions: neutral, maximal contralateralflexion and supported lumbar lordosis.All tests were performed in supine ona flat surface. Within each posture tenpain onset and two limiting pain ob­servations were performed. However,since repeated measures within eachposture were obtained with no inter­vening treatment, we were able to re­examine Puentedura's raw data for test­retest reliability coefficients. Regard­less of posture these proved to be uni­formly high. The mean test-retest cor­relation between adjacent pain onsettrials within a posture was 0.98. Thelimiting pain data yielded an averagecorrelation of 0.96. These results con­firm and extend those of McFarlane.

In the period between the studies ofMcFarlane and Puentedura three pub­lications appeared (Hoehler et af 198~

Lankhorst et af 1982, Million et of 1982)which seem to further confirm the highreliability of the SLR test. Million etof (1982) found a within session retestreliability of 0.97 using nineteen pa­tients. Lankhorst et of (1982) using anactive SLR reported error componentsfor both interobserver and interday as­pects from a factorial design appliedto 48 low backache patients. From theirresults we calculated an interday test­retest reliability of 0.96-0.97 and in­terobserver reliability of 0.93-0.96.Slightly poorer results were found byHoehler and Tobis (1982) for inter­observer reliability when measuringpassive SLR (r = 0.78), although foractive SLR the results were comparable(r = 0.95).

Results with the FF testLike the SLR test, the FF test in­

volves 'physiological' movement.However the test is one of active ratherthan passive movement. During activeforward bending in the sagittal plane,with the knees extended, several para­meters may be recorded. These includeROM to pain onset (PI) and ROM tomaximum pain tolerance (P2) amongothers. The test is widely used as a partof various approaches to examination

of the lumbar spine (Maitland 1977,Stoddard 1980, Cyriax 1982). The pur­pose of this subsection is to review fourstudies our group performed on thereliability of the FF test for measuringpain parameters.

Several methods for recording ROMduring FF tests have been reported in­cluding skin distraction (Macrae ~d

Wright 1969, Van Adrichem and VanDer Korst 1973), spondylometry(Twomey and Taylor 1979, Stoddard1980), inclinometry (Loebl 1967), tan­gential hydrogoniometry (Andersonand Sweetman 1975), radiography(Hauley et of 1976) and photography(Troup et al 1967). Some of the pre­vious literature investigating the ade­quacy of these measurement methodshas been concerned with their relativevalue for assessing spinal mobility(Troup et af 1967, Van Adrichem andVan Der Korst 1973, Reynolds 1975,Moran et al 1979). Much of the evi­dence has been collected from normalsamples (Loebl 1967, Troup et a/1967,Van Adrichem and Van Der Korst1973, Reynolds 1975, Moran et af1979). The purpose of the studies re­ported below was to examine painmeasurement with a view to clinicalapplication. Therefore simplicity was acriterion for selecting the approach tomeasuring ROM. This excluded radi­ographic and photographic methods.

The method adopted was to measurefingertip position using a measuringtape (Kapanji 1974). Apart from itssimplicity this method seemed appro­priate because kinesiological analysissuggests that it is influenced not onlyby spinal movement, but also by hipmovement and a variety of associatedstructures including muscle and con­nective tissue (Farfan 1973, Van Ad­richem and Van Der Korst 1973, Hartet 0/1974). While this is a disadvantagefor the assessment of specific mobilityin the lumbar spine (Moll and Wright1976), it may be an advantage in themeasurement of pain, particularly painprogress, where a variety of structuresmay be implicated.

Kwong (1981) investigated the test­retest reliability of assessing PI withthe FF test. Twenty patients attendinga physiotherapy clinic were sampledprovided they had low back pain with­out either hip involvement, a list, orscoliosis. Patients were assessed inbriefs and bare feet after the promi­nence of the tibial tuberosity wasmarked. They were required to bendforward, sliding their hands down theirthighs, without deviation from the sag­ittal plane, until pain onset. Patientswith pre-existing background pain wereinstructed to stop on onset of a painchange. Using the tibial tuberositymark as origin, ROM to PI was re­corded by measuring the distance tothe tip of the midfinger with a nylontape. Three measurements with an in­tertrial interval of one minute were ob­tained from each patient. The meantest-retest correlation between trialpairs was 0.98, indicating very highreliability. Using Kwong's data we cal­culated that changes of 83mm or morewould give 95070 confidence that theobserved change was not the result ofrandom error of measurement.

Systematic error due to repeatedmeasurement was assessed by compar­ing the central tendency in the threesamples. No statistically significant dif­ferences were obtained, although therewas a suggestion that an initial practicetrial might stabilize the data.

Using the same FF measurementtechnique, Bruce (1981) investigated thetest-retest reliability of assessing thepoint of limiting pain (P2). Twentypatients with low back pain were se­lected from a private physiotherapyclinic provided they were not restrictedby bilateral hamstring tension, or hadless than 600 ROM, or had an 'irritablecondition'. Patients were randomly al­located to two groups of ten. Threemeasurements were taken from all sub­jects. An objective examination of thespine (Maitland 1977) was interpolatedin one group between the first and sec­ond measure and in the other groupbetween the second and the third meas­ure. The other intertrial intervals were

The Australian Journal of Physiotherapy. Vol. 31, No.5, 1985 119

Page 6: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

three minute rests. Test-retest correla­tion, when only rest intervened be­tween the two trials, was 0.98. Thiswas consistent with Kwong's results.Test-retest correlations between trialsseparated by the objective spinal as­sessment were 0.87 and 0.99. Using thereliability coefficient of 0.98 andBruce's raw data we calculated thatchanges of 33mm or more would give95070 confidence that the observedchange was not the result of randomerror of measurement. No systematicbias due to repeated measurement wasfound, replicating Kwong's data.

The studies performed by Bruce andKwong were limited to assessing within­session retest reliability. While evalu­ation of within-session progress is amain use of assessment in manual ther­apy, the results of Bruce and Kwongare not necessarily generalizable to be­tween-session retest intervals. There­fore Patterson (1982) examined limit­ing pain in FF on two consecutive days.Three FF tests were conducted on Day1 separated by one minute rest inter­vals. The procedure was repeated onDay 2. A sample of 12 subacute orchronic low back pain oatients wereselected, using similar criteria to thoseof Bruce and Kwong. The mean within­session retest reliability was found tobe 0.98, confirming the findings ob­tained by Bruce. The mean between­session retest reliability was 0.97, notsignificantly lower than that obtainedwithin-session. The 95070 confidence in­terval for measuring changes within­session was 45mm, slightly higher thanthat obtained by Bruce. The 95070 con­findence interval for measuring changesbetween days was 52mm.

Maitland (1977, p.171) recommendsthat a therapeutic effect should onlybe assumed if an improvement of25mm or more in limiting pain is ob­tained. This conclusion, based on clin­ical observation and in the absence offormal analysis, compares well to ourexperimental estimates. In terms of therandom error estimate obtained byBruce, changes in excess of 25mm af­ford 87070 confidence. In terms of Pat-

terson's within-session estimates 25mmchanges afford 76<1/0 confidence. Be­tween-session conclusions should betaken even more conservatively: ourcalculations based on Patterson's dataestimate only 68070 confidence for aminimum change of 25mm.

Another possibility for measurementerror on reassessment is that repeatedexposure to the same test may createa systematic bias. Serial effects mayoccur as a result of changes in therelevant anatomy/physiology caused bythe initial test, placebo phenomena, orsimply skill learning. In Kwong's studyno statistically significant differenceswere obtained among the three trials.The largest mean difference was only6mm and occured between the samplesof trials 1 and 2.

The most recent study in this serieswas designed to determine if the highreliability found in the three previousstudies was an artefact of the way thetest was performed. At least two ob­vious hypotheses might be invoked tosuggest that the high retest reproduci­bility resulted from factors other thanpain sensation. One hypothesis is thatvisual and tactile feedback was avail­able to patients in these studies sincethey could see their own performanceand the test procedure required thehands to slide down the legs. Anotherhypothesis, more difficult to test, isthat the high reproducibility merelyrepresents memory for movementrather than recurrence of a given painlevel at the same point in ROM.

To investigate these hypothesesMunro (1983) examined the within-ses­sion retest reliability on a modified FFtest. The test was performed with ablindfold. Furthermore, instead ofsliding their hands down their legs,subjects were required to bend for­wards while depressing a low-frictionplunger vertically with the tips of themiddle fingers (Moll and Wright 1976).The plunger was part of an apparatuscontaining a metric scale and pointerwhich permitted location of movementendpoint to the nearest millimetre. Afinal modification to the previous pro-

cedure was that a simple motor task(manipulation of a nut and bolt) wasinterpolated between the test and theretest in an effort to produce somedisruption in sensorimotor memory.Two groups of subjects were tested.The first group comprised 17 low backpain patients selected along criteriasimilar to those of the earlier studies.The second group comprised 17 asymp­tomatic subjects. Subjects were selectedin the asymptomatic group on amatched pair basis with a low backpain subject. The matching criteriawere gender and age parity (within 6years). The asymptomatic member ofeach matched pair was required to per­form a task yoked to the initial per­formace of the low back pain subject.A mechanical block, placed at the samepoint where the symptomatic subjectshowed pain onset, was used to stopforward bending of the asymptomaticsubject during the test. During the re­test, which followed the interpolatedtask, the block was not present andasymptomatic subjects were requiredto simply stop at the point as they recallit from the initial test. Symptomaticpatients were required to stop on painonset during both tests.

Despite the blindfold and the inter­polated task, symptomatic subjectsshowed a test-retest correlation of 0.99.Statistical analysis revealed that thiswas significantly higher than the cor­relation shown by the asymptomaticgroup (0.92). The high reliability ob­tained confirmed the earlier FF data(Bruce 1981 , Kwong 1981, Patterson1982). More importantly, however, thesuperior reliability of the symptomaticgroup under these stringent perform­ance requirements suggests that painsensation was contributing, rather thanvisual or tactile feedback. Similarly,performance on memory alone can berejected, although a more convincingdemonstration could probably havebeen obtained by employing a longerintertest interval and an interpolatedtask using the same joints as the FFtest, but which does not aggravate thepain.

180 The Australian Journal of PhYSiotherapy. Vol. 31, No.5, 1985

Page 7: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

In conclusion, our studies of the FFtest for pain have consistently pro­duced high reliability estimates andsuggest that pain is indeed being ac­cessed. This finding is in contrast tothe deprecatory conclusions of someother authors (Hart et at 1974, Rey­nolds 1975, Moll and Wright 1976,Moran et a/ 1979). The FF test is saidnot to be a good measure of spinalmobility (Hart et a/ 1974, Reynolds1975, Moll and Wright 1976). This maybe so but the point is irrelevant to themeasurement of pain and its progress.The FF test is said to be influenced bystructures other than those of the lum­bar spine (Reynolds 1975, Moran et at1979). We have already addressed thisissue indicating that from the point ofview of monitoring pain progress thismay be an advantage. In general there;;fore, results indicate that the FF testshould not be overlooked as a simpleand reliable clinical test for assessingpain changes, particularly if other as­pects of the assessment have estab­lished the nature of the underlying painprocess. Finally, it is interesting to notethat the reliability coefficients of theSLR and FF tests, both of which in­volve 'physiological movements', werecomparable and consistently higherthan those obtained for PAIVM testsof pain.

III The Reliability ofSome ClinicalProcedures forAssessing Compliance

Manual tests of spinal complianceprobably form the most characteristi­cally unique contribution of manualtherapy to the diagnostic armamentar­ium. Their objective is to employ thetherapist's perception of displacementand 'resistance' to obtain a subjectivemodel of spinal compliance, which canbe used for a variety of decisions (Mait­land 1977). That this involves a per­ceptual model of spinal compliance,including dynamic parameters, can beseen most clearly in the development

of the two-dimensional movement dia­gram (Maitland 1977). Manual assess­ment of compliance contains, in coun­terpart to pain assessment, some keyparameters: the point in ROM of re­sistance onset (Rl); the point in ROMwhere resistance limits passive move­ment (R2); and the compliance func­tion which links Rl and R2. Compli­ance tests have a role in: initialdiagnosis, including selection of thelevel to be treated and type of mobi­lization to be utilized; the evaluationof progress within-session followingtreatment; and progress between ses­sions (Maitland 1977).

Consequently test-retest and inter­therapist reliability are relevant issues.The majority of our studies to datehave been concerned with PAIVM(Baker 1981) Millman 1981 , Wong1981, Weeks 1982, Allen 1983, Flint1983) although one study involvingPPIVM (Clarkson 1982) is also re­ported below.

Studies evaluating tbe reliability of Rland R2 assessment witb PAIVM

Despite the relatively widespread useof the PAIVM assessment proceduresdescribed by Maitland (1977), a reviewof the literature prior to 1981, the timeof our group's initial study (Baker1981, Wong 1981), revealed a remark­able dearth of systematic attempts toevaluate the reliability of these proce­dures.

An initial study designed to estimateintertherapist reliability for locating Rland R2 in ROM was conducted byBaker (1981) and Wong (1981). Threetherapists independently examined sixspinal levels from each of eighteen sub­jects. The subjects had an age rangeof 18 to 54 and no history of recentspinal pain. The six levels examinedwere C2, C6, T2, TIO, L2 and L4. Thethree cephalad processes were exam­ined with thumbs in apposition. Thethree caudad processes were examinedwith pisiform technique. Each thera­pist was required to mark Rl, R2 andthe compliance function linking themon a 45x60mm movement diagram

(Maitland 1977). The ROM to Rl andto R2 was then obtained to the nearestmillimetre. Using these measures, in­tertherapist correlation coefficients foreach joint were then obtained for eachpairwise combination of therapists.

Intertherapist correlations were lowerthan those obtained in PAIVM tests ofpain. The mean coefficient for Rlacross all spinal levels was 0.30. Thebest mean correlation for a single levelwas 0.64, obtained from L4. This wassignificantly superior to the other coef­ficients obtained. The mean correlationfor R2 across all spinal levels was 0.28and the best mean correlation for asingle level was 0.58, obtained fromL2. The L2 value was significantly su­perior to that of C6, T2 and TI0. Otherdifferences between the reliabilitiesgiven by the six levels were not statis­tically significant. Although the meanreliability coefficients of 0.30 and 0.28were statistically significant, they weredisappointingly low.

In a subsequent study Weeks (1982)examined the within-session and inter­week test-retest reliability for locatingRl. Four therapists independently ex­amined three joints from each of twelvesubjects. None of the subjects had ahistory of recent spinal pain. The agerange was 20-50 years. Each therapistpalpated C2, T4 and L5 on two oc­casions one week apart. Within eachsession the joints were assessed twiceon a rotational basis across the twelvesubjects, ie the examination of elevensubjects intervened between the firstand second assessments within the ses­sion. Therapists were required to markthe location of Rl on an 80mm VASmarked in quarters. Apart from theareas to be examined, subjects' bodieswere draped.

Distances to R1 were then used tocompute, for each segment, the within­session and interweek reliability coef­ficients for each therapist. The within­session correlation was 0.46 when aver­aged across all four therapists and allthree joints. The interweek reliabilitycoefficient averaged an extremely poor0.09, which was significantly worse

The Australian Journal of PhySIotherapy. Vol. 31, No.5, 1985 181

Page 8: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

than even the disappointingly lowwithin-session correlation.

Since the four therapists examinedthe same subjects it was also possibleto replicate the estimate of interther­apist reliability obtained by Wong(1981). Over both days and across alljoints the mean pairwise intertherapistcorrelation was 0.25, confirming thelow estimate obtained by Wong.

In general, therefore, the two studiesindicated that PAIVM assessment ofcompliance parameters has poor reli­ability. However these estimates shouldbe interpreted in the light of two meth­odological issues which weaken thegeneralizability of the estimates. Thefirst issue is that in both studies thesample comprised trainee therapists inthe second half of the postgraduatediploma specializing in manual ther­apy. It is possible to argue that sucha sample may not have been repre­sentative of the ability which a sampleof fully trained and more experiencedpractitioners would demonstrate.

The second issue of generalizabilityrefers to the sample of subjects usedby the two studies, which in both caseshad no recent history of spinal pain,unlike the subjects typically seen inclinical practice. The quantification ofreliability is affected to a degree by therange of individual differences amongthe joints examined. The mathematicaltheory of reliability clearly indicatesthat restricting the range of variationwill tend to reduce the reliability coef­ficient (see Appendix). The reliabilitycoefficient is the ratio of the true-scorevariance to total (true-score plus error)variance. The size of the error variancemay be assumed to remain constantover the sample as a whole when thesame method of measurement is em­ployed. However, if the true-score var­iance is reduced because true individualdifferences between the measured ent­ities has been reduced, then the randomerror component will be a larger pro­portion of the total variation and theoverall correlation coefficient will bereduced. In other words, if the rangeof variation in compliance parameters

which results from individual differ­ences in a non-clinical sample is sub­stantially different from the range ob­tained in clinical samples, the reliabilityestimates obtained will tend to bebiased. The issue being an empiricalone, the logjcal approach is to examinea clinical sample. Flint (1983), whoseresults have been reported in partabove, chose that approach.

The study carried out by Flint incontrast to those of Weeks and Wong,employed a clinical sample; gave ther­apists an 'ecologically valid' assessmenttask, since they were required to do afull pain and passive movement dia­gram on a clinical subject; and usedfour fully qualified therapists with post­qualification experience ranging fromnine months to three years. The in­tertherapist reliability coefficient forlocating RI in ROM was found to be0.38 on the average, which is not sig­nificantly higher, in either the statisti­calor the practical sense, than that of0.30 reported by Wong.

The reliability of differentiating spinallevels on the basis of compliance per­ception following PAIVM's

In clinical assessment PAIVM testsmay be used in the attempt to locatecompliance parameters on a perceptualratio scale so that they may be used toguide diagnosis, assess progress and as­sist in the selection of grades of ther­apeutic movement typical of the ap­proach described by Maitland (1977).This purpose guided the orientation ofthe studies reported in the previoussubsection. An alternative purpose forPAIVM tests is to assess the presenceof compliance abnormalities by pal­pation, on a comparative basis, acrossseveral spinal levels. Relevant para­meters include 'end feel', soft tissueresistance and postero-anterior ampli­tude of joint movement (Maitland1977).

Millman (1981) examined test-retestand intertherapist reliability for blinddiscrimination of the stiffest spinallevel. Therapists were blindfolded and

required to select, only by performingPAIVM with pisiform, which of thesix unidentified levels presented in ran­dom sequence was stiffest. The levelsincluded were L4 to TIl.

Therapists were permitted to repal­pate any levels they were uncertainabout until they came to a firm deci­sion. Each of three therapists examinedthe same thirteen nonclinical subjectson two occasions within one session oftesting. The results indicated that pre­conceptions about anatomical varia­tion in stiffness were adequately con­trolled by this procedure becausetherapists' ability to identify which an­atomical levels they were on was notsignificantly better than that attainableby chance. Furthermore, therapistswere unable to guess at better thanchance rates when they were perform­ing a retest.

Under these conditions, which im­posed a strict dependence on palpatoryinformation, the mean test-retestagreement rate was 31 010. Statistically,this was significantly better than theagreement rate of 16.7070 predicted bya model which assumed that therapistswere randomly selecting one levelamong six. The analysis also showedthat 31 070 was significantly worse thanthe agreement rate of 50010 predictedby a model which assumed that ther­apists were able to reject four levelswith certainty, but were guessing whichof the remaining two levels was stiffest.The best model was that which as­sumed therapists were able to rejectthree of the levels but guessed amongthe remaining three. These models are,of course, imaginary. They should notbe taken to imply that therapists decideliterally following the processes as­sumed by these models. The models dohowever provide a valuable frame ofreference for interpretation.

The analysis of intertherapist agree­ment showed that the average pairwiseagreement was 25.7010. This was sig­nificantly better than the 16.7010 pre­dicted by a model assuming completeguessing. It was also significantly worsethan the 33010 predicted by a model

182 The Australian Journal of PhySiotherapy. Vol. 31, No.5, 1985

Page 9: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

which assumed that therapists were ableto reject three levels with certainty, buthad to guess among the remaining threespinal levels.

Millman's results therefore sug­gested that by palpation alone thera­pists can discriminate better thanchance those differences in stiffness de­rived from anatomical variation. Un­fortunately, the degree of agreement,though better than chance, was never­theless low from the point of view ofpractical diagnostics. For example, itseems likely that a therapist would beable to narrow the range of clinicallyrelevant levels down to three, or per­haps even two, by using the case his­tory, the other test data and epide­miological knowledge.

However, the generalization of Mill­man's data faces some problems. First,the source of variation between spinallevels was that due to natural anatom...ical differences in non-symptomaticspines. In clinical decision making thestated objective is to identify the pres­ence of an abnormality. The frame ofreference for the therapbt presumablyis some cumulated memory model ofwhat is normal (Maitland 1977).Whether the difference between the im­mediate perceptual trace from an ab­normal joint and the cognitive templateof normality is an easier discriminationto perform than the discrimination be­tween recently experienced perceptionsof stiffness which differ according toanatomical variation between spinallevels, seems to be a moot point in thelight of the complexity of the issue andthe lack of evidence.

A second problem stems from thechoice of 'stiffest level' (Millman 1981)as the object of discrimination. In clin­ical theory the finding of abnormalitymay involve a broader base of com­pliance features. These include 'endfeel' and soft tissue resistance as wellas postero-anterior ROM (Maitland1977).

A third problem arises from thenature of the therapist sample. Thethree therapists all had a minimum of

four years clinical experience. How­ever, although they had satisfactorilycompleted more than half of the post­graduate diploma specializing in man­ual therapy, including the spinal as­sessment and treatment portion of thecourse, it may be that the lack of fullqualification and post-specializationexperience was a factor in their per­formance.

Allen (1983) conducted a study whichattempted to resolve some of the issuesraised by Millman's study. Five lumbarlevels from each of twelve patients re­cruited from several clinics were ex­amined. All patients had a history ofback pain. Seven patients had symp­toms which had persisted over sixmonths. Three physiotherapists withspecialist prostgraduate qualificationsin manual therapy and a minimum ofeighteen months of post-specializationexperience performed the assessments.Millman's procedure was replicated,but therapists were asked to selectwhich level had the greatest soft tissueresistance, which had the most abnor­mal 'end-feel', and which had thesmallest postero-anterior amplitude ofmovement. In addition, therapists wererequired to indicate which of the fivelevels should be selected for treatmentand which of the three indicators ofabnormality mentioned above had mostinfluenced their selection.

Allen's data revealed a very high de­gree of coherence between the specificindicators of abnormality. On over970/0 of occasions two or three of theseindicators identified the same level asthat selected to be 'most abnormal'.Therefore reliability estimates wereprepared only for the decision of whichlevel should be selected for treatment.The test-retest agreement rate averaged47.2070, somewhat higher than Mill­man's 310/0. However, our analysis ofthe results obtained by these studiesdid not indicate the improvement to bestatistically significant. The inter-ther­apist agreement rate averaged 26.40/0on a pairwise basis in Allen's study.This is very similar to Millman's resultand not significantly better than the

200/0 agreement rate which would beexpected from a random guess model.The high coherence between specificindicators seems to imply either thatabnormal compliance tends to manifestsimultaneously through the severalparameters, or that therapists tend tobe biased towards 'false alarms' of ab­normality having found a single ab­normal sign from the level in question.The low degree of reliability suggeststhat the latter explanation should bepreferred. Furthermore, since the test­retest reliability indicates that some de­gree of consistent information wastransmitted even though intertherapistagreement was very low, it seemsreasonable to hypothesize that thera­pists make global judgements of ab­normality, on perceptual dimensionswhich are probably not consistent andwhich may be difficult to verbalize.

The reliability of compliance ratingsfollowing PPIVM tests

Passive movement of a 'physiologi­cal' type provides another testing ap­proach which may be used for diag­nosis or progress evaluation (Maitland1977, Cyriax 1982).

Kaltenborn and Lindahl (1969) ex­amined the intertherapist reliability often therapists during assessment of in­tervertebral joint mobility. A four­point rating scale consisting of nomovement, hypomobility, normalmovement and hypermobility was used.Kaltenborn's ratings were used as acriterion for agreement. Each of thetherapists independently gave 13 as­sessments. Their conclusion of 're­markably good' agreement was not ac­companied by a formal analysis.However, the following results werereported: complete agreement fromthree therapists; 2 disagreements fromtwo therapists; 3 disagreements fromone therapist; and 4 or 5 disagreementsfrom the remaining three therapists.This represents an average agreementrate of about 84070.

Gonnella et 01 (1982) examined theintertherapist and retest reliability of

The Australian Journal of PhYSiotherapy. Vol. 31, No 5,1985 183

Page 10: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

five therapists employing PPIVM testson lumbar segments. On each of twodays, which were separated by a 13 dayinterval, each therapist independentlyevaluated the six segments of fiveyoung, nonsymptomatic subjects. Twoevaluations, one under 'normal' andone under blindfold conditions wereperformed within each session. For­ward bending, side bending (left andright) and rotation (left and right) wereperformed. A seven point rating scalewas used, with 'ankylosed' and 'unst­able' as the end values. In addition'plus' and 'minus' qualifiers were per­mitted, which produced a potential13-point scale. In practice the scale val­ues employed by the observers werelimited to the range 1-4, producing aneffective seven-point scale biased to­wards hypomobility. In fact, the dis­tribution was probably even more re­stricted because the extreme scale values(1.0 and 4.0) seem to have occurredvery infrequently (eg 2070 for forwardbending, the only test for which suf­ficient data was available to extract aresult). Gonnella et af concluded that'results on intertherapist reliability weredisappointing' (p.442). Although thisconclusion is not immediately apparentfrom their analysis of the data, our re­analysis of the evidence Gonnella et afpresented (p.440) confirmed their con­clusion. For example, with the forwardbending manoeuvre we calculate thatintertherapist agreement reached 78070when agreement is defined (Gonnellaet af 1982) as ratings differing by lessthan one full scale unit. However, theagreement rate expected from thechance agreement model is 71 070. Thehigh degree of chance agreement is thecombined effect of a restricted distri­bution of mobility together with a defi­nition of agreement which accepts avariation of half a scale value (see Ap­pendix).

Thus the PPIVM research literaturepresented until 1982 a somewhat equiv­ocal overview. One study claimed goodresults for intertherapist reliability(Kaltenborn and Lindahl 1969), whileanother found poor results (Gonnella

et a/1982). A further problem was thatof non-generalizability of findings,either because the evaluation samplesincluded few subjects (Kaltenborn andLindahl 1969) or nonsymptomatic sub­jects (Gonnella et af 1982).

Therefore, Clarkson (1982) investi­gated the intertherapist reliability offour experienced physiotherapists spec­ialized in manual therapy. The testsample comprised ten subjects aged 20­55, all of whom had a history of lowback pain. One subject had a radi­ographically confirmed sacralization ofL5. Others included a retired dancer,a footballer and a champion runner.That is, there was an effort to obtaina wide cross-section of test joints. Eachtherapist independently assessed eachvertebral segment from S1 to T12 usingthe PPIVM technique for forward flex­ion described by Maitland (1977).Therapists used a five-point scale withthe end values being 'ankylosed' and'hypermobile'. On the average, thepairwise intertherapist agreement ratewas 45070. Statistically this was signifi­cantly better than the 37070 expected tooccur from chance agreement. How­ever, from a clinical point of view itdoes not seem a very encouraging re­sult. When the 'stiff' and 'very stiff'ratings were amalgamated to producea four-point scale like that of Kalten­born and Lindahl (1969) the agreementrate became 57070. This seems substan­tially lower than the 82070 obtained byKaltenborn and Lindahl. The resultsare also poorer than the 78070 agree­ment rate obtained by Gonnella et af(1982), although the comparison iscomplicated by differences in the ratingscales used.

Further evidence about the reliabilityof PPIVM is available outside the re­search literature of physiotherapy. Ro­tational manoeuvres similar to thetechniques employed by physiothera­pists are encountered in osteopathy(Johnston 1982). Recently, Johnston etaf (l982a) reported on the interther­apist reliability obtained by one osteo­pathic physician and two student phy­sicians. The tests employed were

cervical rotation, cervical sidebendingand several trunk motions. The exper­imental sample comprised 161 volun­teers which included 84 students and71 patients. However the report doesnot clarify the particular characteristicsof the subsamples used to assess thereliability of the different motions.Therapists were required to indicate ifresistance to passive motion was sym­metrical or asymmetrical for left andright manoeuvres. For cervical rotationthe three therapists agreed on 42070 ofthe 43 subjects tested this way. Forcervical sidebending they agreed on33070 of 36 subjects. Although theseagreement rates appear rather low theywere significantly higher than those ex­pected to occur by chance (19070 and14070, respectively). Furthermore, theseare three-way agreements rather thanpairwise agreement rates as in the otherstudies reviewed by this section. Un­fortunately the report by Johnston etaf 1982a makes extraction of meanpairwise agreement difficult, therebyprecluding direct comparisons. In ad­dition the ratings required were some­what different. Nevertheless, in termsof clinical significance, the results seemrather disappointing, a conclusionshared by Johnston et af (1982a).

In a subsequent study on cervicalrotation, Johnston et af (1982b) eval­uated intertherapist reliability whenonly subjects with strong indicationsof asymmetry were included in thesample. Preselection of subjects wasbased on agreed examination findingsby two faculty osteopaths. Three stu­dent therapists then independently ex­amined the subjects. The pairwiseagreements for each student with thefaculty examiners were 71070, 62070 and57070. While the agreement rate fromthe first student was significantly higherthan expected to occur by chance, thiswas not the case for the other two setsof ratings. Given the preselected sam­ple of subjects and the restriction ofratings to symmetry or left and rightasymmetry, the 63070 average agree­ment rate is disappointing in terms ofclinical significance, particularly since

184 The Australian Journal of PhYSiotherapy. Vol. 31, No.5, 1985

Page 11: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

cervical rotation seemed the mostpromising test in the prior study (John­ston et al 1982a).

It may be tempting to dismiss John­ston's low reliability as resulting fromtherapist inexperience, but Kaltenbornand Lindahl's (1969) 84070 agreementrate was based on a group which in­cluded a variety of experience. In anycase, poor results were also found instudies with experienced therapists(Clarkson 1982, Gonnella et al 1982).

Interpretation of the studies exam­ining PPIVM test reliability is compli­cated further by the variety of ratingscales, subjects and spinal levels used.Furthermore, agreement rates are dif­ficult to compare directly because theyare influenced by distributional propJerties including response base rates,which may vary across studies.

To facilitate comparisons of agree­ment rates we therefore expressed theresults of the above studies in terms ofCohen's kappa (see Appendix). Sincekappa expresses the proportion of ob­tained agreements relatIve to that ex­pect to occur by chance, it facilitatescomparisons across studies which em­ploy different rating scales, test joints,or other methodological features whichmight alter the statistical properties ofthe therapists' responses. A second ad­vantage is that it is a correlation-likeindex, which varies between zero andone (unless observed agreements areless than expected by chance). Usingdata presented in the published reports,we found kappas of 0.64 for Kalten­born and Lindahl (1969), 0.37 forJohnston et al (1982b), 0.24 for Gon­nella et a/ (1982) and 0.15 for Clarkson(1982). In general therefore the studiesof PPIVM tests do not seem to yielda very good degree of intertherapistreliability, particularly within theframework of clinical requirements. Inview of the variety of therapist back­grounds, subjects used (includingsymptomatic and nonsymptomatic) andother variables, this conclusion prob­ably has good generalizability and con-

curs with the more recent of previousinterpretations (Gonnella et al 1982;Johnston et al 1982).

Reliability of spinal mobility assess..ment using combined PAIVM andPPIVM tests

In addition to the investigations citedin the previous three subsections whichhave involved either PPIVM orPAIVM assessment, a number of stud­ies reported in the literature have usedcombined assessment techniques to ratespinal mobility. Because of the com­bined nature of the assessment taskutilized in these studies, it is not pos­sible to separate the individual relia­bility of anyone of the tests involved.However the studies outlined belowprovide some insights into therapistperformance.

Jull (1978) reported a study whichexamined the intertherapist reliabilityof rating the mobility of the upperthree cervical joints following PAIVMand PPIVM tests. Each therapist per­formed 81 tests ranking each joint ona five point scale with the extremes of'hypermobile' and 'no movement'. Atotal agreement rate of 88070 wasclaimed, which is highly encouraging.However, a number of methodologicalissues suggest that this agreement rateshould be interpreted with caution.Given the relative infrequency of 'hy­permobility' and 'no movement' rat­ings likely to occur in the population,the effective range of variability mayhave been somewhat reduced. Unfor­tunately, no data on the relative fre­quency of findings in each categorywere reported. Furthermore, severaldecisions came from a given spinal seg­ment. This could have introduced fur­ther restrictions in the (a prion) sub­jective range of potential variation.Finally the generalizability of the datais limited by the fact that the smallestsample viable for an intertherapist re­liability study was used: two therapists.

In a later report, Jull (1982) providedfurther evidence of intertherapist reli­ability for combined PPIVM andPAIVM tests of lumbar segments. Two

therapists examined one subject onthree successive occasions. The inter­session interval was one day. The in­tertherapist reliability coefficient was0.35, which has been interpreted tomean that 'examiners correlated highly'(Jull 1982, p.75). Although the resultwas significantly different from no cor­relation, in the statistical confidencesense, a reliability coefficient of 0.35is not high. In fact, the majority ofthe variance in the observed scores isattributable to error when the coeffi­cient is so low. A similar argumentapplies to the intersession reliabilitycoefficient reported to be only 0.10.

In a further study, Jull and Lane(1983) published findings related to as­sessment of lumbar spinal mobility. Asubsample of 20 normal subjects froma population of 100 males and 100females with no history of back painwere examined. Postero-anterior ac­cessory glide and all passive physiolog­ical movements were assessed in sixintersegmental levels from T12/Ll toL5IS 1. Each level was classified on afive point rating scale from 'hyper­mobile' to 'very stiff'. The retest agree­ment rate for the single participatingtherapist was 87.3070. Intertherapistagreement on a subsample of five sub­jects was reported to be 82.2070 betweenthe therapist and an independent ob­server. Once again, these high agree­ment rates should be interpreted withcaution because limited sample varia­bility will increase the agreement at­tainable by chance. On the basis of theaveraged data published by Jull andLane (1983) for their full population,we estimate that an agreement rate of38070 could have been typically ex­pected to occur by chance. If the testsubsample had consisted of only theyounger subjects the chance agreementestimate would have been 61070. Usingthe chance agreement rate for the wholepopulation we computed a Cohen'skappa for the retest agreements of 0.79and for the intertherapist agreementsof 0.71. Using the 61070 estimate, kappavalues would have been 0.67 and 0.54respectively.

The Australian Journal of PhySiotherapy Vol 31, No.5, 1985 185

Page 12: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

Grant (1980) examined lumbar spinalmobility in groups of dancers and non­dancer controls using a number oftechniques including passive movementtests. Within the study, two observersperformed twenty tests on five subjectsrating lumbar levels on a four pointscale from 'hypermobile' to 'very stiff'and an interobserver agreement rate of90070 was obtained. The actual fre­quency distribution of test findings wasnot included in the report nor was itindicated from which of the experi­mental groups the subjects were drawn.Therefore we did not proceed to esti­mate kappa, the more appropriatecoefficient.

It should be pointed out that it wasnot the primary intention of lull (1978,1982), lull and Lane (1983) or Grant(1980) to measure reliability of assess..ment per se, but only to determine thereliability of the therapists who per­formed assessments for the variousstudies. Consequently, the generaliza­bility of these results is in all caseslimited by the fact that absolute min­imum numbers of therapists were in­volved in both retest and intertherapisttrials. Furthermore it is not clearwhether the judgements resulting fromthe several segments sampled from agiven subject were statistically inde­pendent. Lack of independence couldhave artificially raised the estimate ofreliability.

lull and Bogduk (1985) examined thereliability of diagnosis of zygapophy­seal joint disorders in a group of twentypatients attending a pain clinic becauseof cervical pain. A trained therapiststipulated the abnormal cervical levelafter a full subjective and objectiveexamination, including passive phys­iological and accessory movements. Toprovide an objective criterion, medialbranch blocks (Bogduk 1985) were usedto selectively anaesthetize nerves sup­plying cervical joints. Perfect agree­ment between the diagnosis of the ther­apist and the medial branch block wasobtained. A subsample of four subjectswas independently examined by an­other manipulative therapist, with per-

fect agreement on the abnormal joint.The results of lull and Bogduk (1985)might suggest that palpatory tests canperfectly diagnose the level to betreated. It should be noted howeverthat the patient sample had severe pain,which was often irritable (lull and Bog­duk 1985, p.163) and that the manualassessment not only included pain re­production, but also was conducted inthe context of other information pro­duced by a full objective and subjectiveexamination. Although the authorsclaim that the pathological joints hadsuch abnormal compliance features as'limited range of motion', 'abnormalquality of resistance' and 'abnormallimitation to the movement' (lull andBogduk 1985, p.164), they also reportthat 'reproduction of pain was invar­iably associated with these abnormalqualities of movement' (p.I64). On thebasis of our experience with assessmentof compliance features (low reliability)and pain (high reliability) an alternativehypothesis is indicated: that provoca­tion and reproduction of pain was thekey factor in reliable identification ofthe injured level. This interpretationseems preferable because it is more par­simonious, being consistent with bothour group's results and those of lulland Bogduk (1985).

IV Reliability in theProduction ofTherapeutic PassiveMovement

The reliability with which therapeu­tic movement is produced has receivedno systematic investigation accordingto our reviews of the journal literature.The degree of intratherapist or interth­erapist variation in production of pas­sive movement is presumably an im­portant factor, at least theoretically,since some descriptions of mobilizationtechniques do identify various gradesand do recommend selective use ac­cording to various conditions, eg Mait­land (1977). Until systematic empiricalstudies are conducted to assess the dif-

ferences in therapeutic outcome due todifferent grades of mobilization, theactual importance of using selectedgrades of mobilization, or of the reli­ability with which they are produced,must remain a problem which is jus­tified only theoretically or through clin­ical anecdote. Nevertheless, given thebroad influence on clinical and edu­cational practice which description ofgrades of mobilization have attained,the issue seems to require far greaterattention than it has received to date.

However, the primary purpose forreporting here two pioneering studies(Banting 1982, Mitchell 1983) con­ducted in our laboratories on this issueis that the reliability with which se­lected grades of movement are pro­duced is indirectly related to the reli­ability with which compliance isassessed. That grades of mobilizationare related to assessment of complianceis clear from descriptions of clinicalprocedures (Maitland 1977). The linkwas even more explicit in the defi­nitions used by Banting (1982) andMitchell (1983) when instructing thetherapists in their studies. Grade II mo­bilizations were defined as 'large am­plitude movements to the point whereRl is just perceived, at a rate of twoto three oscillation per second' (Bant­ing 1982, MitchellI983)~ Grade IV mo­bilizations were defined as 'a small am­plitude movement just up to andtouching the end of available jointrange' (Mitchell 1983). Again two tothree oscillations per second was therecommended oscillation frequency.

To investigate reliability, both stud­ies adopted the strategy of presentingseveral spinal levels from several in­dividuals thus ensuring a variety ofranges and joint mobilities. The re­producibility of peak force of mobili­zation can then be examined within theframe of reference provided by the var­iations due to anatomical and individ­ual differences. When the same levels,from the same subjects are examined,the intertherapist and retest correla­tions for peak force are then akin tothe reliability coefficients for locating

186 The AustraHan Journal of Physlotherapy~ Vol. 31, No.5, 1985

Page 13: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

R1 and R2 in range presented by otherstudies (Baker 1981 , Wong 1981 , Weeks1982, Flint 1983), particularly given theexplicit definitions used by Banting andMitchell.

In both studie~ the force platformtechnique already described was usedto assess the forces of mobilizationwhile therapists performed centralPAIVM. The output of the force plat­form was monitored by computer. Thispermitted calculation of peak force ofmobilization for each oscillation, aswell as of oscillation amplitude andfrequency by means of the dynamicforce measurement technique describedearlier. The data on the latter two para­meters is important to the wider issueof reproducibility of technique but isless directly relevant to the presenttheme. It is considered in detail else­where (Banting et af 1985).

Banting (1982) examined interther­apist reliability in seven physiothera­pists with specialist postgraduate qual­ifications in manual therapy. The leastexperienced therapist had more thannine months of clin~cal practice sincecompletion of the specialist qualifica­tion. The sample comprised graduatesfrom schools in three different Austra­lian States. Each therapist mobilizedfour premarked spinal levels (TIl, T9,T7, T5) from each of four subjectsusing central PAIVM delivered withthe pisiform technique. Each level wasmobilised for 20 seconds. Among otherparameters, the peak forces during acycle were calculated and averaged forall the cycles of a trial. Scores fromthe 16 levels mobilized by each thera­pist were then used to compute pair­wise intertherapist correlations. Themean intertherapist correlation was avery poor 0.22. In addition systematicbiases were found between the seventherapists when the peak forces wereaveraged across the 16 spinal levels(Banting 1982). Two therapists showeda 'light touch' (7.6N and 9.8N), threewere two to three times more forceful(14.5N, 16.3N, 20.6N) and two showednine or more times that force (50.2N,87.1N). An analysis of variance con-

firmed these differences to be statisti­cally significant (Banting 1982).

Mitchell (1983) replicated and ex­tended Banting's study. Subjects wereeight experienced physiotherapists withspecialist postgraduate qualifications inmanual therapy. Each mobilized twentyspinal levels comprising T9, TIl, L1,L3 and L5 from one female and threemale volunteers with no history of backpain. The same twenty segments weremobilized again one week later. Thusthe design assessed both intertherapistand test-retest reliabilities for Grade IIand Grade IV movements. In order tomaintain comparability all joints werepre-mobilized by the experimenter.Thus all therapists, including thestarter, were dealing with previouslymobilized spines.

Among other parameters, Mitchell(1983) calculated the peak force foreach oscillation. Following the earlierstudy (Banting 1982), trial averageswere computed, from which inter­therapist and retest correlations wereobtained. Mitchell confirmed that in­tertherapist reliability for Grade IImovements was low (r = 0.25) andshowed that this was also the case forGrade IV (r = 0.16). In addition hefound poor test-retest reliability forboth Grade II (r = 0.22) and GradeIV (r = 0.42).

Systematic biases were also evidentin the data. The peak forces for GradeII when averaged over the twenty seg­ments showed an intertherapist rangefrom 2.2N to 46.7N on Day 1. Eventhe trimmed range, excluding the ex­treme therapists, was 13.0N to 30.2N.On Day 2 the range was 3.9N to 26.4N.Analysis of variance confirmed thatthere were significant differences be­tween therapists and between days(Mitchell 1983). Similarly, for GradeIV, the intertherapist range was 150.9Nto 329.3N on Day 1 and 89.2N to222.4N on Day 2. Again analyses ofvariance confirmed that there were sta­tistically significant differences be­tween therapists and between days(Mitchell 1983).

The studies of Banting and Mitchellrelate to those for :21 assessment in thecase of Grade II movement peak forcesand to those of R2 in the case of GradeIV peak forces. The findings show verygood consistency. Thus for inter­therapist reliability in locating Rl thecomparison figures are 0.30 (Wong1981), 0.25 (Weeks 1982) and 0.38(Flint 1983). These confirm the GradeII results (r = 0.22, r = 0.25). Thecomparison figures for intertherapistreliability in locating R2 are 0.28 (Baker1981) and 0.24 (Flint 1983), which seemto support Mitchell's Grade IV result(r = 0.16). The poor test-retest cor­relation obtained by Mitchell for GradeII (r = 0.22), is if anything, betterthan the low value obtained by Weeksover the same interval (r = 0.09).Therefore the results obtained byMitchell and Banting reinforce the con­clusion of poor reliability for estima­tion of spinal compliance duringPAIVM.

V DiscussionAn overview of the studies presented

above suggests several patterns in thefindings (cf also Table 1). In general,pain tests were more reliable than testsassessing features of compliance. Thiseffect was obtained even when verysimilar testing techniques were usedsuch as when PAIVM was used forboth PI and Rl assessment. A secondfeature of the results is the excellentreliability obtained with the SLR andFF tests for pain. The correlation coef­ficients (0.96-0.98) were superior tothose obtained by PI assessment withPAIVM (0.73). These differences arestatistically significant. A third aspectis the consistent finding of superiortest-retest reliability over intertherapistreliability. This is a common result inmost fields of measurement. Before theclinical implications of these findingsare considered it is appropriate to dis­cuss some factors which may accountfor the obtained results.

The Australian Journal of Physiotherapy. Vol. 31, No.5, 1985 187

Page 14: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

Table 1:Summary of reliability coefficients

Measure Test Retest Inter- CI Sourcemove- r(K) observer (95%)ment r(K)

ROM PAIVM .88 Collis-Brown

PAIVM .86 GrisoldPAIVM .78 McNeill

R1 PAIVM .30 WongPAIVM .25 WeeksPAIVM .38 FlintPAIVM .46 17% WeeksPAIVM .09PAIVM .22 BantingPAIVM .22 .25 Mitchell

R2 PAIVM .28 BakerPAIVM .42 .16 Mitchell

P1 PAIVM .73 34% Collis-BrownPAIVM .83 Collis-BrownPAIVM .62 McNeillPAIVM .75 McNeillSLR .96 13.60 0 McFarlaneSLR .98 PuenteduraSLR .97 Million et alSLR .96-.97 .93-.96 Lankhorst et alSLR .78 HoehlerSLR .95 HoehlerFF .98 83mm KwongFF .99 MunroeFF .91 Million et aJFF .95 .97 Lankhorst et aJFF .50 Hoehler

Comments

after correction for error in patient re­port.

after correction for error in patient re­port.

int rasessionintersessionpeak applied force during Grade IIpeak applied force during Grade II

peak applied force during Grade IV

therapist location on VASmeasured force at patient report of P1therapist location on VASmeasured force at patient report of P2

intersessionpassive testactive test

skin distractionintersession,skin distractionskin distraction

Factors which may account for the su­perior reliability of pain assessment

Although pain tests showed betterreliability than tests of compliance fea­tures, there are procedural differencesbetween the pain tests investigated.Therefore the comparison betweenPAIVM assessment of pain and com­pliance features is probably the mostappropriate for discussion.

In order to locate PI by PAIVM aphysical stimulus is applied. The pa­tient must sense and report pain onset,and the therapist must then relate thatevent to a point in ROM. In order to

locate RIa similar physical stimulus isapplied, the therapist must sense theoccurrence of the 'onset of resistance',then relate that event to a point inROM. For both tests some of the totalerror will be due to stimulus applica­tion and some to the ability to locatea point in ROM. Thus the essentialdifference between the two judgemen­tal processes is that tests of pain involveonly one judgement, that of ROM,while tests of compliance require thejudgement of both the compliance fea­ture and ROM. It may appear there­fore that the issue is simply a question

of which of these contrasting percep­tual processes contains more error.However, the quantitative theory ofreliability shows clearly that reliabilityis a function of both error and truescore variation (see Appendix). Thesame amount of error (in metric terms)means poorer reliability if the true scorevariation is small rather than large.

It is important to note that the lowcorrelations obtained for Rl and R2are at least in part due to the restrictedrange of true score variability. R1 tendsto be restricted to the lower third ofrange while R2 tends to be restricted

188 The Australian Journal of Physiotherapy. Vol 31, No.5, 1985

Page 15: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

Table 1:Summary of reliability coefficients

Measure Test Retest Inter- CI Source Commentsmove- r(K) observer (95%)ment r(K)

P2 SLR .96 PuenteduraFF .98 33mm BruceFF .97 52mm Patterson intersessionFF .98 45mm Patterson intrasession

Compliance PPIVM (.64) KaltenbornPPIVM (.37) Johnston (1982b)PPIVM (.24) Gonnella et alPPIVM (.15) ClarksonMixed .35 .10 Jull (1982) combined PPIVM and PAIVMMixed (.67- (.54- Jull and combined PPIVM and PAIVM

.79) .71) Lane

Level PAIVM (.16) (.11) Millman stiffest levelSelection PAIVM (.34) (.08) Allen level to be treated

(1.00) Jull and pathological levelBogduk full objective and subjective

examination

to the upper third. We re-examined thedata of Wong (1981), Weeks (1982)and Flint (1983) to confirm this tend­ency. The standard deviation of R1 inROM occupied respectively 8070, 8.3070and 7.9070 of scale in the three studies,confirming the restricted variability ofR1 in both normal and clinical popu­lations and indicating very good con­sistency between the three independentstudies. In contrast, the standard de­viation of PI was 23.2070 of ROM inthe Collis-Brown (1982) study. We haveapplied equation A.21 (see Appendix)to compute what the obtained corre­lations would have been had the truescore variability been the same as thatobserved by Collis-Brown for PI. Theintrasession test-retest correlation ofWeeks (0.46) becomes 0.83; the inters­ession correlation (0.09) becomes 0.25and the intertherapist correlation ob­tained by Flint (0.38) becomes 0.76. Itseems possible therefore to account forthe poorer reliability of compliancefeature assessment without suggestingthat therapists perceive R1 or R2 morepoorly than patients perceive PI or P2.

It seems rather that therapists face amore difficult discrimination problemwhen attempting to locate Rl.

Factors which may account for the in­ferior reliability of passive interverte­bral movement tests of pain

Passive intervertebral tests, whether'accessory' or 'physiological', invaria­bly yielded poorer reliability coeffi­cients than those of the gross move­ment tests such as SLR and FF. Toavoid the confounding contribution ofpain versus compliance assessment, anappropriate comparison available fordiscussion is between FF or SLR testsof pain versus PAIVM assessment ofpain.

In FF or SLR tests a gross 'phys­iological' movement provides the stim­ulus for pain elicitation, the patientmust then perceive and report pain on­set (or similar parameters) and ROMcan be recorded via goniometry ormeasures of relatively large linear dis­placements. In PAIVM tests a morelocalized movement is the stimulus forpain elicitation, the patient must per-

ceive and report pain onset (or similarparameters), then the therapist mustthrough subjective evaluation of ROM,record where the pain occurred.

The issues for discussion thereforeseem to be: the reliability of subjectiveROM assessment by the therapist ver­sus goniometric or similar methods forROM assessment; and the reliability ofpain elicitation by gross physiologicalmovement versus localized PAIVM.

As might be expected, goniometricassessment is typically reported to showhigh reliability (Leighton 1955, Myers1961, Boone et a/1978, Ekstrand et at1982). However, the reliability of as­sessing ROM by palpation does notseem to have been previously investi­gated. Initial evidence that therapistsdo not introduce a very large amountof error at the stage of locating the PIreport in ROM was obtained by Collis­Brown (1982). His test-retest correla­tion when based upon force platformdata, which does not involve therapistjudgement of ROM, was 0.83. Whenbased upon therapist determined datait was 0.73. Thus adding subjective

The Australian Journal of PhYSiotherapy. Vol. 31, No 5,1985 189

Page 16: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

ROM assessment to the total processdid not reduce reliability substantially.The degree of error due to patient re­port and force measurement techniqueis represented in the force test-retestcorrelation (0.83). It is possible to cal­culate what the test-retest reliabilitywould have been if no error had arisenfrom these processes (see Appendix).This indirect estimate of test-retest re­liability for locating a point in ROMwas 0.88.

Additional evidence for high intra-therapist reliability of ROM assesmentwas obtained by Grisold (1983). Ther­apists were asked to palpate end ofrange of a single lumbar level using thepisiform technique. They were thenasked to palpate one, two, three, four,five, six and seven eighths of range ina random order prescribed by the ex­perimenter. This procedure was re-­peated eight times, varying the orderof presentation of point in range eachtime.

The static force platform techRique(see Section 1) was used to measureapplied force for each of the 64 trials.Average test-retest correlations were0.86, almost identical tn the 0.88 com­puted from the data of Collis-Brownafter correction for variation in patientreport.

Nevertheless, although the ability oftherapists to locate a point in ROMseems relatively high, particularly inconsideration of the difficulty of thetask, it is lower than that of goniom­etric and related techniques (0.96-0.98),thus accounting in part for the lowerreliability of PI assessment throughPAIVM. That the assessment of ROMcannot be the full explanation for thesuperior reliability of the SLR and FFtests is clear from Collis-Brown's (0.83)retest correlations for applied force atPI. This coefficient is analogous tothose derived from goniometric meas­urement during SLR test, or lengthmeasurement during FF tests. Our sta­tistical analysis revealed that 0.83 wassignificantly lower than either 0.98 or0.96. Thus some of the superiority inreliability exhibited by SLR and FF

tests appears attributable to the secondfactor, ie the way pain is elicited.

Manual application of accessorymovement seems to be more suscepti­ble to random error than the applica­tion of physiological movement. Ourevidence suggests that production ofPAIVM is likely to contain significanterror in comparison to the limited dis­tribution of Rl and R2 over the ROM(Baker 1981, Wong 1981, Weeks 1982,Flint 1983). Biomechanical studies con­firm the difficulty facing the therapists.Punjabe et af (1977) have measured4mm displacement between lumbarvertebral bodies when forces of about160N were applied in the anterior di­rection to the cephalad vertebra in vi­tro. Collis-Brown (1982) and McNeill(1982) measured maximum forces ap­plied during PAIVM tests of about350N. It is reasonable to assume thatthis load is equally distributed betweenthe intervertebral joints on either sideof the assessed level. This implies thatsimilar loads (350/2 = 175N) were ap­plied by the therapist to lumbar in­tervertebral joints during PAIVM aswere applied in the in vitro studies ofPunjabe et af (1977). Similar interv­ertebral displacement would thereforebe expected in the two cases. The invitro observations of Punjabe et af havebeen tentatively confirmed in vivo byThompson (1983) who developed anapparatus for measuring applied loadand relative intervertebral displacementsimultaneously. The apparatus con­sisted of a proof-ring strain gaugethrough which force was applied cen­trally to a lumbar vertebra (L3). Twoparallel linear-displacement trans­ducers attached to the strain-gauge andadjusted to contact the spinous proc­esses of vertebrae immediately aboveand below the loaded processes wereused to measure relative displacementbetween L2 and L3 and between L3and L4. Results for three subjects in­dicated that the caudad joint exhibitedmore displacement (3-5mm) than thecephalad joint (1-3mm) with appliedloads of 250N. Again, if the assump­tion is made that this load is distributed

equally between the intervertebral jointsabove and below, this represented aforce of 125N at each joint. These datasuggest that therapists are required toproduce very small variations in dis­placement, sometimes by the applica­tion of large forces. Both factors seemconducive to poor performance.

In contrast to the difficulties pre­sented to reliable stimulus productionduring PAIVM, the procedures of FFand SLR tests seem to be taking ad­vantage of a naturally available systemfor amplification of joint movements.Anatomical evidence indicates that rel­atively gross physiological movementswill produce very small intervertebralmovements. For example, during for­ward flexion of the trunk, approxi­mately the first 600 is accomplished byspinal structures alone. Farfan (1973)and Allbrook (1957) have shown thatapproximately 120 of this total is con­tributed by the L5-S1 joint and a fur­ther 120 by the L4-L5 joint. The re­maining lumbar joints contribute about70 each with the remainder distributedover the relatively immobile thoracicvertebrae. It is a commonly held viewthat for trunk flexion angles less than600, the lumbar joints contribute tothe total in an amount proportional totheir contribution to maximal flexion(although, we have been unable to findquantitative evidence which relates tothis point). According to this model,as the trunk moves through 50, thelower lumbar vertebral joints movethrough an angle of 10 and the higherjoints through about 0.5 0. At the sametime, the shoulders, a distance of 0.5maway from the lumbar vertebral jointsmove through an arc length of about4cm by comparison with the fractionsof millimeters displacement at the jointsthemselves. The amplification effect isquite clear. Similar arguments pertainto structures affected by the SLR.

The implication is that effects wellwithin the control of the therapist'smotor skill (or in the case of activemovement tests within the patient'smotor skill) would produce quite smallchanges at the spine, thus improving

190 The Australian Journal of PhYSiotherapy. Vol. 31, No.5, 1985

Page 17: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

the signal to noise ratio of t!te man­oeuvre.Further possible problems with passiveintervertebral movement tests

The argument has already been putthat although therapists' ability to lo­cate a point in ROM is reasonably good(r = 0.86-0.88), the narrow range ofvariation in compliance parametersplaces particularly high reliability re­quirements on the therapist, if themeasures are to distinguish phenomenaof interest. To appreciate the difficultyfurther, consider the results obtainedby Weeks (1982) who demonstratedthat an intrasession change of at least170/0 of scale would have to occur inRl for therapists to detect it with 950/0confidence. Since Rl in the populationprobably varies over about a third ofthe scale according to the data of Baker(1981), Weeks (1982) and Flint (1983),intrasession changes exceeding half ofthe total range of individual differences f

in Rl would have to occur for reliabledetection by the therapist. It is equiv­alent to requiring a joint which is inthe lower quartile of R1 in the popu­lation to change to the upper quartile.This seems a very unlikely proposition.The detection of intersession change,or of absolute location in range for Rlor R2, provides an even bleaker pic­ture.

Another aspect of the judgement taskpresented to therapists is identificationof a specific point within ROM. Inassessment of PI, this simply involvesjudgement of the current point in ROMat the time of patient report of pain.In assessments of compliance featuresthis requires identification of the fea­ture and subsequent estimation of thepoint in ROM at which this featureoccurs. Identification of a feature seemsto require that the feature exists inmechanical terms in order to providea stimulus. It also seems to require thatthe feature be definable uniquely interms of the therapists perceptions of'joint feel'. The experiments of Bant­ing (1982) and Mitchell (1983), in whichtherapists were required to performmobilizations to a particular point in

ROM, indicated wide variations intherapists' 'connotations' of Rl andR2, since vastly different forces wereutilized to reach the same point in rangeon the same subject. In the textbook(Maitland, 1977) which established thenomenclature and theory in this field,we have been unable to find a preciseoperational definition of 'resistance'.Therapists with whom we have dIS­cussed this issue have not been able toreach consensus on a definition. A dis­cussion of the distinctions betweenthese definitions and their implicationsfor the construction of the movementdiagram are, however, beyond thescope of this review.

In the studies reported here specificfeatures of pain or compliance wererecorded on the two dimensional move­ment diagram. This two dimensionalVAS helps clarify the therapist's as­sessment task and is recommended forsummarizing and communicating clin­ical descriptions (Maitland 1977). It isof interest to examine the demands itmakes upon the therapist. For exam­ple, the horizonal axis, which scalesROM, is defined by Maitland to rep­resent 'any range of movement fromthe starting position at A to the limitof normal range at B. It makes nodifference whether the movement de­picted is small or large . . . Point B isalways constant and always at the ex­treme of normal average range of pas­sive movement' (Maitland 1977, p.317).This definition shows clearly that thetherapist is not merely required to re­spond on a psychophysical scale ac­cording to current sensory input, a dif­ficult enough task under thecircumstances, but also has to makethat scale relative to 'normal averagerange of movement' .

Several problems may be seen to arisefrom defining the scale relative to nor­mal average range. First, the therapistis required to alter the scale in relationto past experience. This is likely tointroduce a variety of biases (Kahne­man et af 1982, Slovic et af 1977).Second, the therapist is apparently re­quired to store many models of nor-

mality, since a different model will berequired for different joints, differentmovements and perhaps other subsetsas well, such as those generated bygender or age. This requirement placesan even larger burden on memory.Third, the parameter for mental mo­delling is 'average normal range'. Thisseems rather vague, particularly sinceit requires statistical interpretation fromthe observer. Human intuitive percep­tion of the statistical parameters of datasamples suffers from several biases(Slovic et af 1977, Kahneman et af1982). All of these factors are likely toincrease the error of scaling. Nowherein the clinical literature have we beenable to discover evidence that therapistscan in fact cope with such complexityof judgement. Our data, which con­sistently returned very poor interther­apist correlations, suggest that the taskis too difficult.

A final problem, at least for thecentral PAIVM data reported above,may be seen to arise from the sensoryinformation afforded by the technique.An essential value of passive move­ments to clinical theory seems to lie inthe highly localized nature of theirprobing. As such the ROM of interestwould appear to be only that which isrelative to adjacent structures, ratherthan the overall movement throughspace described by the segment tested.However, consider the following state­ment from Maitland (1977, p.34): 'Ifthe pressure is applied as a single slowpressure, the vertebral movement willnot be appreciated at all; if it is appliedtoo quickly it can only be interpretedas shaking. However, if the pressure isthen relaxed and reapplied and re­peated two or three times a second, theamount of movement which can takeplace will be readily appreciated'. Asthis statement indicates, the perceptionof relative movement relies not on di­rect sensation of displacement but onperception of phenomena which are notuniquely determined by relative dis­placement. As such, the movement dia­gram seems to place a burden of com­plex and undefined biomechanical

The Australian Journal of PhYSiotherapy. Vol. 31, No.5, 1985 191

Page 18: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

interpretation on the therapist, whichwill be conducive to the introductionof error. In fact when direct manualsensation of displacement relative toadjacent segments is not concomitantlyundertaken, the situation we have usu­ally observed to be the case duringPAIVM tests, the movement diagramborders on being a biomechanical non­sequitur to PAIVM. An alternativeVAS more directly defined in terms ofthe sensory experience of the perform­ing therapist may be preferable.

Clinical implicationsAs neither of us is trained with a

clinical background in manual therapywe wish to confine our comments to aseries of questions which the psychom­etric and biomechanical evidence pre­sented above seem to raise.

Tests of pain have generally been:considered most important in the as­sessment procedure (Maitland 1977).The reviewed results suggest good ~o

excellent reliability for this aspect ofassessment.

However, the poor reliability shownby tests of vertebral compliance duringpassive movement raisef several ques­tions about their role in clinical prac­tice. Presumably one of the major vir­tues of passive movement tests ofcompliance is that they help localizethe pathology. However, are there nosatisfactory substitutes for achievingthis goal? It is not yet clear that thesetests would be required, even if relia­ble, given the plethora of other caseinformation, together with epidemiol­ogical knowledge. lull and Bogduk's(1985) results interpreted in the contextof those reported here, suggest thatpain reproduction will very reliably se­lect the level to be treated. How oftendo joint conditions present in the ab­sence of pain? Furthermore, is preci­sion in selection of level to be treatednecessary? If there is no adverse effectassociated with intervention at inap­propriate levels, the additional resourcecost involved would appear to be mar­ginal, thus permitting a 'fail-safe' strat­egy to locality of intervention.

Another role for passive movementtests of compliance seems to be to aidin the selection of a direction and gradeof movement. Again the questionarises, could this decision be made onthe basis of the other information?Furthermore, the research literature hasyet to demonstrate that the grade ofmovement (in the respects defined bycompliance features) selected is criticalto clinical outcome. In any case, bothinter and intratherapist reliability in ap­plication of movement grades was de­monstrably unreliable. Could the mo­bilization procedure be made to bemore reliant on patient comfort andparticularly patient feedback ratherthan on manual reassessment followingtreatment? If so, there is ample liter­ature in the experimental psychologyof motor skills which suggest that per­formance with feedback tends to besuperior (Sage 1977). Perhaps feed­back-based treatment, utilizing pain re­port as feedback, is the de facto modusoperandi and the intertherapist unre­liability in the absence of pain merelyconfirms this.

A third role which might be attrib­uted to passive movement tests of com­pliance is to evaluate progress. If re­liability is the criterion for selectingtests of progress, then the evidence pre­sented indicates clearly superior alter­natives. The objection may be raisedthat localized compliance changes mustbe uniquely traced. However, the casethat compliance changes per se arepathological or uniquely related to pa­thology has yet to be definitively out­lined in the research literature.

A final role which might be attrib­uted to passive movement tests of com­pliance in clinical decision strategy isthat of confirmatory tests. A confir­matory test is undertaken to reassurethat a decision taken on another testis adequate. This is a common, butoften misused clinical strategy. If testA correctly predicts a criterion variable(eg pathology of a given type) on 80070of occasions and if test B does likewise,then the final probability of a 'con­firmed' decision which is also a correct

decision is actually 64070! This arisesbecause 'confirmation' implies thatboth tests yield the same prediction,thereby invoking the multiplicative lawof contingent probability. On 4070 ofoccasions the tests will confirm eachother, but be simultaneously wrong(0.20 x 0.20 = 0.04). On 16070 of oc­casons test A will be correct, but testB will disagree (0.80 x 0.20 = 0.16)and on another 16070 vice-versa, mak­ing a total of 32070 of occasions con­taining difficult disagreements. Thesefigures deteriorate if one of these twotests should have a lower percentageof valid predictions.

In conclusion therefore, the obtainedresults suggest that the assessment roleof passive movement tests of compli­ance be seriously reconsidered, partic­ularly PAIVM in its present form. Ifa case can be made that unique, essen­tial information is provided by the pas­sive assessment of compliance, and wereiterate that such a case has not yetbeen made in accordance with the ri­gors of empirical science, then it wouldseem that new methods of testing mustbe developed which achieve that pur­pose.

Present limitations and future direc­tions

The conclusions drawn in this reviewmust be understood in the light of thelimitations imposed by the methodol­ogy of the studies providing the evi­dence for these conclusions. In par­ticular, since we are reporting on anincomplete series of small studies,which of necessity must be limited intheir sampling, several issues requirediscussion.

The number of therapists investi­gated in anyone study was typicallysmall. However most issues were ad­dressed by more than one study andconsistent results were obtained. Someof the studies reviewed here involvedstudent manual therapists who hadvarying degrees of clinical experiencein physiotherapy practice but who hadnot yet completed their specialist pro­gramme in manual therapy. They had

192 The Australian Journal of PhYSiotherapy. Vol. 31, No.5, 1985

Page 19: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

however completed and passed the unitrelevant to the particular proceduresassessed. Several points can be put for­ward to argue the case that poor resultswere not the effect of therapist inad­equacy or inexperience. Firstly, it mightbe argued that student therapists hadrecently completed a period of veryintensive clinical training and were infact likely to perform better than prac­tising therapists who used some of thesetechniques less frequently. Secondly,when studies utilizing student thera­pists were replicated with experiencedtherapists, no significant differences inresults were obtained. Thirdly, motorlearning research suggests that whenlearning occurs in the absence of ex­teroceptive feedback, variability aboutsome mean performance is reduced butthe average performance remains un­changed (Gibson 1969). It is conceiv­able that upon completion of a periodof formal training the therapists nolonger receive information about thecorrectness of skills employed in prac­tice from a common source and aretherefore continuing to learn in the ab­sence of shared feedback. We mighttherefore expect improvements in test­retest reliability in more experiencedtherapists. However, because their ex­perience may have been individualized,it is possible that there will be nochange, or even a deterioration, in in­tertherapist reliability. Finally, wecould argue that the therapists selectedrepresent a cross section of practisingtherapists and therefore represent thegeneral level of therapeutic skills. Weare unaware of factors which couldhave biased the samples toward the'poor' therapists; in fact, in some cases,efforts were made to involve the morerespected and established members ofthe therapeutic community.

In addition to 'type of therapist'other variables were sampled. Theseinclude anatomical location and assess­ment technique. Although cervical andthoracic segments were sampled insome studies the lumbar segments wereobserved much more frequently. Thedata from non-lumbar segments col-

lected so far does not suggest that sig­nificantly better results for PAIVMtests of compliance will be obtained inthese segments. Finally, it should beclear that since all studies reported areabout spinal joints, no statement canbe made about reliability in the assess­ment of peripheral joints. Clearly thisaspect requires further investigation,particularly because substantial differ­ences exist between spinal and periph­eral joint assessment. For example, inperipheral joints goniometry is morereadily applicable with current tech­niques. Furthermore, a contralateraljoint is available for simple comparisonin peripheral joints. Contralateral com­parisons in spinal joints, when they areappropriate, seem rather more complexbecause both joints belong to theaffected level.

In the assessment of spinal-jointcompliance a number of interestingreliability comparisons remain to beconducted. The PAIVM data collectedto date is limited to central PAIVM.The reliability of unilateral PAIVMseems deserving of investigation sinceamong other differences to centralPAIVM, a contralateral comparison ofsorts is available. In addition most ofthe evidence collected so far relates toPAIVM technique. The reliability ofPPIVM tests, particularly in the cerv­ical spine also seems to deserve furtherinvestigation. In PPIVM the stimulusmovement at the spine may be morecontrollable than in PAIVM becauseof the mechanical advantage argumentinvoked above in the discussion of thesuperior reliability of SLR and FF testsof pain. This could be particularly sofor cervical movement where the ther­apist has a more manageable structurethan the trunk. Furthermore, unlikePAIVM tests, during PPIVM tests thetherapist is required to directly palpatethe relative movement of adjoining seg­ments in addition to sensing the forcerequired to produce that movement.

A number of lines of research arealso suggested by the results obtained.For example, we have already men­tioned that the poor reliability of pas-

sive assessment of compliance featuresindicates that their contribution to theoverall clinical decision process shouldbe carefully assessed. Our group hastaken some initial steps in that direc­tion (Cunningham 1982, Walker 1984).If compliance assessment proves in thefuture to be essential to clinical deci­sion-making more reliable tests willneed to be developed. It may be nec­essary to develop instrumentedapproaches to this problem. Thomp­son's study (1983) is a first step in thatdirection in our laboratories. In anycase, such instrumentation will berequired if adequate surveys of spinaljoint compliance are to be completedin order to provide the normative datacurrently missing from the scientificliterature of manual therapy. The poorreliability found for production ofselected grades of movement furtherstrengthens the requirement to inves­tigate the dependence of clinical out­come on particular grades ofmobilization, a requirement initiallyposed by the apparent absence of for­mal study on this central issue in someapproaches to mobilization.

Manual therapy, at this point in itsdevelopment, is in the position of hav­ing developed to the stage of a complexclinical theory well in advance of asound base of verifiable, empiricaldata. It should be clear from the fore­going that even within the narrow aimsselected by the respective investigatorsa great deal remains to be done. It isour hope that the above will prove tobe a seminal contribution in the fieldof clinical arthrometrics.

AcknowledgementWeare indebted to our clinician col­

leagues who have been a constantsource of inspiration and feedback; tothe many therapists who have so gen­erously given their time, skills and otherresources; and above all to our stu­dents, without whose work this wouldnot have been possible. We hope theywill continue to confront the unknownwith curiosity, rationality and con­structive hard work.

The Australian Journal of Physiotherapy. Vol. 31, No.5, 1985 193

Page 20: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

ReferencesAllbrook D (1957), Movements of the lumbar

spmal column, The Journal of Bone and JomtSurgery, 398, 339-345.

Allen D (1983), The relIabIlity of determImng themost abnormal lumbar jomt usmg paSSIve ac­cessory mtervertebral movements. UnpublIshedPostgraduate DIploma DIssertatIon, LIncoln In­stItute of Health Sciences, School of PhySIO­therapy, Melbourne.

Anderson JAD and Sweetman BJ (1975), A com­bIned fleXI-rule/hydrogomometer for measure­ment of lumbar spme and Its sagittal movement,Rheumatology and RehabllltatlOn, 14, 173-179.

Bach TM (1985), An mdIrect method for meas­unng forces applied durmg therapeutIc mter­ventIOn and assessment techniques. Unpub­lIshed manuscnpt.

Baker M (1981), Interobserver relIabIlIty of rangeImpaIrmg stiffness ratmgs obtaIned from paSSIveaccessory Intervertebral movements. Unpub­lIshed Postgraduate DIploma DIssertatIOn, Lm­coIn InstItute of Health SCIences, School ofPhysIotherapy, Melbourne.

Banting J (1982), IntertherapIst relIabIlIty In theperformance of a grade II mobilizatIon move­ment. Unpublished Postgraduate DIploma DIS­sertatIon, LIncoln Institute of Health SCIences.School of Physiotherapy, Melbourne.

Bantmg JB. Mitchell WN, Bach TM and MatyasTA (1985). RelIabIlIty In the executIOn of se­lected grades of mobilIzatIOn m manual therapy.UnpublIshed manuscnpt.

Boone DC, Azen SP, Lm CM, Spence C. BaronC and Lee L (1978). RelIabIlity of gomometncmeasurement, PhYSical Therapy, 58, 1355-1360.

BreIg A and Troup JDG (1979), BIOmechamcalconSIderations in the straIght-leg-raIsmg test,Spme, 4, 242-250

Bogduk N (1985), A SCIentifIc ai>proach to cervIcaldIagnosIs, Proceedmgs of the AustralianPhySIOtherapy AssoclQtlOn Conference, Bns­bane.

Bruce P (1981), The test-retest relIability of phys­iological movement as a method of assessmgrange of movement to the level of pam toler­ance. UnpublIshed Postgraduate Diploma Dis­sertatIon, Lmcoln InstItute of Health SCIences,School of PhYSIotherapy, Melbourne.

Clarkson M (1982), Intertherapist reliabIlIty massessmg stiffness ratmgs m the lumbar spmeobtaIned from passIve phYSIOlogical mterverte­bral movements. UnpublIshed Postgraduate DI­ploma DIssertatIOn, Lincoln InstItute of HealthSCIences, School of PhysIotherapy, Melbourne.

Cohen J (1960), A coeffiCIent of agreement fornommal scales, EducatIOnal and PsychologicalMeasurement, 20, 37-46.

CollIs-Brown GL (1982), Test retest relIabIlIty ofpam onset ratmgs obtaIned from paSSIve acces­sory mtervertebral movements. UnpublIshedPostgraduate DIploma DissertatIOn, Lmcoln In­stItute of Health SClenr,es, School of PhySIO­therapy, Melbourne.

Cunmngham G (1982), ClImcal deCIsion makmgm mampulatIve therapy: the effect of antecedentmformatIOn on palpatIOn findmgs. UnpublIshedPostgraduate Diploma DIssertation, Lmcoln In­stItute of Health SCIences, School of PhysIO­therapy, Melbourne.

Cyriax J (1982), Textbook of Orthopaedic Med­Icme, Vol. 1. (9th ed.), Bailliere Tmdall, Lon­don.

DePalma AF and Rothman RH (1970), The In­tervertebral DISC, WB Saunders, Philadelphia.

Edwards AL (1964), Statistics for the BehaVIOuralSCiences, Holt, Rmehart and Wmston, NewYork

Ekstrand J, Wlktorsson M, Oberg Band GIllqUIStJ (1982), Lower extremity gomometnc meas­urements: a study to determme their relIabilIty,Archives of PhySical Medlcme and Rehabillta­tlOn, 63, 171-175.

Farfan HF (1973), Mechamcal DISorders of theLow Back, Lea and Feblger, PhIladelphIa

Flelss JL, Cohen J and Eventt BS (1969), Largesample standard errors of kappa and weightedkappa, Psychological Buffetm, 72, 323-327IT

Flmt R (1983), IntertherapIst relIabIlIty for theassessment of jomt measurement behaVIor bymeans of passive accessory mtervertebral move­ments (PAIVMs). UnpublIshed PostgraduateDIploma DissertatIon, LIncoln InstItute ofHealth SCIences, School of Physiotherapy, Mel­bourne.

GIbson E (1969), Prmclples of Perceptual Learn­mg and Development, Appleton-Century-Crofts,New York.

Goddard MD and ReId JD (1965), Movementsmduced by straIght leg raIsmg m the lumbo­sacral roots, nerves and plexus, and the mtra­pelVIC sectIon of the SCiatIC nerve, Journal ofNeurology, Neurosurgery and Psychiatry, 28,12-18

Gonnella C, ParIS SV and Kutner M (1982), Re­lIabIlIty m evaluatmg paSSIve mtervertebral mo­tIOn, PhYSical Therapy, 62, 436-444.

Grant R (1980), Lumbar sagittal mobIlIty m hy­permobIle mdlvlduals. Proceedmgs of the Ma­mpulatlve Therapy ASSOCiatIOn of A ustrallQ,AdelaIde.

Gnsold PM (1983), Estlmatmg range of movementfrom passive accessory Intervertebral move­ments; the nature of the scale and the relIabIlItyof performance. UnpublIshed Postgraduate DI­ploma DissertatIOn, Lmcoln Institute of HealthSCIences, School of PhYSIotherapy, Melbourne.

GUIlford JP (1954), PsychometrIC Methods,McGraw-HIll, New York, chs 13. 14.

Hanley EN, Matter RE and Frymoyer JW (1976),Accurate roentgenographIc determInatIOn oflumbar fleXIon-extenSIOn, Cllmcal OrthopaediCSand Related Research, 115, 145-148.

Hart FD. Stnckland D and ClIffe P (1974), Meas­urement of spmal mobIlIty, Annals of Rheu­matic DISeases, 33, 136-139.

Hartman DP (1977), ConSIderations m the chOIceof mter-observer relIabIlIty estimates, Journalof Applzed BehaVIOr AnalySIS, 10, 103-116.

Hoehler FK and Tobls JS (1982), Low back paInand ItS treatment by SpInal mampulatIOn: meas­ures of fleXIbilIty and asymmetry, Rheumatol­ogy and RehabilitatIOn, 21, 21-26.

Hollenbeck AR (1978). Problems of relIabIlIty mobservatIOnal research. m GP Sackett (Ed.),Observmg behaVIOur, Vol 2: Data collectIOn andAnalysIS Methods, Umverslty Park Press, Bal­timore.

Hubert L (1977), Kappa reVISIted. PsychologicalBulletm, 84, 289-297.

Johnston WL (1982), PaSSIve gross motIOn testIng:Part I. Its role m phYSIcal exaffilmatIOn, Journalof the Amerzcan OsteopathiC ASSOCiatIOn, 81,298-303.

Johnston WL, ElkIss ML, Manno RV and BlumGA (1982a), PaSSIve gross motion testmg: PartII. A study of mterexamIner agreement, Journalof the Ameflcan OsteopathiC ASSOCiatIOn, 81,304-308.

Johnston WL, Beal MC, Blum GA, Hendra JL,Neff DR and Rosen ME (l982b), Passive grossmotIon testmg Part III Exammer agreementon selected subjects, Journal of the AmerzcanOsteopathiC ASSOCiatIOn, 81, 309-313

Jull G (1978), ClImcal observatIOns of upper cerv­Ical mobIlity, Proceedmgs of the Inaugural Con­gress of the Mampulatlve Therapy ASSOCiatIOnof A ustralla, Sydney

Jull G (1982). PaSSIve mtervertebral movementsof the lumbar spme, m Toward a better un­derstandmg of spmal paIn. Proceedmgs of theManzpulatlve Therapy ASSOCiatIOn of AustraliaAnnual Conference, Bnsbane

Jull GA and Lane MB (1983). Aspects of lumbarspme mobIlity m a normal population, In KDBower (Ed.) InternatIOnal Conference on Ma­mpulatlve Therapy Proceedmgs, Perth

Jull GA and Bogduk N (1985), Manual exami­nation: An ObjectIve test of cerVIcal Jomt dys­functIOn, Proceedmgs of the A ustrallan PhysIO­therapy AssoclQton Conference, Bnsbane

Kahneman D. Slovlc P and Tversky A (Eds) (1982),Judgement Under Uncertamty: Heuflst,cs andBiases, CambrIdge Umverslty Press, Cam­bndge

Kaltenborn F and LIndahl 0 (1969), ReprodUCI­bilIty of the results of manual mobIlIty testIngof speCIfic mtervertebal segments, Lakartldnm­gen (SwedIsh Medical Journal), 66, 962-965

KapandjI IA (1974). The PhYSIOlogy of the Jomts,Vol 3, (2nd ed.) LIvmgstone, Edmburgh.

Kwong HF (1981), Test-retest relIabilIty of pamonset assessed by actIVe 'physIOlogical' move­ments. UnpublIshed Postgraduate DIploma DIS­sertatIon, Lmcoln InstItute of Health SCIences,School of PhYSIOtherapy, Melbourne.

Lankhorst GJ. Van de Stadt RJ, Vogelaar TW,Van der Korst JK and Prevo AJH (1982), Ob­JectIVity and repeatabIlIty of measurements Inlow back paIn, Scandmavlan Journal of Re­habilitative Medlcme, 14, 21-26.

LeIghton JR (1955), Instrument and techmc formeasurement of range of jomt motion. ArchiVesof PhYSical Medlcme and RehabilitatIOn, 36,571-578.

Loebl WY (1967), Measurement of spmal postureand range of spmal movement, Annals ofPhys­Ical Medlcme, 9, 103-110.

Macfarlane A (1981), Test-retest relIabIlIty ofstraIght leg raIse as determined by paIn onset.UnpublIshed Postgraduate DIploma DIsserta­tIon, Lmcoln InstItute of Health SCIences. Schoolof PhYSIOtherapy, Melbourne.

Macrae IF and Wnght V (1969), Measurement ofback movement,Annals of Rheumatic DISeases,28. 584-589.

MaItland GD (1977), Vertebral Mampulatlon, (4thed.). Butterworth, London.

McNeIll KE (1982), Intertheraplst relIabIlIty ofpaIn onset from passive accessory mtervertebralmovements. UnpublIshed Postgraduate DI­ploma DIssertatIon, Lmcoln InstItute of HealthSCIences. Schoolof PhysIotherapy, Melbourne.

Meyers LS and Grossen NE (1974). BehaVIOralResearch: Theory, Procedure, DeSign. WHFreeman and Co., San FranCISCO, 164-166.

MillIon R, Hall W. Haavlk NIlsen K, Baker RDand Jayson MIV (1982), Assessment of theprogress of the back paIn patIent, Spme, 7, 204­212.

MIllman AJ (1981), Test-retest relIabIlIty of rangeImpaIrmg stiffness ratmgs obtaIned from paSSIveaccessory mtervertebral movements. Unpub-

194 The Australian Journal of PhYSiotherapy. Vol. 31, No.5, 1985

Page 21: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

IIshed Postgraduate DIploma DIssertatIon, Lm­coIn InstItute of Health SCIences, School ofPhysIOtherapy, Melbourne

MItchell WN (1983), RelIabIlIty m the performanceof Grade II and Grade IV moblhzatlons. Un­publIshed Postgraduate DIploma DIssertatIOn,Lmcoln Institute of Health SCIences, School ofPhysIotherapy, Melbourne.

Moll J and Wnght V (1976), Measurement ofspmal movement, m M Jayson (Ed.), The Lum­bar Spme and Back Pam, Sector, London

Moran HM, Hall MA, Barr A and Ansel BM(1979), Spmal mobIhty m the adolescent, Rheu­matology and RehabilItatIOn, 18, 181-185.

Munro R (1983), The contnbution of pam versusmemory for pOSItIon and exteroceptIve feedbackm the forward flexIon test. UnpublIshed Post­graduate DIploma DIssertatIon, Lmcoln InstI­tute of Health SCIences, School of PhySIO­therapy, Melbourne.

Murphy RW (1977), Nerve roots and spmal nervesm degeneratIve dISC dIsease, Cilmcal Ortho­paediCS and Related Research, 129, 46-57

Myers H (1961), Range of motion: Part I ­mtroductory revIew of literature, Physical Ther­apy Reviews, 29, 195-205.

Nunally JC (1978), PsychometriC theory, (2nd ed.),McGraw-HIll, New York.

O'Keefe PJ (1981), The spmal complIance testJ~Unpubhshed Postgraduate DIploma DIsserta­tIon, Lmcoln InstItute of Health SCIences, Schoolof PhysIotherapy, Melbourne.

Patterson S (1982), The test-retest rehabllIty oflumbar fleXIon when hmIted by pam. Unpub­hshed Postgraduate DIploma DIssertatIon, Lm­coIn InstItute of Health SCIences, School ofPhyslOthrapy, Melbourne.

Puentedura L (1983), The effects of trunk pOSItIonon straight leg raise m normal subjects. Un­pubhshed Postgraduate DIploma DIssertatIon,Lmcoln InstItute of Health SCIences. School ofPhysIotherapy, Melbourne.

PunJabe MM, Krag MH, WhIte AA and South­WIck WO (1977), Effect of preload on loaddIsplacement curves of the lumbar spme, Or­thopaedic ClImcs of North America, 8, 181­192.

Reynolds PM (1975), Measurement of spmal mo­bIhty: a comparIson of three methods, Rheu­matology and RehabIlitatIOn, 14, 180-185.

Sage GH (1977), IntroductIOn to Motor BehaVIOr:A Neuro-psychologlcal Approach, (2nd ed.),AddIson Wesley, Readmg, Massaechusetts, ch20.

SIOVIC P, FIschhoff Band Llchtenstem S (1977),BehaVIOral deCISIon theory, Annual ReView ofPsychology, 28, 1-39.

Stoddard A (1980), kfanual of OsteopathiC Tech­mque, (3rd ed.), Hutchmson, London.

Thompson R (1983), Measurement of relative m­tervertebral dIsplacement m the lumbar spinedunng applicatIon of a PAIVM. UnpublishedPostgraduate DIploma DIssertatIon, Lincoln In-

stttute of Health SCIences, School of Physio­therapy, Melbourne.

Troup JGD, Hodd CA and Chapman AE (1967),Measurements of the sagittal mobility of thelumbar spine and hips, Annals ofPhYSical Med­Icme, 9, 308.

Twomey LT and Taylor JF (1979), A descriptionof two new mstruments for measuring the rangesof sagIttal and honzontal plane motIons in thelumbar regIon, AustralIan Journal of PhysIO­therapy, 25, 201-203.

Van Adrichem JA and Van Der Korst JK (1973),Assessment of the fleXIbility of the lumbar spine,ScandmaVian Journal of Rheumatology, 2, 87­91.

Walker D (1984), A survey of treatment selectionand subjectIve certainty at different stages ofclImcal assessment. Unpublished PostgraduateDIploma DissertatIOn, Lincoln Institute ofHealth SCIences, School of Physiotherapy, Mel­bourne.

Weeks PM (1982), Test-retest reliability of stiff­ness onset using passive accessory intervertebralmovements. Unpublished Postgraduate Di­ploma Dissertation, Lincoln Institute of HealthSCIences, School of Physiotherapy, Melbourne.

Wong M (1981), Interobserver reliabihty of stiff­ness onset ratings obtained from passive acces­sory mtervertebral movements. UnpublishedPostgraduate Diploma Dissertation, Lincoln In­stitute of Health SCIences, School of Physio­therapy, Melbourne.

where X o = X o - Xo ' the deviation of the observed rawscore from the mean of the observed scores;x t = X t - X t , the deviation of the true score from themean of the true scores; and e = Eo - E the deviation

these circumstances. Constant error does affect the truthof the absolute value, but the difference between twoobserved scores will be equal to the true score difference.However, if the error is random, measurements will varyunpredictably even when the same true value is underobservation. The quantitative theory of reliability is con­cerned therefore with random error.

Since E may vary from one occasion of measurement tothe next, a consequent problem is how to summarize the'typical' size of E. Furthermore, the interest usually lies indescribing how reliable an observation process is for avariety of objects which lie on a common dimension, ratherthan in describing the reliability for measuring only oneobject. This also requires the definition of a method forindexing the 'typical' value of error. Thus in estimatingerror, a sample of values is usually generated. Hence theissue of 'typical' error is a problem in sampling theory andthe associated descriptive statistics.

If a sample consisting of one measurement of severalobjects is taken, then each score could be expressed as adeviation from the sample mean rather than in raw scoreunits. Equation (A.2) then follows from (A.]):

AppendixReliability theory is a highly developed field with ample

presentation of its concepts (Guilford 1954, Edwards 1964,Nunally 1978). This appendix will only review selectedissues of interest to a number of the studies reported inthis review. A knowledge of basic statistical theory (mean,variance, correlation, statistical inference) is assumed inthe following discussion.

The reliability of a measurement process refers to thedependability, or reproducibility of observed scores whenthese are obtained from measurements of the same events.Realiability classically relates the extent to which observedscores represent the true values of the events measured.Equation (A.]), where X o = observed score X t = truescore and E = error component, shows that the observedvalue can be represented as being partly composed of truequantity and partly error.

X o = X t + E (A.])

If X t is known the discrepancy of X ° readily quantifiesthe error. The larger E is the more unreliable is the ob­servation.

Two patterns of error can occur: systematic error, suchthat E is constant; and random error, such that E isunpredictably variable from measurement to measurement.If the error is constant, then observed scores will be thesame across several measurements of a given true value.The instrument is therefore not considered unreliable under

X o = X t + e (A.2)

The Australian Journal of PhYSIOtherapy. Vol. 31, No.5, 1985 195

Page 22: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

The average squared deviation from the mean is knownas the variance, or S2. Therefore, the variance of observedscores is composed of true variance plus error variance:

Summing over the sample and dividing by the numberof cases yields the averages:

Ex2 E(x~ + e2 + 2x te)__0 = _

Ed; + E(e - e,)2Ed; + (Ee2 + Ee' 2 + 2ee '

Ed~

.'. s~o

If n is the number of pairs of observed scores, then:Ed~ E7 (Ee2 Ee' 2)--=-+--+--

n n n n

do X o - X o(XI - XI) + (e - e')

Note that E2dl (e - e') = 2d t E(e - e') andE(e - e') = Ee - Ee'. Since both e and e' are randomwithin (and between) the measurelnent samples, thenEe = Ee' = 0 and E(e -e') = 0 (within the limits ofsampling error) following the earlier argument. Thus E2d t(e - e') = 0 and:

If the true difference In deviation units isd, = (x, - XI)

then:

Again, since e and e' are random, the positive andnegative components will be equal (within the limits ofsampling error). Thus E2ee' = 0 and:

d~ =:;; [d I + (e e ' )J2d~ + (e - e,)2 + 2d t (e - e')

. Ed~ Ed; + E(e - e,)2 + E2d, (e - e')

the random error In deviation units, then It follows from(A.2) that:

(A. 7)

(A. 6)

(A.3)

(A.4)

(A.5)nn

X~ = (x t + e)2

x~ = x; + e2 + 2x te)

Ex~ ~ Ee2 E2x te-=-+-+--n n n n

of the error component from the mean of the error com­ponents.

Since the problem is to obtain a measure of 'typical'amount of random error, the deviation scores could beaveraged over the sample. However, if error is random,there will be just as much positive deviation as negativedeviation, yielding a misleading average of zero. To over­come this, statisticians deal with squared deviation, whichhas the effect of removing the algebraic sign. The meansquared deviation score will not average to zero. In devia­tion score units the average may be obtained as follows:

That is,

Since the error is randomly positive and negative in equalquantity, over the total sample Elxt ewill tend to be zero,as in the earlier argument. n

Therefore, EX~ Ex; Ee2--=-4--

n n n

Hence,

This index is more readily interpreted and is commonlycited.

Frequently the interest lies in measuring change fromone occasion to another. In these situations each of thetwo measurements will introduce some error. If do is theobserved difference score in deviation units, X o is the ob­served deviation score on the second occasion and e' is

Since s; is the amount of squared error (in deviationunits) per case it seems to be an adequate measure of'typical' error.

However, there are several drawbacks to using s; as thesole index of reliability. One is that the units of error aresquared, which makes interpretation awkward. This is eas­ily resolved by defining the squared root of s; to be the'standard error of measurement':

Consequently the error of measuring change will be largerthan the error for measuring on either occasion. The stand­ard error of measuring changes (Se dill) will be:

(A. 10)se dill = .Js.; + s;,The standard error of measurement however is a measure

of 'typical' error. The error will sometimes be less, some­times more. Most often, it is assumed that error is variablein both direction and magnitude, with small errors moreprobably than large errors. Although situations may arisewhere other assumptions are better, it is unusual to imaginethat the errors around a true value are normally distributed.Thus the mean of a sample of observed values of the sameevent will be the best estimate of the event's true score. Ifthe errors around this true value follow the assumed normaldistribution it is possible to calculate over what range somespecified proportion of observed values will fall. This sta­tistic is known as the confidence interval (eI):

(A.9)

(A.8)

196 The Australian Journal of PhYSiotherapy Vol 31, No 5, 1985

Page 23: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

Furthermore, since the same events are being measuredtwice, within the limits of sampling error the two samplesX and Y should have the same variance, s; = s;. The

Yo = Y t + E (A.I3)All of equations (A.2) - (A.I2) can be rewritten for

these second measurements. Since reliability can be definedas the extent to which measul ements predict remeasure­ments of the same events, the correlation between X andY will be an index of reliability. The correlation coefficientis defined as the average cross-product of the standardizedscore on X and Y:

measurements composed of several observations a part­score from a subset of the observations may be comparedto a part-score based on another subset (internal consist­ency). These are all different practical methods for ob­taining two estimates of the same underlying true value.Although the error introduced in attempting reobservationby different methods are likely to be different, all thesepractical approaches to establishing reliability have in com­mon the need to quantify the degree to which one set ofobservations~ predicts another set of observations of thesame events.

The correlation coefficient r (Edwards 1964) is a measureof the degree to which one data set predicts another. Ifthe sample of events is remeasured (eg on another occasion,or by another observer), then equation (A.I3) relates theobserved scores Yo the true scores Y t and the error E:

(A. 14)EZx Zy

r = ---n

We will assume that the reader is already familiar withthe theory of correlation, which indicates how this indexrelates to scattergrams; and how it varies between 0 (whenX and Yare randomly related) and 1.0 (when X, Y co­ordinates plot perfectly on a straight line).

An algebraically equivalent equation for r can be writtenin deviation scores since Z x = (X - X) / Sx andZy (Y - Y)/Sy :

r = Exo Yovi Ex~Ey~ (A.15)

If ex and ey are the error deviation scores for X and Yrespectively, then:

Exy = E(x, + ex) (Yt + ey )

= Ex,y,+ Extey + Ey,ex + Eexey

Since ex and ey are random (with positive and negativevalues equivalent and randomly paired to particular x t ' S

or y t' s) it follows that Extey = 0, Ey tex = 0, and also thatEexey = O. Thus EsoY 0 = Ex,yt. Since the same eventsare being remeasured x t = Y t and therefore

Exoyo = Ex; = Ey; (A. 16)

(A. 11)

where (1 - a) is the confidence level and Z a is the appro­priate value from the normal distribution. An analogousequation can be written for difference scores by substitutingSe dlff for Se' The virtue of transforming a standard errorinto confidence intervals is that it acknowledges the errorto be variable and permits calculation of the proportionsof observations which will occur within some given errorrange, or vice-versa. It thus more completely models theerror of measurement.

Another drawback to both s; and Se is that they aremetric bound indexes. That is, standard errors of variousmeasures are not readily comparable: different approachesto measurement must often be compared; measurementunits are sometimes arbitrary; the comparative reliabilityof measurement in different fields is an issue at times. Inthese cases a unit-free index of reliability is preferable.Percentages or proportions are often used to resolve sucha problem. From (A.B) it follows that:

2 2~+~ = 1

s~ s~ (A. 12)

Thus s;/ s~ is the proportion of observed score variancedue to true score variance and s;/s~ is the proportion dueto error score variance. The former may be defined as acoefficient of reliability. For a perfectly errorless measure­ment method s; = O. Thus s; = s; and the reliabilitycoefficient s;/ s~ will be 1. As s;/s~ increases so the reli­ability diminishes. For a measurement method which ismaximally errorful all the observed variance is error var­iance, ie s; = s~. In this case s; = 0 and the coefficientof reliability will be zero.

How then to estimate S e and its associated statistics inpractice? Clearly one way might be to measure a set ofevents whose values are known, then calculate s~ and s;from observed and known values. From this, s~ and itsderivatives s e " SedIff" the reliability coefficient and var­ious confindence intervals could be obtained.

Unfortunately, in practice, particularly in new fields ofmeasurement, this is often impossible, since the true valuesare not known. However, although the true values are notknown, it can be safely assumed that if a variety of eventsare measured, some variation in true scores should occur.If these events are measured again the initial values shouldbe exactly reproduced provided there is no error. To theextent that there is random error the relative position amongthe initial observations will not be reproduced.

It should be noted that failure to reproduce scores canresult from several processes. Instruments or observers maybe unstable over time (test-retest unreliability). Measurestaken by two observers may differ (interobserver unrelia­bility). Measures taken by two versions of the same in­strument or test may differ (parallel form unreliability). In

The Australian Journal of PhySIotherapy. Vol. 31, No 5, 1985 197

Page 24: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

(A.20)r =

(A.22)

(A.21)

,(Sis)R = J/

The preceding discussion is concerned with the theoryof reliability as it applies to variables measured on intervalor ratio scales such as might occur in goniometry. Oftenclinical measurement is categorical in nature, such as whenrating abnormality, or when rating the stiffness of a jointalong a five point scale. A reliability theory needs to bedefined for these situations also.

A frequently employed measure for describing test-retestor interobserver reliability of categorical data is the per-

In equation (2I)R = correlation for uncurtailed distribu­tion, S = standard deviation of uncurtailed distribution,, = correlation of the curtailed distribution, s = standarddeviation of the curtailed distribution.

Another conclusion derived from reliability theory whichis relevant to the main text is that the observed correlationbetween two variables will be less than the theoreticallypossible correlation between their true scores. This occursbecause both variables are measured with some randomerror. If the reliability coefficients for both variables areknown, the theoretically possible relationship between thetwo variables when measured without error can be calcu­lated (2). If X and Yare the two variables, r XtYt = thecorrelation between the true X and Y scores, , x Y =observed correlation between X and Y, rxx = the reliabilitycoefficient for measuring X and r yy = the reliability coef­ficient for measuring Y, then:

Imagine an experiment where a blindfolded human sub­ject is required to palpate lO cubes, the sides of which varyin 5mm steps from lOmm sides to 55mm. The cubes arethen repalpated. The subject is required to judge their sizeon both occasions and a reliability coefficient is calculated.Conversely imagine the same experiment with 10 cubesvarying in Imm steps from 20mm to 29mm. Since the samepalpatory technique is employed on similar events the ran­dom error of measurement in metric terms e should remaincomparable (within the limits of sampling variation). ThusSe is assumed constant across the two experiments. How­ever the true score variance will be larger in the firstexperiment with cubes ranging from IOmm to 55mm. Itfollows from equation (A.20) that if s; diminishes whens; remains constant, then the ratio r will diminish also. Itis therefore important that reliability studies use stimuliwith a range of variability which is representative of theevents to which the instrument or observational procedurewill be ultimately applied. If a range restriction does occura correction is available:

(A.l9)

(A. 18)

(A.17)

- ,

S;r =

s~

Ex~ = Ey~

Note that both' and So are readily calculated from ex­perimental data.

The preceding theory outlines the rationale and inter­pretational basis of the major classical indexes for quan­tifying reliability: the standard error of measurement (se) ;the reliability coefficient (r); and the confidence interval(eI) around the true score (or around the change score).A number of further conclusions are derivable from theforegoing. An entire exposition of these is beyond the scopeof this contribution. However two aspects are importantto arguments present~d in the main text.

One aspect is that the reliability coefficient is sensitiveto the amount of true score variance. If the true scorevariance is for some reson restricted the reliability coeffi­cient will be reduced provided the error of measurementremains constant. This conclusion follows from equations(8) and (18). Since, = s;ls~ and s~ = s; + s; then:

,

variance being the average squared deviation it followsthatEx~ IN = EYoln.That is:

where s; = true score variance and s; observed scorevariance. Equation (A.I8) allows the very important con­clusion that the correlation between two measures of thesame sample of events is in fact the reliability coefficientdefined from equation (A.I2).

This conclusion not only enhances the interpretation ofreliability and its evaluation in practice, but also permitsevaluation of the other useful index of reliability s and itsderivative the confidence interval. From equation (A.I2) itfollows that 1 - r = s; /s~. Thus s; = s~ (1 - ,) andtherefore:

From equations (A.I6) and (A.I7) equation (A. 18) may berewritten:

Dividing both nominators and denominators by n defines, in terms of variances:

Ex;ln Ey;ln, = Ex~/n = Ey~/n

198 The Australian Journal of Physiotherapy. Vol. 31, No.5, 1985

Page 25: The Reliability of Selected Techniques in Clinical Arthrometrics · 2017. 2. 28. · movements (PPIVM). Collectively these tests and other similar ones may be taken to define the

Reliability in Clinical Arthrometrics

Kappa expresses observed agreement Po relative to expectedagreement. It also expresses that difference as a proportionof the distance between random and perfect agreement.Thus kappa is very similar to the reliability coefficient. IfPo is 100070, then K = 1,. if Po = P e', then K = o. AsPo exceeds P e so kappa grows. Although the analogybetween k and r is limited, a number of problems arepractically resolved by this statistic. The probability dis­tribution of kappa has been investigated (Fleiss and Cohen1969, Hubert 1977) and it is a method with relevance to awide variety of reliability problems when categorical datais encountered (Hartman 1977, Hollenbeck 1978). Othercorrelation-like statistics, such as <P, are applicable to prob­lems of association in categorical data, but a full discussionof their relative values is beyond the scope of this appendix.

(A.23)K=

rating. Let a, b, e, d be the proportion of ratings in therespective categories obtained on the first round of meas­urements and let a', b', e', d' be the correspondingproportions on the second round. If agreement is definedas not only the conjuction of identical ratings, but also ofadjacent ratings, then elementary probability theory con­cludes that Pe = aa' + ab' + ba' + bb' + be' + ee'+ cd' + dc' + dd'. In the above example a = a' =0.1,b = b ' = 0.4, c = e' = 0.4, d = d' = 0.1. ThereforeP e = 0.82, which is substantially larger than 0.34, theresult obtained with the stricter agreement rule.

To overcome such disadvantages Cohen (1960) definedthe statistic kappa:

Po - P e1 - P e

centage of agreement between the two sets of observations.The rationale is similar to that of the reliability coefficient:the presence of error will reduce agreement. Althoughsimple and widely used, percent agreement has some de­ficiencies.

Even if measurement is totally unreliable, that is if theobseI ved categories arose at random, there will be somedegree of agreement. Furthermore, that degree will beinfluenced by the distribution of measurements which arisefrom random processes. These distributions are differentin different circumstances.

For example, the number of categories in the scale willinfluence randomly obtained agreement levels. On a twopoint scale, if both responses are equiprobable the expectedagreement rate is 0.50. On a four point scale, if all responsesare equiprobable, the expected agreement rate is 0.25.

In addition, the assumption of equiprobability may beinappropriate. If the incidence of the two middle categoriesin the four category example was 0.4 for each and if theincidence for the two extreme categories was 0.1 for each,then basic probability theory indicates the percentage ofexpected agreement (Pe) would be P e = (.1 x .1) + (.1x .1) + (.4 x .4) + (.4 x .4), tllat is 0.34. The twodistributions of marginal probability, furthermore, neednot be identical as in this example. Nevertheless, basicprobability theory can readily Yield expected proportionsof agreement under the random model.

Another factor which can influence the proportion ofagreement under a random model is the definition of agree­ment. Let A, B, C, D be the four categories of the above

The Australian Journal of PhYSiotherapy. Vol. 31, No.5, 1985 199