developing theory-based diagnostic tests of english grammar: application of processability theory...

Developing Theory-Based Diagnostic Tests of English Grammar: Application of Processability TheoryRosalie HirchApril 26, 2013

2

Order of the Presentation Introduction Literature Review: Processability Theory

(PT) & Diagnostic Language Tests Hierarchies Errors Task Types

Method Participants Instruments Analyses

Results Discussion, Limitations, & Conclusions

3

Introduction:Background & Motivation Bridging the gap between testing and the

classroom

Previous Research in Diagnostic Language Assessment Empirical-based Theory-based

Processability Theory Already used for tests (RapidProfile) Is it sufficient for diagnostic tests?

4

Introduction:Major Goals & Aims of the Study To evaluate the reliability of a diagnostic

grammar test for middle school students

To explore theoretical approaches to diagnostic language assessment

To investigate the application of Processability Theory for diagnostic grammar tests

Literature ReviewProcessability Theory & Diagnostic Language Tests

6

HierarchiesProcessability Theory Based on Lexical Functional Grammar

Levels are implicational

Levels come from grammar tree

Problem: the PT hierarchy is very limited

7

HierarchiesProcessability Theory

Susan decorated a cake while John was playing tennis.

N V D N SC N V PrP NCategory Procedure

Phrasal Procedure (Phrase)

(Phrase)

S-Procedure

S-Procedure

(Phrase)

S’-Procedure

Word/ Lemma

8

HierarchiesDiagnostic Tests Other educational diagnostic tests also use

hierarchies Used for analyzing problems Some are implicational

Tend to be very broad (covering as much as possible) Suggestion that grammar, in particular, must cover a

lot

9

ErrorsProcessability Theory Learners tend to make 2 types of errors

These account for interlanguages

Is she at home? (Target Sentence)She Ø at home? (Deletion)She is at home? (Overuse)

10

ErrorsDiagnostic Tests The primary focus of diagnostic tests

Can potentially show 2 elements in learner performance Where the problem lies (error—observable outcome) What thinking led to the error (weakness—underlying

problem)

Requires careful planning Before: Item Design After: Rubric Design

11

Types of TasksProcessability Theory Emphasis on implicit knowledge (automaticity)

Based on Levelt’s Speaking Model

Tasks tend to be productive (speaking, writing)

Analysis is done afterwards

12

Types of TasksDiagnostic Tests It is possible to use productive tasks, but not

optimal Difficult to control contexts

More likely to be discrete and, as a result “inauthentic”

Tasks from Norris (2005) and Chapelle et al. (2010) Some qualities of multiple choice Attempt to imitate productive

13

Research Questions

1. Can we achieve an acceptable level of reliability for the grammatical diagnostic test used for this study?

2. Do the items for the grammatical diagnostic test work well at an item level in terms of item discrimination and difficulty? Were there unexpected patterns?

3. What is the relationship between the subtest, full test, and self-assessment?

4. Were mastery and non-mastery patterns consistent with predictions based on the Processability Theory hierarchy?

14

MethodParticipantsInstrumentAnalyses

15

Participants—Subjects

219 middle school students

Outside Seoul

No overseas education

N%

Girls%

Boys

Grammar Test Writing Test

Mean StDev Range Mean StDev Range

Gr. 3-5 72 52.7 47.2 0.46 0.180.10-0.85

3.3 1.8 0-7.5

Gr. 6 89 59.6 40.4 0.50 0.200.13-0.87

3.3 1.8 0-8

Gr. 7 39 51.3 48.7 0.47 0.190.02-0.79

3.8 1.6 0-7

Gr. 8&9

19 36.8 63.2 0.58 0.220.04-0.90

4.2 2.4 0-7

Total 219 53.9 46.1 0.49 0.190.02-0.90

3.5 1.8 0-8

16

Participants—Raters

2 rounds of rating

Round 1: Grammar 6 Raters All experienced in teaching; 4 in preparing tests Scored the grammar tests and writing tests for the specific

grammar points Rated once (absolute answers)

Round 2: Holistic 5 Raters All experienced in scoring writing tests Rated twice (3 times where raters differed by 2 or more)

17

Instruments

Grammar Test (see handout)

Writing test: picture task Comparison purposes

PT grammar and additional levels

18

Analyses

Descriptive Statistics Central tendency & dispersion measures T-unit analysis

Test and subsection reliability (Alpha)

Item difficulty and discrimination

Correlation with the writing test

Fit to PT hierarchy

19

Results

20

Descriptive StatisticsGrammar Test & Writing Test

N Items Mean SD Median Mode Range

Version 1 219 52 25.6 10.1 25 15 1-47

Version 2 219 42 20.3 9.0 20 19 0-40

Writing 219 1 3.0 1.8 4.0 4.5 0-8

NAve. Word Count

Range Word Count

Avet-unit Count

Words per

t-unit

Words per

Clause

Clauses pert-unit

Target Clause

s

219 67.83 0-242 10.78 6.30 5.69 0.11 0.19

21

Reliability StatisticsGrammar Test and Subsections & Writing Test

Section

Det NC PNPas

tPrC

SVsg

SVpl

Prep SCA SCB SCC SCT TestPTes

tNumber of items

5 5 5 5 5 6 4 5 4 4 4 12 52 42

Alpha score 0.18 0.7 0.88 0.85 0.93 0.92 0.73 0.76 0.73 0.74 0.61 0.83 0.92 0.93

NCorrelati

onKappa

Perfect Agreem

ent

Adjacent

Scores

Perfect+

Adjacent

Rho Alpha

P-B Proph

(3-rater)

Writing Test

219 0.92 0.41 0.49 0.49 0.99 0.91 0.96 0.98

22

Item Difficulty and DiscriminationGrammar Test

Inde

x

Item Numbers

23

Correlation with the Writing TestGrammar Test and SubsectionsPlN Past PrC SVSg SVPl Prep

SubClTest Total

Writing

ScoreA B C Tot

PlN 1Past .37** 1PrC .29** .34** 1

SVsg .28** .42** .43** 1SVpl .38** .36** .27** .25** 1Prep .28** .33** .46** .45** .26** 1SCA .21** .28** .40** .40** .26** .53** 1SCB .23** .34** .27** .38** .23** .53** .56** 1SCC .15* .18** .26** .39** .11 .39** .50** .42** 1SCT .25** .34** .39** .48** .26** .60** .87** .86** .69** 1Test .55** .65** .70** .75** .51** .73** .67** .64** .51** .76** 1

Writing

.36** .43** .44** .37** .33** .47** .42** .46** .31** .50** .61** 1

**. Correlation is significant at the 0.01 level (2-tailed).

*. Correlation is significant at the 0.05 level (2-tailed).

24

Fit to Implicational Hierarchies

Coefficient of Scalability:

PT Only=94.1%

PT + Proposed Levels=89.3%

1 2 3 N

3 levels 5

2 levels

8 2

1 level

10 3

0 levels 2

𝑇𝑜𝑡𝑎𝑙 ¿𝑜𝑓 𝑐𝑒𝑙𝑙𝑠−𝐸𝑥𝑐𝑒𝑝𝑡𝑖𝑜𝑛𝑠 ¿𝑇𝑜𝑡𝑎𝑙¿

𝑜𝑓 𝑐𝑒𝑙𝑙𝑠 ¿=90%+¿

Discussion, Limitations, & Conclusion

26

Discussion

Overall reliability was quite good

Determiner and non-count section did not work Exposed a problem with determiners generally

Task-types have good potential for diagnostic information

Grammar correlated fairly well with writing scores Follows from complexity and accuracy May also explain determiners & non-count nouns

Fit to PT of proposed levels suggests tasks are plausible

27

Limitations

Results are generalizable only to Koreans Methods may be universal

Should have had a larger writing sample Also, more feedback from students and teachers

More high-level students

Conclusions

Most of the grammar tasks can work well, but require more planning & research Particular attention on error types

It may be possible to expand the PT hierarchy Needed in order to be useful for diagnostic purposes

developing theory-based diagnostic tests of english grammar: application of processability theory...

Documents