developing theory-based diagnostic tests of english grammar: application of processability theory...
TRANSCRIPT
Developing Theory-Based Diagnostic Tests of English Grammar: Application of Processability TheoryRosalie HirchApril 26, 2013
2
Order of the Presentation Introduction Literature Review: Processability Theory
(PT) & Diagnostic Language Tests Hierarchies Errors Task Types
Method Participants Instruments Analyses
Results Discussion, Limitations, & Conclusions
3
Introduction:Background & Motivation Bridging the gap between testing and the
classroom
Previous Research in Diagnostic Language Assessment Empirical-based Theory-based
Processability Theory Already used for tests (RapidProfile) Is it sufficient for diagnostic tests?
4
Introduction:Major Goals & Aims of the Study To evaluate the reliability of a diagnostic
grammar test for middle school students
To explore theoretical approaches to diagnostic language assessment
To investigate the application of Processability Theory for diagnostic grammar tests
Literature ReviewProcessability Theory & Diagnostic Language Tests
6
HierarchiesProcessability Theory Based on Lexical Functional Grammar
Levels are implicational
Levels come from grammar tree
Problem: the PT hierarchy is very limited
7
HierarchiesProcessability Theory
Susan decorated a cake while John was playing tennis.
N V D N SC N V PrP NCategory Procedure
Phrasal Procedure (Phrase)
(Phrase)
S-Procedure
S-Procedure
(Phrase)
S’-Procedure
Word/ Lemma
8
HierarchiesDiagnostic Tests Other educational diagnostic tests also use
hierarchies Used for analyzing problems Some are implicational
Tend to be very broad (covering as much as possible) Suggestion that grammar, in particular, must cover a
lot
9
ErrorsProcessability Theory Learners tend to make 2 types of errors
These account for interlanguages
Is she at home? (Target Sentence)She Ø at home? (Deletion)She is at home? (Overuse)
10
ErrorsDiagnostic Tests The primary focus of diagnostic tests
Can potentially show 2 elements in learner performance Where the problem lies (error—observable outcome) What thinking led to the error (weakness—underlying
problem)
Requires careful planning Before: Item Design After: Rubric Design
11
Types of TasksProcessability Theory Emphasis on implicit knowledge (automaticity)
Based on Levelt’s Speaking Model
Tasks tend to be productive (speaking, writing)
Analysis is done afterwards
12
Types of TasksDiagnostic Tests It is possible to use productive tasks, but not
optimal Difficult to control contexts
More likely to be discrete and, as a result “inauthentic”
Tasks from Norris (2005) and Chapelle et al. (2010) Some qualities of multiple choice Attempt to imitate productive
13
Research Questions
1. Can we achieve an acceptable level of reliability for the grammatical diagnostic test used for this study?
2. Do the items for the grammatical diagnostic test work well at an item level in terms of item discrimination and difficulty? Were there unexpected patterns?
3. What is the relationship between the subtest, full test, and self-assessment?
4. Were mastery and non-mastery patterns consistent with predictions based on the Processability Theory hierarchy?
14
MethodParticipantsInstrumentAnalyses
15
Participants—Subjects
219 middle school students
Outside Seoul
No overseas education
N%
Girls%
Boys
Grammar Test Writing Test
Mean StDev Range Mean StDev Range
Gr. 3-5 72 52.7 47.2 0.46 0.180.10-0.85
3.3 1.8 0-7.5
Gr. 6 89 59.6 40.4 0.50 0.200.13-0.87
3.3 1.8 0-8
Gr. 7 39 51.3 48.7 0.47 0.190.02-0.79
3.8 1.6 0-7
Gr. 8&9
19 36.8 63.2 0.58 0.220.04-0.90
4.2 2.4 0-7
Total 219 53.9 46.1 0.49 0.190.02-0.90
3.5 1.8 0-8
16
Participants—Raters
2 rounds of rating
Round 1: Grammar 6 Raters All experienced in teaching; 4 in preparing tests Scored the grammar tests and writing tests for the specific
grammar points Rated once (absolute answers)
Round 2: Holistic 5 Raters All experienced in scoring writing tests Rated twice (3 times where raters differed by 2 or more)
17
Instruments
Grammar Test (see handout)
Writing test: picture task Comparison purposes
PT grammar and additional levels
18
Analyses
Descriptive Statistics Central tendency & dispersion measures T-unit analysis
Test and subsection reliability (Alpha)
Item difficulty and discrimination
Correlation with the writing test
Fit to PT hierarchy
19
Results
20
Descriptive StatisticsGrammar Test & Writing Test
N Items Mean SD Median Mode Range
Version 1 219 52 25.6 10.1 25 15 1-47
Version 2 219 42 20.3 9.0 20 19 0-40
Writing 219 1 3.0 1.8 4.0 4.5 0-8
NAve. Word Count
Range Word Count
Avet-unit Count
Words per
t-unit
Words per
Clause
Clauses pert-unit
Target Clause
s
219 67.83 0-242 10.78 6.30 5.69 0.11 0.19
21
Reliability StatisticsGrammar Test and Subsections & Writing Test
Section
Det NC PNPas
tPrC
SVsg
SVpl
Prep SCA SCB SCC SCT TestPTes
tNumber of items
5 5 5 5 5 6 4 5 4 4 4 12 52 42
Alpha score 0.18 0.7 0.88 0.85 0.93 0.92 0.73 0.76 0.73 0.74 0.61 0.83 0.92 0.93
NCorrelati
onKappa
Perfect Agreem
ent
Adjacent
Scores
Perfect+
Adjacent
Rho Alpha
P-B Proph
(3-rater)
Writing Test
219 0.92 0.41 0.49 0.49 0.99 0.91 0.96 0.98
22
Item Difficulty and DiscriminationGrammar Test
Inde
x
Item Numbers
23
Correlation with the Writing TestGrammar Test and SubsectionsPlN Past PrC SVSg SVPl Prep
SubClTest Total
Writing
ScoreA B C Tot
PlN 1Past .37** 1PrC .29** .34** 1
SVsg .28** .42** .43** 1SVpl .38** .36** .27** .25** 1Prep .28** .33** .46** .45** .26** 1SCA .21** .28** .40** .40** .26** .53** 1SCB .23** .34** .27** .38** .23** .53** .56** 1SCC .15* .18** .26** .39** .11 .39** .50** .42** 1SCT .25** .34** .39** .48** .26** .60** .87** .86** .69** 1Test .55** .65** .70** .75** .51** .73** .67** .64** .51** .76** 1
Writing
.36** .43** .44** .37** .33** .47** .42** .46** .31** .50** .61** 1
**. Correlation is significant at the 0.01 level (2-tailed).
*. Correlation is significant at the 0.05 level (2-tailed).
24
Fit to Implicational Hierarchies
Coefficient of Scalability:
PT Only=94.1%
PT + Proposed Levels=89.3%
1 2 3 N
3 levels 5
2 levels
8 2
1 level
10 3
0 levels 2
𝑇𝑜𝑡𝑎𝑙 ¿𝑜𝑓 𝑐𝑒𝑙𝑙𝑠−𝐸𝑥𝑐𝑒𝑝𝑡𝑖𝑜𝑛𝑠 ¿𝑇𝑜𝑡𝑎𝑙¿
𝑜𝑓 𝑐𝑒𝑙𝑙𝑠 ¿=90%+¿
Discussion, Limitations, & Conclusion
26
Discussion
Overall reliability was quite good
Determiner and non-count section did not work Exposed a problem with determiners generally
Task-types have good potential for diagnostic information
Grammar correlated fairly well with writing scores Follows from complexity and accuracy May also explain determiners & non-count nouns
Fit to PT of proposed levels suggests tasks are plausible
27
Limitations
Results are generalizable only to Koreans Methods may be universal
Should have had a larger writing sample Also, more feedback from students and teachers
More high-level students
Conclusions
Most of the grammar tasks can work well, but require more planning & research Particular attention on error types
It may be possible to expand the PT hierarchy Needed in order to be useful for diagnostic purposes