ensl 346/446 midterm file2. beatriz will stay at school until she finishes her project. a.adverb c....
TRANSCRIPT
ENSL 346/446 Midterm Score:_______/61
SECTION 1 (____/17 points total)
Parts of Speech:
Directions: Match each part of speech in the left column with the appropriate definition in the right column
1. Noun:
_______________
A. modifies a noun or a pronoun by describing, identifying, or quantifying words
2. Verb:
_______________
B. replaces a noun
3. Pronoun:
_______________
C. shows direction, location or time while linking nouns, pronouns and phrases to other words in a sentence.
4. Adjective:
_______________
D. names a person, place, thing or idea.
5. Adverb:
_______________
E. expresses actions, events or states of being.
6. Conjunction:
_______________
F. links words, phrases and clauses
7. Preposition:
_______________
G. modifies a verb, an adjective, an adverb, a phrase or clause. Shows manner, time, place, cause or degree.
Multiple choice:
Identify the parts of speech of the underlined words/phrases:
1. Does this meal come with rice?
a. pronoun c. adjective
b. noun d. preposition
2. Beatriz will stay at school until she finishes her project.
a. adverb c. coordinating conjunction b. pronoun d. subordinating conjunction
3. The cunning raccoon jumped into the garbage can.
a. article c. adverb
b. preposition d. auxiliary
4. Which of the following is a helping (auxiliary) verb?
a. The can of soda exploded. b. Throw me a can of soda!
c. Would you like a can of soda?
d. I wouldn’t like a can of soda.
5. Which of the following is an adjective?
a. My brother’s ugly red car lasted him twenty years.
b. My brother’s ugly red car lasted him twenty years.
c. My brother’s ugly red car lasted him twenty years.
d. My brother’s ugly red car lasted him twenty years.
6. Which of the following is a pronoun?
a. Jeremy is excited because he bought new shoes.
b. The shoes are under the table.
c. How much did those shoes cost?
d. He got a discount, thanks to his school ID.
7. Lucille usually jogs on the beach before class.
a. adjective c. helping verb
b. preposition d. adverb
8. Luiz saved his money all Summer so he could buy a guitar.
a. coordinating conjunction c. subordinating conjunction
b. verb d. article
9. Which of the following is a noun?
a. The smell of pie fills Mrs. Weasley’s house every Saturday.
b. Mrs. Weasley loves to bake pie.
c. Mrs. Weasley’s pies are baked with love. d. Harry walked away with a mountain of pie on his plate.
10. Isabel finally hiked in Big Sur last weekend.
a. helping verb c. adjective
b. verb d. adverb
SECTION 2 (____/15 Points Total)
Conjunctions and Transitions:
Directions: Combine the sentences by rewriting them in the space below each one with the proper conjunction or fill-in-the-blank with the proper conjunction. Use each word from the wordbank once.
Example: The cat sleeps all day. He hunts at night.
The cat sleeps all day and he hunts at night.
Wordbank
but and so now because after although that or
1. Diego likes most food. He doesn’t like spinach or artichokes.
2. No one washes their car in the Monterey Peninsula. There is a drought in California.
3. I need to study hard __________ I can pass the exam.
4. ____________ Sandy was very ill, she didn’t take any medicine.
5. All of the students had a party. The test was finished.
6. Lateisha ran a marathon yesterday. She finished her essay on time.
7. I didn’t know ____________ Anwar was coming home today.
Adventures in the USA
Read the paragraph below about Andres’ experience moving to the US. Then, fill in the correct subordinating conjunctions. Use the words from the list below to complete the task. Use each word once. Not every word in the wordbank will be used.
Wordbank
if when until who now however that and after because
_____________ Andres moved from Colombia to the United States it was the first time
he had travelled outside of the country. Since then, he has been studying biology at
the university. His biology professors, ________________ are very generous, have helped
him a lot with his degree. ________________, it hasn’t always been easy for Andres to
live in a foreign country. In fact, it wasn’t until he started studying biology that he
met a lot of his friends _______________ began to feel comfortable in the US. Upon his
arrival it was difficult to meet people _________________ he was new and unfamiliar
with US culture. His mother noticed this because Andres never went out on the
weekends. She told him, “If you want to make friends, you have to take risks and not
be afraid to talk to people.” _________________ hearing his mother’s advice, Andres
started talking to other students in his classes and he realized ______________ he and
his classmates shared a lot of things in common. _______________ Andres feels
comfortable with his new surroundings and his mother has a difficult time
convincing Andres to stop hanging out with his friends and spend time at home with
his family.
SECTION 3 (____/14 points total)
Short Answers: American Core Values
Directions: The paragraph below tells a story about American values. Read and answer the question(s). Write no more than 4 – 5 sentences.
Story 1:
Giovanni moved from Italy to New York when he was 18 years old to study business at an American university. He planned on moving back to Italy when he finished his studies but he fell in love with an American girl and they ended up getting married. After they graduated and got married, Giovanni became a US citizen. Giovanni and his wife then moved to a small town on the coast of California because his wife got hired as an accountant at a small firm there. Giovanni wanted to open an Italian restaurant in the town because the only Italian restaurant in town was Olive Garden. According to Giovanni, this was not real Italian food. After months of hard work the restaurant finally opened. At first, the restaurant struggled to make money because everyone still went to Olive Garden. To make his restaurant different, Giovanni decided to have a lunch special during the week. The lunch special was a great success but, to Giovanni’s surprise, Olive Garden started to have a lunch special, too. Then Giovanni decided to stay open an hour later than Olive Garden, but after two weeks Olive Garden changed their hours to the same as Giovanni’s!
Which of the six American values does Giovanni’s story represent? Explain with examples from the case study.
Section 4 (___/15 points total)
Written Response: Analyzing Advertisements
Directions: Choose one ad and analyze it using the categories we studied and practiced. Write a clear paragraph with a topic sentence and supporting sentences. Please include the following in your response:
a. Product b. Target market c. Description of the ad using adjectives of description d. American value used e. Strategy(s) used to sell the product and examples from the ad f. Proper use of coordinating and subordinating conjunctions and transitions
RunningHead:ORIGINALTESTPROJECT 1
OriginalWritingTestforMontereyPeninsulaCollege
RachelMusgroveandBrockKetterling
MiddleburyInstituteofInternationalStudiesatMonterey
ORIGINALTESTPROJECT 2
OriginalWritingTestforMontereyPeninsulaCollege
Background
TheESLsequenceatMontereyPeninsulaCollege(MPC)preparesstudentsfor
mainstreamacademicclasses.Thecoursesrangefrombeginner(level1)touniversity-level
(level6).Fortheoriginaltestproject,wedesignedatestforPennyPartch’sHigh-
IntermediateWritingaboutAmericanCulturecourse(level5).Theobjectivesforthis
coursearetodevelopwritingskillsandculturalliteracy,withanemphasisonwriting
essaysrelevanttotoU.S.government,diversity,values,andinnovations(MPC,n.d.).
Fromthebeginning,wewantedtodesignawritingtest.Brockalreadyhadan
interestinteachingcollegewritingandRachel,whohadonlyevertaughtyounglearners,
wantedtotrysomethingnew.BrockhadpreviouslyobservedPenny’sclassforthe
TeachingofWritingcoursetaughtbyJohnHedgcockatMIIS,atwhichpointhegatheredthe
initialinformationforourneedsanalysisformakingthetest(AppendixA).Graves(2014)
mentionsthataneedsanalysisshouldtakeintoaccountthepurposeofthecourse,thetest
developer’sownbeliefsabouttesting,andinformationthatisalreadyknownaboutthe
learners’goalsandproficiencylevels.SoafterreviewingBrock’sobservationnotes,we
interviewedPennyandgatheredinformationaboutthegoalsofthecourseandthe
strengthsandweaknessesofthestudents.Shealsoprovideduswithcoursematerials,such
asthecoursetextbookandpreviousassignments(AppendixB).Thisinformationhelpedus
contextualizeourtest,establishourconstructs,andreworkanyirrelvanttaskswehad
previouslyenvisionedforthetest.WithPenny’sguidance,webegantodesignhermidterm
exam.
PurposeofTest
ORIGINALTESTPROJECT 3
Themidtermexamfunctionedasaprogresstest.Pennywantedtoknowwhather
studentsunderstoodhalf-waythroughthetermsoshewouldknowwhattofocuson
duringthesecondofhalfoftheterm.TheWritingforAmericanCulturecourse(ENSL
346/446)isacontent-basedcourse.Thestudentsimprovetheirwritingskillsbyexploring
topicsconcerninghistory,multiculturalism,andimmigrationintheUnitedStates.Atthe
pointwebegandesigningthetest,thestudentshadlearnedaboutAmericanvaluesand
advertisingtechniques.Theyhadalsocoveredgrammarpointssuchaspartsofspeech,
conjunctions,transitions,sentencetypesandadjectiveclauses.Thus,wehadtoblendthe
grammaticalportionofourtestintothecontextofAmericanadvertisingandvalues.Our
goalsweretotestthestudentsontheirgrammaticalknowledgeandwritingability,andto
familiarizethemwithculturalpracticesthattheywillfaceintheirmainstreamacademic
coursesandlifeintheUS.
TargetAudience
ThelearnerswhotookthemidtermexamwereESLcommunitycollegestudents
rangingfrom18to40yearsold.Theclasswasmadeupamixofresidentimmigrantsand
generation1.5students.Therewereatleastthirteencountriesrepresentedinthisclass,so
theL1ofthestudentsvariedasdidtheirculturesandschoolinghistories.Someofthe
studentshadtakenpartinthewholeESLsequenceatMPCandwereusedtotheformatof
theclass.Othershadtestedinfromanoutsideinstitutionandstillneededhelpwithmore
basicsubjectmatter,likepartsofspeech.Regardlessoftheirdifferences,thestudents
viewedEnglishasanimportantskilltheyneededtodevelop,andtheyunderstoodthevalue
oflearningabouttheculturetheylivein.
Constructs
ORIGINALTESTPROJECT 4
TheENSL346/446Midtermmeasuresthreeoverarchingconstructs:linguistic
competence,sociolinguisticcompetence,anddiscoursalcompetence.Theseconstructs
werechosenbecauseoftheirrelevancetothecurriculumofthecourseandtheir
pertinencetoacademicwritingingeneral.Thesethreeareasofknowledgewillbecrucialto
thelearners’successforthedurationoftheirstudies,mostlywithrespecttoformal
writing.Wehavelabelledthesubsectionsofourtestwhichmeasurelinguisticcompetence
as“Grammar”andthesectionswhichmeasurediscoursalandsociolinguisticcompetence
as“Writing.”Werealizethatthereisoverlapbetweentheconstructsandthesections.For
example,studentscannotrespondtothewritingpromptswithoutsomedegreeoflinguistic
competence.Tocontroltheoverlappingofconstructswedevisedarubricthattookinto
considerationthemultipleaspectsofstudent’swriting.Forinstance,inthewritingsections
weonlyfactoredinspecificgrammarpointssuchasadjectiveclauses,butnotgeneral
syntaxorgrammaticalaccuracy.Therubricwillbediscussedindepthlaterinthispaper.
Foradditionaldefinitionsofourconstructs,seeAppendixC.
Linguisticcompetence
WemodifiedCanaleandSwain’s(1980)definitionoflinguisticcompetenceas
“knowledgeoflexicalitemsandofrulesofmorphology,syntax[and]sentence-grammar
semantics”(Canale&Swain,1980,p.29).WeomittedCanaleandSwain’sinclusionof
phonologywithinthisdefinitionbecauseourtestismainlyatestofgrammarandwriting
anddoesnotincludeanyspeakingorlistening.Whytestongrammarinwhatismeantto
beacommunicativeEAPcourse?EllisandShintani(2014)saythatgrammarisredundant
andthatmuchofthemeaningwhichhumanscommunicatetooneanothercanbedone
throughcontextandlexis(p.54).However,asWiddowson(1990)putsit,“grammarfrees
ORIGINALTESTPROJECT 5
usfromadependencyoncontextandthelimitationsofapurelylexicalcategorizationof
reality”(p.86).Afocusongrammarasameaning-makingresource(Celce-Murcia&
Larsen-Freeman,2016)isespeciallyimportantinwriting,asapieceoftextcan
communicatemeaningacrossspaceandtime,whereasspokenlanguageisoftenlimitedto
aspecificinstanceforthosewhohearit.Therefore,it’sespeciallyimportantthatthe
EnglishlanguagelearnersinanEAPcourselearntocommunicatetheirmessages
accurately,astheirwritingwillultimatelyexistasadecontextualizedentitywhenitis
assessedbytheirinstructorsorreadbytheirpeers.
Thespecificgrammarpointsthatweassessinourmidtermexamareimportantto
meaning-makingforavarietyofreasons.Thefirstgrammarsubtestfocusesonpartsof
speech,ortheterminologyassignedtovariouscategoriesofwordsinaccordancewiththeir
syntacticfunctions.Thepartsofspeechthatweincludedinourtestwerenouns,verbs,
pronouns,adjectives,adverbs,prepositionsandconjunctions.Althoughitisnotexactly
necessarytoknowthenameofanadjectiveoradverbinordertowriteanacademicpaper
(asmanynativeEnglishspeakersinformeduswhilewewerepreparingthistest!),knowing
thisterminologyhelpsstudentsdevelopa“meta-language,”orawayoftalkingabout
grammarthathelpslearnerstoconceptualizeit(Celce-Murcia&Larsen-Freeman,2016,p.
17).Beingabletobreakdownasentenceintoitsbasicbuildingblockswillhelpthese
Englishlanguagelearnersunderstandwhatmakes“effective”collegewriting.More
specifically,itwillenablethemtoseewhycertainwordsorcombinationsofwordssound
moreformalormorepersuasivethanothers.Knowingtheterminologyfortheindividual
elementsoflanguagewillhelpthestudentscombinethemtobuildsocio-culturally
appropriatesentencesandultimately,moreconvincingacademicpapers.
ORIGINALTESTPROJECT 6
Conjunctions,whicharethemainfocusofthesecondgrammarsubtest,eliminate
redundanciesinacademicwriting(Celce-Murcia&Larsen-Freeman,2016,p.481).Words
likeand,but,because,soandalthoughhelpmakewritinglesschoppyandredundant(p.
489).BeingabletoidentifyconjunctionsanddeploythemcorrectlyhelpsEnglishlanguage
learnersachievecohesionandcoherence(alsoimportantindiscoursalcompetence),and
finallyamore“sophisticated”levelofwriting.Linguisticcompetence,then,isessentially
thefoundationforournexttwoconstructsaswell:sociolinguisticcompetenceand
discoursalcompetence.
Sociolinguisticcompetence
CanaleandSwain(1980)refertosociolinguisticcompetenceastherulesthat
“specifythewaysinwhichutterancesareproducedandunderstoodappropriately”(p.30).
BaileyandCurtis(2015)furtherthisdefinitiontoincludetheabilitytoapplytheserulesto
discourse.Swain(1984)emphasizesthatsocioculturalcompetencehasmostlytodowith
anawarenessofcontextualfactorssuchas“topic,statusofparticipantsandpurposesofthe
interaction”(p.188).Inotherwords,itisthelearners’abilitytoadapttheirlanguageto
certainsituations,dependingonwhotheyaretalkingtoandtherelativesocialdistance
betweenthemselvesandtheirinterlocutors.Inthespecificcontextofourtest,testtakers
willberesponsibleforwritingintheappropriateregistercorrespondingtoUS-American
academicwriting,whichtheyhavebeenlearningbothinthiscourseandinprevious
coursesatMPC(ifthey’veprogressedthroughtheMPCESLsystem).Specifically,they
shouldwriteinaformalmannerwithcompletesentences,attentiontocorrectgrammar,
punctuation,avoidslangtermsoraddressingthereaderas“you.”
ORIGINALTESTPROJECT 7
Sociolinguisticawarenessisimportantfortheseparticularlanguagelearners
becausesomeofthemmaybeenteringadifferentrealmthantheonestheyhave
previouslyusedEnglishinbefore.Aspreviouslymentioned,ENSL346/446consistsofa
diverserangeofstudentsfromvariouscountriesandeducationalbackgrounds.Eventhose
whowereborninormovedtotheU.S.atayoungagemaybe“earlearners”ofEnglishand
maynotbeawareyetofdifferentregisters.Therefore,itiscrucialforthemtorealizehow
grammarandlexisdifferbetweenspokenandwrittenlanguage,orevendifferentmodesof
writtenlanguage.Forexample–thewayEnglishusedonsocialmediadiffersvastlyfrom
thatexpectedinacollegeessay.Theabilitytousetheappropriatelanguageinthe
appropriatesettingkeepslearnersfromembarrassingthemselvesorpotentiallyoffending
aninterlocutorbyleadingthemtobelievethatthelearnerdoesnottakethesituation
seriouslyenough.Sociolinguisticcompetenceiscloselylinkedwithdiscoursalcompetence
becausebotharerelatedtosituationalawarenessoflanguage.Discoursalcompetence,
however,hasmoretodowiththegenreofwriting,aswewillexplaininthenextsection.
Discoursalcompetence
Discoursalcompetenceisassociatedwiththeoverallorganizationofatext.Swain
(1984)elaboratesthatdiscoursalcompetenceisknowing“howtocombinegrammatical
formsandmeaningstoachieveaunifiedspokenorwrittentextindifferentgenres”(p.
189).Thefocusongenresisimportantbecausediscourseisessentiallywhatdistinguishes
anacademicessayfromabulletedlistorarepetitiveformula.Establishingthisunity
throughoutatextreliesonthewriter’scontroloverwhatCanale(1983)calls“coherence
andcohesivedevices.”Thesedevicesreferbacktoourdescriptionoflinguisticcompetence.
Inordertoestablishcohesionandcoherenceinatext,alearnermustknowhowto
ORIGINALTESTPROJECT 8
correctlyuseelementssuchaspronouns,synonyms,conjunctionsandellipsistorelate
individualutterancesinalogicalmanner.Inourtestspecifically,weassessedthelearners
ontheirabilitytowriteanargumentativeparagraph(SectionIV),whichincludedathesis
statement,supportingevidence,andconcludingstatements.Testtakerswerealso
expected,obviously,tousethegrammaticalfeaturesdiscussedearlierinthisessay,namely
coordinatingandsubordinatingconjunctionssothattheirwritingflowedsmoothlyand
showedavarietyofsentencetypes.
Pennywantedustofocusespeciallyonparagraphstructureinourtestbecauseshe
foundthatherstudentshadquiteavariedunderstandingofhowtosequencetheirwriting
inalogicalmanner.TheinstructorsofthemainstreamacademiccoursesatMPChave
expectationsaboutwhateffectivewritingstructurelookslike,soitisvitalthattheESL
studentsmasterthesepatternsnowbeforetheyadvancetohigher-levelcourses.
TestMethodsandOrganization
Ourtest(AppendixD)hasfoursections.Thefirsttwosectionsareobjectively
scoredandfeaturediscrete-pointitems.Thelasttwosectionsaresubjectivelyscored
constructed-responseitems.Thetotalpointspossibleforthetestis61andeachsection
wasmoreorlessevenlyweighted.
SectionI(Grammar)–PartsofSpeech
SectionI,partiourtestconsistsofsevenmatchingitemswhichassessstudents’
abilitytodefinepartsofspeech.Webeganourtestwiththistaskbecauseitlaidthe
foundationfortherestofthesubtestsandwasthemostbasicknowledgethestudentswere
testedon.Intheleftcolumnwasalistofterminologylikenounandadverb.Studentsthen
hadtomatchthosewordswiththedefinitionsintherightcolumn.Thedefinitionswere
ORIGINALTESTPROJECT 9
modifiedfromahandoutPennyprovidedherstudents(AppendixB),sotheconceptswere
alreadyfamiliartothethem,especiallyiftheyhadbeeninthepreviousESLcoursesatMPC.
Toworkforpositivewashback,wealwaystriedtoadaptourtasksfromassignmentsthey
hadalreadydoneinclass.
SectionI,partiiconsistsoftenmultiple-choiceitemsinwhichweassessedstudents
abilitytoidentifypartsofspeechwithinsentences.Wecreatedtwotypesofmultiplechoice
questionssothatstudentshadtopayextraattentiontowhatthequestionwasasking.
Multiplechoicecanberedundantsowefiguredthatswitchinguphowthequestionswere
formattedwouldcausestudentstofocusonwhatwasbeingaskedofthem.Foronetypeof
multiplechoicequestion,wecreatedourownsentencesforthestemandunderlinedthe
partofspeechbeingtestedforthequestion.Studentsthenhadtopickthecorrecttermin
thelistofoptions,labeledathroughd.Fortheothertypeofmultiplechoicequestion,we
posedaquestioninthestemthataskedstudentstofindaparticularpartofspeechthatwas
writteninthelistofoptions.Intheoptionsthereweresentenceswithdifferentpartsof
speechunderlined.Foreachofourmultiplechoicequestions,therewasonlyonekey.
SectionII(Grammar)–ConjunctionsandTransitionWords
SectionIIofourtestassessedtheuseofconjunctionsandtransitionwords.Penny
hadbeenteachingherstudentsaboutsubordinatingandcoordinatingconjunctionsand
transitionwordssothattheycouldcreatedifferenttypesofsentencesandparagraphsthat
werecohesiveandcoherent.Tocreatethetaskforthissubsection,wetookoneofher
existingassignmentsandadapteditforourtest(AppendixB).Thistaskhadtwodifferent
typesofquestions.Oneaskedthestudentstorewritetwoindependentsentencesby
combiningthemwiththecorrectconjunctionfoundinawordbank.Theotherquestion
ORIGINALTESTPROJECT 10
askedstudentstofillintheblankofanalreadycompletedsentencewiththecorrect
conjunctionalsofoundinthewordbank.Themainobjectiveforthistaskwasforstudents
toanalyzeconjunctionsonasententiallevel.
SectionII,partiiwasaparagraph-longclozepassagetellingthefictionalizedstoryof
animmigrantstudentintheUnitedStates.Thestudentshadtoreadthepassageand
choosethecorrectconjunctionsandtransitionwordsfromthewordbankprovidedforthis
subsection.SincethiswasacontentbasedcourseaboutAmericanculturewetriedto
createtasksthatreflectedtheaspectsofAmericanculturethestudentswerestudying.We
alsowantedthissubsectiontohighlighttheimportanceofparagraphstructure;the
realisticstorycontextualizedhowconjunctionsandtransitionsprovidecohesionand
coherenceinwriting.
SectionIII(Writing)–AmericanValues
SectionIIIofourtestwasaconstructed-responsesection.Wecreatedafictionalized
casestudyaboutthestoryofanimmigrantentrepreneurintheUSandthestudentswere
askedtoidentifytheAmericanvaluesthatwereassociatedwithhisstory.Thestudents
wereaskedtowritenomorethanfivesentences.Theiranswersdidnothavetobe
formattedintoaparagraph,either.Themainobjectiveforthistaskwasthattheycould
identifytheAmericanvaluesinacontextualizedstory.
Wedesignedthissectionbasedoffoftwoassignmentsthestudentshaddonein
class(AppendixB).Oneassignmentwasacasestudyaboutplagiarism.Studentshadto
readastoryabouteitheraRussianorJapanesestudent(dependingonwhichhandoutthe
studentreceived)takingatestinanAmericanuniversitywhowascaughtcheatingduringa
quiz.StudentsthenhadtoanswerquestionstodiscoverwhytheRussianorJapanese
ORIGINALTESTPROJECT 11
studenthadcheatedandmakeinferencesaboutculturaldifferencesbetweenthestudent’s
homecountryandtheUS.Forhomework,theywereaskedtowriteaparagraphabouttheir
findingsanddiscussions.
TheotherassignmentwasachartthatfocusedonsixAmericanvalues
(individualism,self-reliance,equalityofopportunity,competition,materialwealth,and
hardwork).Studentshadpreviouslybeendiscussingthedifferentaspectsofthesevalues,
butforthisassignmentstudentshadtofindaUSexampleofeachvaluetoillustrateits
significanceandthencomparetheexampletohowitwouldbeviewedintheirhome
country.Studentsdiscussedtheiranswerswithclassmates,actingasculturalambassadors
fortheircountries.
Bycombiningandadaptingthesetwoactivities,wewereabletocreateawriting
taskthatnotonlyreflectedwhatthestudentshadbeenpracticingandstudyinginclass,but
alsochallengethemtoanalyzeastorytheywerenotfamiliarwith.Thissubsection
preparedstudentsforthefinalsectionwhichwasmoredemandingandall-encompassing.
SectionIV(Writing)–AnalyzingAdvertisements
SectionIVofourtestwasalsoaconstructedresponse;studentshadtocomposea
paragraphthatanalyzedaUSadvertisement.Thestudentshadthechoicebetweentwo
differentadvertisements,eitheraBurgerKingadvertisementoraChaneladvertisement
featuringBradPitt.Wepickedtheseadvertisementsfortworeasons.Firstly,thestudents
werelikelytobefamiliarwithatleastoneofthecompaniesandtheproductsadvertised.
Butifnot,theproductswerepicturedontheadvertisementsthemselves,makingitobvious
whatwasbeingsold.Afterthestudentschosetheiradvertisement,theyhadtowritea
paragraphwithatopicsentenceandsupportingsentencesthatincludedadescriptionof
ORIGINALTESTPROJECT 12
theproduct,thetargetmarket,theAmericanvaluesused,andthestrategiesused.These
objectiveswerelistedsothatstudentsunderstoodwhatwasexpectedofthem.
Wedevelopedthissectionofthetestbycombiningtheothersectionsofthetest
withanadvertisingassignmentthestudentshadbeenworkingonthatfocusedon
interpretingadvertisingstrategiesintheUS(AppendixB).Ourgoalforthefinalsection
wastoincorporatealloftheconstructsinourtest;studentshadtoproducetheirown
paragraphs,theirownconjunctionsandtransitions,andusetheirknowledgeofthe
conceptstheylearnedinclasstocreateawell-structuredparagraph.
Pre-piloting
TheENSL346/446midtermwaspre-pilotedonOctober14th,2015.Rachel
administeredthetesttoDr.KathleenBaileyandJenniferDowrie,afellowclassmateofthe
LanguageAssessmentcourseatMIIS.Bothtest-takerswereabletooffersomevery
valuablefeedback(AppendixE).InconjunctionwithsomefeedbackfromPenny,we
draftedasecondversionofthetestwithsomenotablechangesfromthefirst.Oneofthe
biggestchangesbetweenthefirstandseconddraftwastheSectionIVwritingtask.Inthe
initialversionofthetest,learnerswerepresentedwithfourdifferentadvertisements,
askedtochoosetwoandwriteanessaycomparingandcontrastingthem.Pennyinformed
usthatthistaskwasalittlebeyondwhatthelearnershadalreadypracticedinclass,sowe
narrowedthetasktoincludeonlytwoadvertisements,fromwhichthelearnershadtopick
oneandsimplyanalyzeitusingtheconceptstheyhadstudiedinclass.Thetaskwasalso
shortenedfromanentireessaytoonlyaparagraph,whichweallthoughtwouldbemore
manageableforthestudents(andalsoeasierforustograde!).Wealsoomittedoneofthe
ORIGINALTESTPROJECT 13
casestudiesthatwehadoriginallydraftedforthetestinSectionIII,becauseitturnedout
nottoalignwithPenny’scurriculumasmuchasweoriginallythought.
JenniferandDr.Baileyalsosuggestedsomeformattingchangesthatoverallmade
ourtestmuchmorevisuallyappealingandeasiertoread.Forinstance,wecondensedthe
multiplechoiceoptionswhereverwecouldandtrimmedsomeunnecessarilylong
sentences.Wealsochangedsomeoptionsforthemultiplechoicewhichturnedouttobe
misleadingorconfusing.OneespeciallyhelpfultipfromDr.Baileywasthesuggestionto
addextrawordstothewordbankfortheclozepassageandthesentencerewriteinSection
II.Thisisatacticto“biasforbest,”sothatthetest-takershavesomeleewaytomakean
errorononequestionwithoutjeopardizingtheirchancestoanswerotherquestions
correctly.AfterPennyandDr.Baileyapprovedournewlyeditedversion,wewerereadyto
bringourtesttothepeoplewhomatteredmost:thelearnersinENSL346/446.
Piloting
TherewasanairofexcitementandanervousbuzzinginPenny’sclassroomwhen
wearrivedtopilotourtest.Pennyhadwantedustocomeandintroducethetestourselves
sothatwecouldexplainitandfieldanyquestionsthelearnersmayhave.Theonlyquestion
thelearnershad,however,was“canwestartalready?”Nooneseemedsurprisedor
confusedbythetasks,becausethestructurewasfamiliartothem–theyhadcompleted
similartasksbefore.Pennythoughtthatitwouldn’tbeagoodideaforustostaywhilethe
studentstookthetest,asthepresenceofstrangersintheroommightmakethetest-takers
morenervousthantheyalreadywere.Soafterintroducingthetest,weleftandreturned
laterintheweektopickuptheresultsforanalysis.
ScoringProcess&Results
ORIGINALTESTPROJECT 14
Wegradedatotalof34testsfromPenny’sclass.Webeganbyscoringtheobjective
sectionsofthetestwiththekeyinAppendixE.Aswegradedthetest,someofthetest-
takerssurprisedusbyrespondingwithkeyableanswersthatwehadnotpreviously
noticed.Take,forexample,question5inSectionII,parti:
5.Allofthestudentshadaparty.Thetestwasfinished.
Test-takerswereaskedtoconjointhetwosentencesusingoneoftheconjunctionsfromthe
wordbankprovided.Thecorrectanswerwehadinitiallywritteninthekeywasafter,buta
numberofstudentsrespondedwithbecause.Bothanswersareactuallyperfectly
grammaticalandmakesensefromasemanticstandpoint.Wecouldn’tpenalizethe
studentsforchoosingacorrectanswerjustbecausetheycouldn’treadthetestdevelopers’
minds,soweadaptedourkeytoincludebecauseasacorrectanswer.Thesameactually
happenedwiththenextquestion(question6inSectionII,parti)whenwediscoveredthat
withsomecreativemaneuvering,althoughwouldactuallybeanacceptableanswerwhenit
isinthesentenceinitialposition.Theexperienceofmodifyingouranswerkeytaughtusa
lessonaboutcontrollingforpotentiallykeyabledistractors.Whendevelopingfuturetests,
weshouldbemorethoroughintryingoutallofthedistractorsinawordbankbefore
finalizingthetest.
Evenwithsomesurprises,thegradingoftheobjectiveportionsofourtestwas
muchquickerandmorestraightforwardthanourgradingofthesubjectivesections.We
developedtwodifferentrubricsforthetwosubjectivelyscoredsectionsofthetest,which
canbefoundinAppendixG.Eachrubricistaskspecific.Thatistosay,theircriteriaand
descriptorsreflectspecificfeaturesoftheirelicitedperformance(CARLA,n.d.).Wechose
ORIGINALTESTPROJECT 15
thistypeofrubricbecausePenny’sexpectationsforeachsubtestwereveryprecise.Section
IIIwasdesignedtotestlearners’knowledgeofAmericanvaluesandtheirabilitytosupport
anargumentwithexamplesfromatext.SectionIVwasdesignedtoassesslearners’ability
toidentifyaspectsofanadvertisement,describeanimagewithadjectivesandbuilda
coherentandcohesiveparagraph.Toourknowledge,thereexistednorubricwhich
containedthesedescriptorsexactly,sowedevelopedoneourselves.Althoughwehad
initiallyexpectedtouseananalyticrubricforthesesections,weeventuallysettledona
holisticrubricinstead.Holisticrubricsrequireraterstomakejudgmentsbasedonan
overallimpressionofaperformance,whichisthenassignedascorebasedonbands
(Weigle,2002,p.113),ordescriptorsofeachlevel.Wechoseaholisticrubricmostlyfor
practicalreasons;thereweremanytesttakersandeachonewrotetwoshortessays.A
holisticrubricservedtosavetime,minimizingthenumberofdecisionsthatweasraters
hadtomake(CARLA,n.d.).Ananalyticrubricwouldhavetakenmoretimetocreateand
usetoevaluateeachtext.Theholisticrubricalsoreducedthechancesthatweasraters
woulddisagree,sinceweeachonlyhadtosettleononenumber,thereforeincreasingour
potentialinter-raterreliability.
Typicallyinholisticscoring,eachbandcorrespondstoasinglescore,basedon
descriptionsofwhatawritingsampleatthisband-levelshouldlooklike.Weinterpreted
thisprocessinasomewhatoriginalwayandassignedeachbandinsteadtoarangeof
scores(forexample,10–12or13–15)Wedidthisinanefforttokeepeachsection
weightedrelativelyequally,withouthavingtowriteseparatebanddescriptionforeach
scorelevel0through14(SectionIII)or1through15(SectionIV).However,hadwean
opportunitytoredoourscoringprocessfromthebeginning,wemighthavechosena
ORIGINALTESTPROJECT 16
differentmethod,astheactofconvertingtheinitialbandscoretoanumberbetween1and
14or0and15provedtobesomewhatconfusingatthestart.Regardlessofourroughstart,
wewerefairlyconsistentinourawardingofthesameorsimilarscores,ascanbeseenin
Table1.Weachievedsuchconsistencybygoingthrougha“normingprocess,”inwhich
eachraterreadthesamethreepapers,assignedthemagradebasedonthescale,andthen
wecomparedourscorestoseeifweagreedonallthecriteria.Luckily,ourfirstthree
samplepapersprovidedabroadrangeofscores,whichactedasbenchmarksforusaswe
gradedtheremaining31papers.Weigle(2002)explainshowbenchmarkscriptsserveas
an“anchor”forraters,astheyperfectlyexemplifythecriteriaforthatlevel.Byreferencing
thebenchmarks,raterscanbecarefullytrainedtoadheretotherubricwhenscoring
scripts(p.112).Sincewewerefortunateenoughtoestablishourscoringcriteriatogether,
weessentiallytrainedourselvesandeachotherintheuseoftherubric,whichisone
explanationforourrelativelyconsistentscoringmethod.
Table1MidtermExamScoresbyLearner
LearnerGrammarSubtest1
GrammarSubtest2 WritingSubtest1 WritingSubtest2
TotalScore
1 9 15 R1(12)R2(14)=13 R1(14)R2(14)=14 512 11 15 R1(12)R2(13)=12.5 R1(13)R2(14)=13.5 523 10 12 R1(12)R2(12)=12 R1(12)R2(12)=12 464 11 13 R1(10)R2(10)=10 R1(14)R2(14)=14 485 9 13 R1(12)R2(13)=12.5 R1(8)R2(9)=8.5 436 4 14 R1(4)R2(2)=3 R1(6)R2(6)=6 277 8 15 R1(12)R2(13)=12.5 R1(13)R2(13)=13 48.58 17 15 R1(13)R2(13)=13 R1(13)R2(14)=13.5 58.59 16 13 R1(12)R2(12)=12 R1(12)R2(12)=12 5310 10 15 R1(14)R2(14)=14 R1(12)R2(12)=12 5111 8 15 R1(13)R2(14)=13.5 R1(13)R2(14)=13.5 5012 17 14 R1(14)R2(14)=14 R1(14)R2(15)=14.5 59.513 10 14 R1(13)R2(12)=12.5 R1(14)R2(13)=13.5 5014 17 11 R1(12)R2(13)=12.5 R1(12)R2(14)=13 53.5
ORIGINALTESTPROJECT 17
15 14 12 R1(12)R2(12)=12 R1(11)R2(11)=11 4916 16 15 R1(13)R2(13)=13 R1(13)R2(13)=13 5717 16 15 R1(14)R2(14)=14 R1(14)R2(14)=14 5918 13 12 R1(12)R2(12)=12 R1(12)R2(12)=12 4919 17 12 R1(13)R2(13)=13 R1(12)R2(12)=12 5420 13 10 R1(13)R2(14)=13.5 R1(13)R2(13)=13 49.521 16 15 R1(12)R2(12)=12 R1(12)R2(12)=12 5522 12 13 R1(14)R2(14)=14 R1(14)R2(14)=14 5323 16 11 R1(13)R2(14)=13.5 R1(13)R2(14)=13.5 5424 16 15 R1(13)R2(14)=13.5 R1(15)R2(15)=15 59.525 16 14 R1(11)R2(11)=11 R1(14)R2(14)=14 5526 14 15 R1(12)R2(14)=13 R1(14)R2(13)=13.5 55.527 8 9 R1(9)R2(9)=9 R1(11)R2(11)=11 3728 15 15 R1(13)R2(13)=13 R1(13)R2(13)=13 5629 15 12 R1(11)R2(10)=10.5 R1(11)R2(11)=11 48.530 11 13 R1(13)R2(14)=13.5 R1(10)R2(10)=10 47.531 17 14 R1(14)R2(14)=14 R1(15)R2(15)=15 6032 14 12 R1(13)R2(14)=13.5 R1(15)R2(15)=15 54.533 16 14 R1(13)R2(12)=12.5 R1(12)R2(12)=12 54.534 17 15 R1(12)R2(14)=13 R1(14)R2(15)=14.5 59.5
R1=RaterOne;R2=RaterTwo
Table2
MidtermExamDescriptiveStatistics(n=34)
TestPointsPossible Mean Mode Median Range
StandardDeviation Variance
Subtest1 17 13.21 16 14 13 3.51 12.29Subtest2 15 13.44 15 14 6 1.65 2.74Subtest3 14 12.37 13 13 11 2.02 4.10Subtest4 15 12.69 12 13 9 1.88 3.55Total 61 51.41 59.5 52.5 33 6.67 44.55
Thefrequencyhistograms(AppendixH)areallnegativelyskewed,showingthatin
general,studentsdidwellontheexam.Moststudentshadatotalscoreofhigherthan47,
whichisagradeof77%(“C”onatypicalUSlettergradingscale).SectionI–Grammar
ORIGINALTESTPROJECT 18
(PartsofSpeech)wasthemostdifficultjudgingfromthefactthatithadthebroadestrange
andmostevenmixofscores.WecantelljustfromasuperficialanalysisthatSectionI,parti
trippedalotofpeopleup;itseemsthattheywerenotasfamiliaraswehadhopedwiththe
partsofspeechterminologyanddefinitions.Thematchingsectionwasalsotrickyforsome
studentsbecausetherewereexactlyasmanydefinitionsasthereweretermstomatchto.
ThisissuegoesbacktowhatDr.Baileysaidabout“biasingforbest”–oncestudentsmissed
onequestiontheyweredoomedtomissatleastanotherone.However,wedon’tthinkthat
thestudents’struggleswiththissectionaredueentirelytoourformatofthetask.Many
studentsalsomisidentifiedthepartsofspeechinthemultiplechoicesections,evenwith
contextualizedexamplesofthewords.Inparticular,manytest-takersmissedquestion2
andquestion8,whichdealwithcoordinatingandsubordinatingconjunctions.Interestingly
enough,SectionII,whichdealtwithconjunctionsandtransitionsspecifically,wasrelatively
easyforthestudentsincomparison.Thescoresforthissectiononlyhadarangeofsix,and
thehistogramisclearlynegativelyskewed.Thisshowsthatthetest-takersseemto
understandhowconjunctionsworkincontext,theyjustdon’tknowhowtolabelthem.This
informationwasveryinsightfultoPenny,whothenknewsheshouldreviewtheseterms
againbeforetheendoftheterm.
Swain’sCommunicativeTestingFrameworkasitAppliestoourTest
Swain(1984)putsforthfourcriteriabywhichcommunicativetestsshouldbe
evaluated:1)startfromsomewhere,(2)concentrateoncontent,(3)biasforbest,and(4)
workforwashback.“Startingfromsomewhere”savestestdevelopersfromhavingto
“reinventthewheel,”sotospeak,andreferstothefactthatthetestshouldberelevantto
thelearners’needs,goals,identitiesandpreviousknowledgeinsomeway.To“concentrate
ORIGINALTESTPROJECT 19
oncontent,”testdevelopersmustensurethatallofthematerialonthetest(stimuliand
tasksposedtothelearner)givelearnerstheopportunitytoshowoffallfourcomponentsof
communicativecompetence:grammatical,sociolinguistic,discoursalandstrategic
performance(p.190).Testdevelopershave“biasedforbest”iftheyhavedone“everything
possibletoelicitthelearners’bestperformance”(Swain,1984,p.195).BaileyandCurtis
(2015)definewashbackasthe“effectatesthasonteachingandlearning”(p.349).Ideally,
atestshouldworkforpositivewashback,meaningthattheactofpreparingandtakinga
testshouldhelplearnersachieveanoveralldesirableleveloffluency(Swain,1984,p.196–
197).InTable3,wehaveoutlinedhowourtestexemplifiesSwain’sfourcommunicative
testingprinciples.
Table3:Swain’s(1984)FourPrinciplesofCommunicativeLanguageTesting
Swain’sTestAnalysisPrinciples
ENSLMidterm
Startfromsomewhere
● Progresstestforcollegeintermediatewritingskills.● Ourconstructs:Linguisticcompetence,sociolinguistic
competence,discoursalcompetence● Content:ThecoursematerialsPennygaveus,
observationsandpreviousknowledgeaboutthecourse.
Concentrateoncontent
● Motivatingpresentation–colorphotosoffamiliar,eye-catchingadvertisementsinSectionIV
● Substantive–ClozepassageandSectionIIIcasestudycontainedrelevantstorieswhichwerenewtostudents,presentednewperspectiveontheimmigrationstoriestheyhadencounteredinclass.
● Integrated–SectionsII,IIIandIVrevolvedaroundfamiliarthemesofimmigration,Americanvaluesandmulticulturalism.Admittedly,Grammarsubtests(SectionsIandII)couldhavebeenmoreintegrated,aswewillexplainmoreinthereflection.
● Interactive–Newsubstantivecontentintheformofthe
ORIGINALTESTPROJECT 20
casestudyandtheadvertisementspresented,gavelearnersanopportunitytorespondwiththeiropinionsandoriginalideas.
Biasforbest ● Presentingthetestinperson–weexplainedwhattest-takersneededtodoandansweredanyquestionstheyhad
● Basedalltasksoffofpreviousassignmentsinclass(nosurprises,exceptfortheoriginalcontent)
● Studentswereinformedaheadoftimeofwhatgeneraltopicswouldbetestedon.
● Explicitinstructions–condensedversionoftherubricincludedinthewritingsectionssothattest-takersknewexactlywhattheywouldbescoredon.
● Sequenceoftestmaterials–startingwithbasicpartsofspeechscaffoldedthefollowingsections,testtakerscouldreferbacktothedefinitionsasaresourcethroughoutthetest.
Workforwashback
● Pennywasinvolvedinthedevelopmentofthetest,itsadministrationanditsscoring(asaconsultant)
● TestgivesopportunitytoprepareformainstreamacademiaandalsolifeintheUS,beingawareofadvertisementsandthevaluestheyimpart.
● Test-takersbecomefamiliarwithAmericanacademicdiscourseandAmericancultureaswell
● SharinganswerswithPennyprovidedfeedbackonexactlywhichstudentsstruggledwithwhichconcepts,shewillusetheinformationtoguideherinstructionfortherestofthesemester
Wesche’sFrameworkasitAppliestoourTest
ThefollowingtablesillustrateWesche’s(1983)modelforlanguagetesting.Thefirst
componentinWesche’sframeworkisstimulusmaterial,whichreferstoinformation
presentedtotesttakersthatthemtodemonstratetheskillsintendedtobeassessed.The
secondcomponentistaskposedtothelearners,whichishowstudentsunderstandthetask
presentedtothem,andthethirdislearner’sresponse,whichistheirresponsetothetask
andhowwelltheydoso.Thefinalcomponentisscoringcriteria,whichiswhatisusedto
ORIGINALTESTPROJECT 21
scorethetask,withoutscoringcriteriathetaskismerelyanactivity(Bailey&Curtis,2015).
Table4analyzessectionsIandIIofourtest,whichfocusesongrammar;Table5analyzes
sectionsIIIandIV,whichfocusesonwriting.
Table4:Wesche’s(1983)fourcomponentsofalanguagetest-SectionI&IIWesche’sTest
AnalysisComponents
Grammar
Stimulusmaterial
Subtest1● matching:Themismatchedconceptsanddefinitionsofpartsof
speecharestimulusmaterials.● multiplechoice:Thesentencesfoundintheitemstemoritem
optionswithunderlinedpartsofspeecharestimulusmaterials.Subtest2● sentencerewrite:Thetwoseparatesentences,thesentences
withmissingconjunctions,andthewordbankwiththelistofconjunctionsarestimulusmaterials.
● clozepassage:Thestoryoftheimmigrantstudentandthewordbankwiththelistofconjunctionsandtransitionsarestimulusmaterials.
Taskposedtothelearner
Subtest1● matching:Thistaskaskstest-takerstosortthroughthe
mismatchedconceptsanddefinitionsofpartsofspeechandfindthecorrectmatch.
● multiplechoice:Thistaskasksstudentstoidentifycertainpartsofspeechthatareunderlinedinsentences.Studentsthenpickthecorrectanswerinalistofoptionslabeledathroughd.
Subtest2● sentencerewrite:Thistaskaskstest-takerstoeitherrewrite
twoseparatesentencesbyconnectingthemwiththeappropriateconjunctionsfoundinawordbank,orfill-in-the-blankpartofasentencewithanappropriateconjunctionfoundinawordbank.(Asmentionedabove,thesequestionsareformattedintwodifferentways)
● clozepassage:Thistaskaskstest-takerstoreadaparagraphwithmissingconjunctionsandtransitions.Test-takershavetochoosetheappropriateconjunctionsandtransitionwordinthewordbanktomaketheparagraphcohesiveandcoherent.
ORIGINALTESTPROJECT 22
Learner’sresponse
Subtest1● matching:SincesectionsIandIIarediscrete-pointitems,the
test-takersonlyneedtomatcheachpartofspeechthatisrepresentedintheleftcolumnofthetablewiththeappropriatedefinitionthatislistedintherightcolumn.Eachdefinitionhasaletterrepresentingit,sotest-takerswritetheletternexttothepartofspeechintheleftcolumn.
● multiplechoice:Afterreadingthequestionstem,thetest-takerscirclethecorrectletterinthelistofoptions.
Subtest2● sentencerewrite:Test-takerseitherrewritethetwosentences
bycombiningthemwiththeappropriateconjunctionfromthewordbank,ortheyjustwritethecorrectconjunctionintheblankspaceprovidedinthesentence.(Asmentionedabove,thesequestionsareformattedintwodifferentways)
● clozepassage:Test-takerschoosethecorrectconjunctionortransitionwordfromthewordbankandwriteitinintheblankspacesoftheparagraph
Scoringcriteria
Subtest1● matching:Thereisonlyonecorrectanswersincethisisa
discrete-pointitem.Thereisakeyavailableforalltheobjectivelyscoreditems(AppendixF)
● multiplechoice:(Seeabove)Subtest2● sentencerewrite:(Seeabove)● clozepassage:(Seeabove)
Table5:Wesche’s(1983)fourcomponentsofalanguagetest-SectionIII&IVWesche’sTest
AnalysisComponents
Writing
Stimulusmaterial
Subtest3:Thefictionalizedstoryisstimulusmaterial.Subtest4:Thepicturesoftheadvertisementsarestimulusmaterials.
Taskposedtothelearner
Subtest3:Thistaskaskstest-takerstoreadastoryandwritefourtofivesentencesaboutwhichAmericanvaluesapplytoit.Inadditiontoidentifyingthevaluespresentinthestory,test-takersmustsupporttheirclaimswithconcreteexamplesfromthetext.
ORIGINALTESTPROJECT 23
Subtest4:Thistaskaskstest-takerstochooseoneofthetwoadvertisementsandwriteaparagraphwithatopicsentenceandsupportingsentences.Thesupportingsentencesmustinclude:a)descriptionsoftheproductb)descriptionsofthetargetmarketc)descriptionoftheadusingdescriptiveadjectivesd)descriptionoftheAmericanvaluesusedintheade)descriptionoftheadstrategies
Learner’sresponse
Subtest3:Thiscomponentistheactualwrittenresponse.Itcouldalsoincludenotestakenaboutthestory.Subtest4:Thiscomponentisalsothewrittenresponse.Itcouldalsoincludeanyoutliningthatthetest-takerpreparedbeforewritingtheparagraph.
Scoringcriteria
Subtest3:Wecreatedaholisticrubricforthissection,scaled0to14(AppendixG).Subtest4:Wecreatedaholisticrubricforthissection,scaled0to15(AppendixG).*Todeterminethetotalscoresofthesesections,weaveragedthetworaters’scores.
Reflection
Thiswasthefirsttimeeitherofushaddesignedourowntestandwewerefortunate
toworktogetheronsuchademandingtask.Itwaseasytooverlookthemostminutedetails
inourdesignsoitwasbeneficialtotestourideasononeanotherandourclassmates.And
evenwiththeamountoftimespentconstructingthistest,Penny’sstudentsanswered
itemsonourtestinwaysinwhichwehadnotperceived.Thewholeprocesshasbeen
revealingandthroughoursemester-longresearchonlanguageassessmentsomepowerful
lessonshaveemergedthathavehelpedshapeourtestingphilosophies.
Forinstance,initially,neitherofusfoundmultiple-choicequestionsveryvaluable
whentestingalearner’slanguageability.Multiple-choiceallowsstudentstoguessevenif
theydonotknowtheanswer,sothatispotentiallyanissuewhenassessingsomeone’s
languageknowledge;theassessordoesnotalwaysknowifthetest-takertrulyunderstands
ORIGINALTESTPROJECT 24
theconceptbeingtestedornot.However,multiplechoicecanalsobeareliefforthetest-
takerforitprovidesoptionsforthetest-takertoconsider.Webothendedupfeelingthat
teachersshouldproceedwithcautionwhenusingmultiple-choiceitemsbecausewewould
notwantastudent’sgradetohingeonthesetypesofitems.Multiplechoiceisnot
necessarilyunhelpful,butperhapstheseitemsshouldonlybeincludedinsectionsofatest
thatdonotcarryalotofweight,oronlyusedforsmallquizzesin-class.Itisalsoworth
mentioningthatgoodmultiple-choiceitemsaredifficulttocreate,and,inturn,studentsdo
notalwaysbenefitfromthembecauseoftheirflaws.
Anotherstepwecouldhavetakentoimprovethetesttakers’experiencewouldbe
toconcentratemoreoncontent.WecouldhaveintegratedthethemesofAmericanvalues
ormulticulturalismmoreintoourdiscrete-pointtasksbyperhapstakingoradaptingthe
samplesentencesfromtextstheyhadreadalreadyforclass.Thiswouldhavemadethetest
overallmorecohesiveandunifiedaroundthecontent-basedtheme.
Additionally,welearnedalotaboutscoringsubjectiveportionsofatest.Notonly
didwegainexperienceincreatingrubricsandnorming,wealsorealizedsomewaysin
whichwecouldstreamlinethisprocesstoavoidconfusingandtime-consuming
conversionsbetweentheband-levelsandthescores.Ifwecouldredothewholescoring
process,wewouldhavedevelopeda7-bandscaleforSectionIIIinsteadofa3-bandone.
Then,ratherthanconvertingthescoretoonebetween0and14,wewouldsimplyaddour
tworaterscorestogethertogeteachtesttaker’sfinalscoreforthatsection.Wewouldhave
undergoneasimilarprocesswiththeSectionIVsothattherewerefewerdecisions
involvedinthescoringprocessforeachsubjectivelyscoredsection.Unfortunately,this
ORIGINALTESTPROJECT 25
solutiondidnotoccurtousuntilafterwehadcompletedouranalysisofthetest,butit’s
valuableinsightforthenexttimewecreateatest.
Thereare,ofcourse,somepositivetakeawaysfromthistestdevelopingexperience.
Forexample,wearepleasedwiththevisualformattingofourtest,whichwefeelhasanice
balanceofwhite-spaceandtext.WeweretoldbyPennythatacrowdedlayoutcanmake
test-takersfeelanxious,butourswasapparentlyvery“non-threatening”inappearance.We
alsofeelthatourdirectionsforthevarioustestingtaskswereveryclearandsuccinct.We
alsopresentedtheminseveralmodes–bothwrittenandspoken,whichhelpstobiasfor
best.What’smore,wealsoliketheideaofeachtesttaskscaffoldingthenextone,whichwe
triedtoaccomplishbystartingwiththemorefoundationaltaskslikeidentifyingpartsof
speech,thenmovinggraduallytomorecomplextaskslikewritingawholeparagraphor
essay.Theseareallpracticeswhichwewillcontinuethroughoutourcareersaseducators
andtestdevelopers.
Whenwefinishedgradingthetest,wehadcoffeewithPennyandshowedherour
findings.Wewereabletopinpointwhattopicsherstudentswerestillstrugglingwithand
whattopicstheyunderstood.Shewaspleasedwithourfindingsandshesaidourtest
measuredwhatshewaslookingfor.Shealsowasabletoprovideuswithbackground
informationaboutcertainstudentsanddescribewhytheymayhaveperformedwellor
poorlyonthetest.Someofherstudentshavefamiliestoprovidefor.Somestudentshave
full-timejobsandstilltravellongdistancestoschool.Somestudentscomefromwar-torn
countriesandarestillassimilatingintotheculture.Theseindividualstoriesremindedus
thatstudentsandtest-takershavelivesoutsideofthelanguageclassroomandshouldbe
ORIGINALTESTPROJECT 26
treatedaspersonsincontext;therearemanyfactorsthatcontributetohowatest-taker
performsonatest.
ORIGINALTESTPROJECT 27
References
Bailey,K.M.,&Curtis,A.(2015).Learningaboutlanguageassessment:dilemmas,
decisions,anddirections.Boston,MA:NationalGeographicLearning.
Canale,M.,&Swain,M.(1980).Theoreticalbasesofcommunicativeapproachesto
secondlanguageteachingandtesting.AppliedLinguistics,1,1-47.
Canale,M.(1983).Onsomedimensionsoflanguageproficiency.InJ.W.Oller(Ed.),
Issuesinlanguagetestingresearch(pp.333-342).Rowley,MA:NewburyHouse.
CARLA:CenterforAdvancedResearchonLanguageAcquisition.(n.d.).Typesofrubrics.
RetrievedOctober15,2015from
http://www.carla.umn.edu/assessment/vac/improvement/p_5.html
Celce-Murcia,M.&Larsen-Freeman,D.(2016).Thegrammarbook.Boston,MA:National
GeographicLearning.
Ellis,R.,&Shintani,N.(2014).Exploringlanguagepedagogythroughsecondlanguage
acquisitionresearch.NewYork,NY:Routledge.
Graves,K.(2014).Syllabusandcurriculumdesignforsecondlanguageteaching.In
M.Celce-Murcia,D.M.Brinton,andM.A.Snow(Eds.),TeachingEnglishasasecond
orforeignlanguage.(pp.46-62).Boston:Heinle.
Swain,M.(1984).Large-scalecommunicativelanguagetesting:Acasestudy.InS.J.
Savignon
&M.Berns(Eds.),Initiativesincommunicativelanguageteaching(pp.185-201).
Reading,MA:Addison-Wesley.
Tedick,D.J.(2002).Proficiency-orientedlanguageinstructionandassessment:Standards,
ORIGINALTESTPROJECT 28
philosophies,andconsiderationsforassessment.InMinnesotaArticulationProject,
D. J.Tedick(Ed.),Proficiency-orientedlanguageinstructionandassessment:A
curriculumhandbookforteachers(RevEd.).CARLAWorkingPaperSeries.
Minneapolis,MN:UniversityofMinnesota,TheCenterforAdvancedResearchon
LanguageAcquisition.
Weigle,S.C.(2002).AssessingWriting.Cambridge,UK:CambridgeUniversityPress.
Wesche,M.B.(1983).Communicativetestinginasecondlanguage.TheModernLanguage
Journal,67,41-55.
Widdowson,H.G.(1990).Grammar,nonsenseandlearning.InH.Widdowson(ed.)Aspects
of
languageteaching.Oxford:OxfordUniversityPress.
MontereyPeninsulaCollege(n.d.).CoursesOffered:ENSL.Retrievedfrom
http://www.mpc.edu/academics/academic-divisions/humanities-division/english-
as-a-second-language-ensl-/esl-program-sequence/ensl-346-446
Wiggins,G.,&McTighe,J.(2005).Understandingbydesign.Columbus,OH:Pearson
ORIGINALTESTPROJECT 29
Running Head: ORIGINAL TEST PROJECT 1
Original Writing Test for Monterey Peninsula College, Part II
Rachel Musgrove and Brock Ketterling
Middlebury Institute of International Studies at Monterey
ORIGINAL TEST PROJECT 2
In part one of our paper, we described the background and design of the ENSL 346/446
midterm for Penny Partch’s High-Intermediate Writing: American Culture class. With this
foundation, we can now analyze some of the data from the test piloting process. Specifically, we
will examine item facility, item discrimination, distractor analysis, response frequency
distribution, split-half reliability, inter-rater reliability and subtest relationships as they relate to
our data. We will interpret what these statistics mean in terms of the success of individual test
items and the test as a whole, as well as its reliability, validity, practicality and washback.
Item Facility
Table1SectionIItemFacility(n=34)
ItemStudentswhoanswered
theitemcorrectly ItemFacility(I.F.)1 33 0.972 32 0.943 31 0.914 23 0.685 22 0.656 27 0.797 19 0.568 30 0.889 15 0.4410 28 0.8211 24 0.7112 28 0.8213 29 0.8514 21 0.6215 25 0.7416 29 0.8517 32 0.94 AverageI.F.=0.77
ORIGINAL TEST PROJECT 3
Table2SectionIIItemFacility(n=34)
ItemStudentswhoanswered
theitemcorrectly ItemFacility(I.F.)1 32 0.942 30 0.883 31 0.914 33 0.975 34 1.006 30 0.887 34 1.008 26 0.769 33 0.9710 32 0.9411 27 0.7912 32 0.9413 25 0.7414 30 0.8815 29 0.85 AverageI.F.=0.90
Item Facility, according to Bailey and Curtis (2015), is “an index of how easy an
individual item was” for the people who took the test (p. 198). To calculate IF for each test item,
we divided the number of test-takers who answered the item correctly by the total number of
test-takers. Tables 1 and 2 show the IF values for the first two subsections of the ENSL 346/446
midterm, both of which measure grammar knowledge. Oller (1979) describes an ideal IF value as
falling between 0.15 and 0.85, because they indicate more variance among test takers. IF scores
closer to zero or 100 do not yield enough variance to be “useful.” If Oller saw our IF scores for
Sections I and II, he would regard them as quite dismal. In Section I, items 1, 2, 3, 8, and 17
yielded IF values higher than .88. In Section II, nearly every item except 8, 11, 13, and 15 had a
ORIGINAL TEST PROJECT 4
higher IF than “preferred.” In fact, items 5 and 7 exhibited the ceiling effect, which is when every
test-taker gets the item correct.
If we were to follow Oller’s (1979) advice, we would rewrite these items to be more
challenging. But as Bailey and Curtis (2015) mention, criterion-referenced tests are usually
meant to be a measure of individual students’ knowledge, and not to yield a normal distribution
of test scores. Our goal in developing the ENSL 346/446 midterm was not to obtain a broad
variance of scores, but rather to help Penny see how well her class has understood the course
material thus far in the semester. Nevertheless, these numbers are very informative, especially
the average IF for both subtests. For example, Section I has an average IF of 0.77, which falls
within the range of 0.15 and 0.85, which shows that it was moderately difficult for test-takers.
This value is especially revealing in comparison to the average IF for Section II, which is 0.90.
Such a high IF value for Section II indicates that this subtest was much easier for the learners.
This stark contrast comes as no surprise when we consider that Section I was intended to
measure metalinguistic terminology of grammar, whereas Section II was designed to test
students’ knowledge of grammar in context. Even though Penny asked us to test both of these
subject areas, students were more familiar with the sentence rewrites and fill-in-the-blank tasks,
both of which are very prevalent in their ESL composition textbook. After conferring with
Penny, we learned that unless students had progressed through the entire ESL sequence at MPC,
they were unlikely to have encountered much metalinguistic terminology. The low IF on the
Parts of Speech portion of the test is probably due to the high number of transfer students in
ENSL 346/446, who are not as familiar with these concepts.
Item Discrimination
ORIGINAL TEST PROJECT 5
Table3SectionIItemDiscrimination(n=34)
ItemHighscorers(topnine)withcorrectanswers
Lowscorers(bottomnine)withcorrect
answersItemDiscrimination
(I.D.)1 9 8 0.112 9 7 0.213 9 7 0.214 9 3 0.645 9 3 0.646 9 5 0.427 9 2 0.758 9 7 0.219 8 2 0.6410 9 6 0.3211 7 7 0.0012 9 5 0.4313 8 6 0.2114 7 1 0.2115 7 6 0.1116 9 8 0.1117 9 7 0.21 AverageI.D.=0.32
Table4SectionIIItemDiscrimination(n=34)
ItemHighscorers(topnine)withcorrectanswers
Lowscorers(bottomnine)withcorrect
answersItemDiscrimination
(I.D.)1 9 8 0.112 9 8 0.113 9 8 0.114 9 8 0.115 9 9 0.006 9 8 0.117 9 9 0.008 7 9 –0.219 9 9 0.00
ORIGINAL TEST PROJECT 6
10 9 8 0.1111 8 6 0.2112 9 8 0.1113 9 6 0.3214 9 5 0.4315 9 6 .32 AverageI.D.=0.12
Like Item Facility, Item Discrimination (ID) shows how difficult the test was relative to
each item. ID, on the other hand, gives a more detailed look at how individual students
performed, with a focus on how the “high scorers” did in relation to the “low scorers” (Bailey &
Curtis, 2015). Furthermore, ID shows us whether items with a low IF score are actually difficult
or if there are other factors at play. According to Flanagan’s Method for Estimating Item
Discrimination, we calculated ID by ranking our scored tests from highest total score to lowest
total score. We then identified the top 27.5% and bottom 27.5% of test takers, which would have
equaled 9.35 people. We rounded this value down to 9, so as not to count “partial people.” We
then constructed Tables 3 and 4, which display the ID values for the top nine high scorers and
low scorers on Sections I and II, respectively.
Mertler (2003) states that a strong test item has an ID higher than 0.50. For a test item to
be usable, it must have an ID of higher than 0.30. because it indicates that the high scorers
performed better on the item than the low scorers. A lower ID indicates that high scorers and low
scorers performed more or less equally on the item. A negative ID would indicate that the low
scorers outperformed the high scorers. According to Oller (1979), any value under 0.25 is an
unacceptable ID value. Once again, Oller would be disturbed by our ID values, especially in
Section II, where the average ID is 0.12. Section I’s average ID falls within the range of “fair
quality” according to Mertler (2003). But once again, the purpose of our test was to check
learners’ progress throughout the semester, not how they compare to one another. Therefore, we
ORIGINAL TEST PROJECT 7
as test developers did not write any items with the intent of discriminating against particular
groups of learners. It is worth reiterating the diversity of the learners in this particular course, and
that their English learning backgrounds are inconsistent with one another. Therefore, some seem
to have different aptitudes in different areas of grammar, which might explain why some of the
low scorers actually performed the same or better than the high scorers on most of the items in
Section II and on many of the items in Section I.
Distractor Analysis and Response Frequency Distribution
Table5MultipleChoice,SectionIDistractorAnalysis(n=34)
Item A B C D OmittedResponse8 1 30* 3 0 09 6 0 12 15* 110 1 28* 4 1 011 1 4 24* 5 012 5 1 0 28* 013 29* 1 3 0 114 4 3 6 21* 015 25* 1 7 1 016 1 2 29* 1 117 0 32* 1 1 0
In the first draft of the ENSL 346/446 midterm, Dr. Bailey critiqued several of our
distractors for the multiple items for being too confusing or misleading to test takers. One way
test developers can see which “distractors” have tricked students is through distractor analysis
(Bailey & Curtis, 2015). The goal in a norm-referenced multiple choice test is for every
distractor to be chosen by at least one person (ibid., p. 200). If a distractor isn’t chosen by
anyone, it should be reconsidered and possibly replaced. Distractor analysis only applies to
multiple choice items, which in our test are items 8 through 17 in Section I. Our breakdown of
ORIGINAL TEST PROJECT 8
the test-takers responses to these questions is shown in Table 5. Not surprisingly, only a few of
our items successfully “distracted” students. Item 9 in Section I threw off the most test takers.
This item was meant to assess knowledge of subordinating conjunctions versus coordinating
conjunctions, a distinction that many of the learners in Penny’s class reportedly struggled with.
Beatrizwillstayatschooluntilshefinishesherproject.
a. adverb c.coordinatingconjunctionb. pronoun d.subordinatingconjunction
The correct answer to this item was D, but almost as many students chose option C. We
speculate that this misconception is due to the fact that test-takers are most familiar with the
mnemonic device FANBOYS (for, and, nor, because, or, yet, and so) when identifying
coordinating conjunctions. We noticed while scoring the tests that those test takers who got item
9 correct wrote this acronym on their testing papers next to the question. It’s obvious that most of
the test takers (28 out of 24) knew that until was a conjunction, they just didn’t know which type
it was. One way we could have “biased for best” more in developing this test would have been to
include the acronym FANBOYS somewhere near this test item, since that’s the terminology the
students are more familiar with. One implication of this test item is that it is always important to
be consistent in the terminology with which students are familiar when writing tests.
Table6ResponseFrequencyDistributiononGrammarSubtest
ItemHigh/LowScorers A B C D
OmittedResponse
1 High 0 9* 0 0 0 Low 1 7 1 0 02 High 0 0 1 8* 0 Low 0 0 6 2 13 High 0 9* 0 0 0 Low 0 6 2 1 04 High 0 0 7* 2 0
ORIGINAL TEST PROJECT 9
Low 0 2 7 0 05 High 0 0 0 9* 0 Low 3 1 0 5 06 High 8* 0 1 0 0 Low 6 1 1 0 17 High 1 1 0 7* 0 Low 2 2 4 1 08 High 7* 0 2 0 0 Low 6 1 2 0 09 High 0 0 9* 0 0 Low 0 0 8 0 110 High 0 9* 0 0 0 Low 0 7 1 1 0
Table 6 offers an even more revealing look at the choices these test-takers made in the
multiple choice section by breaking the distractor analysis down by which options the high
scorers chose in comparison to the low scorers. Returning again to item 9, it’s interesting to see
that most of the top scorers got this item right, whereas the majority of the low scorers were
tricked by option C.
If we had a norm-referenced test, Table 6 would look a little more uniform in terms of the
response frequency distribution. Because many of the distractors were never chosen, we might
consider rewriting some items or distractors so that they would be more difficult for test-takers.
Item 14 was the only item in which all of the low scorers were somewhat evenly distracted by all
the options. This item required test-takers to identify which part of speech the word usually
belongs to. It makes sense that so many test-takers were unable to identify it as an adverb
because a large proportion of test takers also missed item 5 (as shown by its item facility of 0.64
in Table 1), which asked them to define an adverb. Table 6 shows that it was mostly the low
scorers who were confused by this item, while only 1 high scorer got it wrong. It is reassuring to
ORIGINAL TEST PROJECT 10
see that the distractor analysis and response frequency distribution align with the item facility
values for both questions in Section I regarding adverbs.
Item 11 also shows an interesting distribution of responses, in which an equal number of
high and low scorers answered correctly, but those that were distracted were fooled by different
options.
Whichofthefollowingisahelping(auxiliary)verb?
a.Thecanofsodaexploded.
b.Throwmeacanofsoda!
c.Wouldyoulikeacanofsoda?
d.Iwouldn’tlikeacanofsoda.
Their confusion could be due to the wording of the correct answer (Option C), which is in the
interrogative. The learners might be more likely to identify an auxiliary verb in a declarative
sentence where there isn’t any Wh-movement. This is also an item which we reworked after our
pre-pilot because Dr. Bailey mentioned that some of the options were too similar, which she
thought may be a “giveaway” to test-takers. Perhaps by reworking the wording of these options,
we made the question more “sufficiently” difficult, as Oller (1979) would say.
Reliability
Table7InternalConsistencyMeasures
SubtestSplitHalfReliability
ReliabilityafterusingSpearman
BrownProphecyFormula
StandardDeviation
ConfidenceInterval
PointsPossible
SectionI 0.76 0.86 3.51 0.09 17.00SectionII 0.60 0.75 1.65 0.08 15.00
Brown (2005) defines reliability as “the extent to which the results [of a test] can be
considered consistent or stable” (p. 175). In other words, if we were to administer the ENSL
ORIGINAL TEST PROJECT 11
346/446 midterm again several weeks after the initial test date, we should expect the learners to
score very much the same as they did the first time. Because of the impracticalities of
administering the same test to the same population twice in a short time period, we opted to
measure the reliability of the objectively scored test items through internal consistency methods.
Specifically, we used the split-half reliability method by splitting the test into two similar parts
based on odd-numbered items and even-numbered items (Appendix A). We then correlated the
scores of the test-takers on the two halves of the test with Cronbach’s alpha, as if they were
separate tests (Brown, 2005; Hatch and Farhady, 1982). Once we had obtained the reliability for
the two halves of the test, we used Spearman Brown’s prophecy formula to determine the
reliability of the full test. The coefficients for both internal consistency methods are indicated in
Table 7.
We are satisfied with the relatively high internal consistencies of both of our Grammar
subtests. The results of the Spearman Brown prophecy formula align with our previous
assumptions about the questions regarding adverbs in Section I. We can be fairly confident that
Section I consistently measures knowledge of Parts of Speech, both internally and if we were to
administer the test a second time.
Reliability for Section II was comparably lower than for Section I. This could be because
of the format of the test items varied throughout the section. Section II, part i required test-takers
to rewrite sentences, whereas part ii was a gap-fill requiring test-takers to select the correct word
to fill in the blanks in a paragraph. Although both task types were designed to measure
knowledge of conjunctions, perhaps the inconsistent formatting contributed to the overall lower
reliability score for this section. To improve reliability for Section II, we might consider making
the task types more homogenous, not only to improve consistency but also to make sure they are
ORIGINAL TEST PROJECT 12
indeed testing the same constructs. This last point is more related to validity than reliability, but
as Bachman (1990) states, “when we increase the reliability of our measures, we are also
satisfying a necessary need for validity: in order for a test score to be valid, it must be reliable”
(p. 160).
Since we have a criterion referenced test, we did not calculate Standard Error of
Measurement (SEM). Instead, we calculated confidence intervals, which are a “zone within
which a test-taker’s score would fall if he [or she] were tested repeatedly over the same
constructs without learning or forgetting taking place” (Bailey & Curtis, 2015, p. 244).
Confidence Intervals carry out the same function as SEM, but are specific to criterion-referenced
tests. For example, if a student scored a proportion of 0.88 on Section I of the midterm, with a
confidence interval of 0.09, that same student could be expected to score between 0.79 and 0.97
on the same section if she were tested repeatedly, at least 68 percent of the time (Brown, 2005).
This is a fairly wide band for scores in Section I, which could be due to the fact that there are so
few questions. The difference between a score of 79% and 97% is only three questions. The
confidence intervals add depth and context to our previous measures of internal consistency, and
show us how test-takers’ scores might fluctuate over time. Even though we have a high
reliability for the objectively scored portions of our tests according to the Spearman Brown
prophecy, our confidence intervals let us know that scores could vary quite widely if we were to
administer the test again.
Inter-rater Reliability
Table 8 Inter-rater Reliability for Section III
Learner Rater1 Rater2 Rater1+Rater21 12 14 262 12 13 25
ORIGINAL TEST PROJECT 13
3 12 12 244 10 10 205 12 13 256 4 2 67 12 13 258 13 13 269 12 12 2410 14 14 2811 13 14 2712 14 14 2813 13 12 2514 12 13 2515 12 12 2416 13 13 2617 14 14 2818 12 12 2419 13 13 2620 13 14 2721 12 12 2422 14 14 2823 13 14 2724 13 14 2725 11 11 2226 12 14 2627 9 9 1828 13 13 2629 11 10 2130 13 14 2731 14 14 2832 13 14 2733 13 12 2534 12 14 26Mean 12.21 12.53 24.74
StandardDeviation 1.82 2.29 4.04Variance 3.32 5.23 16.32
CoefficientAlpha=0.95
Table9Inter-raterReliabilityforSectionIV
Learner Rater1 Rater2 Rater1+Rater21 14 14 28
ORIGINAL TEST PROJECT 14
2 13 14 273 12 12 244 14 14 285 8 9 176 6 6 127 13 13 268 13 14 279 12 12 2410 12 12 2411 13 14 2712 14 15 2913 14 13 2714 12 14 2615 11 11 2216 13 13 2617 14 14 2818 12 12 2419 12 12 2420 13 13 2621 12 12 2422 14 14 2823 13 14 2724 15 15 3025 14 14 2826 14 13 2727 11 11 2228 13 13 2629 11 11 2230 10 10 2031 15 15 3032 15 15 3033 12 12 2434 14 15 29Mean 12.59 12.79 25.38
StandardDeviation 1.89 1.92 3.77Variance 3.58 3.68 14.18
CoefficientAlpha=0.98 Until now, our reliability measures have only applied to the objectively scored portions of
our test. The subjectively scored portions (Sections III and IV) presented their own unique
ORIGINAL TEST PROJECT 15
challenges to score. As mentioned in Part I of our paper, we developed a holistic rubric with
which both of us scored each of the subjective test items. In order to measure how consistent
both of us were at using the same rating system, we used Cronbach’s alpha to measure inter-rater
reliability. Bailey and Curtis (2015) define inter-rater reliability as “the consistency with which
two or more raters evaluate the same data using the same scoring criteria” (p. 164). Ideally, those
ratings should be identical or very similar. The closer the value is to 1.00, the greater the inter-
rater reliability.
As shown by Table 8, our coefficient alphas for Sections III and IV are 0.95 and 0.98
respectively. This strong coefficient value is due to the rubric which we developed together and
the norming process we underwent before using it. As we mentioned in Part I of this paper, this
collaborative process helped us achieve a very high inter-rater reliability. However, we wouldn’t
expect such a strong reliability if we were to give the rubric to two other raters and ask them to
score responses from the same test. Since we created the rubric and were more familiar with the
nuances of the different descriptions for each level. If we were to pass this test along to be used
in another setting, we would have to write a detailed protocol for using the rubric and also
provide benchmark examples for each level.
Subtest Relationships
Table10SubtestRelationships(df=32,p<.05)
Test CorrelationCoefficients(Pearson’sr)TotalTest 0.81 0.79 0.35 0.79 -Grammar1 0.51 0.50 0.04 - 0.79Grammar2 0.24 0.18 - 0.04 0.35Writing1 0.68 - 0.18 0.50 0.79Writing2 - 0.68 0.24 0.51 0.81
Writing2 Writing1 Grammar2 Grammar1 TotalTest
ORIGINAL TEST PROJECT 16
Table11r-squaredforSubtestRelationships
Test OverlappingVarianceTotalTest 0.66 0.62 0.69 0.77 -Grammar1 0.26 0.25 0.00 - 0.77Grammar2 0.05 0.03 - 0.00 0.69Writing1 0.46 - 0.03 0.25 0.62Writing2 - 0.46 0.05 0.26 0.66
Writing2 Writing1 Grammar2 Grammar1 TotalTest
We used Pearson’s r correlation coefficient to calculate the relationship between the
scores for each subtest and the test as a whole. We then used r-squared to determine the
overlapping variances between the subtests and the total test. At first glance, our subtest
relationships seem quite abysmal. For example, the r-squared value between Section I (Grammar
1) and Section II (Grammar 2) shows no overlap whatsoever! What this means in terms of our
test is unclear. It could mean, as Jean Turner explained to us (personal communication), that our
subtests measure different skills in terms of our original test constructs. Or as Oller (1979)
argues, low correlation coefficients may not necessarily mean that the subtests measure different
areas of knowledge. They could, in fact, be measuring the same kinds of knowledge but not in
adequate ways. Low correlation could indicate an overall low reliability in the test, or in one
section. Or that perhaps the test was “poorly calibrated with respect to the tested subjects” (Oller,
1979, p.188).
Between Turner’s and Oller’s different interpretations of subtest relationships, we would
side with Turner’s assumption that our various subsections demand different skills from the test-
takers. As we mentioned in Part I of our paper, knowledge of parts of speech is not necessarily
indicative of a test-takers grammatical awareness as a whole. Perhaps a section that we initially
thought was testing linguistic competence turned out to measure only a narrow portion of that
ORIGINAL TEST PROJECT 17
construct. Additionally, our test boasts an overall high level of reliability according to our
internal consistency measures, further contradicting Oller’s argument that the sections are
“poorly calibrated” to one another.
One other potential issue that we noticed in our correlation calculation is that our testing
group did not seem to match the conditions for the Pearson’s r statistic (Turner, 2014). Namely, r
is a parametric statistic and our testing sample was too small at only 34 members. Also, although
our data was interval-like and rankable, it is not normally distributed, which is what Pearson’s r
calls for. We decided to retry our correlation calculation statistics using Kendall’s tau, which is
the non-parametric counterpart to Pearson’s r (Appendix B). Tau also handles tied ranks better
than other non-parametric correlation statistics like Spearman rho, and gives a more precise
estimate of correlation strength (ibid.). Unfortunately, Kendall’s tau did not yield much higher
correlation values than Pearson’s r. In fact, almost all of our subtest relationship correlations
were lower once we calculated them using Kendall’s tau.
Discussion
Bailey and Curtis (2015) mention that there are four traditional criteria for evaluating
tests: reliability, validity, practicality and washback. Reliability, as previously mentioned, has to
do with how consistent and stable test results are across time. Validity, a related concept, refers
to “how well a test does what it’s supposed to do” (Oller, 1979, p. 4). In other words, does the
test measure the construct that claims to? Practicality deals with procedures for developing,
administering and scoring a test and how feasible they are in context. Washback is “the effect a
test has on teaching and learning... either positive or negative” (Bailey & Curtis, 2015, p. 3)
In terms of reliability, our test had pros and cons, which emerged through the scoring
process. One strength was definitely our inter-rater reliability, which was extremely high, due to
ORIGINAL TEST PROJECT 18
our detailed rubric and thorough norming process. Our main concerns are with the reliability of
the objectively scored sections. We may want to review these sections and rework the item
formats, which are quite varied throughout the two sections. Perhaps these differences played a
role in how learners answered the questions, along with their knowledge of conjunctions or parts
of speech as a whole.
Our subtest relationships made us question the validity of our test, especially between
sections that measured seemingly similar skills, such as grammar in Sections I and II and writing
in Sections III and IV. We would like to investigate further whether these low correlations are
due to the fact that the subtests measure truly different areas of knowledge, or if there was some
other intervening factor at play. We do think, however, that our test contains face validity,
especially in the last two sections. The test items reflected material that test-takers were already
familiar with and addressed relevant issues in their academic careers and personal lives. Perhaps
one weakness in our design of Section I, was that it lacked face validity to students. As language
educators and test developers, we realize the importance of parts of speech and we were asked to
include them, but perhaps we didn’t integrate the Section I test items well enough with the
content that the test-takers had been learning, or what they were tested on in other subsections.
From a practical standpoint, a great deal of time went into designing, pre-piloting,
editing, piloting, scoring and interpreting our test. But this effort was to be expected for two
novice test developers, creating our first exam for a real teaching context. In comparison to other
types of tests, however, the administration of the ENSL 346/446 midterm was simple. For
instance, with no listening section or speaking section, we did not need to spend extra time and
manpower playing an audio clip or interviewing the test-takers. All we had to do essentially was
explain the test format and leave the test-takers to their own devices. The scoring, although time
ORIGINAL TEST PROJECT 19
consuming, was also relatively straightforward. Our design of a holistic rubric helped
considerably, because it saved on the amount of time we spent with each text and the number of
decisions we needed to make. One change we might make to the multiple choice items in Section
I would be to add a space in the margin where the test-takers could write their letter answers.
One would be surprised how many different interpretations test-takers will come up with for
marking answers when asked simply to “identify thepartsofspeechoftheunderlined
words/phrases.”Itbecameconfusingtodeciphertest-takers’responseswhensomecircled
justtheletteroftheresponse,somecircledtheentireresponse,andsomeunderlinedor
crossedoutcertainoptions,eveniftheydidn’tendupchoosingthem.Havingauniform
spaceforanswerswouldhavestreamlinedthegradingprocess.
UponourfinalmeetingwithPenny,wereceivedevidenceofpositivewashbackfrom
ourtest.Bylookingatourdataandpinpointingspecifictestitemsthatmanylearners
struggledwith(forexample,questions9and14inSectionI),Pennywasabletoseehow
effectiveherinstructionoftheseconceptshadbeen,andusethisthisfeedbacktoguideher
curriculumfortherestofthesemester.Furthermore,ourtestwasverywellgroundedin
theteachingcontextbecauseofourthoroughneedsanalysis.Weknowthatthematerial
willberelevanttostudents’furtherstudiesatMPC,andmayaffecttheirinterpretationsof
Americancultureintheirday-to-daylives.
Conclusion
Inouranalysisofitemfacility,itemdiscrimination,responsefrequencydistribution,
reliabilityandsubtestrelationships,severalstrengthsandshortcomingsofourtestbecame
apparent.Wearegratefulthatwetookthetimetoscrutinizesuchminutedetailsofthe
exambecauseitgaveusinsightintohowwecancreatemoreeffectivetestsinthefuture.
ORIGINAL TEST PROJECT 20
Overall,despitetheweaknessesinourtestdesign,wefeelthatourtestaccomplishedits
intendedpurpose:tomeasuretheprogressofPenny’sstudentsatthemid-pointinthe
semester.Itwaslevel-appropriateandincorporatedthecontentofthecourse.Pennywas
pleasedwiththerevealingresults,andweareconfidentthatithelpedheridentifythe
strengthsandweaknessesofherclass.
ORIGINAL TEST PROJECT 21
References
Bachman, L. (1990). Fundamental considerations in language testing. Oxford: Oxford
University Press.
Bailey, K. M., & Curtis, A. (2015). Learning about language assessment: Dilemmas, decisions,
and directions. Boston, MA: National Geographic Learning.
Brown, J.D. (2005). Testing in language programs: A comprehensive guide to English language
assessment. New York, NY: McGraw-Hill.
Hatch, E.M., & Farhady H. (1982). Research design and statistics for applied linguistics.
Rowley, MA: Newbury House.
Mertler, C.A. (2003). Classroom Assessment: A practical guide for educators. Los Angeles, CA:
Pyrczak Publishers.
Oller, J. W. (1979). Language tests at school. London: Longman Group.
Turner, J. (2014). Using statistics in small-scale research: Focus on non-parametric data. New
York, NY: Routledge.