ensl 346/446 midterm file2. beatriz will stay at school until she finishes her project. a.adverb c....

ENSL 346/446 Midterm Score:_______/61

SECTION 1 (____/17 points total)

Parts of Speech:

Directions: Match each part of speech in the left column with the appropriate definition in the right column

1. Noun:

_______________

A. modifies a noun or a pronoun by describing, identifying, or quantifying words

2. Verb:

_______________

B. replaces a noun

3. Pronoun:

_______________

C. shows direction, location or time while linking nouns, pronouns and phrases to other words in a sentence.

4. Adjective:

_______________

D. names a person, place, thing or idea.

5. Adverb:

_______________

E. expresses actions, events or states of being.

6. Conjunction:

_______________

F. links words, phrases and clauses

7. Preposition:

_______________

G. modifies a verb, an adjective, an adverb, a phrase or clause. Shows manner, time, place, cause or degree.

Multiple choice:

Identify the parts of speech of the underlined words/phrases:

1. Does this meal come with rice?

a. pronoun c. adjective

b. noun d. preposition

2. Beatriz will stay at school until she finishes her project.

a. adverb c. coordinating conjunction b. pronoun d. subordinating conjunction

3. The cunning raccoon jumped into the garbage can.

a. article c. adverb

b. preposition d. auxiliary

4. Which of the following is a helping (auxiliary) verb?

a. The can of soda exploded. b. Throw me a can of soda!

c. Would you like a can of soda?

d. I wouldn’t like a can of soda.

5. Which of the following is an adjective?

a. My brother’s ugly red car lasted him twenty years.

b. My brother’s ugly red car lasted him twenty years.

c. My brother’s ugly red car lasted him twenty years.

d. My brother’s ugly red car lasted him twenty years.

6. Which of the following is a pronoun?

a. Jeremy is excited because he bought new shoes.

b. The shoes are under the table.

c. How much did those shoes cost?

d. He got a discount, thanks to his school ID.

7. Lucille usually jogs on the beach before class.

a. adjective c. helping verb

b. preposition d. adverb

8. Luiz saved his money all Summer so he could buy a guitar.

a. coordinating conjunction c. subordinating conjunction

b. verb d. article

9. Which of the following is a noun?

a. The smell of pie fills Mrs. Weasley’s house every Saturday.

b. Mrs. Weasley loves to bake pie.

c. Mrs. Weasley’s pies are baked with love. d. Harry walked away with a mountain of pie on his plate.

10. Isabel finally hiked in Big Sur last weekend.

a. helping verb c. adjective

b. verb d. adverb

SECTION 2 (____/15 Points Total)

Conjunctions and Transitions:

Directions: Combine the sentences by rewriting them in the space below each one with the proper conjunction or fill-in-the-blank with the proper conjunction. Use each word from the wordbank once.

Example: The cat sleeps all day. He hunts at night.

The cat sleeps all day and he hunts at night.

Wordbank

but and so now because after although that or

1. Diego likes most food. He doesn’t like spinach or artichokes.

2. No one washes their car in the Monterey Peninsula. There is a drought in California.

3. I need to study hard __________ I can pass the exam.

4. ____________ Sandy was very ill, she didn’t take any medicine.

5. All of the students had a party. The test was finished.

6. Lateisha ran a marathon yesterday. She finished her essay on time.

7. I didn’t know ____________ Anwar was coming home today.

Adventures in the USA

Read the paragraph below about Andres’ experience moving to the US. Then, fill in the correct subordinating conjunctions. Use the words from the list below to complete the task. Use each word once. Not every word in the wordbank will be used.

Wordbank

if when until who now however that and after because

_____________ Andres moved from Colombia to the United States it was the first time

he had travelled outside of the country. Since then, he has been studying biology at

the university. His biology professors, ________________ are very generous, have helped

him a lot with his degree. ________________, it hasn’t always been easy for Andres to

live in a foreign country. In fact, it wasn’t until he started studying biology that he

met a lot of his friends _______________ began to feel comfortable in the US. Upon his

arrival it was difficult to meet people _________________ he was new and unfamiliar

with US culture. His mother noticed this because Andres never went out on the

weekends. She told him, “If you want to make friends, you have to take risks and not

be afraid to talk to people.” _________________ hearing his mother’s advice, Andres

started talking to other students in his classes and he realized ______________ he and

his classmates shared a lot of things in common. _______________ Andres feels

comfortable with his new surroundings and his mother has a difficult time

convincing Andres to stop hanging out with his friends and spend time at home with

his family.

SECTION 3 (____/14 points total)

Short Answers: American Core Values

Directions: The paragraph below tells a story about American values. Read and answer the question(s). Write no more than 4 – 5 sentences.

Story 1:

Giovanni moved from Italy to New York when he was 18 years old to study business at an American university. He planned on moving back to Italy when he finished his studies but he fell in love with an American girl and they ended up getting married. After they graduated and got married, Giovanni became a US citizen. Giovanni and his wife then moved to a small town on the coast of California because his wife got hired as an accountant at a small firm there. Giovanni wanted to open an Italian restaurant in the town because the only Italian restaurant in town was Olive Garden. According to Giovanni, this was not real Italian food. After months of hard work the restaurant finally opened. At first, the restaurant struggled to make money because everyone still went to Olive Garden. To make his restaurant different, Giovanni decided to have a lunch special during the week. The lunch special was a great success but, to Giovanni’s surprise, Olive Garden started to have a lunch special, too. Then Giovanni decided to stay open an hour later than Olive Garden, but after two weeks Olive Garden changed their hours to the same as Giovanni’s!

Which of the six American values does Giovanni’s story represent? Explain with examples from the case study.

Section 4 (___/15 points total)

Written Response: Analyzing Advertisements

Directions: Choose one ad and analyze it using the categories we studied and practiced. Write a clear paragraph with a topic sentence and supporting sentences. Please include the following in your response:

a. Product b. Target market c. Description of the ad using adjectives of description d. American value used e. Strategy(s) used to sell the product and examples from the ad f. Proper use of coordinating and subordinating conjunctions and transitions

RunningHead:ORIGINALTESTPROJECT 1

OriginalWritingTestforMontereyPeninsulaCollege

RachelMusgroveandBrockKetterling

MiddleburyInstituteofInternationalStudiesatMonterey

ORIGINALTESTPROJECT 2

OriginalWritingTestforMontereyPeninsulaCollege

Background

TheESLsequenceatMontereyPeninsulaCollege(MPC)preparesstudentsfor

mainstreamacademicclasses.Thecoursesrangefrombeginner(level1)touniversity-level

(level6).Fortheoriginaltestproject,wedesignedatestforPennyPartch’sHigh-

IntermediateWritingaboutAmericanCulturecourse(level5).Theobjectivesforthis

coursearetodevelopwritingskillsandculturalliteracy,withanemphasisonwriting

essaysrelevanttotoU.S.government,diversity,values,andinnovations(MPC,n.d.).

Fromthebeginning,wewantedtodesignawritingtest.Brockalreadyhadan

interestinteachingcollegewritingandRachel,whohadonlyevertaughtyounglearners,

wantedtotrysomethingnew.BrockhadpreviouslyobservedPenny’sclassforthe

TeachingofWritingcoursetaughtbyJohnHedgcockatMIIS,atwhichpointhegatheredthe

initialinformationforourneedsanalysisformakingthetest(AppendixA).Graves(2014)

mentionsthataneedsanalysisshouldtakeintoaccountthepurposeofthecourse,thetest

developer’sownbeliefsabouttesting,andinformationthatisalreadyknownaboutthe

learners’goalsandproficiencylevels.SoafterreviewingBrock’sobservationnotes,we

interviewedPennyandgatheredinformationaboutthegoalsofthecourseandthe

strengthsandweaknessesofthestudents.Shealsoprovideduswithcoursematerials,such

asthecoursetextbookandpreviousassignments(AppendixB).Thisinformationhelpedus

contextualizeourtest,establishourconstructs,andreworkanyirrelvanttaskswehad

previouslyenvisionedforthetest.WithPenny’sguidance,webegantodesignhermidterm

exam.

PurposeofTest


Themidtermexamfunctionedasaprogresstest.Pennywantedtoknowwhather

studentsunderstoodhalf-waythroughthetermsoshewouldknowwhattofocuson

duringthesecondofhalfoftheterm.TheWritingforAmericanCulturecourse(ENSL

346/446)isacontent-basedcourse.Thestudentsimprovetheirwritingskillsbyexploring

topicsconcerninghistory,multiculturalism,andimmigrationintheUnitedStates.Atthe

pointwebegandesigningthetest,thestudentshadlearnedaboutAmericanvaluesand

advertisingtechniques.Theyhadalsocoveredgrammarpointssuchaspartsofspeech,

conjunctions,transitions,sentencetypesandadjectiveclauses.Thus,wehadtoblendthe

grammaticalportionofourtestintothecontextofAmericanadvertisingandvalues.Our

goalsweretotestthestudentsontheirgrammaticalknowledgeandwritingability,andto

familiarizethemwithculturalpracticesthattheywillfaceintheirmainstreamacademic

coursesandlifeintheUS.

TargetAudience

ThelearnerswhotookthemidtermexamwereESLcommunitycollegestudents

rangingfrom18to40yearsold.Theclasswasmadeupamixofresidentimmigrantsand

generation1.5students.Therewereatleastthirteencountriesrepresentedinthisclass,so

theL1ofthestudentsvariedasdidtheirculturesandschoolinghistories.Someofthe

studentshadtakenpartinthewholeESLsequenceatMPCandwereusedtotheformatof

theclass.Othershadtestedinfromanoutsideinstitutionandstillneededhelpwithmore

basicsubjectmatter,likepartsofspeech.Regardlessoftheirdifferences,thestudents

viewedEnglishasanimportantskilltheyneededtodevelop,andtheyunderstoodthevalue

oflearningabouttheculturetheylivein.

Constructs


TheENSL346/446Midtermmeasuresthreeoverarchingconstructs:linguistic

competence,sociolinguisticcompetence,anddiscoursalcompetence.Theseconstructs

werechosenbecauseoftheirrelevancetothecurriculumofthecourseandtheir

pertinencetoacademicwritingingeneral.Thesethreeareasofknowledgewillbecrucialto

thelearners’successforthedurationoftheirstudies,mostlywithrespecttoformal

writing.Wehavelabelledthesubsectionsofourtestwhichmeasurelinguisticcompetence

as“Grammar”andthesectionswhichmeasurediscoursalandsociolinguisticcompetence

as“Writing.”Werealizethatthereisoverlapbetweentheconstructsandthesections.For

example,studentscannotrespondtothewritingpromptswithoutsomedegreeoflinguistic

competence.Tocontroltheoverlappingofconstructswedevisedarubricthattookinto

considerationthemultipleaspectsofstudent’swriting.Forinstance,inthewritingsections

weonlyfactoredinspecificgrammarpointssuchasadjectiveclauses,butnotgeneral

syntaxorgrammaticalaccuracy.Therubricwillbediscussedindepthlaterinthispaper.

Foradditionaldefinitionsofourconstructs,seeAppendixC.

Linguisticcompetence

WemodifiedCanaleandSwain’s(1980)definitionoflinguisticcompetenceas

“knowledgeoflexicalitemsandofrulesofmorphology,syntax[and]sentence-grammar

semantics”(Canale&Swain,1980,p.29).WeomittedCanaleandSwain’sinclusionof

phonologywithinthisdefinitionbecauseourtestismainlyatestofgrammarandwriting

anddoesnotincludeanyspeakingorlistening.Whytestongrammarinwhatismeantto

beacommunicativeEAPcourse?EllisandShintani(2014)saythatgrammarisredundant

andthatmuchofthemeaningwhichhumanscommunicatetooneanothercanbedone

throughcontextandlexis(p.54).However,asWiddowson(1990)putsit,“grammarfrees


usfromadependencyoncontextandthelimitationsofapurelylexicalcategorizationof

reality”(p.86).Afocusongrammarasameaning-makingresource(Celce-Murcia&

Larsen-Freeman,2016)isespeciallyimportantinwriting,asapieceoftextcan

communicatemeaningacrossspaceandtime,whereasspokenlanguageisoftenlimitedto

aspecificinstanceforthosewhohearit.Therefore,it’sespeciallyimportantthatthe

EnglishlanguagelearnersinanEAPcourselearntocommunicatetheirmessages

accurately,astheirwritingwillultimatelyexistasadecontextualizedentitywhenitis

assessedbytheirinstructorsorreadbytheirpeers.

Thespecificgrammarpointsthatweassessinourmidtermexamareimportantto

meaning-makingforavarietyofreasons.Thefirstgrammarsubtestfocusesonpartsof

speech,ortheterminologyassignedtovariouscategoriesofwordsinaccordancewiththeir

syntacticfunctions.Thepartsofspeechthatweincludedinourtestwerenouns,verbs,

pronouns,adjectives,adverbs,prepositionsandconjunctions.Althoughitisnotexactly

necessarytoknowthenameofanadjectiveoradverbinordertowriteanacademicpaper

(asmanynativeEnglishspeakersinformeduswhilewewerepreparingthistest!),knowing

thisterminologyhelpsstudentsdevelopa“meta-language,”orawayoftalkingabout

grammarthathelpslearnerstoconceptualizeit(Celce-Murcia&Larsen-Freeman,2016,p.

17).Beingabletobreakdownasentenceintoitsbasicbuildingblockswillhelpthese

Englishlanguagelearnersunderstandwhatmakes“effective”collegewriting.More

specifically,itwillenablethemtoseewhycertainwordsorcombinationsofwordssound

moreformalormorepersuasivethanothers.Knowingtheterminologyfortheindividual

elementsoflanguagewillhelpthestudentscombinethemtobuildsocio-culturally

appropriatesentencesandultimately,moreconvincingacademicpapers.


Conjunctions,whicharethemainfocusofthesecondgrammarsubtest,eliminate

redundanciesinacademicwriting(Celce-Murcia&Larsen-Freeman,2016,p.481).Words

likeand,but,because,soandalthoughhelpmakewritinglesschoppyandredundant(p.

489).BeingabletoidentifyconjunctionsanddeploythemcorrectlyhelpsEnglishlanguage

learnersachievecohesionandcoherence(alsoimportantindiscoursalcompetence),and

finallyamore“sophisticated”levelofwriting.Linguisticcompetence,then,isessentially

thefoundationforournexttwoconstructsaswell:sociolinguisticcompetenceand

discoursalcompetence.

Sociolinguisticcompetence

CanaleandSwain(1980)refertosociolinguisticcompetenceastherulesthat

“specifythewaysinwhichutterancesareproducedandunderstoodappropriately”(p.30).

BaileyandCurtis(2015)furtherthisdefinitiontoincludetheabilitytoapplytheserulesto

discourse.Swain(1984)emphasizesthatsocioculturalcompetencehasmostlytodowith

anawarenessofcontextualfactorssuchas“topic,statusofparticipantsandpurposesofthe

interaction”(p.188).Inotherwords,itisthelearners’abilitytoadapttheirlanguageto

certainsituations,dependingonwhotheyaretalkingtoandtherelativesocialdistance

betweenthemselvesandtheirinterlocutors.Inthespecificcontextofourtest,testtakers

willberesponsibleforwritingintheappropriateregistercorrespondingtoUS-American

academicwriting,whichtheyhavebeenlearningbothinthiscourseandinprevious

coursesatMPC(ifthey’veprogressedthroughtheMPCESLsystem).Specifically,they

shouldwriteinaformalmannerwithcompletesentences,attentiontocorrectgrammar,

punctuation,avoidslangtermsoraddressingthereaderas“you.”


Sociolinguisticawarenessisimportantfortheseparticularlanguagelearners

becausesomeofthemmaybeenteringadifferentrealmthantheonestheyhave

previouslyusedEnglishinbefore.Aspreviouslymentioned,ENSL346/446consistsofa

diverserangeofstudentsfromvariouscountriesandeducationalbackgrounds.Eventhose

whowereborninormovedtotheU.S.atayoungagemaybe“earlearners”ofEnglishand

maynotbeawareyetofdifferentregisters.Therefore,itiscrucialforthemtorealizehow

grammarandlexisdifferbetweenspokenandwrittenlanguage,orevendifferentmodesof

writtenlanguage.Forexample–thewayEnglishusedonsocialmediadiffersvastlyfrom

thatexpectedinacollegeessay.Theabilitytousetheappropriatelanguageinthe

appropriatesettingkeepslearnersfromembarrassingthemselvesorpotentiallyoffending

aninterlocutorbyleadingthemtobelievethatthelearnerdoesnottakethesituation

seriouslyenough.Sociolinguisticcompetenceiscloselylinkedwithdiscoursalcompetence

becausebotharerelatedtosituationalawarenessoflanguage.Discoursalcompetence,

however,hasmoretodowiththegenreofwriting,aswewillexplaininthenextsection.

Discoursalcompetence

Discoursalcompetenceisassociatedwiththeoverallorganizationofatext.Swain

(1984)elaboratesthatdiscoursalcompetenceisknowing“howtocombinegrammatical

formsandmeaningstoachieveaunifiedspokenorwrittentextindifferentgenres”(p.

189).Thefocusongenresisimportantbecausediscourseisessentiallywhatdistinguishes

anacademicessayfromabulletedlistorarepetitiveformula.Establishingthisunity

throughoutatextreliesonthewriter’scontroloverwhatCanale(1983)calls“coherence

andcohesivedevices.”Thesedevicesreferbacktoourdescriptionoflinguisticcompetence.

Inordertoestablishcohesionandcoherenceinatext,alearnermustknowhowto


correctlyuseelementssuchaspronouns,synonyms,conjunctionsandellipsistorelate

individualutterancesinalogicalmanner.Inourtestspecifically,weassessedthelearners

ontheirabilitytowriteanargumentativeparagraph(SectionIV),whichincludedathesis

statement,supportingevidence,andconcludingstatements.Testtakerswerealso

expected,obviously,tousethegrammaticalfeaturesdiscussedearlierinthisessay,namely

coordinatingandsubordinatingconjunctionssothattheirwritingflowedsmoothlyand

showedavarietyofsentencetypes.

Pennywantedustofocusespeciallyonparagraphstructureinourtestbecauseshe

foundthatherstudentshadquiteavariedunderstandingofhowtosequencetheirwriting

inalogicalmanner.TheinstructorsofthemainstreamacademiccoursesatMPChave

expectationsaboutwhateffectivewritingstructurelookslike,soitisvitalthattheESL

studentsmasterthesepatternsnowbeforetheyadvancetohigher-levelcourses.

TestMethodsandOrganization

Ourtest(AppendixD)hasfoursections.Thefirsttwosectionsareobjectively

scoredandfeaturediscrete-pointitems.Thelasttwosectionsaresubjectivelyscored

constructed-responseitems.Thetotalpointspossibleforthetestis61andeachsection

wasmoreorlessevenlyweighted.

SectionI(Grammar)–PartsofSpeech

SectionI,partiourtestconsistsofsevenmatchingitemswhichassessstudents’

abilitytodefinepartsofspeech.Webeganourtestwiththistaskbecauseitlaidthe

foundationfortherestofthesubtestsandwasthemostbasicknowledgethestudentswere

testedon.Intheleftcolumnwasalistofterminologylikenounandadverb.Studentsthen

hadtomatchthosewordswiththedefinitionsintherightcolumn.Thedefinitionswere


modifiedfromahandoutPennyprovidedherstudents(AppendixB),sotheconceptswere

alreadyfamiliartothethem,especiallyiftheyhadbeeninthepreviousESLcoursesatMPC.

Toworkforpositivewashback,wealwaystriedtoadaptourtasksfromassignmentsthey

hadalreadydoneinclass.

SectionI,partiiconsistsoftenmultiple-choiceitemsinwhichweassessedstudents

abilitytoidentifypartsofspeechwithinsentences.Wecreatedtwotypesofmultiplechoice

questionssothatstudentshadtopayextraattentiontowhatthequestionwasasking.

Multiplechoicecanberedundantsowefiguredthatswitchinguphowthequestionswere

formattedwouldcausestudentstofocusonwhatwasbeingaskedofthem.Foronetypeof

multiplechoicequestion,wecreatedourownsentencesforthestemandunderlinedthe

partofspeechbeingtestedforthequestion.Studentsthenhadtopickthecorrecttermin

thelistofoptions,labeledathroughd.Fortheothertypeofmultiplechoicequestion,we

posedaquestioninthestemthataskedstudentstofindaparticularpartofspeechthatwas

writteninthelistofoptions.Intheoptionsthereweresentenceswithdifferentpartsof

speechunderlined.Foreachofourmultiplechoicequestions,therewasonlyonekey.

SectionII(Grammar)–ConjunctionsandTransitionWords

SectionIIofourtestassessedtheuseofconjunctionsandtransitionwords.Penny

hadbeenteachingherstudentsaboutsubordinatingandcoordinatingconjunctionsand

transitionwordssothattheycouldcreatedifferenttypesofsentencesandparagraphsthat

werecohesiveandcoherent.Tocreatethetaskforthissubsection,wetookoneofher

existingassignmentsandadapteditforourtest(AppendixB).Thistaskhadtwodifferent

typesofquestions.Oneaskedthestudentstorewritetwoindependentsentencesby

combiningthemwiththecorrectconjunctionfoundinawordbank.Theotherquestion


askedstudentstofillintheblankofanalreadycompletedsentencewiththecorrect

conjunctionalsofoundinthewordbank.Themainobjectiveforthistaskwasforstudents

toanalyzeconjunctionsonasententiallevel.

SectionII,partiiwasaparagraph-longclozepassagetellingthefictionalizedstoryof

animmigrantstudentintheUnitedStates.Thestudentshadtoreadthepassageand

choosethecorrectconjunctionsandtransitionwordsfromthewordbankprovidedforthis

subsection.SincethiswasacontentbasedcourseaboutAmericanculturewetriedto

createtasksthatreflectedtheaspectsofAmericanculturethestudentswerestudying.We

alsowantedthissubsectiontohighlighttheimportanceofparagraphstructure;the

realisticstorycontextualizedhowconjunctionsandtransitionsprovidecohesionand

coherenceinwriting.

SectionIII(Writing)–AmericanValues

SectionIIIofourtestwasaconstructed-responsesection.Wecreatedafictionalized

casestudyaboutthestoryofanimmigrantentrepreneurintheUSandthestudentswere

askedtoidentifytheAmericanvaluesthatwereassociatedwithhisstory.Thestudents

wereaskedtowritenomorethanfivesentences.Theiranswersdidnothavetobe

formattedintoaparagraph,either.Themainobjectiveforthistaskwasthattheycould

identifytheAmericanvaluesinacontextualizedstory.

Wedesignedthissectionbasedoffoftwoassignmentsthestudentshaddonein

class(AppendixB).Oneassignmentwasacasestudyaboutplagiarism.Studentshadto

readastoryabouteitheraRussianorJapanesestudent(dependingonwhichhandoutthe

studentreceived)takingatestinanAmericanuniversitywhowascaughtcheatingduringa

quiz.StudentsthenhadtoanswerquestionstodiscoverwhytheRussianorJapanese


studenthadcheatedandmakeinferencesaboutculturaldifferencesbetweenthestudent’s

homecountryandtheUS.Forhomework,theywereaskedtowriteaparagraphabouttheir

findingsanddiscussions.

TheotherassignmentwasachartthatfocusedonsixAmericanvalues

(individualism,self-reliance,equalityofopportunity,competition,materialwealth,and

hardwork).Studentshadpreviouslybeendiscussingthedifferentaspectsofthesevalues,

butforthisassignmentstudentshadtofindaUSexampleofeachvaluetoillustrateits

significanceandthencomparetheexampletohowitwouldbeviewedintheirhome

country.Studentsdiscussedtheiranswerswithclassmates,actingasculturalambassadors

fortheircountries.

Bycombiningandadaptingthesetwoactivities,wewereabletocreateawriting

taskthatnotonlyreflectedwhatthestudentshadbeenpracticingandstudyinginclass,but

alsochallengethemtoanalyzeastorytheywerenotfamiliarwith.Thissubsection

preparedstudentsforthefinalsectionwhichwasmoredemandingandall-encompassing.

SectionIV(Writing)–AnalyzingAdvertisements

SectionIVofourtestwasalsoaconstructedresponse;studentshadtocomposea

paragraphthatanalyzedaUSadvertisement.Thestudentshadthechoicebetweentwo

differentadvertisements,eitheraBurgerKingadvertisementoraChaneladvertisement

featuringBradPitt.Wepickedtheseadvertisementsfortworeasons.Firstly,thestudents

werelikelytobefamiliarwithatleastoneofthecompaniesandtheproductsadvertised.

Butifnot,theproductswerepicturedontheadvertisementsthemselves,makingitobvious

whatwasbeingsold.Afterthestudentschosetheiradvertisement,theyhadtowritea

paragraphwithatopicsentenceandsupportingsentencesthatincludedadescriptionof


theproduct,thetargetmarket,theAmericanvaluesused,andthestrategiesused.These

objectiveswerelistedsothatstudentsunderstoodwhatwasexpectedofthem.

Wedevelopedthissectionofthetestbycombiningtheothersectionsofthetest

withanadvertisingassignmentthestudentshadbeenworkingonthatfocusedon

interpretingadvertisingstrategiesintheUS(AppendixB).Ourgoalforthefinalsection

wastoincorporatealloftheconstructsinourtest;studentshadtoproducetheirown

paragraphs,theirownconjunctionsandtransitions,andusetheirknowledgeofthe

conceptstheylearnedinclasstocreateawell-structuredparagraph.

Pre-piloting

TheENSL346/446midtermwaspre-pilotedonOctober14th,2015.Rachel

administeredthetesttoDr.KathleenBaileyandJenniferDowrie,afellowclassmateofthe

LanguageAssessmentcourseatMIIS.Bothtest-takerswereabletooffersomevery

valuablefeedback(AppendixE).InconjunctionwithsomefeedbackfromPenny,we

draftedasecondversionofthetestwithsomenotablechangesfromthefirst.Oneofthe

biggestchangesbetweenthefirstandseconddraftwastheSectionIVwritingtask.Inthe

initialversionofthetest,learnerswerepresentedwithfourdifferentadvertisements,

askedtochoosetwoandwriteanessaycomparingandcontrastingthem.Pennyinformed

usthatthistaskwasalittlebeyondwhatthelearnershadalreadypracticedinclass,sowe

narrowedthetasktoincludeonlytwoadvertisements,fromwhichthelearnershadtopick

oneandsimplyanalyzeitusingtheconceptstheyhadstudiedinclass.Thetaskwasalso

shortenedfromanentireessaytoonlyaparagraph,whichweallthoughtwouldbemore

manageableforthestudents(andalsoeasierforustograde!).Wealsoomittedoneofthe


casestudiesthatwehadoriginallydraftedforthetestinSectionIII,becauseitturnedout

nottoalignwithPenny’scurriculumasmuchasweoriginallythought.

JenniferandDr.Baileyalsosuggestedsomeformattingchangesthatoverallmade

ourtestmuchmorevisuallyappealingandeasiertoread.Forinstance,wecondensedthe

multiplechoiceoptionswhereverwecouldandtrimmedsomeunnecessarilylong

sentences.Wealsochangedsomeoptionsforthemultiplechoicewhichturnedouttobe

misleadingorconfusing.OneespeciallyhelpfultipfromDr.Baileywasthesuggestionto

addextrawordstothewordbankfortheclozepassageandthesentencerewriteinSection

II.Thisisatacticto“biasforbest,”sothatthetest-takershavesomeleewaytomakean

errorononequestionwithoutjeopardizingtheirchancestoanswerotherquestions

correctly.AfterPennyandDr.Baileyapprovedournewlyeditedversion,wewerereadyto

bringourtesttothepeoplewhomatteredmost:thelearnersinENSL346/446.

Piloting

TherewasanairofexcitementandanervousbuzzinginPenny’sclassroomwhen

wearrivedtopilotourtest.Pennyhadwantedustocomeandintroducethetestourselves

sothatwecouldexplainitandfieldanyquestionsthelearnersmayhave.Theonlyquestion

thelearnershad,however,was“canwestartalready?”Nooneseemedsurprisedor

confusedbythetasks,becausethestructurewasfamiliartothem–theyhadcompleted

similartasksbefore.Pennythoughtthatitwouldn’tbeagoodideaforustostaywhilethe

studentstookthetest,asthepresenceofstrangersintheroommightmakethetest-takers

morenervousthantheyalreadywere.Soafterintroducingthetest,weleftandreturned

laterintheweektopickuptheresultsforanalysis.

ScoringProcess&Results


Wegradedatotalof34testsfromPenny’sclass.Webeganbyscoringtheobjective

sectionsofthetestwiththekeyinAppendixE.Aswegradedthetest,someofthetest-

takerssurprisedusbyrespondingwithkeyableanswersthatwehadnotpreviously

noticed.Take,forexample,question5inSectionII,parti:

5.Allofthestudentshadaparty.Thetestwasfinished.

Test-takerswereaskedtoconjointhetwosentencesusingoneoftheconjunctionsfromthe

wordbankprovided.Thecorrectanswerwehadinitiallywritteninthekeywasafter,buta

numberofstudentsrespondedwithbecause.Bothanswersareactuallyperfectly

grammaticalandmakesensefromasemanticstandpoint.Wecouldn’tpenalizethe

studentsforchoosingacorrectanswerjustbecausetheycouldn’treadthetestdevelopers’

minds,soweadaptedourkeytoincludebecauseasacorrectanswer.Thesameactually

happenedwiththenextquestion(question6inSectionII,parti)whenwediscoveredthat

withsomecreativemaneuvering,althoughwouldactuallybeanacceptableanswerwhenit

isinthesentenceinitialposition.Theexperienceofmodifyingouranswerkeytaughtusa

lessonaboutcontrollingforpotentiallykeyabledistractors.Whendevelopingfuturetests,

weshouldbemorethoroughintryingoutallofthedistractorsinawordbankbefore

finalizingthetest.

Evenwithsomesurprises,thegradingoftheobjectiveportionsofourtestwas

muchquickerandmorestraightforwardthanourgradingofthesubjectivesections.We

developedtwodifferentrubricsforthetwosubjectivelyscoredsectionsofthetest,which

canbefoundinAppendixG.Eachrubricistaskspecific.Thatistosay,theircriteriaand

descriptorsreflectspecificfeaturesoftheirelicitedperformance(CARLA,n.d.).Wechose


thistypeofrubricbecausePenny’sexpectationsforeachsubtestwereveryprecise.Section

IIIwasdesignedtotestlearners’knowledgeofAmericanvaluesandtheirabilitytosupport

anargumentwithexamplesfromatext.SectionIVwasdesignedtoassesslearners’ability

toidentifyaspectsofanadvertisement,describeanimagewithadjectivesandbuilda

coherentandcohesiveparagraph.Toourknowledge,thereexistednorubricwhich

containedthesedescriptorsexactly,sowedevelopedoneourselves.Althoughwehad

initiallyexpectedtouseananalyticrubricforthesesections,weeventuallysettledona

holisticrubricinstead.Holisticrubricsrequireraterstomakejudgmentsbasedonan

overallimpressionofaperformance,whichisthenassignedascorebasedonbands

(Weigle,2002,p.113),ordescriptorsofeachlevel.Wechoseaholisticrubricmostlyfor

practicalreasons;thereweremanytesttakersandeachonewrotetwoshortessays.A

holisticrubricservedtosavetime,minimizingthenumberofdecisionsthatweasraters

hadtomake(CARLA,n.d.).Ananalyticrubricwouldhavetakenmoretimetocreateand

usetoevaluateeachtext.Theholisticrubricalsoreducedthechancesthatweasraters

woulddisagree,sinceweeachonlyhadtosettleononenumber,thereforeincreasingour

potentialinter-raterreliability.

Typicallyinholisticscoring,eachbandcorrespondstoasinglescore,basedon

descriptionsofwhatawritingsampleatthisband-levelshouldlooklike.Weinterpreted

thisprocessinasomewhatoriginalwayandassignedeachbandinsteadtoarangeof

scores(forexample,10–12or13–15)Wedidthisinanefforttokeepeachsection

weightedrelativelyequally,withouthavingtowriteseparatebanddescriptionforeach

scorelevel0through14(SectionIII)or1through15(SectionIV).However,hadwean

opportunitytoredoourscoringprocessfromthebeginning,wemighthavechosena


differentmethod,astheactofconvertingtheinitialbandscoretoanumberbetween1and

14or0and15provedtobesomewhatconfusingatthestart.Regardlessofourroughstart,

wewerefairlyconsistentinourawardingofthesameorsimilarscores,ascanbeseenin

Table1.Weachievedsuchconsistencybygoingthrougha“normingprocess,”inwhich

eachraterreadthesamethreepapers,assignedthemagradebasedonthescale,andthen

wecomparedourscorestoseeifweagreedonallthecriteria.Luckily,ourfirstthree

samplepapersprovidedabroadrangeofscores,whichactedasbenchmarksforusaswe

gradedtheremaining31papers.Weigle(2002)explainshowbenchmarkscriptsserveas

an“anchor”forraters,astheyperfectlyexemplifythecriteriaforthatlevel.Byreferencing

thebenchmarks,raterscanbecarefullytrainedtoadheretotherubricwhenscoring

scripts(p.112).Sincewewerefortunateenoughtoestablishourscoringcriteriatogether,

weessentiallytrainedourselvesandeachotherintheuseoftherubric,whichisone

explanationforourrelativelyconsistentscoringmethod.

Table1MidtermExamScoresbyLearner

LearnerGrammarSubtest1

GrammarSubtest2 WritingSubtest1 WritingSubtest2

TotalScore

1 9 15 R1(12)R2(14)=13 R1(14)R2(14)=14 512 11 15 R1(12)R2(13)=12.5 R1(13)R2(14)=13.5 523 10 12 R1(12)R2(12)=12 R1(12)R2(12)=12 464 11 13 R1(10)R2(10)=10 R1(14)R2(14)=14 485 9 13 R1(12)R2(13)=12.5 R1(8)R2(9)=8.5 436 4 14 R1(4)R2(2)=3 R1(6)R2(6)=6 277 8 15 R1(12)R2(13)=12.5 R1(13)R2(13)=13 48.58 17 15 R1(13)R2(13)=13 R1(13)R2(14)=13.5 58.59 16 13 R1(12)R2(12)=12 R1(12)R2(12)=12 5310 10 15 R1(14)R2(14)=14 R1(12)R2(12)=12 5111 8 15 R1(13)R2(14)=13.5 R1(13)R2(14)=13.5 5012 17 14 R1(14)R2(14)=14 R1(14)R2(15)=14.5 59.513 10 14 R1(13)R2(12)=12.5 R1(14)R2(13)=13.5 5014 17 11 R1(12)R2(13)=12.5 R1(12)R2(14)=13 53.5


15 14 12 R1(12)R2(12)=12 R1(11)R2(11)=11 4916 16 15 R1(13)R2(13)=13 R1(13)R2(13)=13 5717 16 15 R1(14)R2(14)=14 R1(14)R2(14)=14 5918 13 12 R1(12)R2(12)=12 R1(12)R2(12)=12 4919 17 12 R1(13)R2(13)=13 R1(12)R2(12)=12 5420 13 10 R1(13)R2(14)=13.5 R1(13)R2(13)=13 49.521 16 15 R1(12)R2(12)=12 R1(12)R2(12)=12 5522 12 13 R1(14)R2(14)=14 R1(14)R2(14)=14 5323 16 11 R1(13)R2(14)=13.5 R1(13)R2(14)=13.5 5424 16 15 R1(13)R2(14)=13.5 R1(15)R2(15)=15 59.525 16 14 R1(11)R2(11)=11 R1(14)R2(14)=14 5526 14 15 R1(12)R2(14)=13 R1(14)R2(13)=13.5 55.527 8 9 R1(9)R2(9)=9 R1(11)R2(11)=11 3728 15 15 R1(13)R2(13)=13 R1(13)R2(13)=13 5629 15 12 R1(11)R2(10)=10.5 R1(11)R2(11)=11 48.530 11 13 R1(13)R2(14)=13.5 R1(10)R2(10)=10 47.531 17 14 R1(14)R2(14)=14 R1(15)R2(15)=15 6032 14 12 R1(13)R2(14)=13.5 R1(15)R2(15)=15 54.533 16 14 R1(13)R2(12)=12.5 R1(12)R2(12)=12 54.534 17 15 R1(12)R2(14)=13 R1(14)R2(15)=14.5 59.5

R1=RaterOne;R2=RaterTwo

Table2

MidtermExamDescriptiveStatistics(n=34)

TestPointsPossible Mean Mode Median Range

StandardDeviation Variance

Subtest1 17 13.21 16 14 13 3.51 12.29Subtest2 15 13.44 15 14 6 1.65 2.74Subtest3 14 12.37 13 13 11 2.02 4.10Subtest4 15 12.69 12 13 9 1.88 3.55Total 61 51.41 59.5 52.5 33 6.67 44.55

Thefrequencyhistograms(AppendixH)areallnegativelyskewed,showingthatin

general,studentsdidwellontheexam.Moststudentshadatotalscoreofhigherthan47,

whichisagradeof77%(“C”onatypicalUSlettergradingscale).SectionI–Grammar


(PartsofSpeech)wasthemostdifficultjudgingfromthefactthatithadthebroadestrange

andmostevenmixofscores.WecantelljustfromasuperficialanalysisthatSectionI,parti

trippedalotofpeopleup;itseemsthattheywerenotasfamiliaraswehadhopedwiththe

partsofspeechterminologyanddefinitions.Thematchingsectionwasalsotrickyforsome

studentsbecausetherewereexactlyasmanydefinitionsasthereweretermstomatchto.

ThisissuegoesbacktowhatDr.Baileysaidabout“biasingforbest”–oncestudentsmissed

onequestiontheyweredoomedtomissatleastanotherone.However,wedon’tthinkthat

thestudents’struggleswiththissectionaredueentirelytoourformatofthetask.Many

studentsalsomisidentifiedthepartsofspeechinthemultiplechoicesections,evenwith

contextualizedexamplesofthewords.Inparticular,manytest-takersmissedquestion2

andquestion8,whichdealwithcoordinatingandsubordinatingconjunctions.Interestingly

enough,SectionII,whichdealtwithconjunctionsandtransitionsspecifically,wasrelatively

easyforthestudentsincomparison.Thescoresforthissectiononlyhadarangeofsix,and

thehistogramisclearlynegativelyskewed.Thisshowsthatthetest-takersseemto

understandhowconjunctionsworkincontext,theyjustdon’tknowhowtolabelthem.This

informationwasveryinsightfultoPenny,whothenknewsheshouldreviewtheseterms

againbeforetheendoftheterm.

Swain’sCommunicativeTestingFrameworkasitAppliestoourTest

Swain(1984)putsforthfourcriteriabywhichcommunicativetestsshouldbe

evaluated:1)startfromsomewhere,(2)concentrateoncontent,(3)biasforbest,and(4)

workforwashback.“Startingfromsomewhere”savestestdevelopersfromhavingto

“reinventthewheel,”sotospeak,andreferstothefactthatthetestshouldberelevantto

thelearners’needs,goals,identitiesandpreviousknowledgeinsomeway.To“concentrate


oncontent,”testdevelopersmustensurethatallofthematerialonthetest(stimuliand

tasksposedtothelearner)givelearnerstheopportunitytoshowoffallfourcomponentsof

communicativecompetence:grammatical,sociolinguistic,discoursalandstrategic

performance(p.190).Testdevelopershave“biasedforbest”iftheyhavedone“everything

possibletoelicitthelearners’bestperformance”(Swain,1984,p.195).BaileyandCurtis

(2015)definewashbackasthe“effectatesthasonteachingandlearning”(p.349).Ideally,

atestshouldworkforpositivewashback,meaningthattheactofpreparingandtakinga

testshouldhelplearnersachieveanoveralldesirableleveloffluency(Swain,1984,p.196–

197).InTable3,wehaveoutlinedhowourtestexemplifiesSwain’sfourcommunicative

testingprinciples.

Table3:Swain’s(1984)FourPrinciplesofCommunicativeLanguageTesting

Swain’sTestAnalysisPrinciples

ENSLMidterm

Startfromsomewhere

● Progresstestforcollegeintermediatewritingskills.● Ourconstructs:Linguisticcompetence,sociolinguistic

competence,discoursalcompetence● Content:ThecoursematerialsPennygaveus,

observationsandpreviousknowledgeaboutthecourse.

Concentrateoncontent

● Motivatingpresentation–colorphotosoffamiliar,eye-catchingadvertisementsinSectionIV

● Substantive–ClozepassageandSectionIIIcasestudycontainedrelevantstorieswhichwerenewtostudents,presentednewperspectiveontheimmigrationstoriestheyhadencounteredinclass.

● Integrated–SectionsII,IIIandIVrevolvedaroundfamiliarthemesofimmigration,Americanvaluesandmulticulturalism.Admittedly,Grammarsubtests(SectionsIandII)couldhavebeenmoreintegrated,aswewillexplainmoreinthereflection.

● Interactive–Newsubstantivecontentintheformofthe


casestudyandtheadvertisementspresented,gavelearnersanopportunitytorespondwiththeiropinionsandoriginalideas.

Biasforbest ● Presentingthetestinperson–weexplainedwhattest-takersneededtodoandansweredanyquestionstheyhad

● Basedalltasksoffofpreviousassignmentsinclass(nosurprises,exceptfortheoriginalcontent)

● Studentswereinformedaheadoftimeofwhatgeneraltopicswouldbetestedon.

● Explicitinstructions–condensedversionoftherubricincludedinthewritingsectionssothattest-takersknewexactlywhattheywouldbescoredon.

● Sequenceoftestmaterials–startingwithbasicpartsofspeechscaffoldedthefollowingsections,testtakerscouldreferbacktothedefinitionsasaresourcethroughoutthetest.

Workforwashback

● Pennywasinvolvedinthedevelopmentofthetest,itsadministrationanditsscoring(asaconsultant)

● TestgivesopportunitytoprepareformainstreamacademiaandalsolifeintheUS,beingawareofadvertisementsandthevaluestheyimpart.

● Test-takersbecomefamiliarwithAmericanacademicdiscourseandAmericancultureaswell

● SharinganswerswithPennyprovidedfeedbackonexactlywhichstudentsstruggledwithwhichconcepts,shewillusetheinformationtoguideherinstructionfortherestofthesemester

Wesche’sFrameworkasitAppliestoourTest

ThefollowingtablesillustrateWesche’s(1983)modelforlanguagetesting.Thefirst

componentinWesche’sframeworkisstimulusmaterial,whichreferstoinformation

presentedtotesttakersthatthemtodemonstratetheskillsintendedtobeassessed.The

secondcomponentistaskposedtothelearners,whichishowstudentsunderstandthetask

presentedtothem,andthethirdislearner’sresponse,whichistheirresponsetothetask

andhowwelltheydoso.Thefinalcomponentisscoringcriteria,whichiswhatisusedto


scorethetask,withoutscoringcriteriathetaskismerelyanactivity(Bailey&Curtis,2015).

Table4analyzessectionsIandIIofourtest,whichfocusesongrammar;Table5analyzes

sectionsIIIandIV,whichfocusesonwriting.

Table4:Wesche’s(1983)fourcomponentsofalanguagetest-SectionI&IIWesche’sTest

AnalysisComponents

Grammar

Stimulusmaterial

Subtest1● matching:Themismatchedconceptsanddefinitionsofpartsof

speecharestimulusmaterials.● multiplechoice:Thesentencesfoundintheitemstemoritem

optionswithunderlinedpartsofspeecharestimulusmaterials.Subtest2● sentencerewrite:Thetwoseparatesentences,thesentences

withmissingconjunctions,andthewordbankwiththelistofconjunctionsarestimulusmaterials.

● clozepassage:Thestoryoftheimmigrantstudentandthewordbankwiththelistofconjunctionsandtransitionsarestimulusmaterials.

Taskposedtothelearner

Subtest1● matching:Thistaskaskstest-takerstosortthroughthe

mismatchedconceptsanddefinitionsofpartsofspeechandfindthecorrectmatch.

● multiplechoice:Thistaskasksstudentstoidentifycertainpartsofspeechthatareunderlinedinsentences.Studentsthenpickthecorrectanswerinalistofoptionslabeledathroughd.

Subtest2● sentencerewrite:Thistaskaskstest-takerstoeitherrewrite

twoseparatesentencesbyconnectingthemwiththeappropriateconjunctionsfoundinawordbank,orfill-in-the-blankpartofasentencewithanappropriateconjunctionfoundinawordbank.(Asmentionedabove,thesequestionsareformattedintwodifferentways)

● clozepassage:Thistaskaskstest-takerstoreadaparagraphwithmissingconjunctionsandtransitions.Test-takershavetochoosetheappropriateconjunctionsandtransitionwordinthewordbanktomaketheparagraphcohesiveandcoherent.


Learner’sresponse

Subtest1● matching:SincesectionsIandIIarediscrete-pointitems,the

test-takersonlyneedtomatcheachpartofspeechthatisrepresentedintheleftcolumnofthetablewiththeappropriatedefinitionthatislistedintherightcolumn.Eachdefinitionhasaletterrepresentingit,sotest-takerswritetheletternexttothepartofspeechintheleftcolumn.

● multiplechoice:Afterreadingthequestionstem,thetest-takerscirclethecorrectletterinthelistofoptions.

Subtest2● sentencerewrite:Test-takerseitherrewritethetwosentences

bycombiningthemwiththeappropriateconjunctionfromthewordbank,ortheyjustwritethecorrectconjunctionintheblankspaceprovidedinthesentence.(Asmentionedabove,thesequestionsareformattedintwodifferentways)

● clozepassage:Test-takerschoosethecorrectconjunctionortransitionwordfromthewordbankandwriteitinintheblankspacesoftheparagraph

Scoringcriteria

Subtest1● matching:Thereisonlyonecorrectanswersincethisisa

discrete-pointitem.Thereisakeyavailableforalltheobjectivelyscoreditems(AppendixF)

● multiplechoice:(Seeabove)Subtest2● sentencerewrite:(Seeabove)● clozepassage:(Seeabove)

Table5:Wesche’s(1983)fourcomponentsofalanguagetest-SectionIII&IVWesche’sTest

AnalysisComponents

Writing

Stimulusmaterial

Subtest3:Thefictionalizedstoryisstimulusmaterial.Subtest4:Thepicturesoftheadvertisementsarestimulusmaterials.

Taskposedtothelearner

Subtest3:Thistaskaskstest-takerstoreadastoryandwritefourtofivesentencesaboutwhichAmericanvaluesapplytoit.Inadditiontoidentifyingthevaluespresentinthestory,test-takersmustsupporttheirclaimswithconcreteexamplesfromthetext.


Subtest4:Thistaskaskstest-takerstochooseoneofthetwoadvertisementsandwriteaparagraphwithatopicsentenceandsupportingsentences.Thesupportingsentencesmustinclude:a)descriptionsoftheproductb)descriptionsofthetargetmarketc)descriptionoftheadusingdescriptiveadjectivesd)descriptionoftheAmericanvaluesusedintheade)descriptionoftheadstrategies

Learner’sresponse

Subtest3:Thiscomponentistheactualwrittenresponse.Itcouldalsoincludenotestakenaboutthestory.Subtest4:Thiscomponentisalsothewrittenresponse.Itcouldalsoincludeanyoutliningthatthetest-takerpreparedbeforewritingtheparagraph.

Scoringcriteria

Subtest3:Wecreatedaholisticrubricforthissection,scaled0to14(AppendixG).Subtest4:Wecreatedaholisticrubricforthissection,scaled0to15(AppendixG).*Todeterminethetotalscoresofthesesections,weaveragedthetworaters’scores.

Reflection

Thiswasthefirsttimeeitherofushaddesignedourowntestandwewerefortunate

toworktogetheronsuchademandingtask.Itwaseasytooverlookthemostminutedetails

inourdesignsoitwasbeneficialtotestourideasononeanotherandourclassmates.And

evenwiththeamountoftimespentconstructingthistest,Penny’sstudentsanswered

itemsonourtestinwaysinwhichwehadnotperceived.Thewholeprocesshasbeen

revealingandthroughoursemester-longresearchonlanguageassessmentsomepowerful

lessonshaveemergedthathavehelpedshapeourtestingphilosophies.

Forinstance,initially,neitherofusfoundmultiple-choicequestionsveryvaluable

whentestingalearner’slanguageability.Multiple-choiceallowsstudentstoguessevenif

theydonotknowtheanswer,sothatispotentiallyanissuewhenassessingsomeone’s

languageknowledge;theassessordoesnotalwaysknowifthetest-takertrulyunderstands


theconceptbeingtestedornot.However,multiplechoicecanalsobeareliefforthetest-

takerforitprovidesoptionsforthetest-takertoconsider.Webothendedupfeelingthat

teachersshouldproceedwithcautionwhenusingmultiple-choiceitemsbecausewewould

notwantastudent’sgradetohingeonthesetypesofitems.Multiplechoiceisnot

necessarilyunhelpful,butperhapstheseitemsshouldonlybeincludedinsectionsofatest

thatdonotcarryalotofweight,oronlyusedforsmallquizzesin-class.Itisalsoworth

mentioningthatgoodmultiple-choiceitemsaredifficulttocreate,and,inturn,studentsdo

notalwaysbenefitfromthembecauseoftheirflaws.

Anotherstepwecouldhavetakentoimprovethetesttakers’experiencewouldbe

toconcentratemoreoncontent.WecouldhaveintegratedthethemesofAmericanvalues

ormulticulturalismmoreintoourdiscrete-pointtasksbyperhapstakingoradaptingthe

samplesentencesfromtextstheyhadreadalreadyforclass.Thiswouldhavemadethetest

overallmorecohesiveandunifiedaroundthecontent-basedtheme.

Additionally,welearnedalotaboutscoringsubjectiveportionsofatest.Notonly

didwegainexperienceincreatingrubricsandnorming,wealsorealizedsomewaysin

whichwecouldstreamlinethisprocesstoavoidconfusingandtime-consuming

conversionsbetweentheband-levelsandthescores.Ifwecouldredothewholescoring

process,wewouldhavedevelopeda7-bandscaleforSectionIIIinsteadofa3-bandone.

Then,ratherthanconvertingthescoretoonebetween0and14,wewouldsimplyaddour

tworaterscorestogethertogeteachtesttaker’sfinalscoreforthatsection.Wewouldhave

undergoneasimilarprocesswiththeSectionIVsothattherewerefewerdecisions

involvedinthescoringprocessforeachsubjectivelyscoredsection.Unfortunately,this


solutiondidnotoccurtousuntilafterwehadcompletedouranalysisofthetest,butit’s

valuableinsightforthenexttimewecreateatest.

Thereare,ofcourse,somepositivetakeawaysfromthistestdevelopingexperience.

Forexample,wearepleasedwiththevisualformattingofourtest,whichwefeelhasanice

balanceofwhite-spaceandtext.WeweretoldbyPennythatacrowdedlayoutcanmake

test-takersfeelanxious,butourswasapparentlyvery“non-threatening”inappearance.We

alsofeelthatourdirectionsforthevarioustestingtaskswereveryclearandsuccinct.We

alsopresentedtheminseveralmodes–bothwrittenandspoken,whichhelpstobiasfor

best.What’smore,wealsoliketheideaofeachtesttaskscaffoldingthenextone,whichwe

triedtoaccomplishbystartingwiththemorefoundationaltaskslikeidentifyingpartsof

speech,thenmovinggraduallytomorecomplextaskslikewritingawholeparagraphor

essay.Theseareallpracticeswhichwewillcontinuethroughoutourcareersaseducators

andtestdevelopers.

Whenwefinishedgradingthetest,wehadcoffeewithPennyandshowedherour

findings.Wewereabletopinpointwhattopicsherstudentswerestillstrugglingwithand

whattopicstheyunderstood.Shewaspleasedwithourfindingsandshesaidourtest

measuredwhatshewaslookingfor.Shealsowasabletoprovideuswithbackground

informationaboutcertainstudentsanddescribewhytheymayhaveperformedwellor

poorlyonthetest.Someofherstudentshavefamiliestoprovidefor.Somestudentshave

full-timejobsandstilltravellongdistancestoschool.Somestudentscomefromwar-torn

countriesandarestillassimilatingintotheculture.Theseindividualstoriesremindedus

thatstudentsandtest-takershavelivesoutsideofthelanguageclassroomandshouldbe


treatedaspersonsincontext;therearemanyfactorsthatcontributetohowatest-taker

performsonatest.


References

Bailey,K.M.,&Curtis,A.(2015).Learningaboutlanguageassessment:dilemmas,

decisions,anddirections.Boston,MA:NationalGeographicLearning.

Canale,M.,&Swain,M.(1980).Theoreticalbasesofcommunicativeapproachesto

secondlanguageteachingandtesting.AppliedLinguistics,1,1-47.

Canale,M.(1983).Onsomedimensionsoflanguageproficiency.InJ.W.Oller(Ed.),

Issuesinlanguagetestingresearch(pp.333-342).Rowley,MA:NewburyHouse.

CARLA:CenterforAdvancedResearchonLanguageAcquisition.(n.d.).Typesofrubrics.

RetrievedOctober15,2015from

http://www.carla.umn.edu/assessment/vac/improvement/p_5.html

Celce-Murcia,M.&Larsen-Freeman,D.(2016).Thegrammarbook.Boston,MA:National

GeographicLearning.

Ellis,R.,&Shintani,N.(2014).Exploringlanguagepedagogythroughsecondlanguage

acquisitionresearch.NewYork,NY:Routledge.

Graves,K.(2014).Syllabusandcurriculumdesignforsecondlanguageteaching.In

M.Celce-Murcia,D.M.Brinton,andM.A.Snow(Eds.),TeachingEnglishasasecond

orforeignlanguage.(pp.46-62).Boston:Heinle.

Swain,M.(1984).Large-scalecommunicativelanguagetesting:Acasestudy.InS.J.

Savignon

&M.Berns(Eds.),Initiativesincommunicativelanguageteaching(pp.185-201).

Reading,MA:Addison-Wesley.

Tedick,D.J.(2002).Proficiency-orientedlanguageinstructionandassessment:Standards,


philosophies,andconsiderationsforassessment.InMinnesotaArticulationProject,

D. J.Tedick(Ed.),Proficiency-orientedlanguageinstructionandassessment:A

curriculumhandbookforteachers(RevEd.).CARLAWorkingPaperSeries.

Minneapolis,MN:UniversityofMinnesota,TheCenterforAdvancedResearchon

LanguageAcquisition.

Weigle,S.C.(2002).AssessingWriting.Cambridge,UK:CambridgeUniversityPress.

Wesche,M.B.(1983).Communicativetestinginasecondlanguage.TheModernLanguage

Journal,67,41-55.

Widdowson,H.G.(1990).Grammar,nonsenseandlearning.InH.Widdowson(ed.)Aspects

of

languageteaching.Oxford:OxfordUniversityPress.

MontereyPeninsulaCollege(n.d.).CoursesOffered:ENSL.Retrievedfrom

http://www.mpc.edu/academics/academic-divisions/humanities-division/english-

as-a-second-language-ensl-/esl-program-sequence/ensl-346-446

Wiggins,G.,&McTighe,J.(2005).Understandingbydesign.Columbus,OH:Pearson

Running Head: ORIGINAL TEST PROJECT 1

Original Writing Test for Monterey Peninsula College, Part II

Rachel Musgrove and Brock Ketterling

Middlebury Institute of International Studies at Monterey

ORIGINAL TEST PROJECT 2

In part one of our paper, we described the background and design of the ENSL 346/446

midterm for Penny Partch’s High-Intermediate Writing: American Culture class. With this

foundation, we can now analyze some of the data from the test piloting process. Specifically, we

will examine item facility, item discrimination, distractor analysis, response frequency

distribution, split-half reliability, inter-rater reliability and subtest relationships as they relate to

our data. We will interpret what these statistics mean in terms of the success of individual test

items and the test as a whole, as well as its reliability, validity, practicality and washback.

Item Facility

Table1SectionIItemFacility(n=34)

ItemStudentswhoanswered

theitemcorrectly ItemFacility(I.F.)1 33 0.972 32 0.943 31 0.914 23 0.685 22 0.656 27 0.797 19 0.568 30 0.889 15 0.4410 28 0.8211 24 0.7112 28 0.8213 29 0.8514 21 0.6215 25 0.7416 29 0.8517 32 0.94 AverageI.F.=0.77


Table2SectionIIItemFacility(n=34)

ItemStudentswhoanswered

theitemcorrectly ItemFacility(I.F.)1 32 0.942 30 0.883 31 0.914 33 0.975 34 1.006 30 0.887 34 1.008 26 0.769 33 0.9710 32 0.9411 27 0.7912 32 0.9413 25 0.7414 30 0.8815 29 0.85 AverageI.F.=0.90

Item Facility, according to Bailey and Curtis (2015), is “an index of how easy an

individual item was” for the people who took the test (p. 198). To calculate IF for each test item,

we divided the number of test-takers who answered the item correctly by the total number of

test-takers. Tables 1 and 2 show the IF values for the first two subsections of the ENSL 346/446

midterm, both of which measure grammar knowledge. Oller (1979) describes an ideal IF value as

falling between 0.15 and 0.85, because they indicate more variance among test takers. IF scores

closer to zero or 100 do not yield enough variance to be “useful.” If Oller saw our IF scores for

Sections I and II, he would regard them as quite dismal. In Section I, items 1, 2, 3, 8, and 17

yielded IF values higher than .88. In Section II, nearly every item except 8, 11, 13, and 15 had a


higher IF than “preferred.” In fact, items 5 and 7 exhibited the ceiling effect, which is when every

test-taker gets the item correct.

If we were to follow Oller’s (1979) advice, we would rewrite these items to be more

challenging. But as Bailey and Curtis (2015) mention, criterion-referenced tests are usually

meant to be a measure of individual students’ knowledge, and not to yield a normal distribution

of test scores. Our goal in developing the ENSL 346/446 midterm was not to obtain a broad

variance of scores, but rather to help Penny see how well her class has understood the course

material thus far in the semester. Nevertheless, these numbers are very informative, especially

the average IF for both subtests. For example, Section I has an average IF of 0.77, which falls

within the range of 0.15 and 0.85, which shows that it was moderately difficult for test-takers.

This value is especially revealing in comparison to the average IF for Section II, which is 0.90.

Such a high IF value for Section II indicates that this subtest was much easier for the learners.

This stark contrast comes as no surprise when we consider that Section I was intended to

measure metalinguistic terminology of grammar, whereas Section II was designed to test

students’ knowledge of grammar in context. Even though Penny asked us to test both of these

subject areas, students were more familiar with the sentence rewrites and fill-in-the-blank tasks,

both of which are very prevalent in their ESL composition textbook. After conferring with

Penny, we learned that unless students had progressed through the entire ESL sequence at MPC,

they were unlikely to have encountered much metalinguistic terminology. The low IF on the

Parts of Speech portion of the test is probably due to the high number of transfer students in

ENSL 346/446, who are not as familiar with these concepts.

Item Discrimination


Table3SectionIItemDiscrimination(n=34)

ItemHighscorers(topnine)withcorrectanswers

Lowscorers(bottomnine)withcorrect

answersItemDiscrimination

(I.D.)1 9 8 0.112 9 7 0.213 9 7 0.214 9 3 0.645 9 3 0.646 9 5 0.427 9 2 0.758 9 7 0.219 8 2 0.6410 9 6 0.3211 7 7 0.0012 9 5 0.4313 8 6 0.2114 7 1 0.2115 7 6 0.1116 9 8 0.1117 9 7 0.21 AverageI.D.=0.32

Table4SectionIIItemDiscrimination(n=34)

ItemHighscorers(topnine)withcorrectanswers

Lowscorers(bottomnine)withcorrect

answersItemDiscrimination

(I.D.)1 9 8 0.112 9 8 0.113 9 8 0.114 9 8 0.115 9 9 0.006 9 8 0.117 9 9 0.008 7 9 –0.219 9 9 0.00


10 9 8 0.1111 8 6 0.2112 9 8 0.1113 9 6 0.3214 9 5 0.4315 9 6 .32 AverageI.D.=0.12

Like Item Facility, Item Discrimination (ID) shows how difficult the test was relative to

each item. ID, on the other hand, gives a more detailed look at how individual students

performed, with a focus on how the “high scorers” did in relation to the “low scorers” (Bailey &

Curtis, 2015). Furthermore, ID shows us whether items with a low IF score are actually difficult

or if there are other factors at play. According to Flanagan’s Method for Estimating Item

Discrimination, we calculated ID by ranking our scored tests from highest total score to lowest

total score. We then identified the top 27.5% and bottom 27.5% of test takers, which would have

equaled 9.35 people. We rounded this value down to 9, so as not to count “partial people.” We

then constructed Tables 3 and 4, which display the ID values for the top nine high scorers and

low scorers on Sections I and II, respectively.

Mertler (2003) states that a strong test item has an ID higher than 0.50. For a test item to

be usable, it must have an ID of higher than 0.30. because it indicates that the high scorers

performed better on the item than the low scorers. A lower ID indicates that high scorers and low

scorers performed more or less equally on the item. A negative ID would indicate that the low

scorers outperformed the high scorers. According to Oller (1979), any value under 0.25 is an

unacceptable ID value. Once again, Oller would be disturbed by our ID values, especially in

Section II, where the average ID is 0.12. Section I’s average ID falls within the range of “fair

quality” according to Mertler (2003). But once again, the purpose of our test was to check

learners’ progress throughout the semester, not how they compare to one another. Therefore, we


as test developers did not write any items with the intent of discriminating against particular

groups of learners. It is worth reiterating the diversity of the learners in this particular course, and

that their English learning backgrounds are inconsistent with one another. Therefore, some seem

to have different aptitudes in different areas of grammar, which might explain why some of the

low scorers actually performed the same or better than the high scorers on most of the items in

Section II and on many of the items in Section I.

Distractor Analysis and Response Frequency Distribution

Table5MultipleChoice,SectionIDistractorAnalysis(n=34)

Item A B C D OmittedResponse8 1 30* 3 0 09 6 0 12 15* 110 1 28* 4 1 011 1 4 24* 5 012 5 1 0 28* 013 29* 1 3 0 114 4 3 6 21* 015 25* 1 7 1 016 1 2 29* 1 117 0 32* 1 1 0

In the first draft of the ENSL 346/446 midterm, Dr. Bailey critiqued several of our

distractors for the multiple items for being too confusing or misleading to test takers. One way

test developers can see which “distractors” have tricked students is through distractor analysis

(Bailey & Curtis, 2015). The goal in a norm-referenced multiple choice test is for every

distractor to be chosen by at least one person (ibid., p. 200). If a distractor isn’t chosen by

anyone, it should be reconsidered and possibly replaced. Distractor analysis only applies to

multiple choice items, which in our test are items 8 through 17 in Section I. Our breakdown of


the test-takers responses to these questions is shown in Table 5. Not surprisingly, only a few of

our items successfully “distracted” students. Item 9 in Section I threw off the most test takers.

This item was meant to assess knowledge of subordinating conjunctions versus coordinating

conjunctions, a distinction that many of the learners in Penny’s class reportedly struggled with.

Beatrizwillstayatschooluntilshefinishesherproject.

a. adverb c.coordinatingconjunctionb. pronoun d.subordinatingconjunction

The correct answer to this item was D, but almost as many students chose option C. We

speculate that this misconception is due to the fact that test-takers are most familiar with the

mnemonic device FANBOYS (for, and, nor, because, or, yet, and so) when identifying

coordinating conjunctions. We noticed while scoring the tests that those test takers who got item

9 correct wrote this acronym on their testing papers next to the question. It’s obvious that most of

the test takers (28 out of 24) knew that until was a conjunction, they just didn’t know which type

it was. One way we could have “biased for best” more in developing this test would have been to

include the acronym FANBOYS somewhere near this test item, since that’s the terminology the

students are more familiar with. One implication of this test item is that it is always important to

be consistent in the terminology with which students are familiar when writing tests.

Table6ResponseFrequencyDistributiononGrammarSubtest

ItemHigh/LowScorers A B C D

OmittedResponse

1 High 0 9* 0 0 0 Low 1 7 1 0 02 High 0 0 1 8* 0 Low 0 0 6 2 13 High 0 9* 0 0 0 Low 0 6 2 1 04 High 0 0 7* 2 0


Low 0 2 7 0 05 High 0 0 0 9* 0 Low 3 1 0 5 06 High 8* 0 1 0 0 Low 6 1 1 0 17 High 1 1 0 7* 0 Low 2 2 4 1 08 High 7* 0 2 0 0 Low 6 1 2 0 09 High 0 0 9* 0 0 Low 0 0 8 0 110 High 0 9* 0 0 0 Low 0 7 1 1 0

Table 6 offers an even more revealing look at the choices these test-takers made in the

multiple choice section by breaking the distractor analysis down by which options the high

scorers chose in comparison to the low scorers. Returning again to item 9, it’s interesting to see

that most of the top scorers got this item right, whereas the majority of the low scorers were

tricked by option C.

If we had a norm-referenced test, Table 6 would look a little more uniform in terms of the

response frequency distribution. Because many of the distractors were never chosen, we might

consider rewriting some items or distractors so that they would be more difficult for test-takers.

Item 14 was the only item in which all of the low scorers were somewhat evenly distracted by all

the options. This item required test-takers to identify which part of speech the word usually

belongs to. It makes sense that so many test-takers were unable to identify it as an adverb

because a large proportion of test takers also missed item 5 (as shown by its item facility of 0.64

in Table 1), which asked them to define an adverb. Table 6 shows that it was mostly the low

scorers who were confused by this item, while only 1 high scorer got it wrong. It is reassuring to


see that the distractor analysis and response frequency distribution align with the item facility

values for both questions in Section I regarding adverbs.

Item 11 also shows an interesting distribution of responses, in which an equal number of

high and low scorers answered correctly, but those that were distracted were fooled by different

options.

Whichofthefollowingisahelping(auxiliary)verb?

a.Thecanofsodaexploded.

b.Throwmeacanofsoda!

c.Wouldyoulikeacanofsoda?

d.Iwouldn’tlikeacanofsoda.

Their confusion could be due to the wording of the correct answer (Option C), which is in the

interrogative. The learners might be more likely to identify an auxiliary verb in a declarative

sentence where there isn’t any Wh-movement. This is also an item which we reworked after our

pre-pilot because Dr. Bailey mentioned that some of the options were too similar, which she

thought may be a “giveaway” to test-takers. Perhaps by reworking the wording of these options,

we made the question more “sufficiently” difficult, as Oller (1979) would say.

Reliability

Table7InternalConsistencyMeasures

SubtestSplitHalfReliability

ReliabilityafterusingSpearman

BrownProphecyFormula

StandardDeviation

ConfidenceInterval

PointsPossible

SectionI 0.76 0.86 3.51 0.09 17.00SectionII 0.60 0.75 1.65 0.08 15.00

Brown (2005) defines reliability as “the extent to which the results [of a test] can be

considered consistent or stable” (p. 175). In other words, if we were to administer the ENSL


346/446 midterm again several weeks after the initial test date, we should expect the learners to

score very much the same as they did the first time. Because of the impracticalities of

administering the same test to the same population twice in a short time period, we opted to

measure the reliability of the objectively scored test items through internal consistency methods.

Specifically, we used the split-half reliability method by splitting the test into two similar parts

based on odd-numbered items and even-numbered items (Appendix A). We then correlated the

scores of the test-takers on the two halves of the test with Cronbach’s alpha, as if they were

separate tests (Brown, 2005; Hatch and Farhady, 1982). Once we had obtained the reliability for

the two halves of the test, we used Spearman Brown’s prophecy formula to determine the

reliability of the full test. The coefficients for both internal consistency methods are indicated in

Table 7.

We are satisfied with the relatively high internal consistencies of both of our Grammar

subtests. The results of the Spearman Brown prophecy formula align with our previous

assumptions about the questions regarding adverbs in Section I. We can be fairly confident that

Section I consistently measures knowledge of Parts of Speech, both internally and if we were to

administer the test a second time.

Reliability for Section II was comparably lower than for Section I. This could be because

of the format of the test items varied throughout the section. Section II, part i required test-takers

to rewrite sentences, whereas part ii was a gap-fill requiring test-takers to select the correct word

to fill in the blanks in a paragraph. Although both task types were designed to measure

knowledge of conjunctions, perhaps the inconsistent formatting contributed to the overall lower

reliability score for this section. To improve reliability for Section II, we might consider making

the task types more homogenous, not only to improve consistency but also to make sure they are


indeed testing the same constructs. This last point is more related to validity than reliability, but

as Bachman (1990) states, “when we increase the reliability of our measures, we are also

satisfying a necessary need for validity: in order for a test score to be valid, it must be reliable”

(p. 160).

Since we have a criterion referenced test, we did not calculate Standard Error of

Measurement (SEM). Instead, we calculated confidence intervals, which are a “zone within

which a test-taker’s score would fall if he [or she] were tested repeatedly over the same

constructs without learning or forgetting taking place” (Bailey & Curtis, 2015, p. 244).

Confidence Intervals carry out the same function as SEM, but are specific to criterion-referenced

tests. For example, if a student scored a proportion of 0.88 on Section I of the midterm, with a

confidence interval of 0.09, that same student could be expected to score between 0.79 and 0.97

on the same section if she were tested repeatedly, at least 68 percent of the time (Brown, 2005).

This is a fairly wide band for scores in Section I, which could be due to the fact that there are so

few questions. The difference between a score of 79% and 97% is only three questions. The

confidence intervals add depth and context to our previous measures of internal consistency, and

show us how test-takers’ scores might fluctuate over time. Even though we have a high

reliability for the objectively scored portions of our tests according to the Spearman Brown

prophecy, our confidence intervals let us know that scores could vary quite widely if we were to

administer the test again.

Inter-rater Reliability

Table 8 Inter-rater Reliability for Section III

Learner Rater1 Rater2 Rater1+Rater21 12 14 262 12 13 25


3 12 12 244 10 10 205 12 13 256 4 2 67 12 13 258 13 13 269 12 12 2410 14 14 2811 13 14 2712 14 14 2813 13 12 2514 12 13 2515 12 12 2416 13 13 2617 14 14 2818 12 12 2419 13 13 2620 13 14 2721 12 12 2422 14 14 2823 13 14 2724 13 14 2725 11 11 2226 12 14 2627 9 9 1828 13 13 2629 11 10 2130 13 14 2731 14 14 2832 13 14 2733 13 12 2534 12 14 26Mean 12.21 12.53 24.74

StandardDeviation 1.82 2.29 4.04Variance 3.32 5.23 16.32

CoefficientAlpha=0.95

Table9Inter-raterReliabilityforSectionIV

Learner Rater1 Rater2 Rater1+Rater21 14 14 28


2 13 14 273 12 12 244 14 14 285 8 9 176 6 6 127 13 13 268 13 14 279 12 12 2410 12 12 2411 13 14 2712 14 15 2913 14 13 2714 12 14 2615 11 11 2216 13 13 2617 14 14 2818 12 12 2419 12 12 2420 13 13 2621 12 12 2422 14 14 2823 13 14 2724 15 15 3025 14 14 2826 14 13 2727 11 11 2228 13 13 2629 11 11 2230 10 10 2031 15 15 3032 15 15 3033 12 12 2434 14 15 29Mean 12.59 12.79 25.38

StandardDeviation 1.89 1.92 3.77Variance 3.58 3.68 14.18

CoefficientAlpha=0.98 Until now, our reliability measures have only applied to the objectively scored portions of

our test. The subjectively scored portions (Sections III and IV) presented their own unique


challenges to score. As mentioned in Part I of our paper, we developed a holistic rubric with

which both of us scored each of the subjective test items. In order to measure how consistent

both of us were at using the same rating system, we used Cronbach’s alpha to measure inter-rater

reliability. Bailey and Curtis (2015) define inter-rater reliability as “the consistency with which

two or more raters evaluate the same data using the same scoring criteria” (p. 164). Ideally, those

ratings should be identical or very similar. The closer the value is to 1.00, the greater the inter-

rater reliability.

As shown by Table 8, our coefficient alphas for Sections III and IV are 0.95 and 0.98

respectively. This strong coefficient value is due to the rubric which we developed together and

the norming process we underwent before using it. As we mentioned in Part I of this paper, this

collaborative process helped us achieve a very high inter-rater reliability. However, we wouldn’t

expect such a strong reliability if we were to give the rubric to two other raters and ask them to

score responses from the same test. Since we created the rubric and were more familiar with the

nuances of the different descriptions for each level. If we were to pass this test along to be used

in another setting, we would have to write a detailed protocol for using the rubric and also

provide benchmark examples for each level.

Subtest Relationships

Table10SubtestRelationships(df=32,p<.05)

Test CorrelationCoefficients(Pearson’sr)TotalTest 0.81 0.79 0.35 0.79 -Grammar1 0.51 0.50 0.04 - 0.79Grammar2 0.24 0.18 - 0.04 0.35Writing1 0.68 - 0.18 0.50 0.79Writing2 - 0.68 0.24 0.51 0.81

Writing2 Writing1 Grammar2 Grammar1 TotalTest


Table11r-squaredforSubtestRelationships

Test OverlappingVarianceTotalTest 0.66 0.62 0.69 0.77 -Grammar1 0.26 0.25 0.00 - 0.77Grammar2 0.05 0.03 - 0.00 0.69Writing1 0.46 - 0.03 0.25 0.62Writing2 - 0.46 0.05 0.26 0.66

Writing2 Writing1 Grammar2 Grammar1 TotalTest

We used Pearson’s r correlation coefficient to calculate the relationship between the

scores for each subtest and the test as a whole. We then used r-squared to determine the

overlapping variances between the subtests and the total test. At first glance, our subtest

relationships seem quite abysmal. For example, the r-squared value between Section I (Grammar

1) and Section II (Grammar 2) shows no overlap whatsoever! What this means in terms of our

test is unclear. It could mean, as Jean Turner explained to us (personal communication), that our

subtests measure different skills in terms of our original test constructs. Or as Oller (1979)

argues, low correlation coefficients may not necessarily mean that the subtests measure different

areas of knowledge. They could, in fact, be measuring the same kinds of knowledge but not in

adequate ways. Low correlation could indicate an overall low reliability in the test, or in one

section. Or that perhaps the test was “poorly calibrated with respect to the tested subjects” (Oller,

1979, p.188).

Between Turner’s and Oller’s different interpretations of subtest relationships, we would

side with Turner’s assumption that our various subsections demand different skills from the test-

takers. As we mentioned in Part I of our paper, knowledge of parts of speech is not necessarily

indicative of a test-takers grammatical awareness as a whole. Perhaps a section that we initially

thought was testing linguistic competence turned out to measure only a narrow portion of that


construct. Additionally, our test boasts an overall high level of reliability according to our

internal consistency measures, further contradicting Oller’s argument that the sections are

“poorly calibrated” to one another.

One other potential issue that we noticed in our correlation calculation is that our testing

group did not seem to match the conditions for the Pearson’s r statistic (Turner, 2014). Namely, r

is a parametric statistic and our testing sample was too small at only 34 members. Also, although

our data was interval-like and rankable, it is not normally distributed, which is what Pearson’s r

calls for. We decided to retry our correlation calculation statistics using Kendall’s tau, which is

the non-parametric counterpart to Pearson’s r (Appendix B). Tau also handles tied ranks better

than other non-parametric correlation statistics like Spearman rho, and gives a more precise

estimate of correlation strength (ibid.). Unfortunately, Kendall’s tau did not yield much higher

correlation values than Pearson’s r. In fact, almost all of our subtest relationship correlations

were lower once we calculated them using Kendall’s tau.

Discussion

Bailey and Curtis (2015) mention that there are four traditional criteria for evaluating

tests: reliability, validity, practicality and washback. Reliability, as previously mentioned, has to

do with how consistent and stable test results are across time. Validity, a related concept, refers

to “how well a test does what it’s supposed to do” (Oller, 1979, p. 4). In other words, does the

test measure the construct that claims to? Practicality deals with procedures for developing,

administering and scoring a test and how feasible they are in context. Washback is “the effect a

test has on teaching and learning... either positive or negative” (Bailey & Curtis, 2015, p. 3)

In terms of reliability, our test had pros and cons, which emerged through the scoring

process. One strength was definitely our inter-rater reliability, which was extremely high, due to


our detailed rubric and thorough norming process. Our main concerns are with the reliability of

the objectively scored sections. We may want to review these sections and rework the item

formats, which are quite varied throughout the two sections. Perhaps these differences played a

role in how learners answered the questions, along with their knowledge of conjunctions or parts

of speech as a whole.

Our subtest relationships made us question the validity of our test, especially between

sections that measured seemingly similar skills, such as grammar in Sections I and II and writing

in Sections III and IV. We would like to investigate further whether these low correlations are

due to the fact that the subtests measure truly different areas of knowledge, or if there was some

other intervening factor at play. We do think, however, that our test contains face validity,

especially in the last two sections. The test items reflected material that test-takers were already

familiar with and addressed relevant issues in their academic careers and personal lives. Perhaps

one weakness in our design of Section I, was that it lacked face validity to students. As language

educators and test developers, we realize the importance of parts of speech and we were asked to

include them, but perhaps we didn’t integrate the Section I test items well enough with the

content that the test-takers had been learning, or what they were tested on in other subsections.

From a practical standpoint, a great deal of time went into designing, pre-piloting,

editing, piloting, scoring and interpreting our test. But this effort was to be expected for two

novice test developers, creating our first exam for a real teaching context. In comparison to other

types of tests, however, the administration of the ENSL 346/446 midterm was simple. For

instance, with no listening section or speaking section, we did not need to spend extra time and

manpower playing an audio clip or interviewing the test-takers. All we had to do essentially was

explain the test format and leave the test-takers to their own devices. The scoring, although time


consuming, was also relatively straightforward. Our design of a holistic rubric helped

considerably, because it saved on the amount of time we spent with each text and the number of

decisions we needed to make. One change we might make to the multiple choice items in Section

I would be to add a space in the margin where the test-takers could write their letter answers.

One would be surprised how many different interpretations test-takers will come up with for

marking answers when asked simply to “identify thepartsofspeechoftheunderlined

words/phrases.”Itbecameconfusingtodeciphertest-takers’responseswhensomecircled

justtheletteroftheresponse,somecircledtheentireresponse,andsomeunderlinedor

crossedoutcertainoptions,eveniftheydidn’tendupchoosingthem.Havingauniform

spaceforanswerswouldhavestreamlinedthegradingprocess.

UponourfinalmeetingwithPenny,wereceivedevidenceofpositivewashbackfrom

ourtest.Bylookingatourdataandpinpointingspecifictestitemsthatmanylearners

struggledwith(forexample,questions9and14inSectionI),Pennywasabletoseehow

effectiveherinstructionoftheseconceptshadbeen,andusethisthisfeedbacktoguideher

curriculumfortherestofthesemester.Furthermore,ourtestwasverywellgroundedin

theteachingcontextbecauseofourthoroughneedsanalysis.Weknowthatthematerial

willberelevanttostudents’furtherstudiesatMPC,andmayaffecttheirinterpretationsof

Americancultureintheirday-to-daylives.

Conclusion

Inouranalysisofitemfacility,itemdiscrimination,responsefrequencydistribution,

reliabilityandsubtestrelationships,severalstrengthsandshortcomingsofourtestbecame

apparent.Wearegratefulthatwetookthetimetoscrutinizesuchminutedetailsofthe

exambecauseitgaveusinsightintohowwecancreatemoreeffectivetestsinthefuture.


Overall,despitetheweaknessesinourtestdesign,wefeelthatourtestaccomplishedits

intendedpurpose:tomeasuretheprogressofPenny’sstudentsatthemid-pointinthe

semester.Itwaslevel-appropriateandincorporatedthecontentofthecourse.Pennywas

pleasedwiththerevealingresults,andweareconfidentthatithelpedheridentifythe

strengthsandweaknessesofherclass.


References

Bachman, L. (1990). Fundamental considerations in language testing. Oxford: Oxford

University Press.

Bailey, K. M., & Curtis, A. (2015). Learning about language assessment: Dilemmas, decisions,

and directions. Boston, MA: National Geographic Learning.

Brown, J.D. (2005). Testing in language programs: A comprehensive guide to English language

assessment. New York, NY: McGraw-Hill.

Hatch, E.M., & Farhady H. (1982). Research design and statistics for applied linguistics.

Rowley, MA: Newbury House.

Mertler, C.A. (2003). Classroom Assessment: A practical guide for educators. Los Angeles, CA:

Pyrczak Publishers.

Oller, J. W. (1979). Language tests at school. London: Longman Group.

Turner, J. (2014). Using statistics in small-scale research: Focus on non-parametric data. New

York, NY: Routledge.

ensl 346/446 midterm file2. beatriz will stay at school until she finishes her project. a.adverb c....

Documents