visual feature learning - university of massachusetts amherst

150
VISUAL FEATURE LEARNING A Dissertation Presented by JUSTUS H. PIATER Submitted to the Graduate School of the University of Massachusetts Amherst in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY February 2001 (revised June 14, 2001) Department of Computer Science

Upload: others

Post on 20-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

VISUAL FEATURE LEARNING

A DissertationPresented

by

JUSTUSH. PIATER

Submittedto theGraduateSchoolof theUniversityof MassachusettsAmherstin partialfulfillment

of therequirementsfor thedegreeof

DOCTOROF PHILOSOPHY

February2001(revisedJune14,2001)

Departmentof ComputerScience

Page 2: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

c�

Copyright by JustusH. Piater2001All RightsReserved

Page 3: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

VISUAL FEATURE LEARNING

A DissertationPresented

by

JUSTUSH. PIATER

Approvedasto styleandcontentby:

RodericA. Grupen,Chair

EdwardM. Riseman,Member

Andrew H. Fagg,Member

Neil E. Berthier, Member

JamesF. Kurose,DepartmentChairDepartmentof ComputerScience

Page 4: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

For my daughterZoe,born3 daysand18 hoursaftermy dissertationdefense,

hersiblings,BennettandTalia,andmy wife Petra.

I will love youall forever!

Page 5: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

ACKNOWLEDGMENTS

Few substantialpiecesof work aretheachievementof a singleperson.This dissertationis noexception.Many peoplehave contributedto its development.First of all, this endeavor wouldhave beenimpossiblewithout thecontinualsupportof my wife Petra.Shewalkedmany extramiles andbacked me during timesof highly concentratede� ort. I love you! I alsothankmythreechildren,who– oneaftertheother–emergedonthesceneduringmy fiveyearsof graduatestudiesat theUniversity of Massachusetts.They alwayswantedme to behomewith themallday. Forgivemefor having a job! Bennettin particularhasunknowingly madehiscontributionsto this research.Thewayhelearnedto recognizehand-drawn shapesduringhisearlyyearswasvery inspiring.

I amindebtedto my advisorRodGrupenfor many inspiringdiscussions,unshakableconfi-dence,hisvisionfor task-drivenlearningby interaction,andfor simplybeingagreatguy. Thankyou for inviting me into your lab! Thefirst ideasaboutvisual learninggrew out of my earlierwork with EdRiseman.Hehasbeenaconstantsourceof encouragement,andhiskeenobserva-tionsandsuggestionshave greatlyimprovedthequalityof this presentation.Greatcredit is dueto Andy Fagg,whohasreviewedtheideasandtheirpresentationwith unparalleledscrutiny, andsuggestedmany insightful improvements.Of course,I am to blamefor any remainingerrors.Thanksto their valuablefeedback,this dissertationturnedinto a picturebookthatincludes336individual graphicsfiles in 62figures.

I amgratefulto PaulUtgo� for sharinghis insightduringnumerousdiscussionsof machinelearningissuesthat have shapedmy thinking forever. He introducedme to the Kolmogorov-Smirno� distancethatservesseveralpivotal rolesin thiswork. I deeplyregretthatwenevergotaroundto playing musictogether! Specialthanksgo to the psychologistsRachelClifton andNeil Berthierwho introducedmeto theworld of humanperceptualdevelopmentandlearning.Thishadastrongimpactonmy researchinterests,andhasgreatlyenhancedtheinterdisciplinaryaspectsof thisdissertation.Their sharpinsightshave addedgreatvalueto my studies.

On themoremundaneside,I acknowledgethehelpful supportof Hugin A� S. Hey Danes,neighbors!They providedmewith alow-costPh.D.licenseof theirBayesiannetwork library forSolaris,andlaterwith a Linux port. This library wasusedin theimplementationof thesystemdescribedin Chapters4 and5. I alsowant to usethis opportunityto expressmy appreciationof theFreeSoftware� OpenSourcecommunity, whosee� ortshave producedsomeof thebestsoftware in existence. Without Linux and many GNU tools, my life would be considerablyharder. I wonderhow peopleever lived without Emacs? Thanksto the excellent LATEX 2�classfile by my fellow graduatestudentJohnRidgway (thatbuilds on codecontributedby RodGrupen,andothers),thisdocumentlooksaboutasniceasourGraduateSchoolallows ����� .

Many friendsandcolleagueshelpedmakemy timeat theUMassComputerSciencedepart-mentvery pleasant.The Laboratoryof PerceptualRoboticsis a wonderfulbunchof creativepeople. Thanksto Elizeth for remindingme of my own birthdayso the lab would get theirwell-deserved cake. Sheandherco-veteransJe� ersonandManfredtake mostof thecredit formakingLPR a greatplaceto hangout. Their successorsin thelab now have to startover. Andthanksto my wife for bakingthosecakes.I think they will berememberedlongerthanI. Thanksto my South-AmericanandItalianfriendsfor restoringmy sanityby draggingmeawayfrom mydeskandontothesoccerfield. I apologizeto my fellow gradsfrom LPR,ANW, andEKSL, asmy floodsof long-runningexperimentsleft themwith few clock cycleson our sharedcomputecluster. A big thank-yougoesto my specialfriendsandbrothersin Christ,Craig thescientist

v

Page 6: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

andAchim themusician,for beingtherefor mein my joys andsorrows. How emptywould lifebewithout truefriends!

Most of all, I want to thankGodfor calling our world into being. It is sowonderfullyrichandcomplex thatno onebut You will ever fully understandit. Exploring thewhy’s andhow’sof physicsandlife makesfor a fulfilled professionallife. ThemoreI learnaboutYourcreation,themoreI amin awe. My life belongsto You.

O Lord, howmanifoldare Your works!In wisdomYouhavemadethemall.

Psalm104:24

vi

Page 7: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

ABSTRACT

VISUAL FEATURE LEARNING

FEBRUARY 2001(REVISEDJUNE14,2001)

JUSTUSH. PIATER

Dipl.-Inform., UNIVERSITY OF MAGDEBURG,GERMANY

M.Sc.,UNIVERSITY OF MASSACHUSETTSAMHERST

Ph.D.,UNIVERSITY OF MASSACHUSETTSAMHERST

Directedby: ProfessorRodericA. Grupen

Humanslearnrobust ande� cientstrategiesfor visual tasksthroughinteractionwith theirenvironment. In contrast,mostcurrentcomputervision systemshave no suchlearningcapa-bilities. Motivatedby insightsfrom psychologyandneurobiology, I combinemachinelearningandcomputervision techniquesto developalgorithmsfor visual learningin open-endedtasks.Learningis incrementalandmakesonly weakassumptionsaboutthetaskenvironment.

I begin by introducinganinfinite featurespacethatcontainscombinationsof localedgeandtexture signaturesnot unlike thoserepresentedin the humanvisual cortex. Suchfeaturescanexpressdistinctionsover a wide rangeof specificityor generality. The learningobjective is toselectasmallnumberof highlyusefulfeaturesfrom thisspacein atask-drivenmanner. Featuresarelearnedby general-to-specificrandomsampling.This is illustratedon two di� erenttasks,for which I give very similar learningalgorithmsbasedon the sameprinciplesand the samefeaturespace.

Thefirst systemincrementallylearnsto discriminatevisualscenes.Whenever it fails to rec-ognizeascene,new featuresaresoughtthatimprovediscrimination.Highly distinctive featuresareincorporatedinto dynamicallyupdatedBayesiannetwork classifiers.Even afterall recog-nition errorshave beeneliminated,thesystemcancontinueto learnbetterfeatures,resemblingmechanismsunderlyinghumanvisualexpertise.This tendsto improve classificationaccuracyon independenttestimages,while reducingthenumberof featuresusedfor recognition.

In the secondtask, the visual systemlearnsto anticipateuseful handconfigurationsfora haptically-guideddextrous robotic graspingsystem,much like humansdo when they pre-shapetheir handduring a reach. Visual featuresare learnedthat correlatereliably with theorientationof the hand. A finger configurationis recommendedbasedon the expectedgraspqualityachievedby eachconfiguration.

Theresultsdemonstratehow a largely uncommittedvisualsystemcanadaptandspecializeto solve particularvisualtasks.Suchvisuallearningsystemshave greatpotentialin applicationscenariosthatarehardto modelin advance,e.g.autonomousrobotsoperatingin naturalenvi-ronments.Moreover, thisdissertationcontributesto ourunderstandingof humanvisuallearningby providing acomputationalmodelof task-drivendevelopmentof featuredetectors.

vii

Page 8: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

viii

Page 9: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

TABLE OF CONTENTS

Page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

CHAPTER

1. INTR ODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 ClosedandOpenTaskDomains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2. THIS WORK IN PERSPECTIVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 HumanVisualSkill Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 MachinePerceptualSkill Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3. AN UNBOUNDED FEATURE SPACE . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3.1 Gaussian-Derivative Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3.2 SteerableFilters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3.3 Scale-SpaceTheory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4 Primitive Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4.1 Edgels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4.2 Texels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4.3 SalientPoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4.4 FeatureResponsesandtheValueof aFeature . . . . . . . . . . . . . . . . . 24

3.5 CompoundFeatures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.5.1 GeometricComposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.5.2 BooleanComposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.6 Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

ix

Page 10: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

4. LEARNING DISTINCTIVE FEATURES . . . . . . . . . . . . . . . . . . . . . . . . . 294.1 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3.1 ClassificationUsingBayesianNetworks . . . . . . . . . . . . . . . . . . . . 324.3.2 Kolmogorov-Smirno� Distance . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4 FeatureLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.4.1 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.4.2 Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.4.3 FeatureLearningAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 404.4.4 Impactof aNew Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5 Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.5.1 TheCOIL Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.5.2 ThePlym Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.5.3 TheMel Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.6 Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5. EXPERT LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.1 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.3 ExpertLearningAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.4 Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4.1 TheCOIL Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.4.2 ThePlym Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.4.3 TheMel Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.4.4 IncrementalTasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.4.5 ComputationalDemands. . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.5 Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6. LEARNING FEATURES FOR GRASPPRE-SHAPING . . . . . . . . . . . . . . . . 976.1 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.3 Background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.3.1 Haptically-GuidedGrasping . . . . . . . . . . . . . . . . . . . . . . . . . . 986.3.2 Thevon MisesDistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.4 FeatureLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.4.1 Fitting aParametricOrientationModel . . . . . . . . . . . . . . . . . . . . 1036.4.2 SelectingObject-SpecificDataPoints . . . . . . . . . . . . . . . . . . . . . 1046.4.3 PredictingtheQualityof aGrasp. . . . . . . . . . . . . . . . . . . . . . . . 1056.4.4 FeatureLearningAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.5 Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1086.5.1 TrainingVisualModels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1096.5.2 UsingVisualModelsto CueHapticGrasping . . . . . . . . . . . . . . . . . 110

6.6 Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177.1 Summaryof Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177.2 FutureDirections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.2.1 Primitive FeaturesandtheirComposition . . . . . . . . . . . . . . . . . . . 1197.2.2 Higher-Level Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1207.2.3 E� cientFeatureSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

x

Page 11: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

7.2.4 Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1207.2.5 IntegratingVisualSkills . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7.3 UncommittedLearningin OpenEnvironments. . . . . . . . . . . . . . . . . . . . . 121

APPENDIX: EXPECTATION-MAXIMIZA TION FOR VON MISES MIXTURES . . . . 123

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

xi

Page 12: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

xii

Page 13: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

LIST OF TABLES

Table Page

1.1 Typical characteristicsof closedvs.opentaskdomains. . . . . . . . . . . . . . . . . 2

2.1 Summaryof importantcharacteristicsof thetwo applicationsdiscussedin thisdissertation.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.1 Selectingtrueandmistakenclassesusingposteriorclassprobabilities. . . . . . . . . 42

4.2 Summaryof empiricalresultson all threetasks. . . . . . . . . . . . . . . . . . . . . 47

4.3 Resultson theCOIL task.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4 Resultson theCOIL task:Confusionmatricesof ambiguousrecognitions.. . . . . . 49

4.5 Resultson thePlym task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.6 Resultson thePlym task:Confusionmatricesof ambiguousrecognitions. . . . . . . 56

4.7 Resultson theMel task.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.8 Resultson theMel task:Confusionmatricesof ambiguousrecognitions.. . . . . . . 59

5.1 Summaryof empiricalresultsafterexpertlearningonall tasks. . . . . . . . . . . . . 67

5.2 Resultson theCOIL taskafterExpertLearning. . . . . . . . . . . . . . . . . . . . . 71

5.3 Resultson theCOIL taskafterExpertLearning:Confusionmatricesof ambiguousrecognitions.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.4 Resultson thePlym taskafterExpertLearning. . . . . . . . . . . . . . . . . . . . . 77

5.5 Resultson thePlym taskafterExpertLearning:Confusionmatricesof ambiguousrecognitions.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.6 Resultson a two-classPlym subtaskafter15-fold ExpertLearning.. . . . . . . . . . 78

5.7 Resultson theMel taskafterExpertLearning. . . . . . . . . . . . . . . . . . . . . . 81

5.8 Resultson theMel taskafterExpertLearning:Confusionmatricesof ambiguousrecognitions.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

xiii

Page 14: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

5.9 Resultson a two-classMel subtaskafter12-foldExpertLearning. . . . . . . . . . . 83

xiv

Page 15: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

LIST OF FIGURES

Figure Page

2.1 A generalmodelof incrementalfeaturelearning. . . . . . . . . . . . . . . . . . . . 8

3.1 “Eigen-faces”– eigenvectorscorrespondingto theeightlargesteigenvaluesof asetoffaceimages.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Visualizationof a two-dimensionalGaussianfunctionandsomeorientedderivatives(cf. Equation3.3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Examplemanipulationswith first-derivative images.. . . . . . . . . . . . . . . . . . 16

3.4 Visualizationof a linearscalespacegeneratedby smoothingwith Gaussiankernelsofincreasingstandarddeviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.5 Exampleof ascalefunctions( ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.6 Illustrationof the200strongestscale-spacemaximaof sblob( ). . . . . . . . . . . . 18

3.7 Illustrationof scale-spacemaximaof scorner( ). . . . . . . . . . . . . . . . . . . . . 19

3.8 Extractingedgesfrom theprojectedscalespace.. . . . . . . . . . . . . . . . . . . . 21

3.9 Salientedgelsandtexelsextractedfrom anexampleimage. . . . . . . . . . . . . . . 22

3.10 Stability of texelsacrossviewpoints. . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.11 Matchinga texel (a) acrossrotationandscale,without (b) andwith (c) rotationalnormalization.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.12 A geometriccompoundfeatureconsistingof threeprimitive features.. . . . . . . . . 26

4.1 Definition of theKolmogorov-Smirno� distance. . . . . . . . . . . . . . . . . . . . 34

4.2 A modelof incrementaldiscriminationlearning(cf. Figure2.1,page8). . . . . . . . 35

4.3 A hypotheticalexampleBayesiannetwork. . . . . . . . . . . . . . . . . . . . . . . 36

4.4 Highly unevenandoverlappingfeaturevaluedistributions. . . . . . . . . . . . . . . 37

4.5 KSD of a featuref that,if not foundin animage,causesits Bayesiannetwork to inferthatits classis absentwith absolutecertainty. . . . . . . . . . . . . . . . . . . . 45

xv

Page 16: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

4.6 Objectsusedin theCOIL task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.7 Examplesof featureslearnedfor theCOIL task. . . . . . . . . . . . . . . . . . . . . 50

4.8 Featurescharacteristicof obj1 locatedin all imagesof this classwherethey arepresent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.9 Spatialdistribution of featureresponsesof selectedfeaturescharacteristicof obj1(left columns),for comparisonalsoshown for obj4 (right columns). . . . . . . . 53

4.10 Objectsusedin thePlymtask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.11 Examplesof featureslearnedfor thePlym task. . . . . . . . . . . . . . . . . . . . . 54

4.12 All imagesusedin theMel task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.13 Examplesof featureslearnedfor theMel task. . . . . . . . . . . . . . . . . . . . . . 60

5.1 ExpertLearningresultson theCOIL task. . . . . . . . . . . . . . . . . . . . . . . . 68

5.2 Examplesof expertfeatureslearnedfor theCOIL task,Fold 1. . . . . . . . . . . . . 70

5.3 ExpertFeaturescharacteristicof obj1 locatedin all imagesof this classwheretheyarepresent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.4 Spatialdistribution of expertfeatureresponsesof all featurescharacteristicof obj1learnedin Fold 1 (left columns),for comparisonalsoshown for obj4 (rightcolumns). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.5 Variationdueto randomnessin thelearningprocedure. . . . . . . . . . . . . . . . . 75

5.6 ExpertLearningresultson thePlym task. . . . . . . . . . . . . . . . . . . . . . . . 76

5.7 Examplesof expertfeatureslearnedfor thePlym task,Fold 1. . . . . . . . . . . . . 79

5.8 ExpertLearningresultson theMel task. . . . . . . . . . . . . . . . . . . . . . . . . 80

5.9 Examplesof expertfeatureslearnedfor theMel task,Fold 1. . . . . . . . . . . . . . 83

5.10 Objectsusedin the10-objectCOIL task.. . . . . . . . . . . . . . . . . . . . . . . . 85

5.11 IncrementalExpertLearningresultson theCOIL-inc task,Fold 1. . . . . . . . . . . 86

5.12 IncrementalExpertLearningresultson theCOIL-inc task,Fold 2. . . . . . . . . . . 87

5.13 Numbersof featuressampledduringtheCOIL-inc task. . . . . . . . . . . . . . . . . 88

5.14 IncrementalExpertLearningresultson thePlym-inctask,Fold 1. . . . . . . . . . . 89

5.15 IncrementalExpertLearningresultson thePlym-inctask,Fold 2. . . . . . . . . . . 90

xvi

Page 17: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

5.16 Numbersof featuressampledduringthePlym-inctask. . . . . . . . . . . . . . . . . 91

5.17 IncrementalExpertLearningresultson theMel-inc task,Fold 1. . . . . . . . . . . . 92

5.18 IncrementalExpertLearningresultson theMel-inc task,Fold 2. . . . . . . . . . . . 93

5.19 Numberof featuressampledfor varioustasks,plottedversusthenumberofclasses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.1 Graspsynthesisasclosed-loopcontrol. (Reproducedwith permissionfrom CoelhoandGrupen[21].) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.2 TheStanford� JPLdextroushandperforminghaptically-guidedclosed-loopgraspsynthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.3 Objectwrenchphaseportraitstracedduringgraspsyntheses.. . . . . . . . . . . . . 99

6.4 A hypotheticalphaseportraitof anative controller c (left) andall possiblecontexttransitions(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.5 Illustrationof four von Misesprobabilitydensitiesonacirculardomain. . . . . . . . 101

6.6 Scenariofor learningfeaturesto recommendgraspparameters(cf. Figure2.1,page8). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.7 Definition of thehandorientation(azimuthalangle)for two- andthree-fingeredgrasps.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.8 By Coelho’s formulation,someobjectsarebettergraspedwith two fingers(a),somewith three(b), andfor somethischoiceis unimportant(c, d). . . . . . . . . . . . 102

6.9 Left: Datapointsinducedby agivenfeatureon variousimagesof anobjectformstraightlineson a torus(two in this case);right: A mixtureof two von Misesdistributionswasfitted to thesedata. . . . . . . . . . . . . . . . . . . . . . . . . 103

6.10 Kolmogorov-Smirno� distanceKSDf betweentheconditionaldistributionsof featureresponsemagnitudesgivencorrectandfalsepredictions. . . . . . . . . . . . . . 105

6.11 Representative examplesof synthesizedobjectsandconvergedsimulatedgraspconfigurationsusingCoelho’s haptically-guidedgraspingsystem.. . . . . . . . . 109

6.12 Exampleviews of objectsusedin thegraspingsimulations. . . . . . . . . . . . . . . 109

6.13 Quantitative resultsof handorientationprediction.. . . . . . . . . . . . . . . . . . . 110

6.14 Resultson two-fingeredgraspsof cubes(tophalf), andthree-fingeredgraspsoftriangularprisms(bottomhalf). . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

xvii

Page 18: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

6.15 Examplesof featuresobtainedby trainingtwo-fingeredmodelson cubesandthree-fingeredmodelson triangularprisms,usingcleandataseparatedby objecttype(cf. Figure6.14). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.16 Examplesof featuresobtainedby trainingtwo- andthree-fingeredmodelson cubesandtriangularprisms,usingasingledatasetcontainingrelatively noise-freegraspsof cubesandtriangularprisms(cf. Figure6.17). . . . . . . . . . . . . . . 113

6.17 Graspresultson cubesandtriangularprisms. . . . . . . . . . . . . . . . . . . . . . 113

xviii

Page 19: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

CHAPTER 1

INTRODUCTION

Humanshave a remarkableability to act reasonablybasedon perceptualinformation abouttheir environment. Our perceptualsystemfunctionswith suchspeedand reliability that wearedeludedinto underestimatingthe complexity of everydayperceptualtasks. In particular,humansrely heavily onvisualperception.Weorientourselves,recognizeenvironments,objects,andpeople,andmanipulateitemsbasedon vision without ever thinking aboutit. Given theimportanceof vision to humans,it is not surprisingthatvision hasbeenthemost-studiedmodeof machineperceptionsincetheearlydaysof artificial intelligence.Nevertheless,despitefiftyyearsof active researchin artificial intelligence,roboticsandcomputervision,many real-worldvisuomotortasksremainthatareeasilyperformedby humansbut arestill unsolvedby machines.Therobustnessandversatilityof biological sensorimotorinteractioncannotyet bematchedinroboticsystems.

What is it thatenableshigheranimals,first andforemosthumans,to outperformmachinesso dramaticallyon real-world visuomotortasks?I believe that the answeris groundedin thefollowing two thesesthatform thebasisof this dissertation:� Thehumanvisualsystemis adaptive. During thefirst yearsof life, thevisualcapabilities

of childrenincreasedramatically. Thesecapabilitiesarenot limited by thedesignof thevisualsystemalone,but aremodifiedby learning.All throughoutlife, thehumanvisualsystemcontinuesto improve performanceon bothnovel andwell-practicedtasks.

In contrast,most currentmachinevision systemsdo not learn in this way. They aredesignedto perform,andtheir performanceis limited by thedesign.They do notusuallyimprove over timeor adaptto novel situationsunforeseenby thedesigner.� The humanvisual systemis inextricably linked with humanactivity. Activity operatesin synergy with vision andfacilitatesvisual learning,andvision subservesactivity. Hu-manvision operatesin a highly task-dependentway andis tunedto deliver exactly theinformationneeded.

In contrast,most researchin computervision hasfocusedon task-independentvisualfunctionality. In a typical scenario,acomputervisionsystemproducesagenericresultorrepresentationfor useby subsequentprocessingstages.

Both of thesepoints will be further discussedin Chapter2. The remainderof this openingchapterservesto definethescopeandorganizationof thisdissertation.

1.1 Closedand OpenTaskDomains

Thefield of computervisionis commonlysubdividedinto low-level andhigh-level vision. Low-level visionis typically concernedwith task-independentimageanalysissuchasedgeextractionor computationof disparityor opticalflow. High-level vision considersapplication-level prob-lems,e.g.objectrecognitionor vision-guidedgrasping.Considerableprogresshasbeenmadein bothareasduringthepastdecade.For example,machinerecognitionsystemshave achievedunprecedentedlevelsof performance[73, 104, 71, 77, 78]. Increasinglyimpressive recognition

1

Page 20: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

resultsarereportedon largedatabasesof variousobjects.Automatedcharacterrecognitionsys-temsareinvolvedin sortingmostof theU.S.mail. Opticalbiometricidentificationsystemshavereachedcommercialmaturity.

While thesesuccessfulsystemsaretruly remarkable,mostof themaredesignedfor tasksthatarelimited in scopeandwell definedat designtime. For instance,mostobjectrecognitionsystemsoperateon a fixed setof known objects. Many algorithmsin fact requireaccessto acompletesetof trainingimagesduringadedicatedtrainingphase.DeployedOCRandbiometricidentificationsystemsoperateunderhighly controlledconditionswhereparameterssuchassize,location,andapproximateappearanceof thevisualtargetareknown. Similar argumentscanbemadefor othercomputervisionproblemssuchasfacedetection,terrainclassification,or vision-guidednavigation.

TaskdomainsthatsharethesecharacteristicsI call closed. I arguethatmany practicalvisionproblemsarenotclosed.For instance,ahumanactivity recognitionsystemshouldbeableto op-erateundermany di� erentlighting conditionsandin avarietyof contexts; indoorsor outdoors,with any numberof peoplein thescene.A visually navigatedmobile robot shouldbe abletolearndistinctive landmarksby itself. If therobotis movedfrom oneenvironmentto another, onedoesnot want to redesigntherecognitionalgorithm– thesamealgorithmshouldbeapplicablein, andadaptive to, a variety of environments. An autonomousrobot that traversesunknownterrainor collectsspecimensshouldbeableto learnto predictthee� ectof its actionsbasedonperceptualinformationin orderto improve its actionswith growing experience.Ultimately, itshouldbetheinteractionof anagentwith its environment– asopposedto asupervisorytrainingsignal– thatdrivestheformationof perceptualcapabilitieswhile performinga task[67, 114].

Table1.1.Typical characteristicsof closedvs.opentaskdomains.

ClosedTasks OpenTasksTaskparameters: all known at theoutset someto bediscovered

stationary maybenon-stationaryTrainingdata: fixed dynamic,generative

fully accessible partially accessiblevia interactionLearning: batch incremental

o� -line on-lineVisual features: maybefixed mustbelearned

Thus,many realisticvisualproblemsconstituteopentaskdomains(seeTable1.1). Closedandopentasksconstitutetwo extremesalonga continuumof taskcharacteristics.Opentasksarecharacterizedby parametersthatareunknown atdesigntimeandthatmayevenchangeovertime. Therefore,theperceptualsystemof theartificial agentcannotbecompletelyspecifiedattheoutset,but mustberefinedthroughlearningduringinteractionwith theenvironment.Thereis no fixed setof training data,completeor otherwise,that could be usedto train the systemo� -line. Traininginformationis availablein smallamountsat a time throughinteractionof theagentwith its environment.Therefore,learningmustbeon-lineandincremental.

1.2 Scope

Autonomousrobotsthatperformnontrivial sensorimotortasksin therealworld mustbeabletolearnin bothsensoryandmotordomains.A well-establishedresearchcommunityis addressingissuesin motorlearning,which hasresultedin learningalgorithmsthatallow anartificial agentto improve its actionsbasedon sensoryfeedback.Little work is beingconductedin sensorylearning,hereunderstoodastheproblemof improving anagent’sperceptualskills with growingexperience.Thesophisticationof perceptualcapabilitiesmustultimatelybemeasuredin termsof their valueto theagentin executingits task.

2

Page 21: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Thisdissertationaddressesasubsetof theproblemsoutlinedabove. At abroadlevel, its goalis to makeprogresstowardcomputationalmodelsfor perceptuallearningin opentaskdomains.While mostwork in computervisionhasfocusedonclosedtasks,thefollowing chapterspresentlearningmethodsfor opentasks.Thesemethodsaredesignedto be very general.They makefew prior assumptionsaboutthe tasks,andcan learn incrementallyand on-line. I hopethatthis work will sparknew researchaimedat expandingthe scopeof machineperceptionandautonomousrobotsto increasinglyopentaskdomains.

A key unit of visual informationemployed by biological andmachinevision systemsis afeature. Looselyspeaking,a visual featureis a representationof someaspectof local appear-ance,e.g. a cornerformed by two intensityedges,a spatially localizedtexture signature,orcolor. Most currentfeature-basedmachinevision systemsemploy hand-craftedfeaturesets.Inthecontext of opentasks,I arguethatlearningmusttakeplaceevenat thelevel of visualfeatureextraction. Any fixed,finite featuresetwould constrainthe rangeof tasksthat canbe learnedby theagent.This doesnot necessarilymeanthat the featurescannotbe computedby a fixedoperatorset. If they are,thentheseoperatorsmustbesuitablyparameterizedto provide for theflexibility andadaptabilitydemandedby avarietyof opentasks.

The key technicalcontribution of this dissertationconsistsof methodsfor learningvisualfeaturesin supportof specificvisual or visuomotortasks. Thesefeaturescapturelocal ap-pearancepropertiesand aresampledfrom an infinite featurespace. Most parametersof thesystemarederived on-lineusingprobabilisticmethods.Thevalidity of theproposedmethodsaredemonstratedusingtwo very di� erentexampleskills, onebasedon categorization(visualdiscrimination)andtheotheron regression(visualsupportof haptically-guidedgrasping).Thesamplingmethodsfor featuregenerationand the associatedmodel-fitting techniquescan beadaptedto othervisualandnon-visualperceptualtasks.

This work constitutesexploratoryresearchin anareawherecomputervision andmachinelearningmeet.This areahasreceived relatively little attentionby thesesubdisciplinesof com-putersciencethatarerelatively distincteventhoughbothgrew out of theartificial intelligencecommunity. While theresearchquestionsareaddressedprimarily from theperspective of com-putervision andmachinelearning,muchof themotivation is drawn from observationsin psy-chologyandvisualneuroscience.Partsof themodelaccountfor relevantaspectsof thefunctionor performanceof thehumanvisual system.Thebehavior of themodelresemblescritical as-pectsof phenomenain humanlearning.

1.3 Outline

Thenext chapterplacesthiswork into thecontext of researchin psychologyandartificial intel-ligenceanddiscussesthemotivation,goalsandmeansagainstthebackdropof relatedresearchwithin andoutsideof computerscience.Chapters3–6constitutetheheartof this dissertation.Thesetechnicalchapterssharethesamegeneralstructure:A review of relatedwork is followedby a preciseproblemstatement,a presentationof thecontributedsolution,experimentalresultswhereapplicable,and a discussion. In thesechapters,a sectiontitled “Background”brieflyintroducesprerequisiteconcepts,terminology, andnotation.� Chapter3 isself-containedanddefinesaninfinite featurespacethatisdefinedtoovercome

the limitations of finite featuresetsfor openlearningtasks. Featuresare constructedhierarchicallyby composingprimitive featuresin specificways.Thelearningalgorithmsdiscussedin subsequentchaptersarebasedon this featurespace.� Chapter4 describesa systemfor learningfeaturesin anopenrecognitiontask,buildingon the featurespacedefinedin theprecedingchapter. To reinforcethepoint of learningthe featuresthemselves, the algorithm is biasedto find few but highly distinctive fea-

3

Page 22: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

tures. Probabilisticpatternclassificationis combinedwith information-theoreticfeatureselection.� Chapter5 extendsthepreviouschapterby introducingtheconceptof learnedvisualexper-tise in a way that resembleshumanexpertvisualbehavior. In bothcases,a visualexpertexhibits fasterandmorereliablerecognitionperformancethananon-expert. It is conjec-turedthatsuperiorfeaturesunderlyexpertise,anda methodis introducedfor continuingto learnimprovedfeatureswith growing experienceby thelearningsystem.� Chapter6 describesa systemfor learningfeaturesto supporta haptically-guidedroboticgraspingprocess.Featuresare learnedthat enablereliable initializationsof handcon-figurationsbeforethe onsetof the grasp,resemblinghumanreach-and-graspbehavior.This chapteragainbuilds on the featurespaceintroducedin Chapter3, but is otherwiseself-contained.

Finally, Chapter7 concludeswith a generaldiscussionof the impactof this work and futuredirections.

4

Page 23: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

CHAPTER 2

THIS WORK IN PERSPECTIVE

The questionsaddressedin this dissertationspana wide rangeof disciplinesin- andoutsideof computerscience.This chapterplacesthesequestionsinto thecontext of otherresearchinperceptualskill learningin psychologyandartificial intelligence. Motivatedby this interdis-ciplinary background,I posea generalresearchchallengethat is larger thanthe scopeof thisdissertation.Finally, I statesomeimportantpersonalpreferencesandbiases– many of whicharemotivatedby insightsfrom psychologyandneurobiology– that influencedmany of thedesignchoicespresentedin thesubsequenttechnicalchapters.

2.1 Human Visual Skill Learning

Thehumanvisualsystemis truly remarkable.It routinelysolvesa wide varietyof visual taskswith suchreliability anddeceiving easethat belittlestheir actualdi� culty. Thesespectacularcapabilitiesappearto restonat leasttwo foundations.First, thehumanbraindevotesenormouscomputationalresources to vision: About half of our brain is moreor lessdirectly involvedin processingvisual information [54]. Second,essentialvisual skills are learned in a longprocessthat extendsthroughoutthe first yearsof an individual’s life. At the lowest level, theformation of receptive fields of neuronsalong the early visual pathway is likely influencedby retinalstimulation.Somevisualfunctionsdo not developat all without adequateperceptualstimulationwithin asensitiveperiodduringmaturation,e.g.stereovision[12, 44]. Higher-ordervisual functionssuchaspatterndiscriminationcapabilitiesarealsosubjectto a developmentalschedule[40]:� Neonatescandistinguishcertainpatterns,apparentlybasedon statisticalfeaturessuchas

spatialintensityvarianceor contourdensity[103].� Infantsbegin to notesimplecoarse-level geometricrelationships,but performpoorly inthepresenceof distractingcues.They do not consistentlypayattentionto contoursandshapes[101].� At theageof abouttwo years,childrenbegin to discover fine-graineddetailsandhigher-order geometricrelationships. However, attentionis still limited to “salient” features[123].� Over muchof childhood,humanslearnto discover distinctive featureseven if they areovershadowedby moresalientdistractors.

Thereis growing evidencethat even adults learn new featureswhen facedwith a novelrecognitiontask[107]. In a typicalexperiment,subjectsarepresentedwith computer-generatedrenderingsof unfamiliar objectsthat fall into categoriesbasedon specificallydesignedbut un-obviousfeatures.Beforeandafterlearningthecategorization,thesubjectsareaskedto delineatewhatthey perceive to becharacteristicfeaturesof theshapes.Beforelearning,thesubjectsshowlittle agreementin their perceptionof thefeatures.However, after learningmostsubjectspointout thosefeaturesthatby designcharacterizethecategories[108, 123].

5

Page 24: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

SchynsandRodetdemonstratedconvincingly thathumanslearnfeaturesin task-drivenandtask-dependentways[109]. Subjectswerepresentedwith threecategoriesof “Martian cells”,two-dimensionalgrayscalepatternsthat looselyresemblebiological cells containingintracel-lular particles.Thefirst category wascharacterizedby a featureX, thesecondby a featureY,andthe third by a featureXY, which wasa compositeof X andY. Subjectsweredivided intotwo groupsthatdi� eredin theorderthatthey hadto learnthecategories.Subjectsin onegroupfirst learnedto discriminatecategoriesX andY andthenlearnedcategory XY, whereastheothergrouplearnedXY andX first, thenY. After learningthecategorization,thesubjectswereaskedto categorizeother“Martian cells” thatexhibitedcontrolledvariationsof thediagnosticfeatures.Their category assignmentsrevealedthe featuresusedfor categorization: Subjectsof the firstgrouplearnedto categorizeall objectsbasedon two features(X andY), whereasthesubjectsofthesecondgrouplearnedthreefeatures,not realizingthatXY wasacompoundconsistingof theothertwo. Evidently, featuregenerationwasdrivenby therecognitiontask.

Featurelearningdoesnotnecessarilystopafterlearningaconcept.TanakaandTaylor [120]found that bird expertswereasfast to recognizeobjectsat the subordinatelevel (“robin”) asthey wereat the basiclevel (“bird”). In contrast,non-expertsareconsistentlyfasteron basic-level discriminationsascomparedto subordinate-level discriminations.GauthierandTarr [38]trainednovices to becomeexpertson unfamiliar objectsandobtainedsimilar results. Thesefindingsindicatethatthewayexpertsperformrecognitionis qualitatively di� erentthannovices.It hasbeensuggestedthat expertshave developedspecializedfeatures,facilitating rapid andreliablerecognitionin theirdomainof expertise[107].

Despitethisaccumulatedevidenceof visualfeaturelearningin humans,little isknown aboutthemechanismsof visual learning.At least,recentneurophysiologicalandpsychologicalstud-ieshaveshedsomelight onwhatthefeaturesrepresent[129]. Thebulk of theevidencepointstoview-specificappearance-basedrepresentationsin termsof localfeatures.Theview-dependenceof humanobjectrecognitionhasbeenfirmly established[122]. Recognitionperformancede-creasesasa functionof viewpoint disparityfrom previously learnedviews. In the light of thisevidence,Wallis andBultho� dismissrecognitiontheoriesbasedon geometricmodels– two-or three-dimensional– by declaringthat“thereremainslittle or noneurophysiologicalevidencefor theexplicit encodingof spatialrelationsor therepresentationof geonprimitives” [129].

The strongviewpoint dependency of humanvisual representationsis even moreapparentin the context of spatialorientation[11]. Here, the representationemployed by the brain isclearlybasedon a viewer-centeredperceptualreferenceframe. Abstractspatialreasoningis acognitive skill thatrequiresextensive training,involving multiple perceptualmodalitiesaswellasphysicalactivity [1, 2, 95, 97].

Few definitive statementscanbemadeaboutthespatialextent of the featuresusedby thevisualbrain.Nevertheless,it hasbeenshown thatevenfor therecognitionof faces– oftencitedasa prime exampleof holistic representations– local featuresplay a major role. SolsoandMcCarthygeneratedartificial faceimagesby composingfacial featurestaken from real faceimages. Subjectsregardedtheseartificial facesas highly familiar if the constituentfeaturesweretaken from known faces,even thoughthecomplete(artificial) faceshadnever beenseenbefore[113].

2.2 Machine PerceptualSkill Learning

For thepurposeof this work, machinelearningis concernedwith methodsandalgorithmsthatallow anartificial sensorimotoragentto improveor adaptits actionsovertime,basedonpercep-tual feedbackfrom theenvironmentthat is actedupon. This definitionstressesthat learningistask-drivenandincremental.It excludesnon-incrementalmechanismsfor discoveringstructurein datasuchasmany conventionalclassificationandregressionalgorithms,typically considered

6

Page 25: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

machinelearningmethods.Nevertheless,suchalgorithmsmayform importantcomponentsofthetypeof learningmethodsof interesthere.

For a sensorimotoragent,deriving the next actionto be performedinvolves the followingtwo steps:

1. Analyze the currentset of sensoryinput to extract information – so-calledfeatures –suitablefor actionselection.

2. On thebasisof thesefeatures,andpossiblyotherstateinformation,derive thenext actionto betaken.

This dichotomyis somewhat idealized,asmany machinelearningalgorithmsinvolve transfor-mationsof thefeatureset,e.g.featureselectionor principalcomponentanalysis.Mostwork onmachinelearninghasfocusedonthesecondstep,with thegoalof identifyingimprovedmethodsfor generalizingfrom experiencegiven a setof (possiblytransformed)features.In contrast,amechanismfor perceptuallearningfocusesonimproving theextractionof features.Thefollow-ing paragraphstouchonafew representative examplesof perceptuallearningsystems.Previouswork relatedto specificmethodsandproblemswill bediscussedin laterchapters.

Thegoalof Draperet al.’s ADORE systemis to learnreactive strategiesfor recognitionofman-madeobjectsin aerialimages[29]. Thetaskis formulatedasa Markov decisionprocess,a well-foundedprobabilisticframework in which a policy mapsperceptualstatesto actions. InADORE, an action is oneof a set of imageprocessingandcomputervision operators.Theoutputdataproducedby taking an actioncharacterizethe state,and the policy is built usingreinforcementlearningtechniques[115]. ADORElearnedreactivepoliciessuperiorto any staticsequenceof operatorsassembledby theauthors.

Steelsandhis collaboratorsinvestigatethe problemsof perceptualcategorizationandlan-guageacquisitionby a groupof communicatingagents.Sensoryinformationis availableto anagentin theform of continuous-valued“streams”[114], possiblyextractedfrom livevideo[27].An agentcandecideona“topic” characterizedby certainrangesof valuesin asubsetof thesen-sorychannels,andcaninvent“words” to designatethis topic. Theagentsinteractin theform of“languagegames”in whichoneof themchoosesaknown topicor inventsanew one,anduttersthecorrespondingword(s).A listeningagentmatchestheutterancewith its currentsensoryin-put. If theutterancedoesnot matchthelisteningagent’s conceptof thewords,or if it containsunknown words,theagentrefinesits sensorycharacterizationassociatedwith thewords. Thisis doneby successively subdividing sensoryrangesandchoosingadditionalsensorychannelsifnecessary. Thesensorycategorizationthusformedis adequatefor thecurrenttopic,but it doesnot necessarilymatchtheconceptof thespeaker. However, over thecourseof many languagegamesthe agentsform a sharedvocabulary throughwhich they cancommunicateaboutwhatthey perceive.

A similarly motivatedmechanismfor discoveringinformative structurein sensorychannelswasusedby Cohenet al. [25]. Here,conceptsareformedthat relateperceptionsto theagent’sinteractionwith its environment,in contrastto Steels’socialcommunication.

In thesetwoexamples,perceptualdistinctionsarelearnedby carvinguptheperceptualspaceinto meaningfulregions.This canbedonesuccessfullyaslong astheinformationcontainedinan instantaneousperceptionis su� cient to make thesedistinctions. In many practicalcases,however, world statesthat requiredi� erentactionsby the agentmay not be distinguishableon the basisof an instantaneousperception. In this case,the recenthistory of perceptsmayallow thedisambiguationof theseperceptuallyaliasedstates.This is thebasicideaunderlyingMcCallum’sUtile DistinctionMemoryandUtile Su� x Memoryalgorithms[68, 69]. In additionto resolvinghiddenstate,his U-Treealgorithm[67] performsselective perceptionby testingwhichperceptualdistinctionsaremeaningful.

Temporalstateinformation is also the basisof Coelho’s systemfor learninghaptically-guidedgraspingstrategies[22, 23]. Graspingexperienceis recordedastrajectoriesin a phase

7

Page 26: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

space.Thesetrajectoriesareclusteredandrepresentedby parametricmodelsthatdefinediscretestatesof thegraspingsystemduringits interactionwith atargetobject.A reinforcementlearningprocedureis usedto selectappropriateclosed-loopgraspcontrollersateachof thesestates.

2.3 Objective

Draper’s ADORE system[29] is a rare exampleof perceptualstrategy learning. The othersystemscited in the previous sectionperform perceptuallearningby subdividing the featurespacespatially[114, 27, 25, 67] and� or temporally[67, 22, 23], but they donot learnthefeaturesthemselves.

It is oftenarguedthatmultilayerneuralnetworks learnfeaturesthatarerepresentedby thenodesof the hiddenlayer(s). Likewise, thebasisvectorsextractedby eigen-subspacedecom-positionandsimilar methodscanbe regardedas learnedfeatures.Thesemethodsessentiallyproject the input spaceinto anotherspace,typically of lower dimension.However, they pro-vide little control over the propertiesof the resultingfeatureswith respectto criteria externalto theprojectionmethod.For example,in computervision onemay want to preserve localityof featuresor invarianceto variousimagingconditions.This motivatesthe useof othertypesof featuresthatarenot basedon subspacedecomposition,but exhibit thedesiredpropertiesbydesign.

The central, technicalcontribution of this dissertationis a mechanismfor learningsuchfeatures.Theseareextracteddirectlyfrom theraw imagedata,andexpressimagepropertiesthatarevery hardto discover usinggeneraltechniquesfor dimensionalityreduction. Featuresarechosenfrom a very large feature space. Thespecificationof this spaceconstrainsthestructureof the learnedfeatures.Thus,it is possibleto biasor limit the learningsystemto featuresthathave specificrepresentationalor invarianceproperties.For example,onemaywant featurestoencodecornersunderrotationalinvariance.

sense

analyze,interpret; act

useful?yes no

learn feature

empty

(2)

(1)

Figure 2.1. A generalmodelof incrementalfeaturelearning.

The featurelearningprocedureis incrementalandis suitablefor interactive learning.Fea-turesarelearnedasdemandedby agiventask.Figure2.1depictsageneralmodelof theinterac-tion of thesystemwith its environment.Thesystemstartsfrom ablankslate,andis constrainedonly by thespecificationof thefeaturespaceandthefeaturelearningalgorithm.It operatesin aperception-actioncycle thatbeginswith theacquisitionof animage.This imageis analyzedinorderto derive a decisionor actionin responseto thecurrentperception.It is assumedthat the

8

Page 27: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

systemcanobserve thequality or usefulnessof this decisionor action. If this responserevealstheadequacy of theaction,theagentproceedswith thenext perception-actionloop. Otherwise,oneor morenew featuresaregeneratedwith thegoalof finding onethatwill resultin superiorfutureactions.Thedashedlinesdesignatetwo alternative waysof operation:

1. If thefeaturescanbeevaluatedimmediatelywithout changingtheenvironment,thentheagentcaniteratefeaturelearninganddecisionmakinguntil a suitablefeaturehasbeenfound. The recognitionsystemdescribedin Chapter4 operatesin this way. Whenthesystemmisrecognizesanimage,new candidatefeaturesaregeneratedoneat a time, andareevaluatedimmediatelyby rerunningthe recognitionprocedureon the sameimage.This is indicatedby thelower dashedline in Figure2.1.

2. If a featurecannotbe evaluatedimmediately, then the agentproceedswith the nextperception-actionloop. In thiscase,thegeneratedfeaturesareevaluatedover time. In thegraspingapplication(Chapter6), immediatefeatureevaluationwould requireregraspingof anobject.To avoid this interferencewith thetaskcontext, aftergeneratingnew candi-datefeaturesthesysteminsteadproceedswith thenext object,asindicatedby theupperdashedline. Evaluationof thenewly sampledcandidatefeaturesis distributedovermanyfuturegrasps.

Thesetwo applicationsdi� er significantlynot only in their learningmechanisms,but also intheir learningobjectives. The recognitionapplicationis a supervisedclassificationtask. Theutility of a featureis immediatelyassessedin termsof its contribution to theclassificationpro-cedures.In contrast,thegraspingapplicationis primarily a regressiontask,wherefeaturesarelearnedthatpredicta category-specificangularparameter. Theutility of a featuredependsonits contribution to thevalueof futurebehavior of therobot. Trainingis groundedin therobot’sinteractionwith thegraspedobject,anddoesnot involve anexternalsupervisor. Nonetheless,bothof theseapplicationsemploy thesamefeaturespace,introducedin thefollowing chapter,andtheir featurelearningalgorithmsarebasedon thesamebasicprinciples.Thesedi� erencesandsimilaritiesaresummarizedin Table2.1.

Table 2.1. Summaryof importantcharacteristicsof the two applicationsdiscussedin this dis-sertation.

Recognition(Ch.4) Grasping(Ch.6)classification regressionexternalsupervisor learninggroundedin interactionimmediatefeatureevaluation featuresevaluatedover time

samefeaturespacesameprinciplesfor featurelearning

All learningalgorithmspresentedin the following chaptersaredesignedto operateincre-mentally. They arevery uncommittedto any specifictask,andmake few assumptionsaboutthetaskenvironment.For example,therecognitionalgorithmdoesnot know how many knownobjectcategories– if any – arepresentin any givenscene.This is in contrastto mostexistingrecognitionalgorithmsthat alwaysassignone(or no) classlabel to eachscene.Also, neitherapplicationhasany prior knowledgeaboutthenumberor natureof thecategoriesto belearned.In short, the algorithmsdo not assumea closedworld, which hasmany implicationson theirdesign. A consistentset of learningalgorithmsfor opendomainsconstitutethe secondkeycontribution of thisdissertation.

9

Page 28: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

2.4 Moti vation

In additionto theobjectivesstatedin theprecedingsection,thereareadditional,somewhatlesstangiblebiasesandmotivationsthatinfluencedthedesignof therepresentationsandalgorithmspresentedin thefollowing chapters.Theseareintroducedbriefly here.

The primary goal of this dissertationis to contribute to our understandingof mechanismsof perceptuallearningin humansthat canbeappliedin machinevision systems.Many of theideasmanifestedin this work aremotivatedby insightsfrom perceptualpsychology. At theoutset,I believe that humans(childrenandadults)learnvisual featuresasdemandedby tasksthey encounter. Above I presentedevidencecollectedby Schynsandothersin supportof thisbelief. A subgoalof thiswork is to contributeto ourunderstandingof humanvision,by devisingplausiblecomputationalmodelsthatdescribecertainaspectsof humanvision. I firmly believethat we can learn a greatdeal from biology for the purposeof advancingtechnologyin theserviceof humaninterests.Specifically, I believe that task-driven, on-line, incrementalvisuallearningis essentialfor building sophisticatedappliedvision systemsthat operatein general,uncontrolledenvironments.

With this long-termobjective in mind,themechanismscontributedareapplicablein scenar-ioswherenoexternalsupervisoris available.While thebasiclearningframework is supervised,thesupervisorysignalcanbeproducedby theagentitself. In subsequentchapters,I will showhow thiscanbedone.

The constructionof the algorithmsand the prototypeimplementationinvolved a numberof designchoices,many of themconcerninginessentialdetails.Wherever reasonable,choicesweremadeto emulatebiological vision. Someof the resultingalgorithmsarevery e� cientlyimplementedon massively parallel neuralhardware, but are quite expensive when run on aconventionalserialcomputer.

10

Page 29: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

CHAPTER 3

AN UNBOUNDED FEATURE SPACE

The necessityfor a very large featurespacehasbeenmotivatedbriefly above. This chapterintroducesa particularfeaturespacesuitablefor theobjectivesof thiswork.

3.1 RelatedWork

The ideaof representingdatain termsof featuresfrom an infinite featureset (i.e., a featurespace)is quiteold. A naturalmethodis PrincipalComponentAnalysis(PCA) that returnstheeigenvectorsof the covariancematrix of the data. Thesecan be regardedas a small set of“features” selectedfrom an N-dimensionalcontinuousspace,whereN is the dimensionalityof the data. If the dataresemblea low-dimensionalzero-meanGaussiancloud in this space,thenthey canbefaithfully reconstructedusingonly a few featurescorrespondingto thelargesteigenvalues. Turk and Pentland[125] popularizedtheseeigen-features for facerecognition.Figure3.1 shows theeightprincipalcomponentsof a setof faces.Interestingly, someof themcorrespondto intuitively meaningfulfacialfeatures.

Figure3.1. “Eigen-faces”– eigenvectorscorrespondingto theeightlargesteigenvaluesof asetof faceimages.The brightnessof eachpixel representsthe magnitudeof a coe� cient. Zero-valuedcoe� cientsarerenderedin intermediategray; largepositive andnegative coe� cientsinwhite andblack, respectively. (Reproducedwith permissionfrom MoghaddamandPentland[72].)

Nayar, Muraseand collaborators[74, 73, 75] developedthis generalapproachfor view-invariantobjectrecognition,trackingandsimplerobotic control. It hasbeenpointedout thatPCA is not necessarilywell suitedfor discriminationbetweenobject classes[117]. Severaladaptationsof the ideaof PCA to generatediscriminative (asopposedto descriptive) featureshave beensuggested,suchastheFisherLinearDiscriminant[31], theFukunaga-Koontztrans-form [37], andtheMostDiscriminatingFeature[117]. TalukderandCasasent[118, 119] suggestan adaptationof nonlinearPCA that simultaneouslyoptimizesthedescriptive anddiscrimina-tive power of theextractedfeatures.Sometypesof eigen-featurescanbe learnedandupdatedincrementally[130].

Hiddennodesin a neuralnetwork maybeseenascomputingfeaturesfrom aninfinite set.In particular, neuralnetworkscanimplementseveralprojectionpursuitmethodsaswell asPCAin biologicallyplausibleways[47, 48, 91]. Projectionpursuititeratively seekslow-dimensionalprojectionsof high-dimensionaldatathat maximizea given projectionindex. The projectionindex encodessomemeasureof “interestingness”of thedata,typically basedon thedeviation

11

Page 30: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

from Gaussiannormality[59, 36, 99]. Theprojectionsgeneratedby aprojectionindex basedonsecond-orderstatisticscorrespondto theprincipalcomponentsof thedata.

All of thesemethods,with theexceptionof certaintypesof neuralnetworks,computeglobalfeatures.In contrast,it is oftendesirableto producelocal featuresthatrepresentspatiallylocal-izedimagecharacteristics.Localizedvariantsof eigen-featureshave beenexplored[26].

A variety of local featureshave beenproposedin thecontext of recognitionsystems.In atypicalapproach,a featureis avectorof responsesto asetof locally appliedbasisfilter kernels.Koenderink[57] suggeststhat theneighborhoodaroundan imagepoint canberepresentedbya setof Gaussianderivativesof variousorders,termeda local jet, derived from a local Taylorseriesexpansionof the imageintensitysurface. Somerelatedschemesarebasedon steerablebasesof Gaussian-derivative filters [90, 104, 94, 43], whichwill bebriefly discussedin Section3.3.2. Othersemploy Gaborfilters [128, 20], curved variantsknown asbananawavelets[58],or Haarwavelets[81] to representlocal appearance.

A di� erentway to generateanunboundedfeaturespaceis by defininga featureasaparam-eterizedcompositionof severalcomponents.ChoandDunn[18] definea cornerfeatureby thedistanceandanglebetweentwo straightline segments.Their featuresetis finite becausetheseparametersarequantized.

Segen [110] was one of the first to consideran infinite combinatorialfeaturespacein acomputervision context. His shaperecognitionsystemis basedon structuralcompoundsoflocal features. Geometricrelationsbetweenlocal curvaturedescriptorsare representedby amultilevel graph,andconceptmodelsof 2-D shapesareconstructedby matchingandmergingmultiple instancesof suchgraphs. Califano andMohan [16] combinedtriplets of local cur-vaturedescriptorsto form a multidimensionalindex usedfor recognitionof 2-D shapes.Theircontributionsincludeaquantitativeprobabilisticargumentdemonstratingthatthediscriminativepower of featuresincreaseswith theirdegreesof freedom.

Amit, Gemanand their coworkers createa potentially infinite variety of featuresfrom aquantizedspaceby meansof combinatorics.Primitive local featuresarecomposedto produceincreasinglycomplex arrangements.Primitive featuresaredefinedby co-occurrencesof smallnumbersof binarizedpixel values,andcompoundsarecharacterizedby relaxedgeometricrela-tionships.Discriminative featuresareconstructedandqueriede� ciently usinga decisiontreeprocedure.This typeof approachhasbeenappliedto handwrittencharacterrecognition[8, 5]andfacedetection[7, 6] with remarkablesuccess,andhasalsobeenextendedto recognitionofthree-dimensionalobjects[50]. Theapproachemployedin thisdissertationborrowsandextendskey ideasfrom thiswork.

3.2 Objective

Therepresentationalpowerof any systemthatusesfeaturesto representconceptsis determinedto a large extent by the available features. If the featureset is not su� ciently expressive forthe taskat hand,the performanceof the systemwill be limited forever. If the featureset istoo expressive, thesystemwill beproneto a varietyof problems,includingoverfitting, andtobiasesintroducedthroughpairwisecorrelatedfeaturesor distractorfeaturesthatdonotcorrelatewell with importantdistinctions.Clearly, di� erentfeaturesmayberelevant for di� erenttasks.Moreover, givena task,di� erentfeaturesetsmaybesuitablefor di� erentalgorithmsemployedto solve thesametask[52].

Theseconsiderationsrequirethatfeatureseitherbehand-craftedfor aparticularsetting(con-sistingof task,algorithmanddata),or that they belearnedby thesystem.This chapterdefinesa featurespacefor usewith a featurelearningsystem,subjectto thefollowing objectives:

1. As thiswork is concernedwith visualtasks,featuresarecomputedon intensityimages.

12

Page 31: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

2. Thefeaturespaceshouldnotbegearedtowardany task,but it shouldcontainfeaturesthataresuitablefor a varietyof visual tasksat variouslevelsof specificityandgenerality. Inotherwords,thereshouldbesomefeaturesthatarehighly usefulfor any of a varietyoftasksthatthesystemmaybeaskedto solve.

3. Sincethecharacteristicsof a featurelearningsystemarebestdemonstratedon a systemthat learnsfew but powerful features,the featurespaceshouldcontainfeaturesthat areindividually very usefulfor giventasks.This is in contrastto mostexisting feature-basedvision systemsthatderive their power primarily from a largenumberof featuresthatarenotnecessarilyvery powerful individually.

4. The featurespaceshouldbe su� ciently completeto demonstratesuccesson exampletasksusingfeaturelearningsystemsdiscussedin subsequentchapters.

5. Sinceimage-planeorientationis importantin sometasksand irrelevant in others,fea-turesshouldinherentlyallow themeasurementof featureorientationin away thatcanbeexploitedto achieve normalizationfor rotation.

6. Sincein many visual tasksthereis no prior scaleinformationavailable,featuresshouldberobustto variationsin scaleto simplify computation.

7. To enhancerobustnessto clutterandocclusionandto facilitatethelearningof categoriescharacterizedby uniqueobjectparts,featuresshouldbelocal in theimagearray.

8. Forgenerality, featuresshouldbeapplicablein generalclassificationandregressionframe-works.

9. The featuresshouldbe a plausiblefunctional model of relevant aspectsof the humanvisualsystem.1

All of themethodsdiscussedin theprecedingsectionsatisfysomeof theseobjectives,butnoneof themsatisfiesall. Eigen-featuresandtheir variationsarewell suitedfor a variety oftasks,but rotationalinvarianceandimage-planelocality arehardto achieve. WaveletsandGa-bor filters andtheir variantsarelocal, but arenot rotationallyinvariant. Filtersbasedon steer-ableGaussian-derivative filters satisfyall of theabove objectives. Thesteerabilityproperty, tobeexplainedin Section3.3.2,allows thesynthesisof filters andtheir responsesat arbitraryori-entationsfrom a smallsetof basisfilters, andcanbeexploitedto achieve rotationalinvarianceat very little computationalcost. However, the representationalanddistinctive power of indi-vidual, well-localized(Objective 7) Gaussian-derivative filters is not su� cient for many tasks(Objective 3). Amit andGeman[8, 5] usedrelatively weakindividual features,sparsebinarypixel templates,andshowedhow to increasetheirpowerby composingthemin acombinatorialfashion.

The key idea to achieving all of the above objectives is to combinesteerableGaussian-derivative filters with anadaptationof Amit andGeman’s idea,producingfeaturesthatconsistof spatialcombinationsof localappearancecharacteristics.Beforethedetailsarepresented,thefollowing sectionprovidessomehelpful background.

3.3 Background

This sectionpresentsbrief introductionsto relevantconceptsthathave beenwell establishedinthe literature.Somefamiliarity with theseconceptsis requiredto understandthecontributionspresentedin subsequentsections.

1Sincethis contradictsObjective 5, rotational(non-)invarianceis not considereda relevantaspecthere.

13

Page 32: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

3.3.1 Gaussian-DerivativeFilters

The isotropiczero-meanGaussianfunction with variance 2 of a two-dimensionalparameterspaceatapoint x � [x y]T � R2 is definedas

G(x � ) � 1

2� 2e� xT x

2� 2 � (3.1)

For the purposesof this work, the orientedderivative of a Gaussianof orderd at orientation� � 0, i.e. in thex direction,is written as

G0d(x � ) � � d� dx

G(x � ) � (3.2)

For generalorientations�, G�d is definedasanappropriatelyrotatedversionof G0

d(x � ), i.e.

G�d(x � ) � G0d(R� x � ) (3.3)

with a rotationmatrix

R� � cos� �

sin�

sin�

cos� �

Gaussian-derivative filter kernelsarediscreteversionsof Gaussianpoint spreadfunctionscomputedin this way. To generatea filteredversionIG(x) of an imageI (x), a discreteconvo-lution with a kernelG is performed.Two-dimensionalconvolutionswith Gaussiansandtheirderivativescanbecomputedverye� cientlybecausethekernelsareseparable,andhighly accu-rateande� cient recursive approximationsexist [127]. Unfortunately, this is not generallytrueof rotatedderivatives(Equation3.3).

Gaussian G0

First derivative G01 G��� 21

Secondderivative G02 G��� 32 G2��� 32

Figure3.2.Visualizationof atwo-dimensionalGaussianfunctionandsomeorientedderivatives(cf. Equation3.3). Zero valuesare shown as intermediategray, positive valuesare lighter,negative valuesdarker.

Figure3.2 illustratessomeorientedGaussian-derivative filters usedin the context of thiswork. Gaussianfilters act as low-passfilters, and their derivatives as bandpassfilters. Thecenterfrequency increases(seeFigure 3.2) and the bandwidthdecreasesas the order of thederivative increases.First-orderderivative kernelsrespondto intensitygradientsin I . To extractedgeinformationfrom animageI , I is convolved with thetwo orthogonalbasisfilters G0

1 and

G��� 21 . Thegradientmagnitudesandorientationsin I aretheneasilycomputed:

� �I� � I2

G01

� I2G�! 21

(3.4)

14

Page 33: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

tan�#"

I � IG�! 21

$IG0

1(3.5)

Whencomputing�#"

I usingEquation3.5,thefull angularrangeof 0–2 canberecoveredby tak-ing into accountthesignsof thenumeratorandthedenominatorwhencomputingthearctangent.For a thoroughdiscussionof Gaussian-derivative filters seea recentbookchapter[93].

3.3.2 SteerableFilters

A classof filters is steerable if a filter of a particularorientationcanbesynthesizedasa linearcombinationof asetof basisfilters [35]. For example,first-orderGaussian-derivative filtersaresteerable,since

G�1 � G01 cos

� � G��� 21 sin�

(3.6)

asis easilyverified. Dueto the linearity of theconvolution operation,an imagefilteredat anyorientationcanbecomputedfrom thecorrespondingbasisimages:

IG%1 � IG01cos

� � IG�! 21sin�

(3.7)

which is muchmoree� cientlycomputedfor many di� erentvaluesof�

thanby explicit convo-lution with theindividual filtersG�1.

Gaussian-derivative filters of any orderd aresteerableusingd � 1 basisfiltersG� k & dd thatareequallyspacedin orientationbetween0 and , i.e.,

�k' d � k

d � 1 k � 0 ����� d�

Incidentally, Figure3.2shows thebasisfiltersG� k & dd for thefirst two derivatives.To synthesizeafilter at any givenorientation

�, thesebasisfilters arecombinedusingtherule

G�d � d

k( 0

G� k & dd c�k' d (3.8)

wherec�k' 1 � cos(

�)�k $ 2) k � 0 1

c�k' 2 � 13 1 � 2cos(2(

�*�k $ 3)) k � 0 1 2

c�k' 3 � 12 cos(

�+�k $ 4) � cos(3(

�+�k $ 4)) k � 0 1 2 3

which containsEquation3.6 asa particularcase[35, 90]. Again, dueto the linearity of con-volution, thesameoperationcanbeperformedon thefiltered images,asopposedto thefiltersthemselves.SeeFreemanandAdelson[35] for a thoroughtreatmentof steerablefilters.

Example3.1(Operationson Gaussian-filtered images) Figure3.3 illustratessomemanip-ulationson filtered imagesthatareof interestin this context. The imageshown in Figure3.3ais convolved with thekernelsG0

1 andG��� 21 , resultingin theverticalandhorizontaledgeimagesshown in Figures3.3b and3.3c. Edgeenergy is computeddirectly from theseimagesusingEquation3.4 (Figure3.3d),andedgeorientationusingEquation3.5 (Figure3.3e). In theori-entationimage,notethat theangulardomainis mappedto thelineargrayscale.Consequently,bothblackandwhite correspondto near-zeroangles.Figure3.3f wasgeneratedby steeringtheedgeimagesusingEquation3.7. This imageis identicalto theonethatwould beobtainedbyconvolving theoriginal image(Figure3.3a)with thekernelG��� 41 .

15

Page 34: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

(a) (b) (c) (d) (e) (f)

I IG01

IG�, 21I2G0

1

� I2G�! 21

tan� 1IG�! 21IG0

1

IG01cos �4� IG�, 21

sin �4Figure 3.3. Examplemanipulationswith first-derivative images. Imagea shows the originalimageof size128 - 128pixels. In imagesb, c, andf, zerovaluesareencodedasintermediategray, andpositive andnegative valuesaslighter anddarker shades,respectively. In imaged,blackrepresentszero.In imagee, theangularrangeof 0–2 is mappedto therangefrom blackto white. Thederivativeswerecomputedwith .� 2 pixels.SeeExample3.1for discussion.

3.3.3 Scale-SpaceTheory

A fundamentalproblemin computervision is to decideon theright scalefor imageanalysis.Atypical questionis: “How large arethefeaturesof interestrelative to thepixel size?” In manyapplications,no prior scaleinformationis available.Therefore,many computervision systemsperformtheir analysisover anexhaustive rangeof scalesandintegratetheresults.Scale-spacetheoryprovidesa sophisticatedmathematicalframework for multi-scaleimageanalysis[62],andis consultedin thiswork to answertheabovequestion.A (linear)scale-spacerepresentation,or simply scalespacefor short,of an imageis the stackof imagesgeneratedby increasinglyblurring the original image. This processcanbe describedby the di� usionequation[56]. Inpractice,a scale-spacerepresentationis computedby successive smoothing(Figure3.4). TheGaussiankernelhasbeenshown to be the uniquesmoothingoperatorfor generatinga linearscalespace[34].

Figure3.4. Visualizationof a linearscalespacegeneratedby smoothingwith Gaussiankernelsof increasingstandarddeviation. Theimagesizeis 168 - 234pixels.

16

Page 35: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

For the purposesof this work, scale-spacetheory o� ers well-motivatedways to identifyappropriatescalesfor imageanalysisateachlocationin animage,basedonthefollowing scale-selectionprinciple:

“In theabsenceof otherevidence,assumethat a scalelevel, at which some(pos-sibly non-linear)combinationof normalizedderivativesassumesa local maximumover scales,canbe treatedasreflectinga characteristiclengthof a correspondingstructurein thedata.” [64]

Lindeberg [64] providessometheoreticaljustificationfor this principlederivedfrom theobser-vation that if an imageis rescaledby a constantfactor, thenthescaleat which a combinationof normalizedderivatives assumesa maximumis multiplied by the samefactor. The choiceof a specificcombinationof normalizedderivatives– in thefollowing denoteda scalefunctions( ) – dependson thetypeof imagestructureof interest.Suchstructuresandtheir associatedscalescanthenbedetectedsimultaneouslyin animageby finding localmaximaof s in thescalespace.Thus,therearetypically multiple intrinsic scalesassociatedwith eachimagelocation.Thenormalizationis donewith respectto theoperatorscale , andensuresthat thevaluesofthescalefunctioncanbecomparedacrossscales.

Figure3.5.Exampleof ascalefunctions( ). Thegraphshows thescalefunctioncomputedfor13discretevaluesof at thecenterpixel of theimage(121 - 121pixels).Eachlocalmaximumdefinesan intrinsic scaleat this point. The two intrinsic scalesare illustratedby circles ofcorrespondingradii.

Example3.2(Intrinsic Scale) Figure3.5 illustratesthe computationof the intrinsic scalesat an imagelocation. The graphplots the valuesof s( ) at this point, computedfor discretevaluesof at half-octave spacing.The local maximaof this graphdefinethe intrinsic scales,which are illustratedby circlesof correspondingradii in the image. The strongermaximum(s(5� 7) � 73� 6) capturesthesizeof thedark,solid region at thecenterof thesunflower, whiletheweaker maximum(s(22� 6) � 7� 1) roughlycorrespondsto thesizeof theentireflower. Themaximumat theminimal valueof /� 1 is notconsideredascale-spacemaximumbecauseit isnotenclosedbetweenlower valuesof s.

To illustratehow s( ) canbeusedto detectimagestructuresof interestat their intrinsic scale,Figure3.6 shows the 200 strongestlocal maximaof the samescalefunction s( ) computedeverywherein the image. To generatethis figure, the value of s( ) was computedat eachpixel for eachvalueof . Thecorrespondinglocal maximawithin theresultingscalespaceareillustratedby circlesof thecorrespondingradius at thecorrespondinglocationx. Thefigureclearly illustratesthecorrespondenceof intrinsic scalewith thesizeof local imagestructures,which is heredominatedby sunflowers. The scalefound for the large sunflower at the lowerright is smallerthanexpectedbecauseno larger scaleswereconsideredat this point to avoid

17

Page 36: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

artifactscausedby thethefilter kernelexceedingtheimageboundary. Thescalefunction s( )usedfor theseexamplesis definedby Equation3.9,discussednext.

Figure3.6. Illustrationof the200strongestscale-spacemaximaof sblob( ). Theradiusof eachcircle correspondsto thelocal scale . SeeExample3.2.

Thetypesof structuresof interestin this dissertationareblobs,corners,andedges.Blob isa commonlyusedtermreferringto a roughlycircular (or ellipsoidal)imageregion of roughlyhomogeneousintensity(or texture).A suitableisotropicscalefunctionis definedin termsof thetraceof theHessian:

sblob( ) �0 2 IG02

� IG�! 22(3.9)

wherethescaleis givenby which is thescale(standarddeviation) of thefiltersG02 andG��� 22 .

Thenormalizationfactor, 2, ensuresthatthescale-selectionprincipleholds[64], andtheabso-lutevalueis takento removesensitivity to thesignof theHessian.Thisscalefunctionis rotationinvariantasaretheothertwo mentionedbelow. Figure3.6,describedin Example3.2,illustratestheimagepropertydescribedby ablob.

A commonlyusedmeasurefor cornerdetectionis thecurvatureof level curvesin theinten-sity datacombinedwith thegradientmagnitude.Lindeberg [64] recommendstheproductof thelevel curve curvatureandthegradientmagnituderaisedto the third power, which leadsto thescalefunction

scorner( ) �0 4 I2G�! 21

IG02

�2IG0

1IG�! 21

IGx& y � I2G0

1IG�! 22

(3.10)

whereGx' y � 1 21 x1 yG. Again, thenormalizationfactor 4 ensuresthat thescale-selectionprin-ciple holds,andthe absolutevalueremovessensitivity to the sign. Figure3.7 shows how thescale-spacemaximadetectedby scornerclusterat corners,andalsooccurat high-contrastedges.

18

Page 37: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Figure3.7. Illustrationof scale-spacemaximaof scorner( ). Theleft imageshows thestrongest15 scale-spacemaxima,andthe right imagethestrongest80. The radiusof eachcircle corre-spondsto thelocal scale .

For intensityedges,ausefulscalefunctionis basedon thefirst spatialderivative [63]:

sedge( ) � I2G0

1

� I2G�, 21

(3.11)

Thenormalizationfactor 2 ensuresthescale-selectionprincipleasbefore.Thenormalizationfactorsfor blobs and cornersas formulatedabove arechosensuchthat the recoveredscalescorrespondto the size of the local imagestructure,and will thus typically be relatedto thesizeof objectsdepictedin the image. In contrast,thenormalizationfactorfor edgesis chosento reflect the sharpnessof the edge. Sharpedgesattain their scale-spacemaximaat smallerscalesthanblurry edges.For thepurposesof thisdissertation,thereasonfor thischoiceof scalenormalizationis that it ensuresperfectlocalizationof therecoverededgesat themaximain thegradientmagnitudealongthegradientdirection.

As notedabove,blobandcornerfeaturesareidentifiedaslocalmaximaof thescalefunctionin thescalespace.To extractedgesin scalespace,gradient-parallelmaximaareextractedfromthescalespaceaccordingto thenotionof non-maximumsuppression[17]. This resultsin thinandmostlycontiguousedges.Sincethedetailsof scale-spaceedgeextractionareunimportantin thecontext of thiswork, theinterestedreaderis referredto theliterature[63].

3.4 Primiti veFeatures

Sections3.1 and3.2 discussedimportantpropertiesandvariouswaysto generateglobal andlocal features.Sinceeigen-featureshave a numberof desirableproperties,it is worth consid-ering how their primary drawback– their global spatialextent – canbe overcome. Colin deVerdiereandCrowley investigatedthis questionanddevelopeda recognitionsystembasedonlocal descriptorsderived by PCA [26]. Many of the most importanteigen-featuresgeneratedby their systemresembleorientedbandpassfilters. This is no accident:If naturalimagesaredecomposedinto linearly independent(but not necessarilyorthogonal)basisfunctions,usingasparsenessor minimum-description-length objective thatfavorsrepresentationswith many van-ishingcoe� cients,thenbasisfunctionsemergethatresembleorientedbandpassfilters. Notably,thesearestronglylocalized– evenwithoutanexplicit constraintor biastowardlocality [80, 91].This resulthasbeenwell establishedin theliterature,andservesasanaturalexplanationfor the

19

Page 38: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

shapeof receptive fields in the mammalianearly visual pathway. The receptive fields of vi-sualneuronsin the primary visual cortex have often beenmodeledby orientedderivativesofGaussianfunctions[132, 57].

In this dissertation,theseinsightsmotivate the choiceof Gaussian-derivative filters as afundamentalrepresentationof local appearance.Sinceprincipled and biologically plausiblefeatureextractionmechanismstendto result in featuresresemblingGaussianderivatives,theyareusedheredirectly, avoiding theaddede� ort of eigen-subspacedecomposition.Moreover,Gaussiansandtheir derivativeshave a numberof usefulproperties;thosethatareimportantforthiswork wereintroducedin Section3.3.

Thefeaturespaceconsideredin thiswork is definedrecursively: A featurefrom thisspaceiseitheraprimitivefeature,or acompoundfeature.Compoundfeaturesconsistof othercompoundfeaturesand� or primitivefeatures,andarediscussedin Section3.5.A primitivefeatureisdefinedby a vectorof responsesof Gaussian-derivative filters all computedat thesameimagelocationx. Thefollowing sectionsdescribetwo typesof primitive features.

3.4.1 Edgels

Edgesarea fundamentalconceptin vision, deemedimportantby bothmachineandbiologicalvision research.Edgesdescribeaspectsof shapein termsof intensityboundaries.Here,indi-vidualedgeelements– so-callededgels– areconsideredin isolation,withoutattemptingto jointheminto contours.An edgelis definedby the2-vector

fedgel(x) � IG01(x)

IG�! 21(x)

wherethefiltersG arecomputedat the intrinsic scale x thatmaximizessedgel( ) at theimagelocationx. An edgelencodestheorientationandmagnitudeof thelocal intensitygradientat theintrinsic scale.Note that thegradientorientationis alwaysnumericallystableat a scale-spacemaximum,becausethenumeratoranddenominatorin Equation3.5cannotbothbecloseto zeroatamaximumof thescalefunctionsedgel.

3.4.2 Texels

Edgelsareaweakdescriptorof thelocal intensitysurfacein thatthey aredefinedby merelytwoparameters.It mayoftenbedesirableto have a moreexpressive local descriptorthatusesmoreparametersandcapturesmorestructurethanjust the local intensitygradient. Intuitively, thisstructurecorrespondsto a local patternor texture, which motivatesthe term texel (for textureelement). A texel ftexel is definedby a vector containingthe steerablebasesof the first nd

Gaussianderivativesat ns scales.Oneof thesescalesis the intrinsic scale that maximizessblob( ) or scorner( ) at animagelocation.

Besidesthechoiceof thescalefunction,anedgelis simply a texel with nd � ns � 1. Themotivation for usingtexelswith severalderivativesandscalesis that theadditionalparametersrepresentmoreinformationaboutlocal imagestructure(or texture), increasingthe specificityof the feature[90]. This increasedspecificity may or may not be desirablein a given task.Therefore,both edgelsandtexels areincludedin the featurespace,andit is up to the featurelearningalgorithms(Chapters4 and6) to useany or bothof theseprimitive featuretypes.

This work consistentlyusestexels composedof the first two derivatives at threescales,yieldinga totalof (2 � 3) - 3 � 15filter responses.Themiddlescaleis theintrinsicscaleat thatimagelocation,andtheothertwo scalesarespacedhalf anoctavebelow andabove. Derivativesof an ordergreaterthanabouttwo or threerarely respondsignificantlyin practicebecauseoftheir narrow bandpasscharacteristics.Therefore,they contribute little to the specificity of afeature.Thespecificitycanarbitrarily beincreasedby addingscales(within thelimits imposed

20

Page 39: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

by thepixel andimagesizes).However, addinglargerscalesreducesthelocality of thefeature.Also,goodgeneralizationpropertiesareusuallyexpectedof afeature.Thischoiceof parametersconstitutesacompromisebetweenexpressive powerandlocality of thefilter in spaceandscale,andhasbeenfoundusefulin otherapplications[93].

Theorientationof a texel is definedin termsof thetwo first-derivative responses.At whatscaleshouldtheseresponsesbemeasured?This is a di� cult questionbecausetheedgeenergyat theintrinsic scaleasdeterminedaccordingto sblob andscornermaybevery low, renderingthetexel orientationunreliable. This waspointedout by Chomatet al. [19], who recommendedthat the orientationof blobs be definedusing the scalethat maximizessedge at this location.However, this scalemay be very di� erentthanthe texel scale,resultingin an orientationthatcorrespondsto an imagestructureother than that describedby the texel. In particular, thiscanbea problemin thepresenceof non-rigidobjectsor substantialclutter in thescene.Mostof the imageryconsideredin this dissertationshows rigid objectsandno clutter. Chomatetal.’sempiricalresultsandmy own informal experimentsconfirmedtheclearsuperiorityof theirmethod,which is thereforeappliedthroughoutthiswork.

3.4.3 SalientPoints

In principle,edgelandtexel featurescanbecomputedat any locationin animage.In practice,it is oftenusefulto concentrateon salientimagelocations.For agivenscalefunctions( ), anylocalmaximumwithin thethree-dimensionalscalespacedefinesasalientpointandits intrinsicscale atwhich thismaximumis attained(cf. Example3.2on page17).

In orderto identify scale-spacemaximafor edges,thescalespaceof sedge is first projectedonto the imageplaneby taking the maximumof sedge over the scales at eachpixel. Theresultingimagespecifiesthegradientmagnitude,orientation,andintrinsic scalefor eachpixel.Then, gradient-parallelmaximaare extractedon this 2-D planein the conventional fashion,resultingin well-localizedandmostlycontiguouscontoursegments.

(a) intrinsic scales (b) gradientoriens. (c) grad.magnitudes (d) extractededges

Figure 3.8.Extractingedgesfrom theprojectedscalespace.SeeExample3.3for explanation.

Example3.3(Extracting Edgesfrom the ProjectedScaleSpace) Figure 3.8a shows theintrinsic scalecorrespondingto eachpixel location. At eachpixel, thegrayvalueencodesthescaleat which sedge is maximizedat that imagelocation. Bright shadescorrespondto largescales.High-contrastregionsin theoriginal image(cf. Figure3.3a,page16)clearlycorrespondto smallerscales. The blocky artifactsare boundarye3 ects; filters are not appliedneartheimageboundarywherethe filter kernelexceedsthe image. Edges(Figure3.8d)areextractedasgradient-parallelmaximafrom theimageshown in Figure3.8c,containingat eachpixel thestrongestedgeenergy acrossall scales,attainedat the scaleshown in Figure3.8a,using thecorrespondinggradientsshown in Figure3.8b.

For texels, local maximaareidentifiedin eachof the two scalespacesgeneratedby sblob

and scorner. Theselocal maximaarethenprojectedonto the imageplane. On thoserelatively

21

Page 40: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

rareoccasionswheretwo local maximaprojectto thesameimagelocation,themaximumcor-respondingto thelargerscaleis shiftedslightly. Thesetof salienttexel pointsis thengivenbytheunionof theprojectedmaximaof both scalespaces.As a result,eachsalientimagepointhasexactlyoneintrinsic scale(Figure3.9).

689edgels 130texels(54blobs,76 corners)

430edgels 77 texels(30blobs,47corners)

Figure 3.9. Salientedgelsand texels extractedfrom an example image. On the left, eachdark line segmentcorrespondsto an edgel(i.e., a scale-spacemaximumof sedge) locatedatthe midpoint of the line, with scale identical to half the length of the line. On the right,squaremarkerscorrespondto scale-spacemaximaof scorner, andcircularmarkersto scale-spacemaximaof sblob. Collectively, thesearethesalienttexels.Theorientationsof theseareillustratedby thecenterlinesof eachmarker. Theradiusof eachmarker is equalto thelocal scale . SeeExample3.4for discussion.

Scalespacesarecomputedat half-octave spacing,i.e., i 4 1 �5 i 2 2. This choiceprovidessu� cient resolutionat moderatecomputationalcost[93]. Theminimum plausiblescaleis re-latedto the pixel size,andthe maximumis limited by the sizeof the image. This work usesscalesin therangebetweenonepixel andonequarterof thelesserimagedimension.

Example3.4(SalientEdgelsand Texels) Figure3.9 illustratesthesalientedgelsandtexelsextractedfrom anexampleimage.Theradiusof eachedgelandtexel markershown in thefigure

22

Page 41: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

is equalto thelocal scale . Circlesandsquarescorrespondto blobsandcorners,respectively.Thebottomrow illustratesthedegreeof robustnessof thesalientpointswith respectto rotationandscale.Theseimageswereobtainedby rotatingtheoriginal duck imageby 60 degreesandscalingit by afactorof 0.7andsubsampling,but areshown enlargedto matchtheoriginalsizetofacilitatevisualcomparison.Naturally, someamountof detail is lost throughthesubsampling,whichaccountsfor thereducednumberof salientpoints.

Mostedgelsarelocatedathigh-contrastimagelocationsandareextractedataverysmallscale.In addition,someedgelswereextractedatmuchlargerscales,capturinglarge-scaleimagestruc-turesasopposedto pixel-level localizedgradients.Likewise,texelsclusteraroundhigh-contrastimageregions. While edgelscalesreflectthesmoothnessof the local edge,texel scalescorre-spondto thesizeof theimagestructure.Note,for example,thelargecornerbehindtheneck,orthesmallblobat thetip of thetail. All of thesearelocatedat localmaximain thescalefunction;therewasno thresholdon the prominenceof theselocal maxima. Many of the weaker max-ima tendto occurin clustersalongintensityedges,especiallyat smallerscales.As the figureillustrates,suchtexelsarelessrobust to imagetransformationsthantexelsat morepronouncedlocations,e.g.theblob centeredat theeye, the large cornerbehindtheneck,thesmallblob atthetip of thetail, or theblobat thechestunderneaththechin.

Figure3.10 illustratesthe extent to which texels arestableacrossviewpoint changes.Again,thosetexels that correspondto pronouncedlocationstend to be more stable. Texels locatedat accidentalmaximaalongintensityedgesarereliably present,but their precisenumberandlocationvariesacrossviewpoints.

Figure 3.10. Stability of texels acrossviewpoints. Adjacentviewpointsdi� er by 10 degrees.SeeExample3.4for discussion.

23

Page 42: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

3.4.4 FeatureResponsesand the Valueof a Feature

As introducedin Sections3.4.1and3.4.2,a specificinstancef (p) of a primitive featureis de-fined by the numericalvaluescomprisingthe featureresponsevector, typically generatedbycomputingtheapplicablefilter responsesat a locationp in someimage.To find occurrencesoff (p) in an imageI , the responsestrengthor value ff (p)(x) of f (p) is measuredat eachpoint xin I . To measurethefeaturevalueat a locationx, the featureresponsevectorf (x) is computedat x, usingthelocal intrinsic scale x. Thefeaturevalueis thengivenby thenormalizedinnerproductof theobservedresponsef (x) with thereferenceresponsef (p):

ff (p)(x) � max 0 f (p)T f (x)6f (p)

676f (x)

6 (3.12)

Negative valuesareconsideredasweakasorthogonalresponsevectors,andarethereforecuto� at zero. Hence,0 8 ff 8 1. Thenormalizedinnerproductis pleasingin that it returnsthecosineof theanglebetweenthevectorsin questionwhich, for edgels,is identicalto thecosineof theanglebetweentheedgelsin theimageplane: ff (p)(x) � max9 0 cos(

�p�:�

x) ; . Moreover,thefeaturevalueis invariantto linearchangesin imageintensity.

If thefeaturevalueis to becomputedregardlessof theorientationof a featurein theimageplane,themeasuredfeatureresponsevectorf (x) is first steeredto matchtheorientationof thepatternvector f (p) beforeEquation3.12 is applied. This is doneusinga function f (x p) thatappliesthesteeringequation(3.8,page15) to all componentsof f (x), using

� � � p �<� x. Thus,Equation3.12becomes

ff (p)(x) � max 0 f (p)T f (x p)6f (p)

6f (x p)

(3.13)

for rotationallyinvariantfeaturevaluecomputation.Notethatthis rotationally-invariantmatch-ing implies that ff is alwaysequalto oneif f is an edgel. Hence,individual edgelscarry noinformation. Rotation-invariantedgelsareonly usefulaspartof a compoundfeature,asintro-ducedin thefollowing section.

(a) f (p) (b) f (x= ) (c) f (x=t )

Figure 3.11. Matchinga texel (a) acrossrotationandscale,without (b) andwith (c) rotationalnormalization.SeeExample3.5for explanation.

Example3.5(Localizing a Feature) Figure3.11ashows oneof thesalienttexels extractedfrom the familiar duck image(cf. Figure3.9). Figures3.11band3.11cshow thesamerotatedandscaledversionasusedin Figure3.9. Let p � [71 49]T , which is the locationof the texelshown in Figure3.11a.Then,f (p) is thefeatureresponsevectordefiningthis texel feature.Wenow wantto find thisfeaturein therotatedandscaledimage.To dothiswithoutregardto featureorientation,thelocationx= thatmaximizesff (p)(x) is identifiedusingEquation3.12.Thesalient

24

Page 43: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

point x= thatmaximizesEquation3.12is that shown in Figure3.11b. Here, ff (p)(x= ) � 0� 956.It is no accidentthat this texel hasthe sameorientationas the patternfeaturef (p) shown inFigure3.11a,but the two locationsdo not correspond(even thoughthey arespatiallyclosetooneanotherin this example). The true correspondingfeatureis shown in Figure3.11c;let uscall this locationx=t . Disregardingthedi� erencein orientation,ff (p)(x=t ) � 0� 578 > ff (p)(x= ). Ifateachlocationx thefeatureresponsevectorf (x) is first steeredto matchtheorientationof f (p)usingEquation3.13,thenthecorrectlocationis foundwith ff (p)(x=t ) � 0� 996(Figure3.11c).Incontrast,thevalueat x= increasesonly marginally from ff (p)(x= ) � 0� 956to ff (p)(x= ) � 0� 960,remainingbelow thevalueat thecorrectlocationx=t .

In all partsof this work, featurevaluesarecomputedwith rotationalnormalizationusingEquation3.13.Fromnow on,thetilde will beomittedfor simplicity, andthenotation ff will beusedwith theunderstandingthatfeatureresponsevectorsarealwayssteeredtoachieverotationalinvariance.

A featuref (p) is presentin animageI to thedegree

ff (p)(I ) � maxx? I

ff (p)(x) � (3.14)

This computationis quite expensive, asit involvesthe measurementof the intrinsic scaleandthe computationof f (x) and ff (p)(x) at eachlocationin the image. A large proportionof thiscomputationis going to be wastedbecausemost imageregions will not containfeaturesofpracticalrelevance.Therefore,theassumptionis madethatall featuresof practicalrelevancearelocatedat salientpoints.To identify thesalientpointsit is still necessaryto evaluates( ) overtheentirerangeof scales everywherein the image,which constitutesa highly parallelizableoperation.Moreover, if theapplicationprovidesafixedsetof trainingimages,thiscomputationcanbeperformedo� -line. In practice,this leadsto a greatreductionin computationtime andspace,as the maximumin Equation3.14 is only taken over the salientpointsratherthan theentireimage.

3.5 CompoundFeatures

A coreobjectiveof thiswork is to demonstratethatfeaturescanbelearnedthatarehighly mean-ingful individually. The limited descriptive power of individual primitive featuresis overcomeby composingtwo or moreprimitive featuresinto compoundfeatures.Sincecompoundfea-turescontainmoreparametersandcover a larger imageareathando primitive features,theyprovide a potentially morespecificand robust descriptionof relevant aspectsof appearance.What it meansfor a featureto be relevant, andhow relevant featuresaresought,dependsonthegiventask.Chapters4 and6 presenttwo illustrative examples.This sectiondescribesthreecomplementarywaysin whichprimitive featurescanbecomposedinto compoundfeatures.

Thereis evidencethat themammalianvisualcortex alsocomposessimpleandgenericfea-tureshierarchicallyinto morecomplex andspecificfeatures.The primary visual cortex con-tainsretinotopicarraysof local andorientedreceptive fields [45] which have beenmodeledasGaussian-derivative functions[132, 57]. Many neuronsfoundin higher-ordervisualcorticesaretunedto morecomplex stimuli. RiesenhuberandPoggio[96] proposeda hierarchicalmodelofrecognitionin thevisual cortex, usingspatialcombinationsof view-tunedunits not unlike thecompoundfeaturesintroducedbelow.

3.5.1 GeometricComposition

Thestructureof ageometriccompoundof nf @ 2 primitivefeaturesf0 f1 ����� fnf � 1 (Figure3.12)is definedby theattitudesA i of eachsubfeature,andtheanglesB � i , distancesdi , andscaleratiosBC i betweenthe constituentsubfeaturesf i . Thedistancesdi aregiven relative to the intrinsic

25

Page 44: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

scale i � 1 at point i�

1. Eachprimitive featureis eitheranedgelor a texel. A specificinstancefgeomof ageometricfeaturewith agivenstructureis definedby theconcatenationof thevectorscomprisingtheprimitive subfeatures.Its orientationis denotedby

�, which is theorientationof

thereferencepoint.

DCDECEFCFGCG

HCHCHICIreferencepoint f0

JK

1

d1

K2

d2

f2

f1

Figure 3.12.A geometriccompoundfeatureconsistingof threeprimitive features.

The locationof a featureis definedby the locationof the referencepoint. To extract thefilter responsevectordefininga geometricfeaturef (x) at locationx, thefollowing procedureisused:

Algorithm 3.6(Extraction of a Geometric-CompoundFeature ResponseVector)

1. The responsevector f0(x) is computedat x. This definestheorientation�

x andintrinsicscale x of f (x). Initially, let i � 1:

2. If i � nf , thenstop.

3. The locationxi , orientation�

i , andscalesi of thenext constituentsubfeaturef i arecom-puted:

xi � xi � 1� i � 1di

cos(�

i � 1� A i)

sin(�

i � 1� A i)�

i � �i � 1� B � i i � i � 1 BL i

4. The featureresponsevectorf i(xi) is computedat scale i, andis rotatedto orientation�

i

usingthesteeringequation3.8(page15)with angularargument� �MB � i . Incrementi, and

continueatstep2.

Thefilter responsevectoris givenby theconcatenationof theindividual responsevectorsof theconstituentprimitive featuresf (xi), i � 0 1 ����� nf

�1.

The value ffgeom(I ) of a geometriccompoundfeaturefgeom in an image I is computedin thesamewayasfor aprimitive feature(Equations3.13and3.14),usingtheconcatenatedresponsevectors.This involvesrunningAlgorithm 3.6 at eachof thensal applicablesalientimageloca-tions. Theseareeitherall salientedgelsor all salienttexels,dependingon thetypeof thefirstsubfeaturef0 of fgeom. It is easilyseenthatthetime requiredto computeffgeom is proportionaltonsalnf .

3.5.2 BooleanComposition

A Booleancompoundfeatureconsistsof exactly two subfeatures,eachof whichcanbeaprim-itive or a compoundfeature. The relative locationsof the subfeaturesareunimportant. Twotypesof Booleanfeatureshave beendefined:

26

Page 45: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

� Thevalueof aconjunctivefeatureis givenby theproductof thevaluesof thetwo subfea-tures.� The value of a disjunctivefeatureis given by the maximumof the valuesof the twosubfeatures.

For conjunctive features,the productis usedinsteadof the minimum becauseit is naturaltoassigna highervalueto a featureif the larger of the two componentvaluesis increasedevenfurther. For example,thevalue ff of a conjunctive featurewith subfeaturevalues ff1 � 0� 6 andff2 � 0� 9 shouldbelargerthanfor a featurewith subfeaturevaluesff1 � 0� 6 and ff2 � 0� 7.

Featurevaluesof Booleancompoundfeaturesaremeasuredrecursively by first measuringthevaluesof eachof thetwo subfeaturesindividually, andthencombiningtheirvaluesusingtheproductor themaximum,respectively. Hence,thetimerequiredto computethevalue ffbool(I ) ofa Booleancompoundfeaturefbool in animageI is proportionalto nsal' 0t0(I ) � nsal' 1t1(I ), wherensal' i denotesthenumberof salientimagepointsapplicableto subfeaturef i , andti(I ) denotesthetimerequiredto computethevalue ff i (I ) of subfeaturef i in theimage.Notethateachsubfeatureof a Booleancompoundfeaturecanbeany typeof feature,includingprimitives,or geometric,conjunctive or disjunctive compounds.

3.6 Discussion

Section3.2formulateda list of objectives(page12)to bemetby thefeaturedesign.All of theseareaddressedby thecompositionalfeaturespace:

1. Featuresarecomputedon intensityimages.

2. Thefeaturespacewasdesignedwithoutaspecifictaskin mind. Featurescanbecomposedto arbitrarystructuralcomplexity to increasetheir specificity.

3. Therefore,it seemsjustifiedto hopethatpowerful individual featuresarecontainedin thisfeaturespace.Theextent to which this goalwasachieved will bediscussedin Chapters4–6.

4. Edgelsandtexels arecomplementaryandcaptureimportantaspectsof imagestructure.GeometricandBooleancompositionsarealsocomplementaryandhave intuitive seman-tics. Subsequentchapterswill illustratepossibilitiesandlimitationsof this featurespacefor practicaltasks.

5. Featureorientationsareeasilycomputedandcompensatedfor usingthesteerabilityprop-ertyof thefiltersemployed.

6. The featuresarecomputedat the intrinsic scaleobserved in the images.Therefore,theyareusefulfor scale-invariantimageanalysis.

7. Thefeaturesarelocal, sincethefilters employedhave local support.Thelocality of geo-metric featurescanbecontrolledby placingconstraintson thedistancesbetweenprimi-tivesandthenumberof primitivesallowed.

8. Featuresreturna scalarvalue,which makesthemapplicablein generalclassificationandregressionframeworks. Theorientationattributeof a featurecanalsobeusedfor regres-sion,aswill beseenin Chapter6.

9. Thebiological relevanceof this featurespacewasbriefly discussedin the introductionsto Sections3.4and3.5.

27

Page 46: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

A critical propertyof this compositionalfeaturespaceis that it is partially ordered by thestructuralcomplexity of a feature,accordingto the numberof primitivescontainedin a com-pound.Also, thethreecompositionalrulesareorderedby theirexpressive power. A disjunctivecompoundcanberegardedasmorecomplex thanaconjunctive compoundin thatit canexpressa strictly largerclassof concepts.For example,a hierarchyof disjunctionsof su� ciently com-plex geometricfeaturescanmemorizealmostany arbitrarysetof images.Conjunctive featuresaremoreexpressive thangeometricfeaturesbecausethe former canbe composedout of thelatter, but not viceversa.

This orderly structurewill play an importantrole in the following chapters,wherealgo-rithmsaredescribedfor learningfeaturesfrom this space.Thesealgorithmscreatefeaturesina simple-to-complex manner, in thesensethat featurescontainingfew primitivesarepreferredover featureswith many primitives,andgeometriccompositionis preferredover conjunctivecomposition,which in turn is preferredover disjunctive composition.Theunderlyingassump-tion is that simplerfeaturesgenerallyprovide for bettergeneralization,or in otherwords,thatthe“real” structureunderlyinga tasktendsto beexpressibleby simplefeatures.

In summary, all designgoals laid out in Section3.2 are addressedby the featurespacedefinedin thischapter. It is uniquein its hierarchicalandparametricorganizationthatfacilitatessimple-to-complex searchfor highly distinctive, local featuresin an infinite space.Becauseofthesekey propertiesit is well suitedfor opentasks,asarguedin Chapters1 and2. Thefollowingchapterswill demonstratehow importantvisualskills canbelearnedon thebasisof this featurespace.

28

Page 47: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

CHAPTER 4

LEARNING DISTINCTIVE FEATURES

This chapterintroducesthe first of two featurelearningapplicationsthat build on the featurespaceintroducedin Chapter3. Thetaskis to learnto distinguishvisualcontexts by discoveringfew but highly distinctive features.Thescenarioaddressedin this chapteris substantiallylessconstrainedthanin mostconventionalapproachesto visual recognition. The learningsystemdoesnot assumea closedworld. Learningis incremental,andnew targetobjectclassescanbeaddedat any time. An imagepresentedto thesystemmayshow any numberof targetobjects,includingnone.No explicit assumptionsaremadeaboutthepresenceor absenceof occlusionsor clutter, andno objectsegmentationis performed.Consequently, this chapterdoesnot setoutto improve on therecognitionratesachieved by today’s morespecializedrecognitionsystems.Rather, the goal is to demonstratethat a highly uncommittedsystem,startingfrom a nearlyblankslate,canlearnusefulrecognitionskills.

4.1 RelatedWork

The generalproblemof visual recognitionis too broadandunderconstrainedto be addressedcomprehensively by today’sartificial systems.Therefore,thefollowing threesubproblemshavebeenidentifiedin the literaturethat aremoreeasilyaddressedseparatelyby specializedsolu-tions:

SpecificObject Recognition. In a typical scenario,a testimagedepictsexactly onetargetob-ject (with or withoutocclusionand� or clutter).Thetaskis to recognizethatobjectasoneof severalspecificobjects[46, 73, 104, 71], e.g.aparticulartoy or aparticularlandmark.The appearanceor shapeof the objectsin the databasearepreciselyknown. Naturally,mostwork concentrateson rigid objects,with thenotableexceptionof facerecognition.

Generic Object Classification. The scenariois the sameas in SpecificObject Recognition,but herethetaskis to label thetargetobjectasbelongingto oneof severalknown objectcategoriesor classes[133, 111]. Theindividualobjectswithin aclassmayvary in appear-ance.A class(“car”, “chair”) canbedefinedin variousways,e.g.by objectprototypes,by abstractdescriptions,or by modelscontainingenoughdetail to discriminatebetweenclasses,but notenoughto distinguishbetweenobjectswithin aclass.

Object Detection. Testimagescontainvariousitemsandtypically show natural(indooror out-door)scenes.Thetaskis to localizeregionsof theimagethatdepictinstancesof a targetobjector classof objects,e.g.facesor people[81, 7,6]. Objectdetectionis closelyrelatedto fundamentalproblemsin imageretrieval [92].

The requirementsanddi� cultiesdi� er betweenthesesubproblems.Therefore,many existingrecognitionsystemsaredesignedto solve only oneof them. Moreover, somepracticalrecog-nition problemsfurther specializethesecategories. For example,facerecognition[125] is adistinct subcategory of specificobject recognitionthat mustcopewith grossnon-rigid trans-formationsthat arehard to modelmathematically. Optical characterrecognitionis a distinct

29

Page 48: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

subcategory of genericobjectrecognition[8], etc. Otherpracticalrecognitionproblemsareof-tenaddressedin relatedways.For example,Wengetal. [131] treatindoorlandmarkrecognitionasa simplifiedobjectrecognitionproblemwithout generalizationacrosssizeor pose.Risemanet al. [100] representoutdoorlandmarksby line modelsandrecognizethemby many-to-manymodelmatchingin two andthreedimensions.Very few promisingattemptshave beenmadetodevelopsystemsthatareapplicableto morethanoneof thesesubproblems.Amongthese,mostsystemsemploy featuresthatexplicitly allow certainimagevariations[128, 5, 77, 78].

For visualrecognition,two majorclassesof techniquescanbeidentified:

Model-based methodsemploy geometricmodelsof thetargetobjectsfor featureextractionoralignment[46, 100].

Appearance-basedor model-freemethodsextractfeaturesdirectlyfrom exampleimageswith-out explicitly modelingshapepropertiesof the target objects.Varioustypesof featureshavebeenemployed,e.g.principalcomponentscomputedfrom theimagespace[73, 125],local imagestatistics[116, 104, 71], locally computedtemplates[128], or primitive localshapedescriptorssuchasorientededgesor corners[104, 71].

Hybrid techniqueshave alsobeenproposedin which appearance-basedmethodsserved asin-dexing mechanismsto reducethenumberof candidateobjectmodels[55]. In fact, it appearsthatthis washistoricallytheprimarymotivationfor appearance-basedmethods.Only laterwastheir full potentialdiscovered,andsystemsbeganto emerge that omittedthe model-matchingphaseandreliedon appearance-basedmethodsalone.

Most appearance-basedvisual recognitionsystemsexpressrecognitionasa patternclassi-fication problem,andmany employ standardpatternrecognitionalgorithms[128, 71, 8, 81].Also very commonin computervision, customclassifiersaredesignedfor specificrecognitionparadigms.For example,Swain andBallard [116] andSchieleandCrowley [104] proposedmethodsfor comparinghistograms.Turk andPentland[125] andMuraseandNayar[73] findnearestneighborsin eigen-subspaces.FisherandMacKirdy [32] usea perceptronto learntheweightsfor aweightedmultivariatecross-correlationprocedure.

A largevarietyof featureshavebeenproposed(seealsoSection3.1).Thosethatrelateto thiswork fall into two broadcategories:eigen-subspacemethodsthatoperatein imagespace[125,48, 73,131, 119], andlocally computedimageproperties[128, 116, 104, 71, 5, 32, 66]. Eigen-subspacemethodsarewell suitedfor learninghighly descriptive [125, 73, 119] or discriminativefeatures[131, 119]. Locally computedfeaturesaretypically defineda priori [116, 104, 71], orthey arederivedto bedescriptive of specificobjectclasses[128, 32, 6, 33, 66]. Few approachesexist thatattemptto derive discriminative local features[81, 8, 5].

4.2 Objective

Objectrecognitionhasreacheda remarkablematurity in recentyears.Theproblemof findingandrecognizingspecificobjectsundercontrolledconditionscanalmostbeconsideredsolved,evenin thepresenceof significantclutterandocclusion.Genericobjectrecognitionis consider-ably harderbecausethewithin-classvariability in objectappearancecanbearbitrarily large. Ifthewithin-classvariability is largein comparisonto thebetween-classvariability, it canbeverydi� cult to find featuresetsthatcleanlyseparatetheobjectclassesin featurespace.This prob-lem hasnot yet received muchsystematicattention. Essentiallyall currentobjectrecognitionanddetectionsystemsoperateon theassumptionthat thewithin-classappearancevariability issmall comparedto thebetween-classvariability, i.e., objectswithin a classappearmoresimi-lar (to a humanobserver) thanobjectsfrom distinctclasses.Consequently, researcherschoosefeaturesetsthatreflectour intuitive notionof visualsimilarity.

In practicaltasks,thisassumptiondoesnotalwayshold. I believe thatthisconstitutesoneofthekey di� cultiesin genericobjectrecognition:In practice,meaningfulobjectclassesareoften

30

Page 49: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

definednot by their appearance,but by someothersimilarity metric,e.g.functionalproperties.A classicalexampleis thatof chairs:A chairis somethingthata� ordssitting. In agivencontext,anything thathasthis functionalpropertyis a chair. However, chairscanlook quitedissimilar.It is very hardto specifyagenericappearancemodelof achairfor objectrecognitionpurposes.

In contrast,I arguethathumansrecognizechairsonthebasisof theirexperiencethatobjectsthata� ord sitting tendto exhibit certainvisualproperties,e.g.a roughly horizontalsurfaceofappropriatesizeandheight. It is our experiencethat createdtheassociationbetweenfunctionandspecificaspectsof appearance.Noneof thework citedabove– arguablywith theexceptionof Papageorgiou andPoggio’s wavelet-basedtrainableobjectdetectionsystem[81] – considersobjectclassesthatarisefrom anapplicationcontext otherthansubjective similarity of appear-ance.

A corollaryof thenecessityof learningobjectmodelsfrom experienceis thatlearningmustbe incremental.Someof the recognitionalgorithmsmentionedabove canbe alteredto learnincrementally[130, 81], but noneof theseauthorsnotedincrementallearningas a desirablepropertyof machinevisionsystems.

I perceive thedistinctionbetweenspecificandgenericobjectrecognitionanddetectionasratherartificial. It is the necessityto defineproblemsclearly andto designtestableandcom-parablealgorithmsthathasproducedthis taxonomyof approaches,ratherthanwell-establishedprinciplesof vision. Thesamebasicmechanismshouldproduceandrefinefeaturesat variouslevelsof distinction,asdemandedby a task.Featuresshouldexpressjust thelevel of specificityor generalityrequiredalonga continuumbetweenspecificandgenericrecognition[122, 121].Likewise,it is hardto think of visualsearchor objectdetectionwithoutdoingsomerecognitionof thedetectedobjectsat thesametime. Ultimately, we needa uniform framework for visuallearningthataddressesthesesubtasksin acoherentfashion.

Building onthefeaturespaceintroducedin Chapter3, thischapterpresentsaframework forincrementallearningof featuresfor visualdiscrimination.Thetaskscenariois not restrictedtospecificobjects,landmarks,etc.A designgoalis to beableto learnawidevarietyof perceivabledistinctionsbetweenvisualcontexts thatmaybeimportantin agiventask.A visualcontext maycorrespondto an individual object,a category of objects,or any othervisually distinguishableentity or category. In agreementwith patternrecognitionterminology, thesedistinctentitiesorcategorieswill be referredto asclasses. The following propertiesdistinguishthis frameworkfrom mostconventionalrecognitionschemes(seealsoTable1.1on page2):� At theoutset,thenumberandnatureof theclassesto bedistinguishedareunknown to the

learner.� Learningis performedincrementallyon a streamof training images. No fixed set oftrainingdatais assumed.Only a smallnumberof trainingimagesneedto beaccessedatany onetime.� Thereis no distinctionbetweentraining andtestphases.Learningtakesplaceon-line,duringexecutionof aninteractive task.� Therecanbeany numberof targetobjectsin a scene,not just one. Prior to recognition,thisnumberis unknown to thesystem.� In principle,thereis nodistinctionbetweenspecificrecognition,genericrecognition,anddetection.Theunifiedframework encompasseseachof thesetypesof problems.� Prior to learning,thealgorithmsandrepresentationsarelargely uncommittedto any par-ticular task.Theobjective is to learndistinctivelocal features.

Definition 4.1(Distinctive Feature) A featureis distinctiveif it hasboth of the followingproperties:

31

Page 50: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

� It is characteristicof a particularclassor collectionof classes.Theseclassesarecalledthecharacterizedclasses.Objectsof acharacterizedclassarelikely to exhibit this feature.� It is discriminativewith respectto anotherclassor disjoint collectionof classes.Objectsof suchadiscriminatedclassareunlikely to exhibit this feature.

Taken together, thesepropertiesallow oneto arguethat if a given distinctive featureis foundin a scene,an instanceof thecharacterizedclass(es)is likely to be presentin the scene.Thisreasoningformsthebasisfor thefeaturelearningalgorithmpresentedin Section4.4. Thegoalis to show thatusefulrecognitionperformancecanbeachievedby anuncommittedsystemthatlearnsdistinctive features. To reinforcethis point, the systemstartsout with a blank slate,without any task-specifictuning,andlearnsasmallnumberof featuresto solve a task.

It is worthpointingout that,strictly speaking,neitherof thesetwo characteristicsis biologi-cally correct:Firstly, thehumanvisualsystemindeedshowsverylittle hard-wiredspecializationanddraws muchof its power from its versatilityandadaptability. However, it is alsotrue thatinfantvisuallearningis guidedby highly structureddevelopmentalprocesses.Secondly, wearecapableof learningindividual,distinctivehigh-level features,e.g.correspondingto objectparts.On theotherhand,therobustnessof ourvisualcapabilitiesis largely dueto theredundancy of alargenumberof low-level featuressuchasline segments,cornersandotherdescriptorsof localimagestructurerepresentedby neuronsalongthe visual pathway betweenthe primary visualcortex andtheinferotemporalcortex.

4.3 Background

Theframework for incrementallearningof discriminative features,introducedin thefollowingsection,builds on two importantexisting concepts– Bayesiannetworks andtheKolmogorov-Smirno� distance– thatarebebriefly describedhere.

4.3.1 ClassificationUsingBayesianNetworks

In the context of this work, classificationrefers to the processof assigninga classlabel toan instancedescribedby a feature vector. The elementsof a featurevectormay be scalarorcategoricalvalues,whereasaclasslabelis alwayscategorical.An abstractdevice thatperformsclassificationis calledaclassifier. Here,aclassifieris trainedinductively, i.e.,by generalizationfrom training examples. Therearemany well-establishedframeworks for classification,e.g.decisiontrees,neuralnetworks,nearest-neighbormethods,andsupportvectormachines.

Bayesiannetworks constitutea rathergeneralgraphicalframework for representingandreasoningunderuncertainty. A Bayesiannetwork encodesa joint probabilitydistribution. Eachnoderepresentsarandomvariable,hereassumedto bediscrete.Thenetwork structurespecifiesa setof conditionalindependencestatements:Thevariablerepresentedby a nodeis condition-ally independentof its non-descendantsin the graph,given the valuesof the variablesrepre-sentedby its parentnodes.In moreintuitive terms,anarcindicatesdirectprobabilisticinfluenceof one variableon another. EachnodeX hasan associatedtable specifyingthe conditionalprobabilitiesof the valueof X given the valuesof its parents.In otherwords,the conditionalprobability tableof X containsthevaluesof P(X � x

�A � a B � b ����� ), whereA B ����� are

theparentsof X in thegraph,andx a b ����� eachdenoteaparticularvalue– commonlycalledastate– of thecorrespondingrandomvariable.

Bayesiannetworkspermitthecomputationof many interestingpiecesof information.Mostimportantly, onecaninfer the probabilisticvalueof somevariables,given the valuesof someothervariables.This involvesthefollowing two essentialsteps:

32

Page 51: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Instantiation of evidence: Thevaluesof somevariablesaresetto theirobservedvalues.Thesevaluesmaythemselvesbe(unconditional)probabilitydistributionsover thepossibleval-uesof thesevariables.

Propagationof evidence: Using theconditionalprobability tables,theposteriorprobabilitiesof neighboringnodesareiteratively computed,generallyuntil somenodesof interest–targetvariables– arereached.Thesenodesnow containtheposteriorprobabilitiesof theirvaluesgivenall evidenceenteredinto thenet,alsocalledbeliefs: BEL(X � x) � P(X �x�

A � a B � b ����� ) whereA B ����� includeall nodesin the net that containevidence[83]. If evidenceis propagatedexhaustively throughouttheentirenetwork, thenetis saidto bein equilibrium.

Mathematicalmethodsexist thatallow thesetwo stepsto befreely intermixed. In general,exactinferenceis NP-hard. In practicehowever, dueto the sparsenessof mostpracticalnetworks,inferenceis performedvery quickly for up to a few tensor hundredsof nodes.

Oftenit is usefulto know theinfluenceof onevariableonanother. Theinfluenceof variableX on variableY is convenientlydefinedby themutualinformation[112] betweenX andY, thatis, thepotentialof knowing X to reduceuncertaintyregardingthevalueof Y:

I (Y X) � H(Y)�

H(Y�X) (4.1)

where1 theuncertaintyin Y is givenby theentropy function

H(Y) � �y

P(y) log P(y)

and

H(Y�X) �

x

H(Y�x) P(x)

� �x y

P(y�x) P(x) log P(y

�x)

� �x y

P(y x) log P(y�x)

is theexpectedresidualuncertaintygiven that thevalueof X is known. Expandingthecondi-tional probabilitiesandusingbelief notation,Equation4.1becomes

I (Y X) � �x y

BEL(y x) logBEL(y x)

BEL(y) BEL(x)� (4.2)

To computeI (Y X), note that BEL(y x) � BEL(x�

y) BEL(x). One can computeBEL(x�

y) by temporarilysettingY to value y, propagatingbeliefs throughthe network to establishequilibrium, andusing the resultingposteriorprobabilitiesat nodeX as the BEL(x

�y) [83,

98]. TheunconditionalbeliefsBEL(y) andBEL(x) areavailableat therespective nodesprior toapplicationof thisprocedure.

The theoryof Bayesiannetworks hasbeenbestdevelopedfor discreterandomvariables.With somerestrictions,inferenceis also possiblewith so-calledconditionalGaussiandistri-butions[60], andthe theoryis constantlybeingadvanced.For moreinformationon Bayesiannetworks,excellenttexts areavailable[83, 51].

A naiveBayesianclassifieris easilyrepresentedusingaBayesiannetwork. It hasthetopol-ogy of a star, with a noderepresentingtheclassrandomvariablein thecenter, andarcsgoingout to eachfeaturevariable.Intuitively, eacharcspecifiesthatthepresenceof a classgivesriseto thefeaturepointedto by thearc.A generalBayesianclassifiermodelsdependenciesbetweenfeaturesby insertingarcsbetweenthem,whereapplicable.

1ThenotationI , conventionallyusedto denotemutualinformation,is not to beconfusedwith theuseof I for animageintensityfunction introducedin Section3.3.1. In thefollowing, thecontext will alwaysindicatethereferentof I .

33

Page 52: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

4.3.2 Kolmogorov-SmirnoN Distance

Let X be a continuousrandomvariable,andlet p(x�a) and p(x

�b) describethe conditional

probabilitydensitiesof X undertwo conditionsa andb. Supposeoneis asked to partition thedomainof X into two bins usinga singlecutpoint O suchthat mostvaluesof X that fall intoonebin aregeneratedunderconditiona, andmostvaluesin theotherbin underconditionb. ABayes-optimalvalueof O is onethatminimizestheprobabilityof any given instancex endingup in thewrongbin.

conditionb

conditiona

0

KSDa' b(X)

cum

ulat

ivepr

obab

ility 1

domainof X PFigure4.1. Definitionof theKolmogorov-Smirno� distance.

Let cdf(x�a) andcdf(x

�b) denotethecumulative conditionaldensitiescorrespondingto

p(x�a) and p(x

�b), respectively. The Kolmogorov-SmirnoQ distanceor KSD for variableX

underconditionsa andb is definedas

KSDa' b(X) � maxR cdf( O � a)�

cdf( O � b) (4.3)

(seeFigure4.1). A valueof O maximizingthisexpressionconstitutesa Bayes-optimalcutpointin theabove sense,if theprior probabilitiesof a given instancex having beengeneratedunderconditionsa andb areequal.

In practice,thecumulativedistributionsareoftennotavailablein analyticform. Fortunately,it is easyto estimatethesefrom sampleddata.Givena particularcutpoint O , simply countthedatasamplesthat fall into eachbin, andnormalizethecountsto convert theminto cumulativeprobabilities[126]. To find anoptimal O , thisprocedureis performedfor all candidatecutpointsthatincludeall midpointsbetweenadjacentvaluesin thesortedsequenceof theavailabledata.

4.4 FeatureLearning

Figure4.2 is intendedto serve asa roadmapto thealgorithmsdefinedin thepresentchapter.It shows an instantiationof the genericmodelof incrementallearning(Figure2.1 on page8)appliedto learningdistinctive features.This is the taskscenariounderlyingthis chapter:Atthe outset,the agentdoesnot know anything aboutthe typesof distinctionsandclassesto belearned,andit hasno featuresavailable.As it performsits task,it acquiresoneimageata time,andattemptsto recognizeits contentsasfar asthey arerelevant to the task. If the recognitionsucceeds,theagentsimplycontinuesto performits task.If not,it attemptsto learnanew featurethatsolvestherecognitionproblem.

In thebasicform introducedbelow, thefeaturelearningalgorithmrequirestwo key proper-tiesof thetaskenvironment:

1. Theagentmustbeableto compareits recognitionresultwith thecorrectanswer(i.e., thelearningalgorithmis supervised).

34

Page 53: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

recognize classesAlgorithm 4.4

empty

acquire image

correct?yes no

learn distinctive featureAlgorithm 4.8,

using Algorithm 4.10

Algorithm 4.7

Figure4.2. A modelof incrementaldiscriminationlearning(cf. Figure2.1,page8).

2. Theagentmustbeableto acquirerandomexampleimagesof scenescontaininginstancesof specificclasses.Thispermitsstatisticalevaluationof candidatefeatures.

Thesecondrequirementis not a critical limitation in many realisticinteractive scenarios.Forexample,an infantcanpick up a known objectandview it from variousviewpoints;or a childreceivesvariousexamplesof lettersof thealphabetfrom a teacher. A robotmayaim its cameraatknown locations,or maypick up aknown objectfrom aknown location.

Thesubsequentsectionsintroducethekey componentsof thevisual learningalgorithminturn; namely, theclassifiersubsystem,the recognitionalgorithm,andthe featurelearningpro-cedure.

4.4.1 Classifier

Conventionally, aBayesiannetwork classifiermodelstheclassesusingasinglediscreterandomvariablethathasonestatefor eachclass.This representationassumesthat thereis exactly onecorrectclasslabel associatedwith eachinstance.Also, eachfeature– if instantiated– alwayscontributesto theclassificationprocess,nomatterwhatthecurrentclassis,becauseeachfeaturenodecontainsafull conditionalprobabilitytablethatfor each classrepresentstheprobabilityofthis featurebeingobserved.This is contraryto theconceptof distinctive featuresthatspecializein specificdistinctionsbetweenacharacteristicclassandasmallsubsetof theotherclasses.In anaiveBayesianclassifier, thenumberof entriescontainedin theprobabilitytableateachfeaturenodeis equalto theproductof thenumberof classestimesthenumberof statesof thefeaturevariable. If therearemany classes,thesetablescanbecomequite large. All their conditionalprobabilityentriesmustbeestimated.Evenworse,all of themmustbere-estimatedwheneveranew classis addedto thesystem.

A naturalway to avoid theseproblemsis to representeachclassby anindividual Bayesiannetwork classifier. An exampleis shown in Figure4.3. The classnodeof eachnetwork hastwo states,representingpresenceor absenceof an instanceof this class.Here,eachfeatureischaracteristic(in thesenseof Definition 4.1,page31) of theclassrepresentedby its Bayesiannetwork, andis discriminative with respectto oneor morespecificotherclasses.In Figure4.3,all featuresare characteristicof classC, and the discriminatedclassesare shown inside thefeaturenodes.

35

Page 54: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Class C

Feature 1−Class A

Feature 2

Feature 3Feature 5

Feature 4−Classes AB −Class E

−Class A

−Class B

Figure 4.3.A hypotheticalexampleBayesiannetwork.

This representationkeepstheclassesandtheir characteristicfeaturesseparate.Whenanewclassis learned,a correspondingnetwork is instantiated,but the existing networks and theirconditionalprobabilitytablesremainuna� ected.Moreover, this representationbuilds in theas-sumptionthattheprobabilityof thepresenceof eachclassis independentfrom all otherclasses.Sinceeachnetwork separatelydecideswhetheranobjectof its classis present,multiple targetclasseswithin a sceneareeasilyhandledwithout interference.Eachnetwork e� ectively actsasan objectdetectionsystem.In contrast,mostcurrentrecognitionsystemsact asclassificationsystemsandassignexactlyoneclasslabelto eachimage.

Togetherwith theconceptof distinctive features,animportantimplicationof this represen-tationis thattheabsenceof a featurein animagenever increasesthebelief in thepresenceof aclass.Inferencein favor of a classis only drawn asa resultof detectedfeatures.This behavioris adesignchoicereflectingtheopennessof theuncommittedlearningsystem.It is noteworthythatthisdesignis in contrastto conventionalrecognitionsystemsthatalwaysassignexactlyoneof agivensetof classlabelsto animage.Underthisparadigm,calledforced-choicerecognitionby psychologists,not detectinga characteristicfeaturewill inevitably increasethebelief in thepresenceof otherclasses.Thise� ectmaybedesirablein aclosedworld, but canbedetrimentalin realisticopenscenarios.Many conventionalrecognitionsystemsadoptconfidencethresholdsthat avoid the assignmentof a classlabel in the absenceof su� cient evidencefor any class,which reducesbut doesnoteliminatethisproblem.

Like classnodes,featurenodesalsohave two states,correspondingto presenceor absenceof thefeature.As in a typicalnaiveBayesiannetwork classifier, arcsaredirectedfrom theclassnodeto the featurenodes. To model known dependenciesbetweenfeatures,additionalarcsconnectfeatureswhereappropriate.Suchdependenciesariseby theconstructionof compoundfeatures.Specifically, if ageometriccompoundfeature(say, Feature3 in Figure4.3)hasacom-ponentfeaturethat is alsopresentin thenetwork (e.g.,Feature2 in Figure4.3), thenit is clearthat thepresenceof Feature3 impliesa high probabilityof finding Feature2 in thescenealso.Feature2 doesnot necessarilyhave to bepresentbecausethethresholdsthatdefine“presence”aredeterminedindependentlyfor eachfeaturenode(seeDefinition 4.2 below). An analogousargumentholdsfor thecomponentsof aconjunctive feature.In thecaseof adisjunctive feature,thedirectionof theargument– andthatof thearcs– is reversed.Say, Feature4 is adisjunctivefeature,andFeature5 is oneof its components.Then,finding Feature5 in a sceneimplies thelikely presenceof the disjunctive Feature4 in the samescene.The reverseis not necessarilytrue becauseFeature4 may be presentby virtue of its othersubfeature,with Feature5 beingcompletelyabsent.Therefore,the causalinfluenceis modeledby an arc from the componentfeatureto thecompound.

Note that the samefeature(definedby its structureandits responsevector)may occur inmorethanoneBayesiannetwork classifier. FeaturevaluesaremeasuredusingEquation3.14

36

Page 55: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

(page25). Sincethesefeaturevaluesarerealnumbers,they mustbediscretizedby applyingathresholdmarkingtheminimum valuerequiredof a featurein orderto be consideredpresent.This thresholdis determinedindividually for eachfeaturenodesoasto maximizeits distinctivepower:

Definition 4.2(Distinctive Power of a Feature) Thediscriminativepowerof afeaturef is theKolmogorov-Smirno� distance(KSD) betweenthe conditionaldistributions of featurevaluesobservedunder(a) thediscriminatedclassesversus(b) thecharacterizedclass(cf. Equation4.3andFigure4.1). The distinctivepowerof a featureis identical to its discriminative power, ifcdf( O f

�a) S cdf( O f

�b) for thecutpoint O f associatedwith theKSD, i.e., thefeatureis in fact

characteristicof its class.Otherwise,thedistinctive power is definedto bezero.

As wasnotedin Section4.3.2,this definitionyieldsanoptimalcutpointin theBayesiansensethat if thepresenceof a classis to beinferredon thebasisof this featurealone,theprobabilityof error is minimized,given that theprior probability of the presenceof this classequals0.5.In general,however, theprior probabilityof a classwill not be equalto 0.5. In this case,thischoiceof a cutpointis not optimal in any known sense.Nevertheless,Utgo� andClouse[126]make a strongargumentin favor of usingtheKSD without payingattentionto theprior classprobabilitiesasa testselectionmetricfor decisiontrees.It is basedontheobservationthatevenin caseswheretheBayesianchoiceis alwaysthesameregardlessof thetest,thecutpointgivenby theKSD separatestwo verydi� erentregionsof certaintyregardingthepresenceof theclass.

Example4.3(KSD and DisparatePriors) Considertheclass-conditionaldistributionsshownin Figure4.4. Therearemany moresamplesfrom the discriminatedclassesthan from the

discriminated classes

characterized class

Distributions of feature values observed under the

freq

uenc

y co

unt

0 1feature values

Figure 4.4.Highly unevenandoverlappingfeaturevaluedistributions.

characterizedclass. In otherwords,given an unclassifiedsample,the prior probability that itbelongsto thediscriminatedclassesismuchhigherthanthatit belongsto thecharacterizedclass.WhereshouldtheBayes-optimalcutpointbeplacedthatallows oneto guesswhatconditionaldistribution generatedagivensample?Thetwo distributionsfully overlap! Becauseof thisandthe disparatepriors, it is clear from Figure4.4 that no matterwherea cutpoint is placed,theexpectederrorrateis minimizedby alwaysguessingthatasamplecamefrom thediscriminatedclasses,irrespective of thesample’s featurevalue. However, it is alsoobvious that a cutpointcanbe found that is meaningfulin thesensethatalmostall samplesof thecharacterizedclassfall above it, andalmostnonebelow it. While a decisionrule basedon sucha cutpointis notBayes-optimal,thiscutpointwouldtell ussomethingabouttheposteriorprobabilityof asamplebelongingto eachof the two conditionaldistributions. Utgo� andClausesimply suggestthatthe simple definition of the KSD given in Equation4.3 be applied,which – beingbasedoncumulative distributions– ignorestheprior probabilities.

Thecomputationof a KSD requiresa descriptionof thetwo conditionalprobabilitydistri-butions. Here,thesedistributionsareapproximatedby thecumulative experienceof theagent:

37

Page 56: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

The agentmaintainsan instancelist of experiences.Eachimageencounteredaddsto the in-stancelist anew instancevectorcontainingthefeaturevaluesmeasuredandthetrueclasslabelof this trainingimage.All probabilityestimatesrelatingto theBayesiannetwork classifiersareestimatedby countingentriesin the instancelist. This includesthe cumulative distributionsfor estimatingtheKSDsto discretizefeaturevariables,theconditionalprobability tablesat thefeaturenodes,andtheprior classprobabilitiesat theclassnodes.

4.4.2 Recognition

Recognitionof ascenecanbeperformedin theconventionalwayby first measuringthestrengthof eachfeaturein the image,settingthe featurenodesof the Bayesiannetworks to the corre-spondingvalues,andcomputingtheposteriorprobabilityof thepresenceof eachclass.In thiscase,theabsenceof a featureis meaningfulto thesystem.Alternatively, robustnessto occlu-sioncanbebuilt into thesystemby settingonly featurenodescorrespondingto foundfeatures,andleaving theothersunspecified.In this case,theposteriorprobabilityof thesefeaturesbe-ing present(but occluded)caneasilybe computed. Incorporatingpositive evidenceinto thenetwill monotonicallyincreasetheposteriorprobabilityof a class.In particular, theposteriorprobabilityis alwaysgreaterthanor equalto theprior. Withoute� ective countermeasureshow-ever, in practicethismonotonicincreasewill oftenresultin veryhighposteriorprobabilitiesformany classes.Sincethe posteriorclassprobabilitiesarecomputedindependentlyby separatenetworks for eachclass,evidencefound for oneclassdoesnot reducethe posteriorbelief inany otherclass. To avoid theseproblems,in this work both positive andnegative evidenceisincorporatedinto thenetwork by settingtheapplicablefeaturenodesto thecorrespondingstate,presenceor absence.

Computinga featurevaluein an imageis expensive (cf. Equation3.14) in comparisontopropagatingevidencein a Bayesiannetwork of moderatesize. Moreover, the contribution ofindividual featuresto theclassificationdecisionis unclearif all of themarecomputedat once.Therefore,featuresarecomputedsequentially, andtheclassificationdecisionsaremadeincre-mentallywithin eachBayesiannetwork, accordingto thefollowing algorithm.

Algorithm 4.4(SequentialComputation of ClassBeliefs)

1. Bring theBayesiannetwork to equilibriumby propagatingevidence.

2. Find the maximally informative featureamongall featuresof all classes.Let C andFf

denotetwo-staterandomvariablesrepresentingthe presenceor absenceof a classandof a featuref , respectively. Then,themaximally informative featurefmax is theonethatmaximizes

Imax � maxf

maxC

I (C Ff )

whereI (C Ff ) is themutualinformationdefinedin Equation4.2on page33.

3. If Imax � 0, thenstop.

4. Measurethevalueof this featurefmax, ffmax, in thecurrentimageusingEquation3.14.

5. Instantiateall nodescorrespondingto fmax in all networks to the appropriatevaluesbycomparingffmax againsttherespective thresholdsO f (cf. Definition 4.2).

6. To keeptrack of the usageof eachfeature,label fmax with the running numberof thecurrenttrainingimage.Then,continuewith Step1.

Note that at eachiteration of this algorithm, one featureis instantiated. For an instantiatedfeaturef , theuncertaintyin its characterizedclassC is H(C) � H(C

�Ff ), andthusI (C Ff ) � 0

(Equation4.1).Therefore,thealgorithmstopsatStep3 afterall featureshavebeeninstantiated.

38

Page 57: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

In practice,it oftenstopssoonerbecauseamongtheuninstantiatedfeaturestherearenonewithnonzeromutual information. This resultsif the entropiesin all of the classnodeshave beenreducedto zero,or if noneof the remainingfeaturesareinformative with respectto any classvariablewith nonzeroentropy. The labelingin Step6 allows the algorithmto detectobsoletefeatures.If a featureceasesto be used,it canbe discarded.This will further be discussedinSection4.5.

Therearevariousways to addsophisticationto this basicalgorithm. Insteadof consult-ing featuresexhaustively, a stoppingcriterioncanbe introducedthat terminatesAlgorithm 4.4once I (C Ff ) dropsbelow a thresholdat Step2. By definition, featureswith low mutual in-formationhave little potentialto a� ect the posteriorclassdistributions. The tradeo� betweencomputationalexpenseandconfidencein therecognitionresultcanbeadjustedin termsof thisthreshold.Anothercomputationaltradeo� concernsthe relative costof measuringfeatureval-ues.Largecompoundfeaturesconsistingof many subfeaturesaremorecostly to evaluatethansmallerfeatures,asareBooleancompoundsrelative to geometriccompounds(seeSection3.5.1on page26, andSection3.5.2). This costcanbe consideredat Step2 of Algorithm 4.4, forexampleusingthe framework of decisiontheory[102, 98]. Suchthresholdsandcostsdependon specificapplications.For simplicity, theseissuesareavoidedhere.

Sinceeachclassis representedby its own Bayesiannetwork, the possibility of multipleclassespresentin a sceneis built into the system. The presenceof eachclassis determinedindependentlyof all others.Thepossibilityof arbitrarynumbersof targetclasseswithin ascenemakestheevaluationof recognitionresultssomewhatmorecomplicatedthantheconventionalcorrect� wrongdichotomy. A recognitionresultis derived from theposteriorclassprobabilitiesaccordingto thefollowing definition:

Definition 4.5(Evaluation of RecognitionResults) Thoseclasseswith a posteriorprobabil-ity greaterthan0.5aresaidto be recognized. A classis calledtrue if it is presentin thescene,andfalseotherwise.Eachrecognitionfalls into exactlyoneof thefollowing categories:

correct: Exactlythetrueclassesarerecognized.

wrong: Thecorrectnumberof classesarerecognized,but theseincludeat leastonefalseclass.At leastonetrueclasswasmissed.

ambiguous:All of thetrueclassesarerecognized,plusoneor morefalseclasses.

confused: Noneor some(but not all) of the true classes,andsomefalseclassesarerecog-nized. The numberof recognizedclassesis di� erentfrom the numberof targetclassespresentin thescene.(Otherwise,it is consideredawrongrecognition.)

ignorant: Noneor some(but not all) of the true classes,andnoneof the falseclassesarerecognized.

Thecategory of wrongrecognitionsconstitutesa specialcaseof misrecognitionsthatareother-wise in theconfusedcategory. Wrongrecognitionsareimportantin that they containpairwiseerrorswhereoneclassis mistakenfor oneotherclass.In aclassicalforced-choicescenario,thisis theonly typeof incorrectrecognitions.

In conventionalforced-choiceclassification,thereareonly correctandwrong categories.Sincethis systemhasthe additionalfreedomto assignmultiple classlabelsor no label at all,thenumberof correctrecognitionsmaybelower thanin a forced-choicesituation.To facilitatecomparisonwith forced-choicescenarios,at leastthenumberof ambiguousrecognitionsshouldsomehow betakeninto account.Thesecontaintheright answer, which is valuableespeciallyifthenumberof falsepositive labelsin anambiguousrecognitionis low. Similarly, an ignorantrecognitionshouldcontribute to thesameextentasa randomresult. This ideais formalizedinthefollowing definition.

39

Page 58: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Definition 4.6(Accuracy) A recognition resultis asetof classesrecognizedin animageI byAlgorithm 4.4. Thissetincludesnt truepositivesandnf falsepositives,wherent

� nf 8 nc, thetotal numberof classesknown to thesystem.Let nT denotetheactualnumberof targetobjectsdepictedin thescene.Then,

acc(I ) �nT

ncif nt

� nf � 0

nt

nT� nf

otherwise

is theaccuracyof this recognitionresult. Theaccuracy achievedon a set T of imagesis givenby

acc(T ) � 1� T �I ?VU acc(I ) �

This definition is a generalizationof the notion of correctrecognitionthat valuesambiguousrecognitionsaccordingto the numberof falsepositives involved. An ignorantrecognitionisregardedasequivalent to a fully ambiguousrecognition,i.e., a result that lists all nc classla-bels.Theaccuracy is a valuebetweenzeroandone,wherethemaximumcorrespondsto 100%correctrecognition.If all recognitionsareignorantor fully ambiguous,thentheaccuracy is theinverseof thenumberof classesknown to thesystem.If nT � 1 everywhere,this is identicaltothechance-level proportionof correctrecognitionsattainedby uniformly randomforced-choiceclassifiers.Thus,theaccuracy canbedirectly comparedto thepercentage-correctresultdeliv-eredby aconventionalforced-choiceclassifier.

4.4.3 FeatureLearning Algorithm

In theincrementallearningscenario(Figure4.2onpage35), theagentreceivestrainingimagesoneby one.Initially, theagentdoesnotknow aboutany classesor features.Whenit is presentedwith the first image,it simply remembersthe correctanswer. When it is shown the secondimage,it will guessthesameanswer. This will continue,andno featureswill be learned,untilthisanswerturnsout to beincorrect.

Algorithm 4.7(Operation of the Learning System) This algorithmprovidesdetailsof thebasicmodelof incrementaldiscriminationlearningshown in Figure4.2,page35.

1. A new trainingimageis acquired,denotedI .

2. Image I is run throughthe recognitionalgorithm 4.4 as describedin Section4.4.2. Ifrecognitionis correctaccordingto Definition4.5,thencontinueatStep1.

3. Assumerecognitionfailedbecauseof inaccurateconditionalprobabilityestimatesstoredin the Bayesiannetworks. To remedy, all featurenodesare re-discretizedto maximizetheir distinctive power (cf. Definition 4.2, page37). Their conditionalprobabilitiesareestimatedby countingco-occurringfeaturevaluesandclasslabelsin theinstancelist (cf.Section4.4.1,page38). In thesameway, theprior probabilitiesassociatedwith theclassnodesarere-estimated.

4. Re-runtherecognitionalgorithm4.4on theimageI . If I is now recognizedcorrectly, thencontinueatStep1.

5. Concludethatat leastonenew featuremustbeaddedto at leastoneBayesiannetwork torecognizeI correctly. Attempt to do this by invoking Algorithm 4.8 (describedbelow).Then,continueatStep1.

40

Page 59: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

This algorithm re-estimatesthe conditionalprobability tablesof the Bayesiannetworks onlywhenan imageis misrecognized(Step3). Alternatively, after eachrecognition– successfulor otherwise– the conditionalprobability tablesof all nodesrelatedto queriedfeaturescouldbe updated,maintaininga bestestimateof the actualprobabilitiesat all times. A drawbackof this approachis that thebehavior of thesystemwould fluctuateeven in theabsenceof anymistakes. For example,two consecutive presentationsof the sameimagecould result in twodi� erentrecognitionresults.Sinceit is easierto characterizethebehavior of the systemif itsbehavior doesnotdependonarecenthistoryof correctrecognitions,it is designedto learnonlyfrom mistakes.

At Step5 in this algorithm,theobjective is to derive a new featureto discriminateoneormore true classesfrom oneor moremistaken classes,i.e., thosefalseclassesthat wereerro-neouslyrecognized.Two casesmustbedistinguished.In thefirst case,a trueclassis notamongthe recognizedclasses(wrong, confused,or ignorantrecognition). The agentneedsto find anew featureof this trueclassin thecurrentimage– thesampleimage – that, if detected,willcausethis classto be recognized.In the secondcase,somefalseclasseswererecognizedinadditionto thetrueclasses(ambiguousrecognition).Here,theagentneedsto find a featureinan imageof a falseclassthat is not presentin the currentlymisrecognizedimage,andthat, ifnotdetected,will preventthis falseclassfrom beingrecognized.Again,this imageis calledthesampleimage.

Algorithm 4.8(Feature Learning)

1. For thepurposeof learningthis feature,the true andmistaken classesare(re)definedsothat thereis generallyexactly onetrue andonemistaken class. The classesselectedarethetrueclasswith minimumbelief andthefalseclasswith maximumbelief, respectively,representingthemostdrasticcase.In eithercase,if thereis no suchuniqueclass,beliefvaluesfrom earlier(non-final)recognitionstages(Algorithm 4.4)areconsulted.In partic-ular, this coversthecaseof ignorantrecognition.Seebelow for furtherdiscussionandanexample.

2. A set of exampleimages is retrieved from the environment. This set containsnex ran-dom views of eachof the true classandthe mistaken class. The precisevalueof nex isunimportant,but involvesa tradeo� thatwill bediscussedbelow.

3. A new featureis generated,generallyby randomsamplingfrom thesampleimage,usingAlgorithm 4.10. Algorithm 4.10 may be calledonly a limited numberof times withina singleexecutionof Algorithm 4.8. If this limit is exceeded,Algorithm 4.8 terminatesunsuccessfully. Thedetailswill bediscussedbelow in thecontext of Algorithm 4.10.

4. A new noderepresentingthisnew featureis addedto theBayesiannetwork thatmodelstheclassof thesampleimage,alongwith anarcfrom theclassnodeandany arcsrequiredtoor from any otherfeaturenodes,to modelany known interdependencieswith pre-existingfeatures(cf. Section4.4.1).

5. Thevaluesof this feature,aswell asany existing featuresrepresentedby nodeslinkedtoor from the new featurenodein the Bayesnetwork, aremeasuredwithin eachexampleimage.Theresultinginstancevectorsareaddedto theinstancelist.

6. Thenew featurenodeis discretizedsuchthatthecutpointmaximizesthedistinctive powerbetweenthe true andmistaken classes(seeDefinition 4.2 on page37), asestimatedbycountingentriesin theinstancelist. If thedistinctivepoweriszero,thealgorithmcontinuesatStep3.

7. Theconditionalprobability tablesof thea� ectednodesare(re)estimatedby countingen-triesin theinstancelist.

8. The recognitionprocedure(Algorithm 4.4) is run on the sampleimage,usingthe addedfeature.

41

Page 60: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

(a) If thesampleimageis of a trueclassthatis erroneouslynot recognized,or if thesam-ple imageis of a falseclassthatis erroneouslyrecognized,thenthenew featurenodeis removed from theBayesnet. However, thedefinitionof thenew feature(consist-ing of its structuraldescriptionandresponsevector) is retaineduntil this algorithmterminates.Now, it proceedswith Step3.

(b) If theimageis of a trueclassandis now recognizedsuccessfully, or if it is of a falseclassandis successfullynot recognized,thenthisalgorithmterminatessuccessfully.

Note that this featuresearchprocedurerequiresthat eachnew featureindividually solves anexisting recognitionproblem. This resultsin a strongtendency to learn few but informativefeatures.Eachfeatureis learnedspecificallyto distinguishoneclassfrom oneotherclass(seeStep1). Insteadof usingjust onetrueandonemistakenclass,onecoulduseall misrecognizedclasses,or even all true andfalseclasses.However, in generalit will be mucheasierto findfeaturesthatdiscriminatewell betweenpairsof classes,thanfeaturesthatdiscriminatebetweentwo groupsof several classes.In practice,this may in fact be all that is requiredto solve agivenvisual task.Moreover, usingonly a pair of classesensuresthattheexecutiontime of thisalgorithmis independentof thetotalnumberof classes.

Example4.9(SelectingTrue and Mistaken Classes)Table4.1shows someexamplesof howthecharacterizedanddiscriminatedclassesarechosenin Step1 of Algorithm 4.8.Thefirst caseshows a wrong recognitionresult. Sinceall classeswith posteriorbeliefsgreaterthan0.5 aresaidto berecognized,thesystemerredon classesB andC. Here,it is clearthata new featureneedsto be learnedfor classB that will causeclassB to be recognizedin the currentimage.This new featureshouldbediscriminative with respectto classC, becausethesystemmistookclassB for classC. Note that thenew featurewill not a� ect the recognitionresultfor classC(nor for any otherclassbut thecharacterizedclassB). Likely, it will causeanambiguousresultlike thatshown in thesecondrow. Now, a featuremustbelearnedthatis characteristicof classC, andthat, if not found in an image,will prevent it from beingrecognizedasclassC. ClassA is chosenasthediscriminatedclassbecauseit hasthelowestposteriorbelief amongall trueclasses,andis thereforepresumablymostlikely confusedwith classC.

Table 4.1. Selectingtrue andmistaken classesusingposteriorclassprobabilities. Bold typeindicatesthecharacterizedclassof thenew featureto belearned.Thediscriminatedclassesareshown in italics.

TrueClasses FalseClassesA B C D E

wrong 0.7 0.4 0.8 0.3 0.2ambiguous 0.7 0.9 0.8 0.3 0.2ignorant 0.3 0.4 0.4 0.3 0.2

0.4 0.4 0.3 0.3 0.2ignorant 0.0 0.0 0.0 0.0 0.0

Thefirst (wrong)examplewould constitutea confusedrecognitionif, for example,classD hadreceived a posteriorbelief of greaterthan0.5. The third exampleis an ignorantrecognition.Like in wrongandconfusedcases,thecharacteristicclassis thetrueclasswith lowestposteriorbelief, andthe discriminatedclassis the falseclasswith highestposteriorbelief. In the finalexample,Algorithm 4.4 resultedin zerobelief for all classes.In sucha casewheretrue andcharacteristicclassescannotbedeterminedfrom thefinal recognitionresult,thesystemconsultstheintermediaterecognitionresultgeneratedby thepenultimateiterationof Algorithm 4.4.Thisresult is guaranteedto containbeliefs with nonzeroentropy. In this example, thereare two

42

Page 61: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

potentialcharacteristicclassesbecausebothtrueclasseshave identicalbelief. In this case,oneof them is chosenat randombecauseAlgorithm 4.4 requiresa single sampleimage. Therearealsotwo potentialdiscriminatedclasses,both of which aresimply used. In practice,suchambiguouscasesoccurvery rarely.

Thenumberof exampleimagesperclass,nex (Algorithm 4.8,Step2), determineshow ac-curatelytheconditionalprobabilitiesassociatedwith thenew featureareestimatedatStep7. Alargevalueyieldsaccurateestimates,but theseareexpensive to evaluate.Thedangerof usingtoo smalla valueof nex is thatmany goodfeaturesarediscardedbecauseof overly pessimisticprobabilityestimates.Optimisticestimatesresult in theadditionof featuresthat later turn outto beof little value. Consequently, thesewill ceaseto beusedasthey aresupersededby morediscriminative features.Thus,usinga small nex will result in more featuresearches,eachofwhich will take lesstime thanwhenusinga largenex. In a practicalinteractive application,thechoiceof this parametershoulddependon therelative costof acquiringimagesvs. generatingexperiencein therealworld. If experienceis costly to obtain(e.g.becauseit involvessophisti-catedmaneuverswith physicalmanipulators),but avarietyof representative imagesarecheaplyobtainedatStep2 of Algorithm 4.8,thennex shouldbechosenlarge.On theotherhand,if eachreal-world experienceis generatedcheaplybut exampleimagesareexpensive to obtain,thenasmallvalueof nex shouldbeused.Thevaluedoesnotneedto befixed,nordoesit needto bethesamefor thetrueandthemistakenclass.All experimentsreportedhereusenex � 5. This valuewaschosenrelatively smallbecauseit emphasizesthe incrementalnatureof learningeven if asmall,fixedtrainingsetis used.At theotherextreme,nex couldbechosensolarge thattheex-ampleimagesalwaysincludetheentiretrainingset(soit exists),yielding immediatelyaccurateestimatesof thedistinctive power of featureswithout theneedfor incrementalupdating.

It remainsto be explainedhow a featureis generatedat Step3 of Algorithm 4.8. Therearefive waysto generatea new featurethatdi� er in thestructuralcomplexity of the resultingfeature. In accordwith Occam’s razor, simple featuresarepreferredover complex features.Also, they aremorelikely to bereusableandarecomputedmorerapidly thancomplex features.The following algorithmimplementsa biastowardstructurallysimplefeaturesby consideringfeaturesin increasingorderof structuralcomplexity.

Algorithm 4.10(Feature Generation) Thesemethodsareappliedin turn at Step3 of Algo-rithm 4.8:

1. (Reuse)Froma Bayesiannetwork representinganunrelatedclass(otherthanthe trueormistakenclasses),selectarandomfeaturethatis not in usefor thetrueor mistakenclasses,to seeif it canbereusedfor thecurrentrecognitionproblem.Thispromotestheemergenceof generalfeaturesthatarecharacteristicof morethanoneclass.

2. (New Primiti ve) Samplea new featuredirectly from thesampleimageby eitherpickingtwo pointsandturningtheminto a geometriccompoundof two edgels,or by picking onepoint andmeasuringa texel featureresponsevector. (Recall that featuresaremeasuredunderrotationalinvariance. Therefore,individual edgelsdo not carry any information.)This is theonly stepwherenew featuresaregeneratedfrom scratch.

3. (Geometric Expansion) From amongthe featurescurrently in use,or from amongthefailed candidatefeaturestried earlierduring this executionof Algorithm 4.8 (seeAlgo-rithm 4.8,Step8a),chooseafeatureat random.Finda locationin thesampleimagewherethis featureoccursmoststrongly(Equation3.14),andexpandit geometricallyby addinga randomlychosensalientedgelor texel nearby. This expansionyieldsa new featurethatis morespecificthantheoriginal feature.

4. (Conjunction) Pick two randomexisting featuresandcombinetheminto a conjunctivefeature.Again,theresultis anew featurethatis morespecificthaneitherof theconstituentfeaturesindividually.

43

Page 62: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

5. (Disjunction) Pick two randomexisting featuresand combinethem into a disjunctivefeature.Here,theresultis anew featurethatis lessspecificthantheconstituentfeatures.

For a given run of Algorithm 4.8 (FeatureLearning,page41), eachtime Algorithm 4.8reachesits Step3, onemethodof Algorithm 4.10is applied.Thefirst nfgen timesAlgorithm 4.8reachesStep3, Method1 of Algorithm 4.10is performed;thenext nfgen times,Method2, andsoon. This continuesuntil eithera suitablefeatureis found,or until the lastmethodhasbeenappliednfgen times,atwhichpoint thefeaturesearch(Algorithm 4.8)terminatesunsuccessfully.Theparameternfgen in e� ectcontrolsthebiastowardsimplefeatures.A largevaluewill expendmoree� ort usingeachmethod,while asmallvaluewill giveupsoonerandmoveon to thenextmethod. The currentimplementationusesnfgen � 10, which is consideredto be a relativelysmall value in relationto the roughly 100–1000salientpointspresentin typical imagesusedhere(cf. Chapter3).

Thespecificsequenceof methodsdefinedin Algorithm 4.10waschosenbecauseit imple-mentsa biastoward few andstructurallysimple features.A new featureis only generatedifthesystemfails to find a reusablefeature.Beforeexisting featuresareexpanded,new primitivefeaturesaregenerated.Geometricexpansionsarepreferredover conjunctive featuresbecausetheformerarelocalandrigid, while thelattercanencompassany varietyof individual features.Local featuresaremorerobust to partialocclusionsandminor objectdistortionsthanarenon-local features.Nevertheless,conjunctive featuresmay be importantbecausethey encodethepresenceof severalsubfeatureswithout assertinga geometricrelationshipbetweenthem. Thismakesthemmorerobustto objectdistortionsthangeometricfeatures.

Thelaststepinvolvesthegenerationof disjunctive features.Thesearesusceptibleto overfit-ting becausea singledisjunctive featurecanin principleencodeanentireclassasa disjunctionof highly specificfeatures,eachcharacterizinganindividual trainingimage.Thus,they canhelplearna trainingsetwithoutgeneralization.Ontheotherhand,disjunctive featuresarenecessaryin generalbecauseobjectswithin aclassmaydi� erstronglyin theirappearance.For example,adisjunctive featurecanencodeastatementsuchas“If I seeadial or anumberpad,I amprobablylooking ata telephone.”

4.4.4 Impact of a NewFeature

In Algorithm 4.8(FeatureLearning),Step3,anew featureis generatedin thehopethatit will fixtherecognitionproblemthattriggeredtheexecutionof Algorithm 4.8.Clearly, anecessarycon-dition for successis thatthenew featureis actuallyconsultedby therecognitionalgorithm4.4,whencalledat Step8 of Algorithm 4.8. Algorithm 4.4continuesto consultfeaturesaslong astherearefeaturesthatcanpotentiallyreducetheuncertaintyin any classnodes.Therefore,thenew featurewill be queriedif the entropy in its own classnodedoesnot vanishasa resultofqueryingotherfeaturesof this Bayesiannetwork. The only way Algorithm 4.4 canterminatewithout queryingthenew featureis if thereis a featuref in this Bayesiannetwork thatdrivestheentropy in its classnodeto zero,andthathasa greatermutualinformationwith respecttotheclassnodethanthenew feature,causingit to bequeriedbeforethenew feature.Note thatthis recognitionis incorrect,asit triggeredthe presentexecutionof the featurelearningalgo-rithm 4.8at Step5 of Algorithm 4.7(Operationof theLearningSystem).Thenewly generatedfeatureis not going to be usedon this image,irrespective of any discriminative power it mayhave.

How likely is thisworst-casesituationto occur?Consideracasewherethesystemfailedtorecognizea trueclass.Figure4.5 illustratestheclass-conditionalcumulative probabilitiesof afeaturef thatcandrive theentropy in theclassnodeto zero,concludingthat its classis absent.Thecritical propertyof thegraphis thatif theclassis present,thefeaturef is alwayspresentwithff SWO , allowing oneto infer theabsenceof theclassif ff 8.O . (Thefigurealsoillustratesthat

44

Page 63: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

cum

ulat

ivepr

obab

ility 1

P f0 1

classabsent

classpresent

ff

Figure 4.5. KSD of a featuref that, if not found in an image,causesits Bayesiannetwork toinfer thatits classis absentwith absolutecertainty.

theKSD of sucha featurecanbequitelow – roughly0.5 in thiscase.)Canthis featurepreventthesuccessfulgenerationof new featuresby keepingthemfrom gettingqueried?Theansweris no becauseit will quickly loseits power to reducetheentropy to zero. To seethis, observethat eachprocessedimagegeneratesan instancethat is addedto the instancelist, recordingthemeasuredfeaturevaluestogetherwith thetrueclasslabels.In thesituationdescribedhere,ff 8XO f , but the classwas present. This situationis incompatiblewith the critical propertyof Figure4.5. Consequently, the next time the conditionalprobabilitiesof f arere-estimatedbasedon the instancelist, at leastoneof two changesmustoccur: Theprobabilityof ff 85O f

given the presenceof the classbecomesgreaterthanzero,or the thresholdO f is reduced.Inthefirst case,thefeatureimmediatelylosesits power to determinetheabsenceof theclasswithabsolutecertainty. ThesecondcaseclearlycannotreoccurindefinitelybecauseO f is chosentomaximizetheKSD, ensuringthat ff >YO f for thebulk of thosecaseswheretheclassis absent.As specifiedin Algorithm 4.7 (Operationof theLearningSystem)on page40, thecutpointsOandtheconditionalprobability tablesof all featurenodesarere-estimatedafterevery incorrectrecognition.

In summary, a newly sampledfeatureis not guaranteedto beusedimmediately. However,thosecaseswhereit is not usedcanbeexpectedto berare,sincethey requirethepresenceof afeaturef with thespecificpropertiesdescribedabove. Importantly, any suchdegeneratesituationis guaranteedto resolve itself soonbecausethe featurenodeparametersarere-estimatedafterevery incorrectrecognition. This self-repairingpropertyis a key aspectof the designof thelearningalgorithm,andis critical to its successfuloperation.

4.5 Experiments

Thealgorithmsdevelopedin theprecedingsectionsweredevelopedfor opentaskscenariosinwhich imagesareencounteredwhile theagentperformsa task. In e� ect,eachnewly acquiredvisualscenefirst servesasa testimagethat theagentattemptsto recognize.If it succeeds,theagentmoveson. If the resultof recognitionis incorrect,this scenebecomesa training image.Additional exampleimagesarethenacquiredto enabletheagentto generateaninitial estimateof thedistinctive power of newly generatedfeatures.

Testingthis algorithmon suchan interactive task is a nontrivial endeavor. Implementinga robot thatcaninteractin meaningfulwayswith many objectsis challengingin its own rightandbeyond the scopeof this dissertation.The resultscited in this sectionweregeneratedonsimulatedtasks.

A secondsimplificationconcernsthenumberof objectsshown in animage.Thealgorithmsdescribedabovemakenoassumptionsregardingthenumberof targetor distractorobjectsshownpresentin any given image. For simplicity and to facilitate comparisonof the resultswith

45

Page 64: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

conventionalmethods,all experimentswereperformedin theconventionalway, with eachimagecontainingexactly one object. It is important to note however that the learningsystemdidnot take advantageof this in any way. The algorithmswereappliedin their full generalityasintroducedabove.

A simulatedtaskconsistsof a closedsetof images. This setof imageswasdivided intoequallysizedtraining andtestsets,suchthat the numberof imagesper classwasidenticaltotheextentpossible.Eachexperimentwasrun a secondtime with the rolesof trainingandtestsetsreversed.In themachinelearningterminology, this procedureis calledtwo-fold stratifiedcrossvalidation.Trainingtook placeasshown in theoverview givenin Figure4.2on page35,andasdescribedin detail in Section4.4.3. Imagesfrom the trainingsetwerepresentedto thesystemoneby one,cycling throughthe training setin randomorder. A new permutationwasusedfor eachcycle. On receiptof animage,thesystemanalyzedit usingAlgorithm 4.4. If therecognitionresultwasnotcorrect(in thesenseof Definition4.5onpage39),thenanew featurewassoughtaccordingto Algorithm 4.8. At Step2 of this algorithm,theexampleimageswerechosenat randomfrom thetrainingset(not from thecycling order).Thesystemcycledthroughtheentiretrainingsetuntil all recognitionresultswerecorrect.

Theempiricalperformanceof the learningsystemarediscussedin termsof theevaluationcategoriesgiven in Definition 4.5, which apply to the generalcaseof an arbitrarynumberoftarget classes.Sinceall experimentswere performedusing only a single target class,thesecategoriesarerestatedherefor this specialcase,to simplify their interpretation.

correct: Exactlythetrueclassis recognized.

wrong: Exactlyonefalseclassis recognized.

ambiguous:Thetrueclassis recognized,plusoneor morefalseclasses.

confused: More thanonefalseclassesarerecognized,but not thetrueclass.

ignorant: No classis recognized.

Experimentswereconductedusingthreedatasetswith verydi� erentcharacteristics.Thesewillbediscussedin thefollowing sections.Thenext chapterwill introduceasimpleextensionto thebasiclearningalgorithmthatresultsin pronouncedperformanceimprovements.

4.5.1 The COIL Task

TheCOIL taskconsistedof 20 imagesof eachof thefirst six objectsof the20-objectColumbiaObject ImageLibrary [79]. One sampleimageof eachclass,representingthe middle view,is shown in Figure4.6. Neighboringviews arespaced5 degreesapart,at constantelevation.Neighboringimageswereassignedalternatelyto trainingandtestsets.All imagesareof size128 - 128pixels.

obj1 obj2 obj3 obj4 obj5 obj6

Figure 4.6. Objectsusedin theCOIL task.Shown aremiddleviews from thesortedsequenceof viewpoints.

46

Page 65: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Table4.2summarizesimportantperformanceparametersfor theCOIL task(andalsofor theothertwo tasks,PlymandMel, but thesewill bediscussedlater).Themoststrikingobservationis that the trainedrecognitionsystemis very cautiousin that it makesvery few mistakes. Theproportionsof wrongandambiguousrecognitionsarebothvery low. Unfortunately, thepropor-tion of correctrecognitionis alsofairly low atabout50%.In theremaining33%ignorantcases,no classwasrecognizedatall.

Table 4.2. Summaryof empiricalresultson all threetasks.Thefirst four entriesgive thenameof thetask,thenumberof classes,thefold index, andtheaccuracy accordingto Definition 4.6.Thefive entriesunder“RecognitionResults”in eachrow sumto one,barringroundo� errors.Zero entriesare left blank for clarity. The columnsunderthe heading“# Features”give thetotal numberof featurenodesin all Bayesiannetworks(BN), thetotal numberof di� erentfea-tures(dif.), theaveragenumberof featuresqueried(qu’d) duringasinglerecognitionprocedure(Algorithm 4.4),andthetotal cumulative numberof featuressampled(sp’d), respectively. Therightmosttwo columnslist thenumberof cyclesthroughthetrainingset,andthetotal numberof trainingimagepresentations.

RecognitionResults # Features Tr. SetTask Cls. Fold acc. cor. wrg. amb. cnf. ign. BN dif. qu’d. sp’d. cyc. imgs.

COIL 6 1 .58 .48 .08 .08 .35 46 35 22.1 471 6 360COIL 6 2 .61 .52 .08 .10 .30 60 40 22.1 911 8 480COIL 6 avg. .60 .50 .08 .09 .33 53 38 22.1 691 7 420Plym 8 1 .57 .50 .06 .05 .02 .38 84 59 22.0 977 8 448Plym 8 2 .61 .52 .02 .09 .38 86 55 32.1 1019 8 512Plym 8 avg. .59 .51 .04 .07 .01 .38 85 57 27.1 998 8 480Mel 6 1 .49 .36 .14 .11 .39 50 34 17.4 608 8 288Mel 6 2 .47 .33 .08 .14 .44 47 29 19.6 477 6 216Mel 6 avg. .48 .35 .11 .13 .42 48 32 18.5 543 7 252

The next two columnsgive the numberof featurenodessummedover all class-specificBayesiannetwork classifiers,andthe numberof featuresthat areactuallydi� erent. The dif-ferencebetweenthesetwo numberscorrespondsto featuresthat are usedby more than oneclassifier(seeAlgorithm 4.10,FeatureGeneration,Step1). Here,aboutone-thirdof all featurenodessharethe samefeaturewith at leastone other node. How many of thesefeaturesareactuallyqueriedduring an averagerecognitionprocedure(Algorithm 4.4)? The next columnshowsthatonaverage,only alittle morethanhalf of all availablefeaturesarequeriedbeforetheentropiesin theclassnodesarereducedto zero,or until noneof theunqueriedfeatureshasanypotentialto reducea nonzeroentropy. On average,theclassifiersystemendedup with about9featurenodesperclass(53 featurenodes� 6 classes).

Therightmostthreecolumnsin Table4.2give a flavor of the total numberof featuresgen-eratedduring learning,thenumberof iterationsthroughthetrainingsetperformeduntil it wasperfectlylearned,andthe total numberof training imagesseenby the learningsystem.Thesenumbersbecomemoremeaningfulin thecontext of thefollowing chapter.

Table4.3providesdeeperinsightinto therecognitionperformance,separatelyfor eachclass.Interestingly, it shows that theperformancecharacteristicsdi� er widely betweenthetwo foldsof the two-fold crossvalidation. For example, in Fold 1 almostall instancesof obj3 wererecognizedcorrectly, while in Fold 2 mostof themwerenot recognizedat all. However, sometrendscanstill be found: Performancewasconsistentlyhigh on obj2, andconsistentlylow onobj6 andobj4. Moreover, instancesof obj4 tendedto bemisclassifiedasobj6, but notviceversa.The variationbetweenthe two folds is mostly due to the randomnessinherentin the featurelearningalgorithm.This topic will befurtherdiscussedin Section4.6andin Chapter5.

47

Page 66: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Table 4.3. Resultson the COIL task. The tableshows, for eachclass,how many instanceswererecognizedaswhichclassfor correctandwrongrecognitions,andthenumberof ambigu-ous� confused� ignorant recognitionoutcomes.In eachrow, theseentriessumto thenumberoftestimagesof eachclass(here,10). Therightmostcolumnshows thenumberof featurenodesthat arepart of the Bayesiannetwork representingeachclass. The bottomrow of the bottompanelgivestheaverageproportionof thetestsetthatfalls into eachrecognitioncategory.

ConfusionMatrix correct� wrong # Feats.Class obj1 obj2 obj3 obj4 obj5 obj6 amb. cnf. ign. BN

Fold 1:obj1 6 4 8obj2 8 1 1 4obj3 8 2 9obj4 1 2 1 2 2 2 11obj5 3 2 5 5obj6 1 2 7 9Sum 6 8 10 2 4 4 5 0 21 46

Fold 2:obj1 5 5 10obj2 10 7obj3 2 1 7 16obj4 4 2 2 2 9obj5 7 2 1 8obj6 2 3 2 3 10Sum 5 10 2 4 10 5 6 0 18 60

Averageof both folds (proportions):obj1 .55 .45 9obj2 .90 .05 .05 6obj3 .50 .05 .45 13obj4 .05 .30 .05 .20 .20 .20 10obj5 .50 .20 .30 7obj6 .05 .10 .25 .10 .50 10Avg. .09 .15 .10 .05 .12 .08 .09 .00 .33 9

48

Page 67: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

In thecasesof ambiguousrecognition,which classeswerethefalsepositives?Theansweris givenby Table4.4. In bothfolds,obj4 andobj5 couldoftennot bedisambiguatedfrom eachother. Thesewrongandambiguousrecognitionsdo not seemintuitive, given that they appearvery dissimilarto a casualobserver (Figure4.6). However, in Fold 1, obj4 sharesonefeaturewith eachof obj5 andobj6. This is anindicationthatatsomelevel, theseclassesdoseemsimilarto the learningsystem.Even morestronglyin Fold 2, obj4 sharesonefeaturewith obj5, twofeatureswith obj6, andonemore featurewith both obj5 andobj6. Examplesof the featureslearnedfor eachclassareshown in Figure4.7. Edgelsand texels and their orientationsandscalesarerepresentedasin Figure3.9 on page22. In addition,thesubfeaturesof a compoundfeaturearelinkedwith a line, thatis solid for geometricanddottedfor Booleancompounds.

Table4.4.Resultson theCOIL task:Confusionmatricesof ambiguousrecognitions.Theitalicnumberson the diagonalindicatehow many timesthe classgiven on the left wasrecognizedambiguously(cf. Table 4.3), and the upright numbersin eachrow specify the false-positiveclasses.

Class obj1 obj2 obj3 obj4 obj5 obj6

Fold 1:obj1obj2 1 1obj3obj4 2 1 1obj5 1 1 2obj6

Fold 2:obj1obj2obj3obj4 2 2 1obj5 1 2 1obj6 1 1 2

To a casualhumanobserver, the two cars(obj3 andobj6) look very similar. Why doesthelearningsystemnot seemto have any di� culty with thesetwo objects? The answeris thatthe similarity occursat a high level, concerningthe overall shapeof thebody, the locationofthe wheels,etc. However, the learningsystemdoesnot have direct accessto thesehigh-levelfeatures.It only looksat a smallnumberof stronglylocalizedfeatures,andat combinationsofthese.At thelevel of local features,thetwo carsin factlook ratherdi� erentevento humans.

What are the appearancecharacteristicscapturedby individual features?Figures4.8 and4.9giveanintuition. Figure4.8showsall imagesof classobj1. Eachimagedisplaysall featurescharacteristicof this classthat is presentin this image(i.e., ff SZO f ; cf. Algorithm 4.4,Step5,page38). Eachfeatureis annotatedwith a uniqueidentifier k (featurek was the kth featurelearnedby thesystem).Theimagesshow thatsomefeaturesarereliably locatedatcorrespond-ing partsof theduck(features5, 8, 25). Otherfeatureschangetheir locationwith theviewpoint(features11, 13). Most featuresappearon mostviews, while feature22 only occursin a fewimages.Feature9 is adisjunctive feature.In mostviews,its dominantsubfeatureis alarge-scalegeometriccompoundof two edgels.Only thefirst imageshows its othersubfeature,a texel.

Figure4.9 illustratesthespecificityof a representative subsetof the featurescharacteristicof classobj1. Thetwo centercolumnsindicatehow stronglya featurerespondedeverywhereinthe image. For purposesof illustration, the featureresponsevaluewascomputedat all imagepoints(not justatsalientpoints).At eachpoint,all scalescorrespondingto scale-spacemaxima

49

Page 68: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Figure 4.7. Examplesof featureslearnedfor theCOIL task. For eachclass,all featureschar-acterizingthis classin Fold 1 areshown that areconsideredto bepresentin the imageby theclassifier.

50

Page 69: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Figure 4.8. Featurescharacteristicof obj1 locatedin all imagesof this classwherethey arepresent.Thesefeatureswerelearnedin Fold 1, andarehereidentifiedby uniqueID numbers.

51

Page 70: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

wereconsidered.Thegrayvalueatapixel correspondsto themaximumfeaturevalue fmax overthesescales.The locationof a featureis identifiedwith the locationof its referencepoint (cf.Figure3.12,page26). To enhancedetail,thegrayvalueis proportionalto f 4

max, asmostfeatureresponsesarequite closeto unity. A pixel wasleft white if thescalefunctiondid not assumeany local maximaat this imagelocation.Thefigureshows thatsomefeaturesaremorespecificthanothers.Feature13 is theleastspecificfeaturecharacteristicof obj1. It respondsstronglyinmany areasbotharoundtheperipheryandin theinteriorof theduck,andalsothecat.Feature5is themostspecificfeatureof obj1, andrespondsprimarily to certainareasinterior to theduck.This featureis very similar to feature25. Not accidentally, feature13 assumedits strongestresponseat variousplacesin Figure4.8,while features5 and25 arestablewith respectto theirlocation.Feature18is anexampleof anedgefeature.In fact,its responseis muchmorestronglylocalizedto edgesthanthetexel featuresshown in Figure4.9. Feature9 respondsonly weaklyeverywhere,asits constituentfeaturesweresampledfrom otherimages.Its presencethresholdO f is correspondinglylower (not shown in thefigure).

4.5.2 The Plym Task

The secondtaskusedeight artificially rendered,geometricobjects,2 that areshown in Figure4.10.Thereare15views perobject,coveringa rangeof [ 10degreesabouttheverticalaxisrel-ative to themiddlepositionshown in Figure4.10in stepsof two degrees,at constantelevation.Two views arerenderedfrom elevations10 degreesabove andbelow themiddleposition,andtwo views at 10 degreesin-planerotation. Theobjectsurfacesarefreeof texture. This makesthePlymtaskdi� cult to learnusingthefeaturespaceusedhere,astheonly discriminantinfor-mationavailableto the learningsystemis given by local contourshadingandanglesbetweenedges– andbothareadverselya� ectedby perspectivedistortions.Therenderingviewpointsareclosertogetherandcover a smallersectionof the viewing spherethanthe COIL task. Again,neighboringviews wereassignedalternatelyto trainingandtestsets,yielding onetraining� testsetwith 8 andonewith 7 imagesper class. The objectsarerenderedat similar size,andtheimageswerecroppedto leavea16-pixel boundaryaroundtheobject,resultingin imagesizesofabout100 - 170pixels.

Table4.2 revealsthat the performanceof the learningsystemon the Plym taskwasverysimilar to theCOIL task.Therewasaboutthesamenumberof correctrecognitions,but slightlyfewer wrongandambiguousrecognitions,which is reflectedby a higherproportionof ignorantresults. Oneof the very few confusedrecognitionresultsever encounteredoccurredhere: Acucon wasrecognizedasacucy andacyl3. Incidentally, in thisfold thecucon Bayesiannetworksharedone featureeachwith eachof the cucy andcyl3 networks, which indicatesthat theseclasseswereconsideredsimilar by the learningsystem.All othernumbersshown in Table4.2areroughlyin proportionto theCOIL task.

The confusionmatrices(Table4.5) againreveal striking consistenciesanddi� erencesbe-tweenthetwo folds. For example,cyl3 wasperfectlyrecognizedin Fold 1, but poorly in Fold 2.Objectcycu wasalmostperfectly recognizedin both folds. This object is characterizedby auniquerectangularprotrusionattachedto a smoothsurface. This turnedout to be a powerfulclueto thelearningsystem:Thecompoundfeature(from Fold 2) sittingpreciselyat theattach-mentlocationin Figure4.11hasaKSD of 1.0– it is aperfectpredictorof thisobject.In Fold 1,a coarse-scaletexel featurewaslearnedthat respondsto the uniqueconcavity createdby theprotrusion.This feature(not shown) alsohasa KSD of 1.0.

Theconsistentlymostdi� cult objectsto recognizewerecyl6 andtub6. Mostseverely, noneof thesewasrecognizedcorrectlyin Fold 1. It is nocoincidencethatcyl6 is themostfeaturelessobject in the dataset,and it is identical to the tub6 except that the latter is hollow. This is a

2Theseobjectsweregeneratedby LeventeToth at theCentreFor IntelligentSystems,University of Plymouth,UK, andareavailableundertheURL http:\]\ www.cis.plym.ac.uk\ cis\ levi\ UoP CIS 3D Archive\ 8obj set.tar.

52

Page 71: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Figure 4.9. Spatialdistribution of featureresponsesof selectedfeaturescharacteristicof obj1(left columns),for comparisonalso shown for obj4 (right columns). In the centercolumns,the gray level encodesthe responsestrengthof the featureshown in the outerpanels. Blackrepresentsthemaximalresponseof f ^ 1_ 0; in whiteareasnoresponsewascomputed.Featuresarelabeledby their ID numbers.

53

Page 72: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

con6 cube cucon cucy cycu cyl3 cyl6 tub6

Figure 4.10.Objectsusedin thePlym task.Shown aremiddleviews from thesortedsequenceof viewpoints.

con6 cube cucon cucy

cycu cyl3 cyl6 tub6

Figure 4.11. Examplesof featureslearnedfor thePlym task. For eachclass,all featureschar-acterizingthis classin Fold 2 areshown that areconsideredto bepresentin the imageby theclassifier.

54

Page 73: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Table 4.5. Resultson the Plym task. The table shows, for eachclass,how many instanceswererecognizedaswhichclassfor correctandwrongrecognitions,andthenumberof ambigu-ous� confused� ignorant recognitionoutcomes.In eachrow, theseentriessumto thenumberoftestimagesof eachclass(here,8 in Fold 1 and7 in Fold 2). Therightmostcolumnshows thenumberof featurenodesthat arepart of the Bayesiannetwork representingeachclass. Thebottomrow of thebottompanelgivestheaverageproportionof thetestsetthat falls into eachrecognitioncategory.

ConfusionMatrix correct� wrong # Feats.Class con6 cube cucon cucy cycu cyl3 cyl6 tub6 amb. cnf. ign. BN

Fold 1:con6 6 2 13cube 3 5 16cucon 1 5 1 1 5cucy 3 3 2 10cycu 7 1 8cyl3 8 4cyl6 8 15tub6 2 1 5 13Sum 7 3 5 3 7 10 1 0 3 1 24 84

Fold 2:con6 4 3 6cube 5 2 12cucon 1 5 1 6cucy 3 4 7cycu 7 6cyl3 3 4 14cyl6 3 4 21tub6 1 3 3 14Sum 4 5 1 3 7 3 4 3 5 0 21 86

Averageof both folds (proportions):con6 .67 .33 10cube .53 .47 14cucon .07 .40 .33 .07 .13 6cucy .40 .20 .40 9cycu .93 .07 7cyl3 .73 .27 9cyl6 .20 .80 18tub6 .13 .13 .20 .53 14Avg. .09 .07 .05 .05 .12 .11 .04 .03 .07 .01 .38 11

55

Page 74: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

situationwherethe lack of a closed-world assumption(cf. Section4.4.1) impairsrecognitionperformance:Theonly wayto decidethatanobjectis acyl6 asopposedto a tub6 is by assertingtheabsenceof featurescorrespondingto thecavity. This typeof inferenceis not performedbytherecognitionsystem.Therefore,thelearningsystemtrieshardto identify featurescharacteris-tic of cyl6, finding featuresthatrespondto theindividual trainingimages,but donotgeneralize.Thesefeaturesfail to respondto unseentestviews, which resultsin ignorantrecognitions.Thesimilarity betweenthesetwo objectsis further illustratedby the fact that in both folds, theirrespective Bayesiannetworkssharethreefeatures.

No patternis apparentin thefew ambiguousrecognitions(Table4.6). In eachfold, only oneclassexperiencedambiguousrecognitions.Neitherof theseambiguitiesis reflectedin sharedfeatures.

Table4.6. Resultson thePlym task:Confusionmatricesof ambiguousrecognitions.Theitalicnumberson the diagonalindicatehow many timesthe classgiven on the left wasrecognizedambiguously(cf. Table 4.5), and the upright numbersin eachrow specify the false-positiveclasses.

Class con6 cube cucon cucy cycu cyl3 cyl6 tub6

Fold 1:con6cubecuconcucy 3 2 1cycucyl3cyl6tub6

Fold 2:con6cubecucon 1 5 4cucycycucyl3cyl6tub6

4.5.3 The Mel Task

TheMel taskusedall 12 imagesof eachof six non-rigidobjectsfrom theimagedatabaseusedto train andtestMel’s SEEMOREsystem[71]. All imagesareshown in Figure4.12.To speedup processing,theimagesweresubsampledto 320 - 240from their original sizeof 640 - 480pixels.Theviewsrepresentrandomsamplesfrom theobjects’configurationspace,takenat twodi� erentscalesthatdi� eredby a factorof two. Imageswererandomlyassignedto trainingandtestsets,eachcontainingsix images.Thisdatasetwasintendedto testthelimits of thelearningsystem.As the featurespacederivesmostof its expressive power from geometriccompoundfeatures,it is bestsuitedfor rigid objects.TheMel task,however, consistsof non-rigidobjectsin widely varyingconfigurations,limiting usefulgeometriccompoundsto very smallscales.

Table4.2 shows that thesystemdid poorly on theMel database,asreflectedby all perfor-mancecategoriesexceptfor confusedrecognitions.Ignorantrecognitionsevenexceededcorrect

56

Page 75: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

abalone bike-chain grapes necktie phone-cord sock

Figure 4.12.All imagesusedin theMel task.

57

Page 76: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

recognitions.Accordingto Table4.7, the four weakest-performingclasseswerephone-cord,grapes, abalone, andsock. In particular, no phonecordswererecognizedcorrectly. Oneprob-lemwasthatfor thesmaller-scaleviews,muchof thecrucialdetailwasrepresentedatscalesthatapproachedthepixel size. Eachof theotherthreeclassesrevealsa limitation of thesystemaspresentedhere.A sock is primarily characterizedby thegray-level statisticsof its structurelesssurface. However, the featuresdescribedin Chapter3 aredesignedto characterizegray-levelstructure,notstatistics.

Table 4.7. Resultson the Mel task. The table shows, for eachclass,how many instanceswererecognizedaswhichclassfor correctandwrongrecognitions,andthenumberof ambigu-ous� confused� ignorant recognitionoutcomes.In eachrow, theseentriessumto thenumberoftestimagesof eachclass(here,6). The rightmostcolumnshows thenumberof featurenodesthat arepart of the Bayesiannetwork representingeachclass. The bottomrow of the bottompanelgivestheaverageproportionof thetestsetthatfalls into eachrecognitioncategory.

ConfusionMatrix correct� wrong # Feats.Class abalone chain grapes necktie cord sock amb. cnf. ign. BN

Fold 1:abalone 2 2 2 10bike-chain 4 2 8grapes 2 1 3 7necktie 1 3 2 7phone-cord 1 1 4 9sock 2 2 1 1 9Sum 3 5 2 5 1 2 4 0 14 50

Fold 2:abalone 2 1 3 6bike-chain 4 2 6grapes 1 1 1 3 7necktie 3 3 8phone-cord 1 5 12sock 2 1 3 8Sum 3 5 1 3 0 3 5 0 16 47

Averageof both folds (proportions):abalone .33 .25 .42 8bike-chain .67 .33 7grapes .08 .25 .08 .08 .50 7necktie .08 .50 .25 .17 8phone-cord .08 .08 .08 .75 11sock .17 .33 .17 .33 9Avg. .08 .14 .04 .11 .01 .07 .13 .00 .42 8

Theabalone boardcontainsstronggray-level structure,but this structurevarieswidely be-tweenimages(Figure4.12).Themostmeaningfultexel featuresoccuraboutat thescaleof theabaloneballs. Thesedo not have a meaningfulorientation,as they will mostly be measuredat scalesthat area� ectedby changesin the boardconfiguration. This preventsthe construc-tion of powerful geometriccompoundfeatures,astheserely onpredictablerelative orientationsbetweenfeatures.

Forpractical,reasons,thefeaturesamplingprocedurereliesonsalientpointsandtheirintrin-sic scalesasextractedby scale-spacemethods.Thegrapes classexposesthecritical impactof

58

Page 77: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

thisprocedureonrecognitionperformance.Dueto thelow contrastandthehighdegreeof self-occlusionandclutterpresentin the grapes images,theextractedtexels andtheir scalesrarelycorrespondedto individual grapes.Moreover, similarly to the abalone board,theorientationsof individual texelsis rarelymeaningfulfor larger-scaletexelsin thegrape images.

Table4.8 of ambiguousrecognitionsdoesnot reveal any conclusive insights. Figure4.13shows examplesof featureslearnedfor theMel task.Curiously, two backgroundfeatureswerefoundcharacteristicof theabalone class.Fromtheviewpointof task-drivenlearning,this is notconsideredbad.Any featurethatis helpfulona taskshouldbeconsidered.If thebackgroundispredictive of a targetclass,thenbackgroundfeaturesmaycontribute to performance.

Table 4.8. Resultson theMel task: Confusionmatricesof ambiguousrecognitions.Theitalicnumberson the diagonalindicatehow many timesthe classgiven on the left wasrecognizedambiguously(cf. Table 4.7), and the upright numbersin eachrow specify the false-positiveclasses.

Class abalone chain grapes necktie cord sock

Fold 1:abalone 2 2 1bike-chaingrapesnecktiephone-cord 1 1sock 1 1

Fold 2:abalone 1 1bike-chaingrapesnecktie 1 1 3 1phone-cordsock 1 1 1

4.6 Discussion

This chapterpresentedan incrementalandtask-driven methodfor learningdistinctive featuresfrom thefeaturespaceintroducedin thepreviouschapter. Althoughtheexperimentsdescribedabove usedconventionaldatasetsusedin object recognitionresearch,the methoddevelopedhereis far moregeneralthanmostexisting recognitionsystems.In contrastto mostmethodsbasedon eigen-subspacedecomposition,this approachdoesnot rely on global appearanceorbackgroundsegmentation(but seeColin deVerdiereandCrowley [26] for anexampleof locallyappliedeigen-features,andBlack andJepson[14] andLeonardisandBischof [61] for robustversionsof globaleigen-features).In contrastto mostapproachesbasedon local features,thissystemdoesnot requirea hand-built featureset,but learnsdistinctive featuresfrom a featurespace.No prior knowledgeaboutthetotalnumberof targetclassesis used,andnoassumptionsaremaderegardingthenumberof targetclassespresentin any givenimage.Thelearnerlearnsdistinctionsasrequired,addingclassesasthey arise.

Thefeaturelearningalgorithmreliesontwo key conceptsfor incrementallearningof condi-tional probabilities:theexampleimages,andtheinstancelist. Theconceptof exampleimagesis analogousto the humancapacityto keepa small numberof exampleviews in short-termmemory. A humanacquirestheseviews simply by looking at anobjectfor anextendedperiod

59

Page 78: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Figure 4.13. Examplesof featureslearnedfor theMel task. For eachclass,all featureschar-acterizingthis classin Fold 1 areshown that areconsideredto bepresentin the imageby theclassifier.

60

Page 79: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

of time while changingthe viewpoint. Importantly, theseacquiredviews arenot distinct, butform acoherentmotionsequence,whichallowstheobserver to trackfeaturesacrossviewpoints.This cangreatlyfacilitatethediscovery of stableanddistinctive features[129]. In thesystemdiscussedin thepresentchapter, no suchassumptionsaremadeabouttheexampleimages.Fu-turework shouldinvestigatehow to take advantageof passively provided or actively acquiredmotionsequences.

Notably, exampleimagescanbeavoidedaltogether. In this case,evaluationof newly sam-pled featuresis deferreduntil a su� cient numberof training views of the object classesinquestionhasbeenacquiredover time. However, this deferredevaluationprecludesthe goal-directed,simple-to-complex featuresearchimplementedby Algorithms4.8and4.10.Instead,anumberof featuresmustbegeneratedandkeptarounduntil they canbeevaluated.To illustratehow this canbe implemented,thesecondapplicationdiscussedin Chapter6 operateswithoutexampleimages.

Thesecondkey component,theinstancelist, is not easilydispensablebecauseit is usedtoestimateclass-conditionalfeatureprobabilities.Thedynamicrecomputationof the thresholdsO f (cf. Definition 4.2 (Distinctive Power of a Feature)on page37, andAlgorithm 4.8 (FeatureLearning),Step7 on page41) relieson theability to countthe individual instances.However,this dynamicrecomputationis notanessentialpartof theoverall algorithm.If oneis willing tocommit to anearlyestimateof the thresholdsO f , thentheconditionalprobabilitydistributionscanbemaintainedin theform of parametricapproximationsthatcanbeupdatedincrementallyasnew training examplesbecomeavailable. In this case,the instancelist canbe completelyavoided. If aninstancelist is used,it maybedesirableto truncateit at a givenlength.This hastwo importantconsequences:First,thememoryrequirementsof thefeaturelearningmechanismareessentiallyconstantfor a fixed numberof classes.Second,the learnerwould continuallyadaptto a nonstationaryenvironment, learningnew featuresas required,and forgetting oldfeaturesasthey becomeobsolete.

Therecognitionsystemis designedto bevery general.Threeimportantdesignchoicesthatweremotivatedby this goal arethe useof local featuresthat aremorerobust to the presenceof clutterandocclusionsthanglobal features,theuseof scale-spacetheoryto infer theappro-priatelocal scalesfor imageanalysis,andtheuseof separateclassifiersfor eachclass,eachofwhich decideswhetheranobjectof its classis presentin thescene,independentlyof theotherclasses.Thesecharacteristicsresultin a very generalsystemthatmakesfew assumptionsabouttheviewedscenes.On theotherhand,if suchprior informationis availablein thecontext of agiventask,any generalsystemis clearlyput at a disadvantagewith respectto morespecializedsystemsthatexplicitly exploit this information. For example,if thescaleof the target objectsis known, scale-spacetheory(or any othermethodfor multiscaleprocessing)introducesunnec-essaryambiguity. If it is known that thereis exactly onetargetobjectpresentin eachscene,itwouldbepreferableto have theclassescompetein thesensethatevidencefoundin favor of oneclassreducesthe belief in the presenceof otherclasses.This is implementedby the classicalmodelof a Bayesianclassifier, wherea singleclassvariablehasonestatefor eachclass. Ifobjectsarerigid andpresentedin isolationon controlledbackgrounds,aswasthe casein theCOIL andPlym tasks,thensuperiorresultsaretypically achievedusingglobalfeatures.

To avoid unnecessarycomplication,all experimentswereperformedusingcontrolledim-agerywith onewell-contrastedobjectpresentin eachscene.Moreover, with the exceptionoftheMel task,objectswererigid andsubjectto only minorchangesin scale.As is to beexpected,somewhatsuperiorresultswereachievedby earlierversionsof thefeaturelearningsystemthattook advantageof known taskconstraints.Insteadof themultiple Bayesiannetwork classifiersusedin this chapter, a singleBayesiannetwork classifier[85] anda decisiontree[84, 86] as-signedexactlyoneclasslabelto eachimage.Othersystemsprocessedimageryatasinglescale[84] or atmultiple scales,spacedby a factorof two [86, 88, 87, 89].

61

Page 80: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Theprincipalreasonfor therelatively low recognitionaccuraciesachievedin thischapteristhat the learningalgorithmacceptsany featurethatsolvesa given recognitionproblem(Algo-rithm 4.8 (FeatureLearning),Step8b, page42). However, suchfeaturesmay have very littledistinctive power. Therefore,thegeneralizationpropertiesof theresultingclassifiersaresubjectto the randomnessof the featuresamplingprocedure.The following Chapter5 will presentamethodfor learningimprovedfeatures,which leadsto improvedrecognitionresultswhile pre-servingtheopennessof thesystem.Therefore,a moredetaileddiscussionof theperformancecharacteristicsof thefeaturelearningsystemis deferredto Chapter5.

Anotherimportantlimitation of thesystemasdescribedin thischapteris relatedto thecom-putationof thesalientpoints. Due to limited computationalresources,computationof featurevaluesis restrictedto salientpoints.Thefewersalientpointsareextracted,themorethelearningsuccesswill dependonthevarietyandstabilityof thesalientpoints,which is theflip sideof thecomputationalsavings. Moreover, in the experimentsreportedhere,the scalespacewasrela-tively coarselysampledathalf-octave spacing.If moreintermediatescalesareused,morelocalscale-spacemaximaandthusmoresalientpointswill generallybe found,andintrinsic scaleswill bemeasuredmoreaccurately, againat increasedcomputationalexpense.Section5.4.5willexaminethecomputationalcostin moredetail.

The goal of this work is not to achieve high recognitionratesby exploiting known priorconstraints,but to begin with a highly uncommittedsystemthat learnsusefulvisualskills withexperience.To fully evaluatethethegeneralsystem,large-scaleexperimentswouldberequiredto testit undervariousdegreesof clutterandocclusionatvariousscales,with variousnumbersoftargetobjectspresentin a scene.Thereis no known suitableimagedatasetcurrentlyavailable.Hence,a full-scaleperformanceevaluationwould constitutea substantialendeavor, which isbeyondthescopeof thisdissertation.

62

Page 81: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

CHAPTER 5

EXPERT LEARNING

Chapter4 presentedan algorithmfor learningdistinctive featuresfrom the featurespacede-scribedin Chapter3. Thisbrief chapterintroducesa simpleextensionto thelearningalgorithmthatproducesmarkedly improvedfeatures.A resultingsignificantimprovementin recognitionperformanceis reflectedby all evaluationcriteria.

5.1 RelatedWork

Mostcurrentmachinevisionsystemsareconstructedor trainedo -line or in adedicatedtrainingphase.In contrast,humanvisualperformanceimproveswith practice.For example,recognitionaccuracy andspeedboth increasewith growing experience.It is not clearwhat thebiologicalmechanismsarefor this typeof visual learning.Nevertheless,thereis substantialevidencethatat leastpart of the performanceimprovementcanbe attributedto betterfeatures.TanakaandTaylor [120] found that bird expertswere as fast to recognizeobjectsat a subordinatelevel(“robin”) asthey wereat thebasiclevel (“bird”). In contrast,non-expertsareconsistentlyfasterat basic-level discriminations.This e ect is sopronouncedthat it hasbeenproposedasa defi-nition of expertisein visualrecognition[120, 38]. It seemsthatexperts,colloquially speaking,know whatto look for, suggestingthey have developedhighly discriminative features.

GauthierandTarr[38] investigatedthephenomenonof recognitionexpertiseusingartificial,unfamiliarstimuli, theso-calledGreebles. Greeblesareorganizedintogendersandfamilies,thatarecharacterizedby theshapesof anindividual’s bodyparts.Subjectsweretrainedto recognizeGreeblesat threelevels– gender, family, andindividual– until they reachedtheexpertcriterion,i.e., they wereasfast to recognizeGreeblesat the individual level asat the family or genderlevel. Training requiredbetween2700and5400 trials, spreadacrossa total of 7 to 10 one-hoursessions.Expertsthustrainedwereslower to recognizeGreebleswith slightly transformedbodyparts,ascomparedto theGreeblesthey saw duringtraining. All expertsubjectsreportednoticingthesetransformations.In contrast,for non-expertstherewasnosuchspeedor accuracydi erence,andnoneof themnoticedthe transformations.Again, this resultstronglysuggeststhatexpertshaddevelopedhighly specificfeatures.In fact,GauthierandTarrwrite:

“In our experiment,expertisetraining may have led to the assemblyof complexfeature-detectors,extractedfrom the statisticalpropertiesof the Greebleset thatprovedusefulfor performingthetrainingdiscriminations.” [38]

The featurespaceintroducedin Chapter3 fits this description. The presentchapterprovidesa computationalmodelthat is consistentwith the empiricalresultsandproposedmechanismsreportedby theseauthors.Regardinga possibleneuralsubstrate,GauthierandTarr speculatethat thecomplex featuredetectorneuronsfound in the inferotemporalcortex arenot fixedbutcanbemodifiedin responseto experience,citing LogothetisandPauls[65] whofoundthattheseneuronscanbecomehighly selective for previously novel stimuli.

I am not awareof any work in machinevision that is relatedto thedevelopmentof visualexpertiseand the formationof improved features. Featureextractionmethodsbasedon dis-criminative eigen-subspacesgeneratehighly distinctive features,but thesemethodsareusuallyappliedo -line, andtheextractedfeaturesaretypically non-local[31, 37, 117].

63

Page 82: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

5.2 Objective

Thebasicfeaturelearningprocedureintroducedin theprecedingchapteris driven by a singlecriterion,thatis, to learnthetrainingimages.In theabsenceof any misrecognitions,nolearningoccurs. In analogyto humanvisual learning,the resultingvisual systemremainsa life-longnon-expert. The objective in this chapteris to extend the basiclearningproceduresuchthatlearningcancontinueevenin theabsenceof misrecognitions.As motivatedby thephenomenonof humanexpertise,learningthenfocuseson improving thefeatures.

How canthelearnerimprove its features?Whatis thecriterionthatmeasuresthequality ofa feature?In theprevious chapter, no attentionwaspaid to thequality of a feature.Theonlyrequirementof acandidatefeaturefor inclusionin aclassifierwasthatit solvedtherecognitionproblemthat triggeredthe presentfeaturesearch(Algorithm 4.8 (FeatureLearning),Step8b,page42). The algorithmwasbiasedtoward finding few andpowerful features(by addingamaximumof onefeaturefor eachfailed recognition),andtoward structurallysimple features(by virtue of thesimple-to-complex orderingof stepsin the feature-generationalgorithm4.10on page43). However, therewas no explicit pressureon the numberor quality of features.The following sectiondefinesan obvious measurefor the quality of a feature,anda simplemechanismfor learningimprovedfeatures.A desiredsidee ectis thereductionof thenumberof featuresusedduring recognition,which resemblesandmight explain the fast recognitionperformedby humanexperts.

5.3 Expert Learning Algorithm

The goal of the learningsystemis to learndistinctivefeatures(Definition 4.1, page31). Todiscretizethe continuousfeaturevalues,a cutpoint is chosenthat maximizesthe distinctivepower of a feature(Definition 4.2, page37). Hence,a naturalway to definethe quality of afeaturefor thepurposesof expert learningis to equateit with its distinctive power betweenthecharacterizedclassandoneor morediscriminatedclasses.

The basicideaof theexpert learningalgorithmis to reduceuncertaintyin the recognitionresultby learningfeaturesthatdiscriminatewell betweenclassesthataredia cult to distinguishby the recognitionsystem. This dia culty is measuredin termsof the KSD achieved by anyfeatureavailableto discriminatebetweentheseclasses.If for a givenpair of classesthis KSDfallsbelow aglobalthresholdtKSD, thenanew featureis soughtthatdiscriminatesbetweentheseclasseswith aKSD of at leasttKSD. A high-level view of theexpertlearningideais givenby thefollowing algorithm.

Algorithm 5.1(Expert Learning) Onreceiptof animage,thefollowing stepsareperformed:

1. The imageis recognizedusingAlgorithm 4.4 (page38). If the result is correct(in thesenseof Definition4.5),thencontinueatStep2. Otherwise,thefeaturelearningalgorithm(4.8, page41) is run, with oneslight but importantmodification: At Step3, Algorithm4.10is calledrepeatedlyuntil it returnsa candidatefeaturewith a KSD greaterthantKSD.Thecompletionof Algorithm 4.8alsoconcludesthisAlgorithm 5.1.

2. If thereis noclasswith nonzeroresidualentropy afterrecognition,thealgorithmstops.Noexpert learningis performedin this case– thelearneris considereda fully trainedexperton this image.

3. Identify a pair (cb d) of classesfor which animproveddistinctive featureis to belearned.Thedetailsaregivenin Algorithm 5.2below.

4. A new featureis learnedthat is characteristicof classc anddiscriminative with respectto classd, usingAlgorithm 4.8,beginningat Step2, subjectto thesameminimum-KSDrequirementasdescribedatStep1 above.

64

Page 83: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

A crucialpartof thisstraightforwardalgorithmis Step3. Therearemany waysto defineamean-ingful pair of classesthat warrantlearningan improved distinction. The methoddescribedinAlgorithm 5.2below identifiesapairof classesthatconstitutesa“nearmiss” in therecognition.Hence,oneof theclassesis a true class(i.e., a target classpresentin the currentimage),andtheotheris a falseclass.Thespecificpair of classesis chosenthatareleastwell discriminatedby any featureconsultedduringtherecognitionprocedure(Step1 in Algorithm 5.1). Therefore,this is thepairof classesthatwouldbenefitthemostfrom abetterfeature.Crucially, thecharac-terizedclassof thenew featuremusthave hadnonzeroresidualentropy afterrecognition.Thisguaranteesthatthenew featureis goingto bequeriednext timethesametrainingimageis seen.Thus,thenew featureactuallyimprovesthesituation,andendlessloopsareavoided.

Algorithm 5.2(Choosinga Pair of Classesfor an Expert Feature) This algorithmis run atStep3 of Algorithm 5.1,anddeterminesacharacteristicclassc andadiscriminatedclassd thatareleastwell discriminatedby existing features:

1. For all classesci with nonzeroresidualentropy:

(a) If ci is a trueclass,thendefineD asthesetof all falseclasses;otherwise,let D bethesetof all trueclasses.

(b) Determinetheclassdmini c D that is leastwell discriminatedby any featurecharac-

teristicof classci , in termsof theKSDci d dmini

. This is accomplishedby computingforall classesd j c D:e Over all featuresf that are characteristicof classci , and that have beencon-

sultedduringtherecognitionprocedure,computethemaximumKSDci d dj (f ). EachKSDci d dj (f ) is computedby countingapplicableinstancesin theinstancelist.

Then,theleast-welldiscriminatedclassdmini is givenby thefeaturewith theweakest

valueof theKSDci d dj (f ):

KSDci d dmini f min

dj g Dmax

fKSDci d dj (f )

2. Thecharacterizedanddiscriminatedclassesaregivenby thepairwith theweakestassoci-atedKSD. Formally, c f ci andd f di , where

i f argminj

KSDcj d dminj h (5.1)

The featurelearningproceduredefinedby Algorithms 5.1 and5.2 attemptsto learnimprovedfeaturesaslong asthereareclasseswith nonzeroresidualentropy, andasthereareclassesthatarediscriminatedataKSD lessthantKSD by any currentlyavailablefeature.ThethresholdtKSD

canbeincreasedovertime. Thereareotherpossiblewaysto learnimprovedfeatures.For exam-ple,onecanattemptto learnhighly distinctive featuresbetweenall pairsof classesexhaustively.However, this is impracticalfor evenmoderatenumbersof classes,andit would likely generatea large numberof featuresthat arenever going to be usedin practice.A nice propertyof theabove algorithmsis thatexpert featurelearning– like thebasicfeaturelearning– is driven byactualexperience.Successfullylearnedfeaturesareguaranteedto beusedshouldthesamesit-uationariseagain.Thus,eachnew featurehasa well-definedutility to therecognitionsystem.Whenthereis nomeasurableneedfor betterfeatures,theexpertlearningalgorithmstops.

5.4 Experiments

Experimentswereperformedusingthesamesimulated,incrementalparadigmasin Section4.5,andusingthesamedatasets.Here,trainingoccurredin stages. Thepurposeof thesestageswas

65

Page 84: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

to progressively increasethe thresholdtKSD, below which a featurewasconsideredweakandtriggeredexpertlearning.

Algorithm 5.3(Training Stages)

1. Initially, setthestageindex to s f 0, andlet tKSD(s) f 0.

2. Train the systemby cycling throughthe training set,asdescribedin Section4.5. Cycleuntil eitherthetrainingsethasbeenlearnedcompletely(i.e.,all recognitionsarecorrect),or give up after 20 iterations. During featurelearning,apply the tKSD(s) thresholdasdescribedin theexpertlearningalgorithm5.1,Step1.

At theendof eachcycle,discardall featuresthathavenotbeenconsultedatall duringthispastcycle.

During recognition,monitortheminimumKSD encounteredby Algorithm 5.2duringthefinal iterationthroughthetrainingset.Thisvalueis givenby

KSDmin(s) f minrecognitions

mini

KSDci d dmini

wherethe first minimum is taken over all executionsof Algorithm 5.1 during the finaliterationthroughthetrainingset(cf. Equation5.1).

3. For thenext stages i 1, set

tKSD(s i 1) f 1 j 1 j mink KSDmin(s) b tKSD(s) l2

This ruleattemptsto rampup thetargetKSD tKSD exponentially, asymptotingatunity, butonly if the target KSD wasreachedduring thestagejust completed,i.e. if KSDmin(s) mtKSD(s). If tKSD(s i 1) n 0h 98, thenit wassetto tKSD(s i 1) f 1.

4. Increments, andcontinueat Step2.

Accordingto this algorithm,thefirst stage(s f 0) is preciselythebasicfeaturelearningproce-dureusedin Section4.5. In fact,theresultsreportedtherearethoseobtainedafterthefirst stageof a comprehensive trainingprocedure,theresultsof whicharepresentedbelow.

5.4.1 The COIL Task

The first block of resultslisted in Table5.1 correspondsto the COIL resultsin Table4.2 onpage47,but wereattainedafterexpertlearning.(The‘-inc’ tasksin Table5.1will bediscussedin Section5.4.4.)Mostof thereportedparametersarevisualizedfor all trainingstagesin Figure5.1. In thesegraphs,trainingstage0 correspondsto thedatain Table4.2,andthe last trainingstagecorrespondsto Table5.1.

Therecognitionresultsin Table5.1,correspondingto thetop panelsin Figure5.1, indicatea dramaticperformanceincrease.With cumulative expert training, the proportionof correctrecognitionsrisesby about50%comparedto theinitial trainingstage,ignorantrecognitionsarereducedby about70%,andwrongrecognitionsarealmosteliminated.Thenumberof ambigu-ousrecognitionsalsoincreasessomewhat.This is nota desirede ect,but ambiguityis thebestrecognitionoutcomeshortof correctness.Theaccuracy reachesmorethan80%,which beginsto be comparableto object recognitionresultsreportedfor more specializedmachinevisionsystems.

Another dramaticdevelopmentcan be observed in the evolving featureset. The middlepanelsin Figure5.1 reveala typical behavior. Throughoutexpert learning,thenumberof fea-turesqueriedduringa recognitionproceduredecreasessharply. During Stage1, thenumberof

66

Page 85: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Table5.1. Summaryof empiricalresultsafterexpertlearningonall tasks.Thefirst four entriesgive thenameof thetask,thenumberof classes,thefold index, andtheaccuracy accordingtoDefinition 4.6. The five entriesunder“RecognitionResults”in eachrow sumto one,barringroundo errors.Zeroentriesareleft blank for clarity. Thecolumnsundertheheading“# Fea-tures”give the total numberof featurenodesin all Bayesiannetworks (BN), the total numberof di erentfeatures(dif.), theaveragenumberof featuresqueried(qu’d) duringa singlerecog-nition procedure(Algorithm 4.4),andthetotalcumulative numberof featuressampled(smp’d),respectively. Therightmosttwo columnslist thenumberof cyclesthroughthetrainingset,andthetotalnumberof trainingimagepresentations.

RecognitionResults # Features Tr. SetTask Cls. Fld. acc. cor. wrg. amb. cnf. ign. BN dif. qu’d. smp’d. cyc. imgs.

COIL 6 1 .86 .78 .02 .13 .07 10 8 6.4 19448 60 3600COIL 6 2 .78 .70 .05 .12 .13 23 17 11.1 19203 68 4080COIL 6 av. .82 .74 .03 .13 .10 17 13 8.8 19326 64 3840COIL-inc 6 1 .87 .80 .12 .08 7 6 5.5 13867 6850COIL-inc 6 2 .88 .82 .02 .10 .07 11 11 6.5 17491 11000COIL-inc 6 av. .87 .81 .01 .11 .08 9 9 6.0 15679 8925COIL-inc 10 1 .82 .77 .01 .08 .14 34 21 12.4 69450 43520COIL-inc 10 2 .80 .73 .10 .01 .16 28 16 13.0 87996 57060COIL-inc 10 av. .81 .75 .01 .09 .01 .15 31 19 12.7 78723 50290Plym 8 1 .70 .58 .06 .22 .14 23 17 8.9 33711 155 8680Plym 8 2 .80 .75 .04 .09 .13 11 10 8.5 25513 122 7808Plym 8 av. .75 .67 .05 .15 .13 17 14 8.7 29612 139 8244Plym-inc 8 1 .77 .73 .02 .08 .17 21 11 8.2 71183 36274Plym-inc 8 2 .83 .79 .02 .05 .14 16 10 7.5 28078 15640Plym-inc 8 av. .80 .76 .02 .07 .16 19 11 7.9 49631 25957Mel 6 1 .61 .47 .03 .19 .03 .28 43 35 15.0 3766 38 1368Mel 6 2 .55 .42 .14 .22 .03 .19 26 25 8.8 13461 64 2304Mel 6 av. .58 .44 .08 .21 .03 .24 35 30 11.9 8614 51 1836Mel-inc 6 1 .64 .44 .06 .39 .11 25 22 10.4 8313 3774Mel-inc 6 2 .56 .42 .03 .17 .39 21 19 7.5 15181 3630Mel-inc 6 av. .60 .43 .04 .28 .25 23 21 9.0 11747 3702

67

Page 86: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

0 1 2 3 40

0.2

0.4

0.6

0.8

1

0 .5 .5 .5 .75

training stages

prop

ortio

n of

test

set

COIL Fold 1

correct wrong ambiguousconfused ignorant

0 1 2 3 4 5 6 70

0.2

0.4

0.6

0.8

1

0 .5 .5 .5 .75 .88 .94 .97

training stages

prop

ortio

n of

test

set

COIL Fold 2

correct wrong ambiguousconfused ignorant

0 1 2 3 40

10

20

30

40

50

training stages

num

ber

of fe

atur

es

COIL Fold 1

0 1 2 3 4 5 6 70

10

20

30

40

50

60

training stages

num

ber

of fe

atur

esCOIL Fold 2

total feature nodes different features max. per recognitionmin. per recognitionavg. per recognition

0 1 2 3 40

5

10

15

20

training stages

cum

ulat

ive

num

ber

(×10

00)

COIL Fold 1

features sampledimages presented

0 1 2 3 4 5 6 70

5

10

15

20

training stages

cum

ulat

ive

num

ber

(×10

00)

COIL Fold 2

features sampledimages presented

Figure5.1.ExpertLearningresultsontheCOIL task.Thenumbersshown insidethetopgraphsgive thevalueof tKSD at therespective trainingstage.Wherethis numberappearsin bold face,thetrainingsetwaslearnedwith perfectaccuracy, andthe tKSD goalwasachieved. Otherwise,eitherthetrainingsetwasnot learnedperfectly, or thetKSD goalwasnotattained.Theerrorbarsin themiddleplotshave awidth of onestandarddeviation.

68

Page 87: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

di erentfeaturespresentin all Bayesiannetworks increases,eliminatingmostof the re-usedfeatures.This constitutesa qualitative changein thefeatureset,duringwhich generalfeaturesarereplacedby morespecialized,distinctive features.The total numberof featuresis hardlya ected(Fold 1) or is evenreduced(Fold 2). During furtherexpertlearningstages,a relativelysmallnumberof highly distinctive featuresarelearned,renderinga largeproportionof theex-isting featuresobsolete.Thus,thenumberof featuresstoredby thesystemis reducedfrom 53(averageover both folds) to 17. Due to the high distinctive power of the remainingfeatures,few arequeriedduringa recognitionprocedure.Fold 1 constitutesa quitedrasticexample:Onaverage,little morethansix featuresarequeriedto recognizeanimage(Table5.1). This is justonefeatureperclass!

Figure5.2 shows someof the expert features.Comparedto the novice featuresshown inFigure4.7onpage50, thesearemuchmoremeaningfulintuitively. Onecaris characterizedbythetextureontheroof, theotherby awheelfeature.Both theduckandthecatarecharacterizedby an eyeo foreheadfeature. In fact, this featureis absolutelyidentical; it is one of the twosurviving sharedfeaturesin Fold 1. Curiously, this is the only featurecharacteristicof a cat(Table5.2). How canit alonesua ceto characterizea catif it is alsoa distinctive duckfeature?Theansweris that the thresholdp f for this featureto beconsideredpresentis muchhigherinthecat’sBayesiannetwork thanin theduck’s (cf. Definition4.2,DistinctivePowerof aFeature,page37). Evidently, on all duck imagescontainedin the trainingset,the responsevalue ff ofthis featureremainedbelow thecorrespondingp f thresholdof thecat’snetwork, while onall catimages,it exceededthis threshold.Otherwise,this featurealonewould nothave beensua cientto characterizeacat.

The confusionmatricesof ambiguousrecognitionsis shown in Table5.3. Someof theseambiguitiesarenot surprising.For instance,in Fold 1 a cat (obj4) wasalsolabeledasa duck(obj1). In bothfolds, thetwo cars(obj3 andobj6) tendedto beconfused.Caro carandducko catconfusionsalsoaccountedfor two of thefour wrongrecognitionsshown in Table5.2.

Findinghighly distinctive featuresis noteasy. Thetrainingsetwasfirst learnedafteramereseven iterations(Table5.1), duringwhich about700candidatefeaturesweregenerated.Aftercompleteexpert training,almost20,000featureshadbeentried in morethan60 iterations.Asshown in the bottom panel in Figure5.1, the bulk of this e ort is spentat the first stageofexpert learning.Thereasonis that initially, mostof therecognitionsendwith a high degreeofuncertainty, andmostpairwiseuncertaintiesdiscoveredby Algorithm 5.2have KSDci d dmin

i f 0.Thesuperiorpowerof expertfeaturesovernovice featuresnotonly resultsin fewer features

queriedper recognition,but is alsomanifestedin increasedstability acrossviewpointsandin-creasedspecificity. Figure5.3 shows all featurescharacteristicof obj1. Remarkably, all threefeaturesarepresentin all views except for the very last, wherefeature665 is missing. Allfeaturesrespondedreliably to correspondingobjectpartsacrossmostof the viewpoint range.Figure5.4demonstratesthesuperiorspecificityof theseexpertfeaturescomparedto thenovicefeatures(seeFigure4.9 on page53 for comparison).The leastspecificexpert feature557 re-spondsmorespecificallyto distinct partsof the duck thanthe mostspecificnovice feature5.Even more significantly, it responseon the non-duckimageis weaker in most places. Fea-ture 566 actsasan eye detector– it respondsto a small dark spotwith a neighboringlargerlight spot.As mentionedearlier, this featureis alsocharacteristicof thecat(obj4). Feature665detectsthe wing in a specificspatialrelationto an outeredgeof the duck object. It respondsselectively to a few locationson theduckimage,but almostnowhereon thecatimage.

It is quiteremarkablethatthelearningsystemwasableto constructsuchpowerful features:They arespecificto anobject,andat thesametime aregeneralin thatthey respondwell over awide rangeof viewpoints. Interestingly, they correspondto intuitively meaningfulobjectparts.It would not have beeneasyto constructsuchspecificandyet viewpoint-invariant featuresbyhand.

69

Page 88: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Figure 5.2. Examplesof expertfeatureslearnedfor theCOIL task,Fold 1.

70

Page 89: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Table 5.2. Resultson the COIL taskafter ExpertLearning. The tableshows, for eachclass,how many instanceswererecognizedaswhichclassfor correctandwrongrecognitions,andthenumberof ambiguouso confusedo ignorant recognitionoutcomes.In eachrow, theseentriessumto thenumberof testimagesof eachclass(here,10). Therightmostcolumnshows thenumberof featurenodesthatarepartof theBayesiannetwork representingeachclass.Thebottomrowof thebottompanelgivestheaverageproportionof the testsetthat falls into eachrecognitioncategory.

ConfusionMatrix correcto wrong # Feats.Class obj1 obj2 obj3 obj4 obj5 obj6 amb. cnf. ign. BN

Fold 1:obj1 10 3obj2 8 2 1obj3 8 1 1 3obj4 8 2 1obj5 6 2 2 1obj6 7 1 2 1Sum 10 8 8 8 6 8 8 0 4 10

Fold 2:obj1 9 1 8obj2 8 2 2obj3 5 2 3 4obj4 1 4 1 4 1obj5 2 8 1obj6 8 2 7Sum 10 8 5 6 8 8 7 0 8 23

Averageof both folds (proportions):obj1 .95 .05 6obj2 .80 .20 2obj3 .65 .05 .15 .15 4obj4 .05 .60 .15 .20 1obj5 .10 .70 .10 .10 1obj6 .75 .15 .10 4Avg. .17 .13 .11 .12 .12 .13 .13 .00 .10 3

71

Page 90: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Figure 5.3. ExpertFeaturescharacteristicof obj1 locatedin all imagesof this classwheretheyarepresent.SeeFigure4.8onpage51 for comparison.

72

Page 91: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Figure 5.4. Spatialdistribution of expertfeatureresponsesof all featurescharacteristicof obj1learnedin Fold 1 (left columns),for comparisonalsoshown for obj4 (right columns). In thecentercolumns,thegray level encodesthe responsestrengthof the featureshown in theouterpanels. Black representsthe maximal responseof f q 1r 0; in white areasno responsewascomputed.Featuresarelabeledby their ID numbers.SeeFigure4.9onpage53for comparison.

73

Page 92: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Table 5.3. Resultson theCOIL taskafterExpertLearning:Confusionmatricesof ambiguousrecognitions.Theitalic numbersonthediagonalindicatehow many timestheclassgivenontheleft wasrecognizedambiguously(cf. Table5.2), andtheuprightnumbersin eachrow specifythefalse-positive classes.

Class obj1 obj2 obj3 obj4 obj5 obj6

Fold 1:obj1obj2 2 2obj3 1 1obj4 1 1 2obj5 2 2obj6 1 1

Fold 2:obj1obj2 1 2 1obj3 1 1 2 1obj4 1 1obj5obj6 1 1 2

Thefeaturelearningalgorithmemploysarandomfeaturegenerationprocedure(Algorithm4.10,page43). How sensitive is theperformanceof thetrainedsystemto this randomness?To giveanintuition, thesystemwastrainedon Fold 1 a total of ten times,seedingtherandom-numbergeneratorwith uniquevalues.Figure5.5shows theaverageperformanceof thetrainednovicesandexperts.Accordingto apaired-sampleone-tailedt test,thesuperioraccuracy of expertsvs.novices is highly significant(p s 0h 0001). Thereis considerablevariation in the accuracies,while thenumberof featuresqueriedstabilizesreliably duringexpert learning.Thereasonforthis is that learningendsassoonasall recognitionsleave all classifierswith zeroresidualen-tropy. By this time, all objectsarecharacterizedby a smallnumberof featuresthatarehighlydistinctive on thetrainingset– however, nodefinitive statementcanbemadeabouttheirgener-alizationpropertiesto unseentestimages.In realistic,openlearningscenarios,this is notgoingto bea problem:If an imageis misrecognizedor is recognizedwith lessthanperfectcertainty,it immediatelybecomesa training image,and learningcontinues.Thus, in practicethe finalperformanceis expectedto dependmuchlesson therandomnessintrinsic to thelearner, but isdeterminedby theexpressivenessof thefeaturespace.

5.4.2 The Plym Task

All of the generalcommentsmadein the previous sectionaboutthe COIL taskapply alsotothe Plym task(Figure5.6). As is alsoseenin Table5.1 in comparisonto Table4.2, all per-formanceparametersimproved substantiallyasa resultof expert learning,with the exceptionof anincreasednumberof ambiguousrecognitions.In Fold 1, progressof expert learningwasmarkedly non-monotonic.For many stages,it failed to achieve thegoalof tKSD f 0h 5. At theend,however, it learnedthetrainingsetwith a remarkabletKSD f 0h 94.

Similarly to thebasiclearningstage,cyl6 andcucon wereproblematicclasses(Tables5.4and5.5). Examplesof the featureslearnedareshown in Figure5.7. Similarly to the COILtask,thesefew but highly distinctive featuresmake muchmoreintuitive sensethanthefeatureslearnedduringthefirst stage(seeFigure4.11on page54). Noticethecon6 featurecharacteriz-

74

Page 93: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Novice Expert0

0.2

0.4

0.6

0.8

1

accu

racy

Novice Expert0

5

10

15

20

25

30

avg.

# fe

atur

es q

uerie

d pe

r re

cogn

.

Figure5.5.Variationdueto randomnessin thelearningprocedure.Thebarsrepresentthemeanperformanceachievedon 10 independentlearningtrials of COIL-Fold 1. Theerrorbarsextendonestandarddeviation above andbelow themean.Extremaareindicatedby asterisks.

ing theconverging edges,thecucon, cucy, cycu, andcyl3 featureslocatedatcharacteristicfoldsor corners,andthe tub6 featurethatrespondsto thetop rim.

Whatis thecritical limiting factorof theperformanceonthePlymtask,thefeaturespaceorthe learningprocedure?Thereis no absoluteanswerto this question.However, if thesystemcan learn to discriminatewell betweentwo particularlydia cult classesif more training dataare used,then this would indicatethat the limited performanceis primarily an e ect of thelearningprocedure.A leave-one-outcrossvalidationprocedurewasrun on classescube andcucon. In Fold 1 of thefull-scaleeight-objecttask,theseweretwo of themostdia cult classesto recognize(Table5.4), andin two casesa cube appearedasa falsepositive togetherwith acucon (Table5.5). In Fold 2, cucon wasoneof theweakestclasses.

Table5.6shows that this distinctionwaslearnedquitewell, with anaccuracy of 0.93.Thissuggeststhat the featurespaceis capableof expressingsuchdistinctions.Thecritical limitingfactor is probablythe fixed training set: Oncefeatureshave beenlearnedthat alwaysachieveperfectrecognitionwith zeroresidualentropy, expert learningends(cf. Section5.3). At thisstage,the learnedfeaturesetwill generallystill be imperfectin that they do not fully disam-biguateall pairsof classeswith a KSD of 1.0. It is featureswith a weakKSD andthepotentialto eliminateall entropy thatcausethesystemto fail on unseentestimages.In a practical,opentask,thisis notaproblembecauseany imperfectlyrecognizedtestimageautomaticallybecomesa trainingimage,andlearningcontinues.

5.4.3 The Mel Task

The Mel taskreflectsthe generalphenomenaobserved in the COIL andPlym tasks,but to alesserextent (Figure 5.8 andTable5.7). Contraryto this developmentis the slight increaseof wrongrecognitionsin Fold 2, andtheemergenceof oneconfusedrecognitionin both folds(in Fold 1, an abalone boardwas mistaken for a phone-cord and a bike-chain; in Fold 2, asock wasmislabeledasgrapes andbike-chain). In Fold 2, no instancesof phone-cord wererecognizedcorrectly, aswasthecasefor bothfoldsaftertheinitial learningstage(seeTable4.7on page58). In contrast,Fold 1 achieved somesuccesson this class,but only at the expenseof 20 featuresdedicatedto it. This is the largestnumberof featuresobserved for any objectinall experimentsafterexpert learning.In fact,for bothfolds thenumberof phone-cord featuresincreasedsignificantlyduringexpert learning,while it decreasedfor almostall otherclassesinall experiments.

75

Page 94: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

0 1 2 3 4 5 6 7 8 90

0.2

0.4

0.6

0.8

1

0 .5 .5 .5 .5 .5 .5 .75 .88 .94

training stages

prop

ortio

n of

test

set

Plym Fold 1

0 1 2 3 4 5 6 70

0.2

0.4

0.6

0.8

1

0 .5 .75 .5 .75 .88 .5 .75

training stages

prop

ortio

n of

test

set

Plym Fold 2

correct wrong ambiguousconfused ignorant

0 1 2 3 4 5 6 7 8 90

20

40

60

80

100

training stages

num

ber

of fe

atur

es

Plym Fold 1

total feature nodes different features max. per recognitionmin. per recognitionavg. per recognition

0 1 2 3 4 5 6 70

20

40

60

80

100

training stages

num

ber

of fe

atur

esPlym Fold 2

total feature nodes different features max. per recognitionmin. per recognitionavg. per recognition

0 1 2 3 4 5 6 7 8 90

5

10

15

20

25

30

35

training stages

cum

ulat

ive

num

ber

(×10

00)

Plym Fold 1

features sampledimages presented

0 1 2 3 4 5 6 70

5

10

15

20

25

30

training stages

cum

ulat

ive

num

ber

(×10

00)

Plym Fold 2

features sampledimages presented

Figure5.6.ExpertLearningresultsonthePlymtask.Thenumbersshown insidethetopgraphsgive thevalueof tKSD at therespective trainingstage.Wherethis numberappearsin bold face,thetrainingsetwaslearnedwith perfectaccuracy, andthe tKSD goalwasachieved. Otherwise,eitherthetrainingsetwasnot learnedperfectly, or thetKSD goalwasnotattained.Theerrorbarsin themiddleplotshave awidth of onestandarddeviation.

76

Page 95: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Table 5.4. Resultson the Plym taskafter ExpertLearning. The tableshows, for eachclass,how many instanceswererecognizedaswhichclassfor correctandwrongrecognitions,andthenumberof ambiguouso confusedo ignorant recognitionoutcomes.In eachrow, theseentriessumto thenumberof testimagesof eachclass(here,8 in Fold 1 and7 in Fold 2). The rightmostcolumnshows thenumberof featurenodesthatarepartof theBayesiannetwork representingeachclass.Thebottomrow of thebottompanelgivestheaverageproportionof thetestsetthatfalls into eachrecognitioncategory.

ConfusionMatrix correcto wrong # Feats.Class con6 cube cucon cucy cycu cyl3 cyl6 tub6 amb. cnf. ign. BN

Fold 1:con6 6 2 1cube 2 2 4 7cucon 3 4 1 1cucy 4 1 2 1 1cycu 6 1 1 1cyl3 8 3cyl6 1 2 3 2 8tub6 5 3 1Sum 7 2 3 4 8 8 3 6 14 0 9 23

Fold 2:con6 7 1cube 5 1 1 4cucon 4 2 1 1cucy 6 1 1cycu 5 2 1cyl3 7 1cyl6 4 1 2 1tub6 4 1 2 1Sum 7 5 4 6 5 8 4 5 5 0 7 11

Averageof both folds (proportions):con6 .87 .13 1cube .47 .07 .13 .33 6cucon .47 .40 .13 1cucy .67 .07 .13 .13 1cycu .73 .20 .07 1cyl3 1.00 2cyl6 .07 .13 .47 .07 .27 5tub6 .60 .27 .13 1Avg. .12 .06 .06 .08 .11 .13 .06 .09 .16 0 .13 2

77

Page 96: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Table 5.5. Resultson thePlym taskafterExpertLearning:Confusionmatricesof ambiguousrecognitions.Theitalic numbersonthediagonalindicatehow many timestheclassgivenontheleft wasrecognizedambiguously(cf. Table5.4), andtheuprightnumbersin eachrow specifythefalse-positive classes.

Class con6 cube cucon cucy cycu cyl3 cyl6 tub6

Fold 1:con6 2 1 1cube 2 2cucon 2 4 1 1cucy 1 2 1cycu 1 1cyl3cyl6tub6 1 1 1 2 3

Fold 2:con6cubecucon 2 2 1cucycycu 1 1 1 2cyl3cyl6tub6 1 1

Table5.6. Resultson a two-classPlym subtaskafter15-foldExpertLearning.

correcto wrongClass cube cucon amb. cnf. ign. Sumcube 13 1 1 15cucon 14 1 15Sum 13 15 1 0 1 30

78

Page 97: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

con6 cube cucon cucy

cycu cyl3 cyl6 tub6

Figure 5.7. Examplesof expertfeatureslearnedfor thePlym task,Fold 1.

79

Page 98: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

0 1 2 30

0.2

0.4

0.6

0.8

1

0 .5 .5 .75

training stages

prop

ortio

n of

test

set

Mel Fold 1

correct wrong ambiguousconfused ignorant

0 1 2 3 40

0.2

0.4

0.6

0.8

1

0 .5 .5 .5 .75

training stages

prop

ortio

n of

test

set

Mel Fold 2

correct wrong ambiguousconfused ignorant

0 1 2 30

10

20

30

40

50

training stages

num

ber

of fe

atur

es

Mel Fold 1

0 1 2 3 40

10

20

30

40

50

training stages

num

ber

of fe

atur

es

Mel Fold 2

0 1 2 30

1

2

3

4

training stages

cum

ulat

ive

num

ber

(×10

00)

Mel Fold 1

features sampledimages presented

0 1 2 3 40

2

4

6

8

10

12

14

training stages

cum

ulat

ive

num

ber

(×10

00)

Mel Fold 2

features sampledimages presented

Figure 5.8. ExpertLearningresultson theMel task.Thenumbersshown insidethetopgraphsgive thevalueof tKSD at therespective trainingstage.Wherethis numberappearsin bold face,thetrainingsetwaslearnedwith perfectaccuracy, andthe tKSD goalwasachieved. Otherwise,eitherthetrainingsetwasnot learnedperfectly, or thetKSD goalwasnotattained.Theerrorbarsin themiddleplotshave awidth of onestandarddeviation.

80

Page 99: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Table 5.7. Resultson the Mel taskafter ExpertLearning. The table shows, for eachclass,how many instanceswererecognizedaswhichclassfor correctandwrongrecognitions,andthenumberof ambiguouso confusedo ignorant recognitionoutcomes.In eachrow, theseentriessumto thenumberof testimagesof eachclass(here,6). Therightmostcolumnshows thenumberof featurenodesthatarepartof theBayesiannetwork representingeachclass.Thebottomrowof thebottompanelgivestheaverageproportionof the testsetthat falls into eachrecognitioncategory.

ConfusionMatrix correcto wrong # Feats.Class abalone chain grapes necktie cord sock amb. cnf. ign. BN

Fold 1:abalone 2 2 1 1 5bike-chain 3 3 7grapes 1 1 4 7necktie 2 2 2 1phone-cord 4 2 20sock 5 1 3Sum 2 3 1 2 4 6 7 1 10 43

Fold 2:abalone 3 1 2 1bike-chain 1 1 2 2 1grapes 4 2 8necktie 4 2 1phone-cord 3 3 14sock 3 1 2 1Sum 3 1 7 5 0 4 8 1 7 26

Averageof both folds (proportions):abalone .42 .08 .33 .08 .08 3bike-chain .33 .08 .42 .17 4grapes .42 .08 .17 .33 8necktie .50 .33 .17 1phone-cord .25 .33 .42 17sock .67 .08 .25 2Avg. .07 .06 .11 .10 .06 .14 .21 .03 .24 6

81

Page 100: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

In contrastto theothertwo tasks,performanceon theMel taskis characterizedby widelyvaryingresultsandalackof discerniblepatterns,evenafterexpertlearning.Thisisalsoreflectedby themany ambiguousresultsandthelack of agreementbetweenthefolds (seeTable5.8). Itseemssurprisingthateventhoughfeatureswerelearnedthatdisambiguatedall classeswith non-zeroentropy with a KSD of 0.75(seeFigure5.8,andFigure5.9 for examplesof thefeatures),the test-setresultswerestill relatively poor. This suggeststhat the training setwastoo smallin relationto the enormouswithin-classvariety of appearances,enablingthe featurelearningprocedureto discover highly discriminative featuresthatgeneralizedpoorlynevertheless.Also,recall thatexpert learningactsonly in thecaseof nonzeroresidualentropy in at leastoneclassnode.In theMel task,expert learningwasapparentlyhamperedby theabundantoccurrenceofoverly specificfeaturesthat respondedto very few images. Suchfeatures,if not found in animage,arelikely to drive thebelief in their classto zero,asdiscussedin Section4.4.4on page44,disablingexpertlearning.

Table 5.8. Resultson the Mel taskafter ExpertLearning: Confusionmatricesof ambiguousrecognitions.Theitalic numbersonthediagonalindicatehow many timestheclassgivenontheleft wasrecognizedambiguously(cf. Table5.7), andtheuprightnumbersin eachrow specifythefalse-positive classes.

Class abalone chain grapes necktie cord sock

Fold 1:abalone 2 2bike-chain 3 1 1 1grapesnecktie 1 2 1 1phone-cordsock

Fold 2:abalone 2 1 2bike-chain 2 1 1grapes 2 2necktie 1 2 2phone-cordsock

In orderto find out to which extenttheMel taskis learnableby thesystem,a leave-one-outcrossvalidationprocedurewasrun on a two-classsubtask,usingtwo of themostproblematicclasses,grapes andphone-cord. In Fold 1 of the full six-classtask,grapes werethe poorestclassto be recognized(Table5.7); in Fold 2, phone-cord was the poorestclass,and half ofthe testimagesweremislabeledasgrapes. A confusionmatrix summarizingtheresultsof thetwelve folds on the two-classsubtaskis shown in Table5.9. The resultsshow that the featurespacehassomepower to expressthis classdistinction,but they arestill not outstanding.Theaccuracy is 0.83, with 0.5 beingchanceperformance.Clearly, the structureof the Mel taskis not easily capturedby this learningalgorithm. This topic will receive further attentioninSection7.3.

5.4.4 Incr ementalTasks

In all experimentsdiscussedso far, trainingwasincrementalin that imageswerepresentedtothe learnersequentially. Training imageswere chosenfrom all classesin randomorder. Inmany practicalsituations,a morerealisticscenariowill requirethe learnerto acquireconcepts

82

Page 101: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Figure 5.9.Examplesof expertfeatureslearnedfor theMel task,Fold 1.

Table5.9. Resultson a two-classMel subtaskafter12-foldExpertLearning.

correcto wrongClass grapes cord amb. cnf. ign. Sumgrapes 9 2 1 12phone-cord 1 8 1 2 12Sum 10 8 3 0 3 24

83

Page 102: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

sequentially. In particular, this is how humansaretaught.It helpsusto masterskills if we learnonly a limited numberof themin parallel. In this section,tasksarepresentedto the learnerincrementally, asdescribedin thefollowing algorithm.

Algorithm 5.4(IncrementalTask) To train thelearningsystemonanincrementaltask,beginwith ablank-slatelearner, andthenperformthefollowing steps.

1. Createa trainingsetconsistingof all trainingimagesof thefirst two classes.

2. Train thesystemusingAlgorithm 5.3,with oneminor modification:At Step4, stopafteramaximumof 10 stagesof expertlearning.

3. If thereareuntaughtclasses,thenaddall trainingimagesof thenext untaughtclassto thetrainingset,andcontinuewith Step2. Otherwise,stop.

Importantly, theexperienceof thelearneris notdiscardedbetweenexecutionsof Step2.

Theresultsachievedon thethreetasksusingAlgorithm 5.4areincludedin Table5.1,with‘-inc’ appendedto thetaskname.Interestingly, in all casestheachievedaccuracy wasslightlybetterthanin thecorrespondingnon-incrementaltasks.Moreover, thenumberof featuressam-pled – which is closelyrelatedto the overall runningtime of thesystem– is slightly reducedfor theCOIL task. Thenumberof cyclesthroughthe trainingsetis not given in Table5.1. Ithaslittle meaningfor incrementaltasks,asthesizeof thetrainingsetincreasesover time. Thenumberof training imagepresentationsis drasticallyincreased.The reasonfor this is that thegrowing numberof already-learnedimagesis includedwith everycycle,while learningconcen-tratesprimarily on imagesof thenew class.This e ect canbeavoidedby moresophisticatedincrementallearningschemes.For example,thesimulatedenvironmentcouldchoosetheclassof atrainingimageaccordingto theempiricalmisrecognitionrateattainedby thelearner. In anycase,thechoiceof a learningstrategy dependson theinteractiondynamicsbetweenthelearnerandits environment. An environmentor a taskmayor may not allow the learnerto influencethedistribution of trainingimages.

TheCOIL-inc taskwascontinuedup to a totalof 10 classes,whichareillustratedin Figure5.10. Figures5.11and5.12show performanceparameterson a testsetasthey evolve duringincrementalexpert training. Thetestsetcontainsthesameclassesasthetrainingset. Usually,the introductionof a new classcausesa temporarydrop in performancethat is reflectedbymostperformanceparameters:Theproportionof correctrecognitionsdrops,while mostof theother parametersincrease.Over the next few stagesof expert training, the systemrecovers.Importantly, the transientdropsin performanceco-occurwith an imperfectly learnedtrainingset, as indicatedby the non-boldKSD indicatorsin Figures5.11 and 5.12. In other words,goodperformanceon the training imagesgenerallypredictsgoodperformanceon unseentestimages.This is goodnews, given the often non-monotonicperformanceimprovementduringexpert learning.Nevertheless,a slight dropin accuracy is noticeable,asthenumberof classesincreases.

How dia cult areindividual classesto learn? Thereis no way to answerthis questionforthenon-incrementalversionof experttraining.However, incrementaltraininglendsinsightintoa relatedquestion: How much e ort doesit take to learn a new class,after having learnedsomeotherclasses?Figure5.13 revealsthat someclassesare learnedafter samplingonly asmallnumberof features,while othersrequirea lot of samplinge ort. Both folds learnedthefirst four objectsquite e ortlessly, while obj7 andobj8 weremoredia cult. Interestingly, bothfolds spentmoste ort disambiguatingobj7 from obj2 – thesearethetwo woodenblocksin thedataset. In addition,Fold 1 oftenconfusedobj7 with obj6, possiblydueto thestripedtexturesharedby both objects. In the caseof obj8, therewasno suchsingularsourceof confusion.Nevertheless,the two classesthat accountedby far for most confusionswereobj1 and obj6.Almost all confusionson thetrainingsetthatoccurredat any time duringlearninginvolvedthe

84

Page 103: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

obj1 obj2 obj3 obj4 obj5

obj6 obj7 obj8 obj9 obj10

Figure 5.10.Objectsusedin the10-objectCOIL task.

mostrecentlyintroducedclass,especiallyduring the early stagesof expert learning. Beyondthesequalitative remarks,it is dia cult to draw strongconclusionsfrom only two folds of crossvalidation,andfrom asinglepresentationorderof new classes.

The numberof expert learningstagestaken until a classwaslearnedis clearly correlatedwith thenumberof featuressampled.However, it isnotasreliableanindicatorof classdia culty,because– in contrastto thenumberof featuressampled– it is largelyunrelatedto theproportionof dia cult training imagesof a class.A singledia cult imagecanstretchexpert learningovermany stages,but far fewer featuresaresampledthanin thecaseof many dia cult images.Forexample,in Fold 1, for obj6 11,470featuresweresampledduring4 stages,whereasfor obj10only 5,760featuresweresampledduring9 stages.

The resultsfor the incrementalPlym task (Figures5.14, 5.15 and 5.16) show the samecharacteristicsasthe COIL-inc task. Here,both folds agreethat tub6, the eighthclass,is themost dia cult class(Figure 5.16). Not surprisingly, confusionswith cyl6 accountedfor theoverwhelmingmajority of the featurelearninge orts. Notably, Fold 1 hadtroublewith a fewindividual imageswhile learningthesixth andseventhclasses(cyl3 andcyl6). At almostall ofthesestages,Algorithm 5.3at Step2 cycledthroughthetrainingsetthemaximumof 20 times,while samplingvery few features.

The Mel-inc resultsagainshow similar behavior, but to a lesserextent (Figures5.17and5.18). Thesegraphsreflect the samevariability as discussedfor the other Mel experimentsabove. In contrastto theCOIL andPlym tasks,it appearsthat thenumberof featuressampledgrows purely asa functionof the numberof classes.This is anotherindicatorof the intrinsicdia culty of theMel dataset,which makesit hardfor the learningprocedureto find powerfulandwell-generalizingfeatures.Interestingly, while learningthefourthclass(necktie) in Fold 2,only 69 featuresweresampled,all of themduringStage0. Thefollowing four stagesof expertlearningonly re-estimatedthe conditionalprobabilitiesin the Bayesiannetworks, but neverrequiredanew feature.

In both folds, the systemfailed to recover after the introductionof the sixth class(sock).Thisis somewhatsurprising,giventhatthisclassperformedbetterthanany otherclassafternon-incrementaltraining (Table5.7), andis yet anotherindicatorthatexpert learningwasseverelyhamperedon the Mel task. Expert learningessentiallyfailed herebecausea large numberofoverly specializedfeaturescausedmostrecognitionsto resultin zeroentropy in all classnodes.Suchan e ect is only likely to occuron static,small,andheterogeneoustrainingsetssuchasthatusedin theMel task.

85

Page 104: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

2

3

4

5

6

7

8

9

10

0

0.2

0.4

0.6

0.8 1

0.5

.75.88

0.5

.750

.5.75

0.5

.75.88

.94.97

0.5

.75.88

0.5

.75.88

.94.97

11

0.5

.75.5

.5.75

.5.75

.5.5

.750

.5.5

.5.75

0.5

.75.88

.94.97

.5.75

.88

number of classes

proportion of test setC

OIL−

inc Fold 1

correct w

rong am

biguousconfused ignorant

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50

number of classes

number of features

CO

IL−inc F

old 1

total feature nodes different features m

ax. per recognitionm

in. per recognitionavg. per recognition

Figure

5.11.Increm

entalExpertLearning

resultsonthe

CO

IL-inctask,Fold

1.T

heverticaldotted

linesindicate

theaddition

ofa

newclassto

thetraining

set.Below

thegraphs,theseare

annotatedwith

thetotalnum

berofclassesrepresentedinthe

trainingset.D

atapointsbetw

eentheselinescorrespondto

expertlearningstages(cf.F

igure5.1,page68),w

hichare

notnumberedhere.T

henum

bersinsidethe

topgraphindicatethe

valuesoftKS

Dasin

Figure

5.1.

86

Page 105: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

2

3

4

5

6

7

8

9

10

0

0.2

0.4

0.6

0.81

0.5

0.5

.75.

880

.5.7

5.88

.94.

971

10

.5.5

.75

.5.5

.75.

880

.5.5

.75.

880

.5.7

5.5

.75.

880

.5.5

.5.7

5.88

.5.5

.5.5

.75

0.5

.75.

88.9

4.97

0.5

.5.7

5.88

.94

.5.7

5.5

.75.

88

num

ber

of c

lass

es

proportion of test setC

OIL

−in

c F

old

2

corr

ect

wro

ng

am

bigu

ous

conf

used

ig

nora

nt

2

3

4

5

6

7

8

9

10

010203040506070

num

ber

of c

lass

es

number of features

CO

IL−

inc

Fol

d 2

tota

l fea

ture

nod

es

diffe

rent

feat

ures

m

ax. p

er r

ecog

nitio

nm

in. p

er r

ecog

nitio

nav

g. p

er r

ecog

nitio

n

Fig

ure

5.12

.Inc

rem

enta

lExp

ertL

earn

ing

resu

ltson

the

CO

IL-in

cta

sk,F

old

2.T

heve

rtic

aldo

tted

lines

indi

cate

the

addi

tion

ofa

new

clas

sto

the

trai

ning

set.

Bel

owth

egr

aphs

,thes

eare

anno

tate

dwith

the

tota

lnum

bero

fcla

sses

repr

esen

tedin

the

trai

ning

set.

Dat

apo

ints

betw

eent

hese

lines

corr

espo

ndto

expe

rtle

arni

ngst

ages

that

are

notn

umbe

redh

ere.

The

num

bers

insi

deth

eto

pgr

aphi

ndic

atet

heva

lues

oft K

SD.

87

Page 106: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70

number of classes

cumulative number (×1000)

CO

IL−inc F

old 1

features sampled

images presented

2

3

4

5

6

7

8

9

10

0 20 40 60 80

100

number of classes

cumulative number (×1000)

CO

IL−inc F

old 2

features sampled

images presented

Figure

5.13.Num

bersoffeaturessampledduring

theC

OIL-inc

task.

88

Page 107: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

2

3

4

5

6

7

8

0

0.2

0.4

0.6

0.81

0.5

.75

0.5

.75

.88

.94

.97

0.5

.75

.88

.5.7

50

.5.7

50

.5.7

5.8

8.9

4.9

71

1.5

.75

.88

0.5

.75

.88

.94

.5.7

5.8

8.9

4.5

.50

.5.5

.5.7

5.5

.5.5

.5.7

5.5

.5.5

.75

.5.7

5

num

ber

of c

lass

es

proportion of test setP

lym

−in

c F

old

1

corr

ect

wro

ng

am

bigu

ous

conf

used

ig

nora

nt

2

3

4

5

6

7

8

020406080

num

ber

of c

lass

es

number of features

Ply

m−

inc

Fol

d 1

tota

l fea

ture

nod

es

diffe

rent

feat

ures

m

ax. p

er r

ecog

nitio

nm

in. p

er r

ecog

nitio

nav

g. p

er r

ecog

nitio

n

Fig

ure

5.14

.In

crem

enta

lExp

ertL

earn

ing

resu

ltson

the

Ply

m-in

cta

sk,F

old

1.T

heve

rtic

aldo

tted

lines

indi

cate

the

addi

tion

ofa

new

clas

sto

the

trai

ning

set.

Bel

owth

egr

aphs

,thes

eare

anno

tate

dwith

the

tota

lnum

bero

fcla

sses

repr

esen

tedin

the

trai

ning

set.

Dat

apo

ints

betw

eent

hese

lines

corr

espo

ndto

expe

rtle

arni

ngst

ages

that

are

notn

umbe

redh

ere.

The

num

bers

insi

deth

eto

pgr

aphi

ndic

atet

heva

lues

oft K

SD.

89

Page 108: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

2

3

4

5

6

7

8

0

0.2

0.4

0.6

0.8 1

0.5

.750

.5.75

0.5

.75.88

.940

.5.75

0.5

.75.88

0.5

.750

.5.75

.88.5

.75.88

.94.97

1

number of classes

proportion of test setP

lym−

inc Fold 2

correct w

rong am

biguousconfused ignorant

2

3

4

5

6

7

8

0 10 20 30 40 50

number of classes

number of features

Plym

−inc F

old 2

total feature nodes different features m

ax. per recognitionm

in. per recognitionavg. per recognition

Figure

5.15.Increm

entalExpertLearning

resultsonthe

Plym

-inctask,Fold

2.T

heverticaldotted

linesindicate

theaddition

ofa

newclassto

thetraining

set.Below

thegraphs,theseare

annotatedwith

thetotalnum

berofclassesrepresentedinthe

trainingset.D

atapointsbetw

eentheselinescorrespondto

expertlearningstagesthatare

notnumberedhere.T

henum

bersinsidethe

topgraphindicatethe

valuesoftKS

D .

90

Page 109: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

2

3

4

5

6

7

8

020406080

num

ber

of c

lass

es

cumulative number (×1000)

Ply

m−

inc

Fol

d 1

feat

ures

sam

pled

imag

es p

rese

nted

2

3

4

5

6

7

8

051015202530

num

ber

of c

lass

es

cumulative number (×1000)

Ply

m−

inc

Fol

d 2

feat

ures

sam

pled

imag

es p

rese

nted Fig

ure

5.16

.Num

bers

offe

atur

essa

mpl

eddu

ring

the

Ply

m-in

cta

sk.

91

Page 110: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

2 3 4 5 6 0

0.2

0.4

0.6

0.8

1

0 .5 .75 0 .5 .75 0 .5 .75 .88 .94 0 .5 .75 .88 0 .5 .75 .88

number of classes

prop

ortio

n of

test

set

Mel−inc Fold 1

2 3 4 5 6 0

10

20

30

40

50

number of classes

num

ber

of fe

atur

es

Mel−inc Fold 1

total feature nodes different features max. per recognitionmin. per recognitionavg. per recognition

2 3 4 5 6 0

2

4

6

8

10

number of classes

cum

ulat

ive

num

ber

(×10

00)

Mel−inc Fold 1

features sampledimages presented

Figure 5.17. IncrementalExpertLearningresultson the Mel-inc task,Fold 1. The verticaldottedlinesindicatetheadditionof a new classto thetrainingset.Below thegraphs,theseareannotatedwith thetotal numberof classesrepresentedin thetrainingset.Datapointsbetweentheselinescorrespondto expertlearningstagesthatarenotnumberedhere.Thenumbersinsidethetopgraphindicatethevaluesof tKSD.

92

Page 111: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

2 3 4 5 6 0

0.2

0.4

0.6

0.8

1

0 .5 .75 0 .5 .75 0 .5 .75 .88 .94 0 .5 .5 .75 0 .5 .5 .75

number of classes

prop

ortio

n of

test

set

Mel−inc Fold 2

correct wrong ambiguousconfused ignorant

2 3 4 5 6 0

10

20

30

40

50

number of classes

num

ber

of fe

atur

es

Mel−inc Fold 2

total feature nodes different features max. per recognitionmin. per recognitionavg. per recognition

2 3 4 5 6 0

5

10

15

20

number of classes

cum

ulat

ive

num

ber

(×10

00)

Mel−inc Fold 2

features sampledimages presented

Figure 5.18. IncrementalExpertLearningresultson the Mel-inc task,Fold 2. The verticaldottedlinesindicatetheadditionof a new classto thetrainingset.Below thegraphs,theseareannotatedwith thetotal numberof classesrepresentedin thetrainingset.Datapointsbetweentheselinescorrespondto expertlearningstagesthatarenotnumberedhere.Thenumbersinsidethetopgraphindicatethevaluesof tKSD.

93

Page 112: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

5.4.5 Computational Demands

The currentmethodattemptsto find featuresby randomsamplingin imagespace,which isguidedby asetof simple-to-complex heuristicsdefinedby thefeaturegenerationalgorithm4.10(page43). Highly distinctive featuresarerareanddit cult to find, asillustratedby thegraphsinthis chapter. This opensup thequestionof computationaldemands.ThecurrentMATLAB TM

implementationof the learningsystemtakes many daysof computationon a 700MHz Pen-tiumIII processorto learna typical task. Thebulk of this time is spentmeasuringthevalueoffeaturesin images(Equation3.14on page25). This is theprimary reasonwhy theextractionof salientpointsis soimportant(Section3.4.3,page21). For every featuref sampled,its valueff (I ) mustbe computedon 2nex exampleimages(Algorithm 4.8, FeatureLearning,page41,Steps2 and5).

Given thenumberof pixels in an image(16,384for theCOIL images,76,800for theMelimages)and the infinite combinatoricsopenedup by the featurespace,sampling10,000or100,000featuresto solve a taskinvolving 100trainingimagesdoesnot seemoutrageous.Whatwould it take to run the featurelearningalgorithmson massively parallel neuralhardware?Computingfeaturevaluesis a pixel-level paralleloperationthatcanplausiblybeperformedbythe humanvisual cortex. Sucha parallel implementationcould measurea featurevaluein animagein constanttime, independentof thenumberof salientpointsin animage.Assuminganaverageof 1,000salientpointsin animage,andassumingthat99%of thecomputetimeis spentcomputingfeaturevalues,thiswould reducethetime to solve a reasonablylargeproblemfrom,say, 10daysto lessthan2 hours40 minutes– abouttwo ordersof magnitude.

Theaboveargumentassumesthatthetimespentto computeafeaturevalueis approximatelyconstant.This is not strictly true,asit dependson the lengthof a responsevector, which is 15filter responsesfor texelsand2 for edgels.It alsodependson thelocal scaleof a featureanditssubfeatures,whichdeterminesthesizeof thefilter kernelsusedin Algorithm 3.6(Extractionofa Geometric-CompoundFeatureResponseVector,page26) at Steps1 and4. This sizevariesfrom about25coet cientsupto about1u 16thesizeof theimage.Theearlyvisualcortex extractssuchresponsevectorsin parallel,for all imagelocations,over a rangeof scales,reducingthiscomputationto constanttime. If thesefactorsaretakeninto account,thecomputationtime canbereducedby at leasttwo moreordersof magnitude.

In summary, thecomputationaldemandsof theexpert learningprocedurelimit thesizeoftheproblemsthatcanbeaddressedusingtoday’s conventionalcomputinghardware. However,a biologically plausibleimplementationthat exploits the high degreeof inherentparallelismshouldreducethecomputationaldemandsby aboutfour ordersof magnitude.In principle,thiswouldpermitmuchlarger-scaleapplications.

With respectto scalability, animportantconsiderationis thenumberof distinctionsthathasto belearnedby thesystem.In general,thisnumberis quadraticin thenumberof classes:Eachclassis to be discriminatedfrom eachof the otherclasses.Figure5.19 plots the cumulativenumberof featuressampledin theincrementaltasksversusthenumberof classes.Clearly, thegrowth is superlinear. However, therearenot sut cient datato allow a conclusive statementregardingthecomplexity of thelearningalgorithmwith respectto thenumberof classes.Moreexperimentsusingmoreclassesarerequired. In any case,this superlineargrowth is likely tolimit the maximalproblemsizethat canbe addressedby this system.However, this neednotconstitutea practicallimitation. Humansclearlydo not learnexpert-level distinctionsbetweenall categoriesthat have any meaningto them in any context. Rather, at any given time, thebehavioral context constrainstherelevantcategoriesto a relatively smallset.Expertlearningisonly requiredwheredistinctionsneedto berefinedthatarerelevantwithin agivencontext.

94

Page 113: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

2 3 4 5 6 7 8 9 100

20

40

60

80

100

number of classes

feat

ures

sam

pled

(×1

000) COIL Fold 1

COIL Fold 2Plym Fold 1Plym Fold 2Mel Fold 1 Mel Fold 2

Figure5.19.Numberof featuressampledfor varioustasks,plottedversusthenumberof classes.

5.5 Discussion

Motivatedby thehumanphenomenonof visualexpertise,thischapterintroducedasimplealgo-rithm for learningimprovedfeatures.Theideais to seekfeatureswith increaseddiscriminativepower betweenpairsof classes.Thus,expert learninghasexactly theoppositeev ect of over-training. Becauseexplicit requirementsof discriminative power areenforced,learningcan–in principle – continueforever, without any dangerof overfitting. Resemblinghumanexpertbehavior, this hasthedesiredev ectsof increasedrecognitionaccuracy andet ciency. I amnotawareof any existing relatedwork in machinelearningon incrementalfeatureimprovement.Thevastmajority of comparablemachinelearningmethodseitherusea fixed featureset,per-form featureselection,or computenew featureson thebasisof a fixed featureset(e.g.PCA).The expert learningidea presentedherecritically dependson the ability of the algorithm tosearchavirtually unlimitedfeaturespace.

Themostseverelimitation of theexpert learningalgorithm,aspresentedhere,is that it de-pendsonclassnodeswith nonzeroentropy at theendof arecognitionprocedure.Consequently,featuresthat have the ability to infer classificationresultswith absolutecertainty(cf. Section4.4.4)canev ectively disableexpert learning.This is especiallya problemif thesefeaturesarepoor, i.e. they have little discriminative power, becausethey prevent their own replacementbybetterfeatures.This wasthecasein theMel dataset.However, suchfeaturesareonly likely tooccurin small,static,andheterogeneousdatasetssuchastheMel data.They emergebecausetheexactsameimagesarepresentedto thelearnerover andover, until they arerecognizedcor-rectly. On an interactive taskin a physicalenvironment,no training imageis ever going to beseenagain,andtraining imagesfrom thea classaredrawn from a continuumof appearances.Thispreventsthememorizationof individual trainingimages.

Otherwaysto triggerexpert learningcouldpossiblyavoid this problemevenfor staticfea-turesets.A brute-forcemethodwould attemptto learnhighly distinctive featuresbetweenallpairsof classes,regardlessof any empiricaluncertainties.However, this requiresa numberofdiscriminationsthatis quadraticin thenumberof classes.Moreover, many of thesewill usuallybeunnecessarybecausemany pairsof classesareneverconfusedanyway. Expertlearning– likethe basiclearningstudiedin Chapter4 – oughtto be driven by the task: Whereuncertaintiesarelikely to occur, improved featuresshouldbe learned.This principle is implementedby theexpertlearningalgorithm5.1.

95

Page 114: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

96

Page 115: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

CHAPTER 6

LEARNING FEATURES FOR GRASPPRE-SHAPING

The precedingchaptersappliedfeaturelearningin a supervisedobject recognitionscenario.Many of the visual skills that humansacquirefrom experienceav ect their behavior at a veryfine granularity. For example,betweenaboutfive to nine monthsof age,infantslearnto pre-orienttheirhandsto matchtheorientationof atargetobjectduringreaching[70]. In thischapter,visualfeaturesarelearnedthatenablesimilarprospectivebehavior in ahaptically-guidedroboticgraspingsystem.

6.1 RelatedWork

Thereis extensive literatureon visual servoing; a thoroughreview is beyond thescopeof thisdissertation.Mostof thiswork falls into oneof two maincategoriesaccordingto thecoordinateframein whicherrorsignalsarecomputed,namely, three-dimensionalworkspacecoordinatesortwo-dimensionalimagecoordinates.Three-dimensionalmethodstypically rely ona reasonablycompleteand accuratevisual reconstructionof the workspace. More relatedto this chapterare two-dimensionalapproacheswheremanipulatorcontrol signalsaredirectly derived fromimage-spaceerrors[105, 49]. Here, local featuresof the manipulatorandof workpiecesareidentified,andtheir actualpositionsarecomparedto referencepositions. Image-spaceerrorsarethentransformedinto errorsin thedegreesof freedomof themanipulator. Insteadof localappearance,anotherline of researchbuilds on globalappearanceandcomputeserrorsignalsineigen-subspaces[74].

6.2 Objective

To date, most work in sensor-guided robotic manipulationhas concentratedon exclusivelyvision-basedsystems.Therehasbeenmuchlesswork on haptically-guidedgrasping[21, 23],or on combininghapticandvisual feedback[3, 42]. This seemssomewhatsurprising,asthesetwo sensorymodalitiesideally complementeachother, especiallywhenadextrousmanipulatoris used.Thevisualsystemis mostbeneficialduringthereachphasewhennohapticfeedbackisavailable.Relatively coarsediv erentialerrorsignalsin imagespace,usingappearancefeaturesof thewrist or thearm, aresut cient to achieve a highly et cient vision-guidedreach. Whenthe endev ectordraws closeto the target object,visual informationcan further contribute totheextent that fingertipsandobjectdetailsareaccuratelyidentified. However, this is dit cultto achieve becausethesefeatureswill oftenbe occludedby thehanditself. Moreover, even ifdetailedinformationaboutfingerconfigurationscanbeobtainedby thevisionsystem,it will bedit cult to exploit in thecontext of a dextrousmanipulatordueto themany degreesof freedominvolved,many of whichareredundant.Thisstageof thegraspis ideallysuitedfor hapticfeed-back.Tactilefingertipsensorsarehighly sensitive to relevantsurfacecharacteristicssuchastheorientationof thelocal surfacetangent.Coelho[21, 23] hasdemonstratedthatthis informationcanbe exploited by local graspcontrollersthat converge toward stablegraspconfigurations,therebyobviatingtheneedto solveahigh-dimensional(andoftenintractable)planningproblemin globaljoint space.

97

Page 116: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

McCartyet al. [70] demonstratedthat infantslearnto pre-orienttheir handduring a reachto matchtheorientationof thetargetobject.Notably, thisprospective behavior doesnotdependon thevisibility of thehandor of theobjectduringthereach.Evidently, humansuseaninitialvisual cueto initiate a reach,that is thenguidedby proprioception.Oncethe target object iscontacted,thegraspis completedusingtactilefeedback.

Coelho’s closed-loopgraspingsystemis “blind” in thesensethatno visual informationisexploitedduring thegraspprocess.Theobjective in this chapteris to adda visualcomponentto this systemthat extractsminimal visual informationto pre-orientthehandappropriatelyinrelationto the target object. Specifically, it recommendstwo critical parametersto the hapticsystem:w

theorientation(azimuthalangle)of thehand,andwthenumberof fingersto use(two or three).

Theserecommendationsarepurelybasedon featuresof appearancethatcorrelaterobustly withthesetwo parameters,usinga featurelearningsystemsimilar to thatdiscussedin theprecedingchapters.The learningprocedureis driven entirely by the hapticgraspingsystem,without anexternalsupervisor.

6.3 Background

The graspingtask discussedin this chapteris basedon a haptically-guidedgraspingsystemdevelopedby my colleagueJev ersonCoelho,which is introducedin the next section. Then,Section6.3.2 briefly introducesand illustratesprobability distributions on angulardomains.Theseplayanimportantrole in thealgorithmspresentedin this chapter.

6.3.1 Haptically-Guided Grasping

Coelho[21] describesaframework for closed-loopgraspingusingadextrousrobotichand(Fig-ure 6.1) equippedwith forceu torquetouchsensorsat the fingertips. Referenceand feedbackerrorsignalsareexpressedin termsof theobject-framewrench residualx , a six-elementvectorsummarizingthe residualforcesandtorquesactingon the object. The currentobjectwrenchis computedusingthe instantaneousgrip JacobianG, which in turn is locally estimatedusingsensorfeedbackin theform of contactpositionsandnormals.Thegraspcontrollery receivesanerrorsignal z , which for azeroreferencewrenchis simply thesquaredwrenchresidualz*{|x Tx ,andcomputesincrementaldisplacementsfor asubsetof thefingertipcontactssoasto reducetheerrorsignal.As aresult,thefingersprobethesurfaceof theobjectuntil theirpositionsconvergeto a localminimumin thesquared-wrench-residual surface.

G

Controlactions

GraspController

Contact normalsand positions

Σ

Plant

Objectwrench

Referencewrench

Figure 6.1. Graspsynthesisasclosed-loopcontrol. (Reproducedwith permissionfrom CoelhoandGrupen[21].)

98

Page 117: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

The three-fingeredStanfordu JPL robotichand(Figure6.2) employed in this work canper-form graspsusingany two- or three-fingeredgraspconfiguration.Eachof thesefingercombina-tionsc givesriseto anindividualgraspcontrollery c, yieldinga totalof four controllers.For thepurposeof this chapter, it is sut cient to distinguishbetweentwo classesof graspcontrollers,namely, two- andthree-fingeredgrasps.

Figure 6.2. The Stanford} JPL dextroushandperforminghaptically-guidedclosed-loopgraspsynthesis.

As agraspcontroller~ c executes,thecontrolerror � andits timederivative ˙� traceoutphaseportraitsthatarecharacteristicof objectshapes(Figure6.3). In this way, thegraspingsystemforms haptic categories. Qualitatively di� erent regions in a phaseportrait constitutehapticcontextsandcanberepresentedasstatesof a finite statemachine(Figure6.4). Coelho[22, 23]describesa reinforcementlearningframework for learninggraspcontrol policiesthat selectacontroller ~ c basedon the context currentlyobserved. The learnedpoliciesachieve graspsofconsistentlyhigherquality thanany fixednativecontroller ~ c by escapinglocal minima in thewrenchresidualsurface.

ε

ε

0.2 0.4

−0.010

−0.005

(a)

(b)

(c)

(c)

(b)

(a)

ε

ε0.2 0.4 0.6 0.8

−0.020

−0.015

−0.010

−0.005

µ = 1.320

µ = 0.280

µ = 0.680

Figure6.3. Objectwrenchphaseportraitstracedduringgraspsyntheses.Theleft paneldepictstheevolutionof atwo fingeredgrasptrial from configuration(a) to (b) andto (c), theconvergentconfiguration.Thecomplete,two-fingeredphaseportrait for the irregular triangleis shown ontheright. (Reproducedwith permissionfrom Coelhoetal. [24])

99

Page 118: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

ε ε

A

B

CD

E

π

B

A

C

D E

A,B

B,C

A,CB,D,E

D,E

ιPolicy �

Figure6.4. A hypotheticalphaseportraitof anative controller� c (left) andall possiblecontexttransitions(right). (Reproducedwith permissionfrom Coelhoetal. [24])

Thequalityof agraspis characterizedby theminimumfriction coe� cient � 0 requiredfor anull spaceto exist in thegrip JacobianG, with rank � 1. For an idealgrasp,� 0 � 0, meaningthat the graspconfigurationis suitablefor fixing the object in placeusing frictionlesspointcontacts.Dependingon theobjectshapeandthenumberof fingersavailable,this idealmaynotbeachievable. Theright panelin Figure6.3 is annotatedwith thevaluesof � 0 associatedwiththethreeconvergedconfigurations.

6.3.2 The von MisesDistrib ution

Predictinghandorientationsin relation to objectsinvolves the estimationof relative angularinformation. Most conventionalcontinuousprobabilitydistributionshave a linear domain,forexample,thenormaldistribution. In thecasewherea randomvariablerepresentsanangle,oneprobabilitydistribution thatis oftenusedin lieu of anormaldistribution onacirculardomainisthevonMisesdistribution [30]. It hastheprobabilitydensityfunction

pvM( ��������� ) � 12� I0(� )e� cos(�,��� ) (6.1)

where0 �0��� 2� , 0 �.����� , and I0(� ) is the modifiedBesselfunction of orderzero. Themeandirectionof thedistribution is givenby � , and� is a concentrationparameterwith � � 0correspondingto a circularuniform distribution, and����� to a point distribution. Figure6.5illustratesfour von Misesdistributionswith di� erentparameters.Theprobabilitydensityat ananglecorrespondsto theradialdistancefrom theunit circle to thedensitycurve.

Theparameter� of thevon Misesdistribution is not to beconfusedwith thefriction coef-ficient � 0 introducedin theprevious section.I usethis notationin agreementwith establishedconventions.In this chapter, theuseof � is consistent:� 0 (occasionallyfollowedby onemoreindex) alwaysrefersto thefriction coe� cient,and� (oftenwith anindex otherthanzero)alwaysdenotesthemeandirectionof avon Misesdistribution.

6.4 FeatureLearning

In Coelho’shaptically-guidedgraspingsystem,novisualcuesguidetheplacementof thefingerson theobject.Here,thegoalis to learnvisualfeaturesthatpredictsuccessfulhandorientations,

100

Page 119: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

−1 0 1 2 3 4

−1

−0.5

0

0.5

1

1.5

2

2.5

von Mises distributions

unit circle µ=0.00, κ=0.00 µ=2.36, κ=2.00 µ=1.57, κ=10.00µ=0.00, κ=60.00

Figure 6.5. Illustrationof four von Misesprobabilitydensitiesonacirculardomain.

given prior hapticexperience.Thesecanbe usedto choosea two- or a three-fingeredgraspcontroller, andto pre-shapethehandin orderto placecontactsreliably nearhigh-qualitygraspconfigurationsfor a task. In termsof Figure 6.4, this correspondsto initializing the hapticsystemin a context neartheminimumachievablewrenchresidual.Fromhere,a singlenativegraspcontrollercanleadreliably to thepreferredsolution. In e� ect,pre-graspvisualcuesarelearnedto subsumehaptically-guidedgrasppolicies.

The generalscenariofor the graspingproblemis shown in Figure6.6. The visual systemattemptsto identify featuresof theobjectthatcorrelatewith successfulgraspparameters,includ-ing therelativehandorientationandthenumberof fingersused.An overheadview of thetargetobjectis obtainedbeforetheonsetof the reach,anda predictionof usefulgraspparametersisderived. Subsequently, thegraspis executedusingthepredictedgraspparameters.Uponcon-vergenceof thehapticcontrollerto a final graspconfiguration,graspparametersarerecordedandusedfor learning. The orientationof the handduring a given graspconfiguration,� H, isdefinedasillustratedin Figure6.7.

Therobotmayencounteravarietyof objectsthatdi� er in their shapes.Eachtypeof objectmay requirea dedicatedfeatureto recommenda handorientation. Object identitiesarenotknown to thesystem;theneedfor dedicatedfeaturesmustbediscoveredby graspingexperience.Visualfeaturesthatrespondselectively to hapticcategoriespermitarecommendationtobemaderegardingthenumberof fingersto usefor agrasp(Figure6.8).

Featuresof the type definedin Chapter3 aresampledfrom imagestaken by an overheadcamera.Assumingthatthesefeaturesrespondto theobjectitself, their image-planeorientation� f (Equation3.5,page15) shouldberelatedto theazimuthalorientation� H of therobotichandby a constantadditive o� set � . A given feature,measuredduringmany graspingexperiences,generatesdatapointsthat lie on straightlineson thetoroidal surfacespannedby thehandandfeatureorientations(Figure6.9). Here, � �¡� H ¢ � f . Theremaybemorethanonestraightlinebecausea givenvisualfeaturemayrespondto morethanonespecificobjectlocation(e.g.,dueto objectsymmetries),or to severaldistinctobjectsthatdi� er in shape.Given � , onecantheninfer appropriatehandorientationsasa functionof observedfeatureorientations:

� h �.� f £ � (6.2)

The remainingproblemis to find the o� sets � . Assumingthesecan be modeledas randomvariableswith unimodalandsymmetricprobabilitydistributionspk( � ), with thek correspondingto the clustersof points in Figure6.9, thenthis is an instanceof a K-Meansproblemin one-

101

Page 120: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

empty

analyze,perform grasp

acquire imageStep 1

Steps 2–8

learn featuresStep 10

useful?Step 9yes no

Figure 6.6. Scenariofor learningfeaturesto recommendgraspparameters(cf. Figure 2.1,page8). Thestepindicescorrespondto Algorithm 6.2,to bediscussedlaterin thischapter.

¤H

thumb

¤H

midpointbetweenfingers

Figure 6.7. Definition of the handorientation(azimuthalangle)for two- and three-fingeredgrasps.

(a) (b) (c) (d)

Figure6.8.By Coelho’s formulation,someobjectsarebettergraspedwith two fingers(a),somewith three(b), andfor somethischoiceis unimportant(c, d).

102

Page 121: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

dimensionalcircular (angular)space,with K unknown, andcanbe representedas a mixturedistribution

pmix( � ) � K

k¥ 1

pk( � ) P(k) (6.3)

with mixtureproportions0 ¦ P(k) ¦ 1, k P(k) � 1. Thefollowing sectionshows how to findthe parametersof this mixture distribution, which collectively constitutethe feature modeloff for a particulargraspcontroller. Subsequentsectionsdescribehow to selectdatapointsforfitting object-specificfeaturemodels,andhow to usefeaturesto predictthequality of a grasp.Section6.4.4putsthesepiecestogetherandpresentsthefull Algorithm 6.2 (cf. Figure6.6) forlearningmultiple featureswith multiplegraspcontrollers,andaddressestheinteractionwith thehapticgraspingsystem.

−2 0 2−3

−2

−1

0

1

2

3

Unwrapped Toroidal Plot

Orientation of Feature

Han

d O

rient

atio

n

Mixture of 2 von Mises distributions

unit circle mu=0.10, kappa=122.09mu=2.17, kappa=700.92

Figure 6.9. Left: Datapointsinducedby a given featureon variousimagesof anobjectformstraightlineson a torus(two in this case);right: A mixtureof two von Misesdistributionswasfitted to thesedata.

6.4.1 Fitting a Parametric Orientation Model

While the systemlearnsandperforms,all featuresareevaluatedon all images.The responsemagnitudesff of all featuresf , theirorientations� f , theactualhandorientations� H (thetrainingsignal), and the predictionerrors §C� f � � � H ¨ � h � producedby eachfeaturearestoredin aninstancelist. To computethe mixture model for a feature,this feature’s datapoints(suchasthoseshown in Figure6.9)aretakenfrom this instancelist.

Assumethe � are drawn independentlyfrom a mixture of von Mises distributions. Themixturedistribution 6.3(seeFigure6.9) thenbecomes

pmix( ��� a) � K

k© 1

pvM( �ª��� k ��� k) P(k) (6.4)

wherea is shorthandfor thecollectionof parameters� k, � k andP(k), 1 � k � K.For all plausiblenumbersof clustersK, a (3K ¨ 1)-dimensionalnon-linearoptimization

problemis to besolvedto find the � k, � k andP(k). TheappropriatenumberK of mixturecom-ponentsis thenchosenastheonethatmaximizestheIntegratedCompletedLikelihoodcriterion[13], anadaptationto clusteringproblemsof themorewell-known BayesianInformationCrite-rion [106]. In practice,themethodof choosingtheright numberof componentsis not critical,

103

Page 122: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

becauseasbetterfeaturesarelearned,theclustersof datapointswill becomeincreasinglywellseparated.

For a particularvalueof K, a maximum-a-posteriori(MAP) estimateof a mixturemodelisaparameterizationa thatmaximizestheposteriorprobabilitygivenby Bayes’rule:

p(a «­¬ ) � L(¬®« a) p(a)

m L(¬¯« am) p(am)

whereL(¬°« a) is the likelihood of the observed data ¬±�³²´� 1 µ�¶�¶�¶�µ � N · . We now assumeauniform prior probability densityover all possiblemodelparameterizationsam. In this case,the MAP estimateis identical to the maximum-likelihood estimatethat maximizesthe log-likelihoodof thedata:

logL(¬¯« a) � N

i ¥ 1

log pmix( � i « a) (6.5)

An elegantway to fit mixture modelssuchasthis is via the Expectation-Maximization(EM)algorithm. The EM algorithmis most practicalif the modelparameters k, ¹ k and P(k) canbe computedin closedform at the Expectationstep. Unfortunately, this is not possibleherebecausetheBesselfunctionscannotbe invertedalgebraically(seetheAppendix). Facingthisaddeddiº culty, themixturemodel(6.4) is moreeasilyfitted directly, subjectto theapplicableconstraintson the ¹ k andP(k), via gradientdescentusingthepartialderivativesof theobjectivefunction(6.5):

»» ¸ klogL(¬¯« a) � N

i ¥ 1

P(k)pmix( � i « a)

»» ¸ kpvM( � i «�¸ k µ ¹ k)

»» ¹ klogL(¬¯« a) � N

i ¥ 1

P(k)pmix( � i « a)

»» ¹ kpvM( � i «�¸ k µ ¹ k)

where »» ¸ pvM( ��«�¸ µ ¹ ) � ¹ sin( � ¢ ¸ ) pvM( ��«�¸ µ ¹ ) µ (6.6)»» ¹ pvM( ��«�¸ µ ¹ ) � ¢ I ¼ 1(¹ ) £ I1(¹ ) ¢ 2I0(¹ ) cos( � ¢ ¸ )2I0(¹ ) pvM( �ª«�¸ µ ¹ ) ¶ (6.7)

TheK ¢ 1partialderivativeswith respectto P(k) musttakeintoaccounttheconstraint Kk¥ 1 P(k) �

1: »»P(k)

log L(¬¯« a)

� »»P(k)

N

i ¥ 1

logK ¼ 1

k¥ 1

pvM( � i «�¸ k µ ¹ k) P(k) £ pvM( � i «½¸ K µ ¹ K) 1 ¢ K ¼ 1

k¥ 1

P(k)

� N

i ¥ 1

1pmix( � i « a)

pvM( � i «�¸ k µ ¹ k) ¢ pvM( � i «�¸ K µ ¹ K) µ k � 1µ�¶�¶�¶�µ K ¢ 1¶Eachfeatureis annotatedwith its modelparameters,which includethe ¸ k, ¹ k, P(k), andthe

totalnumberNf of valid datapointsmodeledby thismixture.

6.4.2 SelectingObject-SpecificData Points

If di� erenttypesof objectsareencountered,dedicatedfeaturesmayhaveto belearned.Withoutasupervisorproviding objectidentities,thedatacollections(Figure6.9)will beanindiscernible

104

Page 123: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

mix of featureresponsescorrespondingto variousobjects,and any reliable patternswill beobscured.Thekey to learningdedicatedfeaturesis to ignoredatapointscorrespondingto weakfeatureresponses.Thispermitsfeaturesto emergethatrespondstronglyonly to specific,highlycharacteristicobjectparts,but that respondweakly in any imagethatdoesnot containsuchanobject.Theseweakresponseswill beignored,andreliablemodelsof � canbefittedto thestrongresponses.

Decidingwhethera givenpoint is “strong” in this senseinvolvesa threshold.Suchthresh-olds ¾ f , specificto eachfeature,canbe determinedoptimally in the Bayesiansensethat thenumberof poorrecommendationsmadeby theresultingmodelis minimized.1 To do this for agiven featuref , thehistoryof featureresponsesff andtheassociatedpredictionerrors ¿C� f areanalyzedin orderto find a threshold¾ f suchthat mostpredictionsfor caseswith ff À ¾ f arecorrect. To formalize this intuitive notion, a global thresholdt Á­Â is introduced,meaningthata predictionwith ¿C� f ¦ t ÁÃÂ is correct,andfalseotherwise. The optimal threshold¾ f is thendefinedasa valuethat maximizesthe Kolmogorov-Smirno� distanceKSDf betweenthe twoconditionaldistributionsof ff underthetwo conditionsthat theassociatedpredictionsarecor-rectandfalse,respectively (Figure6.10).Thefeaturemodelof f is thenfitted to all datapoints� with ff À ¾ f . Thethresholdt Á­Â is a globalrun-timeparameterthatcangraduallybereducedover time to encouragethelearningof improvedfeatures.

10

1

KSDf

cum

ulat

ivepr

obab

ility

falsepredictions:

Äffeatureresponseff

Å ¤f Æ t Á­Â

correctpredictions:Å ¤

f Ç t Á­Â

Figure 6.10. Kolmogorov-Smirno� distanceKSDf betweenthe conditionaldistributions offeatureresponsemagnitudesgivencorrectandfalsepredictions.

6.4.3 Predicting the Quality of a Grasp

Thequality of a graspis definedby theminimumfriction coeº cient ¸ 0 requiredto executethegraspusinga givenfingerconfiguration(cf. Section6.3.1). In this model,the lower (closertozero)a valueof ¸ 0, the betterthe grasp. It is not possibleto separategoodfrom poor graspsbasedon a genericthresholdon ¸ 0 alonebecause,for any given object, the bestachievablegraspdependson the objectproperties,the numberof fingers,and– in general– on the end-to-endtask. For example, the bestachievable graspof a triangularprism using two fingersis far worsethanif threefingersareused.UnderCoelho’s formulation,cubesarebestgraspedwith two fingersto takeadvantageof symmetriccontactforcesacrossparallelopposingsurfaces(Figure6.8).

It is possibleto estimatethe expectedvalueof ¸ 0 associatedwith a given featuref . Forthis purpose,theactuallyexperiencedvalueof ¸ 0 is storedin the instancelist alongwith each

1Bayesianoptimalityholdsonly undertheassumptionthattheprior probabilitiesof thetwo conditionsareequal.SeeSections4.3.2and4.4.1for a discussion.

105

Page 124: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

executedgrasp.These 0 valuesareregardedassamplesof a continuousrandomvariableM0

with probabilitydensityfunction p( 0) andexpectedvalue

E[M0] � È0¸ 0 p( 0) d 0 ¶

Observe thateachmeasuredvalueof ¸ 0 is associatedwith themeasuredpredictionerror ¿C� f �« � H ¢ � h « , generatedby thesamegrasp. If f is indeedhighly predictive of thegraspoutcome,thenthereshouldbeanapproximateproportionalrelationshipbetweenthe 0 andthe ¿L� f : Goodgraspswith small ¸ 0 shouldtend to exhibit a small angularerror ¿C� f . Thus, the probabilitydensityfunction p( 0) shouldbewell approximatedby thedistribution pvM(¿L� f ) of theangularo� setsobserved betweenthe featureandhandorientations. Therefore,a sampleestimateofE[M0(f )] for featuref canbeapproximatedas

E[M0(f )] � i ¸ 0É i pvM(¿L� f É i)i pvM(¿C� f É i) (6.8)

wherethe correspondingpairsof ¸ 0É i and ¿C� f É i aretaken from the instancelist, usingonly in-stancescorrespondingto occurrencesof f with ff É i À ¾ f .

6.4.4 FeatureLearning Algorithm

Usingtheconceptsintroducedin theprecedingsections,theprocedurefor fitting afeaturemodelis summarizedin thefollowing algorithm.

Algorithm 6.1(Fitting a Feature Model) For agivenfeaturef , thefollowing stepsaretakento computetheassociatedorientationmodel,consistingof the ¸ k, ¹ k, KSDf , andE[M0(f )]:

1. Using the currentvalue of t ÁÃÂ , identify thoseoccurrencesof f in the instancelist thatcorrespondto correctand falsepredictions. Using the responsemagnitudesf of theseoccurrences,computeKSDf andtheassociatedthreshold¾ f (Section6.4.2).

2. UsingtheNf of theseinstanceswith f À ¾ f , fit a parametricorientationmodel(Equation6.4),obtainingKf , andvaluesfor ¸ k, ¹ k, andP(k), k � 1µ�¶�¶�¶�µ Kf (Section6.4.1).

3. Usingthis orientationmodelandthevaluesof ¸ 0 and ¿L� f correspondingto themeasure-mentsof f in theinstancelist, computeE[M0(f )] (Equation6.8,Section6.4.3).

The systemoperatesby recommendinghandorientationsandobservingthe outcomesofgraspingprocedures.New featuresareaddedto the repertoirein parallel to the executionofgraspingtasks,andareevaluatedover thecourseof many futuregrasps.This is in contrasttothedistinction-learning application(Chapter4), wherethediscriminative powerof new featureswasimmediatelyestimatedusingexampleimages.Here,noexampleimagesarenecessary. Thevisualsystemdoesnothave any influenceon thetypesor identitiesof thegraspedobjects.TwoseparatefeaturesetsÊ 2 andÊ 3 andtwo separateinstancelistsaremaintainedfor two- andthree-fingeredgrasps.Out of thesetwo sets,the featurewith the lowestexpectedfriction coeº cient¸ 0 providesa recommendationof graspparametersto thehapticsystem.Theoperationof thelearningsystemis detailedin thefollowing algorithm.

Algorithm 6.2(Feature Learning and Application of Feature Models) Initially, thesetsÊ 2

andÊ 3 of featuresknown to thesystemareempty.

1. A new grasptaskispresentedto thegraspingsystem,andanoverheadimageof thegraspedobjectis acquired.

2. Thefollowing stepsareperformedfor eachfeatureset Ê c:

106

Page 125: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

(a) Theresponsesff of all featuresf ËÌÊ c aremeasuredin thecurrentimage.

(b) Identify theabove-thresholdfeaturewith thehighestpredictive power:

f Íc � argmaxf ÎVÏ f ÎVÐ c Ñ ff ÒÔÓ f Õ KSDf

Now, choosethenumberc of graspcontactsthatwill resultin thebettergrasp:

c � argmincÎVÏ 2É 3Õ E[M0(f Íc)]

3. FromtheK componentsin theangularmixturemodelassociatedwith f Íc, identify themostconcentratedcomponentdistribution of angularerrors:

kÍ � argmaxkÎ 1Ö kÖ Kf ×c Nf ×c P(k)Ø 3

¹ k

To avoid singularmodes,themaximumis only takenover thosecomponentdistributionsthatmodelat leastthreedatapoints.

4. Using theorientationof f Íc measuredin the imageandtheangularo� setassociatedwithkÍ , form arecommendedhandorientation(cf. Equation6.2):

� h �.� f ×c £ ¸ k×5. Therobotperformsac-contactgraspwith initial azimuthalangleof � h. Uponconvergence

of the hapticsystem,a new instancevector is addedto the instancelist, containingtheresponsemagnitudes,orientations,andpredictionerrors ¿L� f �Ù« � H ¢ � h « of all featuresf ËÌÊ c, aswell astheactualfinal handorientation� H, andtheachievedfriction coeº cient¸ 0.

6. If ¿L� f ×c ¦ t ÁÃÂ , i.e. the recommendedgraspparameterswereaccurateandrequiredlittleadjustmentby thehapticsystem,thencontinueat Step1.

7. Using Algorithm 6.1, the KSDf arerecomputedfor all f Ë�Ê c, andall associatedmix-ture modelsarere-estimatedbasedon thecasesrecordedin the instancelist of previousexperiences.

8. A new prediction� h is formedbasedon thenew models,following thesameprocedureasabove,but leaving c fixed(Steps2a,2b,3, 4).

9. If now « � H ¢ � h «Ú¦ t Á­Â , continueatStep1.

10. Add two new featuresto Ê c: Oneis createdby samplinga randomtexel or a randompairof edgelsfrom salientpointsin thecurrentimage(cf. Algorithm 4.10(FeatureGeneration,page43), Step2). The other featureis generatedby randomlyexpandingan existingfeaturegeometricallyby addinga new point (cf. Algorithm 4.10,Step3). Then,continueatStep1.

Many of the featuresrandomlysampledat Step10 will performpoorly, e.g.becausetheyrespondto partsof thesceneunrelatedto theobjectto begrasped.Suchfeatureswill developa poor KSD, as thereis no systematicassociationbetweentheir responsestrengthsand theirpredictionaccuracies.Dueto their low KSD, suchfeatureswill ceaseto beusedatall (Step2b).On theotherhand,if a featureperformswell, its KSD will beincreased,andit will morelikelybeemployed. Moreover, sinceonly featuresareconsultedthatareassertedto bepresentin theimage(i.e., ff À ¾ f ), featurescanbe learnedthatselectively respondto di� erentobjectshapesrequiring di� erento� sets � . Unusedfeaturesshouldbe discardedperiodically. The details

107

Page 126: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

of when this is doneareunimportant,but if all featuresarekept around,then thosestepsinAlgorithm 6.2thatinvolve iteratingoverall featureswill incurunnecessarycomputationalcost.

This algorithmusesonly a singlefeatureto derive a handorientation.Similarly to thepre-cedingtwo chapters,thereasonfor thischoiceis to reinforcethepointthatahighlyuncommittedadaptive visualsystemcanlearnpowerful features.Therobustnessof therecommendedanglescanprobablybe increasedby pooling therecommendationsof several features.At Step2, thefeatureof choiceis determinedaccordingto its KSDf . Then,atStep3,oneof themodesk of thisfeature’s modelis chosento form theactualprediction.At first glance,it seemsthatthis choiceof a modelis unstablebecausetheactualreliability of the individual modesof a given featuremodelmaydi� er widely. However, this variationdoesnot causeinstabilitybecausetheKSD isderived accordingto theactualbehavior of the featurein practice,asrecordedin the instancelist. Thus,ahighKSD canonly resultif thepracticallyusedmodesk performconsistentlywell.

In contrastto thefeaturelearningalgorithmfor distinctive features(Algorithm4.8onpage41),Algorithm 6.2doesnot evaluatefeaturesimmediately, asthis would requirerepeatedgraspsofthesameobjectin lieu of theexampleimagesusedby Algorithm 4.8. This algorithminsteadchoosesto spreadtheevaluationof newly sampledfeaturesacrossmultiple futuregrasps.Thispreventsthe useof an explicit simple-to-complex searchproceduresuchas Algorithm 4.10.However, the expansionof featuresat Step10 above hasthe samee� ect, spreadover multi-ple grasps. Booleancompoundfeaturesarenot usedherebecausetheir orientationsare lessmeaningfulthanthoseof geometriccompounds.

Incidentally, this searchfor goodpredictive modelsconstitutesan Expectation-Maximiza-tion (EM) algorithm. An EM algorithmalternatesbetweenestimatinga setof unknown datavaluesbasedon a currentestimateof a model (Expectationstep),and estimatingthe modelbasedon the full data(Maximizationstep).Here,theparametricmodelto beoptimizedis thecollectionof all feature-specificmodelsthatareusedby theangularrecommendationprocess.Theunknown datavaluesspecifywhich recordeddatapointsshouldparticipatein thefeature-specificmodels.At theExpectationstep,thesehiddenparametersareestimatedby computingtheKSDf suchthattheprobabilityof makingtheright choicefor eachdatapoint is maximized,given thecurrentmodel. At theMaximizationstep,the likelihoodof themodelgiven thepar-ticipatingdatapointsis maximizedby optimizingthemodelparametersaccordingto Equation6.5. As thesystemoperates,thesetwo stepsalternate.

This instanceof theEM algorithmis unusualin that the Expectationstepdoesnot utilizeall availablecurrentdata. Instead,the instancelist of pastexperiencesis consultedfor previ-ouspredictionresults,which weregeneratedby modelsderived from all dataavailableat thattime. Taking thecorrectexpectationusingthe mostrecentmodelwould involve revisiting allpreviously-seenimagesateachExpectationstep,which is clearlyimpractical.Nevertheless,theconvergencepropertiesof theEM algorithmareuna� ected.As dataaccumulate,theaccuracy ofrecentexpectationscanonly increase,andthe influenceof possiblyinaccuratedatafrom earlyhistorydiminishes.

6.5 Experiments

The learningalgorithmjust presentedis designedto operateon-line. However, a singlegraspperformedby the real robot using Coelho’s system(Section6.3.1) takes on the order of tenminutes,while the learningalgorithmrequireshundredsof graspsto converge. Therefore,athoroughon-line evaluationwasimpractical. Instead,experimentswereconductedin simula-tion in a way thatcloselyresemblestheactualintegratedgraspingsystem.Theseexperimentsutilized a detailedkinematicanddynamicsimulatorof therobotichandÛ armsystem,developedby Je� ersonCoelho,thatcyclesthroughthefollowing steps:

108

Page 127: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

1. Generatea randompolyhedralobjectgeometryby drawing objectparametersfrom para-metricprobabilitydistributionsderivedfrom actualexperimentsusingtherealrobot.Ran-domparametersincludethedimensionsandanglesdefiningtheobject.

2. Performa simulatedgraspof this object. The hapticcontrol systemis the sameasthatusedfor therealrobot. Thesimulatorinterpretsthemotorcommandsandgeneratessen-sory feedbackin the form of joint anglesand(relatively noisy) forceÛ torquereadingsofthefingertipsensors.

Thethreetypesof objectgeometriesusedincludedquadrangular(roughlysquare)andtriangu-lar (roughlyequilateral)prisms,andcylinders. Typical objectgeometriesandconvergedgraspconfigurationsareillustratedin Figure6.11. The ray originatingfrom the objectcenterindi-catesthe wrist orientationin accordancewith Figure6.7. Thecategorizationinto “good” and“poor” graspsis for illustrationonly. As mentionedearlier, thereis no way to graspa trianglewell usingtwo fingers,andall graspsof thecylinder wereof consistentlyhigh quality. Visualinput for the visual learningsystemwasgeneratedby producingphoto-realisticallyrenderedandnoise-degradedviews on the basisof the objectspecificationsgeneratedat Step1 above(Figure6.12).All resultsreportedbelow werecomputedin two-fold cross-validation.For sim-plicity andspeed,no texels wereused.However, otherunpublishedexperimentsindicatethatthepresenceor absenceof texelsdoesnothave a noticeableimpacton this task.

GoodGrasps PoorGrasps

Figure 6.11. Representative examplesof synthesizedobjectsandconverged simulatedgraspconfigurationsusingCoelho’s haptically-guidedgraspingsystem.

Figure 6.12.Exampleviews of objectsusedin thegraspingsimulations.

6.5.1 Training Visual Models

To generatetrainingdata,Coelho’sgraspsimulatorperformedseveralhundredsimulatedgraspson instancesof the threeobject types,using two- and three-fingeredgraspcontrollers,and

109

Page 128: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

recordedthefinal graspconfigurationson convergence,alongwith thegraspquality index ¸ 0.Thesedatawereusedby Algorithm 6.2 to train visual featuremodels. Lacking the ability toperforman actualgraspat Step5, the recommendedgraspsweresimulatedby comparingtherecommendedhandorientation� h with thepreviously executedhandorientation� H associatedwith thetrainingimage,modulotheknown rotationalsymmetrypropertiesof theobject.Sincecylindershave infinite-fold rotationalsymmetry, any handorientationworksequallywell. Con-sequently, no featureswereever learnedfor cylinders.

0 2 4 6 8 10 120

2

4

6

8Accuracy (clean, separated data)

Error [degrees]

Cou

nt

0 20 40 600

5

10

15

20Accuracy (noisy, unseparated data)

Error [degrees]C

ount

(a) (b)

Figure 6.13. Quantitative resultsof handorientationprediction.Therightmostbin of therighthistogramincludesall caseswherethepredictionerrorexceeded60 degrees.

Figure6.13demonstratesthatthesystemsucceedsin learningfeaturesthatareindicative ofhandorientations.Thedatain Figure6.13awereproducedby featuremodelsthatweretrainedusing2-fold crossvalidationon the20besttwo-fingeredgraspsof cubes,andthe20 bestthree-fingeredgraspsof triangularprismsavailable in the simulatedtraining set. As always in thischapter, thequality of a graspwasassessedin termsof the friction coeº cient ¸ 0. Thesedatacanbeexpectedto containonly little noisein thetrainingsignal,i.e., theactualhandorientation� H on graspconvergence.Performanceon an independenttestsetis almostalwaysexcellent,with predictionerrormagnitudeson theorderof thevariationin thetrainingsignal.

Thedatain Figure6.13bwereproducedby featuremodelsthatweretrainedon a represen-tative sampleof 80 grasps,including20 graspseachof cubesandtriangularprismsusingtwo-andthree-fingeredgraspcontrollers.Onsuchnoisytrainingdatathatcontainshandorientationsthatproducedapoorgrasp,thesystemexpendsalot of e� ort trying to learntheseoutliers.How-ever, performancedegradesgracefullybecausefeaturesareselectedby Kolmogorov-Smirno�distance,whichprefersgenericfeaturesthatwork well for themajorityof usefultrainingexam-ples.On a noisytestset,mostpoorrecommendationsoccuron outliers.Notably, two-fingeredgraspsof thetriangularobjectareinherentlyunstableandunpredictable.Here,predictionerrorsproducedby thetrainedsystemdependon theerror thresholdthatdivides“good” from “poor”predictionsduringtraining.Choosinga low thresholdgenerallyproducesmoreaccuratepredic-tionson a testset,aslong asthis thresholdis larger thanthevariationcontainedin themajorityof thetrainingdata.

6.5.2 UsingVisual Models to CueHaptic Grasping

Severalexperimentswereperformedto evaluatethee� ectof visualprimingonthehapticgrasp-ing procedure.Thetwo primaryevaluationcriteriawerethenumberof hapticprobesexecutedbeforethegraspcontrollerconverged,andthequality of theachievedgraspon convergenceintermsof the friction coeº cient ¸ 0. All experimentsemploy the native graspcontrollersde-scribedin Section6.3.1.

In thefirstexperiment,atwo-fingerednativegraspcontrollerwascuedusingfeaturestrainedon thebest20 two-fingeredgraspsof cubesavailablein thetrainingset. An analogousexperi-

110

Page 129: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

0 5 10 15 200

100

200

300

400Native Haptic (Cubes/2)

Number of Probes

Cou

nt

0 5 10 15 200

50

100

150Visual+Haptic (Cubes/2)

Number of Probes

Cou

nt

0 0.5 10

500

1000Native Haptic (Cubes/2)

µ0

Cou

nt

0 0.5 10

100

200

300Visual+Haptic (Cubes/2)

µ0

Cou

nt

0 5 10 15 200

100

200

300Native Haptic (Triangles/3)

Number of Probes

Cou

nt

0 5 10 15 200

50

100Visual+Haptic (Triangles/3)

Number of Probes

Cou

nt

0 0.5 10

100

200

300Native Haptic (Triangles/3)

µ0

Cou

nt

0 0.5 10

50

100Visual+Haptic (Triangles/3)

µ0

Cou

nt

Figure 6.14. Resultson two-fingeredgraspsof cubes(top half), andthree-fingeredgraspsoftriangularprisms(bottomhalf). Theleft columnshowspurelyhapticgrasps;in theright column,graspswere cuedusing learnedvisual featuremodels. The rightmostbin in eachhistogramincludesall instanceswith anumberof probesÜ 20,or ¸ 0 Ü 1, respectively.

111

Page 130: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

mentwasperformedusingthree-fingeredgraspsof triangularprisms.Thefeaturemodelswerethe sameasthosethat generatedthe resultsshown in Figure6.13a. The resultsareshown inFigure6.14. For cubes,thenumberof lengthygraspsthat requiredmorethanabout13 hapticprobeswasdrasticallyreduced.Suchgraspstypically do not converge to a stableconfigura-tion. Likewise,thenumberof extremelyfastgraspsthatrequiredonly a singleprobeincreasedsubstantially. For both cubesand triangularprisms,the expectednumberof probeswasnotsignificantlya� ected.However, in bothcasesthenumberof poorgrasps(in termsof ¸ 0) wasdramaticallyreduced.Figure6.15shows examplesof the featureslearnedduring the trainingprocedure.Interestingly, onefeaturecapturesboth the cubeandthe boundaryof its shadow.Apparentlytherewereenoughfitting examplesin thedatathatsuchafeaturewasadvantageous.

Figure 6.15. Examplesof featuresobtainedby training two-fingeredmodelson cubesandthree-fingeredmodelsontriangularprisms,usingcleandataseparatedby objecttype(cf. Figure6.14).

In asecondexperiment,asetof visualfeaturemodelswastrainedusinga trainingsetof 80grasps,including20 graspseachof cubesandtriangularprismsusingtwo- andthree-fingeredgraspcontrollers. This training set di� eredfrom the one usedin Figure 6.13b: There, thetrainingsetwasrepresentative of thepopulationof graspexperiences,includingoutliers.Here,the best20 graspsin eachof the four categorieswas usedfor training. Both the two- andthree-fingeredvisual featuremodelswereexposedboth to cubesandto triangularprisms,andthe learningprocedurelearnedfeaturesthatwerespecificto cubesor triangularprismsexclu-sively. Thus, thesefeaturesgatheredmeaningfulstatisticsof the experiencedfriction coeº -cients,which could thenbe usedto recommenda two- or a three-fingeredgrasp(seeSection6.4.3,andAlgorithm 6.2,Step2). This experimentwasrun in accordancewith Algorithm 6.2,Steps1–5. Figure6.16 shows examplesof the featureslearned. Note that in the caseof thetriangularprism,a featurespuriouslyrespondedto theshadow, which is not correlatedwith theorientationof theobject.

The resultsare shown in Figure 6.17. The two columnson the left display the resultsachieved by the purely haptic system,for two- and three-contactnative controllers. In bothcases,graspedobjectsincludedboth cubesand triangularprisms. This explains the bimodaldistributionsof ¸ 0 shown in the bottomrow: The modescenterednear 0 � 0¶ 2 mostly cor-respondto two-fingeredgraspsof cubesandthree-fingeredgraspsof triangularprisms,whilethe modescenterednear ¸ 0 � 0¶ 6 tend to correspondto three-fingeredgraspsof cubesandtwo-fingeredgraspsof triangularprisms. This illustratesthat neithertwo- nor three-fingered

112

Page 131: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Figure6.16.Examplesof featuresobtainedby trainingtwo- andthree-fingeredmodelsoncubesandtriangularprisms,usingasingledatasetcontainingrelatively noise-freegraspsof cubesandtriangularprisms(cf. Figure6.17).

Figure 6.17. Graspresultson cubesandtriangularprisms.Thenumberof graspcontactswaschosenusingthelearnedvisualfeaturemodels.

113

Page 132: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

native controllersalonearesuº cientto executehigh-qualitygraspsreliably for bothcubesandtriangularprisms.

The rightmostcolumnshows the resultsachieved if the learnedvisual modelsdeterminewhichnativepolicy to use,andhow to orientthehandat theoutsetof agrasp.Almostall graspsresultin a very good 0; thesecondmodehasalmostcompletelydisappeared.Moreover, verylong trials (morethanabout20 probes)arepracticallyeliminated(cf. the two-fingerednativecontrolleron theleft). However, theexpectednumberof probesincreasesslightly. Thereasonsfor this undesirede� ect areunclear. Onepossibleexplanationis that the angularrecommen-dationsmadeby the learnedfeaturemodelswere relatively poor, and possiblyskewed by asystematicbias,dueto thepresenceof many unreliablegraspsin thetrainingset,predominantlytwo-fingeredgraspsof triangularprisms.Evenif this is true,this slight drawbackis morethanoutweighedby theenormousgainin thequalityof theresultinggrasps,asevidencedby thelowvaluesof ¸ 0.

6.6 Discussion

This chapterdescribeda systemfor learningto recommendhandorientationsandfinger con-figurationsto a haptically-guidedgraspingsystem.Localizedappearance-basedfeaturesof thevisualscenearelearnedthatcorrelatereliably with observed handorientations.If the trainingdataarenot overly noisy, specializedfeaturesemerge that respondto distinct objectclasses.Thesecanthenbe usedto recommendthe numberof graspcontactsto be used,basedon theexpectedquality of the grasp. In this way, visual guidancetakesplacewithout prior knowl-edge,explicit segmentationor geometricreconstructionof thescene.The interactionbetweenhapticandvisualsystemis a plausiblemodelof humangraspingbehavior. Learningis on-lineand incremental;thereis no distinction betweenlearningand executionphases.The resultsdemonstratethefeasibilityof theconcept.

Themainlimitation of thesystem,asusedin theexperimentsabove, is thatits performancewasadverselya� ectedby noisein thetrainingdata.It wasunableto identify andignorenoisytrainingexamples.However, it is importantto notethatthis e� ectis inherentin theo� -line na-tureof thetrainingprocess.While thealgorithmis designedto operateon-line,theexperimentsabove weregeneratedo� -line usinga fixed setof training data. In e� ect, the visual systemlearnedsolelyby observingthehapticsystemin action. Without additionalinformation,therewasno way for it to distinguishgoodfrom poor graspoutcomes.O� -line training seversanimportantfeedbackloop betweenthevisualandthehapticsystem.During on-line training,asdescribedin Algorithm 6.2, thehapticsystemalwaysusesthecuesgivenby thevisualsystem.If thesecuesareaccurate,thenthequality of theresultinggraspsis of consistentlyhigh quality(Figures6.14and6.17).In e� ect,oncehighlypredictivefeaturesemerge, theresultingtrainingdata will containlittle noise! Due to the KSD featureselectionmetric, goodfeatureswill beusedmoreoftenthanpoorfeatures.Thus,their increasedusewill changethedistribution of thetraining data,with the e� ect that the noiseis increasinglyreduced.This is a strongreasontoexpectthatanon-line trainingprocedurewill bootstrapitself to yield reliablefeaturesby pro-ducingpuretrainingdata,which arein turn producedby reliablefeatures.A merelyobservantlearneris permanentlyconfrontedwith thesamestationarydistribution of noisy trainingdata,andwastesa lot of e� ort trying to learnthenoise,contaminatingits featuresetwith spuriousfeatures. In contrast,an interactive learnercanactively eliminatethe noisefrom the trainingdata!Theexperimentalverificationof thisargumentmustbeleft to futurework.

Thesensorimotorgraspingsystemdescribedin this chapterlearnsaboutobjects,andhowthey arebestgrasped.This way of learningaboutactivity a� ordedby theenvironmentis remi-niscentof theGibsonianecologicalapproachto perception[41, 39]. Thenotionof aÝ ordancesis a centralconceptin this theory. An a� ordancerefersto the fit betweenan agent’s capa-bilities and the environmentalpropertiesthat make possiblegiven actions. According to the

114

Page 133: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Gibsons,an agentlearnsabouta� ordancesby interactingwith the environment. It discoverse� ectsof actionsandperceptualcharacteristicsindicative of a� ordances.The ecologicalap-proachemphasizesthatperceptionis direct; behaviorally usefulinformationis extractedmoreor lessdirectly from the egocentricsensorystream,without intermediatecomprehensive rep-resentationssuchasgeometricmodels.TheGibsoniantheoryis appealingto many roboticistsbecauseof its direct link betweenperceptionandaction. Unfortunately, the underlyingcom-putationalmechanismsarelargely unclear. This is addressedby themethodsdescribedin thischapter. Thehapticsensorimotorsystemlearnsaboutthetypesof graspsa� ordedby di� erentobjects,thatis, thefit betweenhandandobject.At thesametime,thevisualsystemlearnsto as-sociatethesea� ordanceswith visualappearance.Featuresof appearanceareextracteddirectlyfrom thesensorystream,which is compatiblewith theGibsoniannotionof directperception.Itwill be interestingto exploresuchactivity-driven learningmechanismsin moreelaboratetaskscenarios.

115

Page 134: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

116

Page 135: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

CHAPTER 7

CONCLUSIONS

The first two chaptersof this dissertationoutlineda scenarioin which autonomousrobotsre-fine their perceptualskills with growing experience,asdemandedby their interactionwith theenvironment.The following four technicalchaptersintroduceda generalframework for learn-ing usefulvisual features,andappliedit to two very di� erenttaskdomains.This final chaptercollectstheinsightsgainedin thetechnicalchapters,andreassemblestheminto partsof thebigpicturedrawn at the outset. It discussesthe contributions of this dissertation,andhighlightssomeimportantdirectionsfor futureresearch.

7.1 Summary of Contrib utions

The desireto constructautonomousrobotsthat perform nontrivial tasksin the real world isone of the driving forcesof researchin artificial intelligence. It has long beenarguedthatthe world is too complex to allow the designof canned,robust action policies. Therefore,learningcapabilitiesaregenerallyagreedto beacrucialingredientof sophisticatedautonomousrobots. However, the computervision researchcommunityhasnot aswidely embracedthisviewpoint. In Section1.2 I arguedthatvisual learningis indispensablefor agentsthatneedtoextracttask-relevant informationfrom uncontrolledenvironments.Thisdynamicandpurposiveparadigmof vision is commonlycalledactivevision[4, 15]. Thecrucialroleof visual learningin activevisionhasbeenpointedoutin its earlydays[10, 9]. Sincethenhowever, relatively littleemphasishasbeenplacedon thelearningaspectsof task-drivenvision,althoughsomeprogressis beingmade[76]. A centralaim of this dissertationis theadvancementof visual learningasan essentialingredientof task-driven vision systems.The following itemssummarizethekeycontributionsof thisdissertation:

Þ Thefield of computervisionhasastrongtraditionof viewing visionproblemsin isolation.Underthisparadigm,thegoalof visionis to generatescenedescriptionsfor useby higher-level reasoningprocesses.I arguethatthisapproachis inappropriatein many visualtasksthat arisein the context of autonomoussystems,and insteadadvocateadaptive visualsystemsthatlearnto extractjust theinformationneededfor a task.Þ I proposedan iterative, on-line paradigmfor task-driven perceptuallearningthat cyclesthroughsensation,action,andevaluationsteps(Figures2.1 on page8, 4.2 on page35,and6.6on page102). Learningtakesplaceonly whenneeded,andis highly focusedonsolvingproblemsencounteredon line. This view of incrementalvisual learningdepartsfrom establishedtraditionsin computervision.Þ ExtendingAmit andGeman’swork oncombinatorialfeatures[8, 5] andcombiningit withhighly expressive andbiologically plausiblefeatureprimitives, I introducedan infinitecombinatorialandparametricfeaturespacethat containsusefulfeaturesfor a variety oftasks.Powerful featuresaresoughtin a simple-to-complex fashionby randomsamplingdirectly in trainingimages.

117

Page 136: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

Þ I demonstratedthatpowerful featurescanbelearnedin atask-drivenmannerby ageneralsystemthat startsout with very little prior commitmentto any particulartaskscenario.To reinforcethis message,my algorithmsaredesignedto learnonly a small numberoffeaturesthatarehighly distinctive individually.Þ In thecontext of visual recognition,I generalizedtheconventionaltrueÛ falsedichotomyof classificationby amulti-classframework thatallowseachclassto betreatedseparately.Thisparadigmunifiesthetraditionalparadigmsof objectrecognitionanddetection.I alsomotivatedanimplementationon thebasisof multiple Bayesiannetworks. This represen-tationabandonstheclosed-world assumptionmadeby conventionalrecognitionsystems.Þ Theeº cacy of thesealgorithmsandrepresentationswasevaluatedon objectdiscrimina-tion tasksthatsimulatedanon-linelearningscenario.Þ Motivatedby the phenomenonof humanexpertisein visual recognition,I proposedamethodfor learningimproved featuresby introducingexplicit requirementsof discrim-inative power. As a result, test-setperformancesignificantly improvesaccordingto allperformancemetrics.Þ Onthebasisof thefeaturespace,I developedamethodfor learningfeaturesto recommendconfigurationparametersfor a haptically-guidedgraspingsystemthatresemblesthewayhumansusevisualinformationto pre-shapetheirhandswhile reachingfor anobject.Themethodresultsin improved grasps,without any explicit extractionor representationofobjectgeometry.

Two verydi� erentapplicationsdemonstratedthepotentialof thegeneralapproach.Theob-ject discriminationtaskis essentiallya classificationproblem,with thegoalof finding featuresthatarehighly predictive of particularclasses.A stronglyfocusedfeaturesearchprocedureex-plicitly appliessimple-to-complex searchheuristics.Candidatefeaturesareevaluatedimmedi-ately, whichrequiresacooperative environmentthatcanprovideexampleimages.Thegraspingtaskmainly constitutesa regressionproblem.Here,featuresaresoughtwhoseimage-planeori-entationsrobustly correlatewith handorientations.Featureevaluationis distributedover manysuccessivegraspingtrials,anddoesnotrequireexampleimages.Thesimple-to-complex featureconstructionstrategy is moreimplicit in thisalgorithm.

All experimentswereperformedin simulation.On-linelearningtasksin realenvironmentswill benefitfrom aneº cient implementation,especiallyif it capitalizeson thehigh degreeofparallelisminherentin thefeaturesearchalgorithms.Importantly, thealgorithmspresentedherecan in principle be appliedin on-line, interactive tasks. This constitutesan advancementinthe technologyfor building autonomousrobotsthat performon-line perceptuallearning. Themethodologybeginswith anuncommittedlearningsystemthatmakesfar weaker assumptionsaboutthe world andthe taskthanmostexisting vision systems.In particular, no explicit as-sumptionsaremaderegardingobjectsegmentation,contourextraction,presenceof clutter orocclusions,thenumberof targetobjectsor classespresentin a scene,or thenumberof classesto be learned.Nevertheless,target conceptsarelearnedwith remarkableaccuracy, asreportedin Chapters5 and6.

In additionto advancingtechnology, this dissertationalsocontributesto our understandingof humanperceptuallearningin thatit providesacomputationalmodelof featuredevelopment.Sections2.1and5.1presentedstrongevidencethathumanslearnfeaturesnotunlike thosecon-tainedin thefeaturespaceintroducedin Chapter3. For thefirst time,a detailedcomputationalmodelof featurelearningwaspresentedthatexplainsmany of thephenomenaobservedby psy-chologists[109, 107, 120, 38, 41, 39]. Key principlesof themodelarebiologically plausible,including the primitive featuresand the principle of composition. It is not unlikely that hu-mansperforma simple-to-complex featuresearch,producingempirically useful featuresthat

118

Page 137: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

arediscardedor refinedasrequiredby additionalexperience.Clearly, biologicalvisionsystemsemploy moresophisticatedfeaturesnot containedin theconstrainedspaceusedhere,aswill bediscussedin thefollowing section.

7.2 Futur e Dir ections

Thealgorithmsdevelopedin this dissertationarefar moregeneralthantherestrictedscenariosusedin theexperiments.Much moresystematicexperimentationis requiredto illuminatetheirproperties.Therecognitionsystemneedsto beevaluatedunderthepresenceof multiple targetobjects,occlusions,andnon-target clutterobjects.Individual targetclassesshouldincludedif-ferentobjects,includinghighly heterogeneoussetsof objectsthatdi� er widely in appearance.Incrementalintroductionof targetclasses(in thesenseof Section5.4.4)requiresfurtherstudy,e.g.by examiningthe e� ect of presentationorderon the learningprocess.Most importantly,both the recognitionand the graspingsystemneedto be integratedin real-world interactivetasks,breakingaway from closedtrainingsets.It waspointedout thatboththeexpert learningandthegraspingapplicationsareexpectedto undergo a noticeableperformanceboostif train-ing dataaregenerateddynamically. Possibletestbedsincludehumanoidor autonomousmobilerobots,smartrooms,etc.

This dissertationexploredtwo di� erentparadigmsof thegeneralfeaturelearningmethod.Onewasconcernedwith supervisedclassification,theotherwith regression.A third practicalparadigmthat shouldbe exploredinvolvesan agentthat learnsmulti-stepreactive policies ina reinforcementlearningframework. The agenttakes policy actionsbasedon the currentlyobservedstate. Theonly feedbackavailableto theagentconsistsof ascalarreward thatmaybedelayedoverseveralpolicy actions,or maybegivenonly atsparseintervals. In this framework,thestatecanberepresentedin termsof learnedfeaturesthatcapturetask-relevant informationabout the environment. For example, the agentcan begin with an impoverishedperceptualstatespacesuchthat all world statesarealiased. Then, the featurelearningsubsystemseeksperceptualfeaturesthat resolve hiddenstate,similar in spirit to McCallum’s U-Treealgorithm[67].

Thefollowing sectionsdiscussimportantlimitationsandpossibleenhancementsof theideasandmethodsdevelopedin this dissertation,andoutline how thesecanbe addressedby futureresearch.

7.2.1 Primiti veFeaturesand their Composition

The two specificprimitive featuretypesthat form thebasisof the featurespaceintroducedinChapter3 werechosenbecausethey constitutea smallsetandexpresscomplementaryaspectsof appearance.For generaltasks,othertypesof primitive featuresshouldbe introduced.Twoof the most importantfeaturesmissingin the currentsystemarecolor andstatisticaltexturefeatures.Gaussian-derivative featuresof color-opponentimagesignalshave beenexploredforappearance-basedobjectrecognition[43]. Statisticalfeaturesanalyzethelocal intensitydistri-bution without regardto their spatialorganization,andthuscomplementthehighly structuredandorientedtexels. For example,suchfeatureswould probablyprove valuablein the (gray-scale)Mel task.

An importantcharacteristicof the featurespaceintroducedin Chapter3 is that it canbeaugmentedalmostarbitrarily by addingothertypesof primitive featuresandrulesof composi-tion. Theseadditionsarenot limited to visualoperators,but canincludeinformationfrom otherperceptualmodalities,or generally, any typeof assertionabouttheworld. Di � erentmodalitiescaneven be combinedwithin the samecompoundfeaturethroughappropriatecompositionalrules.For example,combiningvisualandhapticinformationusingthis framework would openup new perspectivesin cross-modally-guided grasping.

119

Page 138: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

The temporalandthree-dimensionalnatureof theworld givesrise to morefeatures,manyof whichfit into this framework. For example,primitiveor compoundfeaturescanbecomputedonopticalflow or disparityfields.Thesefeaturesareeasilyaccessibleto interactive agents,andarelikely importantin many everydayscenarios.

The currentparametricrepresentationof compoundfeaturesis not very plausiblebiolog-ically. For representationscompatiblewith currentknowledgeaboutthe organizationof thevisualbrain,RiesenhuberandPoggio’s model[96] canserve asastartingpoint.

7.2.2 Higher-Level Features

In its presentform, the featurespacecontainsprimitive andcompoundfeatures. This spaceexcludesmany typesof featuresthatarevery informative to humans,but arenoteasilyexpress-ible within this framework. ExamplesincludeGestaltfeaturessuchasparallelism,symmetry(axial andradial), andcontinuity of contours.Anotherexampleis cardinality. To illustrate,abicyclewheelhasmany spokes,andatrianglehasthreecorners.Theseabstractconceptsarenotdirectly expressibleby thecurrentfeatureset,but appearto be very powerful cuesto humansfor sceneinterpretation.Somehigher-level featuresmayin principlebelearnableby composingthemfrom lower-level features.

7.2.3 E ß cient Feature Search

Thecurrentfeaturelearningalgorithmengagesablind andrandomgenerate-and-testsearchpro-cedurein staticimages.Theonly guidanceavailableis givenin theform of salientpoints,whichconstrainthesearchspaceto pointsthataredeemedto bemostimportant,independentlyof thetask.In reality, therearemuchmorepowerful waysto draw attentionto importantfeatures.Forexample,I postulatethat the ability to track featurepointsof a moving objectcangive pow-erful cuesaboutwhatappearance-basedfeaturesarestableacrossa wide rangeof viewpoints.Also, suchtrackingallows theconstructionof (at leastpartially)view-invariantfeaturesof localthree-dimensionalobjectstructure.

Theremay alsobe other, moredirect ways to extract distinctive features,in the spirit ofdiscriminative eigen-subspaces.I doubtthat this is alwayspossible,aseven humansdo not –at leastnot always– extract distinctive featuresdirectly in this way. On suº ciently diº cultdiscriminationtasks,we struggleto identify distinctive featuresunlessthey arepointedout tous. Thereis strongpsychologicalevidencesuggestingthat thesubjective discriminabilityandorganizationof objectsis alteredby experience[107].

However, there is certainly much room for improvementon the current,blind generate-and-testmethod. Oneinterestingpossibility is to designa meta-learningprocessthat noticespatternsin thewaysdistinctive featuresarefound,andthuslearnshow to look for goodfeatureseº ciently.

7.2.4 Redundancy

Both mammalianandstate-of-the-artfeature-basedcomputervision systemsowe their perfor-mancelargely to their useof a largenumberof features[71, 104, 96,77, 78, 8, 6, 82, 66]. Theresultingredundancy or overcompletenesscreatesrobustnessto variouskinds of imagevaria-tion anddegradation.Thepresentwork on featurelearningpurposelydid not usethis typeofredundancy becausewithout it, thedistinctive power of individual features– andthusthesuc-cessin learning– standoutmuchmore.However, for practicalapplicationssuchredundancy isrequiredto increaserobustness.Futurework shouldinvestigatewaysof learninglargenumbersof partially redundantfeaturesthatarestill powerful individually.

Recall from Chapter5 that the trainedrecognitionsystemrarely produceswrong results.Almost all incorrectrecognitionsareeitherignorantor ambiguous,which arecausedby a fail-

120

Page 139: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

ureto find all relevantdistinctive featuresin animage.This is anartifactof theextremelysmallnumberof featureslearnedby thesystem.Theavailability of redundantfeatureswould greatlyenhancethe likelihoodof finding distinctive features,which in turn would stronglyreducethenumberof ambiguousandignorantrecognitions.This makesmevery confidentthatanexten-sionof thecurrentlearningsystemwith many redundant(but powerful) featureswould achieveresultscompetitive with thebestof today’s,muchmorespecializedrecognitionsystems.

7.2.5 Integrating Visual Skills

Alongwith relatedwork, thisdissertationdemonstratedsomeof thegreatpotentialof appearance-basedvision. However, thereare tasksthat arenot solved on the basisof appearancealone.Thereis a role for other techniquessuchas shape-from-X,geometricmodel matching,or afunctionalinterpretationof itemsseen.Currently, suchdi� erentparadigmsareusuallyappliedin isolation. We do not know how to integratetheminto a unifiedwhole,asthehumanvisualsystemappearsto do. Ideally, anautonomousrobotshouldhave at its disposala repertoireofessentialvisualtools,andshouldlearnhow to usethesejointly or individually.

7.3 Uncommitted Learning in OpenEnvir onments

Theabovediscussionpointedout somefeaturesandrulesfor compositionthatshouldbeaddedto extend the rangeof conceptsexpressibleby the learningsystem. However, thereare twoimportantcaveats:First, the combinatoricsof the featuresearchgrow badly with the numberof primitivesandrules(seeAlgorithm 4.10on page43). The secondcaveat is an instanceofa critical problemsharedby all inductive learningsystems1: Even if unlimited computationalresourceswereavailableto copewith arbitrarynumbersof primitivesandrules,having toomanyof thesewill necessarilyhurt the learningprocess.Themoredegreesof freedomareavailableto aninductive learningsystem,themorewaysexist to find patternsin thedata.If thenumberof degreesof freedomis large in relationto thenumberof variablesin thedata,many spuriouspatternswill befound. Thereis no way to know which of thesearevalid in thesensethat theywill generalizeto unseentestinstances.Therefore,all inductive learnersarenecessarilylimitedin their expressive power. Theguidingprinciplesimplementingthis limitation arecollectivelycalledthe inductivebias of the learner. In practice,thedesignerof a learningproceduremusttailor theinductive biasto thepresumedstructureof thetask.

A novelty of this work is the introductionof an infinite featurespacebasedon commonlyusedvisual featureprimitives,expressive enoughto learntraining setsof almostany classifi-cationtask– if necessary, by memorizingthetrainingimages.Without any guidingprinciples,this learningsystemwouldbebiasfree,andthereforeunableto generalize.Thebiasis givenbya simple-to-complex featuresamplingprocedurethatmakesthe learnerprefersimplefeatures.Theunderlyingassumptionis thatsimplefeaturesgenerallyprovide for thebestgeneralization,or in otherwords,that the “real” structureunderlyinga tasktendsto beexpressibleby simplefeaturesfrom thefeaturespace(cf. Section3.6,page28).

Thisassumptionis notalwaystrue,asexemplifiedby theMel taskin Chapters4 and5. Thefeaturesfoundby the learnerworkedwell on thetrainingset,but generalizedpoorly neverthe-less.It appearsthat thefeaturesrepresentedaccidentalstructurein thetrainingset,which doesnot generalizeto test images.In somecases,spuriousstructurerepresentedby a featuremaybesupportedby a largenumberof trainingimages.In othercases,it maybesupportedby onlyvery few images,resultingin overfittedfeatures.Datasetsthat violate the structuralassump-tionsreflectedin theinductive biasareespeciallyproneto this problem.This is apparentlytrue

1Informally, aninductivelearningsystemlearnsafunctionfrom trainingexamples,andthengeneralizesto unseentestinstanceson thebasisof this function.

121

Page 140: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

of theMel task,wheresimplecombinationsof edgelsandtexelsdid not tendto capturetherealstructurein thedata.

How do humanssolve the bias problem? Adult humansseemto be able to learn fromvery few examples,and areoften able to pick out distinctive featureswith apparentease. Isuspectthat theansweris relatedto our – learned– ability to controlour learningbias.Guidedby a wealthof backgroundknowledgeandfunctionalunderstandingof theworld, we areableto constrainour featurespaceadaptively andeº ciently, so that relevant structurein the datais easilyfound. In otherwords,humansarenot simpleinductive learners,but our learningissupplementedby insightsinto functionalandcausalstructureof theworld. Likewisefor artificialsystems,thebiasproblemimpliesthatstraightforward inductive mechanismsareinsuº cient tobuild e� ective uncommittedlearners;they mustbesupplementedby othermechanisms[124].

Notably, it seemsplausiblethat this knowledgeis itself acquiredby inductive mechanisms.Onecanenvisionahierarchyof inductive learners.Many learnerslearnspecializedtaskson thebasisof low-level featuressuchassensorysignals.Their outputsignalsconstitutehighly struc-turedinput signalsto otherlearners,enablingtheseto learnfunctionsthatwould beimpossibleto learndirectly using the low-level features,becauseof their excessive degreesof freedom.Theselearningcomponentscanbetrainedin order, individually andlargely independently, gen-erally from thebottomup. This ordercanbegivenby a teacherwho knows e� ective learningschedules.Wheredoestheteachercomefrom?Imagineaworld populatedby alargenumberofsuchmultilayer learningagents,all of which applyrandomlearningschedulesin anattempttodiscover usefulstructurein theworld. Becauseof their largenumber, someof themwill even-tually be successful.Thesecanthenbecometeachersof the others,sharingtheir empiricallysuccessfullearningstrategies.

This dissertationmadeprogresstowardbuilding uncommittedperceptualsystemsthatcanlearnto becomeexpertat specifictasks.Thenext quantumleaptowardunderstandinghumanlearningon the onehand,andtoward constructingautonomousartificial learningsystemsforopenenvironmentsontheotherhand,requiresthedefeatof thebiasproblem.Thisposesagreatchallengeto computervisionandmachinelearningresearch.

122

Page 141: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

APPENDIX

EXPECTATION-MAXIMIZA TION FOR VON MISES MIXTURES

TheExpectation-Maximization (EM) algorithm[28] is anelegantclassof methodsfor estimat-ing theparametersa of a probabilisticmodelwhereonly someof theunderlyingvariablesareknown. The EM algorithmbegins with any suitableinitial estimateof the parametersa, andthenalternatesthefollowing two stepsuntil convergence:

Expectation: Given the currentestimateof the model parametersa and the observed data,computetheexpectedvaluesof theunobservedportionof thedata.

Maximization: Giventheobserveddataandtheexpectedvaluesof theunobserveddata,com-puteanew setof modelparametersa thatmaximizessomeobjective function.

If theobjective functionis continuous,thisprocedurewill convergeto a local maximum.A typicalapplicationof EM is theestimationof theparametersof amixturemodel

pmix( � « a) � K

k¥ 1

p( ��« ak) P(k) (A.1)

to fit an observed set ¬ of datapoints � i , i � 1µ�¶�¶�¶�µ N. The mixing proportionsP(k) andthecomponentski thatgeneratedeachdatapoint � i areunknown. The objective is to find theparametervectorak describingeachcomponentdensityp( � « ak).

TheExpectationstepcomputestheprobabilitiesP(k «à� i) thateachdatapoint � i wasgener-atedby componentk, giventhecurrentparameterestimatesak andP(k), usingBayes’Rule:

P(k «Ô� i) � p( � i « ak) P(k)Kj ¥ 1 p( � i « a j) P( j)

� p( � i « ak) P(k)pmix( � « a)

(A.2)

At theMaximizationstep,anew setof parametersak, k � 1µ�¶�¶�¶�µ K, is computedtomaximizethelog-likelihoodof theobserveddata:

logL(¬®« a) � N

i ¥ 1

log pmix( � « a) (A.3)

At themaximum,thepartialderivativeswith respectto all parametersvanish:

0 � »»ak

log L(¬¯« a) � N

i ¥ 1

P(k)pmix( ��« a)

»»ak

p( � i « ak)

� N

i ¥ 1

P(k «Ô� i)p( � i « ak)

»»ak

p( � i « ak) (A.4)

wherethesecondline (A.4) follows from substitutingEquationA.2. TheMaximizationis thencomputedby solving this system(A.4) for all ak. Moreover, the estimatesof the component

123

Page 142: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

priors areupdatedby averagingthedata-conditionalcomponentprobabilitiescomputedat theExpectationstep:

P(k) � 1N

N

i ¥ 1

P(k «­� i) (A.5)

For theparticularcaseof a mixtureof von Misesdistributions,EquationA.4 is instantiatedfor theparameters and ¹ for eachmixture componentk. For ¸ k, EquationA.4 becomes(cf.Equation6.6,page104)

»» ¸ klogL(¬¯« a) � ¹ N

i ¥ 1

P(k «Ú� i) sin( � i ¢ ¸ k) � 0 (A.6)

ˆ¸ k � á sin¼ 1

N

i ¥ 1

P(k «Ú� i) sin � iN

i ¥ 1

N

j ¥ 1

P(k «Ú� i) P(k «­� j) cos( � i ¢ � j)

(A.7)

Only oneof thetwo solutionsgivenby EquationA.7 is actuallya root of EquationA.6. Whichonethisiscanbedeterminedby backsubstitutionintoEquationA.6. Theroot ˆ¸ k thusdeterminedeitherminimizesor maximizesthe log-likelihood (A.3), ascanbe verified by consultingthesecondderivative: » 2» 2

klogL( ¬®« a) � ¢ ¹ N

i ¥ 1

P(k «Ú� i) cos( � i ¢ ¸ k) (A.8)

If thesecondderivative (A.8) at ˆ¸ k is lessthanzero,then k � ˆ¸ k maximizesthelog-likelihood(A.3); otherwise, k minimizesthelog-likelihood,andits mirror imageonthecircle, ¸ k � ˆ¸ k £ãâ ,is to beusedinstead.

In thecaseof ¹ , thepartialderivatives(cf. Equation6.7)

»» ¹ klogL(¬®« a) � ¢ N

i ¥ 1

P(k «Ô� i) I ¼ 1(¹ k) £ I1(¹ k) ¢ 2I0(¹ k) cos( � ¢ ¸ k)2I0(¹ k)

� 0

cannotbesolvedfor ¹ k in closedform becausethemodifiedBesselfunctionsarenot invertiblealgebraically. The ¹ k have to be approximatedvia numericaloptimization,e.g. gradientde-scent.In otherwords,eachiterationof theEM algorithminvolvesk one-dimensionalnumericaloptimizations. In the faceof this complexity, it seemssimpler to performa single(3K ¢ 1)-dimensionalnumericaloptimizationto find theparameters¸ k, ¹ k, andP(k) of themixturemodel(A.1) directly, insteadof usingtheiterative EM algorithm.

124

Page 143: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

BIBLIOGRAPHY

[1] Acredolo,L. P. Behavioral approachesto spatialorientationin infancy. In TheDevel-opmentand Neural Basesof Higher Cognitive Functions, A. Diamond,Ed., vol. 608.Annalsof theNew York Academyof Sciences,1990,pp.596–607.

[2] Acredolo,L. P., Adams,A., andGoodwyn,S. W. The role of self-producedmovementandvisualtrackingin infantspatialorientation.Journal of ExperimentalChild Psychol-ogy 38 (1984),312–327.

[3] Allen, P. K. Surfacedescriptionsfrom vision and touch. In Proceedingsof the 1984ConferenceonRobotics(Atlanta,GA, Mar. 1984),IEEE,pp.394–397.

[4] Aloimonos,Y., Weiss,I., andBandopadhay, A. Active vision. InternationalJournal ofComputerVision1 (1988),333–356.

[5] Amit, Y., andGeman,D. Shapequantizationand recognitionwith randomizedtrees.Neural Computation9 (1997),1545–1588.

[6] Amit, Y., andGeman,D. A computationalmodelfor visualselection.Neural Computa-tion 11, 7 (Oct.1999),1691–1715.

[7] Amit, Y., Geman,D., andJedynak,B. Eº cient focusingand facedetection. In FaceRecognition: From Theory to Applications, H. Wechsler, P. Phillips, V. Bruce, F. F.Soulie,andT. Huang,Eds.,NATO ASI SeriesF. Springer, Berlin, 1998.

[8] Amit, Y., Geman,D., andWilder, K. Jointinductionof shapefeaturesandtreeclassifiers.IEEE Transactionson PatternAnalysisandMachine Intelligence19, 11 (1997),1300–1305.

[9] Bajcsy, R.,andCampos,M. Activeandexploratoryperception.ComputerVision,Graph-ics,andImage Processing:Image Understanding56, 1 (July 1992),31–40.

[10] Ballard,D. H. Animatevision. Artificial Intelligence48 (1991),57–86.

[11] Ballard, D. H., Hayhoe,M. M., Pook,P. K., andRao,R. P. N. Deictic codesfor theembodimentof cognition.Behavioral andBrain Sciences20, 4 (Dec.1995),723–767.

[12] Banks,M. S., Aslin, R. N., andLetson,R. D. Sensitive periodfor the developmentofhumanbinocularvision. Science190(November1975),675–677.

[13] Biernacki,C.,Celeux,G.,andGovaert,G. Assessingamixturemodelfor clusteringwiththeintegratedcompletedlikelihood.IEEETransactionsonPatternAnalysisandMachineIntelligence22, 7 (July2000),719–725.

[14] Black, M. J.,andJepson,A. D. EigenTracking: Robustmatchingandtrackingof artic-ulatedobjectsusinga view-basedrepresentation.In EuropeanConferenceon ComputerVision (Cambridge,UK, 1996),pp.329–342.

[15] Blake, A., andYuille, A., Eds. ActiveVision. MIT Press,Cambridge,Massachusetts,1992.

125

Page 144: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

[16] Califano,A., andMohan,R. Multidimensionalindexing for recognizingvisual shapes.IEEE Transactionson Pattern Analysisand Machine Intelligence16, 4 (April 1994),373–392.

[17] Canny, J. A computationalapproachto edgedetection. IEEE Transactionson PatternAnalysisandMachineIntelligence8, 6 (1986),679–698.

[18] Cho,K., andDunn,S.M. Learningshapeclasses.IEEETransactionsonPatternAnalysisandMachineIntelligence16, 9 (1994),882–888.

[19] Chomat,O., Colin de Verdiere,V., Hall, D., andCrowley, J. L. Local scaleselectionfor Gaussianbaseddescriptiontechniques.In EuropeanConferenceonComputerVision(2000).

[20] Chomat,O., and Crowley, J. Probabilisticrecognitionof activity using local appear-ance.In Proceedingsof theIEEEComputerSocietyConferenceonComputerVisionandPattern Recognition (CVPR’99) (Ft. Collins, CO, June1999),vol. 2, IEEE ComputerSociety, pp.104–109.

[21] Coelho,Jr., J. A., andGrupen,R. A. A controlbasisfor learningmultifingeredgrasps.Journal of RoboticSystems14, 7 (1997),545–557.

[22] Coelho,Jr., J. A., andGrupen,R. A. Dynamiccontrolmodelsasstateabstractions.InNIPS’98Workshopon AbstractionandHierarchy in ReinforcementLearning(Brecken-ridge,CO,1998).

[23] Coelho,Jr., J. A., and Grupen,R. A. Learningin non-stationaryconditions: A con-trol theoreticapproach.In Proceedingsof theSeventeenthInternationalConferenceonMachineLearning(2000),MorganKaufmann.

[24] Coelho,Jr., J. A., Piater, J. H., andGrupen,R. A. Developinghapticandvisual per-ceptualcategoriesfor reachingandgraspingwith a humanoidrobot. In First IEEE-RASInternationalConferenceon HumanoidRobots(2000).

[25] Cohen,P. R., Atkin, M. S.,Oates,T., andBeal,C. R. Neo: Learningconceptualknowl-edgeby sensorimotorinteractionwith anenvironment.In Proceedingsof theFirst Inter-nationalConferenceonAutonomousAgents(1997),pp.170–177.

[26] Colin deVerdiere,V., andCrowley, J. L. Visual recognitionusinglocal appearance.InFifth EuropeanConferenceon ComputerVision (June1998),vol. 1 of Lecture NotesinComputerScience, Springer, pp.640–654.

[27] deJong,E.D., andSteels,L. Generationandselectionof sensorychannels.In Evolution-ary Image Analysis,SignalProcessingand TelecommunicationsFirst EuropeanWork-shops,EvoIASP’99and EuroEcTel’99 Joint Proceedings(Berlin, May 1999), LNCS,Springer-Verlag,pp.90–100.

[28] Dempster, A. P., Laird, N. M., andRubin,D. B. Maximumlikelihoodfrom incompletedatavia the EM algorithm. Journal of the Royal StatisticalSociety(SeriesB) 39, 1(1977),1–38.

[29] Draper, B. A., Bins, J.,andBaek,K. ADORE: Adaptive objectrecognition.Videre 1, 4(2000),86–99.

[30] Fisher, N. I. StatisticalAnalysisof Circular Data. CambridgeUniversityPress,1993.

[31] Fisher, R. A. Contributionsto MathematicalStatistics. JohnWiley, New York, 1950.

126

Page 145: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

[32] Fisher, R. B., andMacKirdy, A. Integratingiconicandstructuredmatching.In EuropeanConferenceonComputerVision(1998).

[33] Fleuret,F., andGeman,D. Gradedlearningfor objectdetection.Tech.rep.,Departmentof MathematicsandStatistics,Universityof Massachusetts,Amherst,Feb. 1999.

[34] Florack,L. M. J. TheSyntacticStructure of Scalar Images. PhD thesis,University ofUtrecht,1993.

[35] Freeman,W. T., andAdelson,E. H. Thedesignanduseof steerablefilters. IEEE Trans-actionsonPatternAnalysisandMachineIntelligence13, 9 (1991),891–906.

[36] Friedman,J. H., andTukey, J. W. A projectionpursuitalgorithmfor exploratorydataanalysis.IEEETransactionsonComputers 23 (1974),881–890.

[37] Fukunaga,K., andKoontz,W. L. G. Applicationsof theKarhunen-Loeve expansiontofeatureselectionandordering.IEEETransactionsonComputers C-19(1970),917–923.

[38] Gauthier, I., andTarr, M. J. Becominga “Greeble” expert: Exploring mechanismsforfacerecognition.VisionResearch 37, 12 (1997),1673–1682.

[39] Gibson,E. J.,andPick, A. An Ecological Approach to PerceptualLearningandDevel-opment. Oxford UniversityPress,in press.

[40] Gibson,E. J., andSpelke, E. S. Thedevelopmentof perception.In Handbookof ChildPsychology Vol. III: Cognitive Development, J. H. Flavell and E. M. Markman,Eds.,4thed.Wiley, 1983,ch.1, pp.2–76.

[41] Gibson,J. J. TheEcological Approach to Visual Perception. HoughtonMi ä in, Boston,1979.

[42] Grimson,W. E. L., andLozano-Perez,T. Model-basedrecognitionandlocalizationfromtactiledata.Journalof RoboticsResearch 3, 3 (1986),3–35.

[43] Hall, D., Colin de Verdiere,V., andCrowley, J. L. Object recognitionusingcolouredreceptive fields. In EuropeanConferenceon ComputerVision(2000).

[44] Held,R. Binocularvision– behavioral andneuronaldevelopment.In NeonateCognition:BeyondtheBloomingBuzzingConfusion, J.MehlerandR. Fox, Eds.LawrenceErlbaumAssociates,1985,pp.37–44.Reprintedin [53], pp.152–158.

[45] Hubel, D. H., andWiesel,T. N. Receptive fields, binocularinteractionandfunctionalarchitecturein thecat’s visualcortex. Journal of Physiology (London)160 (1962),106–154.

[46] Huttenlocher, D. P., and Ullman, S. Recognizingsolid objectsby alignmentwith animage.InternationalJournalof ComputerVision5, 2 (1990),195–212.

[47] Intrator, N. Featureextractionusingan unsupervisednetwork. Neural Computation4(1992),98–107.

[48] Intrator, N., andGold, J. I. Three-dimensionalobjectrecognitionusinganunsupervisedBCM network: Theusefulnessof distinguishingfeatures.Neural Computation5 (1993),61–74.

[49] Jagersand,M., andNelson,R. Visualspacetaskspecification,planningandcontrol. InProc. IEEE Symposiumon ComputerVision (1995),pp.521–526.

127

Page 146: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

[50] Jedynak,B., andFleuret,F. Reconnaissanced’objets3D a l’aide d’arbresdeclassifica-tion. In Image’Com’96 (Bordeaux,France,20–22May 1996).

[51] Jensen,F. V. An introductionto Bayesiannetworks. Springer, New York, May 1996.

[52] John,G. H., Kohavi, R., andPfleger, K. Irrelevant featuresandthesubsetselctionprob-lem. In Proceedingsof theEleventhInternationalConferenceonMachineLearning(SanFrancisco,1994),W. W. CohenandH. Hirsh,Eds.,MorganKaufmann.

[53] Johnson,M. H., Ed. Brain DevelopmentandCognition. Bracewell, New York, 1993.

[54] Kandel,E. R., Schwartz,J.H., andJessell,T. M., Eds. Essentialsof Neural ScienceandBehavior. Appleton& Lange,Stamford,Connecticut,1995.

[55] Knoll, T. F., andJain,R. C. Recognizingpartially visible objectsusingfeatureindexedhypotheses.IEEE Journalof RoboticsandAutomationRA-2, 1 (March1986),3–13.

[56] Koenderink,J.J. Thestructureof images.Biological Cybernetics50 (1984),363–370.

[57] Koenderink,J. J., andvan Doorn,A. J. Representationof local geometryin thevisualsystem.Biological Cybernetics55 (1987),367–375.

[58] Kruger, N., andLudtke,N. ORASSYLL:Objectrecognitionwith autonomouslylearnedandsparsesymbolic representationsbasedon local line detectors. In British MachineVision Conference(1998).

[59] Kruskal,J.B. Lineartransformationof multivariatedatato revealclustering.In Multidi-mensionalScaling:TheoryandApplicationin theBehavioural Sciences, R.N. Shephard,A. K. Romney, andS.K. Nerlove,Eds.SeminarPress,New York, 1972,pp.179–191.

[60] Lauritzen,S. L. Propagationof probabilities,means,andvariancesin mixed graphicalmodels. Journal of theAmericanStatisticalAssociation(TheoryandMethods)87, 420(1992),1098–1108.

[61] Leonardis,A., andBischof,H. Robustrecognitionusingeigenimages.ComputerVisionandImage Understanding78, 1 (Apr. 2000),99–118.

[62] Lindeberg, T. Scale-SpaceTheoryin ComputerVision. Kluwer AcademicPublishers,Dordrecht,Netherlands,1994.

[63] Lindeberg, T. Edgedetectionandridgedetectionwith automaticscaleselection.Inter-nationalJournalof ComputerVision 30, 2 (1998),117–154.

[64] Lindeberg, T. Featuredetectionwith automaticscaleselection.InternationalJournal ofComputerVision30, 2 (1998),77–116.

[65] Logothetis,N. K., andPauls,J. Psychophysicalandphysiologicalevidencefor viewer-centeredobjectrepresentationin theprimate.Cerebral Cortex 3 (1995),270–288.

[66] Lowe, D. G. Towardsa computationalmodel for object recognitionin IT cortex. InProceedingsof the IEEE InternationalWorkshopon Biologically MotivatedComputerVision (May 2000),S.-W. Lee,H. H. Bulthoå , andT. Poggio,Eds.,vol. 1811of LNCS,Springer-Verlag,pp.20–31.

[67] McCallum,A. K. ReinforcementLearningwith SelectivePerceptionandHiddenState.PhD thesis,Departmentof ComputerScience,University of Rochester, Rochester, NY,1995.Revised1996.

128

Page 147: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

[68] McCallum,R. A. Overcomingincompleteperceptionwith utile distinctionmemory. InProceedingsof theTenthInternationalConferenceonMachineLearning(1993),MorganKaufmann.

[69] McCallum, R. A. Utile suæ x memory for reinforcementlearningwith hiddenstate.Tech.Rep.549, Dept. of ComputerScience,University of Rochester, Rochester, NY,Dec.1994.

[70] McCarty, M. E., Clifton, R. K., Ashmead,D. H., Lee,P., andGoubet,N. How infantsusevision for graspingobjects.Child Development(in press).

[71] Mel, B. W. SEEMORE:Combining color, shape,and texture histogrammingin aneurally-inspiredapproachto visual objectrecognition. Neural Computation9 (1997),777–804.

[72] Moghaddam,B., andPentland,A. Probabilisticvisuallearningfor objectrepresentation.IEEETransactionsonPatternAnalysisandMachineIntelligence19, 7 (July1997),696–710.

[73] Murase,H., andNayar, S. K. Visual learningandrecognitionof 3-D objectsfrom ap-pearance.InternationalJournal of ComputerVision 14 (1995),5–24.

[74] Nayar, S. K., Murase,H., andNene,S. A. Learning,positioning,andtrackingvisualappearance.In Proceedingsof theInternationalConferenceonRoboticsandAutomation(SanDiego,CA, May 1994),IEEE.

[75] Nayar, S. K., Nene,S. A., andMurase,H. Subspacemethodsfor robot vision. IEEETransactionsonRoboticsandAutomation12, 5 (Oct.1996),750–758.

[76] Nayar, S.K., andPoggio,T., Eds. Early Visual Learning. Oxford UniversityPress,Feb.1996.

[77] Nelson,R. C., andSelinger, A. A cubistapproachto objectrecognition.In Proceedingsof theInternationalConferenceonComputerVision(1998).

[78] Nelson,R.C.,andSelinger, A. Large-scaletestsof akeyed,appearance-based3-D objectrecognitionsystem.Vision Research 38, 15–16(1998).

[79] Nene,S. A., Nayar, S. K., andMurase,H. Columbiaobject imagelibrary (COIL-20).Tech.Rep.CUCS-005-96,ColumbiaUniversity, New York, Feb. 1996.

[80] Olshausen,B. A., andField,D. J. Emergenceof simple-cellreceptive-fieldpropertiesbylearningasparsecodefor naturalimages.Nature 381(1996),607–609.

[81] Papageorgiou, C. P., Oren,M., andPoggio,T. A generalframework for objectdetection.In Proceedingsof theInternationalConferenceon ComputerVision(Jan.1998).

[82] Papageorgiou,C.P., andPoggio,T. A patternclassificationapproachto dynamicalobjectrecognition.In Proceedingsof theInternationalConferenceonComputerVision(Corfu,Greece,Sept.1999).

[83] Pearl,J. ProbabilisticReasoningin IntelligentSystems:Networksof PlausibleInference.MorganKaufmann,SanFrancisco,1988.

[84] Piater, J. H., and Grupen,R. A. A framework for learningvisual discrimination. InProceedingsof the Twelfth InternationalFLAIRSConference(1999),Florida ArtificialIntelligenceResearchSociety, AAAI Press,pp.84–88.

129

Page 148: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

[85] Piater, J. H., andGrupen,R. A. Learningvisual recognitionwith Bayesiannetworks.ComputerScienceTechnicalReport99-43,Universityof Massachusetts,Amherst,MA,Mar. 1999.

[86] Piater, J. H., andGrupen,R. A. Toward learningvisual discriminationstrategies. InProceedingsof theIEEE ComputerSocietyConferenceon ComputerVision andPatternRecognition (June1999),vol. 1, IEEE ComputerSociety, pp.410–415.

[87] Piater, J. H., andGrupen,R. A. Constructive featurelearningandthe developmentofvisualexpertise.In Proceedingsof theSeventeenthInternationalConferenceonMachineLearning(2000),MorganKaufmann,pp.751–758.

[88] Piater, J. H., andGrupen,R. A. Distinctive featuresshouldbe learned.In Proceedingsof the IEEE InternationalWorkshopon Biologically MotivatedComputerVision (May2000),S.-W. Lee,H. H. Bulthoå , andT. Poggio,Eds.,vol. 1811of LNCS, Springer-Ver-lag,pp.52–61.

[89] Piater, J.H., andGrupen,R. A. Featurelearningfor recognitionwith Bayesiannetworks.In Proceedingsof theInternationalConferenceon PatternRecognition (Sept.2000).

[90] Rao,R. P. N., andBallard,D. H. An active vision architecturebasedon iconic represen-tations.Artificial Intelligence78 (1995),461–505.

[91] Rao,R. P. N., andBallard,D. H. Dynamicmodelof visual recognitionpredictsneuralresponsepropertiesin thevisualcortex. Neural Computation9, 4 (May 1997),721–763.

[92] Ravela,S.,andManmatha,R. Retrieving imagesby appearance.In Proceedingsof theInternationalConferenceon ComputerVision(1998),pp.608–613.

[93] Ravela,S.,andManmatha,R. Gaussianfilteredrepresentationsof images.In TheEncy-clopediaonElectrical ç ElectronicsEngineering, J.G. Webster, Ed.JohnWiley & Sons,1999,pp.289–307.

[94] Ravela,S.,Manmatha,R., andRiseman,E. M. Imageretrieval usingscalespacematch-ing. In FourthEuropeanConferenceonComputerVision (Cambridge,U.K., April 1996),Springer, pp.273–282.

[95] Rider, E. A., and Rieser, J. J. Pointing at objectsin other rooms: Young children’ssensitivity to perspective afterwalking with andwithout vision. Child Development59(1988),480–494.

[96] Riesenhuber, M., andPoggio,T. Hierarchicalmodelsof object recognitionin cortex.Nature Neuroscience2 (1999),1019–1025.

[97] Rieser, J.J.,Garing,A. E.,andYoung,M. F. Imagery, action,andyoungchildren’sspatialorientation:It’snotbeingtherethatcounts,it’swhatonehasin mind. Child Development65 (1994),1262–1278.

[98] Rimey, R. D., andBrown, C. M. Task-orientedvisionwith multipleBayesnets.In ActiveVision, A. BlakeandA. Yuille, Eds.MIT Press,Cambridge,Massachusetts,1992,ch.14,pp.217–236.

[99] Ripley, B. D. PatternRecognition and Neural Networks. CambridgeUniversity Press,1996.

130

Page 149: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

[100] Riseman,E.M., Hanson,A. R.,Beveridge,J.R.,Kumar, R.,andSawhney, H. Landmark-basednavigation and the acquisitionof environmentalmodels. In Visual Navigation:From Biological Systemsto UnmannedGroundVehicles, Y. Aloimonos,Ed. LawrenceErlbaumAssociates,1997,ch.11,pp.317–374.

[101] Ruå , H. A. Infant recognitionof the invariantform of objects. Child Development49(1978),293–306.

[102] Russell,S., andNorvig, P. Artificial Intelligence: A ModernApproach. Prentice-Hall,UpperSaddleRiver, NJ,1995.

[103] Salapatek,P., andBanks,M. S. Infant sensoryassessment:Vision. In Communicativeand Cognitive Abilities – Early Behavioral Assessment, F. D. Minifie andL. L. Lloyd,Eds.UniversityParkPress,Baltimore,1978.

[104] Schiele,B., andCrowley, J.L. Objectrecognitionusingmultidimensionalreceptive fieldhistograms.In Fourth EuropeanConferenceon ComputerVision(Cambridge,UK, Apr.1996).

[105] Schnackertz, T. J., and Grupen,R. A. A control basisfor visual servoing tasks. InProceedingsof theInternationalConferenceonRoboticsandAutomation(1995),IEEE.

[106] Schwarz,G. Estimatingthedimensionsof a model. Annalsof Statistics6 (1978),461–464.

[107] Schyns,P. G.,Goldstone,R. L., andThibaut,J.-P. Thedevelopmentof featuresin objectconcepts.Behavioral andBrain Sciences21, 1 (1998),1–54.

[108] Schyns,P. G.,andMurphy, G. L. Theontogeny of partrepresentationin objectconcepts.In ThePsychology of Learningand Motivation, Medin, Ed., vol. 31. AcademicPress,SanDiego,CA, 1994,pp.305–354.

[109] Schyns,P. G., andRodet,L. Categorizationcreatesfunctionalfeatures.Journal of Ex-perimentalPsychology: Learning, Memory, andCognition23, 3 (1997),681–696.

[110] Segen, J. Learninggraphmodelsof shape. In Proceedingsof the 5th InternationalConferenceon MachineLearning(Ann Arbor, June12–141988),J.Laird, Ed.,MorganKaufmann,pp.29–35.

[111] Shams,S. Multiple elasticmodulesfor visualpatternrecognition.Neural Networks8, 9(1995),1439–1456.

[112] Shannon,C. E., andWeaver, W. Themathematicaltheoryof communication. Universityof Illinois Press,Urbana,1949.

[113] Solso,R. L., and McCarthy, J. E. Prototypeformation of faces: A caseof pseudo-memory. British Journalof Psychology 72 (1981),499–503.

[114] Steels,L. Constructingandsharingperceptualdistinctions.In Proceedingsof theEuro-peanConferenceon MachineLearning(Berlin, 1997),M. vanSomerenandG. Widmer,Eds.,Springer-Verlag.

[115] Sutton,R. S., andBarto,A. G. ReinforcementLearning: An Introduction. MIT Press,Cambridge,Massachusetts,1998.

[116] Swain, M. J., andBallard, D. H. Color indexing. InternationalJournal of ComputerVision 7, 1 (1991),11–32.

131

Page 150: VISUAL FEATURE LEARNING - University of Massachusetts Amherst

[117] Swets,D. L., andWeng,J. Usingdiscriminanteigenfeaturesfor imageretrieval. IEEETransactionsonPatternAnalysisandMachineIntelligence18, 8 (Aug. 1996),831–836.

[118] Talukder, A., andCasasent,D. Classificationandposeestimationof objectsusingnonlin-earfeatures.In ApplicationsandScienceofComputationalIntelligence(1998),vol. 3390,SPIE.

[119] Talukder, A., andCasasent,D. A generalmethodologyfor simultaneousrepresentationanddiscriminationof multiple objectclasses.Optical Engineering(Mar. 1998). Specialissueon Advancesin RecognitionTechniques.

[120] Tanaka,J. W., andTaylor, M. Objectcategoriesandexpertise:Is thebasiclevel in theeye of thebeholder?CognitivePsychology 23 (1991),457–482.

[121] Tarr, M. J. Visualpatternrecognition. In Encyclopediaof Psychology, A. Kazdin,Ed.AmericanPsychologicalAssociation,Washington,DC, in press.

[122] Tarr, M. J., andBulthoå , H. H. Is humanobjectrecognitionbetterdescribedby geon-structural-descriptions or by multiple-views? Journal of ExperimentalPsychology: Hu-manPerceptionandPerformance21, 6 (1995),1494–1505.

[123] Thibaut, J.-P. The developmentof featuresin childrenandadults: The caseof visualstimuli. In Proc.17thAnnualMeetingof theCognitiveScienceSociety(1995),LawrenceErlbaum,pp.194–199.

[124] Thrun,S.,andMitchell, T. M. Lifelongrobotlearning.RoboticsandAutonomousSystems15 (1995),25–46.

[125] Turk,M., andPentland,A. Eigenfacesfor recognition.JournalofCognitiveNeuroscience3, 1 (1991),71–86.

[126] Utgoå , P. E., andClouse,J. A. A Kolmogorov-Smirnoå metric for decisiontreeinduc-tion. ComputerScienceTechnicalReport96-3,University of Massachusetts,Amherst,1996.

[127] vanVliet, L. J.,Young,I. T., andVerbeek,P. W. Recursive Gaussianderivative filters. InProc. 14thInternationalConferenceon PatternRecognition (ICPR’98) (Brisbane,Aus-tralia,Aug. 1998),vol. 1, IEEE ComputerSocietyPress,pp.509–514.

[128] Viola, P. A. Complex featurerecognition:A Bayesianapproachfor learningto recognizeobjects.AI Memo1591,MassachusettsInstituteof Technology, November1996.

[129] Wallis, G., andBulthoå , H. Learningto recognizeobjects.Trendsin CognitiveScience3, 1 (1999),22–31.

[130] Weng,J.,andChen,S. Incrementallearningfor vision-basednavigation. In Proceedingsof theInternationalConferenceon PatternRecognition (Vienna,Austria,1996),pp.45–49.

[131] Weng,J.,andChen,S. Vision-guidednavigationusingSHOSLIF. Neural Networks11(1998),1511–1529.

[132] Young,R. A. TheGaussianderivative modelfor spatialvision: I. Retinalmechanisms.SpatialVision2, 4 (1987),273–293.

[133] Zerroug,M., andMedioni, G. The challengeof genericobjectrecognition. In ObjectRepresentationin ComputerVision: Proc.NSF-ARPA Workshop(1994),pp.217–232.

132