evaluation of interactive machine learning systems › ial2019 › pdf ›...

EvaluationofInteractiveMachineLearningSystems

NadiaBoukhelifaIAL,September2019

EvaluationofInteractiveMachineLearningSystems

NadiaBoukhelifaIAL,September2019

visual

(IVMLs)

WhoamI?

NadiaBOUKHELIFA

Ph.D.2007—UniversityofKent,UKComputerScience,InformationVisualization

Post-doc—INRIA,TélécomParisTech,FranceVisualization,Human-ComputerInteraction

Researcher,Tenured,2016—INRA,FranceMulti-dimensionalDataVisualization,InteractiveModelling

INRA—NationalInstituteofAgriculturalResearch

NadiaBOUKHELIFA

INRA—NationalInstituteofAgriculturalResearch

NadiaBOUKHELIFA

http://www.cfsg.fr/site-de-grignon

MALICESTeam:ModellingandKnowledgeIntegration

Expertiseformalisation

Uncertainty

Deterministicandstochasticmodelling

DecisionmakingOptimisation,visualisation

Expertiseformalisation

Uncertainty

Deterministicandstochasticmodelling

DecisionmakingOptimisation,visualisation

NotexpertinAI

MALICESTeam:ModellingandKnowledgeIntegration

Context:complexsystems,complexdatasets,complexmodels…

Azodyn-blé (Jeuffroy, Recous, Girard, Barré, Champeil, Meynard) modèle complet

Caractéristiques du sol (Clay%, CaCO3%, N%)

Données climatiques (température, pluie, Irr)

Nature du précédent

Amendements organiques (date, dose, forme)

Minéralisation nette de l�humus, des résidus et des amendements organiques (Mh, Mr, Ma)

Engrais minéral ou organique (date et dose)

Nmin dans sol

Vitesse de croissance la semaine avant apport

Indice de nutrition azotée (NNI)

Quantité de N aérien réel

Courbe de dilution maxi, Vmax

Climat (rayonnement, température)

Quantité maxi de N accumulé dans la culture

Biomasse aérienne

Rayonnement intercepté

Indice foliaire

NG/m²réel Quantité max de N dans les grains

Max Biomass in grains

Biomasse des grains = rendement

Quantité d�N dans les grains

Teneur en protéines

Acides aminés accumulés

N remobilisé

N dans organes végétatifs

Nmin dans sol à récolte climat (température)

CAU de l�engrais

NE/m²

climat avant floraison (température, rayonnement, déficit hydrique) NG/m² max

ENTREES

SORTIES

Jeuffroy, M-H., and Sylvie Recous. "Azodyn: a simple model simulating the date of nitrogen deficiency for decision support in wheat fertilization." European journal of Agronomy 10, no. 2 (1999): 129-144.

Context:InteractiveModelExploration

Whyhumanintheloopinmodelling?

• tointegratevaluableexpertsknowledgethatmaybehardtoencodedirectlyintomathematicalorcomputationalmodels.

• tohelpresolveexistinguncertaintiesasaresultof,forexample,biasanderrorthatmayarisefromautomaticmachinelearning.

• tobuildtrustbymakinghumansinvolvedinthemodellingorlearningprocesses.

• tocaterforindividualhumandifferencesandsubjectiveassessmentssuchasinartandcreativeapplications.

VisualisationOptimisationModel Simulations

Experimental System

Application!Multiple Computational Stages!

Multiple Actors!

Feedback

Interaction!Results !

Boukhelifa,Nadia,etal."AnExploratoryStudyonVisualExplorationofModelSimulationsbyMultipleTypesofExperts."Proceedingsofthe2019CHIConferenceonHumanFactorsinComputingSystems.ACM,2019.

OurapproachtoInteractiveModelExploration

Experimental System

Multiple Actors!

Feedback

12 Boukhelifa et al., CHI2019

Collaborativesetup

Experimental System

Multiple Actors!

Feedback

• Modelswrittenbythirdparties• Expertsspecialistsinpartsofthemodelledprocess

=>Co-locatedmultipleexpertise:domain,model,optimisation,visualization.

Collaborativesetup

Boukhelifa et al., CHI2019

Experimental System

Multiple Actors!

Feedback

Data & Models 14

Experimental System

Multiple Actors!

Feedback

Data & Models 15

Trade-offs

Experimental System

Multiple Actors!

Feedback

Data & Models Pareto Fronts 16

setofnon-dominatedcompromisepoints,wherenoobjectivecanbeimprovedwithoutsacrificingatleastoneotherobjective.

QUANTITY OF ITEM 1

Experimental System

Multiple Actors!

Feedback

Visualization SystemData & Models Pareto Fronts 17

QUANTITY OF ITEM 1

http://www.seguetech.com/

a way to generate pretty images from data

Definition#1:Visualization

Thepurposeofvisualizationisinsight,notpictures!

Definition#1:Visualization

Informa<onVisualisa<on

“Theuseofcomputer-supported,interactive,visualrepresentationsofabstractdatatoamplifycognition.”[Cardetal.,1999]

Thepurposeofvisualizationisinsight,notpictures!

Card,StuartK.,JockD.Mackinlay,andBenShneiderman."Usingvisiontothink."Readingsininformationvisualization.MorganKaufmannPublishersInc.,1999.Card,S.

WhyvisualiseyourdataRawDatafromAnscombe’sQuartet

FrankAnscombe

StatisticalanalysisRawDatafromAnscombe’sQuartet

StatisticalProperties

Meanofx 9.0

Varianceofx 11.0

Meanofy 7.5

Varianceofy 4.12

Correlationbetweenxandy 0.816

Linearregressionline y=3+0.5x

VisualrepresentationofthedataRawDatafromAnscombe’sQuartet

VisualProperties

Samestats,differentgraphs!

Nevertrustsummarystatisticsalone;visualiseyourdata!

DatasaurusDozenDataset,AutodeskResearchhttps://www.autodeskresearch.com/publications/samestat

Visualrepresentationofthedata

Communicate visually

Explore interactively

Evaluate

InformationVisualization

Definition#2:Human-ComputerInteraction

“Human-computer interaction is a discipline concerned with the design, evaluation and implementation of interactive computing systems for human use and with the study of major phenomena surrounding them.”

Hewett,T.etal.ACMCurriculaforHuman-ComputerInteraction.1992;

“algorithms that can interact with both computational agents and human agents (in active learning: oracles) and can optimize their learning behavior through these interactions.”

Definition#3:iML

“Users Are People, Not Oracles”

Takingintoaccounthumanfactors

Amershietal.,PowertothePeople:TheRoleofHumansinInteractiveMachineLearning,2014

AndreasHolzinger.Interactivemachinelearning(iml).InformatikSpektrum,39(1),2016.DanielKottke,InteractiveAdaptiveLearning,IntelligentEmbeddedSystems,UniversityofKassel,2018

Definition#4:IVML

“In interactive visual machine learning (IVML), a human operator and a machine collaborate to achieve a task mediated by an interactive visual interface.”

Ourdefinition:

N.Boukhelifa,A.Bezerianos,andE.Lutton.EvaluationofInteractiveMachineLearningSystems.InHumanandMachineLearning,pp.341-360.Springer,Cham,2018.

EvaluationofInteractiveSystems

✔Userperformance

✔Userexperience

✔Algorithmicperformance

Weknowhowtoassess:

EvaluationofIVMLSystems

https://unsplash.com/photos/VBe9zj-JHBs

OurexperiencewithEVE-EvolutionaryVisualExploration

N.Boukhelifa,A.Bezerianos,I-CTrelea,N.MPerrot,andELutton.AnExploratoryStudyonVisualExplorationofModelSimulationsbyMultipleTypesofExperts."ACMCHIConferenceonHumanFactorsinComputingSystems,2019

N.Boukhelifa,A.Bezerianos,W.CancinoandE.Lutton.EvolutionaryVisualExploration:EvaluationofanIECFrameworkforGuidedVisualSearch.EvolutionaryComputationJournal,MITpress,2017.

N.Boukhelifa,A.BezerianosandE.Lutton.AMixedApproachfortheEvaluationofaGuidedExploratoryVisualization System.EuroVisWorkshoponReproducibility,Verification,andValidationinVisualization(EuroRV3)2015.

W.Cancino,N.Boukhelifa,A.BezerianosandE.Lutton.EvolutionaryVisualExploration:ExperimentalAnalysisofAlgorithmBehaviour.GECCOworkshoponGeneticandEvolutionaryComputation(VizGEC2013).

N.Boukhelifa,W.Cancino,A.BezerianosandE.Lutton.EvolutionaryVisualExploration:EvaluationWithExpertUsers.ComputerGraphicsForum(EuroVis2013),EurographicsAssociation,2013,32(3).

N.Boukhelifa,A.Bezerianos,I-CTrelea,N.MPerrot,andELutton.AnExploratoryStudyonVisualExplorationofModelSimulationsbyMultipleTypesofExperts."ACMCHIConferenceonHumanFactorsinComputingSystems,2019

N.Boukhelifa,A.Bezerianos,W.CancinoandE.Lutton.EvolutionaryVisualExploration:EvaluationofanIECFrameworkforGuidedVisualSearch.EvolutionaryComputationJournal,MITpress,2017.

N.Boukhelifa,A.BezerianosandE.Lutton.AMixedApproachfortheEvaluationofaGuidedExploratoryVisualization System.EuroVisWorkshoponReproducibility,Verification,andValidationinVisualization(EuroRV3)2015.

W.Cancino,N.Boukhelifa,A.BezerianosandE.Lutton.EvolutionaryVisualExploration:ExperimentalAnalysisofAlgorithmBehaviour.GECCOworkshoponGeneticandEvolutionaryComputation(VizGEC2013).

w. Cancino A. Bezerinaos E. Lutton

Our experience with EVE - Evolutionary Visual Exploration

EvolutionaryVisualExploration

Needtoexplorecombineddimensionshttps://unsplash.com/photos/YTbFHT9_IhY

Howtoexploreahugesearchspaceofpossiblecombined

dimensions?

Howtoexploreahugesearchspaceofpossiblecombined

dimensions?

n-DdatasetInteresting2D projectionsEV

E IEAs

EvolutionaryVisualExploration

WhyIEAs?

✔cancombineobjective&subjectivemeasures

✔supportexploration&exploitation

✔adapttouserchangeofinterest

Suitableforexploratoryvisualization:

EVEPrototype

t al.,

ArtificialEvolution

NaturalEvolution EvolutionaryAlgorithms(EAs)

Wikipedia

EVE:CreatingNewProjections

Biscuit data (6D)

Top_heat, Bottom_heat

+,-,*,/, (.)(.), exp and log

offspring1: 0.2 * Top_heat + 0.7 * Bottom_heat

offspring2: Exp(Top_heat)

offspring3: 0.99 - (0.69 * log(Top_heat)) …

Favor linear patterns

Choose

offspringX

InteractiveEvolution(IEAs)

EvaluationofProjections

User assessment ComplexitySurrogate function

Scagnostics, Wilkinson and Wills, 2008

learned

interactivecomputed

TheSearchSpace

TheSetofalldimensionsencodedastrees(GPframework,Koza92)

initialdimensions,operators,constants

SubspaceExploration

KeyfeaturesofEVE

Intuitive

Adaptable

Interactive

Flexible

EvaluatingEVE

Amixedapproach

Algorithm centred evaluation

User centred Evaluation

Isthehuman-machinecooperationproducingtherightresults?

Whenwehaveground-truth

• istheexplorationactuallysteeredtowardaninterestingareaofthesearchspace?

• aretheproposedsolutionsvaried?

Quantitativestudymethodology

• syntheticdataset

• pre-specifiedtask

Evaluatinghuman-AIcollaboration

• 12participants• mean28.5years• noexperiencerequired• Syntheticdataset5D• 20minutes

Participants

Gameseparatetwopointclusters

IEAisabletofollowtheorderofuserranking

Convergenceanalysis

failure

failuresuccess

success

Interfaceevaluation-usestrategiesVisualpatternselectionstrategies

failure

failuresuccess

success

Interfaceevaluation-usestrategiesVisualpatternselectionstrategies

Keyfindings

Userstakedifferentsearchandevaluationstrategiesevenforasimpletask.

Onaveragethesurrogatefunctionfollowstheorderofuserrankingfairlyconsistently.

Linkbetweenuserevaluationstrategy,andoutcomeofexploration&speedofconvergence.

Promisingresultsbut…

• Realworldsituationshavemorecomplexdatasetsandtasks.

• Usersarenotalwaysconsistentorgivedetailedfeedback.

• Wedonotalwayshavegroundtruth.

What happens when we do not have ground truth ?

User-centredevaluation

Insightandusabilityevaluation

• areexpertsabletoconfirmoldknowledge?

• areexpertsabletogainnewinsight?

Qualitativestudymethodology

• thinkaloud,observe,interview&questionnaire

• videotapedandlogdatacapture

• 5domainexperts

• mean34.2years

• owndatasets

• pre-questionnaire

• 2.5hours

Participants

Training

T1:showinthetoolwhatyoualreadyknowaboutthedata

T2:explorethedatainlightofaresearchquestion

EVEResults

Hypothesisgeneration

“thiscombinationmaybeanimportantfindingbecauseitinvolvesparametersthataffectonlyonepartofthe

simulationmodel...”

Cityemergencemodel

EVEResults

Hypothesisquantification

‘‘wealwaystalkaboutthisqualitatively.Thisisthefirst

timeIseeconcreteweights...’’

Electricityconsumptionprofiles

EVEResults

expertswereableto:• tryoutalternativescenarios• thinklaterally• quantifyaqualitativehypothesis• formulateanewhypothesisorrefineoldone• (domainvalue)

But…

« On one hand, humans perform unpredictable and sophisticated reasoning; on the other, artificial solvers are technically complex and adopt solving strategies which are very different from those employed by humans.

Secondly, the environment from which the problem to be solved is drawn is usually uncontrollable and uncertain.

Together, these factors complicate the task of designing precise and effective evaluation studies. »

G. Cortellessa and A. Cesta, AAAI

EvaluationofIVMLisChallenging

Cortellessa,Gabriella,andAmedeoCesta."EvaluatingMixed-InitiativeSystems:AnExperimentalApproach."ICAPS.Vol.6.2006.

EvaluationofIVMLisChallenging

Ch#1ComplexHumanFactors

Ch#2MultipleExpertise

Ch#3StochasticProcesses

Ch#4Co-adaptation

Ch#5Uncertainty

Ch#1ComplexHumanFactors

“Users Are People, Not Oracles”

FrustratedBored

FatigueAnnoyed

Inconsistent

Reluctanttogivefeedback

Interruptibility

Amershi et al., Power to the People: The Role of Humans in Interactive Machine Learning , 2014

e.g.,Needbettertechniquestocaptureuserintent.

ShouldIVMLhelpgroupsreachconsensus?Encouragemultiple

views?

Forus:buildcommongroundandselectbest

trade-offs

ShouldIVMLhelpgroupsreachconsensus?Encouragemultiple

views?

Forus:buildcommongroundandselectbest

trade-offs

Weneedtocaterforindividualaswellascollective

exploration

Needsetupsthatcanswitchbetweenindividualand

collectivelearning

• Riskofgettingstuckinlocaloptima?

• Effectofstochasticityonuser’smentalmodeoftheML

Ch#3StochasticProcesses

Ch#3Co-adaptation

“ Co-adaptive phenomena are defined as those in which the environment affects human behavior and at the same time, human behavior affects the environment. Such phenomena pose theoretical and methodological challenges and are difficult to study in traditional ways. ”

W. E. Mackay, Users and Customizable Software: A Co-Adaptive Phenomenon, 1990.

Ch#4Uncertainty

Differentsourcesofuncertaintyarisingfrom:• Automatic

inferences• Analysts

reasoning

https://www.facebook.com/pedromics

N.Boukhelifa,A.Bezerianos,andE.LuQon.Evalua<onofInterac<veMachineLearningSystems.InHumanandMachineLearning,pp.341-360.Springer,Cham,2018.

ReviewofCHI/VIZpublications2012-2017Keywordsearch:IML+Evaluation19papersfromdifferentdomains

Notanexhaustivesurvey!

IMLEvaluation-HCIStudies

HumanFeedback:

• implicit(7papers)

• explicit(8papers)

• mixed(4papers):• implicithumanfeedbackhelpsinferinformationtocomplementimplicithumanfeedback.

SystemFeedback:

• toinformhumansaboutthestateofthemachinelearningalgorithm,andprovenanceofsystemsuggestions.

• canbevisual,progressive,andcanindicateuncertainty.

• mostsystemsgavefeedback.

• challenge:toinformwithoutoverwhelmingtheuser.

Typesofstudy:

• 12/19studiesinvolvingusers(someforofcontrolledstudy)

• Difficulttoconduct:• potentialconfoundingfactors

• nogroundtruth

EvaluationMetrics:

objective:• time,precision,re-call,#Insights

• iMLvs.baselineorvariantsofML;with/outsystemfeedback;explicitvs.implicit;impactofuserfeedback

• difficultyseparatingusabilityissuesfromtaskresults

EvaluationMetrics:

subjective:• aspectsofuserexperience,e.g.,reported,easiness,speed,taskload,trust,andconfidence.

EvaluationMetrics:

subjective:• aspectsofuserexperience,e.g.,reported,easiness,speed,taskload,trust,andconfidence.

“One sign of success of iML systems is when humans forget that they are feeding information to an algorithm, and rather focus on

synthesising information relevant to their task”. Endertetal.,2012[38]

GuidelinesforIVMLevaluation

Weknowhowtoevaluateinterfaces&interactivesystems,butveryfewguidelinesexistspecificallyforiMLsystems!

JakobNielsen's10UsabilityHeuristicsforUserInterfaceDesign

G1:VisibilityofsystemstatusG2:MatchbetweensystemandtherealworldG3:UsercontrolandfreedomG4:ConsistencyandstandardsG5:ErrorpreventionG6:RecognitionratherthanrecallG7:FlexibilityandefficiencyofuseG8:AestheticandminimalistdesignG9:Helpusersrecognize,diagnose,&recoverfromerrorsG10:Helpanddocumentation

G1Developingsignificantvalue-addedautomation.

G2Consideringuncertaintyaboutauser’sgoals.

G3Consideringthestatusofauser’sattentioninthetimingofservices.

G4Inferringidealactioninlightofcosts,benefits,anduncertainties.

G5Employingdialogtoresolvekeyuncertainties.

G6Allowingefficientdirectinvocationandtermination.

G7Minimizingthecostofpoorguessesaboutactionandtiming.

G8Scopingprecisionofservicetomatchuncertainty,variationingoals.

G9Providingmechanismsforefficientagent−usercollaborationtorefineresults.

G10Employingsociallyappropriatebehaviorsforagent−userinteraction.

G11Maintainingworkingmemoryofrecentinteractions.

G12Continuingtolearnbyobserving.

PrinciplesofMixed-InitiativeUserInterfaces-EricHorvitz,1999

G1Makeclearwhatthesystemcando.

G2Makeclearhowwellthesystemcandowhatitcando.G3Timeservicesbasedoncontext.G4Showcontextuallyrelevantinformation.G5Matchrelevantsocialnorms.G6Mitigatesocialbiases.

G7Supportefcientinvocation.

G8Supportefcientdismissal.

G9Supportefcientcorrection.

G10Scopeserviceswhenindoubt.

G11Makeclearwhythesystemdidwhatitdid.G12Rememberrecentinteractions.G13Learnfromuserbehavior.G14Updateandadaptcautiously.G15Encouragegranularfeedback.

G16Conveytheconsequencesofuseractions.

G17Provideglobalcontrols.

G18Notifyusersaboutchanges.

Newguidelinesproposed,e.g.,Amershietal.,2019.

Noconclusions-researchquestions!

Q1WhataspectsorcomponentsoftheIVMLsystemaremostimportanttoevaluate?

Q2Whattypesoftaskscanbedelegatedtomachinelearningandwhicharebestlefttohumans?

Q3WhoarethetargetusersofIVMLsystems?

Q4Whatistheroleofexpertiseinthiscontext?

Q5ShouldwehavedomainexpertstrainIVMLsystems?

Q6Whataretherisksandbenefitsofintroducinghumanexpertiseintothe(machine)learningprocess?

Q7Whatmetricsshouldweusetoevaluate?

Q8Canweestablishapplicationordomain-independentmetrics?

Q9DoweestablishdifferentevaluationsmeasuresfortheunderstandingofIVMLsystemsandfortheirperformance?

Q10Canweestablishbenchmarkdatasetsandtestuse-casestohelpevaluateIVMLsystems?

Q11Shouldweseekreplicationinthiscontext,andifso,howdowesupportreplicationofresultsinaco-learningoradaptiveenvironment?

Q12Howdowecommunicatetheevaluationresultstootherdisciplines?

eviva-ml.github.io

iML-eval:On-goingResearchTopic

21 October, 2019

eviva-ml.github.io

eviva-ml.github.ioEarly registration deadline 20/09

Organizing Committee

• Nadia Boukhelifa (INRA, FR)• Anastasia Bezerianos (Univ. Paris-Sud and INRIA, FR)• Enrico Bertini (NYU Tandon School of Engineering, USA)• Christopher Collins (Uni. of Ontario Institute of Technology, CA)• Steven Drucker (Microsoft Research, USA)• Alex Endert (Georgia Tech, USA)• Jessica Hullman (Northwestern University, USA)• Michael Sedlmair (University of Stuttgart, DE) • Remco Chang (Tufts University, USA)• Chris North (Virginia Tech, USA)

Noconclusions-researchquestions!

Q1WhataspectsorcomponentsoftheIVMLsystemaremostimportanttoevaluate?

Q2Whattypesoftaskscanbedelegatedtomachinelearningandwhicharebestlefttohumans?

Q3WhoarethetargetusersofIVMLsystems?

Q4Whatistheroleofexpertiseinthiscontext?

Q5ShouldwehavedomainexpertstrainIVMLsystems?

Q6Whataretherisksandbenefitsofintroducinghumanexpertiseintothe(machine)learningprocess?

Q7Whatmetricsshouldweusetoevaluate?

Q8Canweestablishapplicationordomain-independentmetrics?

Q9DoweestablishdifferentevaluationsmeasuresfortheunderstandingofIVMLsystemsandfortheirperformance?

Q10Canweestablishbenchmarkdatasetsandtestuse-casestohelpevaluateIVMLsystems?

Q11Shouldweseekreplicationinthiscontext,andifso,howdowesupportreplicationofresultsinaco-learningoradaptiveenvironment?

Q12Howdowecommunicatetheevaluationresultstootherdisciplines?

eviva-ml.github.io

Danke!

evaluation of interactive machine learning systems › ial2019 › pdf ›...

Documents

nyai - interactive machine learning by daniel hsu

adatao: interactive, visual, predictive analytics for big...

three assertions about interactive machine learning ·...

predicting housing prices with azure machine learning ·...

interactive machine learning: robotics• gain intuition for...

northwestern's mccormick school of engineering ·...

interactive language learning by question answering ·...

analytics perspective towards better analysis of machine...

interactive learning

machine learning in logistics: machine learning...

power to the people: the role of humans in interactive...

an interactive machine-learning approach for defect

guided interactive machine learning

obstacle avoidance and path traversal using interactive...

reinforcement learning of motor skills using policy search...

three assertions about interactive machine...

interactive machine learning

interactive proofs for verifying machine learning

mlsploit: a framework for interactive experimentation with...

effective, interactive strategies for facilitating...