user file to project construction

Upload: hemant-mahajan

Post on 04-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 User file to project construction

    1/16

    Usher: Improving Data Qualitywith Dynamic Forms

    Kuang Chen, Student Member, IEEE, Harr Chen, Neil Conway,

    Joseph M. Hellerstein, Member, IEEE Computer Society, and Tapan S. Parikh

    AbstractData quality is a critical problem in modern databases. data-entry forms present the first and arguably best opportunity for

    detecting and mitigating errors, but there has been little research into automatic methods for improving data quality at entry time. In this

    paper, we propose USHER, an end-to-end system for form design, entry, and data quality assurance. Using previous form

    submissions, USHER learns a probabilistic model over the questions of the form. USHER then applies this model at every step of the

    data-entry process to improve data quality. Before entry, it induces a form layout that captures the most important data values of a form

    instance as quickly as possible and reduces the complexity of error-prone questions. During entry, it dynamically adapts the form to the

    values being entered by providing real-time interface feedback, reasking questions with dubious responses, and simplifying questions

    by reformulating them. After entry, it revisits question responses that it deems likely to have been entered incorrectly by reasking the

    question or a reformulation thereof. We evaluate these components of USHER using two real-world data sets. Our results demonstrate

    that USHER can improve data quality considerably at a reduced cost when compared to current practice.

    Index TermsData quality, data entry, form design, adaptive form.

    1 INTRODUCTION

    ORGANIZATIONS and individuals routinely makeimportantdecisions based on inaccurate data stored in supposedlyauthoritative databases. Data errors in some domains, suchas medicine, may have particularly severe consequences.These errors can arise at a variety of points in the life cycle ofdata, from data entry, through storage, integration, andcleaning, all the way to analysis and decision making [1].

    While each step presents an opportunity to address dataquality, entry time offers theearliest opportunity to catch andcorrect errors. The database community has focused on datacleaning once data have been collected into a database, andhas paid relatively little attention to data quality at collectiontime [1], [2]. Current best practices for quality during dataentry come from the field of survey methodology, whichoffers principles that include manual question orderings andinput constraints, and double entry of paper forms [3].Although this has long been the de facto quality assurancestandard in data collection and transformation, we believethis area merits reconsideration. For both paper forms anddirect electronic entry, we posit that a data-driven and more

    computationally sophisticated approach can significantly

    outperform these decades-old static methods in bothaccuracy and efficiency of data entry.

    The problem of data quality is magnified in low-resourcedata collection settings. Recently, the World Health Organi-zation likened the lack of quality health information indeveloping regions to a gathering storm, saying, [to]make people count, we first need to be able to count people

    [4]. Indeed, many health organizations, particularly thoseoperating with limited resources in developing regions,struggle with collecting high-quality data. Why is datacollection so challenging? First, many organizations lackexpertise in paper and electronic form design: designersapproach question and answer choice selection with adefensive, catchall mind-set, adding answer choices andquestions that may not be necessary; furthermore, theyengage in ad hoc mapping of required data fields to data-entry widgets by intuition [5], [6], often ignoring orspecifying ill-fitting constraints. Second, double entry is toocostly. In some cases this means it is simply not performed,resulting in poor data quality. In other cases, particularly

    when double entry is mandated by third parties, it results indelays and other unintended negative consequences. Weobserved this scenario in an HIV/AIDS program inTanzania, where time-consuming double entry was imposedupon a busy local clinic. The effort required to do the doubleentry meant that the transcription was postponed for monthsand handled in batch. Although the data eventuallypercolated up to national and international agencies, in theinterim the local clinic was operating as usual via paperforms, unable to benefit from an electronic view of the datalatent in their organization. Finally, many organizations indeveloping regions are beginning to use mobile devices likesmartphones for data collection; for instance, community

    health workers are doing direct digital data entry in remotelocations. Electronic data-entry devices offer differentaffordances than those of paper, displacing the role oftraditional form design and double entry [5]. We often found

    1138 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 8, AUGUST 2011

    . K. Chen, N. Conway, and J.M. Hellerstein are with the Department ofElectrical Engineering and Computer Science, University of California,Berkeley, 2599 Hearst Ave., Berkeley, CA 94720.E-mail: {kuangc, nrc, hellerstein}@cs.berkeley.edu.

    . H. Chen is with theComputerScience andArtificial IntelligenceLaboratory,Massachusetts Institute of Technology, 32 Vassar St., Cambridge, MA02139. E-mail: [email protected].

    . T.S. Parikh is with the School of Information, University of California,Berkeley, 102 South Hall, Berkeley, CA 94720.E-mail: [email protected].

    Manuscript received 23 Mar. 2010; revised 19 Oct. 2010; accepted 25 Oct.

    2010; published online 1 Feb. 2011.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTKDESI-2010-03-0172.Digital Object Identifier no. 10.1109/TKDE.2011.31.

    1041-4347/11/$26.00 2011 IEEE Published by the IEEE Computer Society

  • 7/29/2019 User file to project construction

    2/16

    that there were no data quality checks at all in navelyimplemented mobile interfaces, compounding the fact thatmobile data-entry quality is 10 times worse than dictation toa human operator [7].

    To address this spectrum of data quality challenges, we

    have developed USHER, an end-to-end system that can

    improve data quality and efficiency at the point of entry by

    learning probabilistic models from existing data, whichstochastically relate the questions of a data-entry form. These

    models form a principled foundation on which we develop

    information-theoretic algorithms for form design, dynamic

    form adaptation during entry, and answer verification:

    1. Since form layout and question selection is often adhoc, USHER optimizes question ordering according toa probabilistic objective function that aims tomaximize the information content of form answersas early as possiblewe call this the greedy informa-tion gain principle. Applied before entry, the modelgenerates a static but entropy-optimal ordering,which focus on important questions first; duringentry, it can be used to dynamically pick the nextbest question, based on answers so farappropriatein scenarios where question ordering can be flexiblebetween instances.

    2. Applying its probabilistic model during data entry,USHER can evaluate the conditional distribution ofanswers to a form question, and make it easier forlikely answers to be enteredwe call this theappropriate entry friction principle. For difficult-to-answer questions, such as those with many extra-neous choices, USHER can opportunistically reformu-

    late them to be easier and more congruous with theavailable information. In this way, USHER effectivelyallows for a principled, controlled trade-off betweendata quality and form filling effort and time.

    3. Finally, the stochastic model is consulted to predictwhich responses may be erroneous, so as to reaskthose questions in order to verify their correctnesswe call this the contextualized error likelihood princi-ple. We consider reasking questions both during thedata-entry process (integrated reasking) and afterdata entry has been finished (post hoc reasking). Inboth cases, intelligent question reasking approxi-

    mates the benefits of double entry at a fraction of thecost.

    In addition, we may extend USHERs appropriate entry

    friction approach to provide a framework for reasoning

    about feedback mechanisms for the data-entry user inter-

    face. During data entry, using the likelihood of unanswered

    fields given entered answers, and following the intuition

    that multivariate outliers are values warranting reexamina-

    tion by the data-entry worker, USHER can guide the user

    with much more specific and context-aware feedback. In

    Section 9, we offer initial thoughts on design patterns for

    USHER-inspired dynamic data-entry interfaces.

    The contributions of this paper are fourfold:

    1. We describe the design of USHERs core: probabil-istic models for arbitrary data-entry forms.

    2. We describe USHERs application of these models toprovide guidance along each step of thedata-entry lifecycle: reordering questions for greedy informationgain, reformulating answers for appropriate entryfriction, and reasking questions according to contex-tualized error likelihood.

    3. We present experiments showing that USHER hasthe potential to improve data quality at reduced cost.We study two representative data sets: directelectronic entry of survey results about politicalopinion and transcription of paper-based patientintake forms from an HIV/AIDS clinic in Tanzania.

    4. Extending our ideas on form dynamics, we proposenew user-interface principles for providing contex-tualized, intuitive feedback based on the likelihoodof data as they are entered. This provides afoundation for incorporating data cleaning mechan-isms directly in the entry process.

    2 RELATED WORK

    Our work builds upon several areas of related work. Weprovide an overview in this section.

    2.1 Data Cleaning

    In the database literature, data quality has typically beenaddressed under the rubric of data cleaning [1], [2]. Ourwork connects most directly to data cleaning via multi-variate outlier detection; it is based in part on interface ideasfirst proposed by Hellerstein [8]. By the time such retro-spective data cleaning is done, the physical source of thedata is typically unavailablethus, errors often become toodifficult or time-consuming to be rectified. USHER ad-

    dresses this issue by applying statistical data qualityinsights at the time of data entry. Thus, it can catch errorswhen they are made and when ground-truth values maystill be available for verification.

    2.2 User Interfaces

    Past research on improving data entry is mostly focused onadapting the data-entry interface for user efficiencyimprovements. Several such projects have used learningtechniques to automatically fill or predict a top-k set oflikely values [9], [10], [11], [12], [13], [14], [15]. For example,Ali and Meek [9] predicted values for combo-boxes in webforms and measured improvements in the speed of entry,

    Ecopod [15] generated type-ahead suggestions that wereimproved by geographic information, and Hermens andSchlimmer [10] automatically filled leave of absence formsusing decision trees and measured predictive accuracy andtime savings. In these approaches, learning techniques areused to predict form values based on past data, and eachmeasures the time savings of particular data-entry mechan-isms and/or the proportion of values their model was ableto correctly predict. USHERs focus is on improving dataquality, and its probabilistic formalism is based on learningrelationships within the underlying data that guide the usertoward correct entries. In addition to predicting question

    values, we develop and exploit probabilistic models of usererror, and target a broader set of interface adaptations forimproving data quality, including question reordering,reformulation, and reasking, and widget customizations

    CHEN ET AL.: USHER: IMPROVING DATA QUALITY WITH DYNAMIC FORMS 1139

  • 7/29/2019 User file to project construction

    3/16

    that provide feedback to the user based on the likelihood oftheir entries. Some of the enhancements we make for dataquality could also be applied to improve the speed of entry.

    2.3 Clinical Trials

    Data quality assurance is a prominent topic in the science ofclinical trials, where the practice of double entry has beenquestioned and dissected, but nonetheless remains the gold

    standard [16], [17]. In particular, Kleinman takes aprobabilistic approach toward choosing which forms toreenter based on the individual performance of data-entrystaff [18]. This cross-form validation has the same goal as ourapproach of reducing the need for complete double entry,but does so at a much coarser level of granularity. Itrequires historical performance records for each data-entryworker, and does not offer dynamic reconfirmation ofindividual questions. In contrast, USHERs cross-questionvalidation adapts to the actual data being entered in light ofprevious form submissions, and allows for a principledassessment of the trade-off between cost (of reconfirmingmore questions) versus quality (as predicted by theprobabilistic model).

    2.4 Survey Design

    The survey design literature includes extensive work onform design techniques that can improve data quality [3],[19]. This literature advocates the use of manually specifiedconstraints on response values. These constraints may beunivariate (e.g., a maximum value for an age question) ormultivariate (e.g., disallowing gender to be male and pregnantto be yes). Some constraints may also be soft and onlyserve as warnings regarding unlikely combinations (e.g., agebeing 60 and pregnant being yes).

    The manual specification of such constraints requires adomain expert, which can be prohibitive in many scenarios.By relying on prior data, USHER learns many of these sameconstraints without requiring their explicit specification.When these constraints are violated during entry, USHER canthen flag the relevant questions, or target them for reasking.

    However, USHER does not preclude the manual speci-fication of constraints. This is critical, because previousresearch into the psychological phenomena of survey fillinghas yielded common constraints not inherently learnablefrom prior data [3]. This work provides heuristics such asgroups of topically related questions should often beplaced together and questions about race should appear

    at the end of a survey. USHER complements these human-specified constraints, accommodating them while lever-aging any remaining flexibility to optimize questionordering in a data-driven manner.

    3 SYSTEM

    USHER builds a probabilistic model for an arbitrary data-entry form in two steps: first, by learning the relationshipsbetween form questions via structure learning, resulting ina Bayesian network; and second, by estimating the para-meters of that Bayesian network, which then allows us togenerate predictions and error probabilities for the form.

    After the model is built, USHER uses it to automaticallyorder a forms questions for greedy information gain. Section 5describes both static and dynamic algorithms that employcriteria based on the magnitude of statistical information

    gain that is expected in answering a question, given theanswers that have been provided so far. This is a key idea inour approach. By front-loading predictive potential, weincrease the models capacity in several ways. First, from aninformation-theoretic perspective, we improve our ability todo multivariate prediction and outlier detection for subse-quent questions. As we discuss in more detail in Sections 7and 9, this predictive ability can be applied by reformulating

    error-prone form questions, parameterizing data-entrywidgets (type-ahead suggestions and default values), asses-sing answers (outlier flags), and performing in-flightreasking (also known as cross validation in survey designparlance). Second, from a psychological perspective, front-loading information gain also addresses the human issues ofuser fatigue and limited attention span, which can result inincreasing error rates over time and unanswered questionsat the end of the form.

    Our approach is driven by the same intuition underlyingthe practice of curbstoning, which was related to us indiscussion with survey design experts [6]. Curbstoning is a

    way in which an unscrupulous door-to-door surveyor shirkswork: he or she asks an interviewee only a few importantquestions, and then uses those responses to complete theremainder of a form while sitting on the curb outside thehome. The constructive insight here is that a well-chosensubset of questions can often enable an experienced agent tointuitively predict the remaining answers. USHERs questionordering algorithms formalize this intuition via the principleof greedy information gain, and use them (scrupulously) toimprove data entry.

    USHERs learning algorithm relies on training data. Inpractice, a data-entry backlog can serve as this training set. Inthe absence of sufficient training data, USHER can bootstrapitself on a uniform prior, generating a form based on theassumption that all inputs are equally likely; this is no worsethan standard practice. Subsequently, a training set cangradually be constructed by iteratively capturing data fromdesigners and potential users in learning runs. It is acommon approach to first fit to the available data, and thenevolve a model as new data become available. This process ofsemiautomated form design can help institutionalize newforms before they are deployed in production.

    USHER adapts to a form and data set by crafting acustom model. Of course, as in many learning systems, themodel learned may not translate across contexts. We do not

    claim that each learned model would or should fullygeneralize to different environments. Instead, each context-specific model is used to ensure data quality for a particularsituation, where we expect relatively consistent patterns ininput data characteristics. In the remainder of this section,we illustrate USHERs functionality with examples. Furtherdetails, particularly regarding the probabilistic model,follow in the ensuing sections.

    3.1 Examples

    We present two running examples. First, the patient data setcomes from paper patient-registration forms transcribed bydata-entry workers at an HIV/AIDS program in Tanzania.1

    Second, the survey data set comes from a phone survey of

    1140 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 8, AUGUST 2011

    1. We have pruned out questions with identifying information aboutpatients, as well as free-text comment fields.

  • 7/29/2019 User file to project construction

    4/16

    political opinion in the San Francisco Bay Area, entered bysurvey professionals directly into an electronic form.

    In each example, a form designer begins by creating asimple specification of form questions and their prompts,response data types, and constraints. The training data set ismade up of prior form responses. Using the learningalgorithms we present in Section 4, USHER builds a Bayesiannetwork of probabilistic relationships from the data, asshown in Figs. 1 and 2. In this graph, an edge captures a closestochastic dependency between two random variables (i.e.,form questions). Two questions with no path between themin the graph are probabilistically independent. Fig. 2illustrates a denser graph, demonstrating that politicalsurvey responses tend to be highly correlated. Note that astandard joint distribution would show correlations amongall pairs of questions; the sparsity of these examples reflectsconditional independence patterns learned from the data.Encoding independence in a Bayesian network is a standard

    method in machine learning that clarifies the underlyingstructure, mitigates data overfitting, and improves theefficiency of probabilistic inference.

    The learned structure is subject to manual control: adesigner can override any learned correlations that arebelieved to be spurious or that make the form more difficultto administer.

    For the patient data set, USHER generated the staticordering shown in Fig. 3. We can see in Fig. 3 that thestructure learner predicted RegionCode to be correlated withDistrictCode. Our data set is collected mostly from clinics in asingle region of Tanzania, so RegionCode provides little

    information. It is notsurprising then, that USHERs suggestedordering has DistrictCode early and RegionCode lastoncewe observe DistrictCode, RegionCode has very little addi-tional expected conditional information gain. When it is timeto input theRegionCode, if theuser selects an incorrect value,the model can be more certain that it is unlikely. If the userstops early and does not fill in RegionCode, the model caninfer the likely value with higher confidence. In general,static questionorderings areappropriate as an offline processfor paper forms where there is latitude for (re)orderingquestions, within designer-specified constraints.

    During data entry, USHER uses the probabilistic machin-

    ery to drive dynamic updates to the form structure. One typeof update is the dynamic selection of the best next question toask among questions yet to be answered. This can beappropriate in several situations, including surveys that do

    not expect users to finish all questions, or direct-entryinterfaces (e.g., mobile phones) where one question is askedat a time. We note that it is still important to respect the formdesignersa priori specifiedquestion-grouping and -orderingconstraints when a form is dynamically updated.

    USHER is also used during data entry to providedynamic feedback, by calculating the conditional distribu-tion for the question in focus and using it to influence theway the question is presented. We tackle this via twotechniques: question reformulation and widget decoration.For the former, we could, for example, choose to reformu-late the question about RegionCode into a binary yes/no

    CHEN ET AL.: USHER: IMPROVING DATA QUALITY WITH DYNAMIC FORMS 1141

    Fig. 1. Bayesian network for the patientdata set, showing automaticallyinferred probabilistic relationships between form questions.

    Fig. 2. Bayesian network for the survey data set. The probabilisticrelationships are more dense. Some relationships are intuitive (PoliticalIdeology-Political Party), others show patterns incidental to the data set(race-gender).

    Fig. 3. Example question layout generated by our ordering algorithm.

    The arrows reflect the probabilistic dependencies from Fig. 1.

  • 7/29/2019 User file to project construction

    5/16

    question based on the answer to DistrictCode, sinceDistrictCode is such a strong predictor of RegionCode. Aswe discuss in Section 7, the reduced selection space forresponses in turn reduces the chances of a data-entryworker selecting an incorrect response. For the latter,possibilities include using a split drop-down menu forRegionCode that features the most likely answers abovethe line, and after entry, coloring the chosen answer red ifit is a conditional outlier. We discuss in Section 9 the designspace and potential impact of data-entry feedback that ismore specific and context-aware.

    As a form is being filled, USHER calculates contextua-lized error probabilities for each question. These values areused for reasking questions in two ways: during primaryform entry and for reconfirming answers after an initialpass. For each form question, USHER predicts how likelythe response provided is erroneous, by examining whetherit is likely to be a multivariate outlier, i.e., that it is unlikelywith respect to the responses for other fields. In otherwords, an error probability is conditioned on all answeredvalues provided by the data-entry worker so far. If there areresponses with error probabilities exceeding a presetthreshold, USHER reasks those questions ordered bytechniques to be discussed in Section 6.

    3.2 Implementation

    We have implemented USHER as a web application (Fig. 4).The UI loads a simple form specification file containing formquestion details andthe location of thetraining data set. Formquestion details include question name, prompt, data type,widget type, and constraints. The server instantiates a modelfor each form. The system passes information about questionresponses to the model as they are filled in; in exchange, themodel returns predictions and error probabilities.

    Models are created from the form specification, thetraining data set, and a graph of learned structural relation-

    ships. We perform structure learning offline with BANJO[20], an open source Java package for structure learning ofBayesian networks. Our graphical model is implemented intwo variants: the first model used for ordering is based on a

    modified version of JavaBayes [21], an open source Javasoftware for Bayesian inference. Because JavaBayes onlysupports discrete probability variables, we implemented theerror prediction version of our model using Infer.NET [22], aMicrosoft .NET Framework toolkit for Bayesian inference.

    4 LEARNING A MODEL FOR DATA ENTRY

    The core of the USHER system is its probabilistic model ofthe data, represented as a Bayesian network over formquestions. This network captures relationships between aforms question elements in a stochastic manner. Inparticular, given input values for some subset of thequestions of a particular form instance, the model can inferprobability distributions over values of that instancesremaining unanswered questions. In this section, we showhow standard machine learning techniques can be used toinduce this model from previous form entries.

    We will use F fF1; . . . ; Fng to denote a set of randomvariables representing the values of n questions comprising

    a data-entry form. We assume that each question responsetakes on a finite set of discrete values; continuous values arediscretized by dividing the data range into intervals andassigning each interval one value.2 To learn the probabilisticmodel, we assume access to prior entries for the same form.

    USHER first builds a Bayesian network over the formquestions, which will allow it to compute probabilitydistributions over arbitrary subsets G F of form questionrandom variables, given already entered question responsesG0 g0 for that instance, i.e., PG j G0 g0. Constructingthis network requires two steps: first, the induction of thegraph structure of the network, which encodes the condi-

    tional independencies between the question random vari-ables F; and second, the estimation of the resultingnetworks parameters.

    The nave approach to structure selection would be toassume complete dependence of each question on everyother question. However, this would blow up the numberof free parameters in our model, leading to both poorgeneralization performance of our predictions and prohibi-tively slow model queries. Instead, we learn the structureusing the prior form submissions in the database. USHERsearches through the space of possible structures usingsimulated annealing, and chooses the best structure accord-ing to the Bayesian Dirichlet Equivalence criterion [23]. This

    criterion optimizes for a trade-off between model expres-siveness (using a richer dependency structure) and modelparsimony (using a smaller number of parameters), thusidentifying only the prominent, recurring probabilisticdependencies. Figs. 1 and 2 show automatically learnedstructures for two data domains.3

    In certain domains, form designers may already havestrong commonsense notions of questions that should orshould not depend on each other (e.g., education level andincome are related, whereas gender and race are indepen-dent). As a postprocessing step, the form designer can

    1142 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 8, AUGUST 2011

    Fig. 4. USHER components and data flow: 1) model a form and its data,2) generate question ordering according to greedy information gain,3) instantiate the form in a data-entry interface, and 4) during andimmediately after data entry, provide dynamic reordering, feedback, andreconfirmation according to contextualized error likelihood.

    2. Using richer distributions to model fields with continuous or ordinal

    answers (e.g., with Gaussian models) could provide additional improve-ment, and is left for future work.

    3. It is important to note that the arrows in the network do not representcausality, only that there is a probabilistic relationship between thequestions.

  • 7/29/2019 User file to project construction

    6/16

    manually tune the resulting model to incorporate suchintuitions. In fact, the entire structure could be manuallyconstructed in domains where an expert has comprehensiveprior knowledge of the questions interdependencies.However, a casual form designer is unlikely to considerthe complete space of question combinations when identi-fying correlations. In most settings, we believe an automaticapproach to learning multivariate correlations would yield

    more effective inference.Given a graphical structure of the questions, we can then

    estimate the conditional probability tables that parameterizeeach node in a straightforward manner, by counting theproportionofpreviousformsubmissionswiththoseresponseassignments. The probability mass function for a singlequestion Fi with m possible discrete values, conditioned onits set of parent nodes PFi from the Bayesian network, is

    PFi fi j fFj fj : Fj 2 PFig

    NFi fi; fFj fj : Fj 2 P Fig

    NfFj fj : Fj 2 PFig:

    1

    In this notation, PFi fi j fFj fj : Fj 2 PFig refers tothe conditional probability of question Fi taking value fi,given that each question Fj in PFi takes on value fj. Here,NX is the number of prior form submissions that matchthe conditions Xin the denominator, we count thenumber of times a previous submission had the subsetPFi of its questions set according to the listed fj values;and in the numerator, we count the number of times whenthose previous submissions additionally had Fi set to fi.

    Because the number of prior form instances may belimited, and thus may not account for all possiblecombinations of prior question responses, (1) may assignzero probability to some combinations of responses.Typically, this is undesirable; just because a particularcombination of values has not occurred in the past does notmean that combination cannot occur at all. We overcomethis obstacle by smoothing these parameter estimates,interpolating each with a background uniform distribution.In particular, we revise our estimates to

    PFi fi j fFj fj : Fj 2 PFig

    1 NFi fi; fFj fj : Fj 2 P Fig

    NfFj fj : Fj 2 PFig

    m;

    2

    where m is the number of possible values question Fi

    cantake on, and is the fixed smoothing parameter, which wasset to 0.1 in our implementation. This approach isessentially a form of Jelinek-Mercer smoothing with auniform backoff distribution [24].

    Once the Bayesian network is constructed, we can inferdistributions of the form PG j G0 g0 for arbitraryG;G0 Fthat is, the marginal distributions over sets ofrandom variables, optionally conditioned on observedvalues for other variables. Answering such queries isknown as the inference task. There exist a variety ofinference techniques. In our experiments, the Bayesiannetworks are small enough that exact techniques such as the

    junction tree algorithm [25] can be used. For larger models,faster approximate inference techniques like Loop BeliefPropagation or Gibbs Samplingcommon and effectiveapproaches in Machine Learningmay be preferable.

    5 QUESTION ORDERING

    Having described the Bayesian network, we now turn to itsapplications in the USHER system. We first consider ways ofautomatically ordering the questions of a data-entry form.The key idea behind our ordering algorithm is greedyinformation gainthat is, to reduce the amount ofuncertaintyof a single form instance as quickly as possible. Note that

    regardless of how questions are ordered, the total amount ofuncertainty about all of the responses taken togetherandhence the total amount of information that can be acquiredfrom an entire form submissionis fixed. By reducing thisuncertainty as early as possible, we can be more certainabout the values of later questions. The benefits of greatercertainty about later questions are twofold. First, it allows usto more accurately provide data-entry feedback for thosequestions. Second, we can more accurately predict missingvalues for incomplete form submissions.

    We can quantify uncertainty using information entropy. Aquestion whose random variable has high entropy reflectsgreater underlying uncertainty about the responses that

    question can take on. Formally, the entropy of randomvariable Fi is given by

    HFi X

    fi

    Pfi logPfi; 3

    where the sum is over all possible values fi that question Fican take on.

    As question values are entered for a single form instance,the uncertainty about the remaining questions of thatinstance changes. For example, in the race and politicssurvey, knowing the respondents political party providesstrong evidence about his or her political ideology. We can

    quantify the amount of uncertainty remaining in a questionFi, assuming that other questions G fF1; . . . ; Fng havebeen previously encountered, with its conditional entropy:

    HFi j G

    X

    gf1;...;fn

    Xfi

    PG g; Fi fi logPFi fi j G g;

    4

    where the sum is over all possible question responses in theCartesian product of F1; . . . ; Fn; Fi. Conditional entropymeasures the weighted average of the entropy of questionFjs conditional distribution, given every possible assign-

    ment of the previously observed variables. This value isobtained by performing inference on the Bayesian networkto compute the necessary distributions. By taking advan-tage of the conditional independencies encoded in thenetwork, we can typically drop many terms from theconditioning in (4) for faster computation.4

    Our full static ordering algorithm based on greedyinformation gain is presented in Algorithm 1. We selectthe entire question ordering in a stepwise manner, startingwith the first question. At the ith step, we choose thequestion with the highest conditional entropy, given the

    CHEN ET AL.: USHER: IMPROVING DATA QUALITY WITH DYNAMIC FORMS 1143

    4. Conditional entropy can also be expressed as the incremental

    difference in joint entropy due to Fi, that is, HFi j G HFi;G HG.Writing out the sum of entropies for an entire form using this expressionyields a telescoping sum that reduces to the fixed value HF. Thus, thisformulation confirms our previous intuition that no matter what orderingwe select, the total amount of uncertainty is still the same.

  • 7/29/2019 User file to project construction

    7/16

    questions that have already been selected. We call thisordering static because the algorithm is run offline, basedonly on the learned Bayesian network, and does not changeduring the actual data-entry session.

    In many scenarios, the form designer would like tospecify natural groupings of questions that should bepresented to the user as one section. Our model can beeasily adapted to handle this constraint by maximizingentropy between specified groups of questions. We canselect these groups according to joint entropy:

    argmaxG

    HG j F1; . . . ; Fi1; 5

    where G is over the form designers specified groups ofquestions. We can then further apply the static orderingalgorithm to order questions within each individual section.In this way, we capture the highest possible amount ofuncertainty while still conforming to ordering constraintsimposed by the form designer.

    Form designers may also want to specify other kinds ofconstraints on form layout, such as a partial ordering overthe questions that must be respected. The greedy approachcan accommodate such constraints by restricting the choiceof fields at every step to match the partial order.

    5.1 Reordering Questions during Data Entry

    In electronic form settings, we can take our ordering notion astep further and dynamically reorder questions in a form as aninstance is being entered. This approach can be appropriatefor scenarios when data-entry workers input one or severalvalues at a time, such as on a mobile device. We can apply thesame greedy information gain criterion as in Algorithm 1,but update the calculations with the previous responses inthe same form instance. Assuming questions G fF1; . . . ; Fg have already been filled in with valuesg ff1; . . . ; fg, the next question is selected by maximizing:

    HFi j G g

    X

    fi

    PFi fi j G g logPFi fi j G g: 6

    Notice that this objective is the same as (4), except that it usesthe actual responses entered for previous questions, ratherthan taking a weighted average over all possible values.Constraints specified by the form designer, such as topicalgrouping, can also be respected in the dynamic frameworkby restricting the selection of next questions at every step.

    In general, dynamic reordering can be particularly usefulin scenarios where the input of one value determines the

    value of another. For example, in a form with questions forgender and pregnant, a response of male for the formerdictates the value and potential information gain of thelatter. However, dynamic reordering may be confusing to

    data-entry workers who routinely enter information into thesame form, and have come to expect a specific questionorder. Determining the trade-off between these opposingconcerns is a human factors issue that depends on both theapplication domain and the user interface employed.

    6 QUESTION REASKING

    The next application of USHERs probabilistic model is forthe purpose of identifying errors made during entry.Because this determination is made during and immedi-ately after form submission, USHER can choose to reaskquestions during the same entry session. By focusing thereasking effort only on questions that were likely to bemisentered, USHER is likely to catch mistakes at a smallincremental cost to the data-entry worker. Our approach isa data-driven alternative to the expensive practice of doubleentry. Rather than reasking every question, we focusreasking effort only on question responses that are unlikelywith respect to the other form responses.

    Before exploring how USHER performs reasking, weexplain how it determines whether a question response iserroneous. USHER estimates contextualized error likelihood foreach question response, i.e., a probability of error that isconditioned on every other previously entered fieldresponse. The intuition behind error detection is straight-forward: questions whose responses are unexpected withrespect to the rest of the known input responses are morelikely to be incorrect. These error likelihoods are measuredboth during and after the entry of a single form instance.

    6.1 Error Model

    To formally model the notion of error, we extend our

    Bayesian network from Section 4 to a more sophisticatedrepresentation that ties together intended and actualquestion responses. We call the Bayesian network augmen-ted with these additional random variables the error model.Specifically, we posit a network where each question isaugmented with additional nodes to capture a probabilisticview of entry error. For question i, we have the followingset of random and observed variables:

    . Fi: the correct value for the question, which isunknown to the system, and thus a hidden variable.

    . Di: the question response provided by the data-entryworker, an observed variable.

    . i: the observed variable representing the para-meters of the probability distribution of mistakesacross possible answers, which is fixed per ques-tion.5 We call the distribution with parameters i theerror distribution. For the current version of ourmodel, i is set to yield a uniform distribution.

    . Ri: a binary hidden variable specifying whether anerror was made in this question. When Ri 0 (i.e.,when no error is made), then Fi takes the same valueas Di.

    Additionally, we introduce a hidden variable , sharedacross all questions, specifying how likely errors are to

    occur for a typical question of that form instance.

    1144 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 8, AUGUST 2011

    5. Note that in a hierarchical Bayesian formulation such as ours, randomvariables can represent not just specific values but also parameters ofdistributions. Here, i is the parameters of the error distribution.

  • 7/29/2019 User file to project construction

    8/16

    Intuitively, plays the role of a prior error rate, and ismodeled as a hidden variable so that its value can belearned directly from the data.

    Note that the relationships between field values dis-

    covered during structure learning are still part of the graph,so that the error predictions are contextualized in theanswers of other related questions.

    Within an individual question, the relationships betweenthe newly introduced variables are shown in Fig. 5. Thediagram follows standard plate diagram notation [26]. Inbrief, the rectangle is a plate containing a group of variablesspecific to a single question i. This rectangle is replicated foreach of form questions. The F variables in each questiongroup are connected by edges z2 Z, representing therelationships discovered in the structure learning process;this is the same structure used for the question ordering

    component. The remaining edges represent direct probabil-istic relationships between the variables that are describedin greater detail below. Shaded nodes denote observedvariables, and clear nodes denote hidden variables.

    Node Ri 2 f0; 1g is a hidden indicator variable specify-ing whether an error will happen at this question. Ourmodel posits that a data-entry worker implicitly flips a coinfor Ri when entering a response for question i, withprobability of one equal to . Formally, this means Ri isdrawn from a Bernoulli distribution with parameter :

    Ri j $ Bernoulli: 7

    The value of Ri affects how Fi and Di are related, which isdescribed in detail later in this section.

    We also allow the model to learn the prior probability forthe directly from the data. This value represents the

    probability of making a mistake on any arbitrary question.Note that is shared across all form questions. Learning avalue for rather than fixing it allows the model to producean overall probability of error for an entire form instance aswell as for individual questions. The prior distribution for is a Beta distribution, which is a continuous distributionover the real numbers from zero to one:

    $ Beta; : 8

    The Beta distribution takes two hyperparameters and ,which we set to fixed constants (1, 19). The use of a Betaprior distribution for a Bernoulli random variable isstandard practice in Bayesian modeling due to mathema-tical convenience and the interpretability of the hyperpara-meters as effective counts [27].

    We now turn to true question value Fi and observedinput Di. First, PFi j . . . is still defined as in Section 4,maintaining as before the multivariate relationships be-tween questions. Second, the user question response Di ismodeled as being drawn from either the true answer Fi orthe error distribution i, depending on whether a mistake ismade according to Ri:

    Di j Fi; i; Ri $PointMassFi; ifRi 0;Discretei; otherwise:

    9

    If Ri 0, no error occurs and the data-entry worker inputsthe correct value for Di, and thus Fi Di. Probabilistically,this means Dis probability is concentrated around Fi (i.e., apoint mass at Fi). However, if Ri 1, then the data-entryworker makes a mistake, and instead chooses a response forthe question from the error distribution. Again, this error

    distribution is a discrete distribution over possible questionresponses parameterized by the fixed parameters i, whichwe set to be the uniform distribution in our current model.6

    6.2 Error Model Inference

    The ultimate variable of interest in the error model is Ri: wewish to induce the probability of making an error for eachpreviously answered question, given the actual questionresponses that are currently available:

    PRi j D d; 10

    where D fF1; . . . ; Fg are the fields that currently haveresponses, the values of which are d ff1; . . . ; fg, respec-tively. This probability represents a contextualized errorlikelihood due to its dependence on other field valuesthrough the Bayesian network.

    Again, we can use standard Bayesian inference proce-dures to compute this probability. These procedures areblack box algorithms whose technical descriptions arebeyond the scope of this paper. We refer the reader tostandard graphical model texts for an in-depth review ofdifferent techniques [25], [28]. In our implementation, weuse the Infer.NET toolkit [22] with the ExpectationPropagation algorithm [29] for this estimation.

    CHEN ET AL.: USHER: IMPROVING DATA QUALITY WITH DYNAMIC FORMS 1145

    6. A more precise error distribution would allow the model to beespecially wary of common mistakes. However, learning such a distributionis itself a large undertaking involving carefully designed user studies with avariety of input widgets, form layouts, and other interface variations, and apost hoc labeling of data for errors. This is another area for future work.

    Fig. 5. The error model. Observed variable Di represents the actualinput provided by the data-entry worker for the ith question, while hiddenvariable Fi is the true value of that question. The rectangular platearound the center variables denotes that those variables are repeatedfor each of the form questions with responses that have already beeninput. The F variables are connected by edges z2 Z, representing therelationships discovered in the structure learning process; this is thesame structure used for the question ordering component. Variable irepresents the error distribution, which in our current model is uniformover all possible values. Variable Ri is a hidden binary indicator variablespecifying whether the entered data were erroneous; its probability i isdrawn from a Beta prior with fixed hyperparameters and . Shadednodes denote observed variables, and clear nodes denote hiddenvariables.

  • 7/29/2019 User file to project construction

    9/16

    6.3 Deciding When to Reask

    Once we have inferred a probability of error for eachquestion, we can choose to perform reasking either duringentry, where the error model is consulted after eachresponse, or after entry, where the error model is consultedonce with all form responses, or both.

    The advantage of integrating reasking into the primaryentry process is that errors can be caught as they occur,

    when the particular question being entered is still recent inthe data-entry workers attention. The cognitive cost ofreentering a response at this point is lower than if thatquestion were reasked later. However, these error like-lihood calculations are made on the basis of incompleteinformationa question response may at first appearerroneous per se, but when viewed in the context ofconcordant responses from the same instance may appearless suspicious. In contrast, by batching reasking at the endof a form instance, USHER can make a holistic judgmentabout the error likelihoods of each response, improving itsability to estimate error. This trade-off reveals an under-lying human-computer interaction question about how

    recency affects ease and accuracy of question response, afull exploration of which is beyond the scope of this paper.

    To decide whether to reask an individual question, weneed to consider the trade-off between improved dataquality and the cost of additional time required forreasking. USHER allows the form designer to set an errorprobability threshold for reasking. When that threshold isexceeded for a question, USHER will reask that question, upto a predefined budget of reasks for the entire forminstance. If the response to the reasked question differsfrom the original response, the question is flagged forfurther manual reconciliation, as in double entry. For in-flight reasking, we must also choose either the original or

    reasked response for this field for further predictions oferror probabilities in the same form instance. We choose thevalue that has the lesser error probability, since in theabsence of further disambiguating information that value ismore likely to be correct according to the model.

    7 QUESTION REFORMULATION

    Our next application of USHERs probabilistic formalism isquestion reformulation. To reformulate a question means tosimplify it in a way that reduces the chances of the data-entryworker making an error. Fig. 6 presents an example of aquestion as originally presented and a reformulated version

    thereof. Assuming that the correct response is Foot, the data-entry worker would only need to select it out of two ratherthan eight choices, reducing the chance of making a mistake.Moreover, reformulation can enable streamlining of theinput interface, improving data-entry throughput.

    In this work, we consider a constrained range ofreformulation types, emphasizing an exploration of thedecision to reformulate rather than the interface details of thereformulation itself. Specifically, our target reformulationsare binary questions confirming whether a particularresponse is the correct response, such as in the example. Ifthe response to the binary question is negative, then we askagain the original, nonreformulated question. We empha-

    size that the same basic data-driven approach describedhere can be applied to more complex types of reformula-tion, for instance, formulating to k values where k < D, thesize of the original answer domain.

    The benefits of question reformulation rely on theobservation that different ways of presenting a questionwill result in different error rates. This assumption is borneout by research in the HCI literature. For example, in recentwork, Wobbrock et al. [30] showed that users chances of

    clicking an incorrect location increases with how far andsmall the target location is. As a data-entry worker ispresented with more options to select from, they will tendto make more mistakes. We also note that it takes less timefor a data-entry worker to select from fewer rather thanmore choices, due to reduced cognitive load and shortermovement distances. These findings match our intuitionsabout question formulationas complexity increases, re-sponse time and errors increase as well.

    However, question reformulation comes with a price aswell. Since reformulating a question reduces the number ofpossible responses presented to the data-entry worker, it isimpossible for the reformulated question to capture all

    possible legitimate responses. In the example above, ifFootwas not the correct response, then the original questionwould have to be asked again to reach a conclusive answer.Thus, the decision to reformulate should rest on howconfident we are about getting the reformulation rightinother words, the probability of the most likely answer aspredicted by a probabilistic mechanism.

    In light of the need for a probabilistic understanding ofthe responses, USHERs reformulation decisions are drivenby the underlying Bayesian network. We consider reformu-lation in three separate contexts: static reformulation, whichoccurs during the form layout process; dynamic reformula-

    tion, which occurs while responses are being entered; andpostentry reformulation, which is applied in conjunction withreasking to provide another form of cross validation forquestion responses.

    1146 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 8, AUGUST 2011

    Fig. 6. An example of a reformulated question.

  • 7/29/2019 User file to project construction

    10/16

    7.1 Static Reformulation

    In the static case, we decide during the design of a formlayout which questions to reformulate, if any, in conjunc-tion with the question ordering prediction from Section 5.This form of reformulation simplifies questions that tend tohave predominant responses across previous form in-stances. Static reformulation is primarily appropriate forsituations when forms are printed on paper and question

    ordering is fixed. Standard practice in form design is toinclude skip-logic, a notation to skip the full version of thequestion should the answer to the reformulated question betrue. Alternatively, false responses to reformulated ques-tions can be compiled and subsequently reasked after thestandard form instance is completed.

    For each question, we decide whether to reformulate onthe basis of the probability of its expected response. If thatresponse exceeds a tunable threshold, then we choose toreformulate the question into its binary form. Formally, wereformulate when

    maxj

    PFi fj ! Ti; 11

    where Ti is the threshold for question i. In this work, weconsider values of Ti that are fixed for an entire form,though in general, it could be adjusted on the basis of theoriginal questions complexity or susceptibility to erroneousresponses. We note that this reformulation mechanism isdirectly applicable for questions with discrete answers,either categorical (e.g., blood-type) or ordinal (e.g., age); trulycontinuous values (e.g., weight) must be discretized beforereformulation. However, continuous questions with largeanswer domain cardinalities are less likely to triggerreformulation, especially if their probability distributions

    are fairly uniform.Setting the threshold T provides a mechanism for

    trading off improvements in data quality with the potentialdrawback of having to reask more questions. At oneextreme, we can choose to never reformulate; at the other,if we set a low threshold we would provide simplifiedversions of every question, at the cost of doubling thenumber of questions asked in the worst case.

    7.2 Dynamic Reformulation

    Paralleling the dynamic approach we developed for ques-tion ordering (Section 5.1), we can also decide to reformulatequestions during form entry, making the decision based on

    previous responses. The advantage of dynamic reformula-tion is that it has the flexibility to change a question based oncontextas a simple example, conditioned on the answer foran age question being 12, we may choose to reformulate aquestion about occupation into a binary is-student question.Dynamic reformulation is appropriate in many electronic,nonpaper based workflows. In this case, the reformulationdecision is based on a conditional expected response; forquestion i, we reformulate when

    maxj

    PFi fj j G g ! Ti; 12

    where previous questions G fF1; . . . ; Fg have already

    been filled in with values g ff1; . . . ; fg. Note thesimilarities in how the objective function is modified forboth ordering and reformulation (compare (11) and (12) to(4) and (6)).

    7.3 Reformulation for Reasking

    Finally, another application of reformulation is for reaskingquestions. As discussed in Section 6, the purpose ofreasking is to identify when a response may be in error,either during or after the primary entry of a form instance.One way of reducing the overhead associated with reaskingis to simplify the reasked questions. Observe that a reaskquestion does not have to illicit the true answer, but rather a

    corroborating answer. For example, for the question age, areformulated reask question could be the discretizationbucket in which the age falls (e.g., 21-30). From thetraditional data quality assurance perspective, this techni-que enables dynamic cross-validation questions based oncontextualized error likelihood.

    The actual mechanics of the reformulation process arethe same as before. Unlike the other applications ofreformulation, however, here we have an answer for whichwe can compute error likelihood.

    8 EVALUATION

    We evaluated the benefits of USHER by simulating twodata-entry scenarios to show how our system can improvedata quality. We focused our evaluation on the quality ofour model and its predictions. While we believe that thedata-entry user interface can also benefit from valueprediction, as we discuss in Section 9, we factor out thehuman-computer interaction concerns of form widgetdesign by automatically simulating user entry. As such,we set up experiments to measure our models ability topredict users intended answers, to catch artificially injectederrors, and to reduce error using reformulated questions.We first describe the experimental data sets, and thenpresent our simulation experiments and results.

    8.1 Data Sets and Experimental Setup

    We examine the benefits of USHERs design using two datasets, previously described in Section 3. The survey data setcomprises responses from a 1986 poll about race andpolitics in the San Francisco-Oakland metropolitan area[31]. The UC Berkeley Survey Research Center interviewed1,113 persons by random-digit telephone dialing. Thepatient data set was collected from anonymized patientintake records at a rural HIV/AIDS clinic in Tanzania. Intotal, we had 15 questions for the survey and nine for the

    patient data. We discretized continuous values using fixed-length intervals and treated the absence of a response to aquestion as a separate value to be predicted.

    For both data sets, we randomly divided the availableprior submissions into training and test sets, split 80 to20 percent, respectively. For the survey, we had 891 traininginstancesand222 test; for patients, 1,320 training and330 test.We performed structure learning and parameter estimationusing the training set. As described in Section 4, this resultedin the graphical models shown in Figs. 1 and 2. The testportion of each data set was then used for the data-entryscenarios presented below.

    In our simulation experiments, we aim to verify

    hypotheses regarding three components of our system:first, that our data-driven question orderings ask the mostuncertain questions first, improving our ability to predictmissing responses; second, that our reasking model is able

    CHEN ET AL.: USHER: IMPROVING DATA QUALITY WITH DYNAMIC FORMS 1147

  • 7/29/2019 User file to project construction

    11/16

    to identify erroneous responses accurately, so that we can

    target those questions for verification; and third, that

    question reformulation is an effective mechanism for

    trading off between improved data quality and user effort.

    8.2 Ordering

    For the ordering experiment, we posit a scenario where the

    data-entry worker is interrupted while entering a formsubmission, and thus is unable to complete the entireinstance. Our goal is to measure how well we can predictthose remaining questions under four different questionorderings: USHERs precomputed static ordering, USHERsdynamic ordering (where the order can be adjusted inresponse to individual question responses), the originalform designers ordering, and a random ordering. In eachcase, predictions are made by computing the maximumposition (mode) of the probability distribution over unen-tered questions, given the known responses. Results areaveraged over each instance in the test set.

    The left-hand graphs of Fig. 7 measure the averagenumber of correctly predicted unfilled questions, as afunction of how many responses the data-entry workerentered before being interrupted. In each case, the USHERorderings are able to predict question responses withgreater accuracy than both the original form ordering anda random ordering for most truncation points. Similarrelative performance is exhibited when we measure thepercentage of test set instances where all unfilled questionsare predicted correctly, as shown in the right side of Fig. 7.

    The original form orderings tend to underperform theirUSHER counterparts. Human form designers typically do

    not optimize for asking the most difficult questions first,instead often focusing on boilerplate material at thebeginning of a form. Such design methodology does notoptimize for greedy information gain.

    As expected, between the two USHER approaches, thedynamic ordering yields slightly greater predictive powerthan the static ordering. Because the dynamic approach isable to adapt the form to the data being entered, it can focusits question selection on high-uncertainty questions specificto the current form instance. In contrast, the static approacheffectively averages over all possible uncertainty paths.

    8.2.1 Reasking

    For the reasking experiment, our hypothetical scenario isone where the data-entry worker enters a complete forminstance, but with erroneous values for some questionresponses.7 Specifically, we assume that for each data valuethe data-entry worker has some fixed chance p of making amistake. When a mistake occurs, we assume that anerroneous value is chosen uniformly at random. Once theentire instance is entered, we feed the entered values to ourerror model and compute the probability of error for eachquestion. We then reask the questions with the highest errorprobabilities, and measure whether we chose to reask thequestions that were actually wrong. Results are averagedover 10 random trials for each test set instance.

    Fig. 8 plots the percentage of instances where we chose toreask all of the erroneous questions, as a function of thenumber of questions that are reasked, for error probabilitiesof 0.05, 0.1, and 0.2. In each case, our error model is able tomake significantly better choices about which questions toreask than a random baseline. In fact, for p 0:05, which isa representative error rate that is observed in the field [7],USHER successfully reasks all errors over 80 percent of thetime within the first three questions in both data sets. Weobserve that the traditional approach of double entry

    1148 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 8, AUGUST 2011

    7. Our experiments here do not include simulations with in-flightreasking, as quantifying the benefit of reasking during entryquestionrecencyis a substantial human factors research question that is left asfuture work.

    Fig. 7. Results of the ordering simulation experiment. In each case, the x-axis measures how many questions are filled before the submission istruncated. In the charts on the left side, the y-axis plots the average proportion of remaining question whose responses are predicted correctly. In thecharts on the right side, the y-axis plots the proportion of form instances for which all remaining questions are predicted correctly. Results for thesurvey data are shown at top, and for the HIV/AIDS data at bottom.

  • 7/29/2019 User file to project construction

    12/16

    corresponds to reasking every question; under reasonableassumptions about the occurrence of errors, our model isable to achieve the same result of identifying all erroneousresponses at a substantially reduced cost.

    8.2.2 Reformulation

    For the reformulation experiment, we simulate form fillingwith a background error rate and time cost in order toevaluate the impact of reformulated questions. Duringsimulated entry, when a possible response a is at the modeposition of the conditional probability distribution and hasa likelihood greater than a threshold t, we ask whether theanswer is a as a reformulated binary question. Ifa is not thetrue answer, we must reask the full question. Results areaveraged over each instance in the test set.

    Before discussing these results, we motivate the choice oferror rates and costs functions that we employ in thisexperiment. As mentioned in Section 7, the intuition behindquestion reformulation is grounded in prior literature,specifically the notion that simpler questions enjoy both

    lower error rate and user effort. However, the downside withreformulation is that entry forms may cost more to complete,due to reformulated questions with negative responses.

    In order to bootstrap this experiment, we need to derive arepresentative set of error rates and entry costs that varywith the complexity of a question. Previous work [30], [32]has shown that both entry time and error rate increase as afunction of interface complexity. In particular, Fitts Law [32]describes the time complexity of interface usage via an indexof complexity (ID), measured in bits as logA=W 1. This is alogarithmic function of the ratio between target size W andtarget distance A. Mapping this to some typical data-entrywidgets such as radio buttons and drop-down menus, where

    W is fixed and A increases linearly with the number ofselections, we model time cost as logD where D is thedomain cardinality of the answer. In other words, time costgrows with how many bits it takes to encode the answer. For

    our experiments, we set the endpoints at 2 seconds for D 2up to 4 seconds for D 128.

    We also increase error probabilities logarithmically as afunction of domain cardinality D, relying on theintuition thaterror will also tend to increase as complexity increases [30].Our error rates vary from 1 percent for D 2 to 5 percent forD 128.

    We do not claim the generalizability of these specific

    numbers, which are derived from a set of strong assump-tions. Rather, the values we have selected are representativefor studying the general trends of the trade-off betweendata quality and cost that reasking enables, and are in linewith typical values observed in the field [7]. Furthermore,we attempted this experiment with other error and costparameters and found similar results.

    The results of question reformulation can be found inFig. 9. In the pair of graphs entitled (a), we measure theerror rate over reformulation thresholds for each data set.Our results confirm the hypothesis that the greater thenumber of additional reformulated questions we ask, thelower the error rate. In the pair of graphs entitled (b), we

    observe that as the selectivity (threshold) of reformulationgoes up, the likelihood that we pick the correct answer inreformulation also rises. Observe that reformulation accu-racy is greater than 80 and 95 percent for the survey andpatient data sets, respectively, at a threshold of 0.8. In thepair of graphs entitled (c), we see an unexpected result:entry with reformulation features a time cost that quicklyconverges with, and in the case of the patient data set, dipsbelow that of standard entry, at thresholds beyond 0.6.Finally, in the pair of graphs entitled (d), we summarize thetime cost incurred by additional questions versus the timesavings of the simpler reformulated questions. Of course,given our assumptions, we cannot make a strong conclusion

    about the cost of question reformulation. Rather, theimportant takeaway is that the decrease in effort won bycorrect reformulations can help to offset the increase due toincorrect reformulations.

    CHEN ET AL.: USHER: IMPROVING DATA QUALITY WITH DYNAMIC FORMS 1149

    Fig. 8. Results of the reasking simulation experiment. In each case, the x-axis measures how many questions we are allowed to reask, and the y-

    axis measures whether we correctly identify all erroneous questions within that number of reasks. The error probability indicates the rate at which we

    simulate errors in the original data. Results for the survey data are shown at top, and for the HIV/AIDS data at bottom.

  • 7/29/2019 User file to project construction

    13/16

    9 DISCUSSION: DYNAMIC INTERFACES FOR DATAENTRY

    In the sections above, we described how USHER usesstatistical information traditionally associated with offline

    data cleaning to improve interactive data entry via question

    ordering and reasking. This raises questions about the

    human-computer interactions inherent in electronic form

    filling, which are typically device- and application-depen-

    dent. In one application, we are interested in how data

    quality interactions play out on mobile devices in developing

    countries, as in the Tanzanian patient forms we examined

    above. Similar questions arise in traditional online forms like

    web surveys. In this section, we outline some design

    opportunities that arise from the probabilistic power of themodels and algorithms in USHER. We leave the investigation

    of specific interface designs and their evaluation in various

    contexts to future work.

    While an interactive USHER-based interface is presenting

    questions (either one-by-one or in groups), it can infer a

    probability for each possible answer to the next question;

    those probabilities are contextualized (conditioned) byprevious responses. The resulting quantitative probabilities

    can be exposed to users in different manners and at

    different times. We present some of these design options

    in the following:

    1. Time of exposure: pre- and postentry: Before entry,Ushers probabilistic model can be used to improvedata-entry speed by adjusting the friction of enteringdifferent answers: likely results can be made easy orattractive to enter, while unlikely results can be madeto require more work. One example of this is the

    previously described reformulation technique. Addi-tional examples of data-driven variance in frictioninclude type-ahead mechanisms in textfields, pop-ular choice items repeated at the top of drop-down

    1150 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 8, AUGUST 2011

    Fig. 9. Results of the question reformulation experiment. In each chart, the x-axis shows the reformulation thresholds; when the threshold 1, noquestion is reformulated. (a) The overall error rate between reformulated and standard entry. (b) The likelihood that a reformulated answer was, infact, the correct answer. (c) The impact of reformulation on user effort, measured as timeaverage number of seconds per form. (d) The gain/loss ineffort due to when reformulation is correct versus incorrect.

  • 7/29/2019 User file to project construction

    14/16

    lists, and direct decoration (e.g., coloring or font size)of each choice in accordance with its probability. Adownside of beforehand exposure of answer prob-abilities is the potential to bias answers. Alterna-tively, probabilities may be exposed in the interfaceonly after the user selects an answer. This becomes a

    form ofassessmentfor example, by flagging unlikelychoices as potential outliers. This can be seen as a soft,probabilistic version of the constraint violationvisualizations commonly found in web forms (e.g.,the red star that often shows up next to forbidden ormissing entries). Post hoc assessment arguably hasless of a biasing effect than friction. This is bothbecause users choose initial answers without knowl-edge of the models predictions, and because usersmay be less likely to modify previous answers thanchange their minds before entry.

    2. Explicitness of exposure: Feedback mechanisms in

    adaptive interfaces vary in terms of how explicitlythey intervene in the users task. Adaptations can beconsidered elective versus mandatory. For instance, adrop-down menu with items sorted based on like-lihood is mandatory with a high level of friction;whereas, a split drop-down menu, as mentionedabove, is electivethe user can choose to ignore thepopular choices. Another important consideration isthe cognitive complexity of the feedback. Forinstance, when encoding expected values into a setof radio buttons, we can directly show the numericprobability of each choice, forcing a user to interpretthese discrete probabilities. Alternatively, we can

    scale the opacity of answer labelsgiving the useran indication of relative salience, without the needfor interpretation. Even more subtly, we can dyna-mically adjust the size of answer labels clickable

    regionssimilar to the adjustments made by theiPhones soft keyboard in response to the likelihoodof various letters.

    3. Contextualization of interface: USHER uses condi-tional probabilities to assess the likelihood ofsubsequent answers. However, this is not necessa-

    rily intuitive to a user. For example, consider aquestion asking for favorite beverage, where the mostlikely answers shown are milk and apple juice. Thismight be surprising in the abstract, but would be lessso in a case where a previous question had identifiedthe age of the person in question to be under 5 yearsold. The way that the interface communicates thecontext of the current probabilities is an interestingdesign consideration. For example, type-aheadtext interfaces have this flavor, showing the likelysuffix of a word contextualized by the previouslyentered prefix. More generally, USHER makes it

    possible to show a history of already enteredanswers that correlate highly with the value at hand.

    Note that these design discussions are not specificallytied to any particular widgets. In Fig. 10, we show someexamples of user-interface widgets that have been adaptedusing information provided by USHERs probabilisticmodel: the drop-down menu in part (a) features first anelective split-menu adaptation before entry and a color-encoded value assessment after entry; the textfield in part (b)shows type-ahead suggestions ordered by likelihood, thusdecreasing the physical distance (a form of friction) formore likely values; the radio buttons in part (c) directly

    communicate probabilities to the user.While these broad design properties help clarify the

    potential user experience benefits of USHERs data-driven

    philosophy, there are clearly many remaining questions

    CHEN ET AL.: USHER: IMPROVING DATA QUALITY WITH DYNAMIC FORMS 1151

    Fig. 10. Mockups of some simple dynamic data-entry widgets illustrating various design options.

  • 7/29/2019 User file to project construction

    15/16

    about how to do this embedding effectively for differentsettings and users. Those questions are beyond the scope ofthis paper. In separate work, we used Ushers predictiveability to design intelligent user-interface adaptations,studied them with data-entry clerks in a rural Ugandanhealth clinic, and show that our our adaptations have thepotential to reduce error (by up to 78 percent) [33].

    10 DISCUSSION AND FUTURE WORK

    In this paper, we have shown that a probabilistic approachcan be used to design intelligent data-entry forms thatpromote high data quality. USHER leverages data-driveninsights to automate multiple steps in the data-entrypipeline. Before entry, we find an ordering of form fieldsthat promotes rapid information capture, driven by agreedy information gain principle, and can staticallyreformulate questions to promote more accurate responses.During entry, we dynamically adapt the form based onentered values, facilitating reasking, reformulation, and

    real-time interface feedback in the sprit of providingappropriate entry friction. After entry, we automaticallyidentify possibly erroneous inputs, guided by contextua-lized error likelihoods, and reask those questions, possiblyreformulated, to verify their correctness. Our simulatedempirical evaluations demonstrate the data quality benefitsof each of these components: question ordering, reformula-tion and reasking.

    The USHER system we have presented is a cohesivesynthesis of several disparate approaches to improving dataquality for data entry. The three major components of thesystemordering, reasking, and reformulationcan all beapplied under various guises before, during, and after dataentry. This suggests a principled road map for futureresearch in data entry. For example, one combination wehave not explored here is reasking before entry. At firstglance this may appear strange, but in fact that is essentiallythe role that cross-validation questions in paper formsserve, as preemptive reformulated reasked questions.Translating such static cross-validation questions to dy-namic forms is a potential direction of future work.

    Another major piece of future work alluded to in Section 9is to study how our probabilistic model can inform effectiveadaptations of the user interface during data entry. Weintend to answer this problem in greater depth through user

    studies and field deployments of our system.We can also extend this work by enriching the under-lying probabilistic formalism. Our current probabilisticapproach assumes that every question is discrete and takeson a series of unrelated values. Relaxing these assumptionswould make for a potentially more accurate predictivemodel for many domains. Additionally, we would want toconsider models that reflect temporal changes in theunderlying data. Our present error model makes strongassumptions both about how errors are distributed andwhat errors look like. On that front, an interesting line offuture work would be to learn a model of data-entry errorsand adapt our system to catch them.

    Finally, we are in the process of measuring the practicalimpact of our system, by piloting USHER with our fieldpartners, the United Nations Development Programs Mil-lennium Villages Project [34] in Uganda, and a community

    health care program in Tanzania. These organizations data

    quality concerns were the original motivation for this work

    and thus serve as an important litmus test for our system.

    ACKNOWLEDGMENTS

    The authors thank Maneesh Agrawala, Yaw Anokwa,

    Michael Bernstein, S.R.K. Branavan, Tyson Condie, Heather

    Dolan, Jeff Heer, Max van Kleek, Neal Lesh, Alice Lin,

    Jamie Lockwood, Bob McCarthy, Tom Piazza, Christine

    Robson, and the anonymous reviewers for their helpful

    comments and suggestions. We acknowledge the Yahoo

    Labs Technology for Good Fellowship, the Natural Sciences

    and Engineering Research Council of Canada, the US

    National Science Foundation (NSF) Graduate Fellowship,

    and NSF Grant 0713661.

    REFERENCES[1] T. Dasu and T. Johnson, Exploratory Data Mining and Data Cleaning.

    Wiley, 2003.[2] C. Batini and M. Scannapieco, Data Quality: Concepts, Methodologies

    and Techniques. Springer, 2006.[3] R.M. Groves, F.J. Fowler, M.P. Couper, J.M. Lepkowski, E. Singer,

    and R. Tourangeau, Survey Methodology. Wiley-Interscience, 2004.[4] J. Lee, Address to WHO Staff, http://www.who.int/dg/lee/

    speeches/2003/21_07/en, 2003.[5] J.V.D. Broeck, M. Mackay, N. Mpontshane, A.K.K. Luabeya, M.

    Chhagan, and M.L. Bennish, Maintaining Data Integrity in aRural Clinical Trial, Clinical Trials, vol. 4, pp. 572-582, 2007.

    [6] R. McCarthy and T. Piazza, Personal Interview, Univ. ofCalifornia at Berkeley Survey Research Center, 2009.

    [7] S. Patnaik, E. Brunskill, and W. Thies, Evaluating the Accuracy ofData Collection on Mobile Phones: A Study of Forms, SMS, andVoice, Proc. IEEE/ACM Intl Conf. Information and Comm.Technologies and Development (ICTD), 2009.

    [8] J.M. Hellerstein, Quantitative Data Cleaning for LargeDatabases, United Nations Economic Commission for Europe(UNECE), 2008.

    [9] A. Ali and C. Meek, Predictive Models of Form Filling,Technical Report MSR-TR-2009-1, Microsoft Research, Jan. 2009.

    [10] L.A. Hermens and J.C. Schlimmer, A Machine-Learning Ap-prentice for the Completion of Repetitive Forms, IEEE Expert:Intelligent Systems and Their Applications, vol. 9, no. 1, pp. 28-33,Feb. 1994.

    [11] D. Lee and C. Tsatsoulis, Intelligent Data Entry Assistant forXML Using Ensemble Learning, Proc. ACM 10th Intl Conf.Intelligent User Interfaces (IUI), 2005.

    [12] J.C. Schlimmer and P.C. Wells, Quantitative Results ComparingThree Intelligent Interfaces for Information Capture, J. ArtificialIntelligence Research, vol. 5, pp. 329-349, 1996.

    [13] S.S.J.R. Warren, A. Davidovic, and P. Bolton, Mediface: Antici-pative Data Entry Interface for General Practitioners, Proc.Australasian Conf. Computer Human Interaction (OzCHI), 1998.

    [14] J. Warren and P. Bolton, Intelligent Split Menus for Data Entry: ASimulation Study in General Practice Medicine, Proc. Am. MedicalInformatics Assoc. (AMIA) Ann. Symp., 1999.

    [15] Y. Yu, J.A. Stamberger, A. Manoharan, and A. Paepcke, Ecopod:A Mobile Tool for Community Based Biodiversity CollectionBuilding, Proc. Sixth ACM/IEEE CS Joint Conf. Digital Libraries(JCDL), 2006.

    [16] S. Day, P. Fayers, and D. Harvey, Double Data Entry: WhatValue, What Price? Controlled Clinical Trials, vol. 19, no. 1, pp. 15-24, 1998.

    [17] D.W. King and R. Lashley, A Quantifiable Alternative to DoubleData Entry, Controlled Clinical Trials, vol. 21, no. 2, pp. 94-102,2000.

    [18] K. Kleinman, Adaptive Double Data Entry: A Probabilistic Toolfor Choosing Which Forms to Reenter, Controlled Clinical Trials,vol. 22, no. 1, pp. 2-12, 2001.

    [19] K.L. Norman,, Online Survey Design Guide, http://lap.umd.edu/survey_design, 2011.

    1152 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 8, AUGUST 2011

  • 7/29/2019 User file to project construction

    16/16

    [20] A. Hartemink, Banjo: Bayesian Network Inference with JavaObjects, http://www.cs.duke.edu/~amink/software/banjo, 2011.

    [21] F.G. Cozman,, JavaBayesBayesian Networks in Java, http://www.cs.cmu.edu/~javabayes, 2011.

    [22] M.R. Cambridge, Infer.NET, http://research.microsoft.com/en-us/um/cambridge/projects/infernet, 2011.

    [23] D. Heckerman, D. Geiger, and D.M. Chickering, LearningBayesian Networks: The Combination of Knowledge and Statis-tical Data, Machine Learning, vol. 20, no. 3, pp. 197-243, 1995.

    [24] F. Jelinek and R.L. Mercer, Interpolated Estimation of Markov

    Source Parameters from Sparse Data, Proc. Workshop PatternRecognition in Practice, 1980.

    [25] C.M. Bishop, Pattern Recognition and Machine Learning. Springer,2006.

    [26] W.L. Buntine, Operations for Learning with Graphical Models,J. Artificial Intelligence Research, vol. 2, pp. 159-225, 1994.

    [27] J.M. Bernardo and A.F. Smith, Bayesian Theory. Wiley, 2000.[28] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles

    and Techniques. MIT Press, 2009.[29] T.P. Minka, Expectation Propagation for Approximate Bayesian

    Inference, Proc. Conf. Uncertainty in Artificial Intelligence, 2001.[30] J.O. Wobbrock, E. Cutrell, S. Harada, and I.S. MacKenzie, An

    Error Model for Pointing Based on Fitts Law, Proc. 26th Ann.SIGCHI Conf. Human Factors in Computing Systems (CHI 08),pp. 1613-1622, 2008.

    [31] Survey Documentation and Analysis, UC Berkeley, http://

    sda.berkeley.edu, 2011.[32] P.M. Fitts, The Information Capacity of the Human Motor

    System in Controlling the Amplitude of Movement, J. Experi-mental Psychology, vol. 47, no. 6, pp. 381-391, 1954.

    [33] K. Chen, T. Parikh, and J.M. Hellerstein, Designing AdaptiveFeedback for Improving Data Entry Accuracy, Proc. ACM Symp.User Interface Software and Technology, 2010.

    [34] The Millennium Villages Project, http://www.millenniumvillages.org, 2011.

    Kuang Chen received the BS degree in compu-ter science and the BA degree in comparativehistory of ideas from the University of Washing-ton, and the MS degree in computer sciencefrom the University of California, Berkeley. He is

    a PhD candidate at the University of California,Berkeley. His research focuses on data manage-ment systems that help low-resource organiza-tions in the developing world, aiming to improvelocal practices in data collection, data quality,

    information integration, and analytics. He is a student member of theIEEE.

    Harr Chen received the BS degrees (summacum laude) in computer science and applied andcomputational mathematical sciences from theUniversity of Washington, and the SM degree incomputer science from Massachusetts Instituteof Technology. He is a PhD candidate at theMassachusetts Institute of Technology. Hisresearch interests are in statistical naturallanguage processing, information retrieval, andmachine learning. He is a recipient of the

    National Science Foundation Graduate Fellowship and the NationalDefense Science and Engineering Graduate Fellowship.

    Neil Conway received the BSc degree fromQueens University and the MSc degree from theUniversity of California, Berkeley. He is currentlyworking toward the PhD degree in computerscience at the University of California, Berkeley.His research interests include distributed sys-tems, logic programming, and large-scale datamanagement.

    Joseph M. Hellerstein is a professor ofcomputer science at the University of California,Berkeley, whose work focuses on data-centricsystems and the way they drive computing. Heis an ACM Fellow, an Alfred P. Sloan Researchfellow, the recipient of two ACM-SIGMOD Testof Time awards for his research, and a memberof the IEEE Computer Society. In 2010, FortuneMagazine included him in their list of 50 smartestpeople in technology, and Technology Review

    magazine included his work on Distributed Programming on their 2010TR10 list of the 10 technologies most likely to change our world. Keyideas from his research have been incorporated into commercial andopen source software from IBM, Oracle, and PostgreSQL. He is a pastdirector of Intel Research Berkeley, and currently serves on thetechnical advisory boards of a number of computing and Internet

    companies.

    Tapan S. Parikh received the BSc degree(honors) in molecular modeling from BrownUniversity and the MS and PhD degrees incomputer science from the University of Wa-shington. He is an assistant professor at the UCBerkeley School of Information. His researchinterests include human-computer interaction(HCI), mobile computing and information sys-tems for microfinance, smallholder agriculture,

    and global health. He was also named Technology Review magazinesHumanitarian of the Year in 2007.

    . For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    CHEN ET AL.: USHER: IMPROVING DATA QUALITY WITH DYNAMIC FORMS 1153