data analytics for marketing decision support: introduction and a wallet estimation case study

99
 IBM Research  © 2006 IBM Corpo ration Data Analytics for Marketing Decision Support: Introduction and a Wallet Estimation Case Study Saharon Rosset IBM T.J. Watson Research Center

Upload: surnj1

Post on 04-Nov-2015

4 views

Category:

Documents


0 download

DESCRIPTION

Data Analytics for Marketing Decision Support: Introduction and a Wallet Estimation Case Study

TRANSCRIPT

  • IBM Research

    2006 IBM Corporation

    Data Analytics for MarketingDecision Support: Introductionand a Wallet Estimation CaseStudy

    Saharon RossetIBM T.J. Watson Research Center

  • IBM Research

    2006 IBM Corporation2

    Two parts: Introduction to use of Data Mining in marketing applications

    (Collaborator: Naoki Abe) What are the problems we address? Comparison of Data Mining and Marketing Science

    approaches Some of the challenges for Data Mining approaches Customer Wallet and Opportunity Estimation: Analytical

    Approaches and Applications(Collaborators: Claudia Perlich, Rick Lawrence, SrujanaMerugu and others) Define the problem Describe analytic solutions Demonstrate performance in real application

  • IBM Research

    2006 IBM Corporation3

    The grand challenges of marketing Maximize profits (duh) Initiate, maintain and improve relationships with

    customers: Acquire customers Create loyalty, prevent churn Improve profitability (lifetime value) Optimize use of resources:

    Sales channels Advertising Customer targeting

  • IBM Research

    2006 IBM Corporation4

    Some of the concrete modeling problems Channel optimization Cross/up-sell (customer targeting) New customer acquisition Churn analysis Product life-cycle analysis Customer lifetime value modeling

    Effect of marketing actions on LTV? Advertising allocation RFM (Recency, Frequency, Monetary) analysis ...

  • IBM Research

    2006 IBM Corporation5

    Data analytics for decision support: grand challengeBeyond modeling the current situation, we need

    to offer insight about the effect or potential of possible actions and decisions: How would different channels / incentives affect LTV of

    our customers? How much more money could this customer be spending

    with us (customer wallet) Can we predict the effects of new actions that have never

    been tried in historical data? What if they have been tried on non-representative set?

    Can we be confident our results are actionable? Can we differentiate causality from correlation in our models?

  • IBM Research

    2006 IBM Corporation6

    CRM analytics: Relies on primary research (=surveys) to understand

    needs and wants Relies on (more or less) detailed models of customer

    behaviorUsually parametric statistical models

    Often estimates customer-level parameters Data mining:

    Typically relies on data in Data Warehouse /Mart Uses minimum of parametric assumptions Often attempts to fit problem into standard modeling

    framework: classification, regression, clustering...

    Typical marketing analytics vs. data mining

  • IBM Research

    2006 IBM Corporation7

    Comparison of approaches

    -+Integrate expert input from managers and customers (wants and needs)

    +-Use data to learn new, surprising patterns about customer behavior

    +-Robust against incorrect assumptions about domain and problems

    -+Actively collect the data to estimate model quantities (active learning)

    +-Rely on existing, abundant data in Corporate Data Warehouses

    -+Parametric models formalize knowledge of domain and problems

    DMMarketingCriterion

  • IBM Research

    2006 IBM Corporation8

    Rust, Lemon and Zeithaml (2004), Return on Marketing: Using Customer Equity to Focus Marketing Strategy, J. of Marketing

    Modeling customer equity / lifetime value Combine several previous approaches Model the brand switching matrix as a function of customer

    preference, history and product properties Want to identify drivers of satisfaction (levers) Calculate effect (ROI) of marketing actions pulling levers Mostly relies on primary research collected specifically for

    this study Interviews with managers Survey of consumer preferences

    Example 1: modeling and improving LTV

  • IBM Research

    2006 IBM Corporation9

    Simplified version of papers business model

    Marketing investment

    Costs

    Pullinglevers

    Increasedequity

    Return on marketing investment

    Main goals: Identify relevant levers Quantify their effect

  • IBM Research

    2006 IBM Corporation10

    Analytic setup (main components only) logit(pijk) = 0k LASTijk + xik k

    pijk is probability that customer i buys item k given they bought item j previously

    LAST is a dummy variable for inertia Xik is a feature vector for customer i, product k

    This is used to compute the brand switching matrix {pijk} and customer lifetime value is calculated as:CLVij = t PROFij Bijt PROF is a profit measure considering discounting, price & cost

    (assumed known) Bijt is probability customer i buys product j in time t, calculated

    from the stochastic matrix {pijk}

  • IBM Research

    2006 IBM Corporation11

    Data definitions Potential drivers (marketing activities) are reflected in

    the components of xi Price Quality of service etc.

    The data to estimate the logit model is based on: Expert (manager) input Questionnaires of customers Corporate data warehouse (not implemented in their case

    study...)

  • IBM Research

    2006 IBM Corporation12

    Results: important drivers for airline industry?

    Etc. (all factors deemed important)

    6.56.093.609Convenience

    9.86.020.199Price

    10.87.041.441Quality

    11.34.075.849Inertia

    Z score (coeff/std)

    Std errorCoefficientDriver

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

  • IBM Research

    2006 IBM Corporation13

    What would a data miner do? Count more (or only) on historical data in data

    warehouse Variables would have different meaning Identify correlations, not necessarily drivers

    Could use same analytic formulation, but also try alternative approaches Relate LTV directly to variables observed? Model transaction sizes in addition to switching? Use non-parametric modeling tools?Etc.

  • IBM Research

    2006 IBM Corporation14

    Common practice in marketing: Define static, fixed customer segments

    Supposed to capture true essence of customersbehaviors, needs and wants

    Often given catchy names: Upwardly mobile businessmen representing the average profile

    Make marketing decisions at segment level, based on understanding of needs and wants

    Example 2: the segmentation approach

  • IBM Research

    2006 IBM Corporation15

    A market segmentation methodologyBased on Kotler (2000). Marketing Management. Prentice-Hall1. Survey stage: primary research to capture motivations,

    attitudes, behaviors2. Analysis stage: factor analysis, then clustering of survey

    data Identify segments

    3. Profiling stage: analyze segments and give them namesAdditional stage often taken is to assign all customers to the

    defined segments:4. Assignment stage: build classification model to assign all

    customers to learned segments

  • IBM Research

    2006 IBM Corporation16

    What would a data-miner do?Option 1: clustering

    Replace primary research by warehouse data Cluster all customers Lose the needs and wants aspect

    Option 2: supervised learning Treat each decision problem as separate modeling task

    E.g., find positive and negative examples for each binary decision, learn model

    Advantage: customized Disadvantages:

    May not have right data to model decisions we want to make Past correlations may not be indicative of future outcomes

  • IBM Research

    2006 IBM Corporation17

    Comparison of approaches

    -+Integrate expert input from managers and customers (wants and needs)

    +-Use data to learn new, surprising patterns about customer behavior

    +-Robust against incorrect assumptions about domain and problems

    -+Actively collect the data to estimate model quantities (active learning)

    +-Rely on existing, abundant data in Corporate Data Warehouses

    -+Parametric models formalize knowledge of domain and problems

    DMMarketingCriterion

  • IBM Research

    2006 IBM Corporation18

    Count on historical data as much as possible Avoid complex parametric models

    Let the data guide us Still want to integrate domain knowledge Analyze and understand the special aspects of marketing

    modeling problems Importance of long-term relationship (lifetime value, loyalty) Effects of competition (customer wallet vs. customer

    spending) Modify existing, or develop new, data analytics

    approaches to address problems properly

    An integrated approach

  • IBM Research

    2006 IBM Corporation19

    Moving beyond revenue modelingTo really understand the profitability and potential of our

    customers, we need to move beyond modeling their short-term revenue contribution Revenue over time: Lifetime Value modeling

    How much can we expect to gain from customer over time? Incorporates loyalty/churn, prediction of future customer

    revenue

    LTV = t S(t) v(t) D(t) dt(S(t) is customer survival function, v(t) customer value over time, D(t) discounting factor)

    Potential revenue: Customer Wallet Estimation How much revenue could we be generating from this

    customer? Incorporates competition, brand switching etc.

  • IBM Research

    2006 IBM Corporation20

    LTV and Wallet: beyond standard modelingTime

    RevenueNow

    Future

    Next year

    Sales / revenue modeling

    Sales forecasting

    L

    T

    V

    m

    o

    d

    e

    l

    i

    n

    g

    Potential salesActual sales

    Wallet estimation

  • IBM Research

    2006 IBM Corporation21

    Types of decision support Passive decision support

    Understand more about problems and causes Identify areas of need, under-performance etc. Help in making better decisions

    Active decision support Model the effect of actions Actively help in deciding between alternative actions

    Active decision support is typically more challenging in terms of data needed to learn models

  • IBM Research

    2006 IBM Corporation22

    Depth and actionability of insightsDepth

    Actionability

    Basic concepts

    Real insight

    Revenue modeling

    ActivePassive

    Correlation Causality

    Revenue forecast

    Lever identification

    LTV modelingWallet

    estimation

    Understand effect of potential actions on LTV and Wallet

    attainment

  • IBM Research

    2006 IBM Corporation23

    The causality challenge Predictive models discover correlation

    Example: linear regressionSignificant t-statistic for coefficients imply they have a significant effect, not that they are actually causing the response

    For active decision support we need to identify levers to pull to affect outcome

    Only works with causality Causality is difficult to find or prove from observation data

    If we have knowledge about causality, we can formalize it as (say) Bayesian network and use in our models

    We can get closer to causality by case-control experiments

  • IBM Research

    2006 IBM Corporation24

    Assume we observe for some companies:X = companys marketing budget,Y = companys salesand want to understand how to affect Y by controlling X

    Assume we find that X is very predictive about Y Possible scenarios:

    Illustration: predictive power is not causality

    Z

    Y X

    x y

    x y

    Causality successfully identified lever

    Fixed percent of revenue to marketing?

    Z=Company size independently determining both quantities?

  • IBM Research

    2006 IBM Corporation25

    Some other challenges Modeling effects of new/unobserved actions

    Critical for active support, often difficult or impossible Even for established actions, they may have been

    applied in different context than our planned campaign

    Integrating expert knowledge into process Can be done formally via graphical models

    Handling data issues: matching, leaks, cleaning Always critical

    Delivering solutions and results

  • IBM Research

    2006 IBM Corporation26

    Example: Telecom Churn ManagementCell phone company has set of customers, some leave (churn)

    every monthThe goals of a Churn Management system: Analyze the process of churn

    Causes Dynamics Effects on company Design policies and actions to improve the situation

    Marketing campaigns Incentive allocation (offer new features or presents) Change in plans to contend with competition

  • IBM Research

    2006 IBM Corporation27

    First step: understand current situation Who is likely to churn (predictive patterns)?

    Phones features / plans Usage patterns DemographicsTools: segmentation, classification, etc. Which of these patterns are causal?

    Tools: expert knowledge, Bayesian networks, etc. Which causal effects not in data?

    Competition, economy etc. Which of these customers are profitable?

    Short term: customer value Long term: lifetime value Growth potential: customer wallet

  • IBM Research

    2006 IBM Corporation28

    Second step: design actions Can we affect causal churn patterns?

    For example, by improving customer service

    Given possible incentives and marketing actions, what effect will they have?

    Loyalty and relationship Current customer value and wallet attainment Customer lifetime value Cost to company

    How can we optimize use of our marketing resources? Identify segments we want to retain Identify effective marketing actions

  • IBM Research

    2006 IBM Corporation29

    Survey of Useful Methodologies Utility-based classification*: Cost-sensitive and Active Learning

    Motivation: need to handle utility of decision and cost of data acquisition in marketing decision problems

    Example domains: Targeted marketing, Brand switch modeling

    Markov Decision Processes (MDP) and Reinforcement Learning Motivation: need to consider long term profit maximization Example domain: Customer lifetime value modeling

    Bayesian Networks Motivation: need to address causality vs. correlation issue; need to

    formalize domain knowledge about relationships in data

    Example domain: Customer wallet estimation

    *c.f. Utility-Based Data Mining Workshop at KDD05 and KDD06

  • IBM Research

    2006 IBM Corporation30

    Cost-sensitive Learning for Marketing Decision Support

    Use of Basic Machine Learning (e.g. Classification and Regression) in Marketing Decision Support is well accepted

    Example applications include: targeted marketing, credit rating, and others But are they the best we have to offer ?

    Regression is an inherently harder problem than is required One does not necessarily need to predict business outcome, customer behavior,

    etc, but is merely required to make business decisions Regression may fail to detect significant patterns, especially when data is noisy

    Classification is an over simplification By mapping to classification, one loses information on the degree of

    goodness/badness of a business decision in the past data Cost-sensitive classification provides the desired middle ground

    It simplifies the problem almost to classification and thus allows discovery of significant patterns;

    Yet retains and exploits the information on the degree of goodness of business decisions, in a way that is motivated by Utility theory

  • IBM Research

    2006 IBM Corporation31

    Cost-sensitive Learning a.k.a. Utility-based Classification

    In regression: given (x,r) X x R, generated from a sampling distribution, find F: F(x) r E.g. r = profit obtained by targeting customer x

    In classification: given (x,y) X x {0,1} , generated from a sampling distribution, find F: F(x) y E.g. y = 1 if customer x is good, 0 otherwise

    In utility-based classification: given (stochastic) utility function U and (x,y) X x {0,1} generated from a sampling distribution, find F: E[U(x,y,F(x))] is maximized (or equivalently E[C(x,y,F(x))] is minimized) E.g. U(x,1,1) = Profit(x) = Profit obtained by targeting customer x, when x is

    indeed a good customer.

  • IBM Research

    2006 IBM Corporation32

    Example Cost and Utility Functions Simple formulation (cost/benefit matrix)

    More realistic formulation (utility/cost dependent on individuals)

    101

    010

    10PredictedTrue

    Classification utility matrix

    Credit rating utility

    Interest0good

    - Default Amt0bad

    goodbadPredictedTrue

    Targeted marketing utility

    Profit C 0good

    - C 0bad

    goodbadPredictedTrue

    011

    100

    10PredictedTrue

    Misclassification cost matrix

  • IBM Research

    2006 IBM Corporation33

    Bayesian Approach with Regression

    For each example x, choose the class that minimizes the expected cost:

    Problem: Requires conditional density estimation and regression to solve a classification problem. Price is high computational and sample complexity

    Merit: more flexibility and general applicability Business constraints Variability in fixed costs But, is it necessary ?

    =ji

    jixCxjPxi ),,()|(minarg)(*

    need be estimated!

  • IBM Research

    2006 IBM Corporation34

    A Classification approach: Cost-sensitive boosting algorithm [AZL 2004]

    =

    T

    tii yxh

    1),(

    ),()],([ 0,y x, yxCyxCEw HyS =

    ||/1),(0 YyxH =

    }'),(|))0(),,{((' yx, SyxwIyxT = >

    S' y)(x,

    S' y)(x,

    ),()],([ 1,y x, yxCyxCEw tHyS =

    GBSE (Learner A, Expanded data S, count T)(1) For all initialize

    (2) For all initialize weight

    (3) For t=1 to T do(a) For all (x,y) in S update weight

    (b) Let

    (c) Let ht = A(T,|w|)(d) ft = Stochastic(hi )(e) Ft = (1- )Ft-1+ft

    (4) Output h(x) = arg max( )

    Weight updated in each iteration

    the difference between average cost by the current ensemble and cost of y

    Y}y' S,y)(x,y |)y'{(x,S' y)(x, =Define the expanded sample S as:

  • IBM Research

    2006 IBM Corporation35

    Gradient Boosting with Stochastic Ensembles: Illustration

    C(x,y)

    At learning iteration t At learning iteration t+1+

    Cost C(x,y)

    PredictedLabel, y

    + - - + -

    Training Labels

    The difference between the current average cost and the cost associated with a particular label is the boosting weight

    The sign of the weight, E[C(x,y)] C(x,y), is the training label

    Ave CostE[C(x,y)]

    y

  • IBM Research

    2006 IBM Corporation36

    Cost-sensitive boosting outperforms existing methods of cost-sensitive learning as well as classification and regression

    Existing methods

    9361046108619010Satellite

    584503614645Splice

    85213029211513Letter

    2149942831942KDD-99

    48105317390237385403397Solar

    34420742127121059174Annealing

    GBSEMetaCostAvgCostBaggingData Set

    Ave Test Set Cost (SE)

  • IBM Research

    2006 IBM Corporation37

    Active Learning a.k.a. Query Learning

    Training Sample Add data to training sample

    Learner

    *The domain size is generally exponential

    The goal is to achieve data and computational efficient learning by obtaining labeled data for points of the algorithms choosing

    Existing approaches can be classified into two main categories Algorithmic approach (c.f. [Angluin]) Information-theoretic approach (c.f. [SOS 1992])

    DomainSelect points necessary for learning

    Active/Query Leaning

  • IBM Research

    2006 IBM Corporation38

    The Query by Committee Algorithm [STS 1991]

    Maximize Uncertainty: Query a point with maximum spread

    Query by Committee

    Input Sample

    Let Agents Predict onrandomly selected points

    Agent Agent Agent Agent Agents: IdealizedRandomized Algorithms

    This is a representative information theoretic active learning method Main idea is to query points at which the agent algorithms disagree the most (to

    maximize information gain) Merit: data efficient learning is theoretically guaranteed, subject to assumptions

    of representability of the target Weakness: the theory requires ideal (Gibbs) agent learner, and is generally not

    computationally feasible

  • IBM Research

    2006 IBM Corporation39

    An Efficient Variant: Query by Bagging and Query by Boosting [AM 1998]

    Queries point x* at which component algorithms disagree most

    AgentLearner A

    Agent Learner A

    Agent Learner A

    h hh

    1 2T

    Input Sample

    Bagging: re-sampling with uniform distributionBoosting: weighted sampling with boosting weights

    These methods combines a computational approach of ensemble method with information theoretic query by committee method

    The method allows arbitrary deterministic agent algorithms Query by Bagging/Boosting

  • IBM Research

    2006 IBM Corporation40

    Active learning can accelerate learning

    WDBC (UCI ML repository) and C4.5 Breast Cancer Wisconsin (UCI) and C4.5

    It has been observed that active learning can drastically accelerate the rate of learning (e.g. 10 to 100 folds) over passive learning

    Application to primary research (survey) in marketing analytics is promising but has not been exploited extensively

  • IBM Research

    2006 IBM Corporation41

    Sequential Cost-sensitive Decision Making by Reinforcement Learning Cost-sensitive classification provides an adequate framework for

    single marketing decision making Real world marketing decision making is rarely made in isolation, but is

    made sequentially Need to address the sequential dependency in decision making

    Cost-sensitive classification Maximizes E[U(x,h(x)]

    We now wish to Maximize t E[U(xt,h(xt)], where x may depend on earlier decisions

    This is nothing but Reinforcement Learning, if we view x as the state Maximize t E[U(st,(st))], where st is determined stochastically

    according to a transition probability determined by st-1 and (st-1).

  • IBM Research

    2006 IBM Corporation42

    Review: Markov Decision Process (MDP)

    At any given time t, the agent is in some state s. It takes an action a, and makes a transition to the next state s,

    dictated by transition probability T(s,a) It then receives a reward, or utility U(s,a), which also depends on

    state s and action a. The goal of a reinforcement learner in MDP is to learn a policy,

    namely : S A, mapping states to actions, so as to maximize the cumulative discounted reward:

    ),( R0t

    ttt asU=

    =

  • IBM Research

    2006 IBM Corporation43

    Modeling CRM process using Markov Decision Process (MDP) Customer is in some "state" (his/her attributes) at any point in time Retailer's action will move customer into another state Retailer's goal is to take sequence of actions to guide customer's path to maximize customer's

    lifetime value

    Reinforcement Learning produces optimized targeting rules of the form If customer is in state "s", then take marketing action "a" Customer state s represented by current customer attribute vector estimates LTV(s,a) -- best policy is to choose a to maximize LTV(s,a)

    Typical CRM Process

    BargainHunter

    Repeater

    LoyalCustomer

    ValuableCustomer

    One Timer

    Repeater

    Defector Defector

    Repeater

    LoyalCustomer

    PotentiallyValuable

    Campaign A

    Campaign B

    Campaign C

    Campaign E

    Campaign D

    MDP and Reinforcement Learning provide an advanced framework for modeling customer lifetime value

    p 64

  • IBM Research

    2006 IBM Corporation44

    Observed lifetime value reflects only customers lifetime value attained by current marketing policy, and therefore fails to capture their potential lifetime value

    MDP based lifetime value modeling allows modeling of lifetime value based on optimized marketing policy (= the output of system !)

    BargainHunter

    Repeater

    LoyalCustomer

    ValuableCustomer

    One Timer

    Repeater

    Defector Defector

    Repeater

    LoyalCustomer

    PotentiallyValuable

    Campaign A

    Campaign B

    Campaign C

    Campaign E

    Campaign D

    Current marketing policyOptimized marketing policy

    Estimated (potential) lifetime value will be based on the optimal path

    Output policy will lead the customer through the same path

    MDP enables genuine lifetime value modeling, in contrast to existing approaches that use observed lifetime value

    Customer As path under

  • IBM Research

    2006 IBM Corporation45

    And here is how this is possible

    The MDP enables the use of data for many customers in various stages (states) to determine potential lifetime value of a particular customer in a particular state

    Reinforcement Learning can estimate the lifetime value (function) without explicitly estimating the MDP itself

    The key lies in the value iteration procedure based on Bellmans equation

    Repeater

    LoyalCustomer

    ValuableCustomer

    Repeater Repeater

    LoyalCustomer

    PotentiallyValuable

    Each rule is, in effect, trained with data corresponding to all subsequent states

    LTV of a state = reward now + LTV of best next state

    Rule a Rule b

    Rule c

    Rule d

    )a',Q(s'maxa)]E[U(s, a)Q(s,'a+=

  • IBM Research

    2006 IBM Corporation46

    Reinforcement Learning Methods with Function Approximation Value Iteration (based on Bellman Equation)

    Provides the basis for classic reinforcement learning methods like Q-learning

    Batch Q-Learning (with Function Approximation) Solves value iteration as iterative regression problems

    a)(s,Qmax arg (s))a',(s'Qmaxa)]E[U(s, a)(s,Q

    ]a)E[U(s, a)(s,Q

    a

    k'1k

    0

    +

    =

    +=

    =

    pi

    a

    ))a',(s'Qmax),((a)(s,)Q-(1 a)(s,Qa)U(s,a)(s,Q

    k'k1k

    0

    aasU ++

    + Estimate using function approximation (regression)

  • IBM Research

    2006 IBM Corporation47

    The graph below plots profits per campaign obtained in monthly campaigns over 2 years (in an empirical evaluation using benchmark data, i.e. KDD cup 98 data)

    0

    10000

    20000

    30000

    40000

    50000

    60000

    70000

    80000

    C ampaign number

    Single

    C C OM

    Lifetime value modeling based on reinforcement learning can achieve greater long term profits than the traditional approach

    to yield greater long term profits

    Output policy of MDP approach (CCOM) invests in initial campaigns

    Output policy of MDP approach (CCOM) invests in initial campaigns

  • IBM Research

    2006 IBM Corporation48

    Bayesian Network a.k.a Graphical Model

    0.70.3

    P(E)P(E)

    0.30.7T

    0.60.4F

    P(C)P(C)E

    0.40.6T T

    0.80.2F T

    0.10.9T F

    0.70.3F F

    P(R)P(R)M C

    Bayesian Network is a directed acyclic graphical model and defines a probability model Here is a simple example

    Economy

    Marketing Competition

    Revenue

    P(M,E,C,R) = P(E) P(M|E) P(C|E) P(R|M,C)

    0.10.9T

    0.70.3F

    P(M)P(M)E

  • IBM Research

    2006 IBM Corporation49

    Bayesian Network as a General Unifying Framework Bayesian Network provides a general framework that subsumes

    numerous known classes of probabilistic models, e.g. Nave Bayes Classification Clustering (Mixture models) Auto regressive models Hidden Markov models, etc, etc Bayesian Network provides a framework for discussing modeling,

    inference, causality, hidden variables, etc

    Nave Bayes classification

    Class

    Variable 1 Variable N. Variable 1 Variable N.

    Clustering/Mixture

    Unobserved

    Class

    Hidden Markov Model

    Symbol Symbol

    State State

    Unobserved

  • IBM Research

    2006 IBM Corporation50

    Bayesian Network and Causality Causality is not necessarily implied by the edge direction

    Economy

    Marketing Competition

    Revenue

    P(M,E,C,R) = P(E) P(M|E) P(C|E) P(R|M,C)P(M,E,C) = P(E) P(M|E) P(C|E)P(M,E,C) = P(M) P(E|M) P(C|E)

    Economy

    Marketing CompetitionAn Example Bayesian NetworkEconomy

    Marketing CompetitionEconomy

    Marketing Competition

    P(M,E,C) = P(C) P(E|C) P(M|E)

    This is actually ambiguous between

  • IBM Research

    2006 IBM Corporation51

    Causal Network and Causal Pattern Causal Network

    Is a directed graph, in which the direction of edge means causality Causal Pattern

    Is an equivalence class of causal networks

    Economy

    Marketing Competition

    Revenue

    Economy

    Marketing Competition

    Revenue

    Causal Network Causal Pattern

    This pattern shows that the causal relationship between E, M, and C are ambiguous

  • IBM Research

    2006 IBM Corporation52

    Edge Orientation in Bayesian/Causal Networks

    [P. Spirtes, C. Glymour, and R. Scheines (2000)]

  • IBM Research

    2006 IBM Corporation53

    Inferring Structure of Bayesian/Causal Network from Data

    Economy

    Marketing Revenue

    Marketing Competition

    Revenue

    P(M,E,R) = P(E) P(M|E) P(R|E)

    P(M,E,C) = P(M) P(C) P(R|M,C)

    M

    M R | E

    Economy

    Marketing Revenue

    Economy

    Marketing Revenue

    The causal structure cannot be determined from data !

    P(M,E,R) = P(M) P(E|M) P(R|E) P(M,E,R) = P(R) P(E|R) P(M|E)

    The causal structure can be determined from data !

    It can be inferred that Marketing can bea lever for controlling Revenue !

  • IBM Research

    2006 IBM Corporation54

    Estimation and Inference with Bayesian Networks Inferring causal structure from data

    Sometimes possible but in general not Bayesian network structure learning from data

    It is known to be intractable for general classes It is even NP-complete to estimate polytrees robustly

    Parameter estimation from data, given structure It is efficiently solvable for many model classes

    Inference given model Exact inference is known to be NP-complete for sub-class including undirected

    cycles It is efficiently solvable for tree structures and many models used in practice

    Latent variable estimation, given structure Local optimum estimation is often possible via EM-algorithms

    Given these facts, determining network structure using domain knowledge and using it to do parameter estimation and inference is common practice example

  • IBM Research

    2006 IBM Corporation55

    Lifetime Value Modeling and Cross-Channel Optimized Marketing (CCOM)

    Direct Mail

    Kiosk

    Web

    Store

    Call Center

    $

    $ $ $ $

    Optimizes targeted marketing across multiple channels for lifetime value maximization.

    Combines scalable data mining and reinforcement learning methodsto realize unique capability.

  • IBM Research

    2006 IBM Corporation56

    CCOM Pilot Project with Saks Fifth Avenue

    Business Problem addressed: Optimizing direct mailing to maximize lifetime revenue at the store (and other channels)

    Provided solution for the Cross-Channel Challenge: No explicit linking between marketing actions in one channel and revenue in another

    CCOM mailing policy shown to achieve 7-8% increase in expected revenue in the store (in laboratory experiments) !

    Direct Mail

    Store

    $ $ $ $

    $

    CCOM-pilot business problem

    reminder

  • IBM Research

    2006 IBM Corporation57

    Some Example FeaturesDemographic Features action rewardFULL_LINE_STORE_OF_RES.: If a full-line store exists in the area 0.018 0.004 NON_FL_STORE_OF_RES.: If a non full-line store exists in area 0.012 -0.004

    Transaction Features (concerning divisions relevant to current campaign)CUR_DIV_PURCHASE_AMT_1M: Pur amt in last month in curr div 0.065 0.090CUR_DIV_PURCHASE_AMT_2_3M: Pur amt in 2-3 month in curr div 0.099 0.080CUR_DIV_PURCHASE_AMT_4_6M: Pur amt in 4-6 month in curr div 0.133 0.091CUR_DIV_PURCHASE_AMT_1Y: Pur amt in last year in curr div 0.162 0.128

    CUR_DIV_PURCHASE_AMT_TOT: Total Pur amt in current division 0.153 0.147

    Promotion History Features (on divisions relevant to current campaign)CUR_DIV_N_CATS_1M: Num cat sent last month in curr div 0.294 0.028CUR_DIV_N_CATS_2_3M: Num cat sent 2-3 months ago in curr div 0.260 0.025CUR_DIV_N_CATS_4_6M: Num cat sent 4-6 months ago in curr div 0.158 0.062CUR_DIV_N_CATS_TOT: Total num cat sent in curr div to date 0.254 0.062

    Control VariableACTION: To mail or not to mail 1.000 0.008Target (Response) VariableREWARD: Expected cumulative profits 0.008 1.000

  • IBM Research

    2006 IBM Corporation58

    The Cross-Channel Challenge and Solution

    The Challenge: No explicit linking between actions in one channel (mailing) andrewards in another (revenue)

    Very low correlation observed between actions and responses Other factors determining life time value may dominate over the control variable

    (marketing action) in estimation of expected value Obtained models can be independent of the action and give rise to useless rules !

    The Cross-Channel Solution: Learn the relative advantage of competing actions! Standard Method

    Proposed Method

    Actions

    Value in state s1 Value in state s2

    Actionsa1 a2 a1 a2

    Value in state s1 Value in state s2

    Actionsa1 a2 a1 a2

    Value in state s1 Value in state s2

    Actions

    a1 a2 a1 a2

    Approximation

  • IBM Research

    2006 IBM Corporation59

    The Learning Method

    Definition of Advantage A(s,a):= 1/t(Q(s,a) maxa Q(s,a))

    Advantage Updating Procedure [Baird 94]

    Modifications: 1. Initialization with empirical life time value 2. Batch Learning with optional function approximation

    Repeat1. Learn

    1.1. A(s,a):=(1-)A(s,a)+ (Amax(s)+(R(s,a)+tV(s)-V(s))/t)

    1.2. Use Regression to estimate A(s,a) 1.3. V(s):=(1-)V(s)

    +(V(s)+(Amax-new(s)-Amax-old(s))/)2. Normalize

    A(s,a):=(1- )A(s,a)+(A(s,a)-Amax(s))

  • IBM Research

    2006 IBM Corporation60

    Evaluation Results

    Significant policy advantage observed with small number of iterations

    Obtained policy with 7- 8% policy advantage, i.e. 7- 8% increase in expected revenue (for 1.6 million customers considered)

    Mailing policy was constrained to mail same number of catalogues in each campaign as last year

    CCOM to evaluate sequence of models and output best model

    Policy Advantage

    -4

    -202

    4

    68

    10

    1 2 3 4 5

    Learning iterations

    A

    d

    v

    a

    n

    t

    a

    g

    e

    (

    p

    e

    r

    c

    e

    n

    t

    a

    g

    e

    )

    Policy Advantage

    -4

    -2

    0

    2

    4

    6

    8

    1 2 3 4 5

    Learning iterations

    A

    d

    v

    a

    n

    t

    a

    g

    e

    (

    p

    e

    r

    c

    e

    n

    t

    )

    Typical run (version 1)

    Typical run (version 2)

  • IBM Research

    2006 IBM Corporation61

    Evaluation Method

    Challenge in Evaluation: Need to evaluate new policy using data collected by existing (sampling) policy

    Solution: Use bias-corrected estimation of policy advantage using data collected by sampling policy

    Definition of policy advantage: (Discrete Time) Advantage

    Policy Advantage

    Estimating policy advantage with bias corrected sampling

    A(s,a):= Q (s,a) maxa Q (s,a)

    As~():= E [Ea~ [A(s,a)]]

    As~():= E [((a|s)/ (a|s)) [A(s,a)]]

  • IBM Research

    2006 IBM Corporation62

    Combination of reinforcement learning (MDP) with predictive datamining enables automatic generation of trigger-based marketing targeting rules

    Optimized with respect to the customers potential lifetime value

    Stated in simple if then style, which supports flexibility and compatibility

    Refined to make reference to detailed customer attributes and hence, well-suited to event and trigger-based marketing

    This is made possible by Representing the states in MDP by customers

    attribute vectors

    Combining reinforcement learning with predictive data mining to estimate lifetime value as function of customer attributes and marketing actions

    An example marketing targeting ruleoutput by CCOM system

  • IBM Research

    2006 IBM Corporation63

    Some examples of rules output by CCOM

    Interpretation: If a customer has spent in the current division but enough catalogues have been sent, then dont mail

    Avoid saturation effects

    Differentiate between customers who may be near saturation and those who are not Interpretation: If a customer has spent in the current division and has received moderately many relevant catalogues, then mail

    Invest in a customer until it knows it is not worth itInterpretation: If a customer has spent significantly in the past and yet has not spent much in the current division (product group) then dont mail

  • IBM Research

    2006 IBM Corporation64

    Marketing Event

    Event IdentifierChannel IdentifierEvent DateEvent Category DescriptionFixed Cost

    Customer

    Customer IdentifierFirst NameLast NameAgeGender

    Transaction

    Customer IdentifierTransaction DateProduct Category IdentifierEvent IdentifierChannel IdentifierTransaction RevenueTransaction Profit

    Customer Marketing Action

    Event IdentifierCustomer IdentifierMarketing Action DateMarketing Action

    Period

    Period IdentifierPeriod Duration

    Customer Profile History

    Customer IdentifierProfile History DatePeriod IdentifierProduct Category IdentifierChannel IdentifierAggregated Count of EventAggregated RevenueAggegated Profit

    Channel

    Channel IdentifierChannel Description

    Product Category

    Product Category IdentifierProduct Category Description

    Customer Loyalty Level History

    Customer IdentifierLoyalty Level Start DateLoyalty Level End DateLoyalty Level

    EventProduct Category

    Event IdentifierProduct Category IdentifierWeight

    CCOM Output Models

    Marketing Policy Model

    Model IdentifierModel TypeModel

    Lifetime Value Model

    Model IdentifierModel TypeModel

    CCOM - Logical Data Model

    Optional Entity

    CCOM is generically applicable by mapping physical data to this model

    *Developed with CBO

  • IBM Research

    2006 IBM Corporation

    Customer Wallet and OpportunityEstimation: Analytical Approachesand Applications

    Saharon Rosset, Claudia Perlich, Rick LawrenceIBM T. J. Watson Research Center

  • IBM Research

    2006 IBM Corporation66

    Outline Wallet estimation: problems and solutions

    The different wallet definitions How can we evaluate wallet models? Modeling approaches Empirical evaluation

    MAP (Market Alignment Program) Description of application and goals The interview process and the feedback loop Evaluation of Wallet models performance in MAP

  • IBM Research

    2006 IBM Corporation67

    What is Wallet (AKA Opportunity)? Total amount of money a company can spend on a

    certain category of products.

    Company Revenue

    IT Wallet

    IBM Sales

    IBM sales IT wallet Company revenue

  • IBM Research

    2006 IBM Corporation68

    Why Are We Interested in Wallet? Better evaluation of growth potential by

    combining wallet estimates and past sales history Enables focus on high wallet, low share-of-wallet

    customers

    Intelligent marketing using wallet estimates for sub-categories e.g., software, hardware Evaluating success of sales personnel and

    sales channel by share-of-wallet they attain Making resource assignment decisions

    OnT

    arg

    et

    MAP

  • IBM Research

    2006 IBM Corporation69

    Wallet Modeling Problem Given:

    customer firmographics x (from D&B): industry, emloyeenumber, company type etc.

    customer revenue r IBM relationship variables z: historical sales by product IBM sales s

    Goal: model customer wallet w, then use it to predict present/future wallets

    No direct training data on w or information about its distribution!

  • IBM Research

    2006 IBM Corporation70

    Historical Approaches Top down: this is the approach used by IBM

    Market Intelligence in North America (called ITEM) Use econometric models to assign total opportunity to

    segment (e.g., industry geography) Assign to companies in segment proportional to their size

    (e.g., D&B employee counts) Bottom up: learn a model for individual companies

    Get true wallet values through surveys or appropriate data repositories (exist e.g. for credit cards)

    Many issues with both approaches (wont go into detail) We would like a predictive approach from raw data

  • IBM Research

    2006 IBM Corporation71

    Agenda Introduction and analytical issues

    Different wallet definitions

    How can we evaluate wallet models? The quantile regression loss function

    Modeling approaches and results: Nearest neighbor approach Quantile regression Model decomposition approach

  • IBM Research

    2006 IBM Corporation72

    Multiple Wallet Definitions TOTAL: Total customer available budget in the

    relevant area (e.g., total IT) Can we really hope to attain all of it? SERVED: Total customer spending on IT products

    covered by IBM Better definition for our marketing purposes REALISTIC: IBM spending of the best similar

    customers This can be concretely defined a high percentile of:

    P(IBM revenue | customer attributes)

    REALISTIC SERVED TOTAL

    Total WalletServed Wallet

    Realistic

  • IBM Research

    2006 IBM Corporation73

    Distribution of IBM sales to the customer given customer attributes: s|r,x,z ~ f,r,x,zE.g., the standard linear regression assumption:

    What we are looking for is the (say) 90th percentile of this distribution

    REALISTIC Wallet: Percentile of Conditional

    ),0(~, 2 Nzrxs +++=

    E(s|r,x,z) Realistic

  • IBM Research

    2006 IBM Corporation74

    Agenda Introduction and analytical issues

    Different wallet definitions

    How can we evaluate wallet models? The quantile regression loss function

    Modeling approaches and results: Nearest neighbor approach Quantile regression approach Model decomposition approach

  • IBM Research

    2006 IBM Corporation75

    Traditional Approaches to Model Evaluation Evaluate models based on surveys

    Cost and reliability issues

    Evaluate models based on high-level performance indicators: Do the wallet numbers sum up to numbers that make

    sense at segment level (e.g., compared to macro-economic models)?

    Does the distribution of differences between predicted Wallet and actual IBM Sales and/or Company Revenuemake sense? In particular, are the same % we expect bigger/smaller?

    Problem: no observation-level evaluation

  • IBM Research

    2006 IBM Corporation76

    The Quantile Loss Function Our REALISTIC wallet definition calls for estimating the

    pth quantile of P(s|data). Can we devise a loss function which is optimized in

    expectation when we succeed?Answer: yes, the quantile loss function for quantile p.

    This loss function is optimized in expectation when we correctly predict REALISTIC:

    >

    >=

    yyyypyyyyp

    yyLp

    if )()1(

    if )(),(

    )|( of quantile p)|),((minarg th

    xyPxyyLE py =

  • IBM Research

    2006 IBM Corporation77

    -3 -2 -1 0 1 2 3

    0

    1

    2

    3

    4

    Some Quantile Loss Functionsp=0.8p=0.5 (absolute loss)

    Residual (observed-predicted)

  • IBM Research

    2006 IBM Corporation78

    Which Wallet Definitions to Model? We are generally interested in modeling

    REALISTIC and SERVED wallets TOTAL wallets are not of real marketing interest For REALISTIC (or opportunity) we have multiple

    modeling approaches Quantile k-nearest neighbors Quantile regression approaches:

    Linear quantile regression Tree-based regression Kernel quantile regression, quanting,

    For SERVED we have developed a graphical modeling approach will not discuss here

  • IBM Research

    2006 IBM Corporation79

    Modeling REALISTIC Wallets REALISTIC defines wallet as 90th percentile of

    conditional of spending given customer attributes Implies some 10% of the customers are spending full

    wallet with IBM

    Two obvious ways to get at the 90th percentile: Estimate the conditional by integrating over a

    neighborhood of specific customers Take 90th percentile of spending in neighborhood

    Create a global model for 90th percentile Build regression models using quantile loss function

  • IBM Research

    2006 IBM Corporation80

    K-Nearest Neighbors Distance metric:

    Industry match Euclidean distance on firmographics

    and past IBM sales Normalization

    Neighborhood sizes (k): Neighborhood size has significant

    effect on prediction quality Prediction:

    Quantile of firms in the neighborhoodI

    n

    d

    u

    s

    t

    r

    y

    Employees Rev

    enue

    Universe of IBM customers with D&B information

    Neighborhood of target company

    Target company i

    F

    r

    e

    q

    u

    e

    n

    c

    y

    IBM Sales

    Wallet Estimate

  • IBM Research

    2006 IBM Corporation81

    Quantile Regression Traditional Regression:

    Estimation of conditional expected value by minimizing sum of squares

    Quantile Regression: Minimize Quantile loss:

    Implementation: assume linear function , solution using linear

    programming Linear quantile regression package in R (Koenker, 2001)

    =

    n

    iiip xfyL

    1)),((min

    quantile regression

    loss function

    =

    n

    iii xfy

    1

    2)),((min

    >

    >=

    yyyypyyyyp

    yyLp

    if )()1(

    if )(),(

    += xy

  • IBM Research

    2006 IBM Corporation82

    Quantile Regression Tree Motivation:

    Identify a locally optimal definition of neighborhood Inherently nonlinear Adjustments of M5/CART for Quantile prediction:

    Predict the percentile rather than the mean of the leaf Splitting/pruning criteria does not require adjustment

    Industry = Banking

    Sales10K

    F

    r

    e

    q

    u

    e

    n

    c

    y

    IBM Sales

    Wallet Estimate

    F

    r

    e

    q

    u

    e

    n

    c

    y

    IBM Sales

    Wallet Estimate

    F

    r

    e

    q

    u

    e

    n

    c

    y

    IBM Sales

    Wallet Estimate

    F

    r

    e

    q

    u

    e

    n

    c

    y

    IBM Sales

    Wallet Estimate

    no

    yes

    no

    no

    yes

    yes

    Industry = Banking

    Sales10K

    F

    r

    e

    q

    u

    e

    n

    c

    y

    IBM Sales

    Wallet Estimate

    F

    r

    e

    q

    u

    e

    n

    c

    y

    IBM Sales

    Wallet Estimate

    F

    r

    e

    q

    u

    e

    n

    c

    y

    IBM Sales

    Wallet Estimate

    F

    r

    e

    q

    u

    e

    n

    c

    y

    IBM Sales

    Wallet Estimate

    F

    r

    e

    q

    u

    e

    n

    c

    y

    IBM Sales

    Wallet Estimate

    F

    r

    e

    q

    u

    e

    n

    c

    y

    IBM Sales

    Wallet Estimate

    F

    r

    e

    q

    u

    e

    n

    c

    y

    IBM Sales

    Wallet Estimate

    F

    r

    e

    q

    u

    e

    n

    c

    y

    IBM Sales

    Wallet Estimate

    F

    r

    e

    q

    u

    e

    n

    c

    y

    IBM Sales

    Wallet Estimate

    F

    r

    e

    q

    u

    e

    n

    c

    y

    IBM Sales

    Wallet Estimate

    F

    r

    e

    q

    u

    e

    n

    c

    y

    IBM Sales

    Wallet Estimate

    F

    r

    e

    q

    u

    e

    n

    c

    y

    IBM Sales

    Wallet Estimate

    no

    yes

    no

    no

    yes

    yes

  • IBM Research

    2006 IBM Corporation83

    Empirical Evaluation: Quantile Loss

    Setup 4 Domains with monetary dependent variable including

    direct mailing, housing prices, income data, IBM sales Performance on test set in terms of quantile loss Approaches: kNN, Linear quantile regression, quantile tree,

    Quanting Baselines

    Constant model Traditional regression models for expected values (for

    skewed distributions, the expected value is actually a high quantile)

  • IBM Research

    2006 IBM Corporation84

    Performance on Quantile Loss

    Conclusions If there is a time-lagged variable, linear quantile model is

    best Quanting (using decision trees) and quantile tree perform

    comparably Generalized kNN is not competitive

  • IBM Research

    2006 IBM Corporation85

    Residuals for Quantile Regression

    Total positive holdout residuals: 90.05% (18009/20000)

  • IBM Research

    2006 IBM Corporation86

    Market Alignment Project (MAP): Background MAP - Objective:

    Optimize the allocation of sales force Focus on customers with growth potential Set evaluation baselines for sales personal

    MAP Components: Web-interface with customer information Analytical component: wallet estimates Workshops with Sales personal to review and correct the

    wallet predictions Shift of resources towards customers with lower wallet

    share

  • IBM Research

    2006 IBM Corporation87

    The MAP tool captures expert feedback from the Client Facing teams

    Transaction Data

    D&BData

    Wallet models: Predicted

    Opportunity

    ResourceAssignments

    Expert validated

    Opportunity

    Analytics and Validation

    Data Integration

    Insight Delivery and Capture

    Post-processing

    MAP Interview Team Client Facing Unit (CFU) Team

    Web Interface

    MAP interview process all Integrated and Aligned Coverages

    The objective here is to use expert feedback (i.e. validated revenue opportunity) from from last years workshops to evaluate our latest opportunity models

  • IBM Research

    2006 IBM Corporation88

    MAP workshops overview Calculated 2005 opportunity using naive k-NN

    approach 2005 MAP workshops

    Displayed opportunity by brand Expert can accept or alter the opportunity Select 3 brands for evaluation: DB2, Rational, Tivoli Build ~100 models for each brand using different

    approaches Compare expert opportunity to model prediction

    Error measures: absolute, squared Scale: original, log, root

  • IBM Research

    2006 IBM Corporation89

    Displayed Model Predictions of kNN Distance metric

    Identical Industry Euclidean distance on size (Revenue

    or employees) Neighborhood sizes 20 Prediction

    Median of the non-zero neighbors (Alternatives Max, Percentile) Post-Processing

    Floor prediction by max of last 3 years revenue

    I

    n

    d

    u

    s

    t

    r

    y

    Employees Rev

    enue

    Universe of IBM customers with D&B information

    Neighborhood of target company

    Target company i

  • IBM Research

    2006 IBM Corporation90

    0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    20

    0 2 4 6 8 10 12 14 16 18 20

    E

    x

    p

    e

    r

    t

    F

    e

    e

    d

    b

    a

    c

    k

    MODEL_OPPTY

    Expert Feedback (Log Scale) to Original Model (DB2)

    Experts reduced opportunity to 0(15%)

    Experts acceptopportunity (45%)

    Experts changeopportunity (40%)

    Increase (17%)

    Decrease (23%)

  • IBM Research

    2006 IBM Corporation91

    Observations Many accounts are set for external reasons to zero

    Exclude from evaluation since no model can predict this

    Exponential distribution of opportunities Residual-based evaluation on the original scale suffers

    from huge outliers

    Experts seem to make percentage adjustments Consider log scale evaluation in addition to original scale

    and root as intermediate Suspect strong anchoring bias, 45% of opportunities

    were not touched

  • IBM Research

    2006 IBM Corporation92

    Evaluation Measures Different scales to avoid outlier artifacts

    Original: e = model - expert Root: e = root(model) - root(expert) Log: e = log(model) - log(expert) Statistics on the distribution of the errors

    Mean of e2

    Mean of |e| Total of 6 criteria

  • IBM Research

    2006 IBM Corporation93

    Model comparison results: Count how often a model scores within the top 10 and 20 for each of the 6 measures

    TivoliDB2RationalModel

    2362204

    1033616

    4564435

    1041316

    40Quantile Tree 0.840Decomposition Center62kNN 50 + flooring21Regression Tree55Linear Quantile 0.841Max 03-05 Revenue66Displayed Model (kNN) (Anchoring)

    (Best)

  • IBM Research

    2006 IBM Corporation94

    Conclusions kNN performs very well after flooring but is typically

    low prior to flooring Empirically linear 80th quantile performs consistently

    well (flooring has a minor effect) Experts are strongly influenced by displayed

    opportunity (and displayed revenue of previous years) Models without last years revenue dont perform

    well

    Use Linear Quantile Regression with q=0.8 in MAP 06

  • IBM Research

    2006 IBM Corporation95

    Ongoing and Future Work Extend MAP to other geographies Quantile estimation performance of different

    methods as a function of the quantile Performance as a function of the shape of the

    conditional distribution of the dependent variables Theoretical generalization of the decomposition

    approach

  • IBM Research

    2006 IBM Corporation96

    A graphical model approach

    Wallet is unobserved, all other variables are Two families of variables --- firmographics and IBM relationship are

    conditionally independent given wallet We develop inference procedures and demonstrate them

    In some cases leads to simple linear regression as ML inference on wallet

    See poster in this conference: Merugu, Rosset, Perlich: A new multi-view learning approach with an application to customer wallet estimation.

    Company firmographics

    IT spendwith IBM

    Historical relationship

    with IBM

    CompanyIT

    Wallet

    back

  • IBM Research

    2006 IBM Corporation97

    References Marketing Science

    R. Rust, K. Lemon and V. Zeithaml, Return on Marketing: Using Customer Equity to Focus Marketing Strategy, J. of Marketing, 2004.

    P. Kotler, Marketing Management. Millennium Ed., Prentice-Hall, 2000. Cost-sensitive Learning

    P. Domingo, Meta-Cost: A general method for making classifiers cost-sensitive, The 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.

    N. Abe, B. Zadrozny and J. Langford, An Iterative Method for Multi-class Cost-sensitive Learning, The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2004.

    Active Learning H.S. Seung, M. Opper and H. Sompolinsky. Query by committee. Proceedings of

    the Fifth Workshop on Computaional Learning Theory, 1992. D. Angluin. Queries and concept learning. Machine Learning, 1988.

    MDP and Reinforcement Learning R. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press,

    Cambridge, MA, 1998. L. P. Kaelbling, M. L. Littman, A. W. Moore, Reinforcement Learning: A Survey,

    Journal of Artificial Intelligence Research, 1996.

  • IBM Research

    2006 IBM Corporation98

    References Bayesian Networks and Causal Networks

    K. Murphy, A brief introduction to Bayesian Networks and Graphical Models, http://www.cs.berkeley.edu/~murphyk/Bayes/bayes.html

    D. Heckerman, A tutorial on learning with Bayesian Networks, Microsoft Research MSR-TR-95-06, March 1995.

    J. Pearl, Causality: Models, Reasoning, and Inference, Cambridge University Press, 2000.

    P. Spirtes, C. Glymour and R. Scheines, Causation, Prediction, and Search, 2nd Edition (MIT Press), 2000.

    Case Study: Customer Wallet Estimation S. Rosset, C. Perlich, B. Zadrozny, S. Merugu, S. Weiss and R.

    Lawrence, Customer Wallet Estimation. 1st NYU workshop on CRM and Data Mining, 2005.

    S. Merugu, S. Rosset and C. Perlich, A New Multi-View Regression Method with an Application to Customer Wallet Estimation. The Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2006.

  • IBM Research

    2006 IBM Corporation99

    Thank [email protected]