prognosis - an approach to predictive analytics- impetus white paper

Upload: impetus

Post on 05-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Prognosis - An Approach to Predictive Analytics- Impetus White Paper

    1/14

    Prognosis - An Approach

    to Predictive Analytics

    W H I T E P A P E R

    Abstract

    Prediction is a statement made about the future, an

    anticipatory vision or perception. This White Paper

    discusses the emergence of technology that enables

    precise predictions in varied fields, and the application

    of exploratory and normative methods to augmentdecision making.

    Forecasting is primarily based on mining historical data

    sets, extracting hidden patterns and transforming

    them into valuable information through a process of

    classification, clustering, regression and association

    rule learning.

    The white paper talks about Impetus implementation

    of Behavioral Targeting for the ad world. This is a

    widely accepted, statistical machine learning algorithmthat helps select most relevant ads to be displayed to a

    web user based on their historical data.

    .

    Impetus Technologies Inc.

    www.impetus.com

    November 2011

  • 7/31/2019 Prognosis - An Approach to Predictive Analytics- Impetus White Paper

    2/14

    Prognosis An Approach to Predictive Analytics

    2

    Table of Contents

    Introduction .................................................................................................................................................. 2

    Large scale data analytics ......................................................................................................................... 3

    Algorithms for forecasting & prediction ................................................................................................... 3

    Behavioral Targeting ..................................................................................................................................... 4

    Advantages and threats ............................................................................................................................ 4

    Industry impact ......................................................................................................................................... 5

    Generic Approach to BT problem solving ................................................................................................. 6

    Large scale implementation of BT ................................................................................................................ 6

    Poissons Linear Regression ...................................................................................................................... 6

    Implementing BT using Poissons Linear Regression ................................................................................ 6

    1. Data Preparation ........................................................................................................................... 7

    2. Model Training .............................................................................................................................. 8

    3. Model Evaluation ........................................................................................................................ 12

    Summary ..................................................................................................................................................... 14

    Introduction

    A prediction is a statement about the way things will happen in the future, often

    but not always based on experience or knowledge. Prediction is necessary to

    allow plans to be made about possible developments. Large corporations invest

    heavily in this kind of activity to help focus attention on possible events, risks

    and business opportunities. Such work brings together all available past and

    current data, as a basis to develop reasonable expectations about the future.

    The basic idea behind any such algorithm is to gather gigantic behavioral data

    that describes the historical series of events/actions/behavior of the entity in

    question. This data is fed into machines and run through complex machine

    learning algorithms to derive models. The models serve as the basis for

    predictions, i.e. based on input criteria the models infer the expected behavior

    of the entity.

    The application of prediction algorithms has gained prominence in a wide range

    of fields such as finance (stock market predictions), insurance (predicting life

    expectancy), science (weather forecasting, predicting natural disasters), medical

  • 7/31/2019 Prognosis - An Approach to Predictive Analytics- Impetus White Paper

    3/14

    Prognosis An Approach to Predictive Analytics

    3

    science (treating developmental disabilities), marketing (behavioral targeting)

    and many more.

    Typically, with predictions, there is a huge amount of historical data, time is of

    the essence and there is always a current activity happening that impacts the

    future. In many cases, freshness of data is a key factor and plays a major role inforecasting the future course of action. In other instances, the entire data set

    has equal relevance and contributes to determining the future.

    Large scale data analytics

    Projects related to future predictions and forecasting point to a huge increase in

    the amount of data that must not only be stored but processed quickly and

    efficiently. These challenges are at once a daunting and exciting chance to use

    data to create a positive impact.

    Often, there is an immediate need to analyze the data at hand, to discover

    patterns, reveal threats, monitor critical systems, and make decisions about thedirection the organization should take. Several constraints are always present:

    the need to implement new analytics quickly enough to capitalize on new data

    sources, limits on the scope of development efforts, and the pressure to expand

    mission capability without an increase in budgets. For many of these

    applications, the large data processing stack (which includes the simplified

    programming model Map-Reduce, distributed file systems, semi-structured

    stores, and integration components, all running on commodity class hardware),

    has opened up a new avenue for scaling out efforts and enabling analytics that

    were impossible in previous architectures. This new ecosystem has been found

    to be remarkably versatile at handling various types of data and classes of

    analytics.

    Perhaps the most exciting benefit, however, from moving to these highly

    scalable architectures is that after the immediate issues have been solved, often

    with a system that can handle todays requirements and scale up to 10x or

    more, new analytics and capabilities can be developed, evaluated and

    integrated easily. This is owing to the speed and ease of Map-Reduce, Pig, Hive,

    and other technologies. More than ever, the large-scale data analysis software

    stack is proving to be a platform for innovation.

    Algorithms for forecasting and prediction

    There are several classes of statistical algorithms that are well suited for thesekinds of problems, which are associated with trend analysis, pattern generation

    and artificial intelligence based predictions. Some of the most common ones

    are:

    Conjoint Analysis Expert opinion and Delphi surveys

  • 7/31/2019 Prognosis - An Approach to Predictive Analytics- Impetus White Paper

    4/14

    Prognosis An Approach to Predictive Analytics

    4

    Quantitative Statistical, suited to predicting trends e.g. PoissonsLinear regression, Exponential smoothing

    Qualitative Subjective, providing a range of possible outcomes, e.g.the Bayesian approach

    Statistical combination A mix of quantitative and qualitativetechniques e.g. Quasi Bayes

    Behavioral Targeting

    Behavioral targeting (BT) leverages historical user behavior to select the most

    relevant ads to display. The state-of-the-art of BT derives a Linear Poisson

    Regression model from fine-grained user behavioral data and predicts click-

    through rate (CTR) from user history.

    Behavioral targeting is an application of modern statistical machine learningmethods to online advertising. But unlike other computational advertising

    techniques, BT does not primarily rely on contextual information such as query

    (sponsored search) and web page (content match). Instead, BT learns from

    past user behavior, especially the implicit feedback (i.e., ad clicks) to match the

    best ads to users.

    This makes BT enjoy a broader applicability such as graphical display ads, or at

    least a valuable user dimension complementary to other contextual advertising

    techniques. In today's practice, behaviorally targeted advertising inventory

    comes in the form of some kind of demand-driven taxonomy. Hierarchical

    examples are Finance, Investment and Technology, Consumer Electronics, andCellular Telephones. Within a category of interest, a BT model derives a

    relevance score for each user from past activity. Should the user appear online

    during a targeting time window, the ad serving system will qualify this user (to

    be shown an ad in this category) if the score is above a certain threshold. One

    de facto measure of relevance is CTR, and the threshold is predetermined in

    such a way that both a desired level of relevance (measured by the cumulative

    CTR of a collection of targeted users) and the volume of targeted ad impressions

    (also called reach) can be achieved.

    The impact of behavioral targeting can be negative if consumers feel annoyed or

    threatened by the use of their personal data. However, as demonstrated by

    Amazon, when personal information and technology enhance the online

    experience, there is less risk of a negative response.

    Advantages and threats

    There are a lot of advantages attributed to ad targeting and behavioral analysis,

    but at the same time it is also important to look at the downsides and surface

  • 7/31/2019 Prognosis - An Approach to Predictive Analytics- Impetus White Paper

    5/14

    Prognosis An Approach to Predictive Analytics

    5

    the threats posed by them. Some of the advantages that can be seen right away

    are:

    Reaching the right audience at the right time (of the day, week or lifestage), with clear behavioral assumptions

    Standing out in a cluttered category

    Reaching target audiences when context inventory is sold out(reaching same target in alternative content)

    High cost of entry in desired content (reaching the same target inalternative content with lower costs)

    Tailoring message to behavioral patterns to make it more relevantAs mentioned earlier, there are some downsides to BT:

    Achieving high reach is difficult. Within extremely targeted segments,the potential universe available may be very limited and there may be alimit to the sites currently allowing behavioral targeting.

    Inconsistencies within segment classifications. The definition ofcommon behavioral segment may differ by publisher (e.g., job seeker

    searching Monster.com not the same job seeker as reading job-related

    article on iVillage). Also, as the technology is cookie enabled, it suffers

    the usual issues of cookie stability and data accuracy.

    Ultimate issue of behavioral targeting clutter. Other advertisers withinthe same vertical will compete in the same space/segments. This is

    currently a future issue but in time, cost, clutter and inventory

    availability positives will become challenges (as seen in paid search). Inthe future, as targeting matures and advertisers have measurable

    results, historical data will be a key indicator of which assumptions

    work. This will provide optimization insights. Collecting and analyzing

    response data generated from different segments are important

    prerequisites for success.

    Industry impact

    Behavioral targeting, as a concept, has wide acceptance in the industry.

    Indicated below are some use-cases where it is being successfully implemented

    as a tool for predicting user behavior:

    Ad Targeting and Predicting the buying behavior of users Relationship building Audience targeting Presidential candidates using BT to target persuasion Treatment of mental disorders and developmental disabilities

  • 7/31/2019 Prognosis - An Approach to Predictive Analytics- Impetus White Paper

    6/14

    Prognosis An Approach to Predictive Analytics

    6

    There is a vast horizon where BT, or BT based solutions are being used to

    successfully predict/forecast behavior in order to increase reach, accessibility,

    and revenue.

    Generic approach to BT problem solving

    Data mining involves extracting hidden patterns from data to transformit into valuable information using computer power to apply knowledge

    discovery methodologies.

    It applies knowledge discovery and prediction through a process ofclassification, clustering, regression and association rule learning.

    The value of the information depends on the collection of indicative andrepresentative data.

    Cookies for behavioral advertising usually contain text that uniquelyidentifies the browser so that advertisers or ad networks can recognize

    the same Internet user across different Web sites or multiple areas on

    the same site.

    Large scale implementation of BT

    Poissons Linear Regression

    This is a statistical method used to calculate the probability of an event, given

    the rate of occurrence of the event in disjoint timeframes, suited for analyzing

    outcomes that have positive values.

    Poissons Linear Regression works really well where the input data is sparse i.e.

    results are valid for rare events. It can model rare events when everyone is

    followed for the same length of time, or when people have different length of

    follow ups.

    Implementing BT using Poissons Linear Regression

    Behavioral targeting can be effectively implemented using the Poissons Linear

    Regression algorithm, as it maps well to the nature of input data and the kind of

    predictions that organizations are looking at.

    The Algorithm is well explained by the flow chart:

  • 7/31/2019 Prognosis - An Approach to Predictive Analytics- Impetus White Paper

    7/14

    Prognosis An Approach to Predictive Analytics

    7

    Impetus Technologies implemented Behavioral targeting using the Poissons

    Linear Regression algorithm. The algorithm was deployed using the Hadoop

    ecosystem. The entire algorithm was decomposed into individual steps. Each of

    the steps was implemented as a Hadoop M/R job and the jobs were run

    sequentially using the Oozie workflow engine. The results of the

    implementation were models for different categories. These models were

    stored on the HBase data store and later consumed for analytics and behavioral

    predictions.The steps involved in the above implementation are explained

    below:

    1. Data Preparation

    In this preprocessing step, the data fields of interest were extracted from raw

    data feeds, thus reducing the size of the data.

    Raw data was related to user behavior with respect to one or more ads. It also

    included ad clicks, ad views, page views, searches, organic clicks or overture

    clicks.

  • 7/31/2019 Prognosis - An Approach to Predictive Analytics- Impetus White Paper

    8/14

    Prognosis An Approach to Predictive Analytics

    8

    1. The raw data came from the user base2. The system stored the raw data in HDFS3. The raw data was sent to the data preparation module which

    undertook the following:

    a. Aggregated event counts over a configurable period of time, tofurther shrink the data size

    b. Merged counts into a single entry with as unique key

    c. It included two M/R jobsFeature-Extractor and Feature-Generator

    1.1 Feature-Extractor

    Input- Raw data feeds

    Output-

    1.2 Feature-Generator

    Input-

    Output-

    2. Model Training

    This fitted the Linear Poisson Regression Model from the preprocessed data and

    involved the following:

    1.

    Feature selection2. Generating of training examples3. Model weights initialization4. Multiplicative recurrence to converge model weights

    2.1 Poisson-Entity-Dictionary

    It mainly performed feature selection and inverted indexing. It did this

    by counting entity frequency in terms of touching cookies and selecting

    the most frequent entities in the given feature space.

    Output-Hashmap of (inverted

    index) for all entity types

    An entity referred to the name (unique identifier) of an event (e.g. an ad

    id, a space Id for page, or a query). The Entity was different from the

  • 7/31/2019 Prognosis - An Approach to Predictive Analytics- Impetus White Paper

    9/14

    Prognosis An Approach to Predictive Analytics

    9

    feature since the latter was uniquely identified by the pair.

    In the context of BT, there were three types of entitiesad, page and

    search

    The Poisson entity dictionary included three M/R jobsPoissonEntityUnit, PoissonEntitySum, and PoissonEntityHash

    2.2 Poisson-Feature-Vector

    This generated training examples (feature vectors) that were directly

    used later by model initialization and multiplicative recurrence.

    It used a sparse data structure (populated primarily with zeros) for

    feature vectors. Behavioral count data is very sparse by nature. For a

    given user, in a given time period, his or her activity only involves a

    limited number of events. Impetus used a pair of arrays of the samelength to represent a feature vector or a target vectoran Integer type

    for feature and float type for value (float type for possible decaying),

    with an array index giving a pair.

    Feature Selection and inverted indexing: - With the feature space

    selected from PoissonEntityDictionary, in this step, Impetus discarded

    the unselected events from the training data in the feature (input

    variable) side. On the target (response variable) side Impetus took the

    option of using all features or only selected features to categorize them

    into target event counts.

    With the inverted index built from PoissonEntityDictionary,

    from the PoissonFeatureVector step and onwards, Impetus

    referenced an original feature name by its index. The same idea was

    also applied to cookies, since the cookie field was irrelevant.

    Several pre-computations were performed at this stage: -

    1. Impetus further aggregated feature counts into a time window,with a size larger than or equal to the resolution from data

    preparation.

    2. Decay counts over time using a configurable factor3. Realized causal approach to generate examples. (Causal

    approach collects features before targets temporarily; while the

    non-causal approach generates targets and features from the

    same period of history).

  • 7/31/2019 Prognosis - An Approach to Predictive Analytics- Impetus White Paper

    10/14

    Prognosis An Approach to Predictive Analytics

    10

    4. Impetus used binary representation (serialized objects in java)and data compression (Sequence file with BLOCK compression

    in Hadoop framework) for feature vectors.

    Data structure for the feature vector

    int[targetLength] targetIndex Array float[targetLength] targetValue Array int[inputLength] inputIndex Array float[inputLength] inputValue Array

    Input-

    Output-

    Target counts were collected from a sliding time window and feature

    counts aggregated (possibly with decay) from a time period preceding

    the target window. The size of the sliding window was kept relatively

    small for the following reasons: -

    1. A large window effectively discarded many co-occurrences within that window. E.g. The following

    setup yielded superior long term models: -

    a. A target window of size one dayb. Sliding over a one week periodc. Preceded by a four week feature window(also sliding

    along with the target window)

    The Algorithm included the following:

    1. For each cookie Impetus cached all the event count data.2. It sorted events by time, forming an event stream of this

    particular cookie covering the entire time period of interest.

    3. Impetus pre-computed boundaries of the sliding window. Fourboundaries were specified featureBegin,

    featureEnd, targetBegin, targetEnd.

    separatingfeatureEnd and targetBegin allowed a

    gap window in between, which was necessary to emulate

    possible latency in online prediction.

  • 7/31/2019 Prognosis - An Approach to Predictive Analytics- Impetus White Paper

    11/14

    Prognosis An Approach to Predictive Analytics

    11

    4. The company maintained three iterators on the event stream,referencing previous featureBegin, current

    FeatureBegin, and targetBegin. It used one pair of

    treeMap objects (i.e. inputMap and targetMap) to hold

    features and targets of a feature vector as the data was being

    processed.2.3 Poisson-Initializer

    It initialized the model weights (coefficients of the regressors) by

    scanning the training data once.

    k: Index of target variables

    j: Index of features or input variables

    i: examples

    a unigram(j) is one occurrence of feature j

    a bigram(k,j) is one co-occurrence of target k and feature j

    The basic idea was to allocate the weight w(k,j) as a normalized number

    of co-occurrences of (k,j).Bigram based initialization.

    The output ofPoissonInitializerwas an initialized weight

    matrix of dimensionality number of targets by number of features.

    1. Impetus distributed the computation of counting the bigrams bya composite key and effectively pre-computed total bigram

    counts of all examples before the final stage.

    2. The M/R framework provided a single key data structure. Inorder to distribute , Impetus needed an efficient function

    to transform a composite key(two integers) into a single key andrecover the composite key back when needed.

    bigram Key(k,j) = a long integer obtained by bitwise left

    shift 32 bit of k and then bitwise OR by j

    3. The Impetus team cached the output of first mapper thatemitted .

    2.4 Poisson-Multiplicative

    It updated the model weights by scanning the training data iteratively. It

    utilized highly effective multiplicative recurrence.

    Computing a normalizer Poisson mean involved dot product a previous

    weight vector by a feature vector (The input portion)

    Input-

    Output- updated wk for all k

  • 7/31/2019 Prognosis - An Approach to Predictive Analytics- Impetus White Paper

    12/14

    Prognosis An Approach to Predictive Analytics

    12

    1. Impetus represented the model weight matrix as K denseweight vectors (arrays) of length J, where K was the number of

    targets and J the number of features.

    2. Using weight vectors was more scalable in terms of memoryfootprint than matrix representation. But, it raised challenges in

    Disk IO. Impetus addressed this problem via in-memory caching.Caching weight vectors was not the solution. The trick was to

    cache input examples. After caching, Impetus maintained a

    hashmap that recorded all relevant targets for cached feature

    vectors. And provided constant time lookup from target Index

    to array-index Map.

    3. Impetus also used Hadoop's distributed cache, which copied therequested files from HDFS to the slave nodes before the task

    was executed. It only copied the files once per job for each task

    tracker, which was shared by M/R tasks.

    3. Model Evaluation

    It tested the trained model on a test data set. The main tasks were:

    1. Predicting expected target counts(clicks and views)2. Scoring (CTR)3. Ranking scores of a test set4. Calculating and reporting performance metrics such as CTR lift and area

    under ROC curve.

    This component contained three sequential steps:

    3.1 Poisson-Feature-Vector-Eval

    It was Identical to Poisson-Feature-Vector.

    There was no need to book keep the summary statistics fortraining such as total count of examples, feature and target

    unigrams.

    Decay was typically necessary in generating test data. Sinceit enabled efficient incremental predicting as new events

    flow in, while diminishing the obsolete long history

    exponentially.

    Sampling and heuristic based robot filtering were notapplied to generate test data

    Impetus could remove those examples without a targetfrom the test dataset, since these records did not impact

    the performance, no matter how the model predicted

    them. However, examples with targets were also kept, even

    those without any inputs. This was because these records

  • 7/31/2019 Prognosis - An Approach to Predictive Analytics- Impetus White Paper

    13/14

  • 7/31/2019 Prognosis - An Approach to Predictive Analytics- Impetus White Paper

    14/14

    Prognosis An Approach to Predictive Analytics

    14

    Summary

    As explained above, prediction is a statement made about the future. A very

    popular area of application that has flourished in recent times is Behavioral

    targeting (BT). BT is defined as a large scale machine learning problem that

    leverages historical user behavior to select the most relevant ads to display. The

    process basically involves mining historical data sets and extracting hidden

    patterns (trends) to predict user interests.

    Major IT giants like Yahoo, Google and Amazon have used Behavioral Targeting

    and achieved major gains in terms of reach and CTR increase. There are several

    implementations of BT that employ various statistical algorithms and processes

    to extract the behavioral traits of the users in question.

    The input to the BT engine is a historical sequence of the activities undertaken

    by users over the Internet. These activities include ad clicks, ad views, page

    views, search queries and search clicks. As the users browse the Internet they

    unknowingly leave a trail of footprints in terms of visited pages, ads, cookies,

    etc. These footprints reveal a lot about their personality traits. BT leverages on

    these subtle inputs and without hindering the privacy of the users draws their

    personality sketch. Based on these inferences, advertisers are able to target

    their audience and show them relevant ads.

    Impetus applied Poissons Linear Regression algorithm for its implementation.

    This was deployed on the Hadoop environment using chained Map reduce jobs

    as an Oozie workflow.

    DisclaimersThe information contained in this document is the proprietary and exclusive property of Impetus Technologies Inc. except as otherwise indicated. No part of

    this document, in whole or in part, may be reproduced, stored, transmitted, or used for design purposes without the prior written permission of Impetus

    Technologies Inc.

    About Impetus

    Impetus Technologies offers Product Engineering and Technology R&D services for software product development.

    With ongoing investments in research and application of emerging technology areas, innovative business models, and

    an agile approach, we partner with our client base comprising large scale ISVs and technology innovators to deliver

    cutting-edge software products. Our expertise spans the domains of Big Data, SaaS, Cloud Computing, Mobility

    Solutions, Test Engineering, Performance Engineering, and Social Media among others.

    Impetus Technologies, Inc.

    5300 Stevens Creek Boulevard, Suite 450, San Jose, CA 95129, USATel: 408.252.7111 | Email:[email protected]

    Regional Development Centers - INDIA: New Delhi Bangalore Indore Hyderabad

    To know more visit:http://www.impetus.com