pollyanna document classifier
DESCRIPTION
TRANSCRIPT
PollyannaA machine learning system for
classifying product pages on the Internet
What is Pollyanna?
• Pollyanna is a Machine Learning System that uses ‘Supervised Learning’ techniques to associate words and categories quantitatively, based on the examples in the training set.
• The training system is programmed to interpret the association between words and categories using theories in probability and statistics
• It applies the training knowledge to classify documents based on the text contained in the document using the ‘linear classifier’ function
What does Pollyanna do?
• It reads the text in the product pages of Internet merchants and retailers,
• Quantitatively associates the words in the title, meta and body tags with the product categories in its taxonomy, and
• Predicts the top 3 categories to which the products in the product page may belong
The Context
What is Pollyanna’s business context?
The Comparison Shopping Engine (CSE) Eco-system
Internet Retailer
Comparison Shopping EngineInternet Buyer
The Process
Retailer Offer Classification
Retailer Offer Alignment
Product
Attribution
Internet Shopper Internet Retailer
Search
Identify/Shortlist
Purchase
Comparison Shopping Engine
Sample Product Taxonomy for Classification
Classification
• Classification of Retailer’s offers is a critical process for most Comparison Shopping Sites• Classification enables a focused
search for a product within a specific product category
Efficiency of existing classification methods
• The approximate accuracy of current classification algorithms (in the Comparison Shopping Space) – 65%
• About 10 % of merchant offers are manually classified
• About 10 % of merchant offers are always mis-classified
Problem Definition
How to most effectively classify merchant/retailer offers accurately at
the lowest cost?
The Solution
The Pollyanna System
A fresh perspective of the process and inputs
Disregard retailer’s data-feed to search engine
Train the system on retailer’s website content
Use the product web page text as input
A new viewpoint on support vectors in a machine learning system
A new predictive coefficient not used in any other publically known machine learning system
Synthesis of statistical theories widely applied in social sciences and medical research
Pollyanna’s Current 1 dimensional relationship analysis
Keyword
Product Category 3
Product Category 4
Product Category 1
Product Category 2
Example of the one dimensional relationship
Word Relationship Product Category
acrylic 0.950338803 Men’s Hats
acrylic 0.944220332 Men’s Socks
acrylic 0.061613565 Men’s Sweaters / Vests
acrylic 0.002798075 Miscellaneous Men’s Accessories
acrylic 0.001157465 Miscellaneous Women’s Accessories
acrylic 0.772611278 Women’s Hats
acrylic 0.442448187 Women’s Socks & Hosiery
Conditional Probability• Conditional probability is the probability of some event A,
given the occurrence of some other event B. Conditional probability is written as P(A|B), and is read as "the probability of A, given B".
• Bayes Theorem provides the Equation for Conditional Probability which can be stated as:
P (A | B) = P (B | A) * P (A) P (B)
Can be written as = P (A ∩ B) P (B)
Conditional Probability
Attribute Document contains the word ‘Drawastring’ (B)
Document does not contain the word ‘Drawstring’ (b)
Total
Women’s Pants (A) 195 12053 12248
Not Women’s Pants (a) 628 434347 434975
Total823 446400 447223
Data from Pollyanna
In this example CP = 195/823CP = 0.2369380316
400 400
1100 2600
0.23000/4001500/400 RR
Risk Ratio
Normal BP
Congestive Heart Failure
No CHF
1500 3000
High Systolic BP
Example from Cohort studies in Medicine.
738 29689
808 415988
7.16591289445677/296891546/738 RR
Risk Ratio
Does not contain “Oxford”
Men’s Shoes
Not Men’s Shoes
1546 445677
Document Contains“Oxford”
Data from Pollyanna
Pollyanna is a Linear Classifier
• If the input feature vector to the classifier is a real vector x, then the output score is
•
• where w is a real vector of weights and f is a function that converts the scalar product of the two vectors into the desired output.
Solution Statement
• Pollyanna is a Machine Learning System that uses new processes, inputs and statistical theories
• That provides a highly accurate automated classification (87% ± 3%)
• Unlike other classification algorithms (in the E-Commerce space) that are dependent on retailer’s data-feeds, and are less accurate (Approx 65%) and are supported by manual classification
• We have assembled a highly accurate classification system that is cost effective, one that does not require an ongoing manual support
Pollyanna Demo
Architecture
Knowledge
Statistical Validation
Statistical Elimination
Sampling
Internet Cloud
Internet Cloud
Front End Tool
User/Client
Training Module
Perl
Perl
Perl
Pollyanna can be applied to predictive analytics in online payment fraud
If the following conditions are met:
The problem must be clearly defined in terms of:
• Input– Type of data: Integer, String, Floating, Boolean– File type: XML, Delimited, Database
• Process– Human intelligence and any other methods, procedures
required for arriving at a decision, prediction or forecast• Output
– All possible decisions/outcomesExamples:• Bucketing a transaction into fraud risk category• Forecasting fraud losses on completed transactions
Historical data should be available
• Reliable data– Data is sufficiently complete and error free
• Valid data– Data actually represents what you think is being
measured• Sufficient data– Data is adequate to support the outcome of the
process or the decision• Spatial data• Time series data
Data should yield binomial probability distribution for each attribute
• Example• A key attribute of an online transaction is the
location of the “IP” address and the location of the physical address of the credit card holder:– Two outcomes are possible for the above attribute
• The “IP” address and the physical address are located geographically in the same country
• The “IP” address and the physical address are not located geographically in the same country
• Continued in the next slide
Data should yield binomial probability distribution for each attribute
• Example – Continued from previous slide• The Machine Learning system is supplied 200
online payment transactions received in the previous year.
• The machine learning system should be able to determine, for each possible outcome, the number of Yes or No events observed– Example: For the outcome “The IP address and the
physical address are located geographically in the same country” – 20 Yes and 180 No
How will the support vector be calculated in the context of online payment transaction
Illustration with an hypothetical case
Support Vector Computation Example
• To simplify the problem let us say that every transaction has to be bucketed into one of the two classes:– A genuine transaction– A fraudulent transaction
• The training module’s goal is to calculate the relationship - between an attribute of a transaction and each of the classes mentioned above - which is the ‘Support Vector’
Support Vector Computation Example
• The training module is supplied with 200 sample transactions (historical data) representing the population
• Of the 200 transactions 20 are fraudulent and 180 are genuine
• A key attribute of the transaction is: The IP address and the physical address of the credit card holder are not located geographically in the same country. Of the 200 transactions 40 had the above attribute and 160 did not have the above attribute. Let us call the above attribute ‘X’.
• The training module will analyze the data and arrive at the following matrix:
Attribute ‘X’ observed
Attribute ‘X’ not observed
Total
Fraudulent 15 5 20
Not Fraudulent
25 155 180
Total 40 160 200
Association between Fraudulent Transaction and Attribute ‘X’
Support Vector Computation Example
Support Vector Computation Example
• Applying a synthesis of theories in probability and statistics the support vector is calculated as 4.040816
• The support vector is a measure of the relationship between a Fraudulent Transaction and the attribute: “The IP address and the physical address of the credit card holder are not located geographically in the same country”.
How will the machine learning system forecast fraud loss
Illustration with an hypothetical case
Forecasting Fraud Loss
• The problem:– To forecast the value of losses on all fraudulent credit
card payment transactions that have been successfully executed in a given month
• There are two steps to doing this:– Step 1: Determine whether each transaction is
fraudulent or not based on attributes of the transaction
– Step 2: Sum the values of the fraudulent transactions to arrive at the forecast of loss for that month
Forecasting Fraud Loss – Step 1Determining whether a transaction is
fraudulent or not• Let us hypothetically say that there are two
outcomes for each transaction, either it is a Fraudulent transaction or it is a Genuine transaction.
• For each outcome the following linear function is applied:
• Refer slide 20 for a brief explanation of the function
Forecasting Fraud Loss – Step 1Determining whether a transaction is
fraudulent or not
• So the linear function is applied for the observed attributes in the transaction (X vector) weighted by the Support Vector (W vector) calculated in the training module
• For our example there are 2 outcomes for each transaction – Fraudulent or Genuine
• For every transaction, the linear function gives the values for both the outcomes and the prediction will be in favor of the outcome with the higher value
Forecasting Fraud Loss – Step 2Summing the values of all fraudulent
transactions
• Step 1 is performed on each transaction in a given period
• The values of fraudulent transactions are totaled to arrive at the forecast of losses due to fraud in a given period
All the details contained in the examples above are imaginary.
They serve only for the purpose of understanding the system and its application to the field of fraud
analytics
Benefits of adopting Pollyanna
Benefits of adopting Pollyanna
• For a process that is currently supported by human intelligence, Pollyanna may confer a cost saving benefit ranging from 40% to 80% from reduction of human resources
• For a process that is already automated or uses machine intelligence, Pollyanna may bring efficiency or accuracy improvement ranging from 10% to 25%
“Any technology sufficiently advanced is indistinguishable from magic”
Sir Arthur C. Clarke
Thank You
Contact Details
• PG Vijay (Consultant – Machine Learning Systems)– Mobile: +91 98418 21167– E-Mail: [email protected]– LinkedIn Public Profile:
http://www.linkedin.com/in/machinelearning