prediction of grupo bimbo inventory demandmwang2/projects/ml_kaggleinventorydemand.pdfjinzhong...
TRANSCRIPT
Prediction of Grupo Bimbo Inventory Demand
by Jinzhong Zhang 90620
Nikita Sonthalia 89679
Team: Bimbo Kagglers
July 30, 2014
San Jose
Acknowledgments
Thanks for
• Bimbo group to raise up this problem and provide funding;
• www.kaggle.com platform to hold this competition;
• Prof. Wang to organize this project.
Jinzhong Zhang&Nikita Sonthalia, September 1, 2016
Preface
This report describes a general business problem that predicts the demandin a future week given the demands in the past weeks. It reduces the exhaustingcalculation of traditional co-occurrence into map-reduce problem, and then usegeneral machine learning algorithm to make the prediction. The cutting-edgetool, Spark, is to render the map-reduce calculation. The results show that(number) predictions can be finished in (time) with (percent) accuracy.
ii
Table of Contents
Acknowledgments ii
Preface ii
Table of Contents iii
List of Figures v
List of Tables vi
Chapter 1 Introduction 11.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 What is the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Why this is a project related to this class . . . . . . . . . . . . . . . . 21.4 Why other approach is not good . . . . . . . . . . . . . . . . . . . . . 21.5 Why our approach should be better . . . . . . . . . . . . . . . . . . . 21.6 Statement of the problem . . . . . . . . . . . . . . . . . . . . . . . . 21.7 Scope of investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2 Theoretical Bases and Literature Review 42.1 Definition of the problem . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Theoretical background of the problem . . . . . . . . . . . . . . . . . 42.3 Related research to solve the problem . . . . . . . . . . . . . . . . . . 52.4 Advantage/disadvantage of those research . . . . . . . . . . . . . . . 52.5 Our solution to solve this problem . . . . . . . . . . . . . . . . . . . . 52.6 Where your solution different from others . . . . . . . . . . . . . . . . 62.7 Why your solution is better . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 3 Hypothesis 73.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Positive/negative hypothesis . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 4 Methodology 10
iii
4.1 How to generate/collect input data? . . . . . . . . . . . . . . . . . . . 104.2 How to solve the problem? . . . . . . . . . . . . . . . . . . . . . . . . 124.3 How to generate output? . . . . . . . . . . . . . . . . . . . . . . . . . 134.4 How to test against hypothesis? . . . . . . . . . . . . . . . . . . . . . 134.5 How to proof correctness . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 5 Implementation 145.1 Split the training data week by week . . . . . . . . . . . . . . . . . . 145.2 The collection of user behavior . . . . . . . . . . . . . . . . . . . . . . 145.3 The calculation of co-occurrence matrix of products and depots . . . 155.4 The calculation of popularity of each product and each depot . . . . . 155.5 Summary to the Analytic Based Table (ABT) . . . . . . . . . . . . . 155.6 Build the Predictive Model . . . . . . . . . . . . . . . . . . . . . . . . 155.7 Make Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Chapter 6 Data Analysis and Discussion 176.1 Output Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176.2 Output Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176.3 Compare output against hypothesis . . . . . . . . . . . . . . . . . . . 186.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 7 Conclusion and Recommendation 217.1 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . 217.2 Recommendations for future studies . . . . . . . . . . . . . . . . . . . 21
Bibliography 22
Appendix A Flowchart 24
Appendix B Code 26B.1 Split the training data week by week (Bash) . . . . . . . . . . . . . . 26B.2 The collection of user behavior (Scala - Spark) . . . . . . . . . . . . . 26B.3 The calculation of co-occurrence matrix of products and depots (Scala
- Spark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27B.4 The calculation of popularity of each product and each depot (Python
- Spark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29B.5 Summary to the Analytic Based Table (Python - Spark) . . . . . . . 29B.6 Build the model and calculate the validation accuracy . . . . . . . . . 34B.7 Make Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
iv
List of Figures
6.1 Our Kaggle RMSLE for public and private data set . . . . . . . . . . 196.2 Rank and private RMSLE of Winners, our team(high lighted one) and
the sample. The sample submission is just a constant prediction ofdemand at 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
A.1 Flowchart of the procedure: from the original database dump to theprediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
v
List of Tables
4.1 Training Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.2 Test Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3 Sample Submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.4 Client Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.5 Product Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.6 Town Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
vi
Chapter 1
Introduction
1.1 Objective
Grupo Bimbo, S.A.B. de C.V., known as Bimbo, is a Mexican multinational bak-
ery product manufacturing company headquartered inMexico City,Mexico. It is the
world’s largest baking company. Grupo Bimbo must strives to meet daily consumer
demand for fresh bakery products on the shelves of over 1 million stores along its
45,000 routes across Mexico.
Currently, daily inventory calculations are performed by direct delivery sales em-
ployees who must single-handedly predict the forces of supply, demand, and hunger
based on their personal experiences with each store. With some breads carrying a
one week shelf life, the acceptable margin for error is small.
Grupo Bimbo invitesKagglers to develop a model to accurately forecast inventory
demand based on historical sales data.[1]
1.2 What is the problem
The dataset consists of 9 weeks of sales transactions in Mexico. Every week, there
are delivery trucks that deliver products to the vendors. Each transaction consists
of sales and returns. Returns are the products that are unsold and expired. The
demand for a product in a certain week is defined as the sales this week subtracted
by the return next week.[1]
We will forecast the demand of a product for a given week (10th or 11th week),
at a particular store.
1
1.3 Why this is a project related to this class
The techniques that we learned in class can be applied to this topic. e.g. co-
occurrence matrix, decision tree, k-nearest neighbors(kNN), Bayesian and support
vector machine.
1.4 Why other approach is not good
In business to customer(B2C) problems, most commonly, previous approaches use
co-occurrence matrix in customer recommendations[2]. The usage of co-occurrence
matrix is rarely seen in demand prediction problems.
In demand prediction problems, previous approaches[3, 4] mainly use the past
data to predict demand in bulk by time sequential prediction models. Those ap-
proaches are facing business to business problems(B2B). Since B2C problems are
fickler, more sophisticated, and more unpredictable. The data size of B2B is several
magnitude larger than the data size of B2B, which makes the problem even harder.
1.5 Why our approach should be better
Our approach combines the co-occurrence matrix and probability based machine
learning into B2C problem. It considers both the co-occurrence among products
and stores and the time series. Since the algorithm extracts the most important
information from the data without any loss, it should perform well.
1.6 Statement of the problem
The week to week account of transactions of 9 weeks is given. The information in each
transaction includes the store ID, the product ID, the customer ID, and the amount
the customer bought and returned. Our task is to predict the demand in the 10th or
the 11th week. The demand equals to the amount that the customer buy minus the
amount of the customer return in the coming week. The supplementary information
is given, which include the store locations, the product names and weights, the
customer names. Details of the data are shown in Chapter 4.
2
1.7 Scope of investigation
We are going to investigate the correlations among products and depots in the past
n weeks. The co-occurrence matrices will be used as a look-up table while building
the analytic based table(ABT). The features of analytic based table describe the
querying customer’s behavior in the past n weeks. The target is demand of the
querying customer in the n + 1 week and n + 2 week. Models M(~x1···n, n + 1) and
M(~x1···n, n+ 2) will be learned by the computer to make the prediction.
3
Chapter 2
Theoretical Bases and Literature
Review
2.1 Definition of the problem
The week to week account of transactions of 9 weeks is given. The information in each
transaction includes the store ID, the product ID, the customer ID, and the amount
the customer bought and returned. Our task is to predict the demand in the 10th or
the 11th week. The demand equals to the amount that the customer buy minus the
amount of the customer return in the coming week. The supplementary information
is given, which include the store locations, the product names and weights, the
customer names. Details of the data are shown in Chapter 4.
2.2 Theoretical background of the problem
The sequential supervised learning problem can be formulated as follows. Let (xi, yi)Ni=1
be a set of N training examples. Each example is a pair of sequences (xi, yi), where
xi = (xi,1, xi,2, . . . , xi,Ti) and yi = (yi,1, yi,2, . . . , yi,Ti
). The goal is to construct a clas-
sifier M that can correctly predict a new label sequence y = M(x) given an input
sequence x[5].
This problem is a kind of sequential supervised learning problem. The following
algorithms are generally used for the sequential supervised learning problems:
• Sliding Window,
• Recurrent Sliding Window,
• Hidden Markov Models,
4
• Conditional Random Fields,
• Neutral Networks.
Our approach is a combination of co-occurrence analysis and sliding window
analysis. The final predictive model is made by gradient boosted regression tree
using XGBoost algorithm [6]. In short words, after the decision tree is made, The
tree partitions the input space into J J disjoint regions R1m, . . . , RJm and predicts
a constant value in each region. A loss function is then calculated and the gradient
descent algorithm can be used on that.
2.3 Related research to solve the problem
In business to customer(B2C) problems, most commonly, previous approaches use
co-occurrence matrix in customer recommendations[2]
In business to business (B2B) demand prediction problems, previous approaches[3,
4] mainly use the past data to predict demand in bulk by time sequential prediction
models.
2.4 Advantage/disadvantage of those research
Since B2C problems are fickler, more sophisticated, and more unpredictable. The
data size of B2B is several magnitude larger than the data size of B2B, which makes
the problem even harder. The previous approaches use cases are limited, especially
when facing store by store and time sensitive demand problems. The computation
scale is too large to make a benefit-able prediction. Thus, new workflow and algo-
rithms are necessary.
2.5 Our solution to solve this problem
We are going to investigate the correlations among products and depots in the past
n weeks. The co-occurrence matrices will be used as a look-up table while building
the analytic based table(ABT). The features of analytic based table describe the
querying customer’s behavior in the past n weeks. The target is demand of the
querying customer in the n + 1 week and n + 2 week. Models M(~x1···n, n + 1) and
M(~x1···n, n+ 2) will be learned by the computer to make the prediction.
5
2.6 Where your solution different from others
Our approach is a combination of co-occurrence analysis and hidden Markov model.
2.7 Why your solution is better
Our approach combines the co-occurrence matrix and probability based machine
learning into B2C problem. It considers both the co-occurrence among products
and stores and the time series. Since the algorithm extracts the most important
information from the data without any loss, it should perform well. Our model
maker (XGBOOST) is also the best model in Kaggle. As claimed by the XGBOOST
authors, “Among the 29 challenge winning solutions 3 published at Kaggles blog
during 2015, 17 solutions used XGBoost” [6].
It does not need to perform exhausting relational search for each query. Instead
of searches, it uses a model to make predictions, which makes the calculation fast.
6
Chapter 3
Hypothesis
3.1 Hypothesis
We assume that the behavior of the customer to buy a certain product at a particular
depot only relates to his/her behaviors in the past n weeks. In this report, we try
n = 3. Under this hypothesis, we will generate machine-learning model from training
dataset and then use test dataset to predict the output. For example
If customer A demand k pieces of product B at depot C in the 4th week, his
behavior in the week 1 to week 3 becomes the features, the demand k is the target.
Particularly, the features of the analytic based table include,
================1st week====================
1. Demand of product B in 1st week,
2. Demand of the product that is most relative to product B in the 1st week times
the co-occurrence weight,
3. Demand of the product that is secondly relative to product B in the 1st week
times the co-occurrence weight,
4. Demand of the product that is secondly relative to product B in the 1st week
times the co-occurrence weight,
5. Demand at depot C in 1st week,
6. Demand at the most relative depot in the 1st week times the co-occurrence
weight,
================2nd week====================
7. Demand of product B in 2nd week,
7
8. Demand of the product that is most relative to product B in the 2nd week
times the co-occurrence weight,
9. Demand of the product that is secondly relative to product B in the 2nd week
times the co-occurrence weight,
10. Demand of the product that is secondly relative to product B in the 2nd week
times the co-occurrence weight,
11. Demand at depot C in 2nd week,
12. Demand at the most relative depot in the 2nd week times the co-occurrence
weight,
================3rd week====================
13. Demand of product B in 3rd week,
14. Demand of the product that is most relative to product B in the 3rd week
times the co-occurrence weight,
15. Demand of the product that is secondly relative to product B in the 3rd week
times the co-occurrence weight,
16. Demand of the product that is secondly relative to product B in the 3rd week
times the co-occurrence weight,
17. Demand at depot C in 3rd week,
18. Demand at the most relative depot in the 3rd week times the co-occurrence
weight,
================General Info====================
19. The popularity of the product B (how many product B have been sold from
1st week to 3rd week),
20. The popularity of the depot C (how many products have been sold at depot C
from 1st week to 3rd week),
================Target====================
21. Target: the demand of the product on the depot nth week.
We consider the only one most relative depot because we found out that people
rarely went to more than two depots in the same month. We consider three relative
8
products because we want to limit the number of features to save the calculation. So
“three” relative products can be adjusted.
Here is an example of a sample in the order mentioned above with target demand
1:
8, 3222510, 2990304, 2930552, 167, 1026716, 2, 3488688, 2564233, 1119980, 128,
786944, 3, 2578008, 1495704, 1351815, 134, 823832, 1495100, 382391, 1.
3.2 Positive/negative hypothesis
It is a regression problem and the error and R2 will be calculated.
9
Chapter 4
Methodology
4.1 How to generate/collect input data?
We got data from Kaggle site only were the company has upload there data. Those
data contain following Tables:
week Sales Depot ID Sales Channel ID Route ID Client ID Product ID
3 1110 7 3301 15766 1212
3 1110 7 3301 15766 1216
3 1110 7 3301 15766 1238
3 1110 7 3301 15766 1240
3 1110 7 3301 15766 1242
Sales unit
this week
(integer)
Sales this
week (unit:
pesos)
Returns unit
next week (in-
teger)
Returns next
week (unit:
pesos)
Adjusted Demand
3 25.14 0 0.0 3
4 33.52 0 0.0 4
4 39.32 0 0.0 4
4 33.52 0 0.0 4
3 22.92 0 0.0 3
Table 4.1: Training Table
To simplify the problem, we used Table only.
10
ID week Sales Depot ID Sales Channel ID Route ID Client ID Product ID
0 11 4037 1 2209 4639078 35305
1 11 2237 1 1226 4705135 1238
2 10 2045 1 2831 4549769 32940
3 11 1227 1 4448 4717855 43066
4 11 1219 1 1130 966351 1277
Table 4.2: Test Table
ID Demand (to be predicted)
0 ?
1 ?
2 ?
Table 4.3: Sample Submission
Cliente ID Name of the client
0 SIN NOMBRE
1 OXXO XINANTECATL
2 SIN NOMBRE
3 EL MORENO
Table 4.4: Client Table
Product ID Name of the product
0 NO IDENTIFICADO 0
9 Capuccino Moka 750g NES 9
41 Bimbollos Ext sAjonjoli 6p 480g BIM 41
53 Burritos Sincro 170g CU LON 53
Table 4.5: Product Table
11
Depot ID Town State
1110 2008 AG. LAGO FILT MAXICO, D.F.
1111 2002 AG. AZCAPOTZALCO MAXICO, D.F.
1112 2004 AG. CUAUTITLAN ESTADO DE MAXICO
1113 2008 AG. LAGO FILT MAXICO, D.F.
1114 2029 AG.IZTAPALAPA 2 MAXICO, D.F.
Table 4.6: Town Table
4.2 How to solve the problem?
• We used training dataset to build the learning model week wise.
• From training dataset we used probability for finding the predictive demand.
4.2.1 Algorithm Design
• We consider (past n weeks, p related products, d related depots):
• First try n=3, p=3, d=1:
• Procedure:
1. split data into weeks
2. generate the training data:
(a) Target: In n+1 week, the demand of the product for the (person,
depot)
(b) Find out the most p related products from 1 to nth weeks data instead
of calculating the entire co-occurrence matrix.
(c) Find out the most d related depots by the same way.
3. Features:
(a) In 1-n weeks, the demand of the product for the (person, depot), n
features,
(b) the demand of the p related product in the past n weeks, n*p features,
(c) the demand of the d related depots, n*d features,
(d) the popularity of the product and depot.
4. Train it by gradient boosted regression tree.
12
4.2.2 Language used
• Python, Scala
4.2.3 Tools used
• Bash Script, Python Libraries, Spark [7], Scikit-Learn [8], XGBOOST [6]
4.3 How to generate output?
Test dataset has rows for which we have to generate demand. We will take each row
as unique row and find the related data in training dataset and differentiate it in
week wise, which will generate learning model. Then that learning model will we use
to generate output.
4.4 How to test against hypothesis?
1. Currently, we assumed that the demand of current week only relates to the
previous 3 weeks. This number could be adjusted from 1 to 7. By calculating
the R2 = 1− residual sum of squarestotal sum of squares
, we could know which hypothesis is the
best;
2. Currently, we assumed that the demand does not relate to the route of the
depot and the channel of the depot. Those information could be added to the
model. Again, R2 is the benchmark.
4.5 How to proof correctness
Calculate the R2. It should be above 0 and as close as to 1.
13
Chapter 5
Implementation
In the first step, we split the training data into weekly training data. This part is
implemented by a simple Bash script. In the second step, from the weekly training
data, for each week, we collect each user’s behavior, calculate the co-occurrence
matrix for products and depots, and calculate the popularities of the products and
depots. This part is implemented using Spark 2.0 [7] with Scala and Python interface.
In the third step, we build the analytic based table (ABT) from the results provided
above. This part is implemented using Spark 2.0 with Python interface. In the forth
step, we split the ABT for training/validation/test and calculate the performance of
the hypothesis. Appendix A shows the flowchart.
5.1 Split the training data week by week
It is implemented by a simple Bash script, as shown in Appendix B.1. It takes several
hours to finish.
5.2 The collection of user behavior
We use the user ID as the key and implement Spark reduceByKey to collect a certain
user information in one week as one row. This is a reformat of data in order to let
the following program get a fast query by the user ID hash.
This part of code is implemented by Scala. They are in Appendix B.2. It takes
several minutes.
14
5.3 The calculation of co-occurrence matrix of prod-
ucts and depots
It calculates the co-occurrence matrix of the products and depots, respectively. Par-
ticularly, if one user bought k product A and l product B in a week, the addition to
the co-occurrence of A and B is the product of the demand of A and the demand of
B in that week.
This part of code is implemented by Scala. They are in Appendix B.3.
5.4 The calculation of popularity of each product
and each depot
It calculates how popular a product is and how popular a depot is. Particularly, the
popularity of a product in a week is how many sales of this product in this week.
The popularity of a depot is how many total sales at this depot in this week.
This part of code is implemented by Python. They are in Appendix B.4.
5.5 Summary to the Analytic Based Table (ABT)
It reads the collection of each user’s behavior, the co-occurrence matrix for products
and depots, and the popularities of the products and depots into memory. From
those hashed look-up tables (up to 2.0GB in-memory), this builder summarize the
behavior of the customer in the previous weeks (we try 3 weeks).
This part of code is implemented by Python. They are in Appendix B.5.
5.6 Build the Predictive Model
In this analysis, we use the 4th week (week 6 in the original train.csv) demand as
the training sample and use the 5th week (week 7 in the original train.csv) demand
as the validation data. The test data is hold in Kaggle.com and is hidden from the
user.
First, we used MinMaxScaler to normalize the data because of those huge weight
numbers.
Second, we feed the normalized data into XGBOOST model maker. The parame-
ter are set to be max depth of the tree is 4 (’bst:max_depth’:4), learning rate is 0.3
(’eta’:0.3), use linear regression, and use root mean square error as loss function
( ’eval_metric’:’rmse’ ).
15
This part of code is implemented by Python, using Scikit-learn and XGBOOST.
They are in Appendix B.6.
5.7 Make Predictions
It uses the scaler and model generated in the last step to predict the future behavior.
This part of code is implemented by Python, using Scikit-learn and XGBOOST.
They are in Appendix B.7.
16
Chapter 6
Data Analysis and Discussion
6.1 Output Generation
By taking the model as described in the previous chapter, we take the fifth week
data as the test sample. By applying the same scaler as the training sample, we
normalized the features of the testing sample. The model is then applied to the
normalized features of the testing sample. The code can be seen in Appendix [?].
6.2 Output Analysis
The following benchmarks have been calculated:
1. R2: 0.59
2. root of mean squared error: 13.9
3. root of mean squared percentage error: 124.6%
4. 50% percentile of mean squared percentage error: 50%
5. 75% percentile of mean squared percentage error: 86.4%
6. 90% percentile of mean squared percentage error: 200%
7. 99.5% percentile of mean squared percentage error: 400%
In the percentage error calculation, we applied a strict rule: if the true value is 0
but the prediction is not, we considered it as 100% error. The codes are shown in
Appendix ??.
17
6.3 Compare output against hypothesis
It indicates by using our model, the company will be able to predict half of the
customers’ behavior in 50% error and the majority of the customer’s behavior in
400% error. It shows that our hypothesis is positive but can be further tuned.
There are some special cases:
1. If the user is new, the prediction of demand will be made purely by the popu-
larity of the product and depot,
2. If the product is new, the prediction of demand will be made purely by the
popularity of the depot. However, this may not be suitable. We have not
further tuned this.
3. If the depot is new, the prediction of demand will be made purely by the
popularity of the product. We have not further tuned this.
4. If both product and depot are new, all features in the ABT will be 0. This
case is unpredictable.
6.4 Discussion
In the Kaggle challenge, Bimbo provides the sales information in week 3, 4, 5, 6, 7,
8, and 9. Bimbo asks challengers to predict the demand in week 10 and week 11.
Because we try to include as many as data in training, we used previous 5 weeks data
as historical data. The bimbo asks the challengers to predict the demand of next
week and the demand of next next week as well. Two training samples are made.
One training sample contains the week 3, 4, 5, 6, and 7 historical data as features,
week 8 demand as target. The other sample contains the same features but week 9
demand as target. In this way, two models are trained. One is to predict the next
week’s demand while the latter one predicts the next next week’s demand.
For the week 10 queries, we take the relative information in week 5, 6, 7, 8, and 9
as features and the first model to make the prediction. For the week 11 queries, we
use the same features but another model to make the prediction. Because the demand
should always equal to or be bigger than zero, we forced all of the negative predictions
to be zero. The negative regression only happens once in 7 million predictions, which
is not a big problem.
After submission, Kaggle evaluate the metric for the competition by Root Mean
Squared Logarithmic Error (RMSLE) [9].
18
The RMSLE is calculated as
ε =
√√√√ 1
n
n∑i=1
(log(pi + 1)− log(ai + 1))2
Where:
• ε is the RMSLE value (score).
• n is the total number of observations in the (public/private) data set,
• pi is your prediction of demand, and
• ai is the actual demand for i.
• log(x) is the natural logarithm of x
Figure 6.1 shows the scores we got in two submissions. We made the tree deeper
after the challenge deadline but did not get a better result.
Figure 6.1: Our Kaggle RMSLE for public and private data set
The public score is what we receive back upon each submission. The public
score is being determined from only a fraction of the test data set – usually between
25 ∼ 33%. When the competition ends, Kaggle takes our selected submissions and
score our predictions against the REMAINING FRACTION of the test set, which is
the private score. Final competition results are based on the private score [10].
Figure 6.2 shows the rank and the private scores of the winners, ours, and the
sample submission got.
We rank at 1263 among 1969 competitors. The full private score board can be seen
at https://www.kaggle.com/c/grupo-bimbo-inventory-demand/leaderboard/private.
Future adjustment is discussed in Secion 7.2.
19
Figure 6.2: Rank and private RMSLE of Winners, our team(high lighted one) and
the sample. The sample submission is just a constant prediction of demand at 7.
20
Chapter 7
Conclusion and Recommendation
7.1 Summary and conclusions
We extracted current demand data and the historical data in the previous 3 weeks
as the training sample. The model predicted the behavior of the customer in the
next week. The R2 test of the demand is calculated to be 0.59. The root of mean
squared error is 13.9. Due to large demand of some customers, we calculated the root
of mean percentage error, which is 124.6%. The R2 and the root of mean percentage
error indicate it is a useful prediction.
7.2 Recommendations for future studies
1. We assumed that the demand of current week only relates to the previous
3 weeks. This number could be adjusted from 1 to 7. By calculating the
R2 = 1− residual sum of squarestotal sum of squares
, we could know which hypothesis is the best;
2. We assumed that the demand does not relate to the route of the depot and the
channel of the depot. Those information could be added to the model.
3. We did not clean the cases that one person registered multiple user IDs. It
should be cleaned.
4. Fine-tuning on the special cases.
5. Instead of consider the state and town of the depots, we considered the co-
occurrence of the depots, the state and town can be considered as well.
6. Try different models.
21
Bibliography
[1] B. Group, Maximize sales and minimize returns of bakery goods,
Wednesday, 8 June, 2016. [Online]. Available: https://www.kaggle.com/
c/grupo-bimbo-inventory-demand
[2] J. D. Kelleher, B. Mac Namee, and A. D’Arcy, Fundamentals of machine learn-
ing for predictive data analytics: algorithms, worked examples, and case studies.
MIT Press, 2015.
[3] O. Kotaro, “Predictive analytics solution for fresh food demand using
heterogeneous mixture learning technology,” NEC Technical Journal, vol. 10,
no. 1, 2015. [Online]. Available: http://www.nec.com/en/global/techrep/
journal/g15/n01/pdf/150117.pdf
[4] L. G. Maria Elena Nenni and L. Pirolo, “Demand forecasting
in the fashion industry: A review,” Int J Eng Bus Manag,
vol. 5, no. 37, 2013. [Online]. Available: http://www.intechopen.
com/journals/international journal of engineering business management/
demand-forecasting-in-the-fashion-industry-a-review
[5] T. G. Dietterich, Machine Learning for Sequential Data: A Review. [Online].
Available: http://web.engr.oregonstate.edu/%7Etgd/publications/mlsd-ssspr.
[6] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” CoRR,
vol. abs/1603.02754, 2016. [Online]. Available: http://arxiv.org/abs/1603.02754
[7] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,
“Spark: Cluster computing with working sets,” in Proceedings of the 2Nd
USENIX Conference on Hot Topics in Cloud Computing, ser. HotCloud’10.
Berkeley, CA, USA: USENIX Association, 2010, pp. 10–10. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1863103.1863113
22
[8] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Ma-
chine learning in Python,” Journal of Machine Learning Research, vol. 12, pp.
2825–2830, 2011.
[9] Kaggle. Evaluation. [Online]. Available: https://www.kaggle.com/c/
grupo-bimbo-inventory-demand/details/evaluation
[10] ——. Member faq. [Online]. Available: https://www.kaggle.com/wiki/
KaggleMemberFAQ
23
Fig
ure
A.1
:F
low
char
tof
the
pro
cedure
:fr
omth
eor
igin
aldat
abas
edum
pto
the
pre
dic
tion
.
25
Appendix B
Code
The codes include both Scala and Python interface of Spark. The Scala codes are im-
plemented by Nikita Sonthalia. The bash scripts and python codes are implemented
by Jinzhong Zhang.
B.1 Split the training data week by week (Bash)
#!/bin/bash
while IFS=’’ read -r line || [[ -n "$line" ]]; do
week=‘echo $line | sed ’s/,.*$//’‘
echo "$line">>train_week$week.csv
done < "$1"
To run:
./Splitter.sh train.csv
B.2 The collection of user behavior (Scala - Spark)
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object sparkweek3 {
def main(args: Array[String]){
val sc = new SparkContext(new SparkConf().setAppName("week"));
for( a <- 3 to 10){
val textFile = sc.textFile("MLprojectOutput/train_week"+a+".csv")
26
val counts = textFile.map(line => {
val token = line.split(",");
val key = token(4);
(key,(token(1),token(2),token(5),token(6),token(7),token(8),token(9),token(10)))}).groupByKey;
counts.coalesce(1).saveAsTextFile("MLprojectOutput/week"+a+"objectoutput")
}
}
}
B.3 The calculation of co-occurrence matrix of prod-
ucts and depots (Scala - Spark)
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
object Cooccurrence {
def main(args: Array[String]){
val sc = new SparkContext(new SparkConf().setAppName("matrix")
.setMaster("local[7]")
.set("spark.executor.memory", "4g")
.set("spark.driver.memory","5g"));
for( a <- 3 to 10){
val textFile = sc.textFile("MLprojectOutput/train_week"+a+".csv")
val pro = textFile.map(line => {
val token = line.split(",");
val key = token(4);
// For product matrix
val valueP = token(5) +"#"+ token(10);
(key,valueP)}).groupByKey;
val dep= textFile.map(line => {
val token = line.split(",");
val key = token(4);
//For depot matrix
val valueD = token(1) +"#"+ token(10);
(key,valueD)}).groupByKey;
27
valueP.coalesce(1).saveAsObjectFile("MLprojectOutput/week"+a+"Pobjectoutput")
valueD.coalesce(1).saveAsObjectFile("MLprojectOutput/week"+a+"Dobjectoutput")
val valuefileP = sc.objectFile[(Char, Seq[(String)])]
("MLprojectOutput/week"+a+"Pobjectoutput/part-00000");
val valuefileD = sc.objectFile[(Char, Seq[(String)])]
("MLprojectOutput/week"+a+"Dobjectoutput/part-00000");
val outputP= valuefileP.persist(StorageLevel.MEMORY_AND_DISK_SER).flatMap( {
case (userid, values) =>
{
var productIds = values.map(value=>value.split("#")(0));
var demand = values.map(value=>value.split("#")(1));
productIds.combinations(2).map(
pairs => {
{(pairs.mkString("#"), 1)}
})
}}).reduceByKey(_ + _);
val outputD= valuefileD.persist(StorageLevel.MEMORY_AND_DISK_SER).flatMap( {
case (userid, values) =>
{
var productIds = values.map(value=>value.split("#")(0));
var demand = values.map(value=>value.split("#")(1));
productIds.combinations(2).map(
pairs => {
{(pairs.mkString("#"), 1)}
})
}}).reduceByKey(_ + _);
outputP.coalesce(1).saveAsTextFile("MLprojectOutput/week"+a+"ProductMatrix")
outputD.coalesce(1).saveAsTextFile("MLprojectOutput/week"+a+"DepotMatrix")
}
}
}
28
B.4 The calculation of popularity of each product
and each depot (Python - Spark)
from pyspark import SparkContext, SparkConf
def parser(line, i):
tokens=line.split(’,’)
return (int(tokens[i]), int(tokens[-1]))
if __name__ == "__main__":
conf = SparkConf()
sc = SparkContext(conf=conf)
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN )
logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN )
for i in range(3,10):
prod = sc.textFile("train_week{0}.csv".format(i))
.map(lambda line:parser(line,5)).reduceByKey(lambda a, b: a + b)
prod.coalesce(1)
.saveAsTextFile("MLprojectOutput/week{0}ProductPopularity".format(i))
depot = sc.textFile("train_week{0}.csv".format(i))
.map(lambda line:parser(line,1)).reduceByKey(lambda a, b: a + b)
depot.coalesce(1)
.saveAsTextFile("MLprojectOutput/week{0}DepotPopularity".format(i))
B.5 Summary to the Analytic Based Table (Python
- Spark)
import sys, os
from pyspark import SparkContext, SparkConf
from ast import literal_eval
import numpy as np
TRAIN_WEEKS = [3,4,5,6]
def parse(x):
res = x[1:-1].split(’,CompactBuffer’)
products = literal_eval(res[1][:-2]+’,)’)
29
n_products = {}
n_depots = {}
for product in products:
try:
product_ID = product[2]
depot_ID = product[0]
demand = product[-1]
if product_ID not in n_products:
n_products[product_ID] = demand
else:
n_products[product_ID] += demand
if depot_ID not in n_depots:
n_depots[depot_ID] = demand
else:
n_depots[depot_ID] += demand
except:
sys.stdout.write("{0}\n".format(products))
raise
return (int(res[0]), n_products, n_depots)
def load_customer(filename):
prod_dict={}
depot_dict={}
with open(filename, ’r’) as f:
lines = f.readlines()
tot_lines = len(lines)
iline = 0
for line in lines:
userID, products, depots = parse(line)
prod_dict[userID] = products
depot_dict[userID] = depots
iline += 1
if iline%1000==0:
sys.stdout.write("\rRead {0}/{1} lines from {2}"
.format(iline, tot_lines, filename))
sys.stdout.flush()
sys.stdout.write("\n")
return prod_dict, depot_dict
30
def load_occurrence_matrix(filename, item_dict={}):
with open(filename, ’r’) as f:
lines = f.readlines()
tot_lines = len(lines)
iline = 0
for line in lines:
items, weight = line[1:-2].split(’,’)
item1, item2 = items.split(’#’)
item1 = int(item1)
item2 = int(item2)
weight = int(weight)
# create a bi-direction search dictionary
if item1 not in item_dict:
item_dict[item1] = {item2:weight}
elif item2 not in item_dict[item1]:
item_dict[item1][item2] = weight
else:
item_dict[item1][item2] += weight
if item2 not in item_dict:
item_dict[item2] = {item1:weight}
elif item1 not in item_dict[item2]:
item_dict[item2][item1] = weight
else:
item_dict[item2][item1] += weight
iline += 1
if iline%1000==0:
sys.stdout.write("\rRead {0}/{1} lines from {2}"
.format(iline, tot_lines, filename))
sys.stdout.flush()
sys.stdout.write("\n")
return item_dict
def load_popularity(filename, item_pop={}):
with open(filename, ’r’) as f:
for line in f:
item, pop = line[1:-2].split(’, ’)
item = int(item)
if item not in item_pop:
31
item_pop[item] = int(pop)
else:
item_pop[item] += int(pop)
sys.stdout.write("Read popularity from {0}, done.\n".format(filename))
return item_pop
N_HISTORY_WEEKS = len(TRAIN_WEEKS)-1
weeks_prod = [{}]*N_HISTORY_WEEKS
weeks_depot = [{}]*N_HISTORY_WEEKS
prod_occurrence = {}
depot_occurrence = {}
product_popularity = {}
depot_popularity = {}
for i, week in enumerate(TRAIN_WEEKS[:-1]):
weeks_prod[i], weeks_depot[i] = load_customer("MLprojectOutput/
week{0}objectoutput/part-00000".format(week))
prod_occurrence = load_occurrence_matrix("MLprojectOutput/
week{0}ProductMatrix/part-00000".format(week), prod_occurrence)
depot_occurrence = load_occurrence_matrix("MLprojectOutput/
week{0}DepotMatrix/part-00000".format(week), depot_occurrence)
product_popularity = load_popularity("MLprojectOutput/
week{0}ProductPopularity/part-00000".format(week), product_popularity)
depot_popularity = load_popularity("MLprojectOutput/
week{0}DepotPopularity/part-00000".format(week), depot_popularity)
MAX_RELATIVE_PRODUCTS = 3
MAX_RELATIVE_DEPOTS = 1
def createSample(line):
token = line.split(",")
userID = int(token[4])
product = int(token[5])
depot = int(token[1])
try:
demand = int(token[-1])
except ValueError:
sys.stdout.write("-{0}-\n".format(token[-1]))
try:
demand = int(token[-1][:-1])
except ValueError:
32
sys.stdout.write("..........................")
demand = 0
row = []
for i in range(N_HISTORY_WEEKS):
n_prod = 0
n_rel_prod = []
n_depot = 0
n_rel_depot = []
# product and relative products
week_prod = weeks_prod[i]
if userID in week_prod:
shopping_list = week_prod[userID]
if product in shopping_list:
n_prod = shopping_list[product]
if product in prod_occurrence:
relative_prods = prod_occurrence[product]
for prod, number in shopping_list.items():
if prod in relative_prods:
n_rel_prod.append(relative_prods[prod]*number)
n_rel_prod.sort(reverse=True)
# depot and relative depots
week_depot = weeks_depot[i]
if userID in week_depot:
depot_list = week_depot[userID]
if depot in depot_list:
n_depot = depot_list[depot]
if depot in depot_occurrence:
relative_depots = depot_occurrence[depot]
for depot, number in depot_list.items():
if depot in relative_depots:
n_rel_depot.append(relative_depots[depot]*number)
n_rel_depot.sort(reverse=True)
# fulfill the remaining entries
n = len(n_rel_prod)
if n > MAX_RELATIVE_PRODUCTS:
n_rel_prod=n_rel_prod[:MAX_RELATIVE_PRODUCTS]
else:
n_rel_prod.extend([0]*(MAX_RELATIVE_PRODUCTS-n))
n = len(n_rel_depot)
33
if n > MAX_RELATIVE_DEPOTS:
n_rel_depot=n_rel_depot[:MAX_RELATIVE_DEPOTS]
else:
n_rel_depot.extend([0]*(MAX_RELATIVE_DEPOTS-n))
# put the data in row
row.append(n_prod)
row.extend(n_rel_prod)
row.append(n_depot)
row.extend(n_rel_depot)
#calculate the popularities
if product in product_popularity:
prod_pop = product_popularity[product]
else:
prod_pop = 0
if depot in depot_popularity:
depot_pop = depot_popularity[depot]
else:
depot_pop = 0
row.extend([prod_pop,depot_pop,demand])
return row
if __name__ == "__main__":
os.system("rm -rf MLprojectOutput/week{0}Formated".format(TRAIN_WEEKS[-1]))
conf = SparkConf()
sc = SparkContext(conf=conf)
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN )
logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN )
formated_data = sc.textFile("train_week{0}.csv"
.format(TRAIN_WEEKS[-1])).map(createSample)
formated_data.coalesce(1).saveAsTextFile("MLprojectOutput/
week{0}Formated".format(TRAIN_WEEKS[-1]))
B.6 Build the model and calculate the validation
accuracy
##############################################################
# xgboost sklearn to train and test the model
34
# Author: Jinzhong Zhang
###############################################################
import xgboost as xgb
import csv, sys, math
import numpy as np
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.metrics import *
from sklearn.externals import joblib
def readData(filename, start_row=0, end_row=-1):
irow = -1
data = []
with open(filename, ’r’) as f_handle:
for row in f_handle:
irow += 1
if irow<start_row:
continue
elif irow>end_row and end_row>0:
break
data.append([np.float64(x) for x in row[1:-2].split(’,’)])
if irow%1000==0:
sys.stdout.write("\rRead {0} lines from {1}".format(irow, filename))
sys.stdout.flush()
data = np.array(data)
return (data[:,:-1],data[:,-1])
def evaluate(data, pred):
print("R2=", r2_score(data,pred))
print("Mean Squared Error=", mean_squared_error(data,pred))
errors = []
sum_ = 0
total_ = 0
for i, v in enumerate(data):
if v>0:
errors.append(abs(v-pred[i])/v)
else:
errors.append(1.0 if pred[i]>0 else 0)
sum_ += errors[-1]**2
total_ += v-pred[i]
35
errors.sort()
print ("mean=",math.sqrt(sum_/len(pred)))
print ("total=",total_)
print ("50 Percentile=", errors[int(len(pred)*0.5)], ", 75 Percentile=", errors[int(len(pred)*0.75)] )
print ("90 Percentile=", errors[int(len(pred)*0.9)], ", 99.5 Percentile=", errors[int(len(pred)*0.995)] )
def train(mode):
if mode == "NextWeek":
DATA = "MLprojectOutput/week34567to8Formated/part-00000"
else:
DATA = "MLprojectOutput/week34567to9Formated/part-00000"
X, Y = readData(DATA, 10000, -1)
X_Scaler = MinMaxScaler().fit(X)
joblib.dump(X_Scaler, ’Predict{0}_Scaler.pkl’.format(mode))
X = X_Scaler.transform(X)
dtrain = xgb.DMatrix(X, label = Y)
param = { ’booster’:"gbtree",
’eta’:0.3,
’max_depth’:6,
’subsample’:0.85,
’colsample_bytree’:0.7,
’silent’:0,
’objective’:’reg:linear’,
’nthread’:10,
’eval_metric’:’rmse’}
__model = xgb.train(param.items(), dtrain)
__model.save_model(’Predict{0}.model’.format(mode))
X_TEST, Y_TEST = readData(DATA, 0, 10000)
X_TEST = X_Scaler.transform(X_TEST)
dtest = xgb.DMatrix(X_TEST)
Y_pred = list(map(lambda x: int(x), __model.predict(dtest)))
evaluate(Y_TEST,Y_pred)
if __name__ == ’__main__’:
train(’NextWeek’)
train(’NextNextWeek’)
36
B.7 Make Predictions
##############################################################
# xgboost sklearn to train and test the model
# Author: Jinzhong Zhang
###############################################################
import xgboost as xgb
import csv, sys
import numpy as np
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.metrics import *
from sklearn.externals import joblib
def readTestData(filename, nrow=100):
irow = 0
data = []
with open(filename, ’r’) as f_handle:
for row in f_handle:
data.append([np.float64(x) for x in row[1:-2].split(’,’)])
irow += 1
if irow%1000==0:
sys.stdout.write("\rRead {0} lines from {1}".format(irow, filename))
sys.stdout.flush()
if irow>=nrow and nrow>0:
break
data = np.array(data)
sys.stdout.write("\n")
return (data[:,0].astype(int), data[:,1:])
def reformat(pred):
y=int(pred)
if y<0:
y=0
return y
def predict(DATA, mode):
IDs, test_X = readTestData(DATA, -1)
X_Scaler = joblib.load(’Predict{0}_Scaler.pkl’.format(mode))
37
__model = xgb.Booster({’nthread’:4}) #init model
__model.load_model("Predict{0}.model".format(mode)) # load data
test_X = X_Scaler.transform(test_X)
dtest = xgb.DMatrix(test_X)
return (IDs, list(map(lambda x: reformat(x), __model.predict(dtest))))
if __name__ == ’__main__’:
IDs_1, pred_1 = predict("MLprojectOutput/week56789to10Formated/part-00000","NextWeek")
IDs_2, pred_2 = predict("MLprojectOutput/week56789to11Formated/part-00000","NextWeek")
writer = csv.writer(open("submission.csv", "w"))
writer.writerow([’id’,’Demanda_uni_equil’])
writer.writerows(zip(np.append(IDs_1,IDs_2), pred_1+pred_2))
38