prediction of grupo bimbo inventory demandmwang2/projects/ml_kaggleinventorydemand.pdfjinzhong...

Prediction of Grupo Bimbo Inventory Demand

by Jinzhong Zhang 90620

Nikita Sonthalia 89679

Team: Bimbo Kagglers

July 30, 2014

San Jose

Acknowledgments

Thanks for

• Bimbo group to raise up this problem and provide funding;

• www.kaggle.com platform to hold this competition;

• Prof. Wang to organize this project.

Jinzhong Zhang&Nikita Sonthalia, September 1, 2016

Preface

This report describes a general business problem that predicts the demandin a future week given the demands in the past weeks. It reduces the exhaustingcalculation of traditional co-occurrence into map-reduce problem, and then usegeneral machine learning algorithm to make the prediction. The cutting-edgetool, Spark, is to render the map-reduce calculation. The results show that(number) predictions can be finished in (time) with (percent) accuracy.

ii

https://www.kaggle.com/

Table of Contents

Acknowledgments ii

Preface ii

Table of Contents iii

List of Figures v

List of Tables vi

Chapter 1 Introduction 11.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 What is the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Why this is a project related to this class . . . . . . . . . . . . . . . . 21.4 Why other approach is not good . . . . . . . . . . . . . . . . . . . . . 21.5 Why our approach should be better . . . . . . . . . . . . . . . . . . . 21.6 Statement of the problem . . . . . . . . . . . . . . . . . . . . . . . . 21.7 Scope of investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Chapter 2 Theoretical Bases and Literature Review 42.1 Definition of the problem . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Theoretical background of the problem . . . . . . . . . . . . . . . . . 42.3 Related research to solve the problem . . . . . . . . . . . . . . . . . . 52.4 Advantage/disadvantage of those research . . . . . . . . . . . . . . . 52.5 Our solution to solve this problem . . . . . . . . . . . . . . . . . . . . 52.6 Where your solution different from others . . . . . . . . . . . . . . . . 62.7 Why your solution is better . . . . . . . . . . . . . . . . . . . . . . . 6

Chapter 3 Hypothesis 73.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Positive/negative hypothesis . . . . . . . . . . . . . . . . . . . . . . . 9

Chapter 4 Methodology 10

iii

4.1 How to generate/collect input data? . . . . . . . . . . . . . . . . . . . 104.2 How to solve the problem? . . . . . . . . . . . . . . . . . . . . . . . . 124.3 How to generate output? . . . . . . . . . . . . . . . . . . . . . . . . . 134.4 How to test against hypothesis? . . . . . . . . . . . . . . . . . . . . . 134.5 How to proof correctness . . . . . . . . . . . . . . . . . . . . . . . . . 13

Chapter 5 Implementation 145.1 Split the training data week by week . . . . . . . . . . . . . . . . . . 145.2 The collection of user behavior . . . . . . . . . . . . . . . . . . . . . . 145.3 The calculation of co-occurrence matrix of products and depots . . . 155.4 The calculation of popularity of each product and each depot . . . . . 155.5 Summary to the Analytic Based Table (ABT) . . . . . . . . . . . . . 155.6 Build the Predictive Model . . . . . . . . . . . . . . . . . . . . . . . . 155.7 Make Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Chapter 6 Data Analysis and Discussion 176.1 Output Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176.2 Output Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176.3 Compare output against hypothesis . . . . . . . . . . . . . . . . . . . 186.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Chapter 7 Conclusion and Recommendation 217.1 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . 217.2 Recommendations for future studies . . . . . . . . . . . . . . . . . . . 21

Bibliography 22

Appendix A Flowchart 24

Appendix B Code 26B.1 Split the training data week by week (Bash) . . . . . . . . . . . . . . 26B.2 The collection of user behavior (Scala - Spark) . . . . . . . . . . . . . 26B.3 The calculation of co-occurrence matrix of products and depots (Scala

- Spark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27B.4 The calculation of popularity of each product and each depot (Python

- Spark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29B.5 Summary to the Analytic Based Table (Python - Spark) . . . . . . . 29B.6 Build the model and calculate the validation accuracy . . . . . . . . . 34B.7 Make Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

iv

List of Figures

6.1 Our Kaggle RMSLE for public and private data set . . . . . . . . . . 196.2 Rank and private RMSLE of Winners, our team(high lighted one) and

the sample. The sample submission is just a constant prediction ofdemand at 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

A.1 Flowchart of the procedure: from the original database dump to theprediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

v

List of Tables

4.1 Training Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.2 Test Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3 Sample Submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.4 Client Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.5 Product Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.6 Town Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

vi

Chapter 1

Introduction

1.1 Objective

Grupo Bimbo, S.A.B. de C.V., known as Bimbo, is a Mexican multinational bak-

ery product manufacturing company headquartered inMexico City,Mexico. It is the

world’s largest baking company. Grupo Bimbo must strives to meet daily consumer

demand for fresh bakery products on the shelves of over 1 million stores along its

45,000 routes across Mexico.

Currently, daily inventory calculations are performed by direct delivery sales em-

ployees who must single-handedly predict the forces of supply, demand, and hunger

based on their personal experiences with each store. With some breads carrying a

one week shelf life, the acceptable margin for error is small.

Grupo Bimbo invitesKagglers to develop a model to accurately forecast inventory

demand based on historical sales data.[1]

1.2 What is the problem

The dataset consists of 9 weeks of sales transactions in Mexico. Every week, there

are delivery trucks that deliver products to the vendors. Each transaction consists

of sales and returns. Returns are the products that are unsold and expired. The

demand for a product in a certain week is defined as the sales this week subtracted

by the return next week.[1]

We will forecast the demand of a product for a given week (10th or 11th week),

at a particular store.

1

1.3 Why this is a project related to this class

The techniques that we learned in class can be applied to this topic. e.g. co-

occurrence matrix, decision tree, k-nearest neighbors(kNN), Bayesian and support

vector machine.

1.4 Why other approach is not good

In business to customer(B2C) problems, most commonly, previous approaches use

co-occurrence matrix in customer recommendations[2]. The usage of co-occurrence

matrix is rarely seen in demand prediction problems.

In demand prediction problems, previous approaches[3, 4] mainly use the past

data to predict demand in bulk by time sequential prediction models. Those ap-

proaches are facing business to business problems(B2B). Since B2C problems are

fickler, more sophisticated, and more unpredictable. The data size of B2B is several

magnitude larger than the data size of B2B, which makes the problem even harder.

1.5 Why our approach should be better

Our approach combines the co-occurrence matrix and probability based machine

learning into B2C problem. It considers both the co-occurrence among products

and stores and the time series. Since the algorithm extracts the most important

information from the data without any loss, it should perform well.

1.6 Statement of the problem

The week to week account of transactions of 9 weeks is given. The information in each

transaction includes the store ID, the product ID, the customer ID, and the amount

the customer bought and returned. Our task is to predict the demand in the 10th or

the 11th week. The demand equals to the amount that the customer buy minus the

amount of the customer return in the coming week. The supplementary information

is given, which include the store locations, the product names and weights, the

customer names. Details of the data are shown in Chapter 4.

2

1.7 Scope of investigation

We are going to investigate the correlations among products and depots in the past

n weeks. The co-occurrence matrices will be used as a look-up table while building

the analytic based table(ABT). The features of analytic based table describe the

querying customer’s behavior in the past n weeks. The target is demand of the

querying customer in the n + 1 week and n + 2 week. Models M(~x1···n, n + 1) and

M(~x1···n, n+ 2) will be learned by the computer to make the prediction.

3

Chapter 2

Theoretical Bases and Literature

Review

2.1 Definition of the problem

The week to week account of transactions of 9 weeks is given. The information in each

transaction includes the store ID, the product ID, the customer ID, and the amount

the customer bought and returned. Our task is to predict the demand in the 10th or

the 11th week. The demand equals to the amount that the customer buy minus the

amount of the customer return in the coming week. The supplementary information

is given, which include the store locations, the product names and weights, the

customer names. Details of the data are shown in Chapter 4.

2.2 Theoretical background of the problem

The sequential supervised learning problem can be formulated as follows. Let (xi, yi)Ni=1

be a set of N training examples. Each example is a pair of sequences (xi, yi), where

xi = (xi,1, xi,2, . . . , xi,Ti) and yi = (yi,1, yi,2, . . . , yi,Ti

). The goal is to construct a clas-

sifier M that can correctly predict a new label sequence y = M(x) given an input

sequence x[5].

This problem is a kind of sequential supervised learning problem. The following

algorithms are generally used for the sequential supervised learning problems:

• Sliding Window,

• Recurrent Sliding Window,

• Hidden Markov Models,

4

• Conditional Random Fields,

• Neutral Networks.

Our approach is a combination of co-occurrence analysis and sliding window

analysis. The final predictive model is made by gradient boosted regression tree

using XGBoost algorithm [6]. In short words, after the decision tree is made, The

tree partitions the input space into J J disjoint regions R1m, . . . , RJm and predicts

a constant value in each region. A loss function is then calculated and the gradient

descent algorithm can be used on that.

2.3 Related research to solve the problem

In business to customer(B2C) problems, most commonly, previous approaches use

co-occurrence matrix in customer recommendations[2]

In business to business (B2B) demand prediction problems, previous approaches[3,

4] mainly use the past data to predict demand in bulk by time sequential prediction

models.

2.4 Advantage/disadvantage of those research

Since B2C problems are fickler, more sophisticated, and more unpredictable. The

data size of B2B is several magnitude larger than the data size of B2B, which makes

the problem even harder. The previous approaches use cases are limited, especially

when facing store by store and time sensitive demand problems. The computation

scale is too large to make a benefit-able prediction. Thus, new workflow and algo-

rithms are necessary.

2.5 Our solution to solve this problem

We are going to investigate the correlations among products and depots in the past

n weeks. The co-occurrence matrices will be used as a look-up table while building

the analytic based table(ABT). The features of analytic based table describe the

querying customer’s behavior in the past n weeks. The target is demand of the

querying customer in the n + 1 week and n + 2 week. Models M(~x1···n, n + 1) and

M(~x1···n, n+ 2) will be learned by the computer to make the prediction.

5

2.6 Where your solution different from others

Our approach is a combination of co-occurrence analysis and hidden Markov model.

2.7 Why your solution is better

Our approach combines the co-occurrence matrix and probability based machine

learning into B2C problem. It considers both the co-occurrence among products

and stores and the time series. Since the algorithm extracts the most important

information from the data without any loss, it should perform well. Our model

maker (XGBOOST) is also the best model in Kaggle. As claimed by the XGBOOST

authors, “Among the 29 challenge winning solutions 3 published at Kaggles blog

during 2015, 17 solutions used XGBoost” [6].

It does not need to perform exhausting relational search for each query. Instead

of searches, it uses a model to make predictions, which makes the calculation fast.

6

Chapter 3

Hypothesis

3.1 Hypothesis

We assume that the behavior of the customer to buy a certain product at a particular

depot only relates to his/her behaviors in the past n weeks. In this report, we try

n = 3. Under this hypothesis, we will generate machine-learning model from training

dataset and then use test dataset to predict the output. For example

If customer A demand k pieces of product B at depot C in the 4th week, his

behavior in the week 1 to week 3 becomes the features, the demand k is the target.

Particularly, the features of the analytic based table include,

================1st week====================

1. Demand of product B in 1st week,

2. Demand of the product that is most relative to product B in the 1st week times

the co-occurrence weight,

3. Demand of the product that is secondly relative to product B in the 1st week

times the co-occurrence weight,

4. Demand of the product that is secondly relative to product B in the 1st week


5. Demand at depot C in 1st week,

6. Demand at the most relative depot in the 1st week times the co-occurrence

weight,

================2nd week====================

7. Demand of product B in 2nd week,

7

8. Demand of the product that is most relative to product B in the 2nd week


9. Demand of the product that is secondly relative to product B in the 2nd week


10. Demand of the product that is secondly relative to product B in the 2nd week


11. Demand at depot C in 2nd week,

12. Demand at the most relative depot in the 2nd week times the co-occurrence

weight,

================3rd week====================

13. Demand of product B in 3rd week,

14. Demand of the product that is most relative to product B in the 3rd week


15. Demand of the product that is secondly relative to product B in the 3rd week


16. Demand of the product that is secondly relative to product B in the 3rd week


17. Demand at depot C in 3rd week,

18. Demand at the most relative depot in the 3rd week times the co-occurrence

weight,

================General Info====================

19. The popularity of the product B (how many product B have been sold from

1st week to 3rd week),

20. The popularity of the depot C (how many products have been sold at depot C

from 1st week to 3rd week),

================Target====================

21. Target: the demand of the product on the depot nth week.

We consider the only one most relative depot because we found out that people

rarely went to more than two depots in the same month. We consider three relative

8

products because we want to limit the number of features to save the calculation. So

“three” relative products can be adjusted.

Here is an example of a sample in the order mentioned above with target demand

1:

8, 3222510, 2990304, 2930552, 167, 1026716, 2, 3488688, 2564233, 1119980, 128,

786944, 3, 2578008, 1495704, 1351815, 134, 823832, 1495100, 382391, 1.

3.2 Positive/negative hypothesis

It is a regression problem and the error and R2 will be calculated.

9

Chapter 4

Methodology

4.1 How to generate/collect input data?

We got data from Kaggle site only were the company has upload there data. Those

data contain following Tables:

week Sales Depot ID Sales Channel ID Route ID Client ID Product ID

3 1110 7 3301 15766 1212

3 1110 7 3301 15766 1216

3 1110 7 3301 15766 1238

3 1110 7 3301 15766 1240

3 1110 7 3301 15766 1242

Sales unit

this week

(integer)

Sales this

week (unit:

pesos)

Returns unit

next week (in-

teger)

Returns next

week (unit:

pesos)

Adjusted Demand

3 25.14 0 0.0 3

4 33.52 0 0.0 4

4 39.32 0 0.0 4

4 33.52 0 0.0 4

3 22.92 0 0.0 3

Table 4.1: Training Table

To simplify the problem, we used Table only.

10

ID week Sales Depot ID Sales Channel ID Route ID Client ID Product ID

0 11 4037 1 2209 4639078 35305

1 11 2237 1 1226 4705135 1238

2 10 2045 1 2831 4549769 32940

3 11 1227 1 4448 4717855 43066

4 11 1219 1 1130 966351 1277

Table 4.2: Test Table

ID Demand (to be predicted)

0 ?

1 ?

2 ?

Table 4.3: Sample Submission

Cliente ID Name of the client

0 SIN NOMBRE

1 OXXO XINANTECATL

2 SIN NOMBRE

3 EL MORENO

Table 4.4: Client Table

Product ID Name of the product

0 NO IDENTIFICADO 0

9 Capuccino Moka 750g NES 9

41 Bimbollos Ext sAjonjoli 6p 480g BIM 41

53 Burritos Sincro 170g CU LON 53

Table 4.5: Product Table

11

Depot ID Town State

1110 2008 AG. LAGO FILT MAXICO, D.F.

1111 2002 AG. AZCAPOTZALCO MAXICO, D.F.

1112 2004 AG. CUAUTITLAN ESTADO DE MAXICO

1113 2008 AG. LAGO FILT MAXICO, D.F.

1114 2029 AG.IZTAPALAPA 2 MAXICO, D.F.

Table 4.6: Town Table

4.2 How to solve the problem?

• We used training dataset to build the learning model week wise.

• From training dataset we used probability for finding the predictive demand.

4.2.1 Algorithm Design

• We consider (past n weeks, p related products, d related depots):

• First try n=3, p=3, d=1:

• Procedure:

1. split data into weeks

2. generate the training data:

(a) Target: In n+1 week, the demand of the product for the (person,

depot)

(b) Find out the most p related products from 1 to nth weeks data instead

of calculating the entire co-occurrence matrix.

(c) Find out the most d related depots by the same way.

3. Features:

(a) In 1-n weeks, the demand of the product for the (person, depot), n

features,

(b) the demand of the p related product in the past n weeks, n*p features,

(c) the demand of the d related depots, n*d features,

(d) the popularity of the product and depot.

4. Train it by gradient boosted regression tree.

12

4.2.2 Language used

• Python, Scala

4.2.3 Tools used

• Bash Script, Python Libraries, Spark [7], Scikit-Learn [8], XGBOOST [6]

4.3 How to generate output?

Test dataset has rows for which we have to generate demand. We will take each row

as unique row and find the related data in training dataset and differentiate it in

week wise, which will generate learning model. Then that learning model will we use

to generate output.

4.4 How to test against hypothesis?

1. Currently, we assumed that the demand of current week only relates to the

previous 3 weeks. This number could be adjusted from 1 to 7. By calculating

the R2 = 1− residual sum of squarestotal sum of squares

, we could know which hypothesis is the

best;

2. Currently, we assumed that the demand does not relate to the route of the

depot and the channel of the depot. Those information could be added to the

model. Again, R2 is the benchmark.

4.5 How to proof correctness

Calculate the R2. It should be above 0 and as close as to 1.

13

Chapter 5

Implementation

In the first step, we split the training data into weekly training data. This part is

implemented by a simple Bash script. In the second step, from the weekly training

data, for each week, we collect each user’s behavior, calculate the co-occurrence

matrix for products and depots, and calculate the popularities of the products and

depots. This part is implemented using Spark 2.0 [7] with Scala and Python interface.

In the third step, we build the analytic based table (ABT) from the results provided

above. This part is implemented using Spark 2.0 with Python interface. In the forth

step, we split the ABT for training/validation/test and calculate the performance of

the hypothesis. Appendix A shows the flowchart.

5.1 Split the training data week by week

It is implemented by a simple Bash script, as shown in Appendix B.1. It takes several

hours to finish.

5.2 The collection of user behavior

We use the user ID as the key and implement Spark reduceByKey to collect a certain

user information in one week as one row. This is a reformat of data in order to let

the following program get a fast query by the user ID hash.

This part of code is implemented by Scala. They are in Appendix B.2. It takes

several minutes.

14

5.3 The calculation of co-occurrence matrix of prod-

ucts and depots

It calculates the co-occurrence matrix of the products and depots, respectively. Par-

ticularly, if one user bought k product A and l product B in a week, the addition to

the co-occurrence of A and B is the product of the demand of A and the demand of

B in that week.

This part of code is implemented by Scala. They are in Appendix B.3.

5.4 The calculation of popularity of each product

and each depot

It calculates how popular a product is and how popular a depot is. Particularly, the

popularity of a product in a week is how many sales of this product in this week.

The popularity of a depot is how many total sales at this depot in this week.

This part of code is implemented by Python. They are in Appendix B.4.

5.5 Summary to the Analytic Based Table (ABT)

It reads the collection of each user’s behavior, the co-occurrence matrix for products

and depots, and the popularities of the products and depots into memory. From

those hashed look-up tables (up to 2.0GB in-memory), this builder summarize the

behavior of the customer in the previous weeks (we try 3 weeks).

This part of code is implemented by Python. They are in Appendix B.5.

5.6 Build the Predictive Model

In this analysis, we use the 4th week (week 6 in the original train.csv) demand as

the training sample and use the 5th week (week 7 in the original train.csv) demand

as the validation data. The test data is hold in Kaggle.com and is hidden from the

user.

First, we used MinMaxScaler to normalize the data because of those huge weight

numbers.

Second, we feed the normalized data into XGBOOST model maker. The parame-

ter are set to be max depth of the tree is 4 (’bst:max_depth’:4), learning rate is 0.3

(’eta’:0.3), use linear regression, and use root mean square error as loss function

( ’eval_metric’:’rmse’ ).

15

This part of code is implemented by Python, using Scikit-learn and XGBOOST.

They are in Appendix B.6.

5.7 Make Predictions

It uses the scaler and model generated in the last step to predict the future behavior.

This part of code is implemented by Python, using Scikit-learn and XGBOOST.

They are in Appendix B.7.

16

Chapter 6

Data Analysis and Discussion

6.1 Output Generation

By taking the model as described in the previous chapter, we take the fifth week

data as the test sample. By applying the same scaler as the training sample, we

normalized the features of the testing sample. The model is then applied to the

normalized features of the testing sample. The code can be seen in Appendix [?].

6.2 Output Analysis

The following benchmarks have been calculated:

1. R2: 0.59

2. root of mean squared error: 13.9

3. root of mean squared percentage error: 124.6%

4. 50% percentile of mean squared percentage error: 50%

5. 75% percentile of mean squared percentage error: 86.4%

6. 90% percentile of mean squared percentage error: 200%

7. 99.5% percentile of mean squared percentage error: 400%

In the percentage error calculation, we applied a strict rule: if the true value is 0

but the prediction is not, we considered it as 100% error. The codes are shown in

Appendix ??.

17

6.3 Compare output against hypothesis

It indicates by using our model, the company will be able to predict half of the

customers’ behavior in 50% error and the majority of the customer’s behavior in

400% error. It shows that our hypothesis is positive but can be further tuned.

There are some special cases:

1. If the user is new, the prediction of demand will be made purely by the popu-

larity of the product and depot,

2. If the product is new, the prediction of demand will be made purely by the

popularity of the depot. However, this may not be suitable. We have not

further tuned this.

3. If the depot is new, the prediction of demand will be made purely by the

popularity of the product. We have not further tuned this.

4. If both product and depot are new, all features in the ABT will be 0. This

case is unpredictable.

6.4 Discussion

In the Kaggle challenge, Bimbo provides the sales information in week 3, 4, 5, 6, 7,

8, and 9. Bimbo asks challengers to predict the demand in week 10 and week 11.

Because we try to include as many as data in training, we used previous 5 weeks data

as historical data. The bimbo asks the challengers to predict the demand of next

week and the demand of next next week as well. Two training samples are made.

One training sample contains the week 3, 4, 5, 6, and 7 historical data as features,

week 8 demand as target. The other sample contains the same features but week 9

demand as target. In this way, two models are trained. One is to predict the next

week’s demand while the latter one predicts the next next week’s demand.

For the week 10 queries, we take the relative information in week 5, 6, 7, 8, and 9

as features and the first model to make the prediction. For the week 11 queries, we

use the same features but another model to make the prediction. Because the demand

should always equal to or be bigger than zero, we forced all of the negative predictions

to be zero. The negative regression only happens once in 7 million predictions, which

is not a big problem.

After submission, Kaggle evaluate the metric for the competition by Root Mean

Squared Logarithmic Error (RMSLE) [9].

18

The RMSLE is calculated as

ε =

√√√√ 1

n

n∑i=1

(log(pi + 1)− log(ai + 1))2

Where:

• ε is the RMSLE value (score).

• n is the total number of observations in the (public/private) data set,

• pi is your prediction of demand, and

• ai is the actual demand for i.

• log(x) is the natural logarithm of x

Figure 6.1 shows the scores we got in two submissions. We made the tree deeper

after the challenge deadline but did not get a better result.

Figure 6.1: Our Kaggle RMSLE for public and private data set

The public score is what we receive back upon each submission. The public

score is being determined from only a fraction of the test data set – usually between

25 ∼ 33%. When the competition ends, Kaggle takes our selected submissions and

score our predictions against the REMAINING FRACTION of the test set, which is

the private score. Final competition results are based on the private score [10].

Figure 6.2 shows the rank and the private scores of the winners, ours, and the

sample submission got.

We rank at 1263 among 1969 competitors. The full private score board can be seen

at https://www.kaggle.com/c/grupo-bimbo-inventory-demand/leaderboard/private.

Future adjustment is discussed in Secion 7.2.

19

https://www.kaggle.com/c/grupo-bimbo-inventory-demand/leaderboard/private

Figure 6.2: Rank and private RMSLE of Winners, our team(high lighted one) and

the sample. The sample submission is just a constant prediction of demand at 7.

20

Chapter 7

Conclusion and Recommendation

7.1 Summary and conclusions

We extracted current demand data and the historical data in the previous 3 weeks

as the training sample. The model predicted the behavior of the customer in the

next week. The R2 test of the demand is calculated to be 0.59. The root of mean

squared error is 13.9. Due to large demand of some customers, we calculated the root

of mean percentage error, which is 124.6%. The R2 and the root of mean percentage

error indicate it is a useful prediction.

7.2 Recommendations for future studies

1. We assumed that the demand of current week only relates to the previous

3 weeks. This number could be adjusted from 1 to 7. By calculating the

R2 = 1− residual sum of squarestotal sum of squares

, we could know which hypothesis is the best;

2. We assumed that the demand does not relate to the route of the depot and the

channel of the depot. Those information could be added to the model.

3. We did not clean the cases that one person registered multiple user IDs. It

should be cleaned.

4. Fine-tuning on the special cases.

5. Instead of consider the state and town of the depots, we considered the co-

occurrence of the depots, the state and town can be considered as well.

6. Try different models.

21

Bibliography

[1] B. Group, Maximize sales and minimize returns of bakery goods,

Wednesday, 8 June, 2016. [Online]. Available: https://www.kaggle.com/

c/grupo-bimbo-inventory-demand

[2] J. D. Kelleher, B. Mac Namee, and A. D’Arcy, Fundamentals of machine learn-

ing for predictive data analytics: algorithms, worked examples, and case studies.

MIT Press, 2015.

[3] O. Kotaro, “Predictive analytics solution for fresh food demand using

heterogeneous mixture learning technology,” NEC Technical Journal, vol. 10,

no. 1, 2015. [Online]. Available: http://www.nec.com/en/global/techrep/

journal/g15/n01/pdf/150117.pdf

[4] L. G. Maria Elena Nenni and L. Pirolo, “Demand forecasting

in the fashion industry: A review,” Int J Eng Bus Manag,

vol. 5, no. 37, 2013. [Online]. Available: http://www.intechopen.

com/journals/international journal of engineering business management/

demand-forecasting-in-the-fashion-industry-a-review

[5] T. G. Dietterich, Machine Learning for Sequential Data: A Review. [Online].

Available: http://web.engr.oregonstate.edu/%7Etgd/publications/mlsd-ssspr.

pdf

[6] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” CoRR,

vol. abs/1603.02754, 2016. [Online]. Available: http://arxiv.org/abs/1603.02754

[7] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,

“Spark: Cluster computing with working sets,” in Proceedings of the 2Nd

USENIX Conference on Hot Topics in Cloud Computing, ser. HotCloud’10.

Berkeley, CA, USA: USENIX Association, 2010, pp. 10–10. [Online]. Available:

http://dl.acm.org/citation.cfm?id=1863103.1863113

22

https://www.kaggle.com/c/grupo-bimbo-inventory-demand

https://www.kaggle.com/c/grupo-bimbo-inventory-demand

http://www.nec.com/en/global/techrep/journal/g15/n01/pdf/150117.pdf

http://www.nec.com/en/global/techrep/journal/g15/n01/pdf/150117.pdf

http://www.intechopen.com/journals/international_journal_of_engineering_business_management/demand-forecasting-in-the-fashion-industry-a-review



http://web.engr.oregonstate.edu/%7Etgd/publications/mlsd-ssspr.pdf

http://web.engr.oregonstate.edu/%7Etgd/publications/mlsd-ssspr.pdf

http://arxiv.org/abs/1603.02754

http://dl.acm.org/citation.cfm?id=1863103.1863113

[8] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,

M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,

D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Ma-

chine learning in Python,” Journal of Machine Learning Research, vol. 12, pp.

2825–2830, 2011.

[9] Kaggle. Evaluation. [Online]. Available: https://www.kaggle.com/c/

grupo-bimbo-inventory-demand/details/evaluation

[10] ——. Member faq. [Online]. Available: https://www.kaggle.com/wiki/

KaggleMemberFAQ

23

https://www.kaggle.com/c/grupo-bimbo-inventory-demand/details/evaluation

https://www.kaggle.com/c/grupo-bimbo-inventory-demand/details/evaluation

https://www.kaggle.com/wiki/KaggleMemberFAQ

https://www.kaggle.com/wiki/KaggleMemberFAQ

Appendix A

Flowchart

24

Fig

ure

A.1

:F

low

char

tof

the

pro

cedure

:fr

omth

eor

igin

aldat

abas

edum

pto

the

pre

dic

tion

.

25

Appendix B

Code

The codes include both Scala and Python interface of Spark. The Scala codes are im-

plemented by Nikita Sonthalia. The bash scripts and python codes are implemented

by Jinzhong Zhang.

B.1 Split the training data week by week (Bash)

#!/bin/bash

while IFS=’’ read -r line || [[ -n "$line" ]]; do

week=‘echo $line | sed ’s/,.*$//’‘

echo "$line">>train_week$week.csv

done < "$1"

To run:

./Splitter.sh train.csv

B.2 The collection of user behavior (Scala - Spark)

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

import org.apache.spark.SparkConf

object sparkweek3 {

def main(args: Array[String]){

val sc = new SparkContext(new SparkConf().setAppName("week"));

for( a <- 3 to 10){

val textFile = sc.textFile("MLprojectOutput/train_week"+a+".csv")

26

val counts = textFile.map(line => {

val token = line.split(",");

val key = token(4);

(key,(token(1),token(2),token(5),token(6),token(7),token(8),token(9),token(10)))}).groupByKey;

counts.coalesce(1).saveAsTextFile("MLprojectOutput/week"+a+"objectoutput")

}

}

}

B.3 The calculation of co-occurrence matrix of prod-

ucts and depots (Scala - Spark)

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

import org.apache.spark.SparkConf

import org.apache.spark.storage.StorageLevel

object Cooccurrence {

def main(args: Array[String]){

val sc = new SparkContext(new SparkConf().setAppName("matrix")

.setMaster("local[7]")

.set("spark.executor.memory", "4g")

.set("spark.driver.memory","5g"));

for( a <- 3 to 10){

val textFile = sc.textFile("MLprojectOutput/train_week"+a+".csv")

val pro = textFile.map(line => {


val key = token(4);

// For product matrix

val valueP = token(5) +"#"+ token(10);

(key,valueP)}).groupByKey;

val dep= textFile.map(line => {


val key = token(4);

//For depot matrix

val valueD = token(1) +"#"+ token(10);

(key,valueD)}).groupByKey;

27

valueP.coalesce(1).saveAsObjectFile("MLprojectOutput/week"+a+"Pobjectoutput")

valueD.coalesce(1).saveAsObjectFile("MLprojectOutput/week"+a+"Dobjectoutput")

val valuefileP = sc.objectFile[(Char, Seq[(String)])]

("MLprojectOutput/week"+a+"Pobjectoutput/part-00000");

val valuefileD = sc.objectFile[(Char, Seq[(String)])]

("MLprojectOutput/week"+a+"Dobjectoutput/part-00000");

val outputP= valuefileP.persist(StorageLevel.MEMORY_AND_DISK_SER).flatMap( {

case (userid, values) =>

{

var productIds = values.map(value=>value.split("#")(0));

var demand = values.map(value=>value.split("#")(1));

productIds.combinations(2).map(

pairs => {

{(pairs.mkString("#"), 1)}

})

}}).reduceByKey(_ + _);

val outputD= valuefileD.persist(StorageLevel.MEMORY_AND_DISK_SER).flatMap( {

case (userid, values) =>

{

var productIds = values.map(value=>value.split("#")(0));

var demand = values.map(value=>value.split("#")(1));

productIds.combinations(2).map(

pairs => {

{(pairs.mkString("#"), 1)}

})

}}).reduceByKey(_ + _);

outputP.coalesce(1).saveAsTextFile("MLprojectOutput/week"+a+"ProductMatrix")

outputD.coalesce(1).saveAsTextFile("MLprojectOutput/week"+a+"DepotMatrix")

}

}

}

28

B.4 The calculation of popularity of each product

and each depot (Python - Spark)

from pyspark import SparkContext, SparkConf

def parser(line, i):

tokens=line.split(’,’)

return (int(tokens[i]), int(tokens[-1]))

if __name__ == "__main__":

conf = SparkConf()

sc = SparkContext(conf=conf)

logger = sc._jvm.org.apache.log4j

logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN )

logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN )

for i in range(3,10):

prod = sc.textFile("train_week{0}.csv".format(i))

.map(lambda line:parser(line,5)).reduceByKey(lambda a, b: a + b)

prod.coalesce(1)

.saveAsTextFile("MLprojectOutput/week{0}ProductPopularity".format(i))

depot = sc.textFile("train_week{0}.csv".format(i))

.map(lambda line:parser(line,1)).reduceByKey(lambda a, b: a + b)

depot.coalesce(1)

.saveAsTextFile("MLprojectOutput/week{0}DepotPopularity".format(i))

B.5 Summary to the Analytic Based Table (Python

- Spark)

import sys, os

from pyspark import SparkContext, SparkConf

from ast import literal_eval

import numpy as np

TRAIN_WEEKS = [3,4,5,6]

def parse(x):

res = x[1:-1].split(’,CompactBuffer’)

products = literal_eval(res[1][:-2]+’,)’)

29

n_products = {}

n_depots = {}

for product in products:

try:

product_ID = product[2]

depot_ID = product[0]

demand = product[-1]

if product_ID not in n_products:

n_products[product_ID] = demand

else:

n_products[product_ID] += demand

if depot_ID not in n_depots:

n_depots[depot_ID] = demand

else:

n_depots[depot_ID] += demand

except:

sys.stdout.write("{0}\n".format(products))

raise

return (int(res[0]), n_products, n_depots)

def load_customer(filename):

prod_dict={}

depot_dict={}

with open(filename, ’r’) as f:

lines = f.readlines()

tot_lines = len(lines)

iline = 0

for line in lines:

userID, products, depots = parse(line)

prod_dict[userID] = products

depot_dict[userID] = depots

iline += 1

if iline%1000==0:

sys.stdout.write("\rRead {0}/{1} lines from {2}"

.format(iline, tot_lines, filename))

sys.stdout.flush()

sys.stdout.write("\n")

return prod_dict, depot_dict

30

def load_occurrence_matrix(filename, item_dict={}):


lines = f.readlines()

tot_lines = len(lines)

iline = 0

for line in lines:

items, weight = line[1:-2].split(’,’)

item1, item2 = items.split(’#’)

item1 = int(item1)

item2 = int(item2)

weight = int(weight)

# create a bi-direction search dictionary

if item1 not in item_dict:

item_dict[item1] = {item2:weight}

elif item2 not in item_dict[item1]:

item_dict[item1][item2] = weight

else:

item_dict[item1][item2] += weight

if item2 not in item_dict:

item_dict[item2] = {item1:weight}

elif item1 not in item_dict[item2]:

item_dict[item2][item1] = weight

else:

item_dict[item2][item1] += weight

iline += 1

if iline%1000==0:

sys.stdout.write("\rRead {0}/{1} lines from {2}"

.format(iline, tot_lines, filename))

sys.stdout.flush()


return item_dict

def load_popularity(filename, item_pop={}):


for line in f:

item, pop = line[1:-2].split(’, ’)

item = int(item)

if item not in item_pop:

31

item_pop[item] = int(pop)

else:

item_pop[item] += int(pop)

sys.stdout.write("Read popularity from {0}, done.\n".format(filename))

return item_pop

N_HISTORY_WEEKS = len(TRAIN_WEEKS)-1

weeks_prod = [{}]*N_HISTORY_WEEKS

weeks_depot = [{}]*N_HISTORY_WEEKS

prod_occurrence = {}

depot_occurrence = {}

product_popularity = {}

depot_popularity = {}

for i, week in enumerate(TRAIN_WEEKS[:-1]):

weeks_prod[i], weeks_depot[i] = load_customer("MLprojectOutput/

week{0}objectoutput/part-00000".format(week))

prod_occurrence = load_occurrence_matrix("MLprojectOutput/

week{0}ProductMatrix/part-00000".format(week), prod_occurrence)

depot_occurrence = load_occurrence_matrix("MLprojectOutput/

week{0}DepotMatrix/part-00000".format(week), depot_occurrence)

product_popularity = load_popularity("MLprojectOutput/

week{0}ProductPopularity/part-00000".format(week), product_popularity)

depot_popularity = load_popularity("MLprojectOutput/

week{0}DepotPopularity/part-00000".format(week), depot_popularity)

MAX_RELATIVE_PRODUCTS = 3

MAX_RELATIVE_DEPOTS = 1

def createSample(line):

token = line.split(",")

userID = int(token[4])

product = int(token[5])

depot = int(token[1])

try:

demand = int(token[-1])

except ValueError:

sys.stdout.write("-{0}-\n".format(token[-1]))

try:

demand = int(token[-1][:-1])

except ValueError:

32

sys.stdout.write("..........................")

demand = 0

row = []

for i in range(N_HISTORY_WEEKS):

n_prod = 0

n_rel_prod = []

n_depot = 0

n_rel_depot = []

# product and relative products

week_prod = weeks_prod[i]

if userID in week_prod:

shopping_list = week_prod[userID]

if product in shopping_list:

n_prod = shopping_list[product]

if product in prod_occurrence:

relative_prods = prod_occurrence[product]

for prod, number in shopping_list.items():

if prod in relative_prods:

n_rel_prod.append(relative_prods[prod]*number)

n_rel_prod.sort(reverse=True)

# depot and relative depots

week_depot = weeks_depot[i]

if userID in week_depot:

depot_list = week_depot[userID]

if depot in depot_list:

n_depot = depot_list[depot]

if depot in depot_occurrence:

relative_depots = depot_occurrence[depot]

for depot, number in depot_list.items():

if depot in relative_depots:

n_rel_depot.append(relative_depots[depot]*number)

n_rel_depot.sort(reverse=True)

# fulfill the remaining entries

n = len(n_rel_prod)

if n > MAX_RELATIVE_PRODUCTS:

n_rel_prod=n_rel_prod[:MAX_RELATIVE_PRODUCTS]

else:

n_rel_prod.extend([0]*(MAX_RELATIVE_PRODUCTS-n))

n = len(n_rel_depot)

33

if n > MAX_RELATIVE_DEPOTS:

n_rel_depot=n_rel_depot[:MAX_RELATIVE_DEPOTS]

else:

n_rel_depot.extend([0]*(MAX_RELATIVE_DEPOTS-n))

# put the data in row

row.append(n_prod)

row.extend(n_rel_prod)

row.append(n_depot)

row.extend(n_rel_depot)

#calculate the popularities

if product in product_popularity:

prod_pop = product_popularity[product]

else:

prod_pop = 0

if depot in depot_popularity:

depot_pop = depot_popularity[depot]

else:

depot_pop = 0

row.extend([prod_pop,depot_pop,demand])

return row

if __name__ == "__main__":

os.system("rm -rf MLprojectOutput/week{0}Formated".format(TRAIN_WEEKS[-1]))

conf = SparkConf()

sc = SparkContext(conf=conf)

logger = sc._jvm.org.apache.log4j

logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN )

logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN )

formated_data = sc.textFile("train_week{0}.csv"

.format(TRAIN_WEEKS[-1])).map(createSample)

formated_data.coalesce(1).saveAsTextFile("MLprojectOutput/

week{0}Formated".format(TRAIN_WEEKS[-1]))

B.6 Build the model and calculate the validation

accuracy

##############################################################

# xgboost sklearn to train and test the model

34

# Author: Jinzhong Zhang

###############################################################

import xgboost as xgb

import csv, sys, math

import numpy as np

from sklearn.preprocessing import StandardScaler,MinMaxScaler

from sklearn.metrics import *

from sklearn.externals import joblib

def readData(filename, start_row=0, end_row=-1):

irow = -1

data = []

with open(filename, ’r’) as f_handle:

for row in f_handle:

irow += 1

if irow<start_row:

continue

elif irow>end_row and end_row>0:

break

data.append([np.float64(x) for x in row[1:-2].split(’,’)])

if irow%1000==0:

sys.stdout.write("\rRead {0} lines from {1}".format(irow, filename))

sys.stdout.flush()

data = np.array(data)

return (data[:,:-1],data[:,-1])

def evaluate(data, pred):

print("R2=", r2_score(data,pred))

print("Mean Squared Error=", mean_squared_error(data,pred))

errors = []

sum_ = 0

total_ = 0

for i, v in enumerate(data):

if v>0:

errors.append(abs(v-pred[i])/v)

else:

errors.append(1.0 if pred[i]>0 else 0)

sum_ += errors[-1]**2

total_ += v-pred[i]

35

errors.sort()

print ("mean=",math.sqrt(sum_/len(pred)))

print ("total=",total_)

print ("50 Percentile=", errors[int(len(pred)*0.5)], ", 75 Percentile=", errors[int(len(pred)*0.75)] )

print ("90 Percentile=", errors[int(len(pred)*0.9)], ", 99.5 Percentile=", errors[int(len(pred)*0.995)] )

def train(mode):

if mode == "NextWeek":

DATA = "MLprojectOutput/week34567to8Formated/part-00000"

else:

DATA = "MLprojectOutput/week34567to9Formated/part-00000"

X, Y = readData(DATA, 10000, -1)

X_Scaler = MinMaxScaler().fit(X)

joblib.dump(X_Scaler, ’Predict{0}_Scaler.pkl’.format(mode))

X = X_Scaler.transform(X)

dtrain = xgb.DMatrix(X, label = Y)

param = { ’booster’:"gbtree",

’eta’:0.3,

’max_depth’:6,

’subsample’:0.85,

’colsample_bytree’:0.7,

’silent’:0,

’objective’:’reg:linear’,

’nthread’:10,

’eval_metric’:’rmse’}

__model = xgb.train(param.items(), dtrain)

__model.save_model(’Predict{0}.model’.format(mode))

X_TEST, Y_TEST = readData(DATA, 0, 10000)

X_TEST = X_Scaler.transform(X_TEST)

dtest = xgb.DMatrix(X_TEST)

Y_pred = list(map(lambda x: int(x), __model.predict(dtest)))

evaluate(Y_TEST,Y_pred)

if __name__ == ’__main__’:

train(’NextWeek’)

train(’NextNextWeek’)

36

B.7 Make Predictions

##############################################################

# xgboost sklearn to train and test the model

# Author: Jinzhong Zhang

###############################################################

import xgboost as xgb

import csv, sys

import numpy as np

from sklearn.preprocessing import StandardScaler,MinMaxScaler

from sklearn.metrics import *

from sklearn.externals import joblib

def readTestData(filename, nrow=100):

irow = 0

data = []

with open(filename, ’r’) as f_handle:

for row in f_handle:

data.append([np.float64(x) for x in row[1:-2].split(’,’)])

irow += 1

if irow%1000==0:

sys.stdout.write("\rRead {0} lines from {1}".format(irow, filename))

sys.stdout.flush()

if irow>=nrow and nrow>0:

break

data = np.array(data)


return (data[:,0].astype(int), data[:,1:])

def reformat(pred):

y=int(pred)

if y<0:

y=0

return y

def predict(DATA, mode):

IDs, test_X = readTestData(DATA, -1)

X_Scaler = joblib.load(’Predict{0}_Scaler.pkl’.format(mode))

37

__model = xgb.Booster({’nthread’:4}) #init model

__model.load_model("Predict{0}.model".format(mode)) # load data

test_X = X_Scaler.transform(test_X)

dtest = xgb.DMatrix(test_X)

return (IDs, list(map(lambda x: reformat(x), __model.predict(dtest))))

if __name__ == ’__main__’:

IDs_1, pred_1 = predict("MLprojectOutput/week56789to10Formated/part-00000","NextWeek")

IDs_2, pred_2 = predict("MLprojectOutput/week56789to11Formated/part-00000","NextWeek")

writer = csv.writer(open("submission.csv", "w"))

writer.writerow([’id’,’Demanda_uni_equil’])

writer.writerows(zip(np.append(IDs_1,IDs_2), pred_1+pred_2))

38

prediction of grupo bimbo inventory demandmwang2/projects/ml_kaggleinventorydemand.pdfjinzhong...

Documents