using scanner data of supermarkets to simplify the

21
Using scanner data of supermarkets to simplify the calculation of the consumer price index Lynette Joosten 11290390 Master thesis Credits: 18 EC Master Information Studies, track Data Science University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. E. Kanoulas Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam 16-06-2017

Upload: others

Post on 04-Oct-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using scanner data of supermarkets to simplify the

Using scanner data of supermarkets tosimplify the calculation of the consumer

price index

Lynette Joosten11290390

Master thesisCredits: 18 EC

Master Information Studies, track Data Science

University of AmsterdamFaculty of ScienceScience Park 904

1098 XH Amsterdam

SupervisorDr. E. Kanoulas

Informatics InstituteFaculty of Science

University of AmsterdamScience Park 904

1098 XH Amsterdam

16-06-2017

Page 2: Using scanner data of supermarkets to simplify the

Abstract

The Consumer Price Index (CPI) measures the price development of all consumption expenditureof consumers, and is the main measurement of inflation. Scanner data of supermarkets has beenused in the calculation of the CPI since 2002, but poses a lot of challenges. To answer the re-search question ”Can a simple method be created to calculate the CPI to overcome thechallenges posed by scanner data of supermarkets?”, three challenges of scanner data wereexplored in this research: aggregation, classification and weights of items. Five experiments wereconducted to construct price indexes in a simple way, by using aggregation over all items, aggrega-tion on COICOP and ISBA level, constructing weighting schemes, and using linear regression withL1 regularization, but not one of the newly constructed price indexes was close to the price indexconstructed by CBS. The conclusion is thus: these simple methods were not sufficient to calculatethe CPI based on scanner data of supermarkets.

Page 3: Using scanner data of supermarkets to simplify the

AcknowledgementsFirstly, I would like to thank my thesis supervisor Dr. Evangelos Kanoulas of the Informatics Insti-tute at the University of Amsterdam. By providing regular feedback and asking critical questions,he has helped me to find the right path in writing this thesis.

Secondly, I would also like to thank my supervisor at CBS, Rosa Meijer, who challenged mewith all the right questions which made me improve my thesis. Thirdly, I would like to thank PietDaas from CBS, for providing me with this research problem and his help with all practical issuesI encountered.

And last but not least, I would like to thank Tom and my family for all the moral support andtheir extensive feedback.

Lynette Joosten16-06-2017

Page 4: Using scanner data of supermarkets to simplify the

Contents

1 Introduction 1

2 Literature research 22.1 Scanner data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Calculation of Dutch CPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Data 4

4 Experimental setup 54.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.3 Differences in methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5 Results and analysis 85.1 Experiment: base price index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85.2 Experiment: classification on ISBA and COICOP level . . . . . . . . . . . . . . . . 95.3 Experiment: weighted index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.4 Experiment: subset of COICOPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

6 Conclusion and discussion 146.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

References

Appendix A Ten chosen ISBAs and COICOPs

Page 5: Using scanner data of supermarkets to simplify the

1. Introduction

The Consumer Price Index (CPI) measures the price development of all consumption expenditureof consumers, and is the main measurement of inflation [1]. It includes:

Food, durable goods such as household appliances and cars, energy and housing costsand a number of government services. Also included in the CPI are consumption-relatedtaxes such as motor vehicle tax and municipal taxes such as sewerage charges [2].

The total CPI consists of 100,000 points, which are divided between the products and servicesbased on National Accounts and budget research by Statistics Netherlands (from now on CBS,Central Bureau of Statistics) [2]. The calculation of the CPI was initially based on a fixed basketof goods. The prices of this basket of goods were collected through a telephone survey or by visitingshops to record the prices. In the seventies, barcode scanners were introduced at shops, which alsomeant that transaction data (from now on scanner data) were recorded electronically [3]. Thescope of this research is scanner data obtained from supermarkets. Supermarkets make up 5% ofthe CPI [4]. Scanner data of two supermarket chains has been used in the Dutch CPI since 2002,and since 2013 supermarket surveys have not been carried out anymore in the Netherlands. Thenumber of supermarket chains that supplied scanner data grew to ten in 2015 [5]. Scanner datais not only provided by supermarket chains, but also by department stores, do-it-yourself stores,and many other stores [6].Multiple methods can be applied to use scanner data in the calculation of the CPI [3]. Theseare based on the methods and formulas used in the days of sampling the basket of goods inthe supermarket itself. Since the first methods were devised by statistics bureaus, technology haschanged a lot. To analyse the scanner data from supermarkets, CBS uses the Jevons method whichis recommended by the International CPI Manual when expenditure or quantity information is notavailable [7], even though this information is available in scanner data from supermarkets. Chessa,Verborg and Willenborg determine in their research that unweighted methods such as the Jevonsmethod should not be used with scanner data [3]. The Laspeyres method is used by CBS forscanner data of other retailers, although this method is known to lead to bias in the CPI. CBS isconducting research to replace the Jevons and Laspeyres methods with one method able to processdata from all retailers. Current methods are not able to process all data, except for multilateralmethods, which are typically used to compare price levels across regions or over time, and yieldtransitive price comparisons [3]. With multilateral methods the price indices of past periods arealtered for every new publication period, which poses a problem since published index numberscan not be revised [3]. One way to deal with this problem is to chain the different indices overthe months, but this leads to chain drift, when a lot of fluctuation in prices occurs over a shortperiod, which is the case with scanner data [7]. Although multilateral methods seem the way togo, there are still some challenges. Thus, at the moment no consensus exists on the best methodfor calculating the CPI based on scanner data. CBS is looking into incorporating a multilateralmethod into their calculation of the CPI.Chain drift, which is caused by items that are on sale, is not the only challenge scanner data posesto the current methods [8]. One of the other challenges for example is on which level to use thescanner data: should there be aggregation over stores and item groups? More challenges will beconsidered in the literature research.CBS launched the Center for Big Data Statistics in September 2016 as a first step to incorporatebig data practices into their research and work [9]. One of their research subjects is whether it ispossible to create a simple method to calculate the CPI based on scanner data from supermarkets,which led to this thesis. All current methods have some problems, which are solved by adding morecomplexity to the methods, as seen for example with multilateral methods and chaining them. Theaim of this research is to see if a simple method can be created, based on one formula, to calculatethe Dutch CPI based on scanner data from supermarkets. Using scanner data can increase theaccuracy of the CPI [3], but at the moment CBS does not even use all scanner data in their Jevonsmethod. A simple method using all scanner data can thus lead to a more accurate and even a moreefficient calculation of the CPI, if some complexity is removed from the calculation. This leads tothe following research question:

1

Page 6: Using scanner data of supermarkets to simplify the

Can a simple method be created to calculate the CPI to overcome the challengesposed by scanner data of supermarkets?

In section 2 literature research has been conducted about the current methods used by CBS,attributes of scanner data and challenges of scanner data. In section 3 the data is described,followed by section 4 where the experimental setup is described. Section 5 contains the results andanalysis of all conducted experiments. The conclusion, discussion and some ideas for future workcan be found in section 6.

2. Literature research

2.1 Scanner data

Scanner data contains all transactions of all goods sold, the prices paid by consumers, and thequantities sold for each item. When the basket of goods was collected through surveys, the in-formation about quantities sold was not available. Furthermore, that basket of goods was just asmall sample of all available goods, while scanner data consists of all goods [3]. Cost efficiencyis an important benefit of scanner data: visiting supermarkets to collect the basket of goods canbe very expensive for statistics bureaus. In The Netherlands using scanner data attributes to areduction of 15,000 price quotes each month [4]. Although scanner data sounds promising, in 2016only seven European countries used scanner data in compiling their CPI: Belgium, Denmark, theNetherlands, Norway, Sweden and Switzerland. This can be explained by the fact that it is along process to initiate contact with a retailer, obtaining the data, and implementing these datainto the CPI [3]. The awareness of scanner data is growing however, among others by the recentEurostat guidelines for scanner data acquisition and processing, and through yearly workshopsabout scanner data [3].

For each item in the scanner data, a unit value is composed by taking the average price paid byconsumers for that item in a certain period. The period depends on how often the scanner data issupplied. The items are identified by their Global Trade Item Number (GTIN)1, an identifier fortrade items [3]. Goods and services are classified using the COICOP system, which is publishedby the UN Statistics division. There are fourteen divisions in the COICOP, with subdivisions intogroups and classes [10].

2.2 Calculation of Dutch CPI

In 2006 CBS started revising their CPI. The old method used before 2010 is explained by Van derGrient and De Haan:

In the old method a large sample of items was selected at the beginning of the yearwhich was representative for the previous year. Each item is identified by the EuropeanArticle Number (EAN) and was given a weight representing its relative importance,i.e. its turnover share within the supermarket chain. The monthly price index foran item was calculated as ratio of the unit value in the current month and the unitvalue in the base year. Next, elementary price indexes were computed for each 4 digitCOICOP group as a weighted average of the constituting items’ indexes. That is,during a calendar year the product category price indexes were calculated accordingto the Laspeyres formula. At each level of aggregation these short-term indexes weresubsequently chained in December to create long-term index series [4].

However, the construction of the item sample was a very time-consuming practice for CBS.Van der Grient and De Haan explain that it would not be feasible to use this method for thescanner data for more than two supermarket chains, which is why they proposed a new method.The Jevons formula was chosen, since this formula does not lead to chain drift [4].

πy,m/y,m−1a =

n(y,m−1),(y,m)a ∏

i=1

(py,mi

py,m−1i

)1/n(y,m−1),(y,m)a

(2.1)

1GTIN is the same as EAN, which is used by CBS

2

Page 7: Using scanner data of supermarkets to simplify the

Formula 2.1 is a month-to-month Jevons price index, as proposed by Van der Grient and De Haan[4], where the price change is calculated for elementary aggregate a, which is either a productgroup on COICOP level or a product group on EAN level, over a month in a certain year y,m.

n(y,m−1),(y,m)a is the amount of matched items between months m and m-1 of year y. The price

changes are then chained using formula 2.2, where Py,m/y0,m0a is the price index going from the

starting month m to month m-1 of year y [4].

P y,m/y0,m0a = P y,m−1/y0,m0

a ∗ πy,m/y,m−1a (2.2)

Since the Jevons formula does not use a weighting scheme and weights are available in scannerdata in terms of expenditure shares, CBS introduced an implicit weighting scheme by using cut-offsampling:

Important items of an elementary aggregate are included in the sample with certaintywhereas unimportant items are excluded. [...] an item will be used in the computationof the index between two consecutive months if its average expenditure share (withrespect to the set of matched items) in those months is above a certain threshold value[4].

The above mentioned threshold value is 1.25. The Jevons formula furthermore does not take itemsinto account if they are temporarily unobserved, which is why missing prices are imputed by CBSby using the last available price. The Dutch CPI figures are published in the first week of montht+1, which leaves not enough time to use the data of the last week of month t, thus the CPIis calculated by taking the first full three weeks of a month t [4]. Furthermore, the prices aresubjected to two data-cleaning processes [4]:

Firstly, month-to-month price changes of a factor greater than 4 are considered implau-sible and declared invalid. Thus, items for which the current price is 300% higher of75% lower than the price in the previous month will be deleted. Secondly, an algorithm,referred to as a dumping filter, has been developed to exclude items from the compu-tation which exhibit a strong price decrease in combination with a strong decrease inexpenditures. ‘Dumping’ occasionally occurs in case of stock clearances when an itemis sold at an extraordinary low price.

The biggest differences between the old and new method are the prices (shelf price versus averagetransaction price), the sample size (small in a survey compared to scanner data), and the indexnumber formulae [4]. CBS is still looking to develop a better method for working with scannerdata, as described by Chessa et al. in their paper at the end of 2016 [3].

2.3 Challenges

Scanner data leads to highly volatile estimates of price changes, since scanner data records thechanges in price which lead to large and frequent shifts in quantities purchased, for example when aproduct is on sale [11]. A solution to this high volatility is to use aggregation, which is a challengeby itself. Aggregation is “the calculation of average prices and total quantities over some unitsuch as time, space, or entity” [8]. There are multiple ways scanner data can be aggregated. Whenitems are aggregated it is implied that these items are homogeneous and thus perfect substitutes foreach other. It is important to define when items are considered homogeneous. Over time, scannerdata can be aggregated quarterly, monthly, weekly or daily. Scanner data can also be aggregatedover product groups or over different stores or chains of stores [8]. Which aggregation methodis chosen, has a big impact on the resulting price index [12]. Before items can be aggregated,the level on which an item is defined must be chosen, which is another challenge. Items areusually identified by their EAN, which can prove to be too detailed [13], [14]. The COICOPclassification system can also be used to identify items, but this classification may not be detailedenough. Additional information, for example internal product codes from the retailer, is needed todetermine at what level of detail items are used in the price index computation [14]. According toDe Haan, Willenborg and Chessa: ”Index number theory suggests that items should be weightedaccording to economic importance. As weighting information (quantities or expenditures) is alwaysavailable in scanner data, it ought to be used.” They have noticed that not all statistics agencieshave incorporated weighting in their index numbers, even though the data is readily available [3].

3

Page 8: Using scanner data of supermarkets to simplify the

Expenditure shares are considered a useful measure of economic importance [6]. Another challengeis posed by consumers returning items. Ideally, returned items are part of the data supplied bythe retailer, so the correct turnover can be calculated [14]. An important challenge is the fact thatnew items appear and other items get discontinued or relaunched. If an item is discontinuedand replaced by a new item, there might be a quality change between these products [15].The Jevons formula used by CBS does not use an explicit weighting scheme and cannot handlenew items [3]. To account for the high volatility of scanner data the two data-cleaning processesmentioned above are in place. CBS uses monthly aggregation over chains of stores. They have notyet found a way to work with consumers returning items [4].The challenges that come from aggregation, the level on which an item is defined and the weightof items are explored in this research.

3. DataThe data provided by CBS consists of all transactions of the supermarket chain Albert Heijn(AH from now on) from 2015 and 2016. For each week two relevant files were provided: one filecontaining all transactions and one file containing all item information on the items sold in thatweek. The first file consists of the information found in Table 3.1.

Week Between 01 and 52Year Either 2015 or 2016Code of the store Either 1, 2, 3, 4 or 5Description of the store Storecode 1 corresponds to Albert (online supermar-

ket by AH), 2 corresponds to AH To Go franchises,3 corresponds to AH To Go stores owned by Ahold(the parent company of AH), 4 corresponds to AHfranchises, and 5 corresponds to AH stores owned byAhold

EAN identifier of the item Code made up by 13 numbersWas the item on sale? Depicted by ’J’ for yes and ’N’ for noTurnover Turnover of the itemAmount Amount of items soldAmount grams Amount of grams of the item sold

Table 3.1: Data contained in first file.

The second file consists of the following information: EAN identifier, ID of the item as given byAH, description of the item, volume of the item, unit in which the item is sold, ISBA classification,description of the ISBA, and three more classification groups and their descriptions.

For each week the transactions and item information file were merged on the EAN identifier.These files were then combined into one big file containing all weeks of 2015 and 2016. The followingcolumns were deemed irrelevant or superfluous and thus deleted: the store description, the volumeand unit of the items, the amount of grams of the items, and the three other classification groupsand their descriptions. Two columns were added: the price, which was calculated by dividing theturnover for each item by the amount sold of each item, and the corresponding month to eachweek.

CBS furthermore provided two of their constructed price indexes: one on ISBA level and oneon COICOP level. These indexes were based on 12-2014, so they were resampled to be based on01-2015. These indexes consists of the following information: month, year, identifier (either ISBAor COICOP), code of the store, computed index. The COICOP index also had a column for theweight of that COICOP, which is the same for each month in a certain year, but changes the nextyear.

4

Page 9: Using scanner data of supermarkets to simplify the

4. Experimental setup

4.1 Experiments

As mentioned in the literature research, the challenges that come from aggregation, classificationof items and weights of items are input for the experiments. Since the obtained data does notcontain information about disappearing, new and return items, it is not possible to conduct anyexperiments on this subject. The other challenges are out of the scope of this research.

Using formula 4.1 the average price of all items was calculated per month to explore the chal-lenge of aggregation. All formulas and a description per formula can be found in Table 4.1, whileTable 4.2 contains an explanation of the notation of the formulas. The average price was firstcalculated per EAN by taking the total turnover of that EAN in that month divided by the totalamount of items sold of that EAN in that month, after which these average prices were averagedper month.

Pt =1

n

n∑i=1

Pit (4.1)

Formula 4.2 was then used to create a base price index, where Pt is respectively the average priceof all items of the base month, which is either the previous month or January 2015, and P ′t is theaverage price of all items of the current month.

Xt =P ′t − Pt

Pt∗ 100 + 100 (4.2)

Two indexes were constructed in this way, one based on the change compared to January 2015 andone where the change was compared to the previous month. These two newly constructed indexeswere compared to the indexes provided by CBS. Since the CBS indexes contained index numbersfor respectively all COICOPs and ISBAs, the CBS indexes were averaged over all these COICOPsand ISBAs per month to get one index number per month on either COICOP or ISBA level.

One would expect that indexes on a more detailed level would yield better results, which iswhy ten COICOPs and ten ISBAs were chosen, after which the average price for each COICOPand ISBA was calculated per month by coupling the COICOP and ISBA classifications to thescanner data and using formula 4.1 for the COICOPs and ISBAs. Formula 4.2 was then used tocreate indexes based on January 2015 for each of these COICOPs and ISBAs. Twenty indexeswere created in total. These twenty indexes were compared to the corresponding COICOP andISBA indexes provided by CBS to explore the challenge of classification.

Weighting information such as expenditure shares is deemed useful. To test whether addingweights to items leads to a very different index, formula 4.3 was used to calculate the weightedprice of an item based on amount of an item sold in each month. The same formula was used tocalculate the weighted price of an item based on the turnover of an item in each month.

Pt =1

n

n∑i∈C

(Pit ∗Ait∑Ait

) (4.3)

These weighted prices were then used to create two indexes, both using formula 4.2. Only itemsthat were part of one of 57 COICOPs that were complete from 01-2015 to 12-2016 in both the CBSCOICOP index and the scanner data were used in this experiment. The CBS index on COICOPlevel used in the previous experiments was unweighted, but was modified in this experiment tocontain the weights provided by CBS. The two newly constructed weighted indexes were comparedto the CBS weighted index. A third index was newly constructed by combining the weight providedby CBS with the scanner data, after which this index was also compared to the CBS weightedindex.

Lastly, a linear regression model using L1 regularization was constructed to determine whichsubset of COICOPs could be used to explain the consumer price index [16]. Two linear regression

5

Page 10: Using scanner data of supermarkets to simplify the

models were created: one with the average prices of the COICOPs per month as features, and onewith the CBS price indexes of the COICOPs per month as features. The data was split in a train(67% or 18 months) and test set (33% or 6 months). The Lasso model was used in Python toconduct the L1 regularization. The coefficients and intercepts obtained from the L1 regularizationwere used to create two new weighted indexes, one based on the coefficients of the average pricesand one based on the coefficients of the price indexes. The same 57 COICOPs as used in theweighted indexes were used in the linear regression models.

DescriptionAverage price in

month tIndex formula

EAN index

Based onaverage price of

all items permonth

Pt =1

n

n∑i=1

Pit

(4.4)

Xt =P ′t − Pt

Pt∗100+100

(4.5)

ISBAindex

Based onaverage price ofone ISBA per

month

Pt =1

|S|∑i∈S

Pit

(4.6)

Xt =P ′t − Pt

Pt∗100+100

(4.7)

COICOPindex

Based onaverage price ofone COICOP

per month

Pt =1

|C|∑i∈C

Pit

(4.8)

Xt =P ′t − Pt

Pt∗100+100

(4.9)

Weightedindex -Amount

Based onaverage price of

all items permonth, weight

based onamount of items

sold

Pt =∑i∈C

(Pit∗Ait∑Ait

)

(4.10)

Xt =P ′t − Pt

Pt∗100+100

(4.11)

Weightedindex -

Turnover

Based onaverage price of

all items permonth, weight

based onturnover of

items

Pt =∑i∈C

(Pit∗Uit∑Uit

)

(4.12)

Xt =P ′t − Pt

Pt∗100+100

(4.13)

Combinedweighted

index

Weightsprovided by

CBS, based onaverage price of

all items

Pt =∑i∈C

Pict∗Wct∑Wt

(4.14)

Xt =P ′t − Pt

Pt∗100+100

(4.15)

CBS ISBAindex

Provided byCBS, averagedover months

Xt =1

n

n∑s=1

Xst

(4.16)

CBSCOICOP

index

Provided byCBS, averagedover months

Xt =1

n

n∑c=1

Xct

(4.17)CBS

weightedCOICOP

index

Provided byCBS, weights

added, averagedover months

X ′t =1

NwctXt ∗

Wct∑Wt

(4.18)

Table 4.1: Formulas used in experiments

6

Page 11: Using scanner data of supermarkets to simplify the

Symbol MeaningPt Average price in month t, in index formula January 2015n Amount of itemsi ItemPit Price of item i in month tXt Index number for month tP ′t Average price in month t to compare with January 2015Xst Index of ISBA in month tS Set of items with ISBAXct Index of COICOP in month tC Set of items with COICOP

Nwct Amount of weighted COICOPs in month tWct Weight of COICOP c in month t∑Wt Sum of all weights in month t

Ait Amount of item i sold in month tUit Turnover of item i in month t

Table 4.2: Symbols used of formulas used in experiments

4.2 Evaluation

There is not one clear way to compare price indexes. The following formula used to evaluatethe indexes is used internally at CBS. The newly computed indexes were evaluated by taking themutations of each month between 2015 and 2016, after which the average of these mutations werecalculated, also called the Mean Absolute Error. This average was compared to the average of theprice index of CBS, as can be seen in the following formula, where i1 corresponds to the first indexand i2 corresponds to the second index, and 2016/01, 2015/01 etc. correspond to January 2016,January 2015 etc.:

(|(2016/01i1 − 2015/01i1)− (2016/01i2 − 2015/01i2)|+ |(2016/02i1 − 2015/02i1)

−(2016/02i2 − 2015/02i2)|+ ...+ |(2016/12i1 − 2015/12i1)− (2016/12i2 − 2015/12i2)|)/12

(4.19)

The newly computed index was considered similar to the index of CBS when the absolutedifference between the mean of monthly mutations is equal to or below 0.01. The indexes werefurthermore analysed using graphical analysis.

4.3 Differences in methods

First of all, CBS uses a threshold of 1.25 to determine which items will be taken into account inthe consumer price index. The expenditure share of an item has to be over the threshold for twoconsecutive months to include the item in the calculation of the consumer price index. In all newlyconstructed indexes no threshold is used.Secondly, only the first three full weeks of a month are used in the calculation by CBS. Theypublish the consumer price index at the first week of the next month, which means they do nothave time to collect the last week of each month. In the newly constructed indexes, all weeks ofall months are used.Thirdly, if items have a price increase greater than 4, they are considered invalid by CBS and thusexcluded. Furthermore, if the price of an item decreases strongly, just as the expenditure shareof that item, CBS considers the item as being ”dumped”, which means that the supermarket isprobably clearing their stock at extremely low prices. These dumped items are also excluded fromthe calculation of the consumer price index. In the newly constructed indexes, items with a largeprice increase or items that are being dumped are all taken into account.Fourthly, items that are grouped together by the supermarket chain for a certain holiday or aim(barbecue products for a barbecue in summer time), can not be coupled to a certain COICOP byCBS and are thus excluded. These items are not excluded from the newly constructed indexes.Lastly, if there is no price available for a certain item in one month, the price is imputed by

7

Page 12: Using scanner data of supermarkets to simplify the

taking the price of the previous month of that item by CBS. No price imputation is done whenconstructing the new indexes.

5. Results and analysis

5.1 Experiment: base price index

Figure 5.1 contains the four indexes, where EAN-index corresponds to the newly devised indexbased on January 2015 and EAN-monthly corresponds to the newly devised index based on theprevious month. The ISBA and COICOP index are based on the ISBA and COICOP indexesprovided by CBS. The two newly constructed indexes follow the same trend, but the differenceswithin the EAN-monthly index are bigger, which makes sense since this index is based on theprevious month instead of a base month. Table 5.1 contains the absolute difference between the

Figure 5.1: Indexes based on average price of all items permonth.

means of the monthly mutations between 2015 and 2016 for all four indexes.

EAN-index EAN-monthly ISBA-index COICOP-indexEAN-index - 0.337855 1.690465 0.026333EAN-monthly 0.337855 - 2.02832 0.311522ISBA-index 1.690465 2.02832 - 1.716798COICOP-index 0.026333 0.311522 1.716798 -

Table 5.1: Absolute differences in mean of monthly muta-tions of the four indexes

There is a notable difference in the ISBA and COICOP index, which can be explained by thefact that both indexes take different groups into account. In the ISBA index, only ISBAs weretaken into account that had an index for the complete time period, which led to the removal ofseveral ISBAs from the index. These ISBAs can be converted to COICOPs by removing the lastthree digits. Only 102 COICOPs remained in the ISBA index, while in the COICOP index therewere 142 COICOPs. There were only 76 overlapping COICOPs in both indexes.The difference between the EAN-monthly index and the CBS indexes can be explained by the factthat the first is constructed based on the previous month, while the CBS indexes are based on Jan-uary 2015. Interestingly enough, the difference between the EAN-monthly index and the COICOPindex is only 0.31, which is not even that big. The EAN-index is closest to the COICOP indexwith an absolute difference of just 0.026, which makes sense considering the fact that the COICOPindex contained more product groups than the ISBA index. The EAN-index follows roughly thesame trend as the COICOP index. The biggest difference between the two indexes is at December2015. The peak at December 2015 could be explained by seasonality. Around Christmas and NewYear’s Eve, a lot of seasonal items are sold, which are not taken into account in the CBS indexes

8

Page 13: Using scanner data of supermarkets to simplify the

as explained in section 4.3 Differences in methods. Apparently, these items have a high averageprice.As mentioned in section 2.3 Challenges, when using aggregation, items are expected to be homo-geneous and thus perfect substitutes for each other. Naturally, when aggregating over all EANs,the items are not homogeneous, which is another explanation why the indexes do not match.

The difference between the EAN index and the two CBS indexes can furthermore be explainedby the method of calculation, as explained in section 4.3 Differences in methods.The absolute difference between the mean of monthly mutations of the indexes is never below 0.01,although the difference between the EAN and COICOP index comes close at 0.026, which meansthe EAN index is not similar to the CBS indexes.

5.2 Experiment: classification on ISBA and COICOP level

One would expect that taking the average prices on an item group level would yield better resultsthan taking the average price over all products, as done in the previous experiment, since the pricedifferences within an ISBA or COICOP are not as big as the price differences between all items.The hypothesis of this experiment is thus:

Hypothesis 1 The absolute difference between indexes is lower on COICOP or ISBAlevel than the absolute difference between the base price and CBS COICOP index.

Table 5.2 shows the mean of the monthly mutations for the ten chosen ISBAs and COICOPs ofthe CBS index and the newly computed indexes.

ISBA Absolute difference COICOP Absolute difference11140796 6.4545 11140 6.342211830701 8.6836 11830 6.648111840702 1.4046 11840 10.628211990701 36.807 11990 1.268611610708 3.1137 11610 2.625511270702 77.6787 11270 2.712412210701 18.7447 12210 4.769211130704 0.9083 11130 0.475721310701 1.0439 21310 1.044211910701 2.3633 11910 0.1991

Table 5.2: Absolute difference in mean of monthly muta-tions of ISBA and COICOP CBS index and newly com-puted indexes

The corresponding graphs can be found in Figure A.1 and A.2 in Appendix A. As can begathered from the differences between the means in Table 5.2 and the graphs in in Appendix A,almost all newly computed indexes are very different from the CBS indexes, although the differencesbetween the COICOP indexes are somewhat smaller than the difference between ISBA indexes.The only ISBA index that has a difference below 1, which can also be seen in Figure A.1i inAppendix A, is ISBA 11130704. Two COICOP indexes have differences below 1: 11910 and 11130.The absolute differences are all not below 0.01, which means none of the newly constructed ISBAand COICOP indexes are considered to be similar to their CBS counterparts. Furthermore, noneof the differences between the ISBA and COICOP indexes is smaller than the difference betweenthe base price index and CBS COICOP index, thus hypothesis 1 is rejected. It seems taking theaverage prices on a lower item classification level does not work better than aggregating over allitems.

The mean price of all items was 4.79, with a standard deviation of 5.21. On ISBA and COICOPlevel however, the standard deviations ranged between 0.48 and 2.52. This might mean thatconstructing price indexes by taking the average price does not work when the price differenceis small for example on COICOP and ISBA level, which is very counterintuitive. Based on theliterature review, one would expect that homogeneity plays an important role in how a price indexis constructed. However, these results indicate that heterogeneity (thus having a bigger price

9

Page 14: Using scanner data of supermarkets to simplify the

difference) is better on ISBA and COICOP level to construct the new index than homogeneity.An explanation for the difference of 77.67 between the CBS index and newly computed index forISBA 11270702 can be found in the amount of items that are taken into account each month. Forexample, in January 2015, there are 165 items that fall within that specific ISBA, but in January2016 there are only 91 items left. In February 2016 there are just 5 items associated with thatspecific ISBA, which are all part of the same EAN. That particular EAN has an average price ofe3.85, while other EANs have a much lower average price. Apparently, some of these other EANsdisappear after December 2015, which results in an index based on just one EAN.The same explanation can not be applied to ISBA 12210701, which has an absolute difference of18.74 between the two indexes. The amount of items ranges between 250 and 350 in all months,with no outliers. It seems there is just a big increase in the average price, from e1.49 in January2015 to e2.35 in December 2016.The biggest absolute difference of the COICOP indexes is 10.63 between the indexes of COICOP11840. As with ISBA 11270702, the amount of items was used as a starting point for the analysis.There is quite a big difference in the amount of items between the start and end of 2015, goingfrom 2789 items to 4219 items. However, in 2016 the amount of items ranges between 3366 and3569, which is not that big a difference. When looking at the graphical analysis of the two indexesof COICOP 11840 in Figure A.2d, the new index increases in 2016 as well as in 2015, which wouldsuggest that the amount of items does not contribute to the difference in the two indexes.The differences between the newly constructed ISBA and COICOP indexes, and the CBS indexescan furthermore be explained by the differences in methods, described in section 4.3. Using theaverage price of all items of a particular ISBA or COICOP does not seem to work to approximatethe CBS index.

5.3 Experiment: weighted index

To construct the weighted index, only a subset of COICOPs was used. The COICOP indexprovided by CBS contained high level and sub-level COICOPs: only the sub-level COICOPs thatwere also present in the scanner data were used, which resulted in a subset of 57 COICOPs. Toexplore the challenge of weights of items, the following hypothesis was constructed:

Hypothesis 2 Weights of items based on scanner data can be used to construct aconsumer price index similar to the CBS weighted index.

In Table 5.3 the absolute differences of the mean of monthly mutations are shown between theweighted indexes. The indexes differ quite a bit, as can also be seen in Figure 5.2a. Interestinglyenough, the two newly constructed indexes based on amount of items sold and turnover of theitems follow roughly the same trend. They do differ quite a bit from the CBS index. The ab-solute difference in the mean of monthly mutations is furthermore not below 0.01 for any of theconstructed weighted indexes, which means that they are not deemed similar to the weighted CBSindex, thus also rejecting hypothesis 2. The differences between the indexes can be explained bythe differences in methods, described in section 4.3. Furthermore, the weights of the CBS indexare determined using National Accounts, budget research and other external sources, whereas thenewly devised weighted indexes are based on turnover or amount sold of the items.

Weighted CBS index Index turnover Index amountWeighted CBS index - 3.8250 1.4125Index turnover 3.8250 - 2.4125Index amount 1.4125 2.4125 -

Table 5.3: Absolute differences in mean of monthly muta-tions of the three indexes

A third index for the 57 overlapping COICOPs was constructed using the weights provided byCBS and the average prices from the scanner data. The absolute difference between the monthlymutations of these two indexes is 3.18, which is quite big and definitely not below 0.01. As canbe seen in Figure 5.2b, the indexes are not similar. The new index does seem to follow the same

10

Page 15: Using scanner data of supermarkets to simplify the

(a) Weighted indexes, where Indexturnover and Index amount werenewly constructed

(b) Newly constructed index usingCBS weights and prices from scannerdata and CBS weighted index

Figure 5.2: Figures weighted index experiment

trend as the two indexes constructed in Figure 5.2a.

5.4 Experiment: subset of COICOPs

To see whether it is possible to determine which COICOPs are more important than others in thecalculation of the CPI, the following hypothesis was constructed:

Hypothesis 3 A subset of COICOPs can be used to calculate the consumer price index.

First, the parameter alpha needed to be determined. The alpha controls the degree of sparsity ofthe estimated coefficients, thus a higher value of alpha will set more coefficients to zero. A goodregression model has a high R2 value [17]. The regression model was fitted using the train set, afterwhich the R2 scores were calculated using the test set. Different alphas were plotted against theirR2 scores, as can be seen in Figure 5.3a and 5.3b, to determine which alpha lead to the highest R2

value. Two more graphs were created to zoom in on the highest R2 scores and their correspondingalphas, as can be seen in Figure 5.3c and 5.3d. Based on these scores, alpha 0.03 was chosen forthe data based on the average price, whereas alpha 0.0035 was chosen for the data based on theprice indexes.

These alphas were then used to fit the model on the train set. The mean squared error and R2

score were calculated based on the test set. The mean squared error and R2 score for both modelscan be found in Table 5.4. These results indicate that the model based on the price indexes worksbetter than the model based on the average prices, since the mean squared error is lower and theR2 is higher.

Mean squared error R2

Average price 0.2325 0.5018Price index 0.1188 0.7454

Table 5.4: Mean squared error and R2 of Lasso models

Table 5.5 contains the COICOPs, their resulting coefficients of the model based on averageprices, and their original weight provided by CBS. All other COICOPs were set to zero. Theintercept for the model based on the average prices was 84.1357.

Table 5.6 contains the COICOPs, their resulting coefficients of the model based on CBS priceindexes, and their original weight provided by CBS. All other COICOPs were set to zero. Theintercept for the model based on the average prices was 57.5963.

Only COICOP 11280 is part of both linear regression models, albeit with different weights.It seems that, based on both models, there is not one subset of COICOPs that can be used to

11

Page 16: Using scanner data of supermarkets to simplify the

(a) Alphas plotted against R2,data based on average price

(b) Alphas plotted against R2,data based on price indexes

(c) Alphas plotted against R2,data based on average price

(d) Alphas plotted against R2,data based on price indexes

Figure 5.3: Determining alphas

COICOP Coefficient Original weight11120 0.502769 0.0023411130 6.96982 0.04946111280 0.314735 0.03152511710 1.326465 0.03948711920 5.343874 0.01328212130 -1.94153 0.00107521320 -0.20295 0.009432

Table 5.5: COICOPs, coefficients and original weights usingLasso model on average prices

calculate the consumer price index, thus hypothesis 3 is rejected. The coefficient of COICOP 11280of the model based on price indexes is closer to the original weight than the coefficient of the modelbased on average prices. For that matter, all coefficients of the model based on the price indexesare closer to the original weights, whereas the coefficients of the model based on average pricesdiffer a lot from their original weights.

To explore the challenge of weights further, the coefficients and intercepts were used to createtwo new weighted price indexes, one based on the average price and one based on the price indexes.The coefficients were multiplied with respectively the average price and the price indexes of theirCOICOPs, after which the intercept was added. These new indexes can be found in Figure 5.4.The absolute difference between the mean of monthly mutations can be found in Table 5.7. Thedifference between the two newly constructed indexes and the weighted COICOP index is very

12

Page 17: Using scanner data of supermarkets to simplify the

COICOP Coefficient Original weight11140 0.010369 0.05974211150 0.023722 0.0156911170 0.037018 0.00551211220 0.00898 0.03470911250 0.032475 0.00368211280 0.023933 0.03152511310 0.015933 0.00594911360 0.015564 0.01044611460 0.082382 0.02209611520 0.045476 0.01370511720 0.081561 0.00587311850 0.010186 0.0129211990 0.014869 0.0412120 0.009018 0.00514712230 0.013315 0.024289

Table 5.6: COICOPs, coefficients and original weights usingLasso model on price indexes

small. The index based on the average prices follows a slightly different trend, as can be seen inFigure 5.4. The R2 value of the index based on average prices is 0.5, which means that just 50%of the variance can be explained by the explanatory variables. The model uses 7 COICOPs asexplanatory variables instead of all 57 COICOPs. The R2 value of the index based on the priceindexes is 0.74, which is quite high. 15 COICOPs are used as explanatory variables, which is twiceas much as in the index based on average prices. Based on the mean squared error, R2 value, theabsolute difference in the mean of monthly mutations and graphical analysis, the index based onthe price indexes seems to be closer to the actual CBS COICOP index than the index based on theaverage prices. The index based on the average prices is the closest to the CBS COICOP index incomparison to all other newly constructed indexes in previous experiments. However, the absolutedifference in the mean of monthly mutations is not below 0.01, so the indexes constructed usingthe linear regression are not similar enough to the CBS index.

COICOP indexIndex - Average price 0.02204Index - Price index 0.0205

Table 5.7: Absolute difference in mean of monthly mutations betweennewly constructed indexes and CBS COICOP index

Figure 5.4: Newly constructed weighted indexes based on coefficientsfrom L1 regularization

13

Page 18: Using scanner data of supermarkets to simplify the

6. Conclusion and discussion

6.1 Conclusion

Can a simple method be created to calculate the CPI to overcome thechallenges posed by scanner data of supermarkets?

The research question can be answered as follows: the newly constructed indexes did not lead toresults that were deemed sufficient. The methods underlying these new indexes (aggregating overall items, using classification on lower levels, constructing weighting schemes, and linear regression)are thus not good enough to be used in constructing the consumer price index. Furthermore, thechallenges were not overcome by using these methods. This leads to the conclusion that theproposed methods in this research cannot be used to calculate the CPI to overcome the challengesposed by scanner data of supermarkets.

6.2 Discussion

The results of the linear regression model based on the price indexes provided by CBS were by farthe best obtained results. Since these price indexes were not based on scanner data, these resultsshould actually not be taken into account. The results of the linear regression model based on theaverage prices of the COICOPs are good in terms of the mean squared error, but the explainedvariance was only 50%.

Since there is no consensus at statistics agencies about when a new index is close enough tothe old index, the threshold of 0.01 was picked based on recommendations by people from CBS.Maybe the mean of monthly mutations of 0.2095 between the EAN index and COICOP index inthe first experiment is small enough for the newly constructed index to be considered a good index,but this is hard to say.

There are obvious differences between the methods used in these experiments and used by CBS,which explains the differences in the indexes. It is hard to say whether these new indexes mightresemble the reality even better than the CBS indexes, since CBS considers their index to be theground truth even though they do not use all data that is available.

In some experiments the data was filtered on a subset of COICOPs, which yielded some goodresults. Not all data was used in constructing these indexes, which means that they do not representthe complete truth.

6.3 Future work

Using the first three weeks of a month instead of all weeks of a month, just as CBS does, could bea next experiment, to see whether this yields a better result and whether there is a big differencewith using all weeks.It would be interesting to explore the other challenges that scanner data poses, such as how to han-dle items that are on sale, new and disappearing items, and items that are returned by customers.It would also be interesting to see how items could be classified further, for example by using aclassification algorithm on certain attributes of the items such as volume, unit, price, quantitysold, or (part of) the description.The current research could be expanded on by using scanner data from different supermarkets oreven different types of suppliers, such as department stores or do-it-yourself stores. The methodsused in the experiments could then be tested on data from these stores, to see whether the methodswould work for these datasets.

14

Page 19: Using scanner data of supermarkets to simplify the

References

[1] J. Walschots, “Het mandje van de consumentenprijsindex, de bestedingen van consumentenen de meting van de cpi in 2016”, 2016.

[2] S. Netherlands. (2017). Definition consumer price index, [Online]. Available: https://www.cbs.nl/en- gb/our- services/methods/definitions?tab=c#id=consumer- price-

index--cpi-- (visited on 06/01/2017).

[3] A. Chessa, J. De Haan, J. Verburg, and L. Willenborg, “A comparison of price index methodsfor scanner data”, unpublished paper, 2016.

[4] H. Van der Grient and J. De Haan, “The use of supermarket scanner data in the dutch cpi”,in Joint ECE/ILO Workshop on Scanner Data, vol. 10, 2010.

[5] A. Chessa, “Towards a generic price index method for scanner data in the Dutch CPI”, May2015.

[6] J. De Haan, A framework for large scale use of scanner data in the dutch cpi. Ottawa Group,2015.

[7] I. L. Office, Consumer price index manual: Theory and practice. International Labour Office,2004.

[8] L. Ivancic, W. E. Diewert, and K. J. Fox, “Scanner data, time aggregation and the construc-tion of price indexes”, Journal of Econometrics, vol. 161, no. 1, pp. 24–35, 2011.

[9] S. Netherlands. (2016). Cbs start uniek initiatief voor big data-onderzoek, [Online]. Available:https://www.cbs.nl/nl-nl/nieuws/2016/39/cbs-start-uniek-initiatief-voor-

big-data-onderzoek (visited on 06/07/2017).

[10] U. N. S. Division. (2017). Coicop, [Online]. Available: https://unstats.un.org/unsd/cr/registry/regcst.asp?Cl=5 (visited on 03/25/2017).

[11] L. Ivancic and K. J. Fox, “Understanding price variation across stores and supermarketchains: Some implications for cpi aggregation methods”, Review of Income and Wealth, vol.59, no. 4, pp. 629–647, 2013.

[12] D. Melser, “Accounting for the effects of new and disappearing goods using scanner data”,Review of Income and Wealth, vol. 52, no. 4, pp. 547–568, 2006.

[13] M. B. Reinsdorf, “Using scanner data to construct cp1 basic component indexes”, Journalof Business & Economic Statistics, vol. 17, no. 2, pp. 152–160, 1999.

[14] J. De Haan, L. Willenborg, and A. Chessa, “A review of price index methods for scannerdata”, unpublished paper, 2016.

[15] J. D. Richardson, “Scanner indexes for the consumer price index”, in Scanner Data and PriceIndexes, University of Chicago Press, 2003, pp. 39–66.

[16] R. Tibshirani, “Regression shrinkage and selection via the lasso”, Journal of the Royal Sta-tistical Society. Series B (Methodological), pp. 267–288, 1996.

[17] M. F. Triola, Elementary statistics. Pearson/Addison-Wesley Reading, MA, 2006.

Page 20: Using scanner data of supermarkets to simplify the

A. Ten chosen ISBAs and COICOPs

(a) ISBA 11140796 (b) ISBA 11830701 (c) ISBA 11840702

(d) ISBA 11990701 (e) ISBA 11610708 (f) ISBA 11270702

(g) ISBA 12210701 (h) ISBA 11130704 (i) ISBA 21310701

(j) ISBA 11910701

Figure A.1: Graphs for ten chosen ISBAs, with the blue line representing the CBS-index and thegreen line representing the newly computed index.

Page 21: Using scanner data of supermarkets to simplify the

(a) COICOP 11610 (b) COICOP 11830 (c) COICOP 11140

(d) COICOP 11840 (e) COICOP 11990 (f) COICOP 11270

(g) COICOP 12210 (h) COICOP 11130 (i) COICOP 21310

(j) COICOP 11910

Figure A.2: Graphs for ten chosen COICOPs, with the blue line representing the CBS-index andthe green line representing the newly computed index.