ustomer analytics for a retailer...

Customer analytics for a retailer chain

Project laboratory report

Author: Szabó Péter Supervisors: Pálfi Attila Petróczi Attila

Date: 5/10/2012 Course code: VITMM376

Budapest University of Technology and Economics Faculty of

Electrical Engineering and Informatics Department of Telecommunications and Media Informatics

Table of contents

2

Table of contents

Table of contents .......................................................................................................................................... 2

Introduction .................................................................................................................................................. 4

1 Project task ........................................................................................................................................... 6

Software specifications ................................................................................................................. 6 1.1

2 Research of literature ........................................................................................................................... 8

Data warehousing ......................................................................................................................... 8 2.1

Data mining ................................................................................................................................. 11 2.2

CRISP-DM .................................................................................................................................... 12 2.3

2.3.1 Business understanding ...................................................................................................... 13

2.3.2 Data understanding ............................................................................................................ 13

2.3.3 Data preparation ................................................................................................................. 13

2.3.4 Modeling ............................................................................................................................. 13

2.3.5 Evaluation............................................................................................................................ 13

2.3.6 Deployment ......................................................................................................................... 14

Customer analytics ...................................................................................................................... 14 2.4

3 Implementation of the task ................................................................................................................ 15

Business Understanding .............................................................................................................. 15 3.1

Data Understanding .................................................................................................................... 15 3.2

Data Preparation ......................................................................................................................... 16 3.3

Market basket analysis ............................................................................................................... 20 3.4

3.4.1 Task specific data preparation ............................................................................................ 20

3.4.2 Modeling ............................................................................................................................. 20

3.4.3 Evaluation............................................................................................................................ 22

Seasonality analysis ..................................................................................................................... 22 3.5

3.5.1 Task specific data understanding ........................................................................................ 22


3.5.3 Modeling ............................................................................................................................. 24

3.5.4 Evaluation............................................................................................................................ 27

Geographical analysis.................................................................................................................. 27 3.6


Table of contents

3

3.6.2 Modeling ............................................................................................................................. 28

3.6.3 Evaluation............................................................................................................................ 29

Customer-power analysis............................................................................................................ 30 3.7


3.7.2 Modeling ............................................................................................................................. 31

3.7.3 Evaluation............................................................................................................................ 34

Discounting method analysis ...................................................................................................... 34 3.8

4 Summary ............................................................................................................................................. 36

Table of figures, tables and charts .............................................................................................................. 38

Bibliography ................................................................................................................................................ 39

Introduction

4

Introduction

We are living in the age of the information. Almost 8 new users go online every seconds and it

is estimated that in 2015 there will be generated more than 10000 Petabytes (1Petabyte = 1000

Terabytes) business data per month globally. [1]

This huge amount of data could not be managed without mathematical and computational

support. Mathematicians and analyst from all over the world are working on this data to reveal

nontrivial coherency between the records and looking for the explanation where could the

company gain one more cent profit. Besides of the overall analysis and reengineering of the

business models, processes it is important to manage the relationship with the customers too.

Although the customer data is only a small slice of the whole business data, it is not a question

that even this amount of data is impossible to handle manually. The support systems are called

Customer Relationship Management (CRM) systems. They widely implemented the strategy for

managing company’s interactions with their clients. CRM systems provide organizing,

automatizing and synchronizing business processes. They are primary based on sales processes,

but they also support marketing, customer service and technical support.

Based on CRM systems companies were getting able to manage their customers even better.

Therefor they needed even more customer data. They realized rapidly that profiling the

customers could help them to decrease marketing cost so indirectly increase profit. The only

one task was to make the customers interested to give as much information freely to the

company as much it needed. The solution was the customer loyalty card. While customers

demand these cards they fill the registration form and give all the identical and demographic

data that companies needed. Furthermore using that card during shopping companies can

identify that which customer buys certain goods. This process help the companies gather

information that needed to create customers profile.

On the other hand customers join these programs voluntarily because for using the loyalty card

during shopping they get some percent discount or special offers that make them feel more

valuable.

Introduction

5

Apparently this is a win-win situation, but behind the scenes the companies are the only

winners. All their moves are to raise their indicators.

Therefore customer analytics – the analysis of customer data – has become a well-supported

field of data mining too. There are a lot of fields of the business where managers are able to

improve the revenues by using data mining techniques. Customer data is useful when the board

should decide where to open the new shop or who to deliver the special offers.

In the most impressive way Sir Colin Marshall summarized the significance of customer analysis:

“It all comes back in the end to value for money. If you can deliver something extra that others

are not or cannot, some people will pay a slight premium for it. I want to stress that when I say

“slight,” I mean precisely that. In our case, we’re talking about an average of 5%. On our

revenues of £5 billion, however, that 5% translates into an extra £250 million, or $400 million, a

year.” [2]

Project task

6

1 Project task Describe and summarize those marketing strategies which could be served to improve sale

indicators. On the available date examine the particular marketing strategies and forecast their

success ratio in the practice. Look at the coherence between the location and the range of

goods then suggest an upgrade for the actual method. Using SPSS Statistics you should

implement a predictive model which helps making the right decision about the product range in

certain shops and planning the best discounts.

Software specifications 1.1The main goal of my project was to discover the functions of the IBM SPSS (Statistical Program

for Social Sciences) and using it for customer analysis. For the project task I had to setup the

work environment.

I had got the input data in different Excel tables so I made the data understanding task in Excel.

For example there were the customer’s, the item’s and the sales’ data in different

spreadsheets.

The following task was data preparation. I decided that besides of separate Excel worksheets I

am going to use a data warehouse. For that I chose the Microsoft SQL Server 2008 R2 because I

already had known it well. For using a data warehouse of course I had to make the ETL

(Extract-Transfer-Load) process. I prepared the ETL in Microsoft SQL Server Business

Intelligence Development Studio which is a plugin for Visual Studio. It makes much easier to

create ETL processes because you do not have to know SQL as well. In addition you are able to

create a picturesque flow chart diagrams easily so it is also good for presentation. Furthermore

it provides a wide range of data source/destination, DLL function calls in a graphical user

interface versus dozens of lines of code in command prompt.

I setup the whole infrastructure on a single instance. Of course there are a lot of advantages

but also many disadvantages too of this choice. Because of both components run on the same

computer there was no need for secure the network communication. On the other hand

running the data warehouse on a big server it would be enough a cheaper client for running

SPSS and making the analysis and reports.

Project task

7

All in all I decided to run both components on the same computer, because it made much

easier to develop and debug my processes. When my solution will be realized in commercial

use I would suggest running them on separate infrastructure.

Research of literature

8

2 Research of literature

Data warehousing [3] 2.1The data warehousing is about ten years old so a really new field of information technology.

Due to the rapid improvement it is developed on some different ways. Data warehousing is a

union of proved and immature techniques so it is hard to define. Bill Inmon’s is one of the

apostles of the topic. His well-accepted definition says: “A data warehouse is a subject oriented,

integrated, nonvolatile, and time variant collection of data in support of management's

decisions.”

This few words are very complex, let’s examine it a bit:

Subject oriented: Data warehouse is built along a well-defined business goal. Reaching

these goals requires certain business data, for example to making customer analytics

sales and customer data are necessary.

Integrated: Integrated design has a strong connection to the previous point’s data-

driven design. In this field integration means that the data of a certain field of interest is

suggested to store in separated groups.

Nonvolatile: The main expectation of the data loaded into the warehouse is that it must

be the same as it was in the source system. When the transferred data had changed in

the source system, it should be migrate to the data warehouse with a timestamp.

Time variant: Usually we load the data from the past but we use it to make prediction

about the future so it is important to think over that which data should we use and

which should we discard.

The building aspects of a data warehouse differ from those that I learned in the class called

Databases. [3]

In this case the main goal is to store the data in the way that support high query performance

the most. So it is not a huge problem if the data is stored with a little bit redundancy as long as

it gains any extra performance for reporting.

The data warehouses are built usually by star or snowflake scheme. In both cases the important

data is located in the fact table, in the middle. Here is situated the enterprise resource planning


9

(ERP) system’s correctly transformed master data. Around that are placed the dimension tables.

The data that is not exactly the analyzed one are loaded these dimension tables. There is for

example a dimension data table for time, location, and etc. data. The dimension tables gather

all the information that could be a category by the analysis. The main differences between the

two design patterns are the connections between the database tables.

As shown on the figure below, in the way of the snowflake model the fact table is in the center.

All the dimension tables are connected to it. There are all the key attributions of the dimension

tables stored in fact table as a foreign key and throw that could be identified all the matching

rows of the dimension tables.

Figure 1 – Entity relationship diagram of the star scheme

Besides that in the snowflake scheme the dimension tables are ordered and connected

hierarchically and only the top of the hierarchies are directly connected to the fact table. For

example the Shop dimension table is connected to the City where it is located, the City table to

the County etc.


10

Figure 2 – ER diagram of the snowflake scheme

During my project task I built a star scheme data warehouse.

The comparison of the star and the snowflake scheme:

Pros of the star scheme:

easy, intuitive data model

requires less join operations than the snowflake

requires only a few table reading

easy to build, the metadata of the model are simple

Cons:

hard to create aggregations

the handling of the huge hierarchical components slows down the queries

the handling of the dimension data need a lot of redundancy


11

Data mining [5] 2.2Data mining or knowledge is digging through, analyze and discovering new patterns in a huge

amount of data. It is a relatively young and interdisciplinary field of computer science. It is

interdisciplinary because it needed a massive computer support, uses database systems, a lot of

algorithms and applied mathematics. Besides that many people list data mining not to

computer science, neither to business but into social sciences. They rank it into social sciences

because of the individual privacy that every single row of data usually means.

Major elements of data mining are:

Extract, transform, and load transaction data onto the data warehouse system.

Provide data access to business analysts and information technology professionals.

Analyze the data by application software.

Present the data in a useful format, such as a graph or table.

The analytical techniques used in data mining are often well-known mathematical algorithms

and techniques. The new is the application of these techniques. As the stored data increase

more and more company define business problems, that’s solution may be rooted in the heap

of the collected data. The algorithms are abstract so they can apply in very different problems

with high efficiency.

These main tools that data mining usually apply are:

Artificial neural networks - Non-linear predictive models that learn through training and

its structure was inspirited by the natural neural systems.

Decision trees - Tree-shaped structures that represent sets of decisions. These sets

makes the business users able to apply easy rules during the decision making process.

Rule induction - Useful if-then rules from data based on statistical significance. Similar to

the decision trees.

Genetic algorithms - Optimization techniques based on the concepts of genetic

combination, mutation, and natural selection. There are a lot of generations of

algorithms which are improved versions of their ancestry.

Nearest neighbor - A classification technique that classifies each record based on the

records most similar to it. [6]


12

During the years it was necessary to standardize these elements and the data mining process.

In 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004

Java Data Mining standard were based on succeed projects. As the effect of their success in

2006 were released the 2.0 edition of both standards. In the next session I will introduce the

CRISP-DM methodology in details.

I personally understand with Sir Colin Marshall (former British Airways CEO), who said that “It

all comes back in the end to value for money.” [2]

All the same when we are talking about genetics, medicine, social sciences or education the

data mining is always business-driven.

CRISP-DM 2.3As I already mentioned CRISP-DM (CRoss Industrial Standard Process for Data Mining) is a data

mining methodology and standard. As a methodology it is a circular model so very similar to the

concept of the spiral development methodology. In this chapter I am going to introduce it

detailed because I used it during my work because the IBM SPSS follows this methodology too.


13

Figure 3 – Processes of CRISP-DM

As shown on the previous figure CRISP-DM has six phases. In the following am going to

introduce each one in some sentences. [7]

2.3.1 Business understanding I the first phase the most important is that all the participants should understand the project

goal from the business perspective. This problem is going to be translated into data mining

goals and into a problem definition. For the end of this phase the project plan is prepared.

2.3.2 Data understanding Now the task is to collect initial data. If there are multiple data sources integration is an

additional issue. Possible this leads into data preparation steps. In DU phase describing data is

an important task too. It means the examining the gross and surface properties of the acquired

data. In exploring data step we examine the distribution of the key attributes, creating some

visualization reports. As a last step we need to verify data quality as answering questions like

“Are data fields completed?” or “Are there any missing values?”

2.3.3 Data preparation Usually data preparation step requires the most time of a project. It could be needed up to 90%

of the time of the whole project. It involves constructing necessary data, integrate data from

multiple sources, clean it and transfer into the destination format.

2.3.4 Modeling In modeling phase we need to choose the modeling technique. If multiple techniques are

applied we need both tasks separately. For these models we need to generate test designs and

building the selected model(s). As the end of this phase we should assess the model, examine

that the data mining goals have been satisfied.

2.3.5 Evaluation In evaluation step we need to translate to model to the business environment, asses the

business model and check that business goals have been completed. Whether all the goals are

satisfied we should review the data mining process looking for upgrade possibilities.


14

2.3.6 Deployment As a start of the last phase we should make the deployment and maintenance/monitoring plan.

After the deployment we should close the project by making the final report and review the

project to summarize what went go/wrong and collect the work experience for further projects.

Customer analytics 2.4Customer analytics means analytics of customer data. It is a really new field of data mining. It is

trigger was the wide spread of Customer Relationship Management (CMR) systems so the

stored data about the customer involved processes increased dramatically. As a field of the

data mining CA does not end up with descriptive statistics but applies predictive models too.

Customer Analytics is the key to unlocking the hidden value of customer data. It’s the

difference between passive, reactive marketing and agile, active marketing—where every step

is sure, every goal is rooted in relevance, and every insight learned makes your next endeavor

that much smarter and more successful. [8]

The main goal in this field is to predict the behavior of the customer. If we need what is the

next move we will be able to make the right discounts or recommend the whished product to

the specified customer.

Implementation of the task

15

3 Implementation of the task As I already mentioned in the task specification chapter my exercise was to analyze the

customers’ data of a retailer chain.

I searched the answer for the next questions:

What are the usual market baskets?

Is there any seasonal good? Which are those and when?

Is there any connection between the location of the shops and its likability?

What is the customer power of the certain shops?

Who are the discounts made? Is there a better solution?

The business and data understanding and the first, data-warehouse-building step of the data

preparation phases were the same for all the questions so I will describe then together.

Business Understanding 3.1

The main goal of my task is to present to the manager layer of the client company the power of

data mining and show them its business significance. My target was to open the client’s eye to

see how magnificent possibilities are hidden by data mining. Hopefully after my work they are

getting interested in using exact, mathematical models in the strategic planning.

The client in this project was a retailer shop chain who gave me all the sales data from the

previous months. I was needed to work with the sales and customer data to figure out how

they could become more profitable.

For convenience the client I assemble the previous list of questions. My exercise was to answer

those. My personal goal was to get know with SPSS.

Data Understanding 3.2

After that I had been getting know the aim of the project I dug myself into the data. I had

become four excel files. One of them contained the customer database of the client. These

customers are identified by a loyalty card and besides their shopping behavior we can identify

them by their gender, address or lifetime.


16

Besides of the customer data I got some data about the available items too such as identity,

prize or last selling’s date.

The rest of the data is for managing the sales. I had got the sales data in two parts. There was

the sales header and the sales line data in different spreadsheets. The sales header table

contains general data about the sales. Furthermore the shop and the cash desk are located

here, where the transaction happened. If the customer possessed any loyalty card, the

identifier had been stored here. Of course any other information is located here which belongs

to the invoice itself as the shopping time, the summarized prize, the summarized prize after the

discounts and some other information about the transaction type.

In the last document there was stored the detailed data of each sale. For every lines of the

invoice stand here the item identifier, the unit prize, the quantity and the line summary.

Right after that I understood the logic of the data storing here and I could see through the

connections between the attributes I decided that I am going to build a data warehouse. This

decision needed to examine the input data more in a more detailed way to get enough

information about the required data types too.

Summarize in this term I was look the input data over and over again to reveal logical

connection and technical properties which are going to be useful during the further steps.

Data Preparation 3.3

In my opinion building of a data warehouse is well-founded in this situation, because there is a

big amount of data that will be the input of all the data mining steps. So I am going to introduce

the Extract-Transfer-Load (ETL) procedure which I made for prepare the data and transfer into

the data warehouse. As I already mentioned it I made the ETL process with Microsoft SQL

Server Business Intelligence Development Studio. It is a graphical designing tool for dataflow.


17

Figure 4 – Control flow diagram of the ETL process

On the state chart diagram (Figure 4) could be seen the control flow of the ETL process that I

made. Every time the process starts it’s running with the initialization script. The initialization

script contains some SQL statement that check if the tables are exists. If they do not, then the

script creates them. As an example there is a part of the script. This few lines of code check,

that the postal code table exist or not. If it does not, then it creates the table:

IF NOT EXISTS (SELECT * FROM sysobjects WHERE id = object_id(N'[dbo].[Postal code]')

AND OBJECTPROPERTY(id, N'IsUserTable') = 1 ) BEGIN CREATE TABLE [Postal code]( [POSTAL CODE] nvarchar(4), [CITY] nvarchar(255), [COUNTY] nvarchar(255) ) END

After that we are sure, that the structure of the data warehouse exists, we are able to start fill it

with data. As it is seen on the state chart diagram, there is two independent thread of ETL

process. So prepared the solution to process the customer and the item data paralelly.

On the path that process the item data there were only some simple data preparation steps,

such as datatype conversations.

The another path was a little bit more difficult. In the input dataset there was only the postal

code of the customers who participate in the loyality program so I decided to add some extra

data to make more interesting the analysis. I loaded some geographical data into the

warehouse, such as the county where the customer lives. After the database task loads the

sd ETL control flow

InitDWH

ETL for items

ETL for

postal code

ETL for

customers

ETL for sales

header

ETL for sales

lines


18

geographical data into the data warehouse it is able to start the processing of the customer

data, because now we can join to it the geodata. While it was loading the customer data I made

some data transfomartion on it, such as I replaced the missing values with NULL values and I

made the lookup of the geographical data too.

While both threads succesfully finished the customer’s and the item’s data have loaded into the

data warehouse so I could load in and join the sales data too. Because the sales lines should be

connected to the sales header rows, at first add the header data to the database. While I had

been inserting the sales data I added the foreign keys to the table to estabilish the relationship

between the customers and the invoices.

After that the database involved the invoice data too, I finally added the sales lines and

estabilished the required connections between the tables to keep data dependency.

Although the input data comes from an ERP system I needed to make some data cleaning steps

before I could store the data into the data warehouse. Mostly I tried to fill missing os invalid

arguments with NULL’s.

The Figure 5 shows an example for the data flows I defined. This starts with an Excel data

source and after that the data goes through a lot of transtormation node it is giong to store in

the data warehouse. In Clear data node I remove the * character from the KEDVAZON attribute

if it contains. Moreover the receipting time is stored as a four digit long decimal number the

node transtorm it into DB_TIME format. After this conversion the following node cast the ID

from a decimal number stored in a string into integer format. The next node is a conditional

split whether the payment was completed by creditcard or not. The nodes on the two output of

the conditional node add a bimary variable to the table. 1 if the customer paid by creditcard

and 0 if not. Then I made a union node to rejoin cases of the condition and added another

condition whether the customer used a coupon during the shoping or not. If he had used, then I

stored the coupon identifier, when he not, then I filled the attribute with NULL. At the end I

rejoin the datasets and load them into the data warehouse.


19

Figure 5 – The data flow diagram of the ETL for sales header


20

The analysis spezificant data preparation I will describe right when I will introduce the certain

analysis and its results.

Market basket analysis 3.4

The aim of this analysis was to reveal the most common shopping baskets. Exploring that which

products are bought usually together is very useful to make cross-selling.

3.4.1 Task specific data preparation

The required data is located in the Sales lines table of the data warehouse, so I got the Receipt

ID, Item ID and Quantity columns for every transaction. For the modeling I had to create a pivot

table from it. The table has as many columns as many items are in the database and the

number of its lines is the number of transactions. An arbitrary mij element of generated matrix

is the quantity of the j item which was sold in the i transaction.

3.4.2 Modeling

In the modeling phase I chose to divide the customers into some clusters. For the clustering I

had chosen k-means clustering method.

K-means divide the input points into k clusters. The algorithm is very simple:

1. Choose k random point in the modeling space. These points will be the centers of the

clusters.

2. For each points of the model its distance is calculated from the center points. Every

point belongs to the cluster with the nearest center point.

3. When all the point belongs exactly to one cluster, then the position of the center points

should be recalculated. The new position is the centroid of the cluster. Then we should

step back to the second step and refresh the model.

The second and the third steps should be iterated until none of the center points’ position

changes.

Most of the transactions consists a few listed goods, so during the distance calculation there

will be a lot of dimensions (items) where the value is zeros. In those dimensions where the

coordinate is not zero (there were some sales of the item) the distance will be a relevant value.

When two shopping basket contained more consonant goods, then the distance between then


21

is going to get smaller, so those transactions are likely to getting ordered in the same cluster,

which contained more of the same goods.

Table 1 – Number of cases in each Cluster

Besides, that SPSS gave me some statistical info about the running of k-means process, it also

generated the cluster ID’s into the input file. The automatic classification process revealed that

three clusters were enough, but I aimed to make smaller, more exact groups so I doubled

number of clusters. As it could be seen on the Table 1, almost all of the cases are included in

the third cluster. Those are that transactions, when the variance of the sold items were too big.

After filtering of the particular clusters I made some intersect calculation to

reveal the common market basket and

as a part of the evaluation tried to prove, that the items in cluster three have to huge

variance.

Cluster No. Usual item in cluster

1 26218, 31598

2 5573, 26218, 31598

4 31598, 31598, 4582, 6250, 3797, 31004, 25122,

12408, 31598, 26218, 424, 31598, 26218

5 425

6 17579

Table 2 – Usual sold items in the separate clusters


22

3.4.3 Evaluation

By the Table 2, I could assess, that there is not a typical market basket in the examined chain.

The range of goods is too wide and the sales data was not enough to make me able to declare a

simple basket to The Usual. In the 1st and 2nd clusters there are some item sets, but they

occurrence cardinality is not significant enough. The 4th cluster contains the data of only one

transaction, it is rather special than usual. The 5th and 6th clusters contains only one product, so

this could be the effect of any marketing campaign or some special offers, but definitely not a

massive phenomenon too.

Seasonality analysis 3.5

In this section I was searching for some recurrence in the shopping periods. I examined the

correlation between the sale amounts in different shops during a period of time and try to

make global consequences.

3.5.1 Task specific data understanding

At first I was wondering that is there any connection between the daily sales and the shops, so I

created the following diagram:

Chart 1– Average daily sales value per shop

But it was not as picturesque as I was expected, so I calculated all the correlation between the

certain shops sales. The figure was right there was not any strength relation between the


23

shops, the highest correlation coefficient was only 0.176 between shops V1 and P2, but the

correlation’s significance level was also low. So I started to examine other time periods. I deal

with the correlation between monthly and daily sales.


In the data preparation phase I got the required data for monthly analysis from the data

warehouse for each shop one by one with the following query:

SELECT [BOLT] AS "SHOP" ,DATEPART("MONTH",[RDAT]) AS "MONTH" ,DATEPART("DAY",[RDAT]) AS "DAY OF MONTH" ,AVG([SALEVALUE]) "AVG SALES VALUE" FROM [ProjectLab1].[dbo].[Sales header] WHERE [BOLT] = @BOLT GROUP BY [BOLT], DATEPART("MONTH",[RDAT]), DATEPART("DAY",[RDAT]) ORDER BY [BOLT], "MONTH", "DAY OF MONTH"

The result set of the query is not exactly the excepted so I made some other transformation on

it in Excel. The initial data set for SPSS was a table which contains in the first column the days of

a month and in the following columns the summarized income for this day. I decided to make

those columns for every each store and examine them separately to get more detailed, more

exact results.

DAYS January February March … December

1 Income 01/01 Income 01/02 Income 01/03 … Income 01/12

2 Income 02/01 Income 02/02 Income 02/03 … Income 02/12

… … … … … …

31 Income 31/01 Income 31/03 … Income 31/12 Table 3 – SPSS input format for monthly seasonality analysis

The data preparation for the daily analysis was quite the same. I got the data with this SQL

statement by shops:

SELECT [BOLT] AS "SHOP" ,DATEPART("HH",[ReceiptingTime]) AS "HOUR" ,AVG([ELADERT]) "AVG BY HOUR" FROM [ProjectLab1].[dbo].[Sales header] WHERE [BOLT] = @BOLT GROUP BY [BOLT], DATEPART("HH",[ReceiptingTime]) ORDER BY [BOLT], "HOUR"


24

The result was processed again in Excel to the format below:

HOUR SHOP1 SHOP2 SHOP3 … SHOP n

0 Income1/0 Income2/0 Income3/0 … Income n/0

1 Income1/1 Income2/1 Income3/1 … Income n/1

… … … … … …

23 Income1/23 Income2/23 Income3/23 … Income n/23 Table 4 - SPSS input format for daily seasonality analysis

3.5.3 Modeling

In the modeling period I made correlation analysis for each store, if it exist any connection

between the sales in each month, then what is it?

First I was trying to make some charts to help realizing the relationships, but they did not work

anyhow. All of them were like on Chart 2, to crowded to simply show the relevant information.

Chart 2 – Average daily income in shop B4

Although the line charts did not give any relevant information it was good to make some

suspicion to work with.


25

For all the shops in the chain that is had got some information, I calculated the correlation

coefficients in SPSS for each month to the others. The most important part of the output is

visible in the Table 5. There are two important indicators in this table:

Pearson Correlation: Pearson’s correlation coefficient. In the interval between 1- and 1.

It means the strength of the relationship. Bigger value is better.

Significance: It is the measurement of how much is it unlikely to have occurred by

chance. It is a possibility so it is between 0 and 1. Smaller number means more reliable

connection.

N means the size of the dataset, in this case the working days of the month.

Table 5 – Correlation calculation result for shop B4

It is important to realize that the correlation table is symmetric, because the operand of

correlation is symmetric too to the arguments.


26

In Table 5 I realized the following coherences: The correlation coefficient is relative high in

August-February-January, September-January and October-April. Those relations are quiet

significant too. There are similar relationships in the correlation matrix of other shops too, so it

is very likely, that they sold most of their goods in spring and in the autumn.

The correlation between the hourly sales is much stronger in the certain shops, as it seems on

the Chart 3, the lines are hardly on each other.

Chart 3 - Average hourly income in certain shops

It means that the correlation between the variables is much bigger, so the customer’s behavior

is very similar in each store. As it seems on Table 6 (it is only a part of the whole matrix, to keep

its well-readability) in most of the time the correlation coefficient is almost absolute one. So the

variables look very similar or very opponent.


27

Table 6 – Part of he hourly correlation matrix

3.5.4 Evaluation

All in all the periodicity is most important in a year, the peaks are in spring, and in autumn.

Inside of the day in most of the shops periodicity is negligible, but in shop B4 there is a massive

rush after 21 o’clock!

Geographical analysis 3.6

In this section I was looking for some correlation between the locations of the shops and the

customers, who are shopping there.


28


At first I made some correspondence analysis from the shopping logs. I examined where the

customers usually do their shopping. For this I needed only the store ID and the customer ID

from the sales header table. The only difficulty was that the SPSS can make the analysis only on

continuous set of numbers, but the customer ID’s are not continuous. As a solution I generated

a sequence in SQL server to the customer ID with the script below.

SELECT DISTINCT IDENTITY(int,1,1) AS RowNumber, vevok INTO #Temp FROM [Sales header] WHERE vevok is not null ORDER BY vevok SELECT * FROM #Temp ORDER BY RowNumber DROP TABLE #Temp

In the SPSS I also created a continuous variable set from the shop codes.

3.6.2 Modeling

After the data preparation I made a correspondence analysis, where the rows were the

shopping logs identifiers and the columns were the stores ones. On the Chart 4 is visible the

result of the correspondence analysis. I removed the labels of the cases from the diagram for

best visibility; in the SPSS output file we are able to turn it on during the analysis.

This chart is the most picturesque output of this function. It is a very powerful tool to get some

information about which store is more preferred by the customers. On the chart customers are

located closer to those store, which they already had visited most frequent. For example those

customers, who did their shopping only in shop “B7”, are located exactly on the point of “B7”.

Those customers, who looks like a dotted line between two shops, they are visited only those

two. For example the customer, whose ID is 6523, sold items in the shops B1, B4 and in B7.

From his location it is visible, that he visited more time the B1 than the others, because the

cross in the cart is closer to the circle of B1. In this case the coordinates of the customers are

the centroid of a triangle, which has the some weight in its vertexes. The weights are the count

of the visits of the certain shops.


29

Chart 4 – Correspondence analysis between customers and shops

On the other hand, correspondence analysis is very useful to identify, the loyalty cycles of the

shops or the preferred shops of the customers. The only thing to do is to connect the selected

point to the origin point with a line. Then I had to calculate the distance of the other point from

the line. Which point is closer to the line, it is in their closer surroundings of the original point

to. Let’s see an example: when we draw a line throw the position of the customer 6523 and the

origin, then calculate the distance of the shops from this line, than the order would be

something like: B1, B7, B4, P1, V1. So it shows too, that the customer 6523 did their shopping

most in the shop B1.

3.6.3 Evaluation

As it is visible from the examples in the modeling chapter, correspondence analysis is a very

useful and strength tool to identify the likeability of some cases. In this data set it was very easy

6523

B1

B4

B7


30

to use, but on bigger data it could mean problem for SPSS (it can solve correspondence analysis

up to 3000 cases, and this dataset must be continuous). I review the process on some special

cases with my supervisor and the result was imposing enough.

Customer-power analysis 3.7

In this chapter I am looking for some connection between the customers and the shops, sales.

Furthermore I am looking for some another indicators from which we are able to get

knowledge about the relationship of the customers to the sold items. In short I am looking the

answer of the following questions:

Where the customers did spent the most money?

How much do they spend average in certain shops?

In which shops are the products sold better? Where the sales indicators are lower?


To answer those questions I got the required data from SQL Server too with the next queries:

SELECT shop, receiptId, salesPrize AS "SUMARY OF SHOPPINGS" FROM [Sales header] WHERE costomer IS NOT NULL ORDER BY receiptId

The query returns the shopId, receiptId and salesPrize of the shopping. I had been used it for

made the boxplot and gained some descriptive information about this dataset.

SELECT itemId FROM [Sales lines] GROUP BY itemId HAVING SUM(quantity) = 0 ORDER BY itemId

I used the previous query to get those items, which already had not been sold or had been

reaturned already by the customers.

The following query returns those items, that had already been returned by the customers. It is

necessary for analysing that why had they been returned?

SELECT shop, itemId, SUM(quantity) AS "SUMMARIZED RETURNED QUANTITY" FROM [Sales lines] GROUP BY itemId, shop HAVING SUM(quantity) < 0 ORDER BY shop, [SUMMARIZED RETURNED QUANTITY]


31

The last query is a bit more complicated, than the others. It list those stores and items, that are

sold more than the average.

At first I calculated, that how money times were the certain items sold in each shops. Then I

calculated, that what is the average size of solds for the certain items. Finally, I listed that

where and which items are sold above the average.

SELECT BOLT, CKOD, SUM(MENY) "summary by shops" INTO #Temp FROM [Sales lines] GROUP BY BOLT, CKOD ORDER BY CKOD SELECT CKOD, AVG([summary by shops]) "average globally sold quantity" INTO #Temp2 FROM #Temp GROUP BY CKOD ORDER BY CKOD SELECT #Temp.BOLT, #Temp.CKOD FROM #Temp INNER JOIN #Temp2 ON #Temp.CKOD = #Temp2.CKOD WHERE #Temp.[summary by shops]>#Temp2.[average globally sold quantity] ORDER BY #Temp.BOLT, #Temp.CKOD DROP TABLE #Temp, #Temp2;

In the modeling section I interpreted these input datasets and tried to answer the questions in

the introduction.

3.7.2 Modeling

Well at first let’s see the result of the first query. I made from it a box chart (Chart 5), which

shows a lot of useful descriptive statistical information about the transactions. The highness of

the box contains 50% of the values. Those values which are involved in the box are called the

interquartile range (IQR) or middle fifty. The wide line inside the box means the median value

of the data set. The location of the median line shows us some information about the

distribution of the dataset. When the line is in the middle of the box, then the value set is

normally distributed. The shop B4 and V1 shows some normally distribution’s nature in the IQR.

When the median line is closer to the bottom of the box, then is it more likely that the

distribution is tailed to the higher values, the value set has a positive canting. When the median

line is closer to the upper limit of the box, then the dataset has a negative canting.


32

The T-shaped lines above the boxes are called the “whiskers”. Their sizes are one and a half of

the size of the boxes if there is any value in this range. If there is not any value in the range,

then the upper/lower bound of the box means the maximum/minimum limit of the dataset. In

Chart 5 there is not any category with a bottom whisker, the bottom of the boxes means the

lowest values. By standard distribution 95% of the values are located in the range between the

lower and upper whiskers.

As it is seen on the Chart 5 there are some values marked with a circle or a star. Those values

are the outliers. The outliers are extreme values. The SPSS divides them into two categories.

The outliers marked with a circle are those values which are out of the one and a half size of the

boxes; while the extreme outliers are marked by stars are out of the size of three boxes. As we

can see, most of the shops have a lot of transactions that are more than ten times bigger than

the median values, so the distribution functions of the transactions are likely to have some

extreme big values, than hardly tailed.

Chart 5 – Box char of the transactions total prize for each shop


33

Well the Chart 5 is picturesque enough, but the most important indicators are hard to find on it

exactly, so as a solution for this problem I created Table 7. On the table it is simple to see, that

the shop B1 generates the biggest traffic in the chain, it is 40% of all. I found very interesting,

that in the shop V1 spent the customers only 5.9% of all of their spending, but the average cost

per shopping is the highest, and also the variance of the shopping is quiet low. The shop P2

generated only a small amount of traffic and the total cost of the shopping is also the lowest

there, maybe it needs some change in the variety of goods or other business political changes

to become more profitable.

Sales prize

Sum Mean Median Maximum Table Sum %

Shop ID

B1 23788047,00 15268,32 7124,00 252720,00 40,0%

B4 14689954,00 18225,75 10000,00 189400,00 24,7%

B7 15790458,00 15375,32 6792,00 242890,00 26,6%

P2 1699571,00 6026,85 1000,00 84500,00 2,9%

V1 3501982,00 21887,39 15114,00 186070,00 5,9%

Table 7 – Description analysis data about the first input dataset

The result of the second query shows, those items what have not been sold yet. There was only

38 items in the result set, so in the full order of the items it is under 10%, so it was not bad at

all.

The result set of the following query showed me that, which was those items, what the

customers returned to the shops. Well I could not find any unnatural in those result. There

were no any items, which did not satisfy the customers. But it took my attention that most of

the returns were in shop V1. It should be expostulated, why these happened. Are the

customers not informed correctly about the item, they are going to buy? Made the shopkeeper

too many receipts gone wrong?

The result set of the next query needed a little bit more transformation before the result of the

analysis ended up here. Only that information that which items were sold better in certain

shops are not enough informative about the performance of the whole chain so I create a pivot

table from the result set and computed, that how many items are sold better than the average

in certain shops? This is shown on the Chart 6. In my opinion this indicator could be the most


34

informative for the board about the performance of the single shops, when the bonuses are

calculated for the employees. From this chart it is shown too, that the customers did not really

preferred P2 and B7 shops, so the management should going to spin up the sales there or try to

renew the business to be more profitability.

Chart 6 - Number of items sold above the average per shops

3.7.3 Evaluation

All in all I think that in this section of analysis I reviewed the task exactly enough to make the

conclusions that are in the modeling phase. I examined the behavior of the customers widely

and tried to examine carefully it from a lot of aspects.

Discounting method analysis 3.8

In the examined retailer chain there is not any method for making discounts. The shopkeepers

are allowed to give discount up to 10% for the certain customers.

In my belief, in very small size this technique can be ready to work. It is ok when there is a

strong relationship between the shopkeepers and the customers, but it needs a lot of work to

keep the standard of this relationship. Over a limit of customers it needs more time (= lost

money) to keep this relationship than if the shopkeeper could use those time to serve new

customers and earn extra money. In this case the board and the team that responsible for

B4; 315

B1; 284

V1; 183

B7; 143

P2; 68


35

discounts would be able to make bigger and more effective campaign from advertising and

marketing to managing the customer relationships.

By this size of the chain I think that it would be more effective if the discounts were managed in

a centralized way. For example the installation of a system which contains all the ERP systems

data in a data warehouse. Behind the DWH there is the business logic, which makes the users

able to make different discounts by the value/amount of the shopping. The system would give

automatically 5% discount for those customers who buy items over five times of the average

sold amount, etc.

Summary

36

4 Summary During my project task I was learning a lot both in theory and in practice. After the literature

review I was creating a data warehouse in Microsoft SQL Server. The creating of a data

warehouse was even not a problem, because in my project laboratory and thesis work in BSc I

already had made some, but this time it was necessary a huge amount of data cleaning tasks.

During my work I had learned a lot about the CRISP-DM methodology, I tried to keep following

it as much as it was possible, but the goal of the project was not really business goal driven. The

main aim of this project work was to present to the data owner company, that by using data

mining techniques the common workflows are better verifiable and the revenue is easily

increasable.

During the analysis part of my task I tried to collect most of the important aspects where the

chain connects to the customers. Also I wanted to get as complex knowledge about the SPSS as

it is possible during this project. In my opinion this was successful. I gained a strong, basic

knowledge that is useful for further projects. Furthermore I tried to make the various analysis

tasks in different data mining or statistical methods for more reasons. Moreover that I want to

learn the basics of more techniques in SPSS, I also tried to make the certain analysis projects in

the best method for it. Of course it needed some extra work to find the best methods, but for

reducing the length of this report usually I documented here only the finally chosen ones.

Among the project laboratory task I have learned a lot of new techniques and getting known

with the SPSS Statistics. I getting knows the customer data and its structure in details, so I

became able to realize some fields of the customer data that indicates maybe some

problematic area of business. In my report I was trying not only to identify the problems, but

find and suggest some possible solution for it. I thought that it is important to reflect for other

possible problems, which are not so closely connected to the customer data, but maybe visible

from it. Such as the returning of the items is not exactly a customer analytical problem, but

when the customer are not satisfied by the sold goods, or the shopkeepers did not give them

enough detailed information before the shopping, then it might influence the visiting ratio and

the revenue, so in figuratively way it is in connection to the customer analysis.

Summary

37

The hugest difficulty during my task was that there was not a concrete, measurable business

goal to reach, rather I needed to make questions which I thought to be interesting. Also after

the analysis there was not a strict limit which shows, that the chosen method was good or

effective enough.

The final presentation of project for the customer will be in the near future, I hope, that they

will be satisfied and in the future projects they will have well-defined goals to reach.

Table of figures, tables and charts

38

Table of figures, tables and charts

Figure 1 – Entity relationship diagram of the star scheme ............................................................. 9

Figure 2 – ER diagram of the snowflake scheme .......................................................................... 10

Figure 3 – Processes of CRISP-DM ................................................................................................ 13

Figure 4 – Control flow diagram of the ETL process ..................................................................... 17

Figure 5 – The data flow diagram of the ETL for sales header ..................................................... 19

Table 1 – Number of cases in each Cluster ................................................................................... 21

Table 2 – Usual sold items in the separate clusters ..................................................................... 21

Table 3 – SPSS input format for monthly seasonality analysis ..................................................... 23

Table 4 - SPSS input format for daily seasonality analysis ............................................................ 24

Table 5 – Correlation calculation result for shop B4 .................................................................... 25

Table 6 – Part of he hourly correlation matrix ............................................................................. 27

Table 7 – Description analysis data about the first input dataset ................................................ 33

Chart 1– Average daily sales value per shop ................................................................................ 22

Chart 2 – Average daily income in shop B4 .................................................................................. 24

Chart 3 - Average hourly income in certain shops ........................................................................ 26

Chart 4 – Correspondence analysis between customers and shops ............................................ 29

Chart 5 – Box char of the transactions total prize for each shop ................................................. 32

Chart 6 - Number of items sold above the average per shops ..................................................... 34

Bibliography

39

Bibliography

[1] Cisco, "Scribd.," [Online]. Available: http://www.scribd.com/doc/59541956/Global-

Internet-Expansion-Infographic. [Accessed 8th april 2012].

[2] S. E. Prokesch, "Competing on Customer Service: An Interview with British Airways’ Sir

Colin Marshall," Harward Bisiness Review, november 1995. [Online]. Available:

http://hbr.org/1995/11/competing-on-customer-service-an-interview-with-british-

airways-sir-colin-marshall/ar/1. [Accessed 9th april 2012].

[3] C. Sildó, „Adattárház összefoglaló,” august 2004. [Online]. Available:

http://scs.web.elte.hu/Work/DW/adattarhazak.htm. [Hozzáférés dátuma: 9th april 2012].

[4] S. Gajdos, Adatbázisok, Budapest: Műegyetemi Kiadó, 2006.

[5] B. Palace, "Data Mining: What is Data Mining?," 1996. [Online]. Available:

http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamini

ng.htm. [Accessed 9th april 2012].

[6] A. Doug, "Data Mining," [Online]. Available:

http://www.laits.utexas.edu/~norman/BUS.FOR/course.mat/Alex/. [Accessed 9th april

2012].

[7] D. Micci-Barreca és S. Ramachandran, „Elite Analytics,” [Online]. Available:

http://www.spss.ch/upload/1122641565_Improving%20tax%20administration%20with%2

0data%20mining.pdf. [Hozzáférés dátuma: 12th april 2012].

[8] Customer Analytics, „Customer Analytics,” Customer Analytics, 2011. [Online]. Available:

http://www.customeranalytics.com/customeranalytics/. [Hozzáférés dátuma: 12th april

2012].

ustomer analytics for a retailer...

Documents