ustomer analytics for a retailer...
TRANSCRIPT
Customer analytics for a retailer chain
Project laboratory report
Author: Szabó Péter Supervisors: Pálfi Attila Petróczi Attila
Date: 5/10/2012 Course code: VITMM376
Budapest University of Technology and Economics Faculty of
Electrical Engineering and Informatics Department of Telecommunications and Media Informatics
Table of contents
2
Table of contents
Table of contents .......................................................................................................................................... 2
Introduction .................................................................................................................................................. 4
1 Project task ........................................................................................................................................... 6
Software specifications ................................................................................................................. 6 1.1
2 Research of literature ........................................................................................................................... 8
Data warehousing ......................................................................................................................... 8 2.1
Data mining ................................................................................................................................. 11 2.2
CRISP-DM .................................................................................................................................... 12 2.3
2.3.1 Business understanding ...................................................................................................... 13
2.3.2 Data understanding ............................................................................................................ 13
2.3.3 Data preparation ................................................................................................................. 13
2.3.4 Modeling ............................................................................................................................. 13
2.3.5 Evaluation............................................................................................................................ 13
2.3.6 Deployment ......................................................................................................................... 14
Customer analytics ...................................................................................................................... 14 2.4
3 Implementation of the task ................................................................................................................ 15
Business Understanding .............................................................................................................. 15 3.1
Data Understanding .................................................................................................................... 15 3.2
Data Preparation ......................................................................................................................... 16 3.3
Market basket analysis ............................................................................................................... 20 3.4
3.4.1 Task specific data preparation ............................................................................................ 20
3.4.2 Modeling ............................................................................................................................. 20
3.4.3 Evaluation............................................................................................................................ 22
Seasonality analysis ..................................................................................................................... 22 3.5
3.5.1 Task specific data understanding ........................................................................................ 22
3.5.2 Task specific data preparation ............................................................................................ 23
3.5.3 Modeling ............................................................................................................................. 24
3.5.4 Evaluation............................................................................................................................ 27
Geographical analysis.................................................................................................................. 27 3.6
3.6.1 Task specific data preparation ............................................................................................ 28
Table of contents
3
3.6.2 Modeling ............................................................................................................................. 28
3.6.3 Evaluation............................................................................................................................ 29
Customer-power analysis............................................................................................................ 30 3.7
3.7.1 Task specific data preparation ............................................................................................ 30
3.7.2 Modeling ............................................................................................................................. 31
3.7.3 Evaluation............................................................................................................................ 34
Discounting method analysis ...................................................................................................... 34 3.8
4 Summary ............................................................................................................................................. 36
Table of figures, tables and charts .............................................................................................................. 38
Bibliography ................................................................................................................................................ 39
Introduction
4
Introduction
We are living in the age of the information. Almost 8 new users go online every seconds and it
is estimated that in 2015 there will be generated more than 10000 Petabytes (1Petabyte = 1000
Terabytes) business data per month globally. [1]
This huge amount of data could not be managed without mathematical and computational
support. Mathematicians and analyst from all over the world are working on this data to reveal
nontrivial coherency between the records and looking for the explanation where could the
company gain one more cent profit. Besides of the overall analysis and reengineering of the
business models, processes it is important to manage the relationship with the customers too.
Although the customer data is only a small slice of the whole business data, it is not a question
that even this amount of data is impossible to handle manually. The support systems are called
Customer Relationship Management (CRM) systems. They widely implemented the strategy for
managing company’s interactions with their clients. CRM systems provide organizing,
automatizing and synchronizing business processes. They are primary based on sales processes,
but they also support marketing, customer service and technical support.
Based on CRM systems companies were getting able to manage their customers even better.
Therefor they needed even more customer data. They realized rapidly that profiling the
customers could help them to decrease marketing cost so indirectly increase profit. The only
one task was to make the customers interested to give as much information freely to the
company as much it needed. The solution was the customer loyalty card. While customers
demand these cards they fill the registration form and give all the identical and demographic
data that companies needed. Furthermore using that card during shopping companies can
identify that which customer buys certain goods. This process help the companies gather
information that needed to create customers profile.
On the other hand customers join these programs voluntarily because for using the loyalty card
during shopping they get some percent discount or special offers that make them feel more
valuable.
Introduction
5
Apparently this is a win-win situation, but behind the scenes the companies are the only
winners. All their moves are to raise their indicators.
Therefore customer analytics – the analysis of customer data – has become a well-supported
field of data mining too. There are a lot of fields of the business where managers are able to
improve the revenues by using data mining techniques. Customer data is useful when the board
should decide where to open the new shop or who to deliver the special offers.
In the most impressive way Sir Colin Marshall summarized the significance of customer analysis:
“It all comes back in the end to value for money. If you can deliver something extra that others
are not or cannot, some people will pay a slight premium for it. I want to stress that when I say
“slight,” I mean precisely that. In our case, we’re talking about an average of 5%. On our
revenues of £5 billion, however, that 5% translates into an extra £250 million, or $400 million, a
year.” [2]
Project task
6
1 Project task Describe and summarize those marketing strategies which could be served to improve sale
indicators. On the available date examine the particular marketing strategies and forecast their
success ratio in the practice. Look at the coherence between the location and the range of
goods then suggest an upgrade for the actual method. Using SPSS Statistics you should
implement a predictive model which helps making the right decision about the product range in
certain shops and planning the best discounts.
Software specifications 1.1The main goal of my project was to discover the functions of the IBM SPSS (Statistical Program
for Social Sciences) and using it for customer analysis. For the project task I had to setup the
work environment.
I had got the input data in different Excel tables so I made the data understanding task in Excel.
For example there were the customer’s, the item’s and the sales’ data in different
spreadsheets.
The following task was data preparation. I decided that besides of separate Excel worksheets I
am going to use a data warehouse. For that I chose the Microsoft SQL Server 2008 R2 because I
already had known it well. For using a data warehouse of course I had to make the ETL
(Extract-Transfer-Load) process. I prepared the ETL in Microsoft SQL Server Business
Intelligence Development Studio which is a plugin for Visual Studio. It makes much easier to
create ETL processes because you do not have to know SQL as well. In addition you are able to
create a picturesque flow chart diagrams easily so it is also good for presentation. Furthermore
it provides a wide range of data source/destination, DLL function calls in a graphical user
interface versus dozens of lines of code in command prompt.
I setup the whole infrastructure on a single instance. Of course there are a lot of advantages
but also many disadvantages too of this choice. Because of both components run on the same
computer there was no need for secure the network communication. On the other hand
running the data warehouse on a big server it would be enough a cheaper client for running
SPSS and making the analysis and reports.
Project task
7
All in all I decided to run both components on the same computer, because it made much
easier to develop and debug my processes. When my solution will be realized in commercial
use I would suggest running them on separate infrastructure.
Research of literature
8
2 Research of literature
Data warehousing [3] 2.1The data warehousing is about ten years old so a really new field of information technology.
Due to the rapid improvement it is developed on some different ways. Data warehousing is a
union of proved and immature techniques so it is hard to define. Bill Inmon’s is one of the
apostles of the topic. His well-accepted definition says: “A data warehouse is a subject oriented,
integrated, nonvolatile, and time variant collection of data in support of management's
decisions.”
This few words are very complex, let’s examine it a bit:
Subject oriented: Data warehouse is built along a well-defined business goal. Reaching
these goals requires certain business data, for example to making customer analytics
sales and customer data are necessary.
Integrated: Integrated design has a strong connection to the previous point’s data-
driven design. In this field integration means that the data of a certain field of interest is
suggested to store in separated groups.
Nonvolatile: The main expectation of the data loaded into the warehouse is that it must
be the same as it was in the source system. When the transferred data had changed in
the source system, it should be migrate to the data warehouse with a timestamp.
Time variant: Usually we load the data from the past but we use it to make prediction
about the future so it is important to think over that which data should we use and
which should we discard.
The building aspects of a data warehouse differ from those that I learned in the class called
Databases. [3]
In this case the main goal is to store the data in the way that support high query performance
the most. So it is not a huge problem if the data is stored with a little bit redundancy as long as
it gains any extra performance for reporting.
The data warehouses are built usually by star or snowflake scheme. In both cases the important
data is located in the fact table, in the middle. Here is situated the enterprise resource planning
Research of literature
9
(ERP) system’s correctly transformed master data. Around that are placed the dimension tables.
The data that is not exactly the analyzed one are loaded these dimension tables. There is for
example a dimension data table for time, location, and etc. data. The dimension tables gather
all the information that could be a category by the analysis. The main differences between the
two design patterns are the connections between the database tables.
As shown on the figure below, in the way of the snowflake model the fact table is in the center.
All the dimension tables are connected to it. There are all the key attributions of the dimension
tables stored in fact table as a foreign key and throw that could be identified all the matching
rows of the dimension tables.
Figure 1 – Entity relationship diagram of the star scheme
Besides that in the snowflake scheme the dimension tables are ordered and connected
hierarchically and only the top of the hierarchies are directly connected to the fact table. For
example the Shop dimension table is connected to the City where it is located, the City table to
the County etc.
Research of literature
10
Figure 2 – ER diagram of the snowflake scheme
During my project task I built a star scheme data warehouse.
The comparison of the star and the snowflake scheme:
Pros of the star scheme:
easy, intuitive data model
requires less join operations than the snowflake
requires only a few table reading
easy to build, the metadata of the model are simple
Cons:
hard to create aggregations
the handling of the huge hierarchical components slows down the queries
the handling of the dimension data need a lot of redundancy
Research of literature
11
Data mining [5] 2.2Data mining or knowledge is digging through, analyze and discovering new patterns in a huge
amount of data. It is a relatively young and interdisciplinary field of computer science. It is
interdisciplinary because it needed a massive computer support, uses database systems, a lot of
algorithms and applied mathematics. Besides that many people list data mining not to
computer science, neither to business but into social sciences. They rank it into social sciences
because of the individual privacy that every single row of data usually means.
Major elements of data mining are:
Extract, transform, and load transaction data onto the data warehouse system.
Provide data access to business analysts and information technology professionals.
Analyze the data by application software.
Present the data in a useful format, such as a graph or table.
The analytical techniques used in data mining are often well-known mathematical algorithms
and techniques. The new is the application of these techniques. As the stored data increase
more and more company define business problems, that’s solution may be rooted in the heap
of the collected data. The algorithms are abstract so they can apply in very different problems
with high efficiency.
These main tools that data mining usually apply are:
Artificial neural networks - Non-linear predictive models that learn through training and
its structure was inspirited by the natural neural systems.
Decision trees - Tree-shaped structures that represent sets of decisions. These sets
makes the business users able to apply easy rules during the decision making process.
Rule induction - Useful if-then rules from data based on statistical significance. Similar to
the decision trees.
Genetic algorithms - Optimization techniques based on the concepts of genetic
combination, mutation, and natural selection. There are a lot of generations of
algorithms which are improved versions of their ancestry.
Nearest neighbor - A classification technique that classifies each record based on the
records most similar to it. [6]
Research of literature
12
During the years it was necessary to standardize these elements and the data mining process.
In 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004
Java Data Mining standard were based on succeed projects. As the effect of their success in
2006 were released the 2.0 edition of both standards. In the next session I will introduce the
CRISP-DM methodology in details.
I personally understand with Sir Colin Marshall (former British Airways CEO), who said that “It
all comes back in the end to value for money.” [2]
All the same when we are talking about genetics, medicine, social sciences or education the
data mining is always business-driven.
CRISP-DM 2.3As I already mentioned CRISP-DM (CRoss Industrial Standard Process for Data Mining) is a data
mining methodology and standard. As a methodology it is a circular model so very similar to the
concept of the spiral development methodology. In this chapter I am going to introduce it
detailed because I used it during my work because the IBM SPSS follows this methodology too.
Research of literature
13
Figure 3 – Processes of CRISP-DM
As shown on the previous figure CRISP-DM has six phases. In the following am going to
introduce each one in some sentences. [7]
2.3.1 Business understanding I the first phase the most important is that all the participants should understand the project
goal from the business perspective. This problem is going to be translated into data mining
goals and into a problem definition. For the end of this phase the project plan is prepared.
2.3.2 Data understanding Now the task is to collect initial data. If there are multiple data sources integration is an
additional issue. Possible this leads into data preparation steps. In DU phase describing data is
an important task too. It means the examining the gross and surface properties of the acquired
data. In exploring data step we examine the distribution of the key attributes, creating some
visualization reports. As a last step we need to verify data quality as answering questions like
“Are data fields completed?” or “Are there any missing values?”
2.3.3 Data preparation Usually data preparation step requires the most time of a project. It could be needed up to 90%
of the time of the whole project. It involves constructing necessary data, integrate data from
multiple sources, clean it and transfer into the destination format.
2.3.4 Modeling In modeling phase we need to choose the modeling technique. If multiple techniques are
applied we need both tasks separately. For these models we need to generate test designs and
building the selected model(s). As the end of this phase we should assess the model, examine
that the data mining goals have been satisfied.
2.3.5 Evaluation In evaluation step we need to translate to model to the business environment, asses the
business model and check that business goals have been completed. Whether all the goals are
satisfied we should review the data mining process looking for upgrade possibilities.
Research of literature
14
2.3.6 Deployment As a start of the last phase we should make the deployment and maintenance/monitoring plan.
After the deployment we should close the project by making the final report and review the
project to summarize what went go/wrong and collect the work experience for further projects.
Customer analytics 2.4Customer analytics means analytics of customer data. It is a really new field of data mining. It is
trigger was the wide spread of Customer Relationship Management (CMR) systems so the
stored data about the customer involved processes increased dramatically. As a field of the
data mining CA does not end up with descriptive statistics but applies predictive models too.
Customer Analytics is the key to unlocking the hidden value of customer data. It’s the
difference between passive, reactive marketing and agile, active marketing—where every step
is sure, every goal is rooted in relevance, and every insight learned makes your next endeavor
that much smarter and more successful. [8]
The main goal in this field is to predict the behavior of the customer. If we need what is the
next move we will be able to make the right discounts or recommend the whished product to
the specified customer.
Implementation of the task
15
3 Implementation of the task As I already mentioned in the task specification chapter my exercise was to analyze the
customers’ data of a retailer chain.
I searched the answer for the next questions:
What are the usual market baskets?
Is there any seasonal good? Which are those and when?
Is there any connection between the location of the shops and its likability?
What is the customer power of the certain shops?
Who are the discounts made? Is there a better solution?
The business and data understanding and the first, data-warehouse-building step of the data
preparation phases were the same for all the questions so I will describe then together.
Business Understanding 3.1
The main goal of my task is to present to the manager layer of the client company the power of
data mining and show them its business significance. My target was to open the client’s eye to
see how magnificent possibilities are hidden by data mining. Hopefully after my work they are
getting interested in using exact, mathematical models in the strategic planning.
The client in this project was a retailer shop chain who gave me all the sales data from the
previous months. I was needed to work with the sales and customer data to figure out how
they could become more profitable.
For convenience the client I assemble the previous list of questions. My exercise was to answer
those. My personal goal was to get know with SPSS.
Data Understanding 3.2
After that I had been getting know the aim of the project I dug myself into the data. I had
become four excel files. One of them contained the customer database of the client. These
customers are identified by a loyalty card and besides their shopping behavior we can identify
them by their gender, address or lifetime.
Implementation of the task
16
Besides of the customer data I got some data about the available items too such as identity,
prize or last selling’s date.
The rest of the data is for managing the sales. I had got the sales data in two parts. There was
the sales header and the sales line data in different spreadsheets. The sales header table
contains general data about the sales. Furthermore the shop and the cash desk are located
here, where the transaction happened. If the customer possessed any loyalty card, the
identifier had been stored here. Of course any other information is located here which belongs
to the invoice itself as the shopping time, the summarized prize, the summarized prize after the
discounts and some other information about the transaction type.
In the last document there was stored the detailed data of each sale. For every lines of the
invoice stand here the item identifier, the unit prize, the quantity and the line summary.
Right after that I understood the logic of the data storing here and I could see through the
connections between the attributes I decided that I am going to build a data warehouse. This
decision needed to examine the input data more in a more detailed way to get enough
information about the required data types too.
Summarize in this term I was look the input data over and over again to reveal logical
connection and technical properties which are going to be useful during the further steps.
Data Preparation 3.3
In my opinion building of a data warehouse is well-founded in this situation, because there is a
big amount of data that will be the input of all the data mining steps. So I am going to introduce
the Extract-Transfer-Load (ETL) procedure which I made for prepare the data and transfer into
the data warehouse. As I already mentioned it I made the ETL process with Microsoft SQL
Server Business Intelligence Development Studio. It is a graphical designing tool for dataflow.
Implementation of the task
17
Figure 4 – Control flow diagram of the ETL process
On the state chart diagram (Figure 4) could be seen the control flow of the ETL process that I
made. Every time the process starts it’s running with the initialization script. The initialization
script contains some SQL statement that check if the tables are exists. If they do not, then the
script creates them. As an example there is a part of the script. This few lines of code check,
that the postal code table exist or not. If it does not, then it creates the table:
IF NOT EXISTS (SELECT * FROM sysobjects WHERE id = object_id(N'[dbo].[Postal code]')
AND OBJECTPROPERTY(id, N'IsUserTable') = 1 ) BEGIN CREATE TABLE [Postal code]( [POSTAL CODE] nvarchar(4), [CITY] nvarchar(255), [COUNTY] nvarchar(255) ) END
After that we are sure, that the structure of the data warehouse exists, we are able to start fill it
with data. As it is seen on the state chart diagram, there is two independent thread of ETL
process. So prepared the solution to process the customer and the item data paralelly.
On the path that process the item data there were only some simple data preparation steps,
such as datatype conversations.
The another path was a little bit more difficult. In the input dataset there was only the postal
code of the customers who participate in the loyality program so I decided to add some extra
data to make more interesting the analysis. I loaded some geographical data into the
warehouse, such as the county where the customer lives. After the database task loads the
sd ETL control flow
InitDWH
ETL for items
ETL for
postal code
ETL for
customers
ETL for sales
header
ETL for sales
lines
Implementation of the task
18
geographical data into the data warehouse it is able to start the processing of the customer
data, because now we can join to it the geodata. While it was loading the customer data I made
some data transfomartion on it, such as I replaced the missing values with NULL values and I
made the lookup of the geographical data too.
While both threads succesfully finished the customer’s and the item’s data have loaded into the
data warehouse so I could load in and join the sales data too. Because the sales lines should be
connected to the sales header rows, at first add the header data to the database. While I had
been inserting the sales data I added the foreign keys to the table to estabilish the relationship
between the customers and the invoices.
After that the database involved the invoice data too, I finally added the sales lines and
estabilished the required connections between the tables to keep data dependency.
Although the input data comes from an ERP system I needed to make some data cleaning steps
before I could store the data into the data warehouse. Mostly I tried to fill missing os invalid
arguments with NULL’s.
The Figure 5 shows an example for the data flows I defined. This starts with an Excel data
source and after that the data goes through a lot of transtormation node it is giong to store in
the data warehouse. In Clear data node I remove the * character from the KEDVAZON attribute
if it contains. Moreover the receipting time is stored as a four digit long decimal number the
node transtorm it into DB_TIME format. After this conversion the following node cast the ID
from a decimal number stored in a string into integer format. The next node is a conditional
split whether the payment was completed by creditcard or not. The nodes on the two output of
the conditional node add a bimary variable to the table. 1 if the customer paid by creditcard
and 0 if not. Then I made a union node to rejoin cases of the condition and added another
condition whether the customer used a coupon during the shoping or not. If he had used, then I
stored the coupon identifier, when he not, then I filled the attribute with NULL. At the end I
rejoin the datasets and load them into the data warehouse.
Implementation of the task
19
Figure 5 – The data flow diagram of the ETL for sales header
Implementation of the task
20
The analysis spezificant data preparation I will describe right when I will introduce the certain
analysis and its results.
Market basket analysis 3.4
The aim of this analysis was to reveal the most common shopping baskets. Exploring that which
products are bought usually together is very useful to make cross-selling.
3.4.1 Task specific data preparation
The required data is located in the Sales lines table of the data warehouse, so I got the Receipt
ID, Item ID and Quantity columns for every transaction. For the modeling I had to create a pivot
table from it. The table has as many columns as many items are in the database and the
number of its lines is the number of transactions. An arbitrary mij element of generated matrix
is the quantity of the j item which was sold in the i transaction.
3.4.2 Modeling
In the modeling phase I chose to divide the customers into some clusters. For the clustering I
had chosen k-means clustering method.
K-means divide the input points into k clusters. The algorithm is very simple:
1. Choose k random point in the modeling space. These points will be the centers of the
clusters.
2. For each points of the model its distance is calculated from the center points. Every
point belongs to the cluster with the nearest center point.
3. When all the point belongs exactly to one cluster, then the position of the center points
should be recalculated. The new position is the centroid of the cluster. Then we should
step back to the second step and refresh the model.
The second and the third steps should be iterated until none of the center points’ position
changes.
Most of the transactions consists a few listed goods, so during the distance calculation there
will be a lot of dimensions (items) where the value is zeros. In those dimensions where the
coordinate is not zero (there were some sales of the item) the distance will be a relevant value.
When two shopping basket contained more consonant goods, then the distance between then
Implementation of the task
21
is going to get smaller, so those transactions are likely to getting ordered in the same cluster,
which contained more of the same goods.
Table 1 – Number of cases in each Cluster
Besides, that SPSS gave me some statistical info about the running of k-means process, it also
generated the cluster ID’s into the input file. The automatic classification process revealed that
three clusters were enough, but I aimed to make smaller, more exact groups so I doubled
number of clusters. As it could be seen on the Table 1, almost all of the cases are included in
the third cluster. Those are that transactions, when the variance of the sold items were too big.
After filtering of the particular clusters I made some intersect calculation to
reveal the common market basket and
as a part of the evaluation tried to prove, that the items in cluster three have to huge
variance.
Cluster No. Usual item in cluster
1 26218, 31598
2 5573, 26218, 31598
4 31598, 31598, 4582, 6250, 3797, 31004, 25122,
12408, 31598, 26218, 424, 31598, 26218
5 425
6 17579
Table 2 – Usual sold items in the separate clusters
Implementation of the task
22
3.4.3 Evaluation
By the Table 2, I could assess, that there is not a typical market basket in the examined chain.
The range of goods is too wide and the sales data was not enough to make me able to declare a
simple basket to The Usual. In the 1st and 2nd clusters there are some item sets, but they
occurrence cardinality is not significant enough. The 4th cluster contains the data of only one
transaction, it is rather special than usual. The 5th and 6th clusters contains only one product, so
this could be the effect of any marketing campaign or some special offers, but definitely not a
massive phenomenon too.
Seasonality analysis 3.5
In this section I was searching for some recurrence in the shopping periods. I examined the
correlation between the sale amounts in different shops during a period of time and try to
make global consequences.
3.5.1 Task specific data understanding
At first I was wondering that is there any connection between the daily sales and the shops, so I
created the following diagram:
Chart 1– Average daily sales value per shop
But it was not as picturesque as I was expected, so I calculated all the correlation between the
certain shops sales. The figure was right there was not any strength relation between the
Implementation of the task
23
shops, the highest correlation coefficient was only 0.176 between shops V1 and P2, but the
correlation’s significance level was also low. So I started to examine other time periods. I deal
with the correlation between monthly and daily sales.
3.5.2 Task specific data preparation
In the data preparation phase I got the required data for monthly analysis from the data
warehouse for each shop one by one with the following query:
SELECT [BOLT] AS "SHOP" ,DATEPART("MONTH",[RDAT]) AS "MONTH" ,DATEPART("DAY",[RDAT]) AS "DAY OF MONTH" ,AVG([SALEVALUE]) "AVG SALES VALUE" FROM [ProjectLab1].[dbo].[Sales header] WHERE [BOLT] = @BOLT GROUP BY [BOLT], DATEPART("MONTH",[RDAT]), DATEPART("DAY",[RDAT]) ORDER BY [BOLT], "MONTH", "DAY OF MONTH"
The result set of the query is not exactly the excepted so I made some other transformation on
it in Excel. The initial data set for SPSS was a table which contains in the first column the days of
a month and in the following columns the summarized income for this day. I decided to make
those columns for every each store and examine them separately to get more detailed, more
exact results.
DAYS January February March … December
1 Income 01/01 Income 01/02 Income 01/03 … Income 01/12
2 Income 02/01 Income 02/02 Income 02/03 … Income 02/12
… … … … … …
31 Income 31/01 Income 31/03 … Income 31/12 Table 3 – SPSS input format for monthly seasonality analysis
The data preparation for the daily analysis was quite the same. I got the data with this SQL
statement by shops:
SELECT [BOLT] AS "SHOP" ,DATEPART("HH",[ReceiptingTime]) AS "HOUR" ,AVG([ELADERT]) "AVG BY HOUR" FROM [ProjectLab1].[dbo].[Sales header] WHERE [BOLT] = @BOLT GROUP BY [BOLT], DATEPART("HH",[ReceiptingTime]) ORDER BY [BOLT], "HOUR"
Implementation of the task
24
The result was processed again in Excel to the format below:
HOUR SHOP1 SHOP2 SHOP3 … SHOP n
0 Income1/0 Income2/0 Income3/0 … Income n/0
1 Income1/1 Income2/1 Income3/1 … Income n/1
… … … … … …
23 Income1/23 Income2/23 Income3/23 … Income n/23 Table 4 - SPSS input format for daily seasonality analysis
3.5.3 Modeling
In the modeling period I made correlation analysis for each store, if it exist any connection
between the sales in each month, then what is it?
First I was trying to make some charts to help realizing the relationships, but they did not work
anyhow. All of them were like on Chart 2, to crowded to simply show the relevant information.
Chart 2 – Average daily income in shop B4
Although the line charts did not give any relevant information it was good to make some
suspicion to work with.
Implementation of the task
25
For all the shops in the chain that is had got some information, I calculated the correlation
coefficients in SPSS for each month to the others. The most important part of the output is
visible in the Table 5. There are two important indicators in this table:
Pearson Correlation: Pearson’s correlation coefficient. In the interval between 1- and 1.
It means the strength of the relationship. Bigger value is better.
Significance: It is the measurement of how much is it unlikely to have occurred by
chance. It is a possibility so it is between 0 and 1. Smaller number means more reliable
connection.
N means the size of the dataset, in this case the working days of the month.
Table 5 – Correlation calculation result for shop B4
It is important to realize that the correlation table is symmetric, because the operand of
correlation is symmetric too to the arguments.
Implementation of the task
26
In Table 5 I realized the following coherences: The correlation coefficient is relative high in
August-February-January, September-January and October-April. Those relations are quiet
significant too. There are similar relationships in the correlation matrix of other shops too, so it
is very likely, that they sold most of their goods in spring and in the autumn.
The correlation between the hourly sales is much stronger in the certain shops, as it seems on
the Chart 3, the lines are hardly on each other.
Chart 3 - Average hourly income in certain shops
It means that the correlation between the variables is much bigger, so the customer’s behavior
is very similar in each store. As it seems on Table 6 (it is only a part of the whole matrix, to keep
its well-readability) in most of the time the correlation coefficient is almost absolute one. So the
variables look very similar or very opponent.
Implementation of the task
27
Table 6 – Part of he hourly correlation matrix
3.5.4 Evaluation
All in all the periodicity is most important in a year, the peaks are in spring, and in autumn.
Inside of the day in most of the shops periodicity is negligible, but in shop B4 there is a massive
rush after 21 o’clock!
Geographical analysis 3.6
In this section I was looking for some correlation between the locations of the shops and the
customers, who are shopping there.
Implementation of the task
28
3.6.1 Task specific data preparation
At first I made some correspondence analysis from the shopping logs. I examined where the
customers usually do their shopping. For this I needed only the store ID and the customer ID
from the sales header table. The only difficulty was that the SPSS can make the analysis only on
continuous set of numbers, but the customer ID’s are not continuous. As a solution I generated
a sequence in SQL server to the customer ID with the script below.
SELECT DISTINCT IDENTITY(int,1,1) AS RowNumber, vevok INTO #Temp FROM [Sales header] WHERE vevok is not null ORDER BY vevok SELECT * FROM #Temp ORDER BY RowNumber DROP TABLE #Temp
In the SPSS I also created a continuous variable set from the shop codes.
3.6.2 Modeling
After the data preparation I made a correspondence analysis, where the rows were the
shopping logs identifiers and the columns were the stores ones. On the Chart 4 is visible the
result of the correspondence analysis. I removed the labels of the cases from the diagram for
best visibility; in the SPSS output file we are able to turn it on during the analysis.
This chart is the most picturesque output of this function. It is a very powerful tool to get some
information about which store is more preferred by the customers. On the chart customers are
located closer to those store, which they already had visited most frequent. For example those
customers, who did their shopping only in shop “B7”, are located exactly on the point of “B7”.
Those customers, who looks like a dotted line between two shops, they are visited only those
two. For example the customer, whose ID is 6523, sold items in the shops B1, B4 and in B7.
From his location it is visible, that he visited more time the B1 than the others, because the
cross in the cart is closer to the circle of B1. In this case the coordinates of the customers are
the centroid of a triangle, which has the some weight in its vertexes. The weights are the count
of the visits of the certain shops.
Implementation of the task
29
Chart 4 – Correspondence analysis between customers and shops
On the other hand, correspondence analysis is very useful to identify, the loyalty cycles of the
shops or the preferred shops of the customers. The only thing to do is to connect the selected
point to the origin point with a line. Then I had to calculate the distance of the other point from
the line. Which point is closer to the line, it is in their closer surroundings of the original point
to. Let’s see an example: when we draw a line throw the position of the customer 6523 and the
origin, then calculate the distance of the shops from this line, than the order would be
something like: B1, B7, B4, P1, V1. So it shows too, that the customer 6523 did their shopping
most in the shop B1.
3.6.3 Evaluation
As it is visible from the examples in the modeling chapter, correspondence analysis is a very
useful and strength tool to identify the likeability of some cases. In this data set it was very easy
6523
B1
B4
B7
Implementation of the task
30
to use, but on bigger data it could mean problem for SPSS (it can solve correspondence analysis
up to 3000 cases, and this dataset must be continuous). I review the process on some special
cases with my supervisor and the result was imposing enough.
Customer-power analysis 3.7
In this chapter I am looking for some connection between the customers and the shops, sales.
Furthermore I am looking for some another indicators from which we are able to get
knowledge about the relationship of the customers to the sold items. In short I am looking the
answer of the following questions:
Where the customers did spent the most money?
How much do they spend average in certain shops?
In which shops are the products sold better? Where the sales indicators are lower?
3.7.1 Task specific data preparation
To answer those questions I got the required data from SQL Server too with the next queries:
SELECT shop, receiptId, salesPrize AS "SUMARY OF SHOPPINGS" FROM [Sales header] WHERE costomer IS NOT NULL ORDER BY receiptId
The query returns the shopId, receiptId and salesPrize of the shopping. I had been used it for
made the boxplot and gained some descriptive information about this dataset.
SELECT itemId FROM [Sales lines] GROUP BY itemId HAVING SUM(quantity) = 0 ORDER BY itemId
I used the previous query to get those items, which already had not been sold or had been
reaturned already by the customers.
The following query returns those items, that had already been returned by the customers. It is
necessary for analysing that why had they been returned?
SELECT shop, itemId, SUM(quantity) AS "SUMMARIZED RETURNED QUANTITY" FROM [Sales lines] GROUP BY itemId, shop HAVING SUM(quantity) < 0 ORDER BY shop, [SUMMARIZED RETURNED QUANTITY]
Implementation of the task
31
The last query is a bit more complicated, than the others. It list those stores and items, that are
sold more than the average.
At first I calculated, that how money times were the certain items sold in each shops. Then I
calculated, that what is the average size of solds for the certain items. Finally, I listed that
where and which items are sold above the average.
SELECT BOLT, CKOD, SUM(MENY) "summary by shops" INTO #Temp FROM [Sales lines] GROUP BY BOLT, CKOD ORDER BY CKOD SELECT CKOD, AVG([summary by shops]) "average globally sold quantity" INTO #Temp2 FROM #Temp GROUP BY CKOD ORDER BY CKOD SELECT #Temp.BOLT, #Temp.CKOD FROM #Temp INNER JOIN #Temp2 ON #Temp.CKOD = #Temp2.CKOD WHERE #Temp.[summary by shops]>#Temp2.[average globally sold quantity] ORDER BY #Temp.BOLT, #Temp.CKOD DROP TABLE #Temp, #Temp2;
In the modeling section I interpreted these input datasets and tried to answer the questions in
the introduction.
3.7.2 Modeling
Well at first let’s see the result of the first query. I made from it a box chart (Chart 5), which
shows a lot of useful descriptive statistical information about the transactions. The highness of
the box contains 50% of the values. Those values which are involved in the box are called the
interquartile range (IQR) or middle fifty. The wide line inside the box means the median value
of the data set. The location of the median line shows us some information about the
distribution of the dataset. When the line is in the middle of the box, then the value set is
normally distributed. The shop B4 and V1 shows some normally distribution’s nature in the IQR.
When the median line is closer to the bottom of the box, then is it more likely that the
distribution is tailed to the higher values, the value set has a positive canting. When the median
line is closer to the upper limit of the box, then the dataset has a negative canting.
Implementation of the task
32
The T-shaped lines above the boxes are called the “whiskers”. Their sizes are one and a half of
the size of the boxes if there is any value in this range. If there is not any value in the range,
then the upper/lower bound of the box means the maximum/minimum limit of the dataset. In
Chart 5 there is not any category with a bottom whisker, the bottom of the boxes means the
lowest values. By standard distribution 95% of the values are located in the range between the
lower and upper whiskers.
As it is seen on the Chart 5 there are some values marked with a circle or a star. Those values
are the outliers. The outliers are extreme values. The SPSS divides them into two categories.
The outliers marked with a circle are those values which are out of the one and a half size of the
boxes; while the extreme outliers are marked by stars are out of the size of three boxes. As we
can see, most of the shops have a lot of transactions that are more than ten times bigger than
the median values, so the distribution functions of the transactions are likely to have some
extreme big values, than hardly tailed.
Chart 5 – Box char of the transactions total prize for each shop
Implementation of the task
33
Well the Chart 5 is picturesque enough, but the most important indicators are hard to find on it
exactly, so as a solution for this problem I created Table 7. On the table it is simple to see, that
the shop B1 generates the biggest traffic in the chain, it is 40% of all. I found very interesting,
that in the shop V1 spent the customers only 5.9% of all of their spending, but the average cost
per shopping is the highest, and also the variance of the shopping is quiet low. The shop P2
generated only a small amount of traffic and the total cost of the shopping is also the lowest
there, maybe it needs some change in the variety of goods or other business political changes
to become more profitable.
Sales prize
Sum Mean Median Maximum Table Sum %
Shop ID
B1 23788047,00 15268,32 7124,00 252720,00 40,0%
B4 14689954,00 18225,75 10000,00 189400,00 24,7%
B7 15790458,00 15375,32 6792,00 242890,00 26,6%
P2 1699571,00 6026,85 1000,00 84500,00 2,9%
V1 3501982,00 21887,39 15114,00 186070,00 5,9%
Table 7 – Description analysis data about the first input dataset
The result of the second query shows, those items what have not been sold yet. There was only
38 items in the result set, so in the full order of the items it is under 10%, so it was not bad at
all.
The result set of the following query showed me that, which was those items, what the
customers returned to the shops. Well I could not find any unnatural in those result. There
were no any items, which did not satisfy the customers. But it took my attention that most of
the returns were in shop V1. It should be expostulated, why these happened. Are the
customers not informed correctly about the item, they are going to buy? Made the shopkeeper
too many receipts gone wrong?
The result set of the next query needed a little bit more transformation before the result of the
analysis ended up here. Only that information that which items were sold better in certain
shops are not enough informative about the performance of the whole chain so I create a pivot
table from the result set and computed, that how many items are sold better than the average
in certain shops? This is shown on the Chart 6. In my opinion this indicator could be the most
Implementation of the task
34
informative for the board about the performance of the single shops, when the bonuses are
calculated for the employees. From this chart it is shown too, that the customers did not really
preferred P2 and B7 shops, so the management should going to spin up the sales there or try to
renew the business to be more profitability.
Chart 6 - Number of items sold above the average per shops
3.7.3 Evaluation
All in all I think that in this section of analysis I reviewed the task exactly enough to make the
conclusions that are in the modeling phase. I examined the behavior of the customers widely
and tried to examine carefully it from a lot of aspects.
Discounting method analysis 3.8
In the examined retailer chain there is not any method for making discounts. The shopkeepers
are allowed to give discount up to 10% for the certain customers.
In my belief, in very small size this technique can be ready to work. It is ok when there is a
strong relationship between the shopkeepers and the customers, but it needs a lot of work to
keep the standard of this relationship. Over a limit of customers it needs more time (= lost
money) to keep this relationship than if the shopkeeper could use those time to serve new
customers and earn extra money. In this case the board and the team that responsible for
B4; 315
B1; 284
V1; 183
B7; 143
P2; 68
Implementation of the task
35
discounts would be able to make bigger and more effective campaign from advertising and
marketing to managing the customer relationships.
By this size of the chain I think that it would be more effective if the discounts were managed in
a centralized way. For example the installation of a system which contains all the ERP systems
data in a data warehouse. Behind the DWH there is the business logic, which makes the users
able to make different discounts by the value/amount of the shopping. The system would give
automatically 5% discount for those customers who buy items over five times of the average
sold amount, etc.
Summary
36
4 Summary During my project task I was learning a lot both in theory and in practice. After the literature
review I was creating a data warehouse in Microsoft SQL Server. The creating of a data
warehouse was even not a problem, because in my project laboratory and thesis work in BSc I
already had made some, but this time it was necessary a huge amount of data cleaning tasks.
During my work I had learned a lot about the CRISP-DM methodology, I tried to keep following
it as much as it was possible, but the goal of the project was not really business goal driven. The
main aim of this project work was to present to the data owner company, that by using data
mining techniques the common workflows are better verifiable and the revenue is easily
increasable.
During the analysis part of my task I tried to collect most of the important aspects where the
chain connects to the customers. Also I wanted to get as complex knowledge about the SPSS as
it is possible during this project. In my opinion this was successful. I gained a strong, basic
knowledge that is useful for further projects. Furthermore I tried to make the various analysis
tasks in different data mining or statistical methods for more reasons. Moreover that I want to
learn the basics of more techniques in SPSS, I also tried to make the certain analysis projects in
the best method for it. Of course it needed some extra work to find the best methods, but for
reducing the length of this report usually I documented here only the finally chosen ones.
Among the project laboratory task I have learned a lot of new techniques and getting known
with the SPSS Statistics. I getting knows the customer data and its structure in details, so I
became able to realize some fields of the customer data that indicates maybe some
problematic area of business. In my report I was trying not only to identify the problems, but
find and suggest some possible solution for it. I thought that it is important to reflect for other
possible problems, which are not so closely connected to the customer data, but maybe visible
from it. Such as the returning of the items is not exactly a customer analytical problem, but
when the customer are not satisfied by the sold goods, or the shopkeepers did not give them
enough detailed information before the shopping, then it might influence the visiting ratio and
the revenue, so in figuratively way it is in connection to the customer analysis.
Summary
37
The hugest difficulty during my task was that there was not a concrete, measurable business
goal to reach, rather I needed to make questions which I thought to be interesting. Also after
the analysis there was not a strict limit which shows, that the chosen method was good or
effective enough.
The final presentation of project for the customer will be in the near future, I hope, that they
will be satisfied and in the future projects they will have well-defined goals to reach.
Table of figures, tables and charts
38
Table of figures, tables and charts
Figure 1 – Entity relationship diagram of the star scheme ............................................................. 9
Figure 2 – ER diagram of the snowflake scheme .......................................................................... 10
Figure 3 – Processes of CRISP-DM ................................................................................................ 13
Figure 4 – Control flow diagram of the ETL process ..................................................................... 17
Figure 5 – The data flow diagram of the ETL for sales header ..................................................... 19
Table 1 – Number of cases in each Cluster ................................................................................... 21
Table 2 – Usual sold items in the separate clusters ..................................................................... 21
Table 3 – SPSS input format for monthly seasonality analysis ..................................................... 23
Table 4 - SPSS input format for daily seasonality analysis ............................................................ 24
Table 5 – Correlation calculation result for shop B4 .................................................................... 25
Table 6 – Part of he hourly correlation matrix ............................................................................. 27
Table 7 – Description analysis data about the first input dataset ................................................ 33
Chart 1– Average daily sales value per shop ................................................................................ 22
Chart 2 – Average daily income in shop B4 .................................................................................. 24
Chart 3 - Average hourly income in certain shops ........................................................................ 26
Chart 4 – Correspondence analysis between customers and shops ............................................ 29
Chart 5 – Box char of the transactions total prize for each shop ................................................. 32
Chart 6 - Number of items sold above the average per shops ..................................................... 34
Bibliography
39
Bibliography
[1] Cisco, "Scribd.," [Online]. Available: http://www.scribd.com/doc/59541956/Global-
Internet-Expansion-Infographic. [Accessed 8th april 2012].
[2] S. E. Prokesch, "Competing on Customer Service: An Interview with British Airways’ Sir
Colin Marshall," Harward Bisiness Review, november 1995. [Online]. Available:
http://hbr.org/1995/11/competing-on-customer-service-an-interview-with-british-
airways-sir-colin-marshall/ar/1. [Accessed 9th april 2012].
[3] C. Sildó, „Adattárház összefoglaló,” august 2004. [Online]. Available:
http://scs.web.elte.hu/Work/DW/adattarhazak.htm. [Hozzáférés dátuma: 9th april 2012].
[4] S. Gajdos, Adatbázisok, Budapest: Műegyetemi Kiadó, 2006.
[5] B. Palace, "Data Mining: What is Data Mining?," 1996. [Online]. Available:
http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamini
ng.htm. [Accessed 9th april 2012].
[6] A. Doug, "Data Mining," [Online]. Available:
http://www.laits.utexas.edu/~norman/BUS.FOR/course.mat/Alex/. [Accessed 9th april
2012].
[7] D. Micci-Barreca és S. Ramachandran, „Elite Analytics,” [Online]. Available:
http://www.spss.ch/upload/1122641565_Improving%20tax%20administration%20with%2
0data%20mining.pdf. [Hozzáférés dátuma: 12th april 2012].
[8] Customer Analytics, „Customer Analytics,” Customer Analytics, 2011. [Online]. Available:
http://www.customeranalytics.com/customeranalytics/. [Hozzáférés dátuma: 12th april
2012].