intro to data analysis project

49
Data Analysis & Data Fluency

Upload: manojkumar

Post on 05-Sep-2015

14 views

Category:

Documents


3 download

DESCRIPTION

accounting

TRANSCRIPT

  • Data Analysis & Data Fluency

  • Topics of todays discussionReading Data Tables to make ConclusionsMeasures of Central Tendency Mean, Median, ModeMeasures of Dispersion Range, Standard DeviationLooking at data over a period of time as a trendCorrelation and Causality

  • Topics of todays discussionReading Data Tables to make ConclusionsMeasures of Central Tendency Mean, Median, ModeMeasures of Dispersion Range, Standard DeviationLooking at data over a period of time as a trendCorrelation and Causality

  • Reading Data Tables Situation 1Let us assume a city has 4 modern format stores (named Store 1, Store 2, etc) of a single retail player They are more or less of similar size and have similar monthly SalesHowever, the Sales by different categories are different for example, one Store might have a higher Sales of FMCG and another a higher Sale of StaplesIn such a scenario, let us look at the buyers of instant noodles in these 4 stores

  • Reading Data Tables Situation 1Brands Purchased in each Store Instant Noodles

    Store 1Store 2Store 3Store 4Among buyers of Instant Noodles in each Store% Buying Maggi variants only70%75%55%85%% Buying Yippee, Top Ramen, etc20%20%35%10%% Buying both 10%5%10%5%

  • Reading Data Tables Situation 1% Contribution of Buyers of Instant Noodles Brands from each Store

    Store 1Store 2Store 3Store 4

    % Buying Maggi variants only18%27%42%13%% Buying Yippee, Top Ramen, etc13%18%66%4%% Buying both 20%14%60%6%

  • Reading Data Tables Situation 1 Assignment Reading the 2 tables what will you conclude about Instant Noodles sales from the 4 stores?Can you make some guesses about the difference in the catchment profiles of these stores?

  • Reading Data Tables Situation 1 Hint Brands Purchased in each Store Instant Noodles

    Store 1Store 2Store 3Store 4Base: Buyers of Instant Noodles in each Store5007001500300% Buying Maggi variants only70%75%55%85%% Buying Yippee, Top Ramen, etc20%20%35%10%% Buying both 10%5%10%5%

  • Reading Data Tables Situation 1 Hint% Contribution of Buyers of Instant Noodles Brands from each Store

    Store 1Store 2Store 3Store 4% of ALL INSTANT NOODLES BUYERS17%23%50%10%% Buying Maggi variants only18%27%42%13%% Buying Yippee, Top Ramen, etc13%18%66%4%% Buying both 20%14%60%6%

  • Reading Data Tables Situation 1SOME SIMPLE CONCLUSIONSStore 3 has a substantially large number of buyers of Instant Noodles as category Store 4 has the least

    Among their respective buyers, Stores 1, 2 and 4 have high (70%+) solus Maggi buyers, especially Store 4 (85%)

    Store 3 has lesser (55%) solus Maggi buyers. But, being the largest seller of instant noodles, contributes maximum to Maggi sales, as well as to the other brands sales

  • Reading Data Tables Situation 1SOME SIMPLE CONCLUSIONSSame-sized Stores, yet Store 3 has Higher Instant Noodles sales and Higher % of new brand (Yippee, Smoodles, etc) Sales So, the catchment profile might be - younger, with more double income hhlds, bachelors, etc- also, psychographically, more open to trying new brands- more exposed to media hence aware of new brands

    Similarly, Store 4 catchment profile might be just the opposite

  • Sales in a Hyper Store Month-wise, last Quarter

    OctNovDecTotal Qtr Sales (in Rs. Lakhs): STAPLES 30.40 28.63 29.05 88.08 F & V 20.00 20.72 18.69 59.41 FISH & MEAT 9.73 10.68 11.02 31.44 BAKERY 6.39 6.54 9.78 22.72 DAIRY & FROZEN 8.84 8.72 8.22 25.78 FMCG 46.65 47.33 48.55 142.53 LIQUOR 29.68 28.73 39.02 97.43 APPAREL 6.17 5.16 6.51 17.84 E & E 0.84 0.48 0.66 1.98 HWP 9.55 9.28 10.06 28.90 COMMON 0.38 0.39 0.43 1.20 TOTAL 168.65 166.66 181.99 517.30

  • Col% - Each month, what is the contribution of each category?

    OctNovDecTotal Qtr Sales: STAPLES 18.0%17.2%16.0%17.0% F & V 11.9%12.4%10.3%11.5% FISH & MEAT 5.8%6.4%6.1%6.1% BAKERY 3.8%3.9%5.4%4.4% DAIRY & FROZEN 5.2%5.2%4.5%5.0% FMCG 27.7%28.4%26.7%27.6% LIQUOR 17.6%17.2%21.4%18.8% APPAREL 3.7%3.1%3.6%3.4% E & E 0.5%0.3%0.4%0.4% HWP 5.7%5.6%5.5%5.6% COMMON 0.2%0.2%0.2%0.2%

  • Row% - For each category, what is the contribution from each month?

    Turnaround Hyper StoreOctNovDecTotal Qtr Sales: STAPLES 34.5%32.5%33.0%100.0% F & V 33.7%34.9%31.5%100.0% FISH & MEAT 31.0%34.0%35.1%100.0% BAKERY 28.1%28.8%43.1%100.0% DAIRY & FROZEN 34.3%33.8%31.9%100.0% FMCG 32.7%33.2%34.1%100.0% LIQUOR 30.5%29.5%40.1%100.0% APPAREL 34.6%28.9%36.5%100.0% E & E 42.6%24.2%33.3%100.0% HWP 33.1%32.1%34.8%100.0% COMMON 31.8%32.4%35.8%100.0%

  • Topics of todays discussionReading Data Tables to make ConclusionsMeasures of Central Tendency Mean, Median, ModeMeasures of Dispersion Range, Standard DeviationLooking at data over a period of time as a trendCorrelation and Causality

  • Measures of Central TendencyIf we have a lot of data and we want to tell something without giving all the data, there is a need to describe them with a single numberFor example, if a shopkeeper has sold 10 soap bars in a day and someone asks At what price did you sell the soaps? the person wants to hear one number. How does the shopkeeper respond?- Maybe, he will give the typical price of a soap bar sold- Maybe, he will give some number that represents the middle of all the prices- Maybe, he will give the most frequent price at which they have been soldIn all cases, he is trying to give a number that somehow represents the centre of all the prices at which the soap bars have been sold (i.e. a Measure of Central Tendency)

  • Mean, Median and ModeThere are three Measures of Central Tendency that are used commonly Arithmetic Mean, Median and Mode.Let us take the situation of the sales of 10 soap bars to explain the Measures of Central Tendency. Suppose, the prices (in Rs) of the soap bars sold throughout the day were as follows:Rs 20, Rs 25, Rs 15, Rs 20, Rs 10, Rs 10, Rs 40, Rs 20, Rs 20, Rs 30Arithmetic Mean - Most commonly used Measure of Central Tendency- Tries to give the centre of the data by computing the sum of numbers and dividing it with the number of numbers- So, in our example, it is (Sum of prices of 10 soap bars)/10, i.e. Rs 210/10 = Rs 21

  • Mean, Median and ModeMedian - Gives the middle number from a set of the data. It is computed in two steps:STEP 1: The numbers are ordered from lowest to highest (or reverse) STEP 2: - if the total numbers in the set is odd, the single middle-most number is the median- if the total numbers in the set is even, the average of the two middle numbers is the median

  • Mean, Median and ModeLet us compute the Median price of the soap bars sold.STEP 1: Prices of 10 soap bars arranged in order (ascending) Rs 10, Rs 10, Rs 15, Rs 20, Rs 20, Rs 20, Rs 20, Rs 25, Rs 30, Rs 40STEP 2: Since there are 10 soap bars sold (i.e. an even number of soap bars sold), there are two middle numbers the 5th and the 6th onesRs 10, Rs 10, Rs 15, Rs 20,Rs 20, Rs 20, Rs 20, Rs 25, Rs 30, Rs 40So the Median will be the mean of Rs 20 and Rs 20, i.e. Rs 20NOTE: Median is the middle number (or mid-point) and not the centre-of-gravity. For example, if the highest price was Rs 400 instead of Rs 40 the median would still be Rs 20. The Mean would change

  • Mean, Median and ModeMode- The most typical number or the one that shows up the most number of timesFor example, in the case of 10 soap bars sold - One soap each has been sold at the price points of Rs 15, Rs 25, Rs 30 and Rs 40 - Two soaps have been sold at Rs 10 each - Four soaps have been sold at Rs 20 eachSince the maximum number (four) of soaps have been sold at Rs 20, Mode will be Rs 20.A set of data can have one or more Modes

  • Mean, Median and ModeArithmetic Mean Most commonly used Measure However, Arithmetic Mean should be avoided in cases where the data has some outliers Some very large or very small values in the data can skew the Mean In such cases, Median can be a better one number conclusion Mode could also be used, especially when one number occurs most frequently in the data set

  • Exercise 1: Measures of Central TendencyIn a shop selling wrist watches, around 500 wrist watches were sold in a month90% of the watches were sold at a price ranging from Rs 700 to Rs 30009% of the watches were sold at prices below Rs 7002 very high-end expensive watches were sold at prices above Rs 2 lakhs during the month

    If you are asked for a one number to give an answer to the question: At what prices are wrist watches sold from this shop? Which measure of Central Tendency will you use Mean, Median or Mode?

  • Exercise 2: Measures of Central TendencyIn another shop selling wrist watches in a relatively down-market area, around 200 wrist watches were sold in a month.

    All the wrist watches were in the price range of Rs 500 to Rs 2000. If you are asked for a one number to give an answer to the question: At what prices are wrist watches sold from this shop? Which measure of Central Tendency will you use Mean, Median or Mode?

  • Exercise 3: Measures of Central TendencyIn a modern format outlet the sales of the largest SKU of potato wafers, on a particular day, has been as follows:

    If you are asked for a one number to give an answer to the question: At what price are branded potato wafers sold from this outlet? Which measure of Central Tendency will you use Mean, Median or Mode?

    Brands (selling at Rs X)Units soldLays (at Rs 20)43Bingo (at Rs 20)32Pringles (at Rs 80)4Other local brands (at Rs 20)31

  • Exercise 4: Measures of Central TendencyIn an outlet selling footwear in Kolkata, following were the sales (out of 350 footwears sold) by the Sizes of footwear, during a particular week:If you are asked for a one number to give an answer to the question: What is the Size of footwear of Kolkattans, in general? Which measure of Central Tendency will you use Mean, Median or Mode?Are there any other conclusions that you can draw from this data? Would you want to look at the given data in some other way?

    Size of footwearUnits sold3 or below304405756507758609 or above20

  • The concept of Weighted Mean Brands Purchased in each Store Instant Noodles

    AllStore 1Store 2Store 3Store 4Base: Buyers of Instant Noodles in each Store30005007001500300% Buying Maggi variants only?70%75%55%85%% Buying Yippee, Top Ramen, etc?20%20%35%10%% Buying both? 10%5%10%5%

  • The concept of Weighted Mean Brands Purchased in each Store Instant Noodles

    AllStore 1Store 2Store 3Store 4Base: Buyers of Instant Noodles in each Store30005007001500300% Buying Maggi variants only65%70%75%55%85%% Buying Yippee, Top Ramen, etc27%20%20%35%10%% Buying both8% 10%5%10%5%

  • Topics of todays discussionReading Data Tables to make ConclusionsMeasures of Central Tendency Mean, Median, ModeMeasures of Dispersion Range, Standard DeviationLooking at data over a period of time as a trendCorrelation and Causality

  • Measures of DispersionLot of times, a single number (Mean, Median or Mode) does not lead to sufficient conclusions. Let us look at the following data:From 3 outlets selling wrist watches, the average price of watches sold is as follows:Looking at Arithmetic Mean, can we conclude that 3 outlets have similar priced watches sold and similar profile of customers?The answer is NO! So, we need some Measures to know how the data is spread across the Arithmetic Mean.

    Outlet 1Outlet 2Outlet 3Mean price of watches soldRs 890Rs 886Rs 891

  • Range, Variance and Standard DeviationTo understand how spread apart the data is from Mean, three measures are used Range, Variance and Standard DeviationRange is the difference between the highest value and the lowest value in the data set a simple measure, giving clear idea of outliersStandard Deviation, very simply put, is the Measure that shows by how much do the individual members of a data set differ from the Mean value of the dataSo, high Standard Deviation (or SD) individual data points are very much different or spread apart from the MeanLow SD individual data points are close to the Mean

  • Example: Measures of Dispersion Let us look at the 3 outlets selling wrist watches once again now, with the Measures of DispersionKind of Sales happening in the 3 outlets are completely different.Outlet 1 caters to a very homogeneous segment all buying watches within a narrow price range.Outlet 2 has Sales at a wide range of price points (maybe have outliers too). caters to diverse demographic profiles some very affluent ones and some from middle/lower income groups

    Outlet 1Outlet 2Outlet 3Mean price of watches soldRs 890Rs 886Rs 891RangeRs 2500Rs 99,500Rs 7,500Standard Deviation Rs 22.40Rs 255.90Rs 98.20

  • Assignment 2In the catchment area of a store, 1000 people were asked some questions in a survey. All 1000 people themselves shop for day-to-day household items, belong in the age group of 21 45 yrs, and are housewives or single earning members.

    They were asked to agree or disagree with a statement I love buying day-to-day items from modern format outlets rather than going to traditional Kirana stores in a five point scale:Of the 1000 people responding to this question, the mean score obtained was 3.1 out of 5. What can you conclude from this?

  • Assignment 2If you are now given the following distribution:

    What would you conclude? Can you make some hypotheses on the sub-groups of people giving this opinion? Would you want the data to be analyzed in some other sub-groups?

  • Assignment 2We may need to look at an output like:

    to check, Is the polarization of findings due to different attitudes among different age groups and different stages of life?

  • Topics of todays discussionReading Data Tables to make ConclusionsMeasures of Central Tendency Mean, Median, ModeMeasures of Dispersion Range, Standard DeviationLooking at data over a period of time as a trendCorrelation and Causality

  • Data over a period of time as a trendLot of times, data has to be looked at, over a period of time as a trend (e.g. Day-to-day Sales Reports, Weekly Reports or Monthly Sales Reports) Following is the monthly sales data (for 10 months) of consumer durables from a store:

    MONTHSSALES (No. of Units)11102100312041405170615071608190920010190

  • Data over a period of time as a trendSince there can be lot of month-on-month fluctuations in the data, it can be difficult to understand the direction of the data To observe the medium or long term trend, one way of looking at the data is to graphically see these Sales as a Scatter Plot and then draw a trend-line

  • Data over a period of time as a trendThe other way of smoothening the fluctuations and looking at the real picture is by computing Moving Averages. Following table shows 3-monthly moving averages:

    MONTHSSALES SALES (Moving Avg)11102100312011041401205170143615015371601608190167920018310190193

  • Data over a period of time as a trendThe Moving Average shows that there is a clear and consistent upward trend in Sales over this period of 10 months.This is evident from the following graph:

  • Topics of todays discussionReading Data Tables to make ConclusionsMeasures of Central Tendency Mean, Median, ModeMeasures of Dispersion Range, Standard DeviationLooking at data over a period of time as a trendCorrelation and Causality

  • Correlation between two variablesWhen we look at data for two or more variables, we sometimes see that data for two variables move in the same direction. For example, if we record the heights and weights of a large number of people, we would observe that there are many taller people who also weigh higher and similarly there are many shorter people who also weigh lesserAlso, there can be two variables that move mostly in opposite directions For example, the Power of the engine and Mileage of the car mostly, cars with higher Power would have lower Mileage.

    We refer to the term Correlation to explain the strength of linear association between two variables and a co-efficient r is used to measure this strengthThe value of r can range between 1 and +1

  • Correlation between two variablesr value closer to +1 a strong positive association r value closer to -1 a strong negative association r value closer to 0 weak association between the two variablesIn Retail Sales data too, it will be interesting to observe the Correlation of Sales of certain categories over the period of time- Is there a high positive Correlation between Sales of Shampoos and Conditioners?- Is there a negative Correlation between Sales of Shower Gels and Soaps?by looking at long-term Sales data.

  • Correlation does not imply CausalityHowever, one must note CORRELATION DOES NOT IMPLY CAUSALITYMeaning, a high positive Correlation between A and B does not mean that A causes B or A leads to B e.g. Brand Imagery vs Brand Usage generally a high positive correlation but does it mean increase in Brand Imagery would lead to increase in Brand Usage?NO!

  • Assignment 3: CorrelationsCorrelations between the Sales of product categories:

    What can you conclude from this?

  • Assignment 3: CorrelationsCorrelations between the Sales of product categories:

    STAPLES F & V FISH & M BAKERY D & F FMCG LIQUOR CORRELATIONS STAPLES 1.00 0.63 0.62 0.37 0.68 0.96 -0.18 F & V 0.63 1.00 0.39 -0.24 0.83 0.73 -0.29 FISH & MEAT 0.62 0.39 1.00 0.61 0.15 0.61 0.15 BAKERY 0.37 -0.24 0.61 1.00 -0.32 0.31 0.56 DAIRY & FROZEN 0.68 0.83 0.15 -0.32 1.00 0.71 -0.39 FMCG 0.96 0.73 0.61 0.31 0.71 1.00 -0.06 LIQUOR -0.18 -0.29 0.15 0.56 -0.39 -0.06 1.00

  • Correlation does not imply CausalityLet us look at the following article:EATING BREAKFAST MAY BEAT TEEN OBESITY In the study, published in Pediatrics, researchers analyzed the dietary and weight patterns of a group of 2,216 adolescents over a five-year period (1998-1999 to 2003-2004) from public schools in Minneapolis-St. Paul, Minn.The researchers write that teens who ate breakfast regularly had a lower percentage of total calories from saturated fat and ate more fiber and carbohydrates than those who skipped breakfast. In addition, regular breakfast eaters seemed more physically active than breakfast skippers.Over time, researchers found teens who regularly ate breakfast tended to gain less weight and had a lower body mass index than breakfast skippers."[Source: WebMD Health News, March 2008]From the data and argument given in the article, is it fair to conclude that not eating breakfast leads to obesity or Eating breakfast reduces chance of obesity? Why?

  • Correlation does not imply CausalityEATING BREAKFAST MAY BEAT TEEN OBESITY In the study, published in Pediatrics, researchers analyzed the dietary and weight patterns of a group of 2,216 adolescents over a five-year period (1998-1999 to 2003-2004) from public schools in Minneapolis-St. Paul, Minn.The researchers write that teens who ate breakfast regularly had a lower percentage of total calories from saturated fat and ate more fiber and carbohydrates than those who skipped breakfast. In addition, regular breakfast eaters seemed more physically active than breakfast skippers.Over time, researchers found teens who regularly ate breakfast tended to gain less weight and had a lower body mass index than breakfast skippers."[Source: WebMD Health News, March 2008]Open to Discussion

  • Correlation does not imply CausalityThe Title and part of content, tries to say:Eating breakfast No Obesity, Breakfast Skipping ObesityAlso, Breakfast eating Physically activeIs it ? Or is it that these two things go together?Maybe, its the other way round People who have high body fat are less likely to get hungry in morning, so Obesity Breakfast Skipping!!!Or maybe, Physical Activity Breakfast (as you are hungry) and Physical Activity No Obesity So, high Correlation, by no means can give Causality

    generally a high positive correlation but does it mean increase in Brand Imagery would lead to increase in Brand Usage?NO!

  • THANK YOU!

    *********