statistics for data scientists

40
Statistics for Data Scientists

Upload: ajay-ohri

Post on 16-Apr-2017

7.856 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Statistics for  data scientists

Statistics for Data Scientists

Page 2: Statistics for  data scientists

AgendaRevision

Data

Statistics -Descriptive, Central Tendency, Variation, Distributions

Data Mining

Page 3: Statistics for  data scientists

Basics of Data Science

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

the culture of academia, which does not reward researchers for understanding technology.

DANGER ZONE- this overlap of skills gives people the ability to create what appears to be

a legitimate analysis without any understanding of how they got there or

what they have created

Being able to manipulate text files at the command-line,

understanding vectorized operations, thinking algorithmically;

these are the hacking skills that make for a successful data hacker.

data plus math and statistics only gets you machine learning,

which is great if that is what you are interested in, but not if you are doing data science

Page 4: Statistics for  data scientists

What is Business Analytics

Definition – study of business data using statistical techniques and programming for creating decision support and insights for achieving business goals

Predictive- To predict the future.

Descriptive- To describe the past.

Page 5: Statistics for  data scientists

DataData is a set of values of qualitative or quantitative variables. An example of qualitative data would be an anthropologist's handwritten notes about her interviews. data is collected by a huge range of organizations and institutions, including businesses (e.g., sales data, revenue, profits, stock price), governments (e.g., crime rates, unemployment rates, literacy rates) and non-governmental organizations (e.g., censuses of the number of homeless people by non-profit organizations). Data is measured, collected and reported, and analyzed, whereupon it can be visualized using graphs, images or other analysis tools.

https://en.wikipedia.org/wiki/Data

Data is distinct pieces of information, usually formatted in a special way. All software is divided into two general categories: data and programs . Programs are collections of instructions for manipulating data.Data can exist in a variety of forms -- as numbers or text on pieces of paper, as bits and bytes stored in electronic memory, or as facts stored in a person's mind.

http://www.webopedia.com/TERM/D/data.html

Page 6: Statistics for  data scientists

Data

https://en.oxforddictionaries.com/definition/data Definition of data in English:

data

noun[mass noun] Facts and statistics collected together for reference or analysis:‘there is very little data available’The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.

Philosophy Things known or assumed as facts, making the basis of reasoning or calculation.

Page 7: Statistics for  data scientists

Variable Something that varies

Page 8: Statistics for  data scientists

Variable Ordinal variables are variables that have two or more categories just like nominal variables only the categories can also be ordered or ranked eg Excellent- Horrible. Dichotomous variables are nominal variables which have only two categories or levels. Nominal variables are variables that have two or more categories, but which do not have an intrinsic order.

Interval variables are variables for which their central characteristic is that they can be measured along a continuum and they have a numerical value (for example, temperature measured in degrees Celsius or Fahrenheit).

Ratio variables are interval variables, but with the added condition that 0 (zero) of the measurement indicates that there is none of that variable. a distance of ten metres is twice the distance of 5 metres.

https://statistics.laerd.com/statistical-guides/types-of-variable.php

.

Page 9: Statistics for  data scientists

Central Tendency

MeanArithmetic Mean- the sum of the values divided by the number of values.

The geometric mean is an average that is useful for sets of positive numbers that are interpreted according to their product and not their sum (as is the case with the arithmetic mean) e.g. rates of growth.

Median the median is the number separating the higher half of a data sample, a population, or a probability distribution, from the lower hal

Mode-The "mode" is the value that occurs most often.

Page 10: Statistics for  data scientists

Dispersion

Range the range of a set of data is the difference between the largest and smallest values.

Variancemean of squares of differences of values from mean

Standard Deviationsquare root of its variance

Frequencya frequency distribution is a table that displays the frequency of various outcomes in a sample.

Page 11: Statistics for  data scientists

DistributionThe distribution of a statistical data set (or a population) is a listing or function showing all the possible values (or intervals) of the data and how often they occur. When a distribution of categorical data is organized, you see the number or percentage of individuals in each group.

http://www.dummies.com/education/math/statistics/what-the-distribution-tells-you-about-a-statistical-data-set/

Page 12: Statistics for  data scientists

Distributions

NormalThe simplest case of a normal distribution is known as the standard normal distribution. This is a special case where μ=0 and σ=1,

Page 13: Statistics for  data scientists

Skewed Distribution

Page 14: Statistics for  data scientists

Skewed Distribution

skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or even undefined.

Image https://en.wikipedia.org/wiki/File:Negative_and_positive_skew_diagrams_(English).svg

Page 15: Statistics for  data scientists

Skewed Distribution

kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable. kurtosis is a descriptor of the shape of a probability distribution

Image http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm

Page 17: Statistics for  data scientists

Distributions

BernoulliDistribution of a random variable which takes value 1 with success probability and value 0 with failure probability. It can be used, for example, to represent the toss of a coin

Page 18: Statistics for  data scientists

Distributions

Chi Squarethe distribution of a sum of the squares of k independent standard normal random variables.

Page 19: Statistics for  data scientists

Distributions

Poissona discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event

Page 20: Statistics for  data scientists

Probability

Probability DistributionThe probability density function (pdf) of the normal distribution, also called Gaussian or "bell curve", the most important continuous random distribution. As notated on the figure, the probabilities of intervals of values correspond to the area under the curve.

Page 21: Statistics for  data scientists

Refresher in Statistics

Page 22: Statistics for  data scientists

Using RCmdr for Statistics

Page 23: Statistics for  data scientists

Using RCmdr for Statistics

Page 24: Statistics for  data scientists

Using RCmdr for Statistics

Page 25: Statistics for  data scientists

Using RCmdr

Page 26: Statistics for  data scientists

Central Limit Theorem

Central Limit Theorem -In probability theory, the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independentrandom variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution.

Page 27: Statistics for  data scientists

Hypothesis testing

Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true. The usual process of hypothesis testing consists of four steps.

1. Formulate the null hypothesis (commonly, that the observations are the result of pure chance) and the alternative hypothesis (commonly, that the observations show a real effect combined with a component of chance variation).

2. Identify a test statistic that can be used to assess the truth of the null hypothesis.

3. Compute the P-value, which is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis were true. The smaller the -value, the stronger the evidence against the null hypothesis.

4. Compare the -value to an acceptable significance value (sometimes called an alpha value). If , that the observed effect is statistically significant, the null hypothesis is ruled out, and the alternative hypothesis is valid.

http://mathworld.wolfram.com/HypothesisTesting.html

Page 28: Statistics for  data scientists

Hypothesis testing

Page 30: Statistics for  data scientists

Hypothesis testing

Page 31: Statistics for  data scientists

Hypothesis testing

Page 32: Statistics for  data scientists

Hypothesis testing

Page 33: Statistics for  data scientists

T testhttp://statistics.berkeley.edu/computing/r-t-tests

> x = rnorm(10)> y = rnorm(10)> t.test(x,y)

> ttest = t.test(x,y)> names(ttest)

> ttest$statistic

Page 34: Statistics for  data scientists

Chi Square Distribution

ProblemFind the 95th percentile of the Chi-Squared distribution with 7 degrees of freedom.

SolutionWe apply the quantile function qchisq of the Chi-Squared distribution against the decimal values 0.95.

> qchisq(.95, df=7) # 7 degrees of freedom

[1] 14.067

http://www.r-tutor.com/elementary-statistics/probability-distributions/chi-squared-distribution

Page 35: Statistics for  data scientists

Normal Distributionwe are looking for the percentage of students scoring higher than 84 , we apply the function pnorm of the normal distribution with mean 72 and standard deviation 15.2. We are interested in the upper tail of the normal distribution.> pnorm(84, mean=72, sd=15.2, lower.tail=FALSE)

[1] 0.21492

Page 36: Statistics for  data scientists

Student T Distribution ProblemFind the 2.5th and 97.5th percentiles of the Student t distribution with 5 degrees of freedom.

SolutionWe apply the quantile function qt of the Student t distribution against the decimal values 0.025 and 0.975.

> qt(c(.025, .975), df=5) # 5 degrees of freedom

[1] -2.5706 2.5706

Page 37: Statistics for  data scientists

Some code

http://rpubs.com/newajay/stats1

Page 38: Statistics for  data scientists

Some code

http://rpubs.com/newajay/stats4

Page 40: Statistics for  data scientists

Bayes Theoremhttps://en.wikipedia.org/wiki/Bayes'_theorem