statistics for data scientists

Statistics for Data Scientists

AgendaRevision

Data

Statistics -Descriptive, Central Tendency, Variation, Distributions

Data Mining

Basics of Data Science

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

the culture of academia, which does not reward researchers for understanding technology.

DANGER ZONE- this overlap of skills gives people the ability to create what appears to be

a legitimate analysis without any understanding of how they got there or

what they have created

Being able to manipulate text files at the command-line,

understanding vectorized operations, thinking algorithmically;

these are the hacking skills that make for a successful data hacker.

data plus math and statistics only gets you machine learning,

which is great if that is what you are interested in, but not if you are doing data science



What is Business Analytics

Definition – study of business data using statistical techniques and programming for creating decision support and insights for achieving business goals

Predictive- To predict the future.

Descriptive- To describe the past.

DataData is a set of values of qualitative or quantitative variables. An example of qualitative data would be an anthropologist's handwritten notes about her interviews. data is collected by a huge range of organizations and institutions, including businesses (e.g., sales data, revenue, profits, stock price), governments (e.g., crime rates, unemployment rates, literacy rates) and non-governmental organizations (e.g., censuses of the number of homeless people by non-profit organizations). Data is measured, collected and reported, and analyzed, whereupon it can be visualized using graphs, images or other analysis tools.

https://en.wikipedia.org/wiki/Data

Data is distinct pieces of information, usually formatted in a special way. All software is divided into two general categories: data and programs . Programs are collections of instructions for manipulating data.Data can exist in a variety of forms -- as numbers or text on pieces of paper, as bits and bytes stored in electronic memory, or as facts stored in a person's mind.

http://www.webopedia.com/TERM/D/data.html

https://en.wikipedia.org/wiki/Stock_price

https://en.wikipedia.org/wiki/Stock_price

https://en.wikipedia.org/wiki/Crime_rate

https://en.wikipedia.org/wiki/Crime_rate

https://en.wikipedia.org/wiki/Unemployment_rate



https://en.wikipedia.org/wiki/Literacy

https://en.wikipedia.org/wiki/Literacy

https://en.wikipedia.org/wiki/Homelessness

https://en.wikipedia.org/wiki/Homelessness

https://en.wikipedia.org/wiki/Measurement

https://en.wikipedia.org/wiki/Measurement

https://en.wikipedia.org/wiki/Data_reporting



https://en.wikipedia.org/wiki/Data_analysis

https://en.wikipedia.org/wiki/Data_analysis

https://en.wikipedia.org/wiki/Data_visualization

https://en.wikipedia.org/wiki/Data_visualization



http://www.webopedia.com/TERM/I/instruction.html

http://www.webopedia.com/TERM/I/instruction.html

http://www.webopedia.com/TERM/T/text.html

http://www.webopedia.com/TERM/T/text.html

http://www.webopedia.com/TERM/B/bit.html

http://www.webopedia.com/TERM/B/bit.html

http://www.webopedia.com/TERM/B/byte.html

http://www.webopedia.com/TERM/B/byte.html

http://www.webopedia.com/TERM/S/store.html

http://www.webopedia.com/TERM/S/store.html

http://www.webopedia.com/TERM/M/memory.html

http://www.webopedia.com/TERM/M/memory.html



Data

https://en.oxforddictionaries.com/definition/data Definition of data in English:

data

noun[mass noun] Facts and statistics collected together for reference or analysis:‘there is very little data available’The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.

Philosophy Things known or assumed as facts, making the basis of reasoning or calculation.

https://en.oxforddictionaries.com/definition/data

https://en.oxforddictionaries.com/definition/data

Variable Something that varies

Variable Ordinal variables are variables that have two or more categories just like nominal variables only the categories can also be ordered or ranked eg Excellent- Horrible. Dichotomous variables are nominal variables which have only two categories or levels. Nominal variables are variables that have two or more categories, but which do not have an intrinsic order.

Interval variables are variables for which their central characteristic is that they can be measured along a continuum and they have a numerical value (for example, temperature measured in degrees Celsius or Fahrenheit).

Ratio variables are interval variables, but with the added condition that 0 (zero) of the measurement indicates that there is none of that variable. a distance of ten metres is twice the distance of 5 metres.

https://statistics.laerd.com/statistical-guides/types-of-variable.php

.



Central Tendency

MeanArithmetic Mean- the sum of the values divided by the number of values.

The geometric mean is an average that is useful for sets of positive numbers that are interpreted according to their product and not their sum (as is the case with the arithmetic mean) e.g. rates of growth.

Median the median is the number separating the higher half of a data sample, a population, or a probability distribution, from the lower hal

Mode-The "mode" is the value that occurs most often.

http://en.wikipedia.org/wiki/Geometric_mean

Dispersion

Range the range of a set of data is the difference between the largest and smallest values.

Variancemean of squares of differences of values from mean

Standard Deviationsquare root of its variance

Frequencya frequency distribution is a table that displays the frequency of various outcomes in a sample.

DistributionThe distribution of a statistical data set (or a population) is a listing or function showing all the possible values (or intervals) of the data and how often they occur. When a distribution of categorical data is organized, you see the number or percentage of individuals in each group.

http://www.dummies.com/education/math/statistics/what-the-distribution-tells-you-about-a-statistical-data-set/



Distributions

NormalThe simplest case of a normal distribution is known as the standard normal distribution. This is a special case where μ=0 and σ=1,

Skewed Distribution

Skewed Distribution

skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or even undefined.

Image https://en.wikipedia.org/wiki/File:Negative_and_positive_skew_diagrams_(English).svg

https://en.wikipedia.org/wiki/File:Negative_and_positive_skew_diagrams_(English).svg




Skewed Distribution

kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable. kurtosis is a descriptor of the shape of a probability distribution

Image http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm

http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm



Skewed Distributionskewnessreturns value of skewness,kurtosis

returns value of kurtosis,

https://cran.r-project.org/web/packages/moments/moments.pdf

Image http://www.janzengroup.net/stats/lessons/descriptive.html





http://www.janzengroup.net/stats/lessons/descriptive.html




Distributions

BernoulliDistribution of a random variable which takes value 1 with success probability and value 0 with failure probability. It can be used, for example, to represent the toss of a coin

Distributions

Chi Squarethe distribution of a sum of the squares of k independent standard normal random variables.

http://en.wikipedia.org/wiki/Independence_(probability_theory)

http://en.wikipedia.org/wiki/Standard_normal

Distributions

Poissona discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event

http://en.wikipedia.org/wiki/Discrete_probability_distribution

http://en.wikipedia.org/wiki/Statistical_independence

Probability

Probability DistributionThe probability density function (pdf) of the normal distribution, also called Gaussian or "bell curve", the most important continuous random distribution. As notated on the figure, the probabilities of intervals of values correspond to the area under the curve.

http://en.wikipedia.org/wiki/Probability_density_function

http://en.wikipedia.org/wiki/Normal_distribution

Refresher in Statistics

Using RCmdr for Statistics

Using RCmdr

Central Limit Theorem

Central Limit Theorem -In probability theory, the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independentrandom variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution.

http://en.wikipedia.org/wiki/Probability_theory

http://en.wikipedia.org/wiki/Arithmetic_mean

http://en.wikipedia.org/wiki/Statistical_independence

http://en.wikipedia.org/wiki/Random_variables

http://en.wikipedia.org/wiki/Expected_value

http://en.wikipedia.org/wiki/Variance

http://en.wikipedia.org/wiki/Normal_distribution

Hypothesis testing

Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true. The usual process of hypothesis testing consists of four steps.

1. Formulate the null hypothesis (commonly, that the observations are the result of pure chance) and the alternative hypothesis (commonly, that the observations show a real effect combined with a component of chance variation).

2. Identify a test statistic that can be used to assess the truth of the null hypothesis.

3. Compute the P-value, which is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis were true. The smaller the -value, the stronger the evidence against the null hypothesis.

4. Compare the -value to an acceptable significance value (sometimes called an alpha value). If , that the observed effect is statistically significant, the null hypothesis is ruled out, and the alternative hypothesis is valid.

http://mathworld.wolfram.com/HypothesisTesting.html

http://mathworld.wolfram.com/NullHypothesis.html

http://mathworld.wolfram.com/AlternativeHypothesis.html

http://mathworld.wolfram.com/TestStatistic.html


http://mathworld.wolfram.com/P-Value.html


http://mathworld.wolfram.com/AlphaValue.html



Hypothesis testing

http://cmapskm.ihmc.us/rid=1052458963987_678930513_8647/Hypothesis%20testing.cmap



Hypothesis testing

T testhttp://statistics.berkeley.edu/computing/r-t-tests

> x = rnorm(10)> y = rnorm(10)> t.test(x,y)

> ttest = t.test(x,y)> names(ttest)

> ttest$statistic

http://statistics.berkeley.edu/computing/r-t-tests

http://statistics.berkeley.edu/computing/r-t-tests

Chi Square Distribution

ProblemFind the 95th percentile of the Chi-Squared distribution with 7 degrees of freedom.

SolutionWe apply the quantile function qchisq of the Chi-Squared distribution against the decimal values 0.95.

> qchisq(.95, df=7) # 7 degrees of freedom

[1] 14.067

http://www.r-tutor.com/elementary-statistics/probability-distributions/chi-squared-distribution

http://www.r-tutor.com/node/38




Normal Distributionwe are looking for the percentage of students scoring higher than 84 , we apply the function pnorm of the normal distribution with mean 72 and standard deviation 15.2. We are interested in the upper tail of the normal distribution.> pnorm(84, mean=72, sd=15.2, lower.tail=FALSE)

[1] 0.21492

Student T Distribution ProblemFind the 2.5th and 97.5th percentiles of the Student t distribution with 5 degrees of freedom.

SolutionWe apply the quantile function qt of the Student t distribution against the decimal values 0.025 and 0.975.

> qt(c(.025, .975), df=5) # 5 degrees of freedom

[1] -2.5706 2.5706



Some code

http://rpubs.com/newajay/stats1



Some code




Bayes Theoremhttps://artax.karlin.mff.cuni.cz/r-help/library/LaplacesDemon/html/BayesTheorem.html

https://artax.karlin.mff.cuni.cz/r-help/library/LaplacesDemon/html/BayesTheorem.html

https://artax.karlin.mff.cuni.cz/r-help/library/LaplacesDemon/html/BayesTheorem.html

Bayes Theoremhttps://en.wikipedia.org/wiki/Bayes'_theorem

https://en.wikipedia.org/wiki/Bayes'_theorem

https://en.wikipedia.org/wiki/Bayes'_theorem

statistics for data scientists

Data & Analytics