![Page 1: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/1.jpg)
Introduction to Web Data Analytics Using R and Python
Mamata Jenamani
Associate Professor Department of Industrial & Systems Engineering,
Indian Institute of Technology, Kharagpur
![Page 2: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/2.jpg)
![Page 3: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/3.jpg)
http://kr.renesas.com/edge_ol/global/13/index.jsp
![Page 4: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/4.jpg)
http://www.forbes.com/sites/louiscolumbus/2015/05/25/roundup-of-analytics-big-data-business-intelligence-forecasts-and-market-estimates-2015/#5c7790b4869a
![Page 5: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/5.jpg)
Predicted Demand
http://www.forbes.com/sites/louiscolumbus/2015/05/25/roundup-of-analytics-big-data-business-intelligence-forecasts-and-market-estimates-2015/#5c7790b4869a
![Page 6: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/6.jpg)
What is right environment for analytics?
![Page 7: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/7.jpg)
http://www.kdnuggets.com/2016/07/burtchworks-sas-r-python-analytics-pros-prefer.html
![Page 8: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/8.jpg)
http://spectrum.ieee.org/computing/software/the-2015-top-ten-programming-languages
The 2015 Top Ten Programming Languages
2015 ranking 2014 ranking
![Page 9: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/9.jpg)
Recent trend in programming languages for teaching
http://cacm.acm.org/blogs/blog-cacm/176450-python-is-now-the-most-popular-introductory-teaching-language-at-top-us-universities/fulltext
![Page 10: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/10.jpg)
Recent trend in programming languages
http://blog.codeeval.com/codeevalblog/2016/2/2/most-popular-coding-languages-of-2016
Based on hundreds of thousands of data points we've collected by processing over 1,200,000+ challenge submissions in 26 different programming languages.
![Page 11: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/11.jpg)
Recent trend in programming languages
https://www.codementor.io/learn-programming/beginner-programming-language-job-salary-community
![Page 12: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/12.jpg)
Recent trend in programming languages
Rank
Language Share Trend
1 Java 23.8 % -0.7 %
2 Python 13.0 % +2.3 %
3 PHP 10.5 % -0.7 %
4 C# 9.0 % -0.3 %
5 Javascript 7.7 % +0.7 %
6 C++ 7.2 % -0.4 %
7 C 7.0 % -0.2 %
8 Objective-C 4.5 % -0.8 %
9 R 3.2 % +0.6 %
Worldwide change in Github contribution, July 2016 compared to a year ago:
http://pypl.github.io/PYPL.html
![Page 13: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/13.jpg)
Recent trend in programming languages
https://www.codementor.io/learn-programming/beginner-programming-language-job-salary-community
![Page 14: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/14.jpg)
Recent trend in programming languages
https://www.codementor.io/learn-programming/beginner-programming-language-job-salary-community
![Page 15: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/15.jpg)
https://www.codementor.io/learn-programming/beginner-programming-language-job-salary-community
![Page 16: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/16.jpg)
http://www.kdnuggets.com/polls/2015/analytics-data-mining-data-science-software-used.html
![Page 17: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/17.jpg)
Top 10 most in-demand skills in big data market
http://www.forbes.com/sites/louiscolumbus/2014/12/29/where-big-data-jobs-will-be-in-2015/#48c70566404a
![Page 18: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/18.jpg)
http://www.forbes.com/sites/louiscolumbus/2015/11/16/where-big-data-jobs-will-be-in-2016/#14b0163df7f1
![Page 19: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/19.jpg)
What is Analytics • Analytics as the scientific process of transforming data into
insights for the purpose of making better decisions. The Institute for Operations Research and the Management Sciences (INFORMS),
https://www.informs.org/About-INFORMS/News-Room/INFORMS-in-the-News/Best-definition-of-analytics
• Analytics is the process of developing actionable insights through problem definition and the application of statistical models and analysis against existing and/or simulated future data. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.269.7294&rep=rep1&type=pdf
• Analytics is the discovery, interpretation, and communication of meaningful patterns in data. Analytics relies on the simultaneous application of statistics, computer programming and operations research to quantify performance. Analytics often favours data visualization to communicate insight.
https://en.wikipedia.org/wiki/Analytics
![Page 20: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/20.jpg)
Data Analysis and Data Analytics
• Data analysis refers to hands-on data exploration and evaluation.
• Data analytics is a broader term and includes data analysis as necessary subcomponent. Analytics defines the science behind the analysis. The science means understanding the cognitive processes an analyst uses to understand problems and explore data in meaningful ways.
• Analytics also include data extraction, transformation, and loading; specific tools, techniques, and methods; and how to successfully communicate results. http://www.kdnuggets.com/2015/02/interview-david-kasik-boeing-data-analytics.html
![Page 21: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/21.jpg)
Data Mining and Data Analytics
• Data analytics is distinguished from data mining by the scope, purpose and focus of the analysis. Data miners sort through huge data sets using sophisticated software to identify undiscovered patterns and establish hidden relationships. Data analytics focuses on inference, the process of deriving a conclusion based on what is already known by the researcher. http://searchdatamanagement.techtarget.com/definition/data-analytics
![Page 22: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/22.jpg)
Web Data Analytics • Web mining aims to discover useful information or
knowledge from the Web hyperlink structure, page content, and usage data.
• Although Web mining uses many data mining techniques, it is not purely an application of traditional data mining techniques due to the heterogeneity and semi-structured or unstructured nature of the Web data.
Bing Liu, Web Data Mining
• Web data analytics aims to use the discovered
information to aid decision making
![Page 23: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/23.jpg)
Web data mining categories
• Discovers useful knowledge from hyperlinks • Ex. Discovering communities of users who share
common interests (Social Network Mining)
Web structure mining,
• Extracts useful information from Web page contents
• Ex. Analyzing customer reviews and forum postings to discover consumer opinions sentiment
Web content mining
• Discovers user access patterns from Web usage logs (Click stream data)
• Ex. Web user behavior modeling, Website personalization
Web usage mining
![Page 24: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/24.jpg)
Data analytics steps • Discretization, Data cleaning, Data
integration, Data transformation, Data reduction
Pre-processing
• Application of data mining (statistics and machine learning) and operations research tools
Processing
• Application of evaluation and visualization techniques
Post-processing
• Application of the domain knowledge to interpret the data for decision making
Decision making
![Page 25: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/25.jpg)
Characteristics of Data • Data on user, item or rating are described by a random
variable and are categorized as continuous or discrete. • Continuous Variable A variable that can assume any
value on a continuous scale within a range is said to be continuous. – Example: 1) time spend by a buyer on a particular page,
2)interestingness of a joke as rated by the user in Jester • Discrete Variable Variables that can assume a finite or
countably infinite number of values are said to be discrete. – Example: 1)Profession of a user, 2) Rating in a product in
Amazon
![Page 26: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/26.jpg)
Scales of measurement Measurement means assigning numbers or other symbols to characteristics of objects according to certain pre-specified rules.
– One-to-one correspondence between the numbers and the characteristics being measured.
– The rules for assigning numbers should be standardized and applied uniformly.
– Rules must not change over objects or time.
• Scaling involves creating a continuum upon which measured objects are located.
Naresh K. Malhotra, Marketing Research: An Applied Orientation, Pearson; 6 edition2009
![Page 27: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/27.jpg)
Characteristics of Scale Description By description, we mean the unique labels or descriptors
that are used to designate each value of the scale. All scales possess description.
Order By order, we mean the relative sizes or positions of the
descriptors. Order is denoted by descriptors such as greater than, less than, and equal to.
Distance The characteristic of distance means that absolute
differences between the scale descriptors are known and may be expressed in units.
Origin The origin characteristic means that the scale has a unique
or fixed beginning or true zero point.
![Page 28: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/28.jpg)
Primary Scales of Measurement
7 3 8
Scale Nominal Numbers Assigned to Runners Ordinal Rank Order of Winners Interval Performance Rating on a 0 to 10 Scale Ratio Time to Finish in Seconds
Third place
Second place
First place
Finish
Finish
8.2 9.1 9.6
15.2 14.1 13.4
![Page 29: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/29.jpg)
Primary Scales of Measurement Scale Basic Characteristics Common
Examples Permissible Statistics Descriptive Inferential
Nominal Numbers identify & classify objects
Social Security nos., numbering of football players
Percentages, mode
Chi-square, binomial test
Ordinal Nos. indicate the relative positions of objects but not the magnitude of differences between them
Quality rankings, rankings of teams in a tournament
Percentile, median
Rank-order correlation, Friedman ANOVA
Interval Differences between objects can be compared, zero point is arbitrary
Temperature (Fahrenheit) Celsius)
Range, mean, standard deviation
Product-moment correlation, t tests, regression
Ratio Zero point is fixed, ratios of scale values can be compared
Length, weight Geometric mean, harmonic mean
Coefficient of variation
![Page 30: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/30.jpg)
Statistical method for understanding the data
![Page 31: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/31.jpg)
Descriptive Statistics • Descriptive statistics are brief descriptive coefficients that
summarize a given data set • Univariate analysis: involves describing the distribution of
a single variable including – Central tendency: mean, median, and mode – Dispersion
• Range, quantiles, inter quantile range • Spread: variance and standard deviation • Shape of the distribution : skewness and kurtosis.
– Characteristics of a variable's distribution in graphical form • Distribution, histograms , stem-and-leaf plot, box plot
• Bivariate analysis: to describe the relationship between pairs of variables. – Cross-tabulations and contingency tables – Graphical representation via scatterplots – Quantitative measures of dependence: correlation, covariance
![Page 32: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/32.jpg)
Univariate analysis for understanding the data
• Central tendency: mean, median, and mode
6, 7, 13, 17, 20, 22, 24, 24, 24, 25, 27, 28, 35, 36, 50
Mean Median Mode ~ 24 24 24
6, 7, 13, 17, 20, 22, 24, 24, 24, 25, 27, 28, 35, 36, 50, 517
Mean Median Mode ~ 54 24 24
The median is less sensitive to outliers (extreme scores) than the mean and thus a better measure than the mean for highly skewed distributions
![Page 33: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/33.jpg)
Univariate analysis for understanding the data • Dispersion
– Range, quantiles, inter quantile range – Spread: variance and standard deviation
Minimum First Quartile Median Third Quartile Maximum
6, 7, 13, 17, 20, 22, 24, 24, 24, 25, 27, 28, 35, 36, 50 • Range = Max - Min = 44 • Standard Deviation (SD) = 11.2 • Variance = s^2 = 126.4
![Page 34: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/34.jpg)
Univariate analysis for understanding the data • Skewness Measures asymmetry of data
– Positive or right skewed: Longer right tail
– Negative or left skewed: Longer left tail
2/3
1
2
1
3
21
)(
)(Skewness
Then, ns.observatio be ,...,Let
−
−=
∑
∑
=
=
n
ii
n
ii
n
xx
xxn
nxxx
• Measures peakedness of the distribution of data. The kurtosis of normal distribution is 0.
3)(
)(Kurtosis
Then, ns.observatio be ,...,Let
2
1
2
1
4
21
−
−
−=
∑
∑
=
=
n
ii
n
ii
n
xx
xxn
nxxx
![Page 35: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/35.jpg)
Univariate analysis for understanding the data
• Characteristics of a variable's distribution in graphical form – Bar diagram and Pie charts are used for categorical variables – Histogram and Box-plot are used for numerical variable.
Figure 3: Age Distribution
02
46
810
1214
16
40 60 80 100 120 140 More
Age in Month
Num
ber o
f Sub
ject
s
Mean 90.41666667
Standard Error 3.902649518
Median 84
Mode 84
Standard Deviation 30.22979318
Sample Variance 913.8403955
Kurtosis -1.183899591
Skewness 0.389872725
Range 95
Minimum 48
Maximum 143
Sum 5425
Count 60
![Page 36: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/36.jpg)
Data Preparation (Preprocessing) • Data in the real world has many problems
– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
• Example: occupation=“” – noisy: containing errors or outliers
• Example: Salary=“-10”, Age=“222” – inconsistent: containing discrepancies in codes or names
• Example: Age= 42 Birthday= 03/07/1997 “42” Birthday=“03/07/1997”
• Data need to be formatted for a given software tool • Data need to be made adequate for a given method
– Computational stability of the algorithms
1. http://paginas.fe.up.pt/~ec/files_0910/slides/aula_2_DataPreparation.pdf 2. Chapter 2: Han and Kamber, Data Mining Book 3. http://www.cs.unm.edu/~mueen/Teaching/CS591/Lectures/3_Data.pdf
![Page 37: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/37.jpg)
Major Tasks in Preprocessing • Data discretization
– Part of data reduction but with particular importance, especially for numerical data
• Data cleaning – Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies • Data integration
– Integration of multiple databases, data cubes, or files • Data transformation
– Normalization and aggregation • Data reduction
– Obtains reduced representation in volume but produces the same or similar analytical results
![Page 38: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/38.jpg)
Discretization of Continuous Variables
• Divide the range of a continuous attribute into intervals – Some methods require discrete values, e.g. most
versions of Naïve Bayes – Reduce data size by discretization – Prepare for further analysis
• Useful for generating a summary of data • Also called binning
– Equal width binning – Equal height binning – Other methods: Entropy based, Holte 1R
![Page 39: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/39.jpg)
Binning • Equal width binning
– It divides the range into N intervals of equal size (range): uniform grid
– If A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B -A)/N
• Equal height binning – It divides the range into N intervals, each containing
approximately the same number of samples – Generally preferred because avoids clumping – In practice, “almost-equal” height binning is used to give more
intuitive break points – Additional considerations:
• don’t split frequent values across bins • create separate bins for special values (e g 0)
![Page 40: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/40.jpg)
Data cleaning
– Fill in missing values (manual vs. automatic) • Ignore • Constant: “unknown”, a new class?! • Attribute mean (of entire set or subset) • Most probable value: inference-based
– Identify outliers and smooth out noisy data • Binning method • Clustering • Combined computer and human inspection • Regression
– Correct inconsistent data – Resolve redundancy caused by data integration
![Page 41: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/41.jpg)
Outlier Detection in Univariate Data
Compute mean and std. deviation. If the value is two three standard deviations away from the mean, it may be considered as an outlier
An observation is an extreme outlier if (Q1-3×IQR, Q3+3×IQR), and declared a mild outlier if it lies outside of the interval (Q1-1.5×IQR, Q3+1.5×IQR) (IQR = Inter Quartile Range, IQR=(Q3-Q1)
![Page 42: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/42.jpg)
Outlier Detection in multivariate Data • Statistical Methods
– Mahalnobis Distance – Outliers: Multivariate data points with large distances
• Data mining Methods – Distance based measures: An observation is defined
as a distance based outlier if at least a fraction β of the observations in the dataset are further than r from it.
– Clustering based methods consider a cluster of small sizes, including the size of one observation, as clustered outliers.
http://www2.cs.uh.edu/~ceick/7362/T1-1.pdf
![Page 43: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/43.jpg)
Handling missing values • Ignore records (use only cases with all values)
– Usually done when class label is missing as most prediction methods
– do not handle missing data well – Not effective when the percentage of missing values per
attribute varies considerably as it can lead to insufficient and/or biased sample sizes
• Ignore attributes with missing values – Use only features (attributes) with all values (may leave
out important features) • Fill in the missing value manually
– tedious + infeasible?
![Page 44: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/44.jpg)
Handling missing values • Use a global constant to fill in the missing value
– e.g., “unknown”. (May create a new class!) • Use the attribute mean to fill in the missing value
– It will do the least harm to the mean of existing data • Use the attribute mean for all samples belonging to the
same class to fill in the missing value • Use the most probable value to fill in the missing value
– Inference-based such as Bayesian formula or decision tree – Identify relationships among variables
• Linear regression, Multiple linear regression, Nonlinear regression – Nearest-Neighbour estimator
• Finding the k neighbours nearest to the point and fill in the most frequent value or the average value
• Finding neighbours in a large dataset may be slow
![Page 45: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/45.jpg)
Data Integration • Combines data from multiple sources into a coherent store • Remove redundancies • Data integration: • Schema integration
– integrate metadata from different sources – Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id ≡ B.cust-# • Detecting and resolving data value conflicts
– for the same real world entity, attribute values from different sources are different
– possible reasons: different representations, different scales, e.g., metric vs. British units
![Page 46: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/46.jpg)
Data transformation
• Smoothing: remove noise from data – binning, regression, clustering
• Aggregation: – summarization, data cube construction
• Generalization: concept hierarchy climbing • Attribute/feature construction
– New attributes constructed from the given ones (add att. Area which is based on height and width)
• Normalization – Scale values to fall within smaller specified range
![Page 47: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/47.jpg)
Data Normalization • min-max normalization
• z-score normalization (standardization)
• normalization by decimal scaling
Where j is the smallest integer such that Max(| |)<1
![Page 48: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/48.jpg)
Data Reduction • Data cube aggregation
– Aggregation operations are applied to the data in the construction of a data cube.
• Attribute subset selection – Irrelevant, weakly relevant, or redundant attributes or
dimensions may be detected and removed – Sometimes according to your knowledge of the business
• Dimensionality reduction – Encoding mechanisms are used to reduce the data set size.
• Numerosity reduction – The data are replaced or estimated by alternative, smaller
data representations
![Page 49: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/49.jpg)
Data cube aggregation
![Page 50: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/50.jpg)
Attribute subset selection
![Page 51: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/51.jpg)
Dimensionality reduction
• Matrix decomposition methods – Singular Value Decomposition
• Principal Component Analysis – Finding major directions – computes k orthonormal vectors
• Signal processing techniques – Discrete Fourier Transform – Discrete Wavelet transform
![Page 52: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/52.jpg)
Numerosity reduction • Reduce data volume by choosing alternative, smaller
forms of data representation – Parametric methods
• Assume the data fits some model, estimate model parameters, use the estimated value instead of the actual data
• Regression and log-linear models – Non-parametric methods
• Do not assume models • Histograms, clustering, sampling
– Discretization and concept hierarchy generation • where raw data values for attributes are replaced by ranges or
higher conceptual levels • Data discretization is a form of numerosity reduction that is very
useful for the automatic generation of concept hierarchies.
![Page 53: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/53.jpg)
Parametric Methods -Regression
![Page 54: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/54.jpg)
Non-parametric methods
![Page 55: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/55.jpg)
Concept Hierarchy Generation and Discretization
![Page 56: Introduction to Web Data Analytics Using R and Python](https://reader034.vdocuments.us/reader034/viewer/2022042517/589d10901a28ab61128b4a4d/html5/thumbnails/56.jpg)
Sampling