big data analytics with r - uibk.ac.at data... · data processing with r using statistical...
TRANSCRIPT
Data Processing with RUsing statistical computing and graphics Software for Business Intelligence and Analytics on a Data Warehouse Environment
About this Course
This course is tailored towards IS Master students and focuses on the handling and processing of data in an Enterprise environment where data are often found in relational structures (e.g. SQL Server, Oracle or MySQL) and analysis is supported by (self-service) business intelligence tools (e.g. Microsoft Excel, PowerPivot or R).
This course takes the case of using the statistical computing and graphics software R on a relational Data Warehouse which is located on a Microsoft SQL Server. It particularly aims at (1) extracting analysis-oriented data from the storage, (2) processing data using statistical procedures and (3) generating graphical output material.
Course Objectives
• Practice data processing in an enterprise-like setting
• Use typical technology that is used in enterprises
• Reflect on the limits and potentials of data processing techniques
Value Chain of Data Analysis
Course Focus
D(Big) Data
PProcessing
OOutput
KKnowledge
AAction
IImpact
extract generate discover initiate realize
Course Assignment Part 1: ReaderPapers:Chaudhuri, S., Dayal, U., & Narasayya, V. (2011). An overview of business intelligence technology. Communications of the ACM, 54(8), 88–98.Smith, D. (2014). R Is Still Hot – and Getting Hotter. Revolution Analytics White Paper. Retrieved from http://www.revolutionanalytics.com/r-is-still-hotAli-Hassan, H. et al. (2015). Linking dimensions of social media use to job performance: The role of social capital. Journal of Strategic Information Systems 24 (2015) 65–89Benthaus, J. et al. (2016). Social media management strategies for organizational impression management and their effect on public perception. Journal of Strategic Information Systems. In Press.
Questions for papers 1 and 2:• How can Business Intelligence and Analytics
contribute to business impact?• How can technology help to assist Business
Intelligence and Analytics?• What limitations can you see in the application of
Business Intelligence and Analytics?• Give a short example of where Business
Intelligence and Analytics help management decisions.
Questions for papers 3 and 4:• What are the research questions and contributions
of the papers?• What methods of data collection were used?• What methods of data analysis were used?• How are the results visualized and discussed?• What are the limitations of the studies?• Give a summarizing evaluation about the overall
quality of the papers.
Course Assignment Part 2: Practical ExampleExercise:
Explore the data set InternetSalesAmount.csv with an Analysis Software of your choice (e.g. Microsoft Excel), create a visual summary and interpret your results.
Questions:• Explore the Distribution of SalesAmountsUSD over
CustomerAge and SalesTerritoryGroup.• Explore the development of SalesAmountUSD over
time.• Explore the distribution of SalesAmountUSD over
other Customer attributes of your choice.
Agenda
• Introduction to R• Getting Started with R Console and R Commander• Data types and Structures• Connecting to a Data Mart / Cube• Exploring and Visualizing Data• Programming Structures, Functions, and Data Relationships
Introduction to R• R is an integrated software suit for
• data manipulation• calculation• graphical display
• R was initially created in 1993• R is developed as a GNU project • R is supported by a global community• R is extendable through packages• R can be used in multiple way
• Windows CMSD• R GUI• R Studio
Windows CMD
R GUI
R Studio
(Ihaka et al. 1996, R Project 2016)
CODE EXAMPLE
R Packages
• Packages only have to be installed once
• From CRAN• From file
• Packages need to be loaded before they can be used
># 2 Options to install packages
>install.packages("RODBC")
>install.packages("rodbc.tar.gz")
>
># 2 Options to use packages
>library("RODBC")
>require("RODBC")
CODE EXAMPLE
Basic R Syntax• Working directory can be
set and retrieved• R supports all kinds of
mathematical operations• <- is the preferred
assignment operator (= is not recommended)
• Outputs are marked with a line-number in (e.g. [1])
• R chooses a suitable data type itself
>setwd("C:/Test/")
>getwd()
[1] "C:/Test/"
>a <- 4 + 1
>b <- 4
>a + b
[1] 9
>ls()
[1] "a" "b"
>typeof(a)
[1] "double"
CODE EXAMPLE
Get Help in R
• R Help is available as html page on your computer
• You can browse help yourself
• You can ask for help with a specific concept
• You search across all installed packages
># Open Help
>help.start()
>
># Search for specific content
>help(mean)
>?mean
>
># Search across all installed projects
>??mean
CODE EXAMPLE
Further Operations in R• R can work with sequences
and vectors as well• Vectors can be input for
further operations (e.g. multiplication with a scalar)
># Create a sequence of numbers
>digits <- 0:10
>digits
[1] 0 1 2 3 4 5 6 7 8 9 10
>sequence <- seq(from=0, to=10, by=1)
>sequence
[1] 0 1 2 3 4 5 6 7 8 9 10
># Create a vector
>vector <- c(1,3,5,7,9)
>vector [1] 1 3 5 7 9
>vector * 1.5
[1] 1.5 4.5 7.5 10.5 13.5
StagingETL
OLTP Database
OLTP Database
OLTP Database
Cube
… Relational Database
… OLAP Cube
Data Structure we are working with
Stage 1 Stage … Data Mart
Adventure Works Case Example
Adventure Works is a fictional company that is selling bicycles and biking accessories. Besides several OLTP databases, Adventure Works is using an OLAP Cube for data analysis which is tailored towards:• Finance Analysis• Product Inventory Analysis• Resales Analysis• Internet Sales AnalysisWe are going to use the Data Warehouse that backs these analysis scenarios.
CODE EXAMPLE
Example Query: Finance
SELECTfact.Amount,fact.Date,acct.AccountType,dat.CalendarQuarter,dat.CalendarSemester,dat.CalendarYear,dat.DayNumberOfYear,org.OrganizationName,org.PercentageOfOwnership
FROM [dbo].[FactFinance] AS factJOIN [dbo].[DimOrganization] AS orgON fact.OrganizationKey = org.OrganizationKeyJOIN [dbo].[DimDate] AS datON dat.DateKey = fact.DateKeyJOIN [dbo].[DimAccount] AS acctON acct.AccountKey = fact.AccountKey
http://tinyurl.com/zqau98u
CODE EXAMPLE
Example Query: Product Inventory
SELECTfact.[MovementDate],fact.[UnitCost],fact.[UnitsIn],fact.[UnitsOut],fact.[UnitsBalance],dat.CalendarQuarter,dat.CalendarSemester,dat.CalendarYear,dat.DayNumberOfYear,prod.DealerPrice AS ProductDealerPrice,prod.EnglishProductName AS ProductName,prod.EnglishDescription AS ProductDescription,prod.[Weight] AS ProductWitght,prod.Size AS ProductSize,prod.SafetyStockLevel AS ProductSafetyStockLevel
FROM [AdventureWorksDW2012].[dbo].[FactProductInventory] AS factJOIN [AdventureWorksDW2012].[dbo].[DimDate] AS datON fact.DateKey = dat.DateKeyJOIN [AdventureWorksDW2012].[dbo].[DimProduct] AS prodON fact.ProductKey = prod.ProductKey
http://tinyurl.com/z9kkmph
CODE EXAMPLE
Example Query: Resales
SELECTfact.[OrderQuantity],fact.[UnitPrice],fact.[ExtendedAmount],fact.[DiscountAmount],fact.[ProductStandardCost],fact.[TotalProductCost],fact.[SalesAmount] AS SalesAmountLCY,ROUND(crat.AverageRate*fact.SalesAmount,2) AS SalesAmountUSD,fact.[TaxAmt],fact.[Freight],fact.[OrderDate],cter.City AS ResellerCity,cter.EnglishCountryRegionName AS ResellerCountry,res.AnnualSales AS ResellerAnnualSale,res.AnnualRevenue AS ResellerAnnualRevenue
,res.NumberEmployees AS ResellerNumberOfEmployees,res.ResellerName,res.OrderFrequency AS ResellerOrderFrequency,res.YearOpened AS ResellerYearOpened,res.FirstOrderYear AS ResellerFirstOrderYear,res.LastOrderYear AS ResellerLastOrderYear
FROM [AdventureWorksDW2012].[dbo].[FactResellerSales] AS factJOIN [AdventureWorksDW2012].[dbo].[DimReseller] AS resON res.ResellerKey = fact.ResellerKeyJOIN [AdventureWorksDW2012].[dbo].DimSalesTerritory AS sterON ster.SalesTerritoryKey = fact.SalesTerritoryKeyJOIN [AdventureWorksDW2012].[dbo].DimGeography AS cterON res.GeographyKey = cter.GeographyKeyJOIN AdventureWorksDW2012.dbo.FactCurrencyRate AS cratON crat.DateKey = fact.OrderDateKeyAND crat.CurrencyKey = fact.CurrencyKey
http://tinyurl.com/hsqcort
CODE EXAMPLE
Example Query: Internet Sales
SELECTfact.[OrderQuantity],fact.[UnitPrice],fact.[ExtendedAmount],fact.[DiscountAmount],fact.[ProductStandardCost],fact.[TotalProductCost],fact.[SalesAmount] AS SalesAmountLCY,ROUND(crat.AverageRate*fact.SalesAmount,2) AS SalesAmountUSD,fact.[TaxAmt],fact.[Freight],fact.[OrderDate],YEAR(CURRENT_TIMESTAMP)-YEAR(cust.BirthDate) AS CustomerAge,cter.City AS CustomerCity,cter.EnglishCountryRegionName AS CustomerCountry,cust.Gender AS CustomerGender,cust.HouseOwnerFlag AS CustomerIsHouseOwner,cust.MaritalStatus AS CustomerIsMarried
,cust.TotalChildren AS CustomerNumberOfChildren,cust.YearlyIncome AS CustomerYearlyIncome,cust.CommuteDistance AS CustomerCommuteDistance,cust.NumberCarsOwned AS CustomerNumberOfCars,cust.NumberChildrenAtHome AS CustomerNumberOfChildrenAtHome
FROM [AdventureWorksDW2012].[dbo].[FactInternetSales] AS factJOIN [AdventureWorksDW2012].[dbo].DimCustomer AS custON cust.CustomerKey = fact.CustomerKeyJOIN [AdventureWorksDW2012].[dbo].DimSalesTerritory AS sterON ster.SalesTerritoryKey = fact.SalesTerritoryKeyJOIN [AdventureWorksDW2012].[dbo].DimGeography AS cterON cust.GeographyKey = cter.GeographyKeyJOIN AdventureWorksDW2012.dbo.FactCurrencyRate AS cratON crat.DateKey = fact.OrderDateKeyAND crat.CurrencyKey = fact.CurrencyKey
http://tinyurl.com/j4edktw
CODE EXAMPLE
Connecting to an ODBC Data Source
• ODBC = Open Database Connectivity
• Create a DSN• Control panel• Administrative tools• ODBC data sources• Add
• Use RODBC package• Use sqlFetch method
library(RODBC)
odcbChannel <- odbcConnect("Local Host")
strSQLQuery <- "SELECT * FROM ..."
datMyData <- sqlQuery(odcbChannel, strSQLQuery)
# This is where you can put your code
odbcClose(odcbChannel)
# Alternatively: Directly connect with connection string
odcbChannel <- odbcDriverConnect("driver={SQL Server};server=c437-sql-12;database=AdventureWorksDW2012;trusted_connection=true")
https://andersspur.wordpress.com/2013/11/26/connect-r-to-sql-server-2012-and-14/
ggplot Package
• Package for Visualization of various sorts of data
• CRAN Documentation: https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf
Visualisations that might be helpful
• Bar Chart• One variable plot• One variable plot with category
• Scatter Plot• Two variables scatter• Three variables scatter
• Treemap Plot• Corrplot• Map Plot
CODE EXAMPLE
Scatterplot (2 Variables)
library(diamonds, package = "ggplot2")
df <- diamonds
ggplot(df, aes(x=df$carat, y=df$price)) + geom_point()
CODE EXAMPLE
Scatterplot (2 Variables, Transformed)
library(diamonds, package = "ggplot2")
df <- diamonds
ggplot(df, aes(x=log(df$carat), y=log(df$price))) +geom_point()
CODE EXAMPLE
Scatterplot (3 Variables)
library(diamonds, package = "ggplot2")
df <- diamonds
ggplot(df, aes(x=df$carat, y=df$price, shape=df$cut))+ geom_point()
CODE EXAMPLE
Scatterplot (4 Variables)
library(diamonds, package = "ggplot2")
df <- diamonds
ggplot(df, aes(x=df$carat, y=df$price, shape=df$cut, color=df$color)) + geom_point()
CODE EXAMPLE
Scatterplot (5 Variables)
library(diamonds, package = "ggplot2")
df <- diamonds
ggplot(df, aes(x=df$carat, y=df$price, shape=df$cut, color=df$color, size=df$depth)) + geom_point()
CODE EXAMPLE
Scatterplot (6 Variables)
library(diamonds, package = "ggplot2")
df <- diamonds
ggplot(df, aes(x=df$carat, y=df$price, shape=df$cut, color=df$color, size=df$depth)) + geom_point() +facet_grid(. ~ clarity)
# Attention: Do not use the dataframe prefix in facet_grid
CODE EXAMPLE
Scatterplot (7 Variables)
library(diamonds, package = "ggplot2")
df <- diamonds
ggplot(df, aes(x=df$carat, y=df$price, shape=df$cut, color=df$color, size=df$depth)) + geom_point() +facet_grid(clarity ~ clarity)
# Attention: Do not use the dataframe prefix in facet_grid
Summary: ggplot2 Aestetics Parameters
• x … Distance of a point on the X-Axis• y … Distance of a point on the Y-Axis• shape … Plot character that is used for a point• color … Plot colour that is used for a point• size … Plot size that is used for a point
CODE EXAMPLE
Smoothed Curve Plot
library(diamonds, package = "ggplot2")
df <- diamonds
ggplot(df, aes(x=data$carat, y=data$price)) +geom_smooth()
# Info: Can also be used with additional aesparameters
CODE EXAMPLE
Smoothed Curve Plot with Points
library(diamonds, package = "ggplot2")
df <- diamonds
ggplot(df, aes(x=data$carat, y=data$price)) +geom_point() + geom_smooth()
# Info: Can also be used with additional aesparameters
# Attention: Order of Geoms makes a difference
CODE EXAMPLE
Histogram Plot
library(diamonds, package = "ggplot2")
df <- diamonds
ggplot(df, aes(x=df$carat)) + geom_histogram()
# Info: Can also be used with additional aesparameters
CODE EXAMPLE
Density Plot
library(diamonds, package = "ggplot2")
df <- diamonds
ggplot(df, aes(x=df$carat)) + geom_density()
# Info: Can also be used with additional aesparameters
CODE EXAMPLE
Bar Chart (1 Variable)
library(diamonds, package = "ggplot2")
df <- diamonds
ggplot(df, aes(x=df$color)) + geom_bar()
CODE EXAMPLE
Bar Chart (2 Variables)
library(diamonds, package = "ggplot2")
df <- diamonds
ggplot(df, aes(x=df$color, color=df$cut)) + geom_bar()
CODE EXAMPLE
Line Plot
library(diamonds, package = "ggplot2")
df <- diamond
ggplot(df, aes(x=df$carat, y=df$price)) +geom_line(color="blue")
CODE EXAMPLE
Correlation Matrix (Corrplot)
library(diamonds, package = "ggplot2")
df <- diamond
# Correlate all numeric variables
corMatrix <- cor(df[,unlist(lapply(df, is.numeric))])
# Transform cross table
CorrList <- melt(corMatrix)
CorrList$value <- round(CorrList$value,digits=2)
# Plot Correlation Matrix
ggplot(CorrList, aes(X1, X2, fill=value)) + geom_tile() +geom_text(aes(label=CorrList$value), colour="white")
# Alternatively, you can use the corrplot package with a lot more additional features
ggmap Package
• Package for Visualization of spatial data
• CRAN Documentation: https://cran.r-project.org/web/packages/ggmap/ggmap.pdf
CODE EXAMPLE
Plot of Map Object
library(ggmap)
dfSpatial <-read.csv("http://www.innsbruck.gv.at/data.cfm?vpath=diverse/ogd/gis/universitaetcsv", sep=";", header = TRUE, dec =",")
map <- get_map("Innsbruck", zoom = 14, maptype ="roadmap")
ggmap(map)
CODE EXAMPLE
Spatial Plot of Locations
library(ggmap)
dfSpatial <-read.csv("http://www.innsbruck.gv.at/data.cfm?vpath=diverse/ogd/gis/universitaetcsv", sep=";", header = TRUE, dec =",")
map <- get_map("Innsbruck", zoom = 12, maptype="roadmap")
ggmap(map) + geom_point(data=dfSpatial, aes(x=Lon, y=Lat), size=5, shape=10)
CODE EXAMPLE
Spatial Plot with Size Variable
library(ggmap)
dfCity <-read.csv("https://dl.dropboxusercontent.com/u/6141049/Data%20Processing%20with%20R/CityLongLat.txt", sep=";", header=TRUE, stringsAsFactors=FALSE)
dfCitySmall <- subset(dfCity, ISO.2=="AT")
dfCitySmall$Size <-dfCitySmall$Population/max(dfCitySmall$Population)*10
map <- get_map("Austria", zoom = 6, maptype="roadmap“, color=“bw")
ggmap(map) + geom_point(data=dfCitySmall, aes(x=Longitude, y=Latitude), size = dfCitySmall$Size, color="red")
treemap Package
• Visualize multi-category data with quantities
• CRAN Documentation: https://cran.r-project.org/web/packages/treemap/treemap.pdf
CODE EXAMPLE
Treemap Plot
library(treemap)
dfCity <-read.csv("https://dl.dropboxusercontent.com/u/6141049/Data%20Processing%20with%20R/CityLongLat.txt", sep=";", header=TRUE, stringsAsFactors=FALSE)
dfCitySmall <- subset(dfCity, ISO.2=="AT")
treemap(dfCitySmall, index=c("Province"), vSize="Population", type="index")