rattle graphical interface for r language

18
INTRODUCTION TO R AND RATTLE 1 IAUSHIRAZ 05/14/2022

Upload: majid-abdollahi

Post on 16-Apr-2017

70 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Rattle Graphical Interface for R Language

05/02/2023 IAUSHIRAZ 1

INTRODUCTION TO R AND RATTLE

Page 2: Rattle Graphical Interface for R Language

05/02/2023 IAUSHIRAZ 2

What is the RStatistical Programming Language

used among statisticians and data miners for developing statistical software and data analysis.

Free and Open Source

Written in C, Fortran and R

Statistical featuresLinear and nonlinear modelingStatistical testsClassification, Clustering

Can manipulate R Objects with C, C++, Java, .NET or Python code.

Page 3: Rattle Graphical Interface for R Language

05/02/2023 IAUSHIRAZ 3

Source Example> x <- c(1,2,3,4,5,6) # Create ordered collection (vector)> y <- x^2 # Square the elements of x> print(y) # print (vector) y[1] 1 4 9 16 25 36> mean(y) # Calculate average (arithmetic mean) of (vector) y; result is scalar[1] 15.16667> var(y) # Calculate sample variance[1] 178.9667> lm_1 <- lm(y ~ x) # Fit a linear regression model "y = f(x)" or "y = B0 + (B1 * x)" # store the results as lm_1> print(lm_1) # Print the model from the (linear model object) lm_1

Call:lm(formula = y ~ x)

Coefficients:(Intercept) x -9.333 7.000

> summary(lm_1) # Compute and print statistics for the fit # of the (linear model object) lm_1

Call:lm(formula = y ~ x)

Residuals:1 2 3 4 5 63.3333 -0.6667 -2.6667 -2.6667 -0.6667 3.3333

Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) -9.3333 2.8441 -3.282 0.030453 *x 7.0000 0.7303 9.585 0.000662 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.055 on 4 degrees of freedomMultiple R-squared: 0.9583, Adjusted R-squared: 0.9478F-statistic: 91.88 on 1 and 4 DF, p-value: 0.000662

> par(mfrow=c(2, 2)) # Request 2x2 plot layout> plot(lm_1) # Diagnostic plot of regression model

Page 4: Rattle Graphical Interface for R Language

05/02/2023 IAUSHIRAZ 4

Graphical front-endsArchitect – cross-platform open source IDE based on Eclipse and StatETDataJoy – Online R Editor focused on beginners to data science and collaboration.Deducer – GUI for menu-driven data analysis (similar to SPSS/JMP/Minitab).Java GUI for R – cross-platform stand-alone R terminal and editor based on Java (also known as JGR).Number Analytics - GUI for R based business analytics (similar to SPSS) working on the cloud.Rattle GUI – cross-platform GUI based on RGtk2 and specifically designed for data mining.R Commander – cross-platform menu-driven GUI based on tcltk (several plug-ins to Rcmdr are also

available).Revolution R Productivity Environment (RPE) – Revolution Analytics-provided Visual Studio-based IDE, and

has plans for web based point and click interface.RGUI – comes with the pre-compiled version of R for Microsoft Windows.RKWard – extensible GUI and IDE for R.RStudio – cross-platform open source IDE (which can also be run on a remote Linux server).

Page 5: Rattle Graphical Interface for R Language

05/02/2023 IAUSHIRAZ 5

What is the RattleR Graphical User Interface Package

Offered by Graham Williams in Togaware Pty Ltd.

Free and Open Source

Represents Statistical and Visual Summaries of data

Tabs :Load DataData ExplorationModelEvaluationTest…

Page 6: Rattle Graphical Interface for R Language

05/02/2023 IAUSHIRAZ 6

Rattle Installation ProcessDownload and Installing R

https://r-project.orgAbout 60MB

Download the Rattle PackageAbout 300MBFollow Instructions :

install.packages("rattle", dependencies=c("Depends", "Suggests")) Library(rattle) Rattle()

Page 7: Rattle Graphical Interface for R Language

05/02/2023 IAUSHIRAZ 7

Load DataDataset Types :

CSV File (CSV, TXT, EXCELL)ARFF (CSV File which adds type information)ODBC (MySQL, SqlLITE, SQL Server, …)

Set Connections in : /etc/odbcinst.ini & /etc/odbc.iniR Dataset (Existing Datasets in Current Solution)R Data FileLibrary (Pre Existing Datasets)Corpus ( Collection of Documents)Script (Scripts for Generating Datasets)

Page 8: Rattle Graphical Interface for R Language

05/02/2023 IAUSHIRAZ 8

Load DataVariable Types :

Input (Most Variables as Input) Predict the Target Variables

Target (Influenced by the Input Variables) Known as the Output Prefix : TARGET_

Risk (Measure of the size of the Targets) Prefix : RISK_

Identifier (any Numeric Variable that has a Unique Value – Not Normally used in modeling) Such as : ID, Date Prefix : ID_

Ignore (Ignore from Modeling) Prefix : IGNORE_

Weight (Weighted by R Formula)

Page 9: Rattle Graphical Interface for R Language

05/02/2023 IAUSHIRAZ 9

TransformRescale

Normalize Re Center Scale [0-1] Median/Mad Natural Log / Log 10 Matrix

Order Rank Interval Number of Group

Page 10: Rattle Graphical Interface for R Language

05/02/2023 IAUSHIRAZ 10

TransformImpute (missing values)

ZeroMeanMedianModeConstant

RecodeQuantilesK-MeansEqual withIndicator variable / Join CategoriesAs Categorical / As Numeric

Page 11: Rattle Graphical Interface for R Language

05/02/2023 IAUSHIRAZ 11

TransformCleanup

Delete IgnoredDelete SelectedDelete MissingDelete Observations with Missing

Page 12: Rattle Graphical Interface for R Language

05/02/2023 IAUSHIRAZ 12

ExplorationSummary

Summary Min, Max, Mean, Quartiles Values.

Describe Missing, Unique, Sum, Mean, Lowest, Highest Values.

Basics (For Numeric Value) Measures of Numeric Data (Missing, Min, Max, Quartiles, Mean, Sum, Skewness, Kurtosis)

Kurtosis (For Numeric Value) A larger value indicates a sharper peak. A lower value indicates a smoother peak.

Skewness (For Numeric Value) A positive skew indicates that the tail to the right is longer. A negative skew that the tail to the left is longer.

Page 13: Rattle Graphical Interface for R Language

05/02/2023 IAUSHIRAZ 13

ExplorationSummary

Show Missing Each row corresponds to a pattern of missing values. Perhaps coming to an understanding of why the data is missing. Rows and Columns are sorted in ascending order of missing data.

Page 14: Rattle Graphical Interface for R Language

05/02/2023 IAUSHIRAZ 14

ExplorationDistributions (review the distributions of each variable in dataset)

Annotate (include numeric values in plots)Group byNumeric Outputs :

Box Plot Histogram Cumulative Benford

For any number of continuous variables Pairs

Categorical Outputs : Bar Plot Dot Plot Mosaic Pairs

Page 15: Rattle Graphical Interface for R Language

05/02/2023 IAUSHIRAZ 15

ExplorationCorrelations (Rattle only computes correlations between numeric variables at this time)

Ordered Order by strength of correlations

Explore Missing Correlation between missing values

Hierarchical Pearson Kendall Spearman

Principal ComponentsSVD

For only Numeric VariablesEigen

Page 16: Rattle Graphical Interface for R Language

05/02/2023 IAUSHIRAZ 16

ModelTree

Traditional Trade off between performance and simplicity of explanation

Conditional

Forest (many decision trees using random subsets of data and variables)Number of TreesNumber of VariablesImpute (set median numeric value for missing values)Sample Size (for balancing classes)Importance (variable importance)Rules (collection of random forest rules)ROC (ROC Curve)Errors

Page 17: Rattle Graphical Interface for R Language

05/02/2023 IAUSHIRAZ 17

ModelSVM

Start with two parallel vector

Linear (linear regression)For continues values

All

Page 18: Rattle Graphical Interface for R Language

05/02/2023 IAUSHIRAZ 18

ClusterK-Means

Set First K

EwKmK-Means with entropy weighting

HierarchicalNot needed to set first Cluster Number

BiClusterSuitable subsets of both the variables and the observations