introduction predictive analytics tools: weka, r of data... · university of california, san diego...

54
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego

Upload: ngotruc

Post on 18-Mar-2018

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Introduction Predictive Analytics Tools: Weka, R!

Predictive Analytics Center of Excellence

San Diego Supercomputer Center University of California, San Diego

!

Page 2: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Available Data Mining Tools!COTs:!

n IBM Intelligent Miner!n SAS Enterprise Miner!n Oracle ODM!n Microstrategy!n Microsoft DBMiner!n Pentaho!n Matlab!n Teradata!

Open Source:!n WEKA!n KNIME!n Orange!n RapidMiner!n NLTK!n R!n Rattle!

2

Page 3: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Agenda!!•  WEKA!

•  Intro and background"•  Data Preparation"•  Creating Models/ Applying Algorithms"•  Evaluating Results"

•  R!•  R Background"•  R Basics"

•  Outline"•  R-Studio Overview"

•  Hands On (homework)"

Page 4: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

WEKA!

Page 5: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Download and Install WEKA!

•  Website: http://www.cs.waikato.ac.nz/~ml/weka/index.html!

!

5 7/1/14

Page 6: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

What is WEKA?!•  Waikato Environment for Knowledge Analysis!

•  WEKA is a data mining/machine learning application developed by Department of Computer Science, University of Waikato, New Zealand"

•  WEKA is open source software in JAVA "•  WEKA is a collection machine learning algorithms and tools for data

mining tasks"•  data pre-processing, classification, regression, clustering, association,

and visualization. "•  WEKA is well-suited for developing new machine learning

schemes "•  WEKA is a bird found only in New Zealand. !

6 7/1/14

Page 7: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Advantages of Weka !•  Free availability !

•  under the GNU General Public License"•  Portability!

•  fully implemented in the Java programming language and thus runs on almost any modern computing platforms"

•  Windows, Mac OS X and Linux"•  Comprehensive collection of data preprocessing and modeling

techniques!•  Supports standard data mining tasks: data preprocessing, clustering,

classification, regression, visualization, and feature selection."•  Easy to use GUI!•  Provides access to SQL databases !

•  using Java Database Connectivity and can process the result returned by a database query."

Page 8: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Disadvantages !!

•  Sequence modeling is not covered by the algorithms included in the Weka distribution!

•  Not capable of multi-relational data mining!

Page 9: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

WEKA Walk Through: Main GUI!•  Three graphical user interfaces!

•  “The Explorer” (exploratory data analysis)"•  pre-process data"•  build “classifiers” "•  cluster data"•  find associations"•  attribute selection"•  data visualization"

•  “The Experimenter” (experimental environment)"•  used to compare performance of different learning

schemes "•  “The KnowledgeFlow” (new process model

inspired interface) "•  Java-Beans-based interface for setting up and running

machine learning experiments."•  Command line Interface (“Simple CLI”)!

9 7/1/14 More  at:    http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html  

Page 10: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER 10

7/1/14

Page 11: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

WEKA:: Explorer: Preprocess!•  Importing data !

•  Data format"•  Uses flat text files to describe the data"•  Data can be imported from a file in various formats: "

•  ARFF, CSV, C4.5, binary"•  Data can also be read from a URL or from an SQL

database (using JDBC)"

Page 12: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

WEKA:: ARFF file format!@relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present ...!

A more thorough description is available here http://www.cs.waikato.ac.nz/~ml/weka/arff.html

Page 13: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER 13

University of Waikato 7/1/14

Page 14: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER 14

University of Waikato 7/1/14

Page 15: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Weka: Explorer:Preprocess!

•  Preprocessing data !•  Visualization"•  Filtering algorithms "

•  filters can be used to transform the data (e.g., turning numeric attributes into discrete ones) and make it possible to delete instances and attributes according to specific criteria."

•  Removing Noisy Data"•  Adding Additional Attributes"•  Remove Attributes"

Page 16: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Page 17: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Page 18: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

WEKA:: Explorer: Preprocess!•  Used to define filters to transform

Data. !•  WEKA contains filters for:!

•  Discretization, normalization, resampling, attribute selection, transforming, combining attributes, etc"

Page 19: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER 19

University of Waikato 7/1/14

Page 20: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Page 21: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Page 22: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Explorer: Visualize!

•  Visualization very useful in practice!•  help determine difficulty of the learning problem"

•  WEKA can visualize single attributes (1-d) and pairs of attributes (2-d)!

•  Color-coded class values!•  “Jitter” option to deal with nominal attributes

(and to detect “hidden” data points)!•  “Zoom-in” function!

22 7/1/14

Page 23: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER 23

University of Waikato 7/1/14

Page 24: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER 24

University of Waikato 7/1/14

Page 25: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Explorer: Attribute Selection!

•  Panel that can be used to investigate which (subsets of) attributes are the most predictive ones!

•  Attribute selection methods contain two parts:!•  A search method: best-first, forward selection, random,

exhaustive, genetic algorithm, ranking!•  An evaluation method: correlation-based, wrapper,

information gain, chi-squared, …"•  Very flexible: WEKA allows (almost) arbitrary

combinations of these two!

7/1/14 25

Page 26: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

WEKA:: Explorer: building “classifiers”!

•  Classifiers in WEKA are models for predicting nominal or numeric quantities!

•  Implemented learning schemes include:!•  Decision trees and lists, instance-based

classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, …"

•  “Meta”-classifiers include:!•  Bagging, boosting, stacking, error-correcting

output codes, locally weighted learning, … "

Page 27: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER 27

University of Waikato 7/1/14

Page 28: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER 28

University of Waikato 7/1/14

Page 29: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

WEKA:: Explorer: building “Cluster”!

•  WEKA contains “clusters” for finding groups of similar instances in a dataset!

•  Implemented schemes are:!•  k-Means, EM, Cobweb, X-means, FarthestFirst"

•  Clusters can be visualized and compared to “true” clusters (if given)!

•  Evaluation based on loglikelihood if clustering scheme produces a probability distribution!

Page 30: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Explorer: Finding associations!

•  WEKA contains an implementation of the Apriori algorithm for learning association rules!•  Works only with discrete data"

•  Can identify statistical dependencies between groups of attributes:!•  milk, butter bread, eggs (with confidence 0.9 and

support 2000)"•  Apriori can compute all rules that have a given

minimum support and exceed a given confidence!

7/1/14 30

Page 31: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

References and Resources!

•  References:!•  WEKA website:

http://www.cs.waikato.ac.nz/~ml/weka/index.html"•  WEKA Tutorial:"

•  Machine Learning with WEKA: A presentation demonstrating all graphical user interfaces (GUI) in Weka. "

•  A presentation which explains how to use Weka for exploratory data mining. "

•  WEKA Data Mining Book:"•  Ian H. Witten and Eibe Frank, Data Mining: Practical Machine

Learning Tools and Techniques (Second Edition)"•  WEKA Wiki: http://weka.sourceforge.net/wiki/index.php/

Main_Page"

Page 32: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

R Environment: R Studio!

Page 33: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Downloading R/ R Studio!•  http://www.r-project.org/!

•  http://www.rstudio.com/ide/download/!

Page 34: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

What is R? !!•  An Environment!

•  R is an integrated suite of software facilities for data manipulation, calculation and graphical facilities for data analysis and display. "

•  Effective data handling and storage"•  Suite of operators for calculations on arrays"•  Large, coherent, integrated collection of intermediate tools for data analysis "•  Programming language, run time environment"

•  Developed at Bell Labs!•  GNU open source software!

•  Under the terms of the Free Software Foundation's GNU General Public License"

•  Open Source implementation of S-Plus language!•  Well-developed, simple and effective programming language"

•  Highly extensible!

Page 35: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

R Features!•  Software package designed for data analysis and graphical representation!•  Interactive, but may also be used programmatically!•  Platform independence!

•  Compiles and runs on a wide variety platforms, Unix base, Windows and MacOS. "•  Free, open source code!•  Engaged community!

•  over 4,200 user-contributed packages"•  Extendable!

•  User defined functions"•  > 4000 packages available in the CRAN package repository"

•  Supports extensions / add-ons (i.e. – rApache)"•  Compatible with other languages (i.e. – SQL, perl, C)"•  Data Import"

•  Pre-processing data from different sources"•  Scalability!

•  Parallel R packages "

Page 36: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

R packages for DM!

•  Clustering !•  Classification!•  Association Rules !•  Sequential patterns!•  Time Series!•  Statistics!•  Graphics!•  Data manipulation!

Page 37: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Data Mining!

•  linear models (lm)!•  generalized linear

models(glm)!•  generalized additive

models (gam)!•  linear mixed effects

models(lme)!•  quantile regression (qr)!•  vector general additive

models(vgam)!•  lasso, ridge, and elastic

net models (glmnet)!•  non-linear models (nlm)!

•  linear mixed effects models (nlmer)!

•  linear discriminant analysis (lda)!

•  quadratic discriminate analysis (qda)!

•  trees (tree)!•  random forests

(randomForrest)!•  support vector machines

(svm)!•  neural networks (nnet)!•  k-nearest neighbors (knn)!•  kmeans!

Page 38: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Big Data Options!•  lapply-based parallelism!

•  multicore library"•  snow library"

•  foreach-based parallelism!•  doMC backend"•  doSNOW backend"•  doMPI backend"

•  Map/Reduce- (Hadoop-) based parallelism!•  Hadoop streaming with R mappers/reducers"•  Rhadoop (rmr, rhdfs, rhbase)"•  RHIPE"

•  Poor-man's Parallelism!•  lots of Rs running"•  lots of input files"

•  Hands-off Parallelism!•  OpenMP support compiled into R build"•  Dangerous!"

Page 39: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

R Considerations/Limitations!

•  Command Line Interface!•  Performance!•  Memory Limits!

•  memory limits dependent on the build, (32-bit vs. 64-bit)"•  32-bit build of R on Windows is dependent on the

underlying OS version"•  Syntax “curiosities”!•  Learning curve!

Page 40: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

!

R-Studio Overview!•  http://www.rstudio.com/ide/download/ •  R-Studio is an integrated development environment to

support R code. •  R-Studio runs in two ways:

•  Desktop version for Linux, Mac, Windows: Single user, perfect for laptop or desktop machine

•  Server Version for Linux: Allows an number of remote users to run R-Studio within a web-browser, facilitates sharing of code and data among team members

Page 41: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

“pop-up”:!Multi-tab display: !Shows graphics, !Current directory and !loaded packages!

Project Window:!Currently loaded !Workspace, and !history!

Console: Run R! Commands!

Editor Window!

•  General View of R-Studio

Page 42: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

The Fundamentals !!•  Launch R!•  Quit R!

•  q() "•  Getting Help!

•  help(package_name) or ?(package_name) or help start()"•  example(package_name)"•  ??(keyword)"•  library(help=“package_name”)"

Page 43: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

The Basics!•  R environmental commands!

•  list objects"•  ls() "•  objects()"

•  list files in current directory"•  list.files()"

•  list current directory"•  getwd()"

•  set working directory"•  setwd()"

•  remove objects"•  rm()"

•  Workspace versus console!•  Clear workspace"

•  rm(list=ls())"•  Clear console"

•  (control, L)"

Page 44: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

The Basics(Naming Variables)!

•  Requirements!•  Case sensitive, names must start with letter or '.’"•  Only letters, numbers, underscores and‘.’s"

•  Special keywords!•  break, else, FALSE, for, function, if, Inf, NA, NaN, next,

repeat, return, TRUE, while"•  Names not limited in length!

Page 45: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

The Basics!

•  All entities in are called “objects”!•  arrays, vectors, matrices, functions, lists, data frames, factors"

•  Expressions vs. assignments!•  10+10"•  my.age <- 23"•  my.age < - 23 (note the added space)"•  age<- c(my.age, 14, 59, 32)"•  my.age == 40"

•  Data Types!•  Numeric, Integer, Complex, Logical, Character"

• Function call!!> mean(weight)"

Page 46: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Summary of Data Structures!

Linear! Rectangular!Homogeneous" Vectors" Matrices"Heterogeneous" Lists" Data Frames"

"

•  Vectors and Matrices must contain same data type!•  Character Type will trump numeric: Values will be

forced into characters!

Page 47: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

The Basics(Functions)!

•  Basic functions!•  mean(age)"•  sd(age)"•  sqrt(var(age))"

•  TIP: to list all function in search path"–  sapply(search(), ls, all.names = TRUE)

•  User Defined functions!•  Score <- age * 10;"

•  Using the correct functions for the given data type!•  apply() family "

Page 48: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Function Components!

writeLines(text=“text”, con = stdout(), sep = "\n", useBytes = FALSE)!•  function name: writeLines(“146.6”, “popRate.txt”, sep =

"\n”)"•  parentheses: writeLines(“146.6”, “popRate.txt”, sep = "\n”)!•  commas: writeLines(“146.6”, “popRate.txt”, sep = "\n”)"•  first argument: writeLines(“146.6”, “popRate.txt”, sep =

"\n”)"•  second argument: writeLines(“146.6”, “popRate.txt”, sep =

"\n”)""•  optional argument: writeLines(“146.6”, “popRate.txt”, "\n”)"

Page 49: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Importing Data/Exporting Data!•  Flat Files!

•  Import: > AHW <- read.csv(“AHW_1.csv”, header=TRUE)" >weatherdata <- read.table(file="C:/work/DM1/weather.csv", header=TRUE, sep=",") "•  Export: > USTemps=read.table(file=file.choose(),header=TRUE)"

•  Databases!•  Import"

•  connection <- dbConnect(driver, user, password, host, dbname)"> AHW <- dbSendQuery(connection, “SELECT * FROM AHW”)

•  Export"•  > connnection <- dbConnect(driver, user, password, host,dbname)"

> dbWriteTable (con, “AHW”, AHW) •  R objects!

•  Import: > load(‘AHW.Rdata’)"•  Export: > save(AHW, file=“New_AHW.Rdata”)"

•  Web!•  connection <-url(‘http://pace.sdsc.edu/sites/default/bootcamp/images/AHW_1.csv’)"•  AHW <- read.csv(con, header=TRUE)"

•  Plots!•  png(filename="C:/R/figure.png", height=295, width=300, bg="white")"•  pdf(file="C:/R/figure.pdf", height=3.5, width=5)"

•  Dev.off() #turn off device driver (to flush output to png/pdf)"

Page 50: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Name of data frame!to be created with !imported data!

Options for parsing !the text data into !fields and values!

How data frame will !look once the data !

are imported!

•  Loading dataset to R-Studio (Simple text file)

Page 51: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Extending R!•  http://cran.r-project.org/web/packages/!

•  Install a package !•  from command line"

"> install.package(‘name_of_package’)"•  from GUI"

•  Packages & Data > Package Installer"•  Load Library (to use installed package)"

•  > library(name_of_package)"•  Example "

> library(markdown)"•  Use Library Function!

•  > function_name(parameters)"•  Example "

> markdownToHTML("example.md")"

"http://www.r-bloggers.com/dont-r-alone-a-guide-to-tools-for-collaboration-with-r/!!

Page 52: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

More Information……!•  The R Manuals!

•  http://www.stat.berkely.edu/~spector/R.pdf"•  And Introduction to R !

•  http://cran.r-project.org/doc/manuals/R-intro.html"•  http://tryr.codeschool.com/"

•  Books!

Page 53: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

Other Resources!

/server irc.freenode.net/join #R!"

Page 54: Introduction Predictive Analytics Tools: Weka, R of Data... · UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER Introduction Predictive Analytics Tools: Weka, R!

UNIVERSITY OF CALIFORNIA, SAN DIEGO

SAN DIEGO SUPERCOMPUTER CENTER

the end!!