big data big analytics

51
BIG DATA BIG ANALYTICS A OHRI

Upload: ajay-ohri

Post on 27-Jan-2015

187 views

Category:

Technology


7 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Big data Big Analytics

BIG DATA BIG ANALYTICS

A OHRI

Page 2: Big data Big Analytics

Pre- Agenda-Presenter Introduction-Audience Introduction-Expectations--------------------------------------------

Page 3: Big data Big Analytics

Presenter Introductionwww.linkedin.com/in/ajayohriWorking with Analytics since 2004Educated at IIM Lucknow, DCE, U TennAuthor (R for Business Analytics (Springer))Blogger at www.decisionstats.com

Interviewed 100+ Analytics leaders

Page 4: Big data Big Analytics

Audience Introduction

● Affiliation-Academic/ Govt/Private● Years of working with Big Data-● Specific Interest Area in Analytics-

Page 5: Big data Big Analytics

Great ExpectationsFrom You 1.No mobile rings , no sleeping (discreet sleeping), 2.Please take notes using pencil,parchment, paper,pen,computer,tablet,stylus,mobile etc,3.Please ask Questions in the END(from notes taken at Step 2)From Me1 Breadth of Case Studies (!)2 Open Source focus (R mostly, clojure, python)

3 Actionable Ideas are useful ! i.e I spent 3 hours in X talk but I did learn to do Y, or I am now interested in trying out Z

Page 6: Big data Big Analytics

Agenda-Presenter Introduction-Audience Identification-Expectations

---------------------------------------------Big Data-Big Data Analytics using R

-Case Study 1(Amazon AWS,SAP Hana DB)-Big Data Analytics using other tools -Case Study 2 (BigML.com, Picloud.com)--------------------------------------------

Page 7: Big data Big Analytics

Big DataWhat is Big Data?"Big data" is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Examples include web logs, RFID, sensor networks, social networks, social data (due to the social data revolution), Internet text and documents, Internet search indexing, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and often interdisciplinary scientific research, military surveillance, medical records, photography archives, video archives, and large-scale e-commerce.

IBM- http://www-01.ibm.com/software/data/bigdata/

Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.

Page 8: Big data Big Analytics

Big DataWhat is Big Data?

Big Data Conferences--O'Reilly's Strata--Hadoop World--Many many conferences......including ours

Page 9: Big data Big Analytics

Thought for TodayIn 2012 , data that is classified as Big Data will be classified as Little Data by 2018

True ----------False?

Page 10: Big data Big Analytics

What is Cloud Computing?Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models.http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf

--National Institute of Standards and Technology

Page 11: Big data Big Analytics

Cloud Computing and Big Data AnalyticsCost of computing Big Data would be too much, but for cloud computing.

Cloud runs on X OS predominantly, and needs customized solutions as of 2012

Open source solutions (OS- Analytics) are more easily customized

Page 12: Big data Big Analytics

Sources of Big Data--Internet------Server Logs,Clickstream,Analytics

--Social Media

--Governments and UN bodies

--Internal Data from customers

Page 13: Big data Big Analytics

Storing Big Data for R --Lots of RAM (?!)--RDBMS --Documents (Couch DB ,MongoDB)

--HDFS (Hadoop)

Page 14: Big data Big Analytics

Storing Big Data for R --Documents (Couch DB ,MongoDB)

Package RMongo provides an R interface to a Java client for `MongoDB' (http://en.wikipedia.org/wiki/MongoDB) databases, which are queried using JavaScript rather than SQL. Package rmongodb is another client using mongodb's C driver.

https://github.com/wactbprot/R4CouchDBR talking to CouchDB using Couch's ReSTful HTTP API. construct HTTP calls with RCurl, then move on to the R4CouchDB package for a higher level interface.http://digitheadslabnotebook.blogspot.in/2010/10/couchdb-and-r.html

Page 15: Big data Big Analytics

Big Data Packages in R- 1/2http://cran.r-project.org/web/views/HighPerformanceComputing.html

● The biglm package by Lumley uses incremental computations to offers lm() and glm() functionality to data sets stored outside of R's main memory.

● The ff package by Adler et al. offers file-based access to data sets that are too large to be loaded into memory, along with a number of higher-level functions.

● The bigmemory package by Kane and Emerson permits storing large objects such as matrices in memory (as well as via files) and uses external pointer objects to refer to them. This permits transparent access from R without bumping against R's internal memory limits. Several R processes on the same computer can also shared big memory objects.

● The HadoopStreaming Provides a framework for writing map/reduce scripts for use in Hadoop Streaming. Also facilitates operating on data in a streaming fashion, without Hadoo

Page 16: Big data Big Analytics

● http://cran.r-project.org/web/packages/biganalytics/

This package extends the bigmemory package with various analytics. Functions bigkmeans and binit may also be used with native R objects

● http://cran.r-project.org/web/packages/bigtabulate/index.htmlThis package extends the bigmemory package with table- and split-like support for big.matrix objects. The functions may also be used with regular R matrices for improving speed and memory-efficiency.

● http://cran.at.r-project.org/web/packages/synchronicity/index.html.For mutex (locking) support for advanced shared-memory usage, see synchronicity.https://r-forge.r-project.org/R/?group_id=556 lists more projects. For linear algebra support, see bigalgebra.

Big Data Packages in R -2/2

Page 17: Big data Big Analytics

Primary -RevoScaleR package /XDF format

Also sponsored RHadoophttps://github.com/RevolutionAnalytics/RHadoop

Big Data and Revolution Analytics

Page 18: Big data Big Analytics

rhdfs-

https://github.com/decisionstats/RHadoop/wiki/rhdfsOverviewThis R package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS. The following functions are part of this package

● File Manipulations● hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get● File Read/Write● hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file● Directory● hdfs.dircreate, hdfs.mkdir● Utility● hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists● Initialization● hdfs.init, hdfs.defaults

http://hadoop.apache.org/hdfs/Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations

RHadoop -rhdfs package

Page 19: Big data Big Analytics

rhbase-

https://github.com/decisionstats/RHadoop/wiki/rhbaseOverviewThis R package provides basic connectivity to HBASE, using the Thrift server. R programmers can browse, read, write, and modify tables stored in HBASE. The following functions are part of this package

● Table Maninpulation● hb.new.table, hb.delete.table, hb.describe.table, hb.set.table.mode, hb.regions.table● Read/Write● hb.insert, hb.get, hb.delete, hb.insert.data.frame, hb.get.data.frame, hb.scan● Utility● hb.list.tables● Initialization● hb.defaults, hb.init

http://hbase.apache.org/

HBase is the Hadoop database. Think of it as a distributed, scalable, big data store.

RHadoop -rhbase package

Page 21: Big data Big Analytics

Big Data Social Network Analysis

Analyzing A Big Social Network using R and distributed graph engineshttp://thinkaurelius.com/2012/02/05/graph-degree-distributions-using-r-over-hadoop/

Page 22: Big data Big Analytics

Big Data Social Media AnalysisCan be used for Customers (and also for latent influencers) -http://www.r-bloggers.com/an-example-of-social-network-analysis-with-r-using-package-igraph/

Page 23: Big data Big Analytics

Big Data Social Media AnalysisR package twitteR http://cran.r-project.org/web/packages/twitteR/index.html can be used for prototyping but Twitter's API is rate limited to 1500 per hour(?)/day, so we can use Datasift API http://datasift.com/pricing#costs

Page 24: Big data Big Analytics

Big Data Social Media Analysis How does information propagate through a social network?http://www.r-bloggers.com/information-transmission-in-a-social-network-dissecting-the-spread-of-a-quora-post/

Page 25: Big data Big Analytics

Big Data Social Network AnalysisCan be used for Terrorists (and also for potential protestors ) -Drew Conway http://riskecon.com/wp-content/uploads/2012/02/Conway-Socio_Terrorism.pdf Primary focus is one three aspects of network analysis1. Identifying leadership and key actors2. Revealing underlying structure and intra-network community structure3. Evolution and decay of social networks

Page 26: Big data Big Analytics

Primary -RevoScaleR package /XDF formatAlso sponsored RHadoop

● For a case study, UpStream software ( slide 16):http://www.revolutionanalytics.com/news-events/free-webinars/2012/how-big-data-is-changing-retail-marketing-analytics/

● Big data GLMs (you might find the chart on this page useful):

http://blog.revolutionanalytics.com/2012/06/big-data-generalized-linear-models-with-revolution-r-enterprise.html

● Data distillation with Hadoop and R:http://blog.revolutionanalytics.com/2012/06/data-distillation-with-hadoop-and-r.html

● Analysis of the million row movie data set (building recommendation engines):

http://blog.revolutionanalytics.com/2012/04/simple-tools-for-building-a-recommendation-engine.html

Big Data and Revolution Analytics

Page 27: Big data Big Analytics

marketing analytics company UpStream Software, used map-reduce to convert transactions from Omniture logs (web visits, emails clicked on, ads displayed) into customer behaviors: response to an offer, research into a product, purchases.

Big Data and Revolution Analytics

Page 28: Big data Big Analytics

More R and Hadoop Case Studiesfew examples where R and Hadoop are used for data distillation:● Using robust regression on a series of raw voice-over-IP packets to

calculate how long participants talk during a phone conversation.● Using graph theory (and R's igraph package) to quantify the number of

close friends of members of a social network.● Orbitz uses R and Hadoop to extract flights and hotels that will be

presented during a travel search, based on previous transaction.● Using k-means clustering to extract similar "groups" of transactions, which

are then aggregated and used as the record level for structured analysis

Page 29: Big data Big Analytics

Using RDBMS (Big?) Data through R--RDBMS -RODBC Packagehttp://cran.r-project.org/doc/manuals/R-data.html#Relational-databaseshttp://cran.r-project.org/web/packages/RMySQL/index.html RMySQLhttp://cran.r-project.org/web/packages/ROracle/index.html ROraclehttp://cran.r-project.org/web/packages/RPostgreSQL/index.html RPostgresSQLhttp://cran.r-project.org/web/packages/DBI/index.html http://cran.r-project.org/web/packages/RSQLite/index.html RSQLite

Page 30: Big data Big Analytics

Using RDBMS (is it Big Data?) through R--RDBMS -RODBC Packagehttp://cran.r-project.org/web/packages/RODBC/RODBC.pdf> library(RODBC)> odbcDataSources(type = c("all", "user", "system")) SQLServer PostgreSQL30 PostgreSQL35W "SQL Server""PostgreSQL ANSI(x64)" "PostgreSQL Unicode(x64)" MySQL "MySQL ODBC 5.1 Driver"

Page 31: Big data Big Analytics

Querying Big Data--RDBMS-SQL

--Hadoop-Pig (but many ways)

Page 32: Big data Big Analytics

Big Data Analytics- Challenges

---Traditional statistics theory grew up when data was constrained

--Traditional analytics programming was NOT parallel processing

--Shortage of trained people

Page 33: Big data Big Analytics

Big Data Analytics- Solutions

---Teaching more parallel programming and algorithms

--More focus on data reduction techniques like clustering , segmentation than on hypothesis testing. Sampling, anyone?

--Training more data scientists

Page 34: Big data Big Analytics

Big Data Analytics- Tools used -Why R

-High Performance Computing

http://cran.r-project.org/web/views/HighPerformanceComputing.html

-Big Data Within Rhttp://www.slideshare.net/bytemining/r-hpc

Page 35: Big data Big Analytics

Using R (interfaces)--Using R Studio for easier development

--Using Rattle GUI for straight off the shelf data mining and Using R Commander for Extensions

--Using Revolution Analytics RPE-----Example of Snippets

Page 36: Big data Big Analytics

Using R--Using R for text mining---Text Mining from Twitter Case Study---Datasift Export to Amazon S3

--Using R for geo-coded analysis---Hana DB

--Using R for Graphical Analysis of Big DataTablePlot3D using R Commander

--Using R for forecasting Using Plugin R Commander E -Pack

Page 37: Big data Big Analytics

Existing Big Data Case StudiesDeparture of Aeroplanes-SAP Hana 200m http://allthingsr.blogspot.in/#!/2012/04/big-data-r-and-hana-analyze-200-million.html

R using SAP Hana

http://www.decisionstats.com/interview-blag-sap-labs-montreal-using-sap-hana-with-rstats/

Page 41: Big data Big Analytics

Revolution Analytics RevoScaleR packagehe RevoScaleR package to extract time series data from time-stamped logs (in this case, the "US Domestic Flights From 1990 to 2009" dataset on Infochimps): Analyzing time series data of all sorts is a fundamental business analytics task to which the R language is beautifully suited. In addition to the time series functions built into base stats library there are dozens of R packages devoted to time series...We have shown how data manipulation functions of the RevoScaleR package to extract time stamped data from a large data file, aggregate it, and form it into monthly time series that can easily be analyzed with standard R functions.

http://www.inside-r.org/howto/extracting-time-series-large-data-sets

http://blog.revolutionanalytics.com/2011/09/how-to-extract-time-series-from-large-timestamped-logs-with-r.html

Page 42: Big data Big Analytics

Using R on Amazon -Case Study--Bioconductor in the Cloud

--Custom Amazon Instance

--Concerns for non- American users of Amazon

Page 43: Big data Big Analytics

Using BigML on cloud Case StudyClassification using Clojure on Cloudhttps://bigml.com/gallery/models/fraud_and_crime

--Concerns on depending on third party tools--Example Cloudnumbers.com

Page 44: Big data Big Analytics

Using Google APIshttps://code.google.com/apis/console/?pli=1

Google Storage API

Google Predictive Analysis API

Introduction to other APIS

----Concerns to users of Google APIs

Page 45: Big data Big Analytics

Using Google APIs case studyGoogle Storage APIGoogle Predictive Analysis APIhttp://code.google.com/p/google-prediction-api-r-client/

Page 46: Big data Big Analytics

Using Google APIs case studyIntroduction to other Big Data Google APIS

----Concerns to users of Google APIs

Page 47: Big data Big Analytics

Using Python- PiCloud http://www.picloud.

com/

Page 48: Big data Big Analytics

Privacy hazards of big data analytics.Big Brother -1984 --- 2012

They know where you are (mobiles)They know what you are looking for (internet)They know your past (financial history +social media)They can use your medical historyLaws authorize them (Patriot Act?)

--example Emotional Analysis of Images http://www.affectiva.com/

Page 49: Big data Big Analytics

References and AcknowledgementsDavid Smith, Revolution AnalyticsDavid Champagne, Revolution AnalyticsAll R Bloggers,Developers, PackagersBlag - SAP Hana AnalyticsCharlie Berger -and Oracle R TeamJim Kobielus -IBM Big Data Team R Development Core Team (2012). R: A language and environment for

statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.

Page 50: Big data Big Analytics

Thanks

Page 51: Big data Big Analytics

Book- R for Business Analyticshttp://www.springer.com/statistics/book/978-1-4614-4342-1