hadoop meets mature bi: data scientists

26
@Kognitio @mphnyc #MPP_R @Kognitio @mphnyc #OANYC Hadoop meets Mature BI: Where the rubber meets the road for Data Scientists Michael Hiskey Futurist, + Product Evangelist VP, Marketing & Business Development Kognitio

Upload: kognitio

Post on 22-Nov-2014

631 views

Category:

Technology


1 download

DESCRIPTION

The key challenge for Data Scientists is not the proliferation of their roles, but the ability to ‘graduate’ key Big Data projects from the ‘Data Science Lab’ and production-ize them into their broader organizations. Over the next 18 months, "Big Data' will become just "Data"; this means everyone (even business users) will need to have a way to use it - without reinventing the way they interact with their current reporting and analysis. To do this requires interactive analysis with existing tools and massively parallel code execution, tightly integrated with Hadoop. Your Data Warehouse is dying; Hadoop will elicit a material shift away from price per TB in persistent data storage.

TRANSCRIPT

Page 1: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYC

Hadoop meets Mature BI: Where the rubber meets the road for 

Data Scientists

Michael HiskeyFuturist, + Product Evangelist

VP, Marketing & Business DevelopmentKognitio

Page 2: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYC

The Data ScientistSexiest job of the 21st Century?

Page 3: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYC

Key Concept: GraduationProjects will need 

to Graduatefrom the 

Data Science Lab and become part 

of Business as Usual

Page 4: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYC

Demand for the Data ScientistOrganizational appetite for tens, not hundreds

© EMC Corporation and  The Guardian UK™ http://www.guardian.co.uk/news/datablog/2012/mar/02/data‐scientist#zoomed‐picture

Page 5: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYC

Don’t be a Railroad Stoker!Highly skilled engineering required … but the world innovated around them.

Page 6: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYC

Business Intelligence

NumbersTablesChartsIndicators

Time‐ History‐ Lag

Access‐ to view (portal)‐ to data‐ to depth‐ Control/Secure

Consumption‐ digestion

…with ease and simplicity

Straddle IT and Business

FasterLower latency

More granularity

Richer data model

Self service

Page 7: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYC

What has changed?

More connected-users?

More-connected users?

Page 8: Hadoop meets Mature BI: Data Scientists

According to one estimate, mankind created 150 exabytesof data in 2005

(billion gigabytes)

In 2010 this was 1,200 exabytes

Page 9: Hadoop meets Mature BI: Data Scientists

Data flow

Page 10: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #OANYC

Data Variety

Page 11: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #OANYC

Respondents were asked to choose up to two descriptions about how their organizations view big data from the choices above. Choices have been abbreviated, and selections have been normalized to equal 100%. n=1144

Source: IBM Institute for Business Value/Said Business School Survey 

What?  New value comes from your existing data

Page 12: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYC

© 20th Century Fox

Page 13: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYC

Hadoop ticks many but not all the boxes

Page 14: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYC

No need to pre‐process No need to align to schema

No need to triage 

Null storage concerns

Page 15: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #OANYC

Machine learning algorithms Dynamic

Simulation

Statistical Analysis

Clustering

Behaviour modelling

The drive for deeper understanding

Reporting & BPMFraud detection

Dynamic Interaction

Technology/Automation

Analytical Com

plexity

Campaign Management

#MPP_R

Page 16: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYC

Hadoop just too slow for interactive 

BI!

…loss of train‐of‐thought

“while hadoop shines as a processingplatform, it is painfully slow as a query tool”

Page 17: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYC

Analytics needslow latency, no I/O wait

High speed in‐memory processing

Page 18: Hadoop meets Mature BI: Data Scientists

Analytical Platform: Reference Architecture

AnalyticalPlatformLayer

Near‐lineStorage(optional)

Application &Client Layer

All BI Tools All OLAP Clients Excel

PersistenceLayer

HadoopClusters

Enterprise DataWarehouses

LegacySystems

Reporting

Cloud Storage

Page 19: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #OANYC

The Future

Big DataAdvanced Analytics

In-memory

Logical Data Warehouse

Predictive Analytics

Data Scientists

Page 20: Hadoop meets Mature BI: Data Scientists

connect

www.kognitio.com

twitter.com/kognitiolinkedin.com/companies/kognitio

tinyurl.com/kognitio youtube.com/kognitio

NA: +1 855  KOGNITIOEMEA: +44 1344 300 770

THESE SLIDES: www.slideshare.net/Kognitio

Page 21: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #MPP_R@Kognitio  @mphnyc  #MPP_R

Hadoop meets Mature BI: Where the rubber meets the road for 

Data Scientists• The key challenge for Data Scientists is not the proliferation of their 

roles, but the ability to ‘graduate’ key Big Data projects from the ‘Data Science Lab’ and production‐ize them into their broader organizations. 

• Over the next 18 months, "Big Data' will become just "Data"; this means everyone (even business users) will need to have a way to use it ‐ without reinventing the way they interact with their current reporting and analysis.

• To do this requires interactive analysis with existing tools and massively parallel code execution, tightly integrated with Hadoop.  Your Data Warehouse is dying; Hadoop will elicit a material shift away from price per TB in persistent data storage.

Page 22: Hadoop meets Mature BI: Data Scientists

The new bounty hunters:DrillImpalaPivotalStinger

The No SQL Posse

WANTEDDEAD OR ALIVE

SQL

Page 23: Hadoop meets Mature BI: Data Scientists

It’s all about getting work done

Used to be simple fetch of valueTasks evolving: 

Then was calc dynamic aggregate

Now complex algorithms!

Page 24: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #MPP_R

create external script LM_PRODUCT_FORECAST environment rsintreceives ( SALEDATE DATE, DOW INTEGER, ROW_ID INTEGER, PRODNO INTEGER, DAILYSALES partition by PRODNO order by PRODNO, ROW_IDsends ( R_OUTPUT varchar )isolate partitionsscript S'endofr( # Simple R script to run a linear fit on daily sales

prod1<-read.csv(file=file("stdin"), header=FALSE,row.namescolnames(prod1)<-c("DOW","ID","PRODNO","DAILYSALES")dim1<-dim(prod1)daily1<-aggregate(prod1$DAILYSALES, list(DOW = prod1$DOW),daily1[,2]<-daily1[,2]/sum(daily1[,2])basesales<-array(0,c(dim1[1],2))basesales[,1]<-prod1$IDbasesales[,2]<-(prod1$DAILYSALES/daily1[prod1$DOW+1,2])colnames(basesales)<-c("ID","BASESALES")fit1=lm(BASESALES ~ ID,as.data.frame(basesales))

select Trans_Year, Num_Trans,count(distinct Account_ID) Num_Accts,sum(count( distinct Account_ID)) over (partition by Trans_Yearcast(sum(total_spend)/1000 as int) Total_Spend,cast(sum(total_spend)/1000 as int) / count(distinct Account_IDrank() over (partition by Trans_Year order by count(distinct Arank() over (partition by Trans_Year order by sum(total_spend)from( select Account_ID,

Extract(Year from Effective_Date) Trans_Year,count(Transaction_ID) Num_Trans,sum(Transaction Amount) Total Spend,

select dept, sum(sales) from sales_factWhere period between date ‘01-05-2006’ and date ‘31-05-2006’ group by depthaving sum(sales) > 50000;

select sum(sales) from sales_historywhere year = 2006 and month = 5 and region=1;

select total_salesfrom summary where year = 2006 and month = 5 and region=1;

Behind the numbers

Page 25: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #MPP_R

For once technology is on our side

First time we have full triumvirate of– Excellent Computing power– Unlimited storage– Fast Networks

…now that RAM is cheap!

Page 26: Hadoop meets Mature BI: Data Scientists

@Kognitio  @mphnyc  #MPP_R

Lots of these

Not so many of these

Hadoop is… 

Hadoop inherently disk oriented

Typically low ratio of CPU to Disk