hadoop meets mature bi: data scientists
DESCRIPTION
The key challenge for Data Scientists is not the proliferation of their roles, but the ability to ‘graduate’ key Big Data projects from the ‘Data Science Lab’ and production-ize them into their broader organizations. Over the next 18 months, "Big Data' will become just "Data"; this means everyone (even business users) will need to have a way to use it - without reinventing the way they interact with their current reporting and analysis. To do this requires interactive analysis with existing tools and massively parallel code execution, tightly integrated with Hadoop. Your Data Warehouse is dying; Hadoop will elicit a material shift away from price per TB in persistent data storage.TRANSCRIPT
@Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYC
Hadoop meets Mature BI: Where the rubber meets the road for
Data Scientists
Michael HiskeyFuturist, + Product Evangelist
VP, Marketing & Business DevelopmentKognitio
@Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYC
The Data ScientistSexiest job of the 21st Century?
@Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYC
Key Concept: GraduationProjects will need
to Graduatefrom the
Data Science Lab and become part
of Business as Usual
@Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYC
Demand for the Data ScientistOrganizational appetite for tens, not hundreds
© EMC Corporation and The Guardian UK™ http://www.guardian.co.uk/news/datablog/2012/mar/02/data‐scientist#zoomed‐picture
@Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYC
Don’t be a Railroad Stoker!Highly skilled engineering required … but the world innovated around them.
@Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYC
Business Intelligence
NumbersTablesChartsIndicators
Time‐ History‐ Lag
Access‐ to view (portal)‐ to data‐ to depth‐ Control/Secure
Consumption‐ digestion
…with ease and simplicity
Straddle IT and Business
FasterLower latency
More granularity
Richer data model
Self service
@Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYC
What has changed?
More connected-users?
More-connected users?
According to one estimate, mankind created 150 exabytesof data in 2005
(billion gigabytes)
In 2010 this was 1,200 exabytes
Data flow
@Kognitio @mphnyc #OANYC
Data Variety
@Kognitio @mphnyc #OANYC
Respondents were asked to choose up to two descriptions about how their organizations view big data from the choices above. Choices have been abbreviated, and selections have been normalized to equal 100%. n=1144
Source: IBM Institute for Business Value/Said Business School Survey
What? New value comes from your existing data
@Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYC
© 20th Century Fox
@Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYC
Hadoop ticks many but not all the boxes
@Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYC
No need to pre‐process No need to align to schema
No need to triage
Null storage concerns
@Kognitio @mphnyc #OANYC
Machine learning algorithms Dynamic
Simulation
Statistical Analysis
Clustering
Behaviour modelling
The drive for deeper understanding
Reporting & BPMFraud detection
Dynamic Interaction
Technology/Automation
Analytical Com
plexity
Campaign Management
#MPP_R
@Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYC
Hadoop just too slow for interactive
BI!
…loss of train‐of‐thought
“while hadoop shines as a processingplatform, it is painfully slow as a query tool”
@Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYC
Analytics needslow latency, no I/O wait
High speed in‐memory processing
Analytical Platform: Reference Architecture
AnalyticalPlatformLayer
Near‐lineStorage(optional)
Application &Client Layer
All BI Tools All OLAP Clients Excel
PersistenceLayer
HadoopClusters
Enterprise DataWarehouses
LegacySystems
…
Reporting
Cloud Storage
@Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #OANYC
The Future
Big DataAdvanced Analytics
In-memory
Logical Data Warehouse
Predictive Analytics
Data Scientists
connect
www.kognitio.com
twitter.com/kognitiolinkedin.com/companies/kognitio
tinyurl.com/kognitio youtube.com/kognitio
NA: +1 855 KOGNITIOEMEA: +44 1344 300 770
THESE SLIDES: www.slideshare.net/Kognitio
@Kognitio @mphnyc #MPP_R@Kognitio @mphnyc #MPP_R
Hadoop meets Mature BI: Where the rubber meets the road for
Data Scientists• The key challenge for Data Scientists is not the proliferation of their
roles, but the ability to ‘graduate’ key Big Data projects from the ‘Data Science Lab’ and production‐ize them into their broader organizations.
• Over the next 18 months, "Big Data' will become just "Data"; this means everyone (even business users) will need to have a way to use it ‐ without reinventing the way they interact with their current reporting and analysis.
• To do this requires interactive analysis with existing tools and massively parallel code execution, tightly integrated with Hadoop. Your Data Warehouse is dying; Hadoop will elicit a material shift away from price per TB in persistent data storage.
The new bounty hunters:DrillImpalaPivotalStinger
The No SQL Posse
WANTEDDEAD OR ALIVE
SQL
It’s all about getting work done
Used to be simple fetch of valueTasks evolving:
Then was calc dynamic aggregate
Now complex algorithms!
@Kognitio @mphnyc #MPP_R
create external script LM_PRODUCT_FORECAST environment rsintreceives ( SALEDATE DATE, DOW INTEGER, ROW_ID INTEGER, PRODNO INTEGER, DAILYSALES partition by PRODNO order by PRODNO, ROW_IDsends ( R_OUTPUT varchar )isolate partitionsscript S'endofr( # Simple R script to run a linear fit on daily sales
prod1<-read.csv(file=file("stdin"), header=FALSE,row.namescolnames(prod1)<-c("DOW","ID","PRODNO","DAILYSALES")dim1<-dim(prod1)daily1<-aggregate(prod1$DAILYSALES, list(DOW = prod1$DOW),daily1[,2]<-daily1[,2]/sum(daily1[,2])basesales<-array(0,c(dim1[1],2))basesales[,1]<-prod1$IDbasesales[,2]<-(prod1$DAILYSALES/daily1[prod1$DOW+1,2])colnames(basesales)<-c("ID","BASESALES")fit1=lm(BASESALES ~ ID,as.data.frame(basesales))
select Trans_Year, Num_Trans,count(distinct Account_ID) Num_Accts,sum(count( distinct Account_ID)) over (partition by Trans_Yearcast(sum(total_spend)/1000 as int) Total_Spend,cast(sum(total_spend)/1000 as int) / count(distinct Account_IDrank() over (partition by Trans_Year order by count(distinct Arank() over (partition by Trans_Year order by sum(total_spend)from( select Account_ID,
Extract(Year from Effective_Date) Trans_Year,count(Transaction_ID) Num_Trans,sum(Transaction Amount) Total Spend,
select dept, sum(sales) from sales_factWhere period between date ‘01-05-2006’ and date ‘31-05-2006’ group by depthaving sum(sales) > 50000;
select sum(sales) from sales_historywhere year = 2006 and month = 5 and region=1;
select total_salesfrom summary where year = 2006 and month = 5 and region=1;
Behind the numbers
@Kognitio @mphnyc #MPP_R
For once technology is on our side
First time we have full triumvirate of– Excellent Computing power– Unlimited storage– Fast Networks
…now that RAM is cheap!
@Kognitio @mphnyc #MPP_R
Lots of these
Not so many of these
Hadoop is…
Hadoop inherently disk oriented
Typically low ratio of CPU to Disk