how to run a successful bi project with hadoop

22
How To Run A Successful BI Project with Hadoop

Upload: mammoth-data

Post on 26-Jan-2017

138 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: How To Run A Successful BI Project with Hadoop

How To Run A Successful BI Project with Hadoop

Page 2: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

About Mammoth Data

Mammoth Data is a Big Data consulting and services company specializing in Hadoop and NoSQL databases. Basically, we turn unstructured information into business intelligence.

Founded January 2008 by Andrew C. Oliver (me)

Based in downtown Durham, NC

Partnered with Hortonworks, MongoDB, DataStax, Cloudera, Couchbase, Cloudbees & Neo Technology

Page 3: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Andrew C. Oliver @acoliver

● Programming since I was about 8● Java since ~1997● Founded POI project (currently hosted at Apache) with Marc Johnson

~2000○ Former member Jakarta PMC○ Emeritus member of Apache Software Foundation

● Joined JBoss ~2002● Former Board Member/current helper/lifetime member: Open Source

Initiative (http://opensource.org)● Column in InfoWorld: http://www.infoworld.com/author-bios/andrew-oliver

○ I make fanboys cry.

Page 4: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Agenda

● Let’s define BI ● Why are you doing this?● Management & Support● Goals● Controlled Scope● A BI Project is not about a tool or tool choice● The right team is essential● Interviews● Data Integration● Let the data change the organization● Weed the analytics● Apply Machine Learning Techniques● Plan a continuous revision cycle● But what about Hadoop???● Implementation● Conclusion

Page 5: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Let’s Define BI

Obligatory wikipedia definitionBusiness Intelligence (BI) is a broad category of computer

software solutions that enables a company or organization to gain insight into its critical operations through reporting applications and analysis tools.

Page 6: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Why Are You Doing This?

● Fad/Charts are pretty?

● Competitive?

● Decrease costs?

Page 7: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Management & Support

● BI cannot be an IT driven project. IT may be involved in providing the implementation, but business drivers need to push the project.

● BI & Data Driven isn’t a technology; it is a state

Page 8: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Goals

● Should have a financial basis

○ Decrease waste by 20%○ Increase sales by 10% via greater customer intimacy○ Reduce the time to analyze data and reduce manual

spreadsheeting by 50%

Page 9: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Controlled Scope

● BI projects sometimes spiral out of control or encompass data integration projects and other projects.

● Break up the project into discrete parts○ Infrastructure○ Data Integration○ Reports○ ...

Page 10: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

A BI Project is not About a Tool or Tool Choice

● Projects that begin with “Should we choose Domo, Pentaho or Tableau” tend to fail (tool choice is the wrong goal, you may need more than one tool)

Page 11: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

The Right Team is Essential

● A proper BI project will involve a large cross section of the company.

● Representatives of the major consumers and producers of the data should be involved.

● Avoid meetings full of the uninvolved. Focus on tasks, goals, requirements.

Page 12: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Interviews

● The “meatcloud”

● Key consumers

● Key producers

● Business actors

● Get outside help for this!

Page 13: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Data Integration

● Often the data that is needed is distributed. Is someone pulling data from two sources into a spreadsheet?

● Ideally data integration should be run as a separate project but informed by BI requirements.

Page 14: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Let the data change the organization

● Some of worst projects attempt to map systems to exactly the way things are done today, but gain few or no new efficiencies.

● The worst of the worst create exactly the same reports in Excel but now not by hand.

● What are people doing with those reports?

Page 15: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Weed the Analytics

● I can tell you how much we sell on Thursdays if it rains and there is a traffic jam, but is this useful? Can we make better decisions?

Page 16: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Apply Machine Learning Techniques

● WTF is it?○ Remember “AI”? That wasn’t a good brand, but now we use it every day.○ Mathematical analysis of data using algorithms

● The promise○ Patterns can tell you things you didn’t realize and with today’s cheap

computing and massive parallelization we can find out things you never thought possible.

○ Finance has been using this for years● The difficulty

○ Which algorithm do you use?○ How do you set the data up for it?○ What does it mean?○ The tools are still in their infancy (tools for a mathgeek’s mathgeek), but do

they understand the business?

Page 17: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Plan a Continuous Revision Cycle

● BI is never done.

● Review business practices, look for sources of automation, look for new questions, phase out useless analytics.

● Technology refresh (this stuff used to be stagnant but now moves quickly)

Page 18: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdataco

But what about Hadoop???

Page 19: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Key Techniques

● HDFS○ Distributed filesystem for storage

● Hive○ SQL to Map Reduce (and soon Hive-Spark = SQL to DAG)

● HBase○ Column-family database (ACID)

● Phoenix○ RDBMS over HBase

● Sqoop○ Copies data to HDFS from an RDBMS

● Kafka○ Messaging

● Spark○ file = spark.textFile("hdfs://...") ○ file.flatMap(lambda line: line.split())○ .map(lambda word: (word, 1))○ .reduceByKey(lambda a, b: a+b)○ Also provides a SQL interface○ claims 100x faster than Map reduce in memory and 10x faster on disk

● (there are others i.e. Flume, Storm, etc)

Page 20: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Implementation

● Set up a Hadoop Cluster

● Run a data integration project (Data Lake, Data Warehousing, etc)

● Integrate BI tools (Tableau, Pentaho, Domo, etc all have connectors to Hive and many can integrate directly into Spark)

● Code for machine learning algorithms

Page 21: How To Run A Successful BI Project with Hadoop

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

In The End...

● This isn’t a technology or IT project, but involves tools and technologies to implement business goals

● Interview all of the major stakeholders, look for opportunities, dashboards are not the beginning and end of this.

● Continuous project lifecycle

Page 22: How To Run A Successful BI Project with Hadoop

Questions?