big data business case

"We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run.“-Amara’s Law.

“The Best Way to Predict the Future is to Create it”. – Peter Drucker.

Author – Karthik Padmanabhan Deputy Manager, Research and Advanced Engineering,

Ford Motor Company.

Image source: isa.org

What is Big DataData will not be termed as Big Data based on Size alone. There are multiple factors to it.

1. Volume – Corporate Data has grown to Peta-byte level and still growing…2. Velocity –Data is changing each moment and its history needs to be tracked as

well as planning on its utilization3. Variety – Data coming from wide variety of sources and in various formats.

Now people include one more aspect to it called Veracity. So essentially what itmeans is that the dimensionality of the term Big Data is increasing by the day.So finally the curse of dimensionality comes into play wherein it becomes thatwe cannot use normal methods to really get some insights from the data.Innovation is the key to handle such a beast.

Dimensions of Big Data

Image source: datasciencecentral.com

Image source: vmware.com

Life Cycle – From Start to End

Image source: doubleclix.wordpress.com

Need for Big Data

Brutal Fact - 80% of data is unstructured data and it is growing at 15% annually. So in the next 2 years the data size will double as of today.

Single View: Integrated analysis of the customer and transaction data.

E-Commerce Business : Storing huge amount of click-stream data. Need to measure the entire Digital Footprint.

Text Processing applications: Social Media text mining. Here the entire landscape changes as it involves different set of metrics at higher dimensions which increases the complexity of the application. Distributed Processing is the way to go.

Real Time Actionability: Immediate feedback on product launch through analysis of social media comments instead of waiting for customer satisfaction survey.

Change in Consumer Psychology: Necessity for Instant Gratification.

The 2013 Gartner Hype Cycle Special Report evaluates the maturity of over 2,000 technologies and trends in 102 areas. New Hype Cycles this year feature content and social analytics,

embedded software and systems, consumer market research, open banking, banking operations innovation, and ICT in Africa.

http://www.gartner.com/technology/research/hype-cycles/

Hype Cycles - Gartner

Big Data – Where it is used

• Work-Force Science• Astronomy (Hubble telescope)• Gene and DNA expressions.• Finding out cancerous cells which causes a disease.• Fraud detection• Video and Audio Mining• Automotive Industry – We will see use cases in the

next slide.• Consumer focused marketing• Retail

Automotive Industry

“If the automobile had followed the same development cycle as the computer, a Rolls-Royce would today cost $100, get a million miles per gallon, and explode once a year, killing everyone inside.”– Robert X. Cringely

Use Cases - Automotive• Vehicle Insurance• Personalized Travel & Shopping Guidance

Systems• Supply Chain/Logistics• Auto-Repairs• Vehicle Engineering• Vehicle Warranty• Customer Sentiment• Customer Care Call Centers

Use Cases - Automotive• Self Driving Cars (Perspective) Sensor Data generates 1 GB of Data per second A Person drives 600 hours per year on Average. 2,160,000 (600*60*60) seconds 2 Petabyte of data per car

per year. Total Number of cars in the world to surpass 1 billion. SO you can do the MATH

• Smart Parking using Mesh Networks – Visuals in the next slide.

Smart parking – Mesh Networks

Datacenter- Where Data Resides

• “eBay Keynote – 90M Users, 245M items on sale, 93B active database calls…that’s a busy DATACENTER” - Gartner DC

Google Datacenter – Over the years

Welcome to the Era of CLOUD

“More firms will adopt Amazon EC2 or EMR or Google App Engine platforms for data analytics. Put in a credit card, by an hour or months worth of compute and storage data. Charge for what you use. No sign up period or fee. Ability to fire up complex analytic systems. Can be a small or large player” Ravi Kalakota’s forecast

https://practicalanalytics.wordpress.com/about/

https://practicalanalytics.wordpress.com/about/

Big Data on the cloud

Image source:practicalanalytics.files.wordpress.com

What Intelligent Means

• “A pair of eyes attached to a human brain can quickly make sense of the content presented on a web page and decide whether it has the answer it’s looking for or not in ways that a computer can't. Until now.” ― David Amerland, Google Semantic Search

http://www.goodreads.com/author/show/1697177.David_Amerland

http://www.goodreads.com/author/show/1697177.David_Amerland

http://www.goodreads.com/work/quotes/23933277

Intelligent Web

Humor Corner

Image source: thebigdatainsightsgroup.com

Big Data Technologies & Complexity

1.Hadoop Framework – HDFS and MapReduce2.Hadoop Ecosystem3.NOSql Databases

In Big data choosing algorithms that has least complexity in terms of processing time is the most important. Usually we use Big O notation for accessing such complexities.Big O Notation is the rate at which the performance of the system degrades as a function of amount of data asked to handle.

For example during sorting operation we should prefer Merge Sort (with time complexity NlogN) over the Insertion sort O(N^2).

How to find the Big O for any given polynomial?The steps are•Drop constants•Drop Coefficients•Keep only the highest order term. The exponent of that is the complexity.For example 3n^3 has a cubic degradation.

Big Data Challenge

Challenge: CAP Theorem

CAP stands for consistency , Availability and Partition Tolerance. These 3 are important properties in the big data space. However the theorem states that we can get only two out of these three. So we are forced to relax on the consistency aspect because the Availability and partition tolerance are critical to the big data world.

Availability: If you can talk to a node in the cluster it can read and write data.Partition Tolerance: The cluster can survive communication breakages.

Cluster Based approachSingle Large CPU (Supercomputer)– Failure is possibility, More Cost, Vertical ScalingMultiple Nodes (Commodity Hardware) – Fault tolerance using Replication, Less Cost, Horizontal Scaling (sharding).

Also we have two variants in cluster based approach.Parallel Computing and Distributed Computing.Parallel Computing has multiple CPU’s with a shared memory and Distributed Computing has multiple CPU’s with one memory for each of the nodes.

While choosing the algorithms in concurrent processing we consider the following factors,• Granularity: Specifies the number of tasks in which job is done.

Has two types Fine Grained and Coarse Grained.Fine Grained: Large number of small tasksCoarse Grained: Small number of large tasks

• Degree of Concurrency: Higher the avg degree of concurrency the better because of proper utilization of clusters.

• Critical Path length: The longest directed path in a task dependency graph.

RDBMS Vs NOSqlRDBMS suffers from Impedance Mismatch problems and the database in integrated. It is not designed to run efficiently on clusters. It is normalized with a well defined schema. This makes it less flexible to adapt to newer requirements of processing large data in a less time.

Natural movement shifted the integration databases to application oriented databases integrated through services.

NOSql emerged with Polyglot persistence having a schema-less design and well suitable for clusters.

Now the Database stack looks like this:

RDBMS, Key-Value, Document, Column Family stores and Graph Databases. We need to choose one of these based on our requirements.

RDBMS – Data storage in Tuples (Limited Data Structure)NOSql - Aggregates (Complex Data Structure)

NOSql Family Key value Databases – Voldemort, Redis,Riak to name a few

The Aggregate is opaque to the database. Just some big blob of mostly meaningless bits. We can access the aggregate by lookup based on some key.

Document Databases – CouchDB and MongoDBThe Aggregate has some structure.

Column Family Stores – HBase, Cassandra, Amazon SimpleDB Two level aggregats structure. Rows and columns. Organized columns into

column families. Row key is row identifier. Column key and Column value together forms the Column family. Example of column family is profile of a customer, orders done by a customer. Within each cell there is a timestamp.

Graph Databases – Neo4j, FlockDB, Infinite GraphSuitable to model complex relationships. This is not aggregate oriented.

Why HadoopYahoo– Before Hadoop: 1 million for 10 TB storage– With Hadoop: $1 million for 1 PB of storageOther Large Company– Before Hadoop: $5 million to store data in Oracle– With Hadoop: $240k to store the data in HDFSFacebook– Hadoop as unified storage Case study: NetflixBefore Hadoop– Nightly processing of logs– Imported into a database– Analysis/BIAs data volume grew, it took more than 24 hours to process andload a day’s worth of logsToday, an hourly Hadoop job processes logs for quickeravailability to the data for analysis/BICurrently ingesting approx. 1TB/day

Hadoop Stack Diagram

Hardware

Software Environment

Application

Commodity Cluster

MapReduce - HDFS

Ecosystem | Custom Applications

Core Components - HDFS

HDFS - Data files are split into blocks of 64 or 128 MB and distributed across multiplenodes in the cluster. No random writes or reads are allowed in HDFS.Each Map operates on one HDFS data block. Mapper reads data in Key Value pairs.5 daemons - Name Node ,Secondary Name Node, Data Node, Job tracker and Task Tracker .

•Data Node sends Heartbeats •Every 10th heartbeat is a Block report •Name Node builds metadata from Block reports •TCP – every 3 seconds •If Name Node is down, HDFS is down

HDFS takes care of load balancing.

Core Component - MapReduceMap Reduce pattern is pattern to allow computations to be parallelized over a cluster.

Example of Map Reduce:

Map Stage:Suppose say we have order as aggregate and sales people want to see the product level report and its total revenue for the last week.

So order aggregate will be input the Map and Key Value pairs (corresponding line items). Key will be the product id and (quantity, price) will be the values.Map operation will only operate on a single record at a time and hence can be parallelized.

Reduce Stage:Takes multiple map outputs with the same key and combines their values.

Number of mappers decided by the block and data size and number of reducers decided by the programmer.

Widely used MapReduce computations can be stored as Materialized Views.

MapReduce Limitations

• Computation depends on previously computed values.Ex: Fibonacci series

• Algorithms that depend on shared global state.Ex: Monte Carlo Simulation

• Join Algorithms for Log ProcessingMapreduce framework is cumbersome for joins.

Fault Tolerance and Speed

•One server may stay up 3 years (1,000 days) .If you have 1,0000 servers, expect to lose 1/day .

•So go for replication of data. Since high volume and velocity it’s impossible to move data. So Bring computation to the data and not vice versa. Also this will minimize network congestion.

•Since nodes fail we go for distributed file system. 64 MB Blocks, 3X Replication, Different racks storage.

•Increase speed of processing – Speculative Execution.

Ecosystem 1 &2- Hive and PigHive:•An SQL-like interface to Hadoop•Abstraction to the Mapreduce as it is complex. This is just a view.•This just provides a tabular view and doesn’t create any tables.•Hive Query Language (HQL) is converted to a set of tokens which in turn gets converted to Map Reduce jobs internally before getting executed on the hadoop clusters.

Pig:•A dataflow language for transforming large data sets.•Pig Scripts resides on the user machine and the job executes on the cluster whereas Hive resides inside the hadoop cluster.

Easy syntax structure such as Load, Filter, Group By, For Each, Store etc.,

Ecosystem 3 - HBase

HBase is a column family-store database layered on top of HDFSProvides Column oriented view of the data sitting on HDFS.Can store massive amounts of data– Multiple Terabytes, up to Petabytes of dataHigh Write Throughput– Scales up to millions of writes per secondCopes well with sparse data– Tables can have many thousands of columns– Even if a given row only has data in a few of the columns

Use HBase if…– You need random write, random read, or both (but not neither)– You need to do many thousands of operations per second onmultiple TB of data

Used in Twitter, Facebook etc.,

Ecosystem 4 - MahoutMachine learning library on top of hadoop. This is scalable and efficient. Useful for predictive analysis where in deriving information from current and historical data (mainly large dataset) to derive meaningful insights from data.

The implementations areRecommendation systems – Implementing recommendations with the use of techniques like Collaborative Filtering.

Classification – Supervised Learning using techniques like Decision trees, KNN etc.,

Clustering – Type of Unsupervised learning using techniques like K-Means and also some distance based metrics.

Frequent item sets Mining – Finding the patterns in customer purchase and then finding the correlation of purchases with respect to support, confidence and lift.

Other EcosystemsOozieSpecifies workflows when there is a complex map reduce job.

ZookeeperCo-ordination among multiple servers with multiple machines sitting at multiple locations.Maintains configuration information, distributed synchronization etc.,

FlumeCheap storage of all the log files from all the servers.

ChukwaScalable log collector collecting the logs and dumping to HDFS. No storage or processing done.

SqoopImports structured data into HDFS and also exports back the results from HDFS to RDBMS.

WhirrThis is a set of libraries for running cloud services. Today configuration is specific to a provider. We can use whirr if we need to port to a different service provider. For example from Amazon S3 to Rackspace.

Hama and BSPThis is alternative to specifically for graph processing applications. BSP is a parallel programming model. mapreduce

Closing NoteBig Data is a reality now and firms are sitting on huge chunk of data to be processed, mined and then to extract meaningful insights generating revenue to the business.This will provide competitive advantage to the companies in serving the customers through the Entire life cycle (Acquisition, Retention, Relationship enhancement etc.,) transforming Prospects to Customers and Customer to Brand Ambassadors.

Innovation in many fields such as Computer science, Statistics, Machine Learning, Programming, Management thinking, Psychology has come together to address this current challenge of Big Data Industry.

To quote a few advances in the technology field,

Reduction of computation cost, Increase in computation power, Decrease cost of storage and availability of commodity hardware.

Also coming up of new roles such as Data Stewards and Scientists in this industry to cater to the challenges and come up with appropriate solutions.

This is an active field involving research and will evolve over the future years.

big data business case

Documents

image source

column family

big data

parallel computing

multiple cpus

distributed computing

fault tolerance

partition tolerance