big data overview 2013-2014

37
1 BIG DATA Kaushal Amin, Chief Technology Officer KMS Technology – Atlanta, GA, USA

Upload: kms-technology

Post on 27-Jan-2015

110 views

Category:

Technology


5 download

DESCRIPTION

At the Technology Trends seminar, with HCMC University of Polytechnics' lecturers, KMS Technology's CTO delivered a topic of Big Data, Cloud Computing, Mobile, Social Media and In-memory Computing.

TRANSCRIPT

Page 1: Big Data Overview 2013-2014

1

BIG DATA

Kaushal Amin, Chief Technology OfficerKMS Technology – Atlanta, GA, USA

Page 2: Big Data Overview 2013-2014

2

AGENDA

• What is Big Data• Why not RDBMS• NoSQL• NewSQL• Performance Comparison• Case Studies

Page 3: Big Data Overview 2013-2014

WHAT IS BIG DATA

Page 4: Big Data Overview 2013-2014

4

WHAT IS BIG DATA?

“Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it with in a tolerable elapsed time for its user population.” - Teradata Magazine article, 2011

“Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.” - The McKinsey Global Institute, 2011

Volume and Variety of Data that is difficult to manage using traditional data management

technology

Page 5: Big Data Overview 2013-2014

WHAT IS GENERATING BIG DATA?

Homeland Security

Real Time Search

Social

eCommerce

User Tracking & Engagement

Financial Services

5

Page 6: Big Data Overview 2013-2014

HOW MUCH DATA?

• 7 billion people• Google processes 100 PB/day; 3 million servers• Facebook has 300 PB + 500 TB/day; 35% of

world’s photos• YouTube 1000 PB video storage; 4 billion

views/day• Twitter processes124 billion tweets/year• SMS messages – 6.1T per year• US Cell Calls – 2.2T minutes per year• US Credit cards - 1.4B Cards; 20B

transactions/year

6

Page 7: Big Data Overview 2013-2014

7

LOWER COST OF STORAGEWhat can I buy for $100 (USD) ?

(not adjusted for inflation)

Memory Capacity = 128 GB by 2020

x1420 in 20 years

Disk Capacity =10 TB by 2020

x1000 in 20 years

Page 8: Big Data Overview 2013-2014

8

HOW IS BIG DATA DIFFERENT?

• Automatically generated by a machine – (e.g. Sensor embedded in an engine)

• Typically an entirely new source of data– (e.g. Use of the internet)

• Not designed to be friendly– (e.g. Text streams)

• May not have much values– Need to focus on the important part

Page 9: Big Data Overview 2013-2014

9

WHO UTILIZES IT?

• Companies and organizations who can leverage large scale consumer produced data – Marketing– Consumer Markets (retail, airlines, hotels, Amazon,

Netflix)– Social Media (Facebook, Twitter, YouTube, LinkedIn)– Search Providers (Google, Yahoo, Microsoft)– People Data Aggregators (LexisNexis, Equifax, Acxiom)

• Other Enterprises are slowly getting into it– Healthcare – Financial Institutes

Page 10: Big Data Overview 2013-2014

WHY NOT RDBMS?

Page 11: Big Data Overview 2013-2014

TYPE OF DATA

• Structured Data (Transactions)

• Text Data (Web Content)

• Semi-structured Data (XML)

• Unstructured Data– Social Network, SMS, Audio, Video

• Streaming Data – You can only scan the data once as it travels on network

11

Page 12: Big Data Overview 2013-2014

WHAT TO DO WITH THESE DATA?

• Aggregation and Statistics – Data warehouse and OLAP

• Indexing, Searching, and Querying– Keyword based search – Pattern matching (XML/RDF)

• Knowledge discovery– Data Mining– Statistical Modeling

12

Page 13: Big Data Overview 2013-2014

RDBMS LIMITATIONS

• Very difficult to scale horizontally (more boxes) as the best way to scale is vertically by utilizing bigger box– Physical limited to CPUs, Disk storage, and memory– Large servers are too expensive and still can’t scale

• Requires structure of tables with rows and columns– Does not deal well with unstructured data

• Relationships have to be pre-defined through schema– Difficult to add newly discovered data quickly

13

Page 14: Big Data Overview 2013-2014

NOSQL

Page 15: Big Data Overview 2013-2014

NOSQL CHARACTERISTICS

• Cheap, easy to implement (open source)– Cluster of cheap commodity servers with cheap storage

• Data are replicated to multiple nodes (therefore identical and fault-tolerant) and can be partitioned– Down nodes can easily be replaced while cluster is operational– No single point of failure

• Easy to distribute• Don't require a schema• Massive Scalability• Relaxed the data consistency requirement

(CAP) – less locking and resource contengency15

Page 16: Big Data Overview 2013-2014

16

NOSQL – SEVERAL OPTIONS

• Currently 150 implementations and growing (http://nosql-database.org/)

• Multiple Types based on storage architecture– Key-Value– Document– Column Family– Graph

Page 17: Big Data Overview 2013-2014

KEY-VALUE STORE

• Values stored in Key-Value Pairs in hashmap• Distributed across nodes based on key • Simple Operations: insert, fetch, update, and

delete• Best for storing high volume dataset with

low complexity (simple data model)• Some of the market leaders:

– Riak– Amazon Dynamo– Voldermort

17

Page 18: Big Data Overview 2013-2014

KEY-VALUE STORE

18

Page 19: Big Data Overview 2013-2014

COLUMN FAMILY STORE

• Stores family of columns• Columns are stored as Key-Value pair• A super column is like a catalogue or a collection

of other columns• Columns within a family can be distributed across

nodes• Supports semi-structured data with high

scalability• Some of the market leaders:

– HBase– Cassandra

19

Page 20: Big Data Overview 2013-2014

COLUMN FAMILY STORE (HBASE)

20

Page 21: Big Data Overview 2013-2014

DOCUMENT STORE

• Supports more complex data model than Key-Value

• Collection of Documents – JSON, XML, other semi-structured formats

• A document is a key value collection• Multi-Index support• Best for storing complex data model but less

scalable• Some of the market leaders:

– MongoDB– CouchDB– SimpleDB

21

Page 22: Big Data Overview 2013-2014

DOCUMENT STORE

22

Page 23: Big Data Overview 2013-2014

GRAPH DATABASE

• Social Graph with Relationship between Entities• Great for Social Networks

– Facebook friends network– LinkedIn connections network

• Some of the market leaders: – Neo4j– FlockDB– Pregel

23

Page 24: Big Data Overview 2013-2014

GRAPH DATABASE - EXAMPLE

24

• Nodes represent entities such as people, businesses, accounts, or any other item you might want to keep track of.

• Properties are pertinent information that relate to nodes such as name, age, DOB, gender.

• Edges are the lines that connect nodes to nodes or nodes to properties and they represent the relationship between the two.

Page 25: Big Data Overview 2013-2014

NEWSQL

Page 26: Big Data Overview 2013-2014

NEWSQL

• Argument is that Relational Model is not the problem for lack of scalability but the physical implementation limitations

• Development of new relational database products and services designed to bring the benefits of the relational model to distributed architectures

• Three Approaches:– Optimized MySQL storage engines (ScaleDB, MemSQL, Akiban)– New SQL databases (Clusterix, VoltDB, NuoDB)– Sharding Middleware to split RDBMS across nodes (ScaleBase,

Scalearc, dbShards)

26

Page 27: Big Data Overview 2013-2014

PERFORMANCE COMPARISON

Page 28: Big Data Overview 2013-2014

28

SOURCE AND APPROACH

• Independent testing done by Altoros Systems Inc.• More details at http://

www.networkworld.com/news/tech/2012/102212-nosql-263595.html?page=1

• Using Amazon virtual machines to ensure verifiable results and research transparency (which also helped minimize errors due to hardware differences)

– Riak, a key-value store – Cassandra, a column family store – Hbase, a column family store – MongoDB, a document-oriented database – MySQL Cluster, a NewSQL– Sharded MySQL, a NewSQL

Page 29: Big Data Overview 2013-2014

PERFORMANCE ON WRITE

29

Page 30: Big Data Overview 2013-2014

30

PERFORMANCE ON READ

Page 31: Big Data Overview 2013-2014

CASE STUDIES

Page 32: Big Data Overview 2013-2014

32

EXAMPLE: HEALTHCARE

A health care consultancy has made the data coming out of medical practices the focus of its thriving business. The company collects billing and diagnostic code data from 10,000 doctors on a daily, weekly and monthly basis to create a virtual clinical integration model. The consulting company analyzes the data to help the groups understand how well they are meeting the FTC guidelines for negotiating with health plans and whether they qualify for enhanced reimbursement based on offering a more cost-effective standard of care.

It also sends them automated information to better take care of patients, like creating an automated outbound calling system for pediatric patients who weren’t up to date on their vaccinations.

Page 33: Big Data Overview 2013-2014

33

EXAMPLE: RETAIL

Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes * of data — the equivalent of 167 times the information contained in all the books in the US Library of Congress.

Page 34: Big Data Overview 2013-2014

34

EXAMPLE: UTILITY

With a smart meter, a utility company goes from collecting one data point a month per customer (using a meter reader in a truck or car) to receiving 3,000 data points for each customer each month, while smart meters send usage information up to four times an hour.

One small Midwestern utility is using smart meter data to structure conservation programs that analyze existing usage to forecast future use, price usage based on demand and share that information with customers who might decide to forestall doing that load of wash until they can pay for it at the nonpeak price.

Page 35: Big Data Overview 2013-2014

35

GROWTH FORECAST

Page 36: Big Data Overview 2013-2014

3636

Page 37: Big Data Overview 2013-2014

© 2013 KMS Technology

Q&A