big data and nosql bus 782. what is big data? employee-generated data user-generated data...

Post on 23-Dec-2015

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Big Data and NoSQLBUS 782

What is Big Data?

• https://www.youtube.com/watch?v=c4BwefH5Ve8• Employee-generated data• User-generated data• Machine-generated data

• Big Data Analytics: 11 Case Histories and Success Stories• https://www.youtube.com/watch?annotation_id=annotation_3535169775&f

eature=iv&src_vid=c4BwefH5Ve8&v=t4wtzIuoY0w

Big Data• Data Size:

– Gigabyte– Terabyte: Terabyte USB– Petabyte: Wal-Mart handles more than 1m

customer transactions every hour at more than 2.5 petabytes

– Exabyte: the amount of traffic flowing over the internet about 700 exabytes annually

– Zettabyte•

Big Data: Some Facts

• World’s information is doubling every two years• World generated 1.8 ZB of information in 2011• Cisco predicts that by 2016 global IP traffic will reach 1.3

zettabytes• There will be 19 billion networked devices by 2016• 70% of this data is being generated by individuals as opposed

to enterprises & organizations

Big Data Sources

• Web sites• Social media• Machine generated• RFID• Image, video, and audio• Etc.

Big Data Challenges• Big Data are high-volume, high-velocity,

and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.

• “3Vs":– Volume: Size >= 30-50 TBs– Velocity: Processing speed– Variety:

• Structured: able to fit in a database table• unstructured data

Do Companies care about Data?

• Not really, What they care about are Key Performance Indicators (KPIs)

• Some examples of KPIs are– Revenue– Profit– Revenue per customer/employee– Customer Attrition: the loss of clients or customers

• Big Data is only useful if it helps drive KPIs

Big Data to KPIs

Applications• Text mining: deriving high-quality information

from text.– text categorization, text clustering, concept/entity

extraction, sentiment analysis, etc.• Web mining:

– Web usage mining– Web content mining

• Social media mining– Salesforce Radian6 Social Marketing Cloud

• http://www.youtube.com/watch?v=EH1dcFh_-I4

Advantages of Relational Databases

• Well-defined database schema• Flexible query language• Maintain database consistency in business transactions:

– Concurrent database processing with multiple users• Reading/updating• Locking

Transaction ACID Properties• Atomic

– Transaction cannot be subdivided– All or nothing

• Consistent– Constraints don’t change from before transaction to after transaction– A transaction transforms a database from one consistent state to another consistent

state.• Isolated

– Transactions execute independently of one another.– Database changes not revealed to users until after transaction has completed

• Durable– Database changes are permanent and must not be lost.

Problems with relational databases in managing Big Data

• High overhead in maintaining database consistency• Do not support unstructured data search very well (i.e. google type

searching)• Do not handle data in unexpected formats well• Don’t scale well to very large size databases:

– Expensive “scale up”: adding processer, storage– Slow query response time– Data must move to server– Server failure

• Organizations such as Facebook, Yahoo, Google, and Amazon were among the first to decide that relational databases were not good solutions for the volumes and types of data that they were dealing with.

What is needed in new approach

• Deal with data size never imagined before.• Hardware failure should be expected.• Data has gravity, compute has to move to data.

What is Hadoop?

• Open source project by Apache Foundation• Based on papers published by Google

– Google File System ( Oct, 2003)– MapReduce ( Dec, 2004)

• Consists of two core components– Hadoop Distributed File System (Storage)– MapReduce (Compute)

How Hadoop fits in the new approach• Run on cluster of low cost commodity servers so can

accommodate petabytes of data cost effectively.• Embraces partial failures • Data locality (computation on local node where• data resides)• Horizontally Scales

– Scale Out• Hadoop file is:

– Distributed: a file is stored in many servers– Replicated: a file is replicated with many copies

Hadoop HDFS: Hadoop Distributed File System• Based on GFS • Designed to store very large amount of data (TBs• and PBs) and much larger file sizes• Write-once, read many-times access pattern• Designed to run on clusters of commodity hardware and does

replication for reliability• Allows data to be read and processed locally• Supports limited operations on files - write, delete, append and

reads but no updates

MapReduce: a programming model for distributedprocessing of data

• Rather than take the conventional step of moving data over a network to be processed by software, MapReduce moves the processing software to the data.

• Each node does both store and compute, and does best to process local data.

• MapReduce has two main phases:– Map– Reduce

Example: Word Count

Hadoop Ecosystem• Hbase–a column-oriented data store• Hive –provides a SQL like query capability• Pig –a high-level language for creating MapReducejobs• HCatalog–takes Hive’s metadata and makes it available across the

Hadoop ecosystem• Mahout –a library of algorithms for clustering, classification, and filtering• Sqoop–accelerates bulk loads of data between Hadoop and RDMS• Flume –streams large volumes of log data from multiple sources into

Hadoop

NoSQL Database

• NotOnlySQL is a broad class of database management systems identified by non-adherence to the widely used relational database management system model.

• They are useful when working with a huge quantity of data when the data's nature does not require a relational model.

Types of NoSQL Databases

• Column-oriented database– Example: Cassandra

• Document-oriented database:– Example: MongoDB, CouchDB

• Data stored in JSON, JavaScript Object Notation, format

JSON, JavsScript Object Notationhttp://www.w3schools.com/json/default.asp

JSON Example

{"employees":[ {"firstName":"John", "lastName":"Doe"}, {"firstName":"Anna", "lastName":"Smith"}, {"firstName":"Peter", "lastName":"Jones"} ]}

Cassandra is essentially a key-value store. This means that all data is stored only in one ‘table’, each row of which is uniquely identified by a key, with JSON representation.

https://blog.safaribooksonline.com/2012/12/11/modeling-data-in-cassandra/{"user1": { "Bio": { "name": "Shaneeb Kamran", "age" : 23 } }, "user2": { "Bio": { "name": "Salman ul Haq", "profession": "Developer" }, "Education": { "bachelors": "NUST" } }}

Column Data Modelhttp://www.sinbadsoft.com/blog/cassandra-data-model-cheat-sheet/

http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/

• A column is a key-value pair consisting of three elements:– 1: Unique name: Used to reference the column– 2: Value: The content of the column. – 3: Timestamp: used to determine the valid content.

• Column Family: A container for columns sorted by their names. Column Families are referenced and sorted by row keys.

• Super Column: A sorted associative array of columns– Example: Multi-value attribute

• Super column family: A container for super columns sorted by their names. Super Column Families are referenced and sorted by row keys.

• Keyspace: Top level element. Container for column families.

Column Family

Super column family

Migrate a Relational Database Structure into a NoSQL Cassandra Structure http://www.divconq.com/2010/migrate-a-relational-database-structure-into-a-nosql-cassandra-structure-part-i/

{ "biologicalfeatures": { "forests" : { "forest003" : { "name" : "Black Forest", "trees" : "two million", "bushes" : "three million“ }, "forest045" : { "name" : "100 Acre Woods", "trees" : "four thousand", "bushes" : "five thousand“ }, "forest127" : { "name" : "Lonely Grove", "trees" : "none", "bushes" : "one hundred“ } }, "famoustrees" : { "tree12345" : { "forestID" : "forest003", "name" : "Der Tree", "species" : "Red Oak“ }, "tree12399" : { "forestID" : "forest045", "name" : "Happy Hunny Tree", "species" : "Willow“ }, "tree32345" : { "forestID" : "forest003", "name" : "Das Ubertree", "species" : "Blue Spruce“ } } }}

Document database: MongoDBhttp://docs.mongodb.org/manual/core/data-modeling-introduction/

• MongoDB stores business subjects in documents.• A document is the basic unit of data in MongoDB. Documents

are analogous to JSON objects but exist in the database in a more type-rich format known as BSON, Bin ary JSON, is a bin ary-en coded seri al iz a tion of JSON-like doc u ments.

• The structure of MongoDB documents and how the application represents relationships between data: – references and embedded documents.

Example using reference

Embedded Data Models

CouchDB• A CouchDB document is a JSON object that consists of named

fields. Field values may be strings, numbers, dates, or even ordered lists and associative maps. An example of a document would be a blog post:

{

"Subject": "I like Plankton",

"Author": "Rusty",

"PostedDate": "5/23/2006",

"Tags": ["plankton", "baseball", "decisions"],

"Body": "I decided today that I don't like baseball. I like plankton."

}

Problems with NoSQL Databases

• Does not support transaction consistency as relational database systems.

• There is no standard query language for NoSQL databases

NewSQL Databaseshttp://en.wikipedia.org/wiki/NewSQL

• NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system.

Approaches of NewSQL Systems

• 1. Distributed cluster of shared-nothing nodes:– node owns a subset of the data. These databases include

components such as distributed concurrency control and distributed query processing.

• 2. Transparent sharding:– These systems provide a sharding middleware layer to automatically

split databases across multiple nodes.• 3. Highly optimized SQL engines• 4. In-memory database

In-Memory Database• An in-memory database is a database

management system that primarily relies on main memory for computer data storage. It is contrasted with database management systems that employ a disk storage mechanism.

• Main memory databases are faster than disk-optimized databases.

• Good for Big Data analytics. • Use non-volatile main memory module that

retains data even when electrical power is removed.

SAP HANA, High-Speed Analytical Appliance

• SAP HANA is an in-memory, column-oriented, relational database management system developed and marketed by SAP. HANA's architecture is designed to handle both high transaction rates and complex query processing on the same platform

• HANA's performance is 10,000 times faster when compared to standard disks, which allows companies to analyze data in a matter of seconds instead of long hours.

top related