hadoop and nosql basics: big data demystified - schedschd.ws/hosted_files/nyinnovates2013/c5/matt...

79
Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation Summit, 12/17/2013 Matt LeMay, @mattlemay

Upload: buikhanh

Post on 05-Jun-2018

228 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Hadoop and NoSQL Basics: Big Data Demystified

NYS Innovation Summit, 12/17/2013

Matt LeMay, @mattlemay

Page 2: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

“When I want people to think I’m smart, I just say ‘HADOOP’ really loud.”

Page 3: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation
Page 4: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

“Big Data!”

“Data Science!”

“Hadoop! There it is.”

“Algorithms!”

Page 5: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation
Page 6: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation
Page 7: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

... why are we thinking about this at all?

Page 8: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

=ALL the data

created until the year 2003

ALL the data created every

two days

Page 9: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation
Page 10: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Writes > 12 terabytes of data per day.

Page 11: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

*the 451 group

Page 12: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

... how did we get here?

Page 13: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

HIERARCHICAL DATABASE MODEL

RELATIONAL DATABASE MODEL

DOCUMENT DATABASE MODEL

Page 14: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

HIERARCHICAL DATABASE MODEL

• Used in early mainframe computing !• Stores data in one-to-many “trees” !• Not very flexible

Fruit

AppleOrange Grape

Granny Smith Honeycrisp Red Delicious

Page 15: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

RELATIONAL DATABASE MODEL

• Invented in 1970 by Edgar F. Codd at IBM !• Stores data in “tuples” which resemble rows of a table !• Still the most widely used database model

Fruit_Variety Fruit

Granny Smith Apple

Honeycrisp Apple

Red Delicious Apple

Navel Orange

Page 16: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

RELATIONAL DATABASE MODEL

• ... can also store hierarchical data!

Fruit_ID Fruit_Name

1 Orange

2 Apple

3 Grape

Variety_ID Variety_Name Fruit_ID

1 Granny Smith 2

2 Honeycrisp 2

3 Red Delicious 2

4 Navel 1

Page 17: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

RELATIONAL DATABASE MODEL

• Has rigid structure or “schema.”

Fruit_ID Fruit_Name

1 Orange

2 Apple

3 Grape

Variety_ID Variety_Name Fruit_ID

1 Granny Smith 2

2 Honeycrisp 2

3 Red Delicious 2

4 Navel 1

Page 18: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

RELATIONAL DATABASE MODEL

• Uses unique “keys” for consistency across “tables”

Fruit_ID Fruit_Name

1 Orange

2 Apple

3 Grape

Variety_ID Variety_Name Fruit_ID

1 Granny Smith 2

2 Honeycrisp 2

3 Red Delicious 2

4 Navel 1

Page 19: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

DOCUMENT DATABASE MODEL

Red Delicious AppleHoneycrisp Apple

Granny Smith Apple

Navel Orange

• Doesn’t have a single structure or “schema” that each entry must follow !• Developed in 1995 for use with Lotus Notes !• SO TRENDY

Page 20: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

DOCUMENT DATABASE MODEL

• CAN have structured elements, but structure doesn’t need to be consistent across entries

{!“Fruits”: [!{!“Type”: “Apple”,!“Variety”: “Red Delicious”!

},!{!“Name”: “Granny Smith Apple”!

},!“Navel Orange”!

]!}!

!

Page 21: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

HIERARCHICAL DATABASE MODEL

RELATIONAL DATABASE MODEL

DOCUMENT DATABASE MODEL

RIGID

FLEXIBLE

Page 22: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

HIERARCHICAL DATABASE MODEL

RELATIONAL DATABASE MODEL

DOCUMENT DATABASE MODEL

RIGID

FLEXIBLE

Page 23: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Relational Database is to Document Database !

As Excel Spreadsheet is to Word Document

Page 24: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

... as SQL is to NoSQL

Relational Database is to Document Database !

As Excel Spreadsheet is to Word Document

Page 25: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Relational Database is to Document Database !

As Excel Spreadsheet is to Word Document

*... mostly / sorta. Stay tuned!

... as SQL is to NoSQL*

Page 26: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

SQL, or “Structured Query Language,” is a language for getting data into and out of a relational database.

“SELECT Variety_Name FROM fruits WHERE fruit_id = 2”

!Variety_Name!---------------------- !Granny Smith!Honeycrisp!Red Delicious!

Page 27: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Depending on who you ask, “NoSQL” means “NOT SQL” or “NOT ONLY SQL.”

Page 28: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

(in fact, some characterize NoSQL as a “movement,” not a particular

technology or set of technologies.)

Page 29: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

“SQL Databases” are highly standardized. !

“NoSQL Databases” are highly fragmented.

Page 30: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

“SQL Databases” are highly standardized. !

“NoSQL Databases” are highly fragmented. Some are document model databases, some use a variation of a key-value store.

Document Databases

Page 31: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

So, what are the characteristics of NoSQL databases* that make them so

trendy and exciting?

* Generally

Page 32: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Relational databases have strict “schemas” dictating the structure of data.

NoSQL databases are generally “schemaless,” even when they use key-value stores.

Page 33: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

NoSQL databases are generally “schemaless,” even when they use key-value stores.

Can start entering data before deciding on how that data will be formatted

Less structured, consistent

More flexible

Page 34: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

NoSQL databases are generally “schemaless,” even when they use key-value stores.

Can start entering data before deciding on how that data will be formatted

Less structured, consistent

More flexible

Page 35: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Relational databases can scale up (on one computer) but not easily out (across many computers).

NoSQL databases are designed to scale out across many computers.

Page 36: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

NoSQL databases are designed to scale out across many computers.

Lots of machines == BIG data

More complicated to set up

Can scale quickly if needed

No single point of failure

Page 37: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Relational databases read and write information directly to a disk drive.

NoSQL databases store information in memory, and/or include robust built-in caching in memory.

Page 38: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

NoSQL databases store information in memory, and/or include robust built-in caching in memory.

Faster

Memory more expensive than disk

Potential reliability issues

Page 39: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Relational databases follow the “ACID” model:

NoSQL databases do not follow the “ACID” model.

Page 40: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

More freedom to handle requests in a way that honors the uniqueness of “things.”

Much greater room for (potentially serious) errors.

NoSQL databases do not follow the “ACID” model.

Page 41: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Relational databases represent data as “rows” and “columns.”

NoSQL databases often represent data in formats such as JSON, which are native to

many programming languages.

Page 42: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

NoSQL databases often represent data in formats such as JSON, which are native to

many programming languages.

Easier, faster for programmers

Harder for non-programmers

Page 43: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

SO WAIT, THOUGH, how the f*** do you find anything in a NoSQL database????

Page 44: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation
Page 45: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

HADOOP is an open source framework for doing MapReduce.

!

MapReduce is one way to make sense of a document database.

!

(That’s how GOOGLE does it.)

Page 46: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

MapReduce has two core steps: !

Map !

and !

Reduce. !

!

!

... both are pretty much what they sound like.

Page 47: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

This is what it actually looks like:

function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)

Page 48: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1)

“For a given document, map each word phrase or item to the number of times that word phrase or item appears.”

MAP:

Page 49: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

“NOW, take all of those maps from every document, and reduce them to a single list of items and counts.”

REDUCE:

function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)

Page 50: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Red Delicious Apple

Honeycrisp Apple Granny Smith Apple

Navel Orange

Page 51: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Red Delicious Apple

Honeycrisp Apple Granny Smith Apple

Navel Orange

(Red, 1) (Delicious, 1) (Apple, 1)

(Honeycrisp, 1) (Apple, 1)

(Navel, 1) (Orange, 1)

(Granny, 1) (Smith, 1) (Apple, 1)

MAP

Page 52: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Red Delicious Apple

Honeycrisp Apple Granny Smith Apple

Navel Orange

(Red, 1) (Delicious, 1) (Apple, 1)

(Honeycrisp, 1) (Apple, 1)

(Navel, 1) (Orange, 1)

(Granny, 1) (Smith, 1) (Apple, 1)

MAP

(Red, 1) (Delicious, 1) (Apple, 3) (Honeycrisp, 1) (Navel, 1) (Orange, 1) (Granny, 1) (Smith, 1)

REDUCE

Page 53: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Red Delicious Apple

Honeycrisp Apple Granny Smith Apple

Navel Orange

(Red, 1) (Delicious, 1) (Apple, 1)

(Honeycrisp, 1) (Apple, 1)

(Navel, 1) (Orange, 1)

(Granny, 1) (Smith, 1) (Apple, 1)

MAP

(Red, 1) (Delicious, 1) (Apple, 3) (Honeycrisp, 1) (Navel, 1) (Orange, 1) (Granny, 1) (Smith, 1)

REDUCE

Page 54: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

The hard work is distributed

Page 55: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

The hard work is distributed

The easy work is centralized

Page 56: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Red Delicious Apple

Honeycrisp Apple Granny Smith Apple

Navel Orange

COMP 1 COMP 2

... but what if we’ve got our documents stored on multiple machines?

Page 57: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Red Delicious Apple

Honeycrisp Apple Granny Smith Apple

Navel Orange

(Red, 1) (Delicious, 1) (Apple, 1)

(Honeycrisp, 1) (Apple, 1)

(Navel, 1) (Orange, 1)

(Granny, 1) (Smith, 1) (Apple, 1)

COMP 1 COMP 2

MAP MAP

Page 58: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Red Delicious Apple

Honeycrisp Apple Granny Smith Apple

Navel Orange

(Red, 1) (Delicious, 1) (Apple, 1)

(Honeycrisp, 1) (Apple, 1)

(Navel, 1) (Orange, 1)

(Granny, 1) (Smith, 1) (Apple, 1)

(Red, 1) (Delicious, 1) (Apple, 2) (Honeycrisp, 1)

(Navel, 1) (Orange, 1) (Granny, 1) (Smith, 1) (Apple, 1)

COMP 1 COMP 2

MAP MAP

REDUCE REDUCE

Page 59: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Red Delicious Apple

Honeycrisp Apple Granny Smith Apple

Navel Orange

(Red, 1) (Delicious, 1) (Apple, 1)

(Honeycrisp, 1) (Apple, 1)

(Navel, 1) (Orange, 1)

(Granny, 1) (Smith, 1) (Apple, 1)

(Red, 1) (Delicious, 1) (Apple, 3) (Honeycrisp, 1) (Navel, 1) (Orange, 1) (Granny, 1) (Smith, 1)

(Red, 1) (Delicious, 1) (Apple, 2) (Honeycrisp, 1)

(Navel, 1) (Orange, 1) (Granny, 1) (Smith, 1) (Apple, 1)

COMP 1 COMP 2

MAP MAP

REDUCE REDUCE

REDUCE

Page 60: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Is this the easiest way to count apples?

Page 61: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

NOT

Page 62: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation
Page 63: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

*

* relational database

Page 64: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation
Page 65: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Tweet Text: “I am so happy!” Tweet Location: “Albuquerque, NM” User Home: “New York, NY”

Tweet Text: “#FML #FML #FML” Tweet Location: “Palo Alto, CA” User Home: “San Francisco, CA”

Page 66: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Tweet Text: “I am so happy!” Tweet Location: “Albuquerque, NM” User Home: “New York, NY”

(1808, +.9)

MAP (WITH MATH + SENTIMENT)

Tweet Text: “#FML #FML #FML” Tweet Location: “Palo Alto, CA” User Home: “San Francisco, CA”

(33, -.6)(Distance in Miles, Sentiment Score)

Page 67: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Tweet Text: “I am so happy!” Tweet Location: “Albuquerque, NM” User Home: “New York, NY”

(1808, +.9)

MAP (WITH MATH + SENTIMENT)

Tweet Text: “#FML #FML #FML” Tweet Location: “Palo Alto, CA” User Home: “San Francisco, CA”

(33, -.6)(Distance in Miles, Sentiment Score)

REDUCE

(1808, +.9) (33, -.6)

Page 68: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Tweet Text: “I am so happy!” Tweet Location: “Albuquerque, NM” User Home: “New York, NY”

(1808, +.9)

MAP (WITH MATH + SENTIMENT)

Tweet Text: “#FML #FML #FML” Tweet Location: “Palo Alto, CA” User Home: “San Francisco, CA”

(33, -.6)(Distance in Miles, Sentiment Score)

REDUCE

(1808, +.9) (33, -.6)

RINSE AND REPEAT LIKE A MILLION TIMES

Page 69: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

... none of this is magic.

Page 70: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

... in fact, the “magic” part is just a precursor to doing the actual hard work.

Page 71: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation
Page 72: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Danah Boyd’s Six Provocations for Big Data:

1. Automating Research Changes the Definition of Knowledge. !2. Claims to Objectivity and Accuracy are Misleading !3. Bigger Data are Not Always Better Data !4. Not All Data Are Equivalent !5. Just Because it is Accessible Doesn’t Make it Ethical !6. Limited Access to Big Data Creates New Digital Divides

Page 73: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

What about THE FUTURE?

Page 74: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

HIERARCHICAL DATABASE MODEL

RELATIONAL DATABASE MODEL

DOCUMENT DATABASE MODEL

RIGID

FLEXIBLE

Page 75: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

HIERARCHICAL DATABASE MODEL

RELATIONAL DATABASE MODEL

DOCUMENT DATABASE MODEL

RIGID

FLEXIBLE

?

Page 76: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation
Page 77: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation
Page 78: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation
Page 79: Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt LeMay Hadoop... · Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation

Further Reading:

Martin Fowler on NoSQL: http://martinfowler.com/nosql.html !Helpful Stack Overflow thread: http://stackoverflow.com/questions/11844603/technology-decision-sql-vs-nosql-vs-newsql !Finding Friends with MapReduce: http://stevekrenzel.com/finding-friends-with-mapreduce !Choosing a Database That’s Right for Your Business: http://slashdot.org/topic/bi/choosing-a-database-right-for-business-2/ !Demystifying the Role of Big Data in Marketing: http://www.guardian.co.uk/media-network/media-network-blog/2013/mar/12/big-data-marketing-demystified !The NoSQL Movement: http://strata.oreilly.com/2012/02/nosql-non-relational-database.html !Big Data Tools Cost Too Much, Do Too Little: http://www.theregister.co.uk/2013/02/28/hadoop_no_sql_dont_believe_the_hype/ !Is Big Data an Economic Big Dud?: http://www.nytimes.com/2013/08/18/sunday-review/is-big-data-an-economic-big-dud.html?hp&_r=1& !Six Provocations for Big Data: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1926431