topic 1: big data and warehouse-scale computing

81
1: Big Data and Warehouse-scale Computing Zubair Nabi [email protected] April 17, 2013 Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 1 / 23

Upload: zubair-nabi

Post on 27-Jan-2015

103 views

Category:

Technology


1 download

DESCRIPTION

Cloud Computing Workshop 2013, ITU

TRANSCRIPT

Page 1: Topic 1: Big Data and Warehouse-scale Computing

1: Big Data and Warehouse-scale Computing

Zubair Nabi

[email protected]

April 17, 2013

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 1 / 23

Page 2: Topic 1: Big Data and Warehouse-scale Computing

Outline

1 Introduction

2 Ecosystem

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 2 / 23

Page 3: Topic 1: Big Data and Warehouse-scale Computing

Outline

1 Introduction

2 Ecosystem

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 3 / 23

Page 4: Topic 1: Big Data and Warehouse-scale Computing

From the very beginning

From the dawn civilization to the year 2003, we created 5EB ofinformation

We now create the same amount of data every 2 days!

By 2012, we had spawned 2.7ZB of data

Following the same trend, we will have 8ZB by 2015

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 4 / 23

Page 5: Topic 1: Big Data and Warehouse-scale Computing

From the very beginning

From the dawn civilization to the year 2003, we created 5EB ofinformation

We now create the same amount of data every 2 days!

By 2012, we had spawned 2.7ZB of data

Following the same trend, we will have 8ZB by 2015

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 4 / 23

Page 6: Topic 1: Big Data and Warehouse-scale Computing

From the very beginning

From the dawn civilization to the year 2003, we created 5EB ofinformation

We now create the same amount of data every 2 days!

By 2012, we had spawned 2.7ZB of data

Following the same trend, we will have 8ZB by 2015

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 4 / 23

Page 7: Topic 1: Big Data and Warehouse-scale Computing

From the very beginning

From the dawn civilization to the year 2003, we created 5EB ofinformation

We now create the same amount of data every 2 days!

By 2012, we had spawned 2.7ZB of data

Following the same trend, we will have 8ZB by 2015

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 4 / 23

Page 8: Topic 1: Big Data and Warehouse-scale Computing

Big Data

Large datasets whose processing and storage requirements exceed alltraditional paradigms and infrastructure

I On the order of exabytes and beyond

Generated by web 2.0 applications, sensor networks, scientificapplications, financial applications, etc.

Radically different tools needed to record, store, process, and visualize

Moving away from the desktop

Offloaded to the “cloud”

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 5 / 23

Page 9: Topic 1: Big Data and Warehouse-scale Computing

Big Data

Large datasets whose processing and storage requirements exceed alltraditional paradigms and infrastructure

I On the order of exabytes and beyond

Generated by web 2.0 applications, sensor networks, scientificapplications, financial applications, etc.

Radically different tools needed to record, store, process, and visualize

Moving away from the desktop

Offloaded to the “cloud”

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 5 / 23

Page 10: Topic 1: Big Data and Warehouse-scale Computing

Big Data

Large datasets whose processing and storage requirements exceed alltraditional paradigms and infrastructure

I On the order of exabytes and beyond

Generated by web 2.0 applications, sensor networks, scientificapplications, financial applications, etc.

Radically different tools needed to record, store, process, and visualize

Moving away from the desktop

Offloaded to the “cloud”

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 5 / 23

Page 11: Topic 1: Big Data and Warehouse-scale Computing

Big Data

Large datasets whose processing and storage requirements exceed alltraditional paradigms and infrastructure

I On the order of exabytes and beyond

Generated by web 2.0 applications, sensor networks, scientificapplications, financial applications, etc.

Radically different tools needed to record, store, process, and visualize

Moving away from the desktop

Offloaded to the “cloud”

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 5 / 23

Page 12: Topic 1: Big Data and Warehouse-scale Computing

Big Data

Large datasets whose processing and storage requirements exceed alltraditional paradigms and infrastructure

I On the order of exabytes and beyond

Generated by web 2.0 applications, sensor networks, scientificapplications, financial applications, etc.

Radically different tools needed to record, store, process, and visualize

Moving away from the desktop

Offloaded to the “cloud”

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 5 / 23

Page 13: Topic 1: Big Data and Warehouse-scale Computing

Big Data

Large datasets whose processing and storage requirements exceed alltraditional paradigms and infrastructure

I On the order of exabytes and beyond

Generated by web 2.0 applications, sensor networks, scientificapplications, financial applications, etc.

Radically different tools needed to record, store, process, and visualize

Moving away from the desktop

Offloaded to the “cloud”

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 5 / 23

Page 14: Topic 1: Big Data and Warehouse-scale Computing

Example: Facebook’s “Haystack”

65 billion photos

4 images of different sizes stored for each photoI For a total of 260 billion images and 20PB of storage

1 billion new photos uploaded each week (increment of 60TB)

At peak traffic 1 million images served per second

An image request is like finding a needle in a haystack

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 6 / 23

Page 15: Topic 1: Big Data and Warehouse-scale Computing

Example: Facebook’s “Haystack”

65 billion photos4 images of different sizes stored for each photo

I For a total of 260 billion images and 20PB of storage

1 billion new photos uploaded each week (increment of 60TB)

At peak traffic 1 million images served per second

An image request is like finding a needle in a haystack

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 6 / 23

Page 16: Topic 1: Big Data and Warehouse-scale Computing

Example: Facebook’s “Haystack”

65 billion photos4 images of different sizes stored for each photo

I For a total of 260 billion images and 20PB of storage

1 billion new photos uploaded each week (increment of 60TB)

At peak traffic 1 million images served per second

An image request is like finding a needle in a haystack

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 6 / 23

Page 17: Topic 1: Big Data and Warehouse-scale Computing

Example: Facebook’s “Haystack”

65 billion photos4 images of different sizes stored for each photo

I For a total of 260 billion images and 20PB of storage

1 billion new photos uploaded each week (increment of 60TB)

At peak traffic 1 million images served per second

An image request is like finding a needle in a haystack

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 6 / 23

Page 18: Topic 1: Big Data and Warehouse-scale Computing

Example: Facebook’s “Haystack”

65 billion photos4 images of different sizes stored for each photo

I For a total of 260 billion images and 20PB of storage

1 billion new photos uploaded each week (increment of 60TB)

At peak traffic 1 million images served per second

An image request is like finding a needle in a haystack

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 6 / 23

Page 19: Topic 1: Big Data and Warehouse-scale Computing

Example: Facebook’s “Haystack”

65 billion photos4 images of different sizes stored for each photo

I For a total of 260 billion images and 20PB of storage

1 billion new photos uploaded each week (increment of 60TB)

At peak traffic 1 million images served per second

An image request is like finding a needle in a haystack

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 6 / 23

Page 20: Topic 1: Big Data and Warehouse-scale Computing

More examples

The LHC at CERN generates 22PB of data annually (after throwing awayaround 99% of readings)

The Square Kilometre Array (under construction) is expected to generatehundreds of PB each day

Farecast, a part of Bing, searches through 225 billion flight and pricerecords to advise customers on their ticket purchases

The amount of annual traffic flowing over the Internet is around 700EB

Walmart handles in excess of 1 million transactions every hour (25PB intotal)

400 million Tweets everyday

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 7 / 23

Page 21: Topic 1: Big Data and Warehouse-scale Computing

More examples

The LHC at CERN generates 22PB of data annually (after throwing awayaround 99% of readings)

The Square Kilometre Array (under construction) is expected to generatehundreds of PB each day

Farecast, a part of Bing, searches through 225 billion flight and pricerecords to advise customers on their ticket purchases

The amount of annual traffic flowing over the Internet is around 700EB

Walmart handles in excess of 1 million transactions every hour (25PB intotal)

400 million Tweets everyday

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 7 / 23

Page 22: Topic 1: Big Data and Warehouse-scale Computing

More examples

The LHC at CERN generates 22PB of data annually (after throwing awayaround 99% of readings)

The Square Kilometre Array (under construction) is expected to generatehundreds of PB each day

Farecast, a part of Bing, searches through 225 billion flight and pricerecords to advise customers on their ticket purchases

The amount of annual traffic flowing over the Internet is around 700EB

Walmart handles in excess of 1 million transactions every hour (25PB intotal)

400 million Tweets everyday

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 7 / 23

Page 23: Topic 1: Big Data and Warehouse-scale Computing

More examples

The LHC at CERN generates 22PB of data annually (after throwing awayaround 99% of readings)

The Square Kilometre Array (under construction) is expected to generatehundreds of PB each day

Farecast, a part of Bing, searches through 225 billion flight and pricerecords to advise customers on their ticket purchases

The amount of annual traffic flowing over the Internet is around 700EB

Walmart handles in excess of 1 million transactions every hour (25PB intotal)

400 million Tweets everyday

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 7 / 23

Page 24: Topic 1: Big Data and Warehouse-scale Computing

More examples

The LHC at CERN generates 22PB of data annually (after throwing awayaround 99% of readings)

The Square Kilometre Array (under construction) is expected to generatehundreds of PB each day

Farecast, a part of Bing, searches through 225 billion flight and pricerecords to advise customers on their ticket purchases

The amount of annual traffic flowing over the Internet is around 700EB

Walmart handles in excess of 1 million transactions every hour (25PB intotal)

400 million Tweets everyday

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 7 / 23

Page 25: Topic 1: Big Data and Warehouse-scale Computing

More examples

The LHC at CERN generates 22PB of data annually (after throwing awayaround 99% of readings)

The Square Kilometre Array (under construction) is expected to generatehundreds of PB each day

Farecast, a part of Bing, searches through 225 billion flight and pricerecords to advise customers on their ticket purchases

The amount of annual traffic flowing over the Internet is around 700EB

Walmart handles in excess of 1 million transactions every hour (25PB intotal)

400 million Tweets everyday

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 7 / 23

Page 26: Topic 1: Big Data and Warehouse-scale Computing

Outline

1 Introduction

2 Ecosystem

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 8 / 23

Page 27: Topic 1: Big Data and Warehouse-scale Computing

Big data ecosystem

Presentation layer

Application layer: frameworks + storage

Operating system layer

Virtualization layer (optional)

Network layer (intra- and inter-data center)

Physical infrastructure

Can roughly be called the “cloud”

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 9 / 23

Page 28: Topic 1: Big Data and Warehouse-scale Computing

Big data ecosystem

Presentation layer

Application layer: frameworks + storage

Operating system layer

Virtualization layer (optional)

Network layer (intra- and inter-data center)

Physical infrastructure

Can roughly be called the “cloud”

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 9 / 23

Page 29: Topic 1: Big Data and Warehouse-scale Computing

Big data ecosystem

Presentation layer

Application layer: frameworks + storage

Operating system layer

Virtualization layer (optional)

Network layer (intra- and inter-data center)

Physical infrastructure

Can roughly be called the “cloud”

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 9 / 23

Page 30: Topic 1: Big Data and Warehouse-scale Computing

Big data ecosystem

Presentation layer

Application layer: frameworks + storage

Operating system layer

Virtualization layer (optional)

Network layer (intra- and inter-data center)

Physical infrastructure

Can roughly be called the “cloud”

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 9 / 23

Page 31: Topic 1: Big Data and Warehouse-scale Computing

Big data ecosystem

Presentation layer

Application layer: frameworks + storage

Operating system layer

Virtualization layer (optional)

Network layer (intra- and inter-data center)

Physical infrastructure

Can roughly be called the “cloud”

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 9 / 23

Page 32: Topic 1: Big Data and Warehouse-scale Computing

Big data ecosystem

Presentation layer

Application layer: frameworks + storage

Operating system layer

Virtualization layer (optional)

Network layer (intra- and inter-data center)

Physical infrastructure

Can roughly be called the “cloud”

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 9 / 23

Page 33: Topic 1: Big Data and Warehouse-scale Computing

Big data ecosystem

Presentation layer

Application layer: frameworks + storage

Operating system layer

Virtualization layer (optional)

Network layer (intra- and inter-data center)

Physical infrastructure

Can roughly be called the “cloud”

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 9 / 23

Page 34: Topic 1: Big Data and Warehouse-scale Computing

Presentation Layer

Acts as the user-facing end of the entire ecosystem

Forwards user queries to the backend (potentially the rest of the stack)

Can be both local and remote

For most web 2.0 applications, the presentation layer is a web portal

For instance, the Google search website is a presentation layer: it takesuser queries, forwards them to a scatter-gather application, and presentsthe results to the user (within a time bound)

Made up of many technologies, such as HTTP, HTML, AJAX, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 10 / 23

Page 35: Topic 1: Big Data and Warehouse-scale Computing

Presentation Layer

Acts as the user-facing end of the entire ecosystem

Forwards user queries to the backend (potentially the rest of the stack)

Can be both local and remote

For most web 2.0 applications, the presentation layer is a web portal

For instance, the Google search website is a presentation layer: it takesuser queries, forwards them to a scatter-gather application, and presentsthe results to the user (within a time bound)

Made up of many technologies, such as HTTP, HTML, AJAX, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 10 / 23

Page 36: Topic 1: Big Data and Warehouse-scale Computing

Presentation Layer

Acts as the user-facing end of the entire ecosystem

Forwards user queries to the backend (potentially the rest of the stack)

Can be both local and remote

For most web 2.0 applications, the presentation layer is a web portal

For instance, the Google search website is a presentation layer: it takesuser queries, forwards them to a scatter-gather application, and presentsthe results to the user (within a time bound)

Made up of many technologies, such as HTTP, HTML, AJAX, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 10 / 23

Page 37: Topic 1: Big Data and Warehouse-scale Computing

Presentation Layer

Acts as the user-facing end of the entire ecosystem

Forwards user queries to the backend (potentially the rest of the stack)

Can be both local and remote

For most web 2.0 applications, the presentation layer is a web portal

For instance, the Google search website is a presentation layer: it takesuser queries, forwards them to a scatter-gather application, and presentsthe results to the user (within a time bound)

Made up of many technologies, such as HTTP, HTML, AJAX, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 10 / 23

Page 38: Topic 1: Big Data and Warehouse-scale Computing

Presentation Layer

Acts as the user-facing end of the entire ecosystem

Forwards user queries to the backend (potentially the rest of the stack)

Can be both local and remote

For most web 2.0 applications, the presentation layer is a web portal

For instance, the Google search website is a presentation layer: it takesuser queries, forwards them to a scatter-gather application, and presentsthe results to the user (within a time bound)

Made up of many technologies, such as HTTP, HTML, AJAX, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 10 / 23

Page 39: Topic 1: Big Data and Warehouse-scale Computing

Presentation Layer

Acts as the user-facing end of the entire ecosystem

Forwards user queries to the backend (potentially the rest of the stack)

Can be both local and remote

For most web 2.0 applications, the presentation layer is a web portal

For instance, the Google search website is a presentation layer: it takesuser queries, forwards them to a scatter-gather application, and presentsthe results to the user (within a time bound)

Made up of many technologies, such as HTTP, HTML, AJAX, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 10 / 23

Page 40: Topic 1: Big Data and Warehouse-scale Computing

Application Layer

Serves as the back-end

Either computes a result for the user, or fetches a previously computedresult or content from storage

The execution is predominantly distributed

The computation itself might entail cross-disciplinary (across sciences)technology

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 11 / 23

Page 41: Topic 1: Big Data and Warehouse-scale Computing

Application Layer

Serves as the back-end

Either computes a result for the user, or fetches a previously computedresult or content from storage

The execution is predominantly distributed

The computation itself might entail cross-disciplinary (across sciences)technology

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 11 / 23

Page 42: Topic 1: Big Data and Warehouse-scale Computing

Application Layer

Serves as the back-end

Either computes a result for the user, or fetches a previously computedresult or content from storage

The execution is predominantly distributed

The computation itself might entail cross-disciplinary (across sciences)technology

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 11 / 23

Page 43: Topic 1: Big Data and Warehouse-scale Computing

Application Layer

Serves as the back-end

Either computes a result for the user, or fetches a previously computedresult or content from storage

The execution is predominantly distributed

The computation itself might entail cross-disciplinary (across sciences)technology

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 11 / 23

Page 44: Topic 1: Big Data and Warehouse-scale Computing

Computation

Can be a custom solution, such as a scatter-gather application

Might also be an existing data intensive computation framework, such asMapReduce, Dryad, MPI, etc. or a stream processing system, such asStorm, S4, etc.

Analytics engines: R, Matlab, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 12 / 23

Page 45: Topic 1: Big Data and Warehouse-scale Computing

Computation

Can be a custom solution, such as a scatter-gather application

Might also be an existing data intensive computation framework, such asMapReduce, Dryad, MPI, etc. or a stream processing system, such asStorm, S4, etc.

Analytics engines: R, Matlab, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 12 / 23

Page 46: Topic 1: Big Data and Warehouse-scale Computing

Computation

Can be a custom solution, such as a scatter-gather application

Might also be an existing data intensive computation framework, such asMapReduce, Dryad, MPI, etc. or a stream processing system, such asStorm, S4, etc.

Analytics engines: R, Matlab, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 12 / 23

Page 47: Topic 1: Big Data and Warehouse-scale Computing

Storage

1 Relational database management systems (RDBMS): MySQL, OracleDB, IBM DB2, etc. (structured data)

2 NoSQL: Key-value stores, document stores, graphs, tables, etc.(semi-structured and unstructured data)

I Document stores: MongoDB, CouchDB, etc.I Graphs: FlockDB, etc.I Key-value stores: Dynamo, Cassandra, Voldemort, etc.I Tables: BigTable, HBase, etc.

3 NewSQL: The best of both worlds: Spanner, VoltDB, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 13 / 23

Page 48: Topic 1: Big Data and Warehouse-scale Computing

Storage

1 Relational database management systems (RDBMS): MySQL, OracleDB, IBM DB2, etc. (structured data)

2 NoSQL: Key-value stores, document stores, graphs, tables, etc.(semi-structured and unstructured data)

I Document stores: MongoDB, CouchDB, etc.I Graphs: FlockDB, etc.I Key-value stores: Dynamo, Cassandra, Voldemort, etc.I Tables: BigTable, HBase, etc.

3 NewSQL: The best of both worlds: Spanner, VoltDB, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 13 / 23

Page 49: Topic 1: Big Data and Warehouse-scale Computing

Storage

1 Relational database management systems (RDBMS): MySQL, OracleDB, IBM DB2, etc. (structured data)

2 NoSQL: Key-value stores, document stores, graphs, tables, etc.(semi-structured and unstructured data)

I Document stores: MongoDB, CouchDB, etc.

I Graphs: FlockDB, etc.I Key-value stores: Dynamo, Cassandra, Voldemort, etc.I Tables: BigTable, HBase, etc.

3 NewSQL: The best of both worlds: Spanner, VoltDB, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 13 / 23

Page 50: Topic 1: Big Data and Warehouse-scale Computing

Storage

1 Relational database management systems (RDBMS): MySQL, OracleDB, IBM DB2, etc. (structured data)

2 NoSQL: Key-value stores, document stores, graphs, tables, etc.(semi-structured and unstructured data)

I Document stores: MongoDB, CouchDB, etc.I Graphs: FlockDB, etc.

I Key-value stores: Dynamo, Cassandra, Voldemort, etc.I Tables: BigTable, HBase, etc.

3 NewSQL: The best of both worlds: Spanner, VoltDB, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 13 / 23

Page 51: Topic 1: Big Data and Warehouse-scale Computing

Storage

1 Relational database management systems (RDBMS): MySQL, OracleDB, IBM DB2, etc. (structured data)

2 NoSQL: Key-value stores, document stores, graphs, tables, etc.(semi-structured and unstructured data)

I Document stores: MongoDB, CouchDB, etc.I Graphs: FlockDB, etc.I Key-value stores: Dynamo, Cassandra, Voldemort, etc.

I Tables: BigTable, HBase, etc.

3 NewSQL: The best of both worlds: Spanner, VoltDB, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 13 / 23

Page 52: Topic 1: Big Data and Warehouse-scale Computing

Storage

1 Relational database management systems (RDBMS): MySQL, OracleDB, IBM DB2, etc. (structured data)

2 NoSQL: Key-value stores, document stores, graphs, tables, etc.(semi-structured and unstructured data)

I Document stores: MongoDB, CouchDB, etc.I Graphs: FlockDB, etc.I Key-value stores: Dynamo, Cassandra, Voldemort, etc.I Tables: BigTable, HBase, etc.

3 NewSQL: The best of both worlds: Spanner, VoltDB, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 13 / 23

Page 53: Topic 1: Big Data and Warehouse-scale Computing

Storage

1 Relational database management systems (RDBMS): MySQL, OracleDB, IBM DB2, etc. (structured data)

2 NoSQL: Key-value stores, document stores, graphs, tables, etc.(semi-structured and unstructured data)

I Document stores: MongoDB, CouchDB, etc.I Graphs: FlockDB, etc.I Key-value stores: Dynamo, Cassandra, Voldemort, etc.I Tables: BigTable, HBase, etc.

3 NewSQL: The best of both worlds: Spanner, VoltDB, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 13 / 23

Page 54: Topic 1: Big Data and Warehouse-scale Computing

Operating System Layer

Consists of the traditional operating system stack with the usual suspects,Windows, variants of *nix, etc.

Alternatives exist though. Specialized for the cloud or multicore systems

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 14 / 23

Page 55: Topic 1: Big Data and Warehouse-scale Computing

Operating System Layer

Consists of the traditional operating system stack with the usual suspects,Windows, variants of *nix, etc.

Alternatives exist though. Specialized for the cloud or multicore systems

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 14 / 23

Page 56: Topic 1: Big Data and Warehouse-scale Computing

Virtualization Layer

Allows multiple operating systems to run on top of the same physicalhardware

Enables infrastructure sharing, isolation, and optimized utilization

Different allocation strategies possible

Easier to dedicate CPU and memory but not the network

Allocation either in the form of VMs or containers

VMWare, Xen, LXC, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 15 / 23

Page 57: Topic 1: Big Data and Warehouse-scale Computing

Virtualization Layer

Allows multiple operating systems to run on top of the same physicalhardware

Enables infrastructure sharing, isolation, and optimized utilization

Different allocation strategies possible

Easier to dedicate CPU and memory but not the network

Allocation either in the form of VMs or containers

VMWare, Xen, LXC, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 15 / 23

Page 58: Topic 1: Big Data and Warehouse-scale Computing

Virtualization Layer

Allows multiple operating systems to run on top of the same physicalhardware

Enables infrastructure sharing, isolation, and optimized utilization

Different allocation strategies possible

Easier to dedicate CPU and memory but not the network

Allocation either in the form of VMs or containers

VMWare, Xen, LXC, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 15 / 23

Page 59: Topic 1: Big Data and Warehouse-scale Computing

Virtualization Layer

Allows multiple operating systems to run on top of the same physicalhardware

Enables infrastructure sharing, isolation, and optimized utilization

Different allocation strategies possible

Easier to dedicate CPU and memory but not the network

Allocation either in the form of VMs or containers

VMWare, Xen, LXC, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 15 / 23

Page 60: Topic 1: Big Data and Warehouse-scale Computing

Virtualization Layer

Allows multiple operating systems to run on top of the same physicalhardware

Enables infrastructure sharing, isolation, and optimized utilization

Different allocation strategies possible

Easier to dedicate CPU and memory but not the network

Allocation either in the form of VMs or containers

VMWare, Xen, LXC, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 15 / 23

Page 61: Topic 1: Big Data and Warehouse-scale Computing

Virtualization Layer

Allows multiple operating systems to run on top of the same physicalhardware

Enables infrastructure sharing, isolation, and optimized utilization

Different allocation strategies possible

Easier to dedicate CPU and memory but not the network

Allocation either in the form of VMs or containers

VMWare, Xen, LXC, etc.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 15 / 23

Page 62: Topic 1: Big Data and Warehouse-scale Computing

Network Layer

Connects the entire ecosystem together

Consists of the entire protocol stack

Tenants assigned to Virtual LANs

Multiple protocols available across the stack

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 16 / 23

Page 63: Topic 1: Big Data and Warehouse-scale Computing

Network Layer

Connects the entire ecosystem together

Consists of the entire protocol stack

Tenants assigned to Virtual LANs

Multiple protocols available across the stack

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 16 / 23

Page 64: Topic 1: Big Data and Warehouse-scale Computing

Network Layer

Connects the entire ecosystem together

Consists of the entire protocol stack

Tenants assigned to Virtual LANs

Multiple protocols available across the stack

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 16 / 23

Page 65: Topic 1: Big Data and Warehouse-scale Computing

Network Layer

Connects the entire ecosystem together

Consists of the entire protocol stack

Tenants assigned to Virtual LANs

Multiple protocols available across the stack

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 16 / 23

Page 66: Topic 1: Big Data and Warehouse-scale Computing

Physical Infrastructure Layer

The physical hardware itself

Servers and network elements

Mechanism for power distribution, wiring, and cooling

Servers are connected in various topologies using different interconnects

Dubbed as datacenters

“We must treat the datacenter itself as one massive warehouse-scalecomputer” – Luiz André Barroso and Urs Hölzle

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 17 / 23

Page 67: Topic 1: Big Data and Warehouse-scale Computing

Physical Infrastructure Layer

The physical hardware itself

Servers and network elements

Mechanism for power distribution, wiring, and cooling

Servers are connected in various topologies using different interconnects

Dubbed as datacenters

“We must treat the datacenter itself as one massive warehouse-scalecomputer” – Luiz André Barroso and Urs Hölzle

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 17 / 23

Page 68: Topic 1: Big Data and Warehouse-scale Computing

Physical Infrastructure Layer

The physical hardware itself

Servers and network elements

Mechanism for power distribution, wiring, and cooling

Servers are connected in various topologies using different interconnects

Dubbed as datacenters

“We must treat the datacenter itself as one massive warehouse-scalecomputer” – Luiz André Barroso and Urs Hölzle

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 17 / 23

Page 69: Topic 1: Big Data and Warehouse-scale Computing

Physical Infrastructure Layer

The physical hardware itself

Servers and network elements

Mechanism for power distribution, wiring, and cooling

Servers are connected in various topologies using different interconnects

Dubbed as datacenters

“We must treat the datacenter itself as one massive warehouse-scalecomputer” – Luiz André Barroso and Urs Hölzle

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 17 / 23

Page 70: Topic 1: Big Data and Warehouse-scale Computing

Physical Infrastructure Layer

The physical hardware itself

Servers and network elements

Mechanism for power distribution, wiring, and cooling

Servers are connected in various topologies using different interconnects

Dubbed as datacenters

“We must treat the datacenter itself as one massive warehouse-scalecomputer” – Luiz André Barroso and Urs Hölzle

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 17 / 23

Page 71: Topic 1: Big Data and Warehouse-scale Computing

Physical Infrastructure Layer

The physical hardware itself

Servers and network elements

Mechanism for power distribution, wiring, and cooling

Servers are connected in various topologies using different interconnects

Dubbed as datacenters

“We must treat the datacenter itself as one massive warehouse-scalecomputer” – Luiz André Barroso and Urs Hölzle

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 17 / 23

Page 72: Topic 1: Big Data and Warehouse-scale Computing

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 18 / 23

Page 73: Topic 1: Big Data and Warehouse-scale Computing

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 19 / 23

Page 74: Topic 1: Big Data and Warehouse-scale Computing

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 20 / 23

Page 75: Topic 1: Big Data and Warehouse-scale Computing

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 21 / 23

Page 76: Topic 1: Big Data and Warehouse-scale Computing

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 22 / 23

Page 77: Topic 1: Big Data and Warehouse-scale Computing

Example: Google

All that infrastructure enables Google to:

Index 20 billion web pages a day

Handle in excess of 3 billion search queries daily

Provide email storage to 425 million Gmail users

Serve 3 billion YouTube videos a day

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 23 / 23

Page 78: Topic 1: Big Data and Warehouse-scale Computing

Example: Google

All that infrastructure enables Google to:

Index 20 billion web pages a day

Handle in excess of 3 billion search queries daily

Provide email storage to 425 million Gmail users

Serve 3 billion YouTube videos a day

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 23 / 23

Page 79: Topic 1: Big Data and Warehouse-scale Computing

Example: Google

All that infrastructure enables Google to:

Index 20 billion web pages a day

Handle in excess of 3 billion search queries daily

Provide email storage to 425 million Gmail users

Serve 3 billion YouTube videos a day

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 23 / 23

Page 80: Topic 1: Big Data and Warehouse-scale Computing

Example: Google

All that infrastructure enables Google to:

Index 20 billion web pages a day

Handle in excess of 3 billion search queries daily

Provide email storage to 425 million Gmail users

Serve 3 billion YouTube videos a day

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 23 / 23

Page 81: Topic 1: Big Data and Warehouse-scale Computing

1 Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, and Peter Vajgel.2010. Finding a needle in Haystack: Facebook’s photo storage. InProceedings of the 9th USENIX conference on Operating systems designand implementation (OSDI’10). USENIX Association, Berkeley, CA, USA.

2 Urs Hoelzle and Luiz Andre Barroso. 2009. The Datacenter as aComputer: An Introduction to the Design of Warehouse-Scale Machines(1st ed.). Morgan and Claypool Publishers.

Zubair Nabi 1: Big Data and Warehouse-scale Computing April 17, 2013 24 / 23