map reduce (part 3) - graz university of...

Post on 28-Mar-2020

9 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Map Reduce (Part 3)Databases 2 (VU) (706.711 / 707.030)

Mark Kroll

Institute of Interactive Systems and Data Science,Graz University of Technology

Nov. 26, 2018

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 1 / 45

Outline

1 Problems Suited for MapReduce

2 MapReduce: ApplicationsMatrix-Vector MultiplicationInformation Retrieval

3 Hadoop EcosystemDesigning a Big Data SystemBig Data Storage Technologies

Slides are partially based onSlides “Mining Massive Datasets” by Jure Leskovec.

“Limitations and Challenges of HDFS and MapReduce” by Weets et al., 2015.

Tutorial on “’Data-Intensive Text Processing with MapReduce’ by Lin, 2009.

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 2 / 45

Not all problems are suited for MapReduce

problems at hand need to be formulated and implemented as MapReduce program, i.e.parallelizable according to the divide&conquer idea, for instance

I analyzing textual data (word count statistics)I matrix-vector multiplications (PageRank)I relational algebra operations (Database query primitives such as joins)

in addition, files need to be largeI due to task setup time and scheduling overhead

and should be rarely updated in placeI not suitable when managing online sales, for instance, the principal operations on Amazon

data involve responding to searches for products, recording sales, and so on; processes thatinvolve relatively li�le calculation and that change the database

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 3 / 45

When MapReduce is not a suitable choice

when your processing requires a lot of data to be shu�led over the networkI remember the idea is to bring the algorithm to the data

real-time processing - when fast responses are needed

for complex algorithmsI for example, machine learning algorithms such as SVMs

processing graphs→ use dedicated frameworks such as Giraph

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 4 / 45

When MapReduce is not a suitable choice

when lot of data needs to be sorted / shu�led, e.g. the map phase generates too many keysI shu�ling is very time-consuming o�en overloading the networkI reason is that the reducers fetch all intermediate data at once as soon as the last mapper

finishes→ led to strategies such as virtual shu�ling or predictive scheduling

when iterations are neededI for example clustering algorithms such as K-Means→ use frameworks such as Spark

handling streaming dataI MapReduce is best suited to batch process hugh amounts of data→ use frameworks such as Storm, Flink

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 5 / 45

MapReduce: Typical Applications

matrix operations such as matrix-matrix and matrix-vector multiplications which fit nicelyinto the MapReduce programming model

I for example, to calculate the PageRank

another important class of operations that can use MapReduce e�ectively arerelational-algebra operations

I many operations on data can be described easily in terms of the common database-queryprimitives such as selection, union, intersection, joins

I which can be used for social network analysis, e.g. finding paths of di�erent lengths, countingfriends, etc.

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 6 / 45

Application 1: Matrix-Vector Multiplication

What do we need to do?Suppose we have an n× n matrixM, whose element in row i and column j will be denoted mij .Suppose we also have a vector v of length n, whose jth element is vj . Then the matrix-vectorproduct is the vector x of length n, whose ith element is given by

xi =n∑

j=1

mijvj

Outline a Map-Reduce program that calculates the vector x.

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 7 / 45

Matrix-Vector Multiplication

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 8 / 45

Matrix-Vector Multiplication

let us first assume that the vector v is large, but it still can fit into the memory

the matrix M and the vector v will be each stored in a file of the DFSassume that the row-column coordinates of a matrix element (indices) can be discovered

I for example, each value is stored as a triple (i, j,mij)

similarly, the position of vj can be discovered analogously

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 9 / 45

Matrix-Vector Multiplication

So how does the Map Function look like?

a map function applies to one element of the matrixMthe vector v is first read in its entirety and is available for all Map tasks at that computenode

from each matrix element mij the map function produces the key-value pair (i,mijvj)

all terms of the sum that make up the component xi of the matrix-vector product will getthe same key i

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 10 / 45

Matrix-Vector Multiplication

What about the Reduce Function?reduce function sums all the values associated with a given key i

result is a pair (i, xi)

thereby calculating one entry in the output vector x

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 11 / 45

Matrix-Vector Multiplication: Divide&Conquer

however, it might be that the vector v does not fit into main memoryI it is not required that the vector v fits into the memory at a compute node, but if it does not

there will be a very large number of disk accesses as we move pieces of the vector into mainmemory

alternatively we can divide the matrixM into vertical stripes of equal width and divide thevector into an equal number of horizontal stripes of the same height

→ use enough stripes so that the portion of the vector in one stripe can fit into mainmemory at a compute node

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 12 / 45

Matrix-Vector Multiplication: Divide&Conquer

Figure: Divide matrixM and vector v into stripes

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 13 / 45

Matrix-Vector Multiplication: Divide&Conquer

the ith stripe of matrixM multiplies only components from the ith stripe of the vector

can divide matrixM into one file for each stripe, and do the same for the vector veach Map task is assigned a chunk from one of the stripes in the matrix and gets the entirecorresponding stripe of the vector

Map and Reduce tasks can then act exactly as before

need to sum up once more the results of the stripes multiplication

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 14 / 45

Application 2: Information Retrieval

Information Retrieval (aka Search) in a Nutshellis the activity of obtaining information relevant to an information need

I for instance a search query you submit to Google

from a collection of information resourcesI for instance databases of texts, images or sounds

Information Retrieval is so to saythe Science of Searching for Information

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 15 / 45

Information Retrieval Process

taken from Eissa Alshari Semantic Arabic Information Retrieval Framework

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 16 / 45

Search Process - Step 1: Indexing

taken from h�ps://developer.apple.com/

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 17 / 45

Search Process - Step 2: Retrieval

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 18 / 45

Are Indexing or Retrieval MapReducable?

The Indexing ProblemScalability is critical

must be relatively fast, but need not be real time

fundamentally a batch operation

→ sounds like a Map-Reduce problem

The Retrieval Problemmust have sub-second response time

for the web, only need relatively few results

→ rather not a Map-Reduce problem

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 19 / 45

Index Construction with MapReduce

original purpose behind the MapReduce programming conceptMap over all documents

I emit term as key, (docNr., termFrequency) as valueI emit other information as necessary (e.g., term position)

Sort/shu�le: group postings by term

ReduceI gather and sort the postings (e.g., by docNr. or tf)I write postings to disk

→MapReduce does all the heavy li�ing

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 20 / 45

Index Construction with MapReduce: An Overview

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 21 / 45

including information on term position:

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 22 / 45

Apache provides a nice so�ware stack

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 23 / 45

Parts of the Hadoop Eco System

Yarn (Yet Another Resource Negotiator)

is a cluster management system to run Big Data applications on a cluster data; not a dataprocessing platform itself but enables the platforms to run their code in a clusterenvironment

HBase

open source, non-relational, distributed database modeled a�er Google’s BigTable

Hive

data warehouse infrastructure built on top of Hadoop for providing data summarization,query, and analysis

Pig

high-level platform for creating programs that run on Apache Hadoop (language is calledPig Latin)

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 24 / 45

Data Processing: From Batch to Streaming . . .

Batch Processing involves operating over a large, static dataset and returning the result at a later time when thecomputation is complete.Stream Processing systems compute over data as it enters the system.The term Microbatch is frequently used to describe scenarios where batches are small and/or processed at small intervals.Even though processing may happen as o�en as once every few minutes, data is still processed a batch at a time.

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 25 / 45

Big Data Processing Frameworks (1)

Batch-only frameworks:I Apache Hadoop can be considered a processing framework with MapReduce as its default

processing engine. Hadoop was the first big data framework to gain significant traction in theopen-source community.

Stream-only frameworks:I Apache Storm focuses on extremely low latency and is perhaps the best option for workloads

that require near real-time processing. It can handle very large quantities of data with anddeliver results with less latency than other solutions. Its topologies are composed of (i)Streams, (ii) Spouts and (iii) Bolts

I Apache Samza is tightly tied to the Apache Kafka messaging system. While Kafka can beused by many stream processing systems, Samza is designed specifically to take advantage ofKafka’s unique architecture and guarantees. It uses Kafka to provide fault tolerance, bu�ering,and state storage

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 26 / 45

Big Data Processing Frameworks (2)

Hybrid frameworks:I Apache Spark is a next generation batch processing framework with stream processing

capabilities; Spark focuses primarily on speeding up batch processing workloads by o�eringfull in-memory computation and processing optimization.

I Apache Flink a stream processing framework that can also handle batch tasks. It considersbatches to simply be data streams with finite boundaries, and thus treats batch processing as asubset of stream processing. This stream-first approach to all processing has a number ofinteresting side e�ects. This stream-first approach has been called the Kappa architecture.

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 27 / 45

Java(-ish) is the Hadoop Language

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 28 / 45

The Good, the Bad and the Ugly wrt. the Ecosystem

The GoodI Easy to write parallel, highly scalable applicationsI Stream and Batch processingI Seamless integration with other systems (e.g. RDBMS)

The BadI Rapid development, hard to keep overviewI 150 projects in (or near) Hadoop Eco System

F see https://hadoopecosystemtable.github.io/

The UglyI Maven dependency hell, if integrated with other systemsI Spark depends on > 50 libraries with a specific version!

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 29 / 45

System Design: An Overview

usually boils down to picking the right components wrt. Input, Processing + OutputMark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 30 / 45

System Design: An Overview

usually boils down to picking the right components wrt. Input, Processing + OutputMark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 31 / 45

So let’s design a system that

1 receives 5 million sensor measurements per minuteI 100 bytes per record ∼ 8MB/sek

2 stores measurements for 3 yearsI ∼ 240 TB per year, 720 TB in total

3 creates daily reportsI for any given date

4 detects faulty sensors in real-timeI alert the maintainers via SMS

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 32 / 45

System Design: Processing & Analyzing Sensor Data

. . . receives 5 million sensor measurements per minute

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 33 / 45

System Design: Processing & Analyzing Sensor Data

. . . stores measurements for 3 years

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 34 / 45

System Design: Processing & Analyzing Sensor Data

. . . creates daily reportsMark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 35 / 45

System Design: Processing & Analyzing Sensor Data

. . . detects faulty sensors in real-timeMark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 36 / 45

Big Data Storage Technologies

File-based: HDFSI distributed, permanent file storageI tuned for large filesI no indexing

Key-Value based: HBaseI distributed Key/Value storeI fast look-upsI based on HDFS

Message-based: KafkaI distributed Producer/Consumer messaging systemI data partitioned in topicsI producer groups / consumer groups

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 37 / 45

File-based: HDFS

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 38 / 45

File-based: HDFS

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 39 / 45

Key-Value based: HBase

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 40 / 45

Key-Value based: HBase

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 41 / 45

Message-based: Kafka

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 42 / 45

Message-based: Kafka

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 43 / 45

Map Reduce Lectures’ Recap:

Part 1:I Handling Big DataI Key Elements: MapReduce Framework & Distributed File System

Part 2:I Optimization: Maximizing ParallelismI Stragglers Problem & Input Data Skew

Part 3:I Suitability & ApplicationsI Hadoop Ecosystem & Storage Technologies

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 44 / 45

The EndNext Lecture: “Beyond Map-Reduce”

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 26, 2018 45 / 45

top related