Download - Mapreduce introduction

MapReduceData Infrastructure Team

Jongyoul Lee

Friday, September 27, 13

MapReduce?

Data types, input/output format

Mapper, reducer

Combiner

Hadoop streaming

Next...


https://github.com/madeng/mrintro.git




MapReduce?


Mapper, reducer

Combiner

Hadoop streaming

Next...


DataNode DataNode DataNode

JobTracker

TaskTracker TaskTracker TaskTracker

TaskTracker TaskTracker TaskTracker

Client

Structure Overview

DataNode DataNode DataNode


분산처리를 위한 고수준(!!) 아키택처

데이터의 흐름은 생각하지 않음

Key/value에 대해서만 생각하면 됨

모든 문제를 해결할 수 있는 것은 아님


Output

Input TextInputFormat

(k1, v1) ! (k2, v2)

(k2, list(v2)) ! (k2, v2’)(k2, v2’, #reducer) ! #partition

(k2, list(v2’)) ! (k3, v3)

TextOutputFormat

Mapper

Combiner

Partitioner

Shuffle/sort

Reducer


org.apache.hadoop.mapred

org.apache.hadoop.mapreduce

mapreduce가 새로운 패키지

하지만 예전 패키지도 여전히 많이 사용

Cascading...


MapReduce?


Mapper, reducer

Combiner

Hadoop streaming

Next...


Class name Data typeBooleanWritable Boolean

ByteWritable byte

DoubleWritable Double

FloatWritable Float

IntWritable Integer

LongWritable Long

Text UTF-8�� 문자열

NullWritable 데이터�� 값이�� 필요�� 없을�� 경우


WritableCompareable interface

void write(DataOutput out)

Serialization of data written

void readFields(DataInput in)

Deserialization of reading data

int compareTo(WritableComparable w)


MapReduce?


Mapper, reducer

Combiner

Hadoop streaming

Next...


파일을 MR에서 읽을 수 있도록 key/value로 변경해주는 formatter

Key/value의 값을 파일로 저장할 수 있도록 도와주는 formatter

파일 이외에 여러 형태의 input/output format이 존재

InputFormat은 getSplit메소드로 여러가지를 함

Input/OutputFormat을 상속받아 구현 가능


createRecordReader(InputSplit split,...)

인풋 스플릿을 읽을 수 있도록 처리

RecordReader를 mapper에 넘겨줌

InputFormat 특징

TextInputFormat캐리지 리턴으로 값 분리키: 라인 번호값: 라인의 내용

SequenceFileInputFormat바이너리 포멧키, 값을 저장하는 구조압축 지원


OutputFormat을 상속

RecordWriter를 reducer에 넘겨줌

OutputFormat 특징

TextOutputFormat 키/값을 텍스트로 출력할 때, 사용“Key, Value\n”의 형태로 출력

LazyOutputFormat

TextOutputFormat과 같은 출력결과파일의 내용이 없을 경우, 생성하지 않음


FileInputFormat.getSplit


MapReduce?


Mapper, reducer

Combiner

Hadoop streaming

Next...


Text를 Key/Value의 형태로 만듦

map(key, value, ...)함수를 호출

하나의 인풋에 대해서만 처리하는 것에 집중

꼭 하나의 인풋에 대해 하나의 아웃풋이 존재할 필요는 없다(DelayCountMapper.java)

(k1, v1) ! (k2, v2)


(k1, v1) ! (k2, v2)


MapReduce?


Mapper, reducer

Combiner

Hadoop streaming

Next...


Key별로 묶은 value들을 처리

같은 키는 반드시 같은 reducer가 처리

하나의 키에 몰리는 것을 방지하는 것이 포인트

(k2, list(v2’)) ! (k3, v3)


(k2, list(v2’)) ! (k3, v3)


Demo


WordCountMapOnly


WordCount


좀 더 세련되게 확인할 순 없을까?


MapReduce?


Mapper, reducer

Combiner

Hadoop streaming

Next...


(k2, list(v2)) ! (k2, v2’)


Map의 아웃풋이 많아지면...

(k2, list(v2)) ! (k2, v2’)



Reducer에 넘겨주는 데이터는 적을 수록 좋다

(k2, list(v2)) ! (k2, v2’)




Mapper에서 Reducer의 일을 일부 할 수 있지 않을까?

(k2, list(v2)) ! (k2, v2’)





한줄이면 끝!

(k2, list(v2)) ! (k2, v2’)





한줄이면 끝!

job.setCombinerClass(Reducer.class)

(k2, list(v2)) ! (k2, v2’)


(k2, list(v2)) ! (k2, v2’)


한가지 주의할 점이 있는데...

(k2, list(v2)) ! (k2, v2’)


한가지 주의할 점이 있는데...

Map의 아웃풋 타입과 Reduce의 인풋 타입

(k2, list(v2)) ! (k2, v2’)


(k1, v1) ! (k2, v2)

(k2, list(v2)) ! (k2, v2’)

(k2, list(v2’)) ! (k3, v3)

Mapper

Combiner

Reducer


MapReduce?


Mapper, reducer

Combiner

Hadoop streaming

Next...


매번 귀찮은 MR프로그램을 짜는 것은 불편...

Hive, Pig, Cascading

contrib/streaming/hadoop-streaming.jar


Demo


MapReduce?


Mapper, reducer

Combiner

Hadoop streaming

Next...


하지만 실제로 분석하는 과정을 보면?



데이터를 읽어서 특정 시간대로 자르고




특정 유저나 혹은 유저들을 다시 필터링하고





각 유저들의 행동을 합하거나, 더하거나 한뒤






유니크한 값들도 찾아보고







원하는 분석 작업들을 진행...







원하는 분석 작업들을 진행...

마지막으로 보고싶은 기준으로 소팅하여 출력


Chaining!!


(교재 157p)


마지막으로...

Job, JobConf로 돌아가자

수 많은 옵션이 존재

각 옵션마다 의미가 있음


[email protected]


mailto:[email protected]

mailto:[email protected]

Download - Mapreduce introduction

Top Related