ghislain fourny big data for engineers fall 2019 · ghislain fourny big data for engineers fall...
TRANSCRIPT
![Page 1: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/1.jpg)
Ghislain Fourny
Big Data for Engineers Fall 20194. Distributed file systems
Kheng Ho Toh / 123RF Stock Photo
![Page 2: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/2.jpg)
2
So far...
We've
rehearsed
relational
databases
![Page 3: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/3.jpg)
3
So far...
We've
looked into
scaling out
![Page 4: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/4.jpg)
4
So far...
We've
seen
Object storage
![Page 5: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/5.jpg)
5
So far...
We've
looked into
the
Key-Value Model
![Page 6: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/6.jpg)
6
Poll
https://eduapp-app1.ethz.ch/
Go now to:
or install EduApp 3.x
![Page 7: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/7.jpg)
There is
Big Dataand
Big Data
Anna Liebiedieva / 123RF Stock Photo
Vadym Kurgak / 123RF Stock Photo
![Page 8: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/8.jpg)
8
Use cases
A huge amount of large files?
![Page 9: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/9.jpg)
9
Use cases
vs.
A huge amount of large files?
A large amount of huge files?
![Page 10: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/10.jpg)
10
Use cases
vs.
Billions of TB files
Millions of PB files
Object Storage
File Storage
![Page 11: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/11.jpg)
11
Where does the data come from?
Raw Data
Sensors
Measurements
Events
Logs
Oleg Dudko / 123RF Stock Photo
![Page 12: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/12.jpg)
12
Where does the data come from?
Raw Data Derived Data
Sensors
Measurements
Events
Logs
Aggregated data
Intermediate data
Oleg Dudko / 123RF Stock Photo
Anton Starikov / 123RF Stock Photo
![Page 13: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/13.jpg)
13
Technologies and models
Key-Value Store File System
Object Storage Block Storage
Billions of
<TB files
Millions of
<PB filesvs.
![Page 14: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/14.jpg)
14
Technologies and models
Key-Value Store File System
Object Storage Block Storage
Billions of
<TB files
Millions of
<PB filesvs.
![Page 15: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/15.jpg)
15
Technologies and models
Key-Value Model File System
Object Storage Block Storage
Billions of
<TB files
Millions of
<PB filesvs.
![Page 16: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/16.jpg)
16
Technologies and models
Key-Value Model File System
Object Storage Block Storage
Billions of
<TB files
Millions of
<PB filesvs.
![Page 17: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/17.jpg)
17
Distributed file systems: inception
FS
17
![Page 18: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/18.jpg)
18
GFS genesis
Characteristics
![Page 19: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/19.jpg)
19
GFS genesis
Characteristics
Requirements
![Page 20: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/20.jpg)
20
GFS genesis
Characteristics
File System Design
Requirements
![Page 21: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/21.jpg)
21
Fault tolerance and robustness
Vitaly Korovin / 123RF Stock Photo
It might fail
Local disk
![Page 22: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/22.jpg)
22
Fault tolerance and robustness
Vitaly Korovin / 123RF Stock Photo
It might fail
nodes will fail
Kheng Ho Toh / 123RF Stock Photo
Local disk
Cluster with 100s to10,000s of machines
22
![Page 23: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/23.jpg)
23
Fault tolerance and robustness
Monitoring
Kheng Ho Toh / 123RF Stock Photo
![Page 24: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/24.jpg)
24
Fault tolerance and robustness
Monitoring
Error detection
Kheng Ho Toh / 123RF Stock Photo
![Page 25: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/25.jpg)
25
Fault tolerance and robustness
Monitoring
Error detection
Automatic Recovery
Kheng Ho Toh / 123RF Stock Photo
![Page 26: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/26.jpg)
26
Fault tolerance and robustness
Fault tolerance
Monitoring
Error detection
Automatic Recovery
Kheng Ho Toh / 123RF Stock Photo
![Page 27: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/27.jpg)
27
File read model
Random access Scan the file
vs.
![Page 28: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/28.jpg)
28
File update model
Random access Upsert/append only
vs.
![Page 29: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/29.jpg)
29
File update model
immutable
Append
suitable for
Sensors
Logs
Intermediate data
_____
_____
_____
![Page 30: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/30.jpg)
30
Appends
Append only
100s of clients
in parallel
atomic
![Page 31: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/31.jpg)
31
Performance requirements
Top priority:
Throughput
![Page 32: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/32.jpg)
32
Performance requirements
? !
Top priority:
Throughput
Secondary:
Latency
![Page 33: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/33.jpg)
33
The progress made (1956-2018): Logarithmic
Picture: Ash Waechter/123RF
10,000x
8x
Throughput LatencyCapacity
(per unit of volume)
200,000,000,000x
![Page 34: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/34.jpg)
34
The progress made (1956-2018): Logarithmic
Picture: Ash Waechter/123RF
200,000,000,000x
Throughput Latency
Parallelize!
Capacity
(per unit of volume)
10,000x
8x
![Page 35: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/35.jpg)
35
The progress made (1956-2018): Logarithmic
Picture: Ash Waechter/123RF
Throughput Latency
Batch processing!
200,000,000,000x
Capacity
(per unit of volume)
10,000x
8x
![Page 36: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/36.jpg)
36
Hadoop
![Page 37: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/37.jpg)
37
Hadoop
Initiated in
2006
![Page 38: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/38.jpg)
38
Hadoop
Primarily:
• Distributed File System (HDFS)
• MapReduce
• Wide column store (HBase)
Covere
d in this
lectu
re
![Page 39: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/39.jpg)
39
Hadoop
Inspired by Google's
• GFS (2003)
• MapReduce (2004)
• BigTable (2006)
Covere
d in this
lectu
re
![Page 40: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/40.jpg)
40
Size timeline
Date Size reported by Yahoo
April 2006 188
May 2006 300
October 2006 600
April 2007 1,000
February 2008 10,000 (index generation)
March 2009 24,000 (17 clusters)
June 2011 42,000 (100+ PB)
November 2016 100,000? (600PB)
![Page 41: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/41.jpg)
41
Distributed file systems: the model
![Page 42: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/42.jpg)
42
Lorem Ipsum
Dolor sit amet
Consectetur
Adipiscing
Elit. In
Imperdiet
Ipsum ante
File Systems (Logical Model)
Key-Value Storage
![Page 43: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/43.jpg)
43
Lorem Ipsum
Dolor sit amet
Consectetur
Adipiscing
Elit. In
Imperdiet
Ipsum ante
File Systems (Logical Model)
Lorem Ipsum
Dolor sit amet
Consectetur
Adipiscing
Elit. In
Imperdiet
Ipsum ante
Key-Value Model File Hierarchy
vs.
![Page 44: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/44.jpg)
44
Block Storage (Physical Storage)
111010010110101…
Object Storage
![Page 45: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/45.jpg)
45
Block Storage (Physical Storage)
111010010110101…
1 2 3
4 5 6
7 8
Object Storage Block Storage
![Page 46: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/46.jpg)
46
Terminology
HDFS: Block
GFS: Chunk
![Page 47: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/47.jpg)
47
Files and blocks
Lorem Ipsum
Dolor sit amet
Consectetur
Adipiscing
Elit. In
Imperdiet
Ipsum ante
![Page 48: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/48.jpg)
48
Files and blocks
Lorem Ipsum
Dolor sit amet
Consectetur
Adipiscing
Elit. In
Imperdiet
Ipsum ante
12 3 4
5
6
7
8
![Page 49: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/49.jpg)
49
Files and blocks
Lorem Ipsum
Dolor sit amet
Consectetur
Adipiscing
Elit. In
Imperdiet
Ipsum ante
12 3 4
5
6
7
8
12
3
![Page 50: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/50.jpg)
50
Files and blocks
Lorem Ipsum
Dolor sit amet
Consectetur
Adipiscing
Elit. In
Imperdiet
Ipsum ante
12 3 4
5
6
7
8
12
3
1
2 3 4
![Page 51: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/51.jpg)
51
Why blocks?
![Page 52: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/52.jpg)
52
Why blocks?
1. Files bigger than a disk
PBs!
![Page 53: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/53.jpg)
53
Why blocks?
1. Files bigger than a disk
PBs!
2. Simpler level of abstraction
![Page 54: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/54.jpg)
54
Single machine vs. distributed
![Page 55: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/55.jpg)
55
The right block size
Simple file system
4 kB
![Page 56: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/56.jpg)
56
Poll
https://eduapp-app1.ethz.ch/
Go now to:
or install EduApp 2.x
![Page 57: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/57.jpg)
57
The right block size
Simple file system Distributed file system
4 kB
64 MB – 128 MB
![Page 58: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/58.jpg)
58
The right block size
Relational Database Distributed file system
4 kB – 32 kB
64 MB – 128 MB
![Page 59: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/59.jpg)
59
HDFS Architecture
![Page 60: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/60.jpg)
60
How do we connect the many machines?
![Page 61: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/61.jpg)
61
Peer-to-peer architecture
![Page 62: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/62.jpg)
62
Master-slave architecture
Slave
Master
Slave Slave Slave Slave Slave
![Page 63: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/63.jpg)
63
HDFS server architecture
![Page 64: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/64.jpg)
64
HDFS server architecture
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
![Page 65: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/65.jpg)
65
From the file perspectiveNamenode
File...
![Page 66: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/66.jpg)
66
From the file perspective
File...
...divided into 128MB chunks...
Namenode
![Page 67: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/67.jpg)
67
From the file perspective
File...
...divided into 128MB chunks...
... replicated for fault tolerance
Namenode
![Page 68: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/68.jpg)
68
Concurrently accessed
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
![Page 69: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/69.jpg)
69
Hadoop implementation
(Packaged code)
![Page 70: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/70.jpg)
70
HDFS Architecture
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
![Page 71: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/71.jpg)
71
HDFS Architecture: NameNode
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
![Page 72: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/72.jpg)
72
NameNode: all system-wide activity
![Page 73: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/73.jpg)
73
NameNode: all system-wide activity
Memory
1 File namespace
(+Access Control)
![Page 74: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/74.jpg)
74
NameNode: all system-wide activity
Memory
/dir/file1
/dir/file2
/file3
File to block mapping
1 File namespace
(+Access Control)
2
![Page 75: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/75.jpg)
75
NameNode: all system-wide activity
Memory
Block locations
/dir/file1
/dir/file2
/file3
File to block mapping
1 File namespace
(+Access Control)
2
3
![Page 76: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/76.jpg)
76
HDFS Architecture
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
![Page 77: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/77.jpg)
77
HDFS Architecture: DataNode
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
![Page 78: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/78.jpg)
78
DataNode
![Page 79: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/79.jpg)
79
DataNode
![Page 80: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/80.jpg)
80
DataNode
Blocks are stored on the
local disk
![Page 81: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/81.jpg)
81
DataNode
Proximity to hardware facilitates disk failure detection
![Page 82: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/82.jpg)
82
Block IDs
64 bits
e.g., 7586700455251598184
![Page 83: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/83.jpg)
83
Subblock granularity: Byte Range
![Page 84: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/84.jpg)
84
Communication
Datanode
Namenode
Datanode
Client
![Page 85: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/85.jpg)
85
Summary
Datanode
Namenode
Datanode
Client
Client Protocol
DataTransfer
Protocol
DataNode
Protocol
![Page 86: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/86.jpg)
86
Communication
Datanode
Namenode
Datanode
Client
![Page 87: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/87.jpg)
87
Client Protocol (RPC)
Client
Metadata operations
Namenode
![Page 88: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/88.jpg)
88
Client Protocol (RPC)
Client
Metadata operations
DataNode location
Namenode
![Page 89: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/89.jpg)
89
Client Protocol (RPC)
Client
Metadata operations
DataNode location
Block IDs Namenode
![Page 90: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/90.jpg)
90
Client Protocol (RPC)
NamenodeClient
Metadata operations
DataNode location
Block IDs
Java API available
90
![Page 91: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/91.jpg)
91
Communication
Datanode
Namenode
Datanode
Client
Control
![Page 92: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/92.jpg)
92
DataNode Protocol (RPC)
Datanode
Datanode always
initiates connection!
Namenode
![Page 93: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/93.jpg)
93
DataNode Protocol (RPC)
Datanode
Datanode always
initiates connection!
Registration
Namenode
![Page 94: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/94.jpg)
94
DataNode Protocol (RPC)
Datanode
Heartbeat
Datanode always
initiates connection!
every 3s
custo
miz
able
Registration
Namenode
![Page 95: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/95.jpg)
95
DataNode Protocol (RPC)
Datanode
HeartbeatBlock operations
Datanode always
initiates connection!
every 3s
custo
miz
able
Registration
Namenode
![Page 96: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/96.jpg)
96
DataNode Protocol (RPC)
Datanode
Heartbeat
BlockReportBlock operations
Datanode always
initiates connection!
every 3s
every 6h
custo
miz
able
Registration
Namenode
![Page 97: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/97.jpg)
97
DataNode Protocol (RPC)
Datanode
Heartbeat
BlockReportBlock operations
Datanode always
initiates connection!
every 3s
every 6h
custo
miz
able
Registration
BlockReceived
Namenode
![Page 98: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/98.jpg)
98
DataNode Protocol (RPC)
Datanode
Namenode
Heartbeat
BlockReportBlock operations
Java API available
every 3s
every 6h
custo
miz
able
Registration
BlockReceived
![Page 99: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/99.jpg)
99
DataNode Protocol
Datanode
Heartbeat
BlockReportBlock operations
Datanode always
initiates connection!
every 3s
every 6h
custo
miz
able
Registration
BlockReceived
Namenode
![Page 100: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/100.jpg)
100
Communication
Datanode
Namenode
Datanode
Client
Control
Control
![Page 101: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/101.jpg)
101
DataTransfer Protocol (Streaming)
DataNodeClient
Data blocks
DataNodeDataNode
101
![Page 102: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/102.jpg)
102
DataTransfer Protocol (Streaming)
DataNodeClient
Data blocks
DataNodeDataNode
Replication
pipelining
(write only)
102
![Page 103: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/103.jpg)
103
DataTransfer Protocol (Streaming)
DataNodeClient
Data blocks
DataNodeDataNode
Replication
pipelining
(write only)
103
![Page 104: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/104.jpg)
104
Communication
Datanode
Namenode
Datanode
Client
Control
Control
ControlData
Control
![Page 105: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/105.jpg)
105
Summary
Datanode
Namenode
Datanode
Client
Client Protocol
DataTransfer
Protocol
DataNode
Protocol
![Page 106: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/106.jpg)
106
Metadata functionality
Create directory
Delete directory
Write file
Append to file
Read file
Delete file
![Page 107: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/107.jpg)
107
Client reads a file
![Page 108: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/108.jpg)
108
Client reads a file
Asks for file1
![Page 109: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/109.jpg)
109
Client reads a file
Get block locations
Multiple DataNodes for each block,
sorted by distance
2
![Page 110: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/110.jpg)
110
Client reads a file
Read
3Input
Stream
![Page 111: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/111.jpg)
111
Client writes a file
![Page 112: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/112.jpg)
112
Client writes a file
Create1
![Page 113: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/113.jpg)
113
Client writes a file
DataNodes for first block
2
![Page 114: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/114.jpg)
114
Client writes a file
Organizes pipeline3
![Page 115: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/115.jpg)
115
Client writes a file
Sends data over4
![Page 116: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/116.jpg)
116
Client writes a file
Ack5
![Page 117: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/117.jpg)
117
Client writes a file
DataNodes for second block
2
![Page 118: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/118.jpg)
118
Client writes a file
Organizes pipeline3
![Page 119: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/119.jpg)
119
Client writes a file
Sends data over4
![Page 120: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/120.jpg)
120
Client writes a file
Ack5
![Page 121: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/121.jpg)
121
This is all done simultaneously under DFSOutputStream (streaming through)
Sends data over4
DataNodes for nth block
2
5
3 Pipeline
Ack
![Page 122: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/122.jpg)
122
Client writes a file
Close/release lock
6
![Page 123: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/123.jpg)
123
Client writes a file
Checking for
minimal
replication
(DataNode protocol)
7
![Page 124: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/124.jpg)
124
Client writes a file
Ack
8
![Page 125: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/125.jpg)
125
Client writes a file
replicates further asynchronously 9
![Page 126: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/126.jpg)
126
Replicas
![Page 127: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/127.jpg)
127
Replicas
Number of replicas
specified
per filedefault:3
![Page 128: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/128.jpg)
128
Replica placement: what to consider?
Reliability
Read/Write Bandwidth
Block distribution
![Page 129: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/129.jpg)
129
Replica placement: Reminder on topology
Cluster
Rack
Node
![Page 130: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/130.jpg)
130
Replica placement: Distance
BA
D(A,B)=2
![Page 131: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/131.jpg)
131
Replica placement: Distance
BA
D(A,B)=4
![Page 132: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/132.jpg)
132
Replica placement
Replica 1: same node as client (or random), rack A
![Page 133: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/133.jpg)
133
Replica placement
Replica 1: same node as client (or random), rack A
Replica 2: a node in a different rack B
![Page 134: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/134.jpg)
134
Replica placement
Replica 1: same node as client (or random), rack A
Replica 2: a node in a different rack B
Replica 3: a node in same rack B
![Page 135: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/135.jpg)
135
Replica placement
Replica 1: same node as client (or random), rack A
Replica 2: a node in a different rack B
Replica 3: a node in same rack B
Replica 4 and beyond: random, but if possible:
• at most one replica per node
• at most two replicas per rack
![Page 136: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/136.jpg)
136
Replica placement
Client
![Page 137: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/137.jpg)
137
Why replicas 2+3 on other rack?
Client
![Page 138: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/138.jpg)
138
If replicas 1+2 were on same rack...
Block concentration on same rack (2/3)
![Page 139: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/139.jpg)
139
Performance and availability
![Page 140: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/140.jpg)
140
The NameNode is a single point of failure
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
/dir/file1
/dir/file2
/file3
![Page 141: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/141.jpg)
141
The namenode is a single point of failure...
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
/dir/file1
/dir/file2
/file3
What if it fails?
![Page 142: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/142.jpg)
142
NameNode: all system-wide activity
Memory
Block locations
/dir/file1
/dir/file2
/file3
File to block mapping
1 File namespace
(+Access Control)
2
3
![Page 143: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/143.jpg)
143
1. You want to persist
Memory
1/dir/file1
2
3 not persisted
![Page 144: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/144.jpg)
144
1. You want to persist
Namespace
file
Persistent Storage
Memory
1/dir/file1
2
3 not persisted
![Page 145: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/145.jpg)
145
1. You want to persist
Namespace
file
Persistent Storage
Memory
1/dir/file1
/dir/file2
2
3 not persisted
Edit log
![Page 146: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/146.jpg)
146
1. You want to persist
Namespace
file
Persistent Storage
Memory
1/dir/file1
/dir/file2
/file3
2
3 not persisted
Edit log
![Page 147: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/147.jpg)
147
2. You want to backup
Namespace
fileEdit log
Persistent Storage
Shared driveBackup drives
Glacier
![Page 148: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/148.jpg)
148
The namenode is a single point of failure...
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
/dir/file1
/dir/file2
/file3
What if it fails?
![Page 149: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/149.jpg)
149
The namenode is a single point of failure...
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
/dir/file1
/dir/file2
/file3
We need to start
it up again!
![Page 150: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/150.jpg)
150
Namenodes: Startup
Namespace
file
Persistent Storage
Memory
Edit log
![Page 151: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/151.jpg)
151
Namenodes: Startup
Namespace
file
Persistent Storage
Memory
Filesystem
Edit log
![Page 152: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/152.jpg)
152
Edit log
Namenodes: Startup
Namespace
file
Persistent Storage
Memory
Filesystem
![Page 153: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/153.jpg)
153
Edit log
Namenodes: Startup
Namespace
file
Persistent Storage
Memory
Filesystem
/dir/file1
/dir/file2
/file3
![Page 154: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/154.jpg)
154
Namenodes: Startup
Namespace
file
Persistent Storage
Memory
Filesystem
/dir/file1
/dir/file2
/file3
Block locations
Edit log
![Page 155: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/155.jpg)
155
Namenodes: Startup
Namespace
file
Persistent Storage
Memory
Filesystem
/dir/file1
/dir/file2
/file3
Block locations
Edit log
![Page 156: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/156.jpg)
156
Namenodes: Startup
Namespace
file
Persistent Storage
Memory
Filesystem
/dir/file1
/dir/file2
/file3
Block locations
Edit log
![Page 157: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/157.jpg)
157
Starting a namenode...
... takes
30 minutes!
![Page 158: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/158.jpg)
158
Starting a namenode...
Can we do
better?
![Page 159: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/159.jpg)
159
3. Checkpoints with Secondary NameNode
Old namespace
fileEdit log
New namespace file
![Page 160: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/160.jpg)
160
4. High Availability (HA): Backup NameNodes
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode
/dir/file1
/dir/file2
/file3
Namenode
/dir/file1
/dir/file2
/file3
Maintains mappings and locations
in memory like the namenode.
![Page 161: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/161.jpg)
161
5. High Availability (HA): Standby NameNodes
Active
Namenode
Standby
Namenode
Standby
Namenode
![Page 162: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/162.jpg)
162
5. Federated DFS
Datanode Datanode Datanode Datanode Datanode Datanode
Namenode /foo
/foo/file1
/foo/file2
Namenode /bar
/bar/file1
/bar/file2
![Page 163: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/163.jpg)
163
Using HDFS
![Page 164: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/164.jpg)
164
HDFS Shell
$ hadoop fs <args>
![Page 165: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/165.jpg)
165
HDFS Shell
$ hadoop fs <args>
$ hdfs dfs <args>
![Page 166: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/166.jpg)
166
HDFS Shell
$ hadoop fs <args>
$ hdfs dfs <args>local
filesystem
![Page 167: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/167.jpg)
167
HDFS Shell: POSIX-like
$ hadoop fs –ls
$ hadoop fs –cat /dir/file
$ hadoop fs –rm /dir/file
$ hadoop fs –mkdir /dir
![Page 168: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/168.jpg)
168
HDFS Shell: upload and download
$ hadoop fs –copyToLocal /user/hadoop/file
localfile
$ hadoop fs –copyFromLocal
localfile1 localfile2
/user/hadoop/hadoopdir
![Page 169: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/169.jpg)
169
HDFS Shell: Configuration
core-site.xml
<properties>
<property>
<name>fs.defaultFS</name>
<value>hdfs://host:8020</value>
<description>NameNode hostname</description>
</property>
</properties>
![Page 170: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/170.jpg)
170
HDFS Shell: Configuration
hdfs-site.xml
<properties>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Replication factor</description>
</property> <property>
<name>dfs.namenode.name.dir</name>
<value>/grid/hadoop/hdfs/nn</value>
<description>NameNode directory</description>
</property>
<property>
<name>dfs.datanode.name.dir</name>
<value>/grid/hadoop/hdfs/nn</value>
<description>DataNode directory</description>
</property>
</properties>
![Page 171: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/171.jpg)
171
Populating HDFS: Apache Flume
Collects, aggregates, moves log data(into HDFS)
_____ _____
__ _____ ___
_____
![Page 172: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/172.jpg)
172
Populating HDFS: Apache Sqoop
Imports from a relational database
![Page 173: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/173.jpg)
173
GFS
![Page 174: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/174.jpg)
174
GFS vs. HDFS: Terminology
NameNode
DataNode
Block
FS Image
Edit log
HDFS
Master
Chunkserver
Chunk
Checkpoint image
Operation log
GFS
![Page 175: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/175.jpg)
175
HDFS vs. GFS: Block size
GFS/Apache HDFS
64 MB
128 MB
Cloudera HDFS
![Page 176: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo](https://reader034.vdocuments.us/reader034/viewer/2022052320/5f05e89f7e708231d4155304/html5/thumbnails/176.jpg)
176
Pointers
Official documentation
http://hadoop.apache.org/docs/r3.2.1/
GFS Paper
On course website
Java API
https://hadoop.apache.org/docs/r3.2.1/api/ind
ex.html