analyzing hdfs files using apace spark and mapreduce fixedlengthinputformat
TRANSCRIPT
Analyzing HDFS Files using Apace Spark and
Mapreduce FixedLengthInputFormat
[email protected] source code is here
MAPREDUCE-1176: FixedLengthInputFormat and FixedLengthRecordReader (fixed in 2.3.0)by Mariappan Asokan, BitsOfInfo
Addition of FixedLengthInputFormat and FixedLengthRecordReader in the org.apache.hadoop.mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces.
One 2GB gigantic line within a file issue[stackoverflow] Considering the String class' length method returns an int, the maximum length that would be returned by the method would be Integer.MAX_VALUE, which is 2^31 - 1 (or approximately 2 billion.)
In terms of lengths and indexing of arrays, (such as char[], which is probably the way the internal data representation is implemented for Strings),...
val rdd = sc.textFile("hdfs:///user/leo/test.txt/nolr2G-1.txt")rdd.count...ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 236)java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271)
hdfs file
org.apache.hadoop.mapreduce.lib.input.TextInputFormat
: An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text.
$ hdfs dfs -cat example.txt King Henry the Fourth. Henry, Prince of Wales, son to the King. Prince John of Lancaster, son to the King. Earl of Westmoreland. Sir Walter Blunt. Thomas Percy, Earl of Worcester. Henry Percy, Earl of Northumberland. Henry Percy, surnamed Hotspur, his son....
line 1line 2
line n
hdfs block 1
hdfs block 2
line 5
hdfs block n
Split & Record❏ An input split is a chunk of the input that is processed by a single map. ❏ Each split is divided into records, and the map processes each record—a key-
value pair—in turn.❏ By default the split size is dfs.block.size.[1]
HDFS Text File
Split 1 (block 1) Split 2 (block 2)Record 1(line 1)
Record 2(line 2)
Record 3(line 3)
Record 4(line 4)
Record 5(line 5)
Record 6(line 6)
HDFS Text File
Split Split Split Split Split Split SplitRecord > 2GB
CR/LF
Normal text file
One 2GB gigantic line within a file
[1] Tom White, “Hadoop:The Definitive Guide, 3rd Edtion”, p.234, 2012
File
③combine last part of previous block with first part of current block
Split
Check length of records with FixedLengthInputFormat (1)
fixed blk fixed blkRecord Record Record
len len len len
len len len
①Read blocks of fixed length except the last one
②split blocks by \n and compute length of each record
HDFS File
Split Split
Record Record Record Record Record Record
CR/LF
blk blk blk blk blk blk blk blk blk blk blk blk blk blk blk
Split-1 Split-2
len len len
Record Record Record
len
len len
⑤sort splits by reading position
④group splits by each file
len⑥combine last part of previous block with first part of next block
Split
Check length of records with FixedLengthInputFormat (2)HDFS File
Split Split Split Split Split Split Split
Record > 2GBblk blk blk blk blk blk blk blk blk blk blk blk blk blk blk
Fixed Length
Record
len len
len
①Read blocks of fixed length except the last one
②split blocks by \n and compute length of each record
➂combine last part of previous block with first part of next block
File
Record
④group splits by file
⑤sort splits by reading position
⑥combine last part of previous block with first part of next block
len len
fixed blk fixed blk Split-1 Split-2
len
Validation (1)
$ hdfs dfs -ls /user/leo/test/-rw-r--r-- 1 leo leo 2147483669 2015-10-06 07:40 /user/leo/test/nolr2G-1.txt-rw-r--r-- 1 leo leo 2147483669 2015-10-06 09:19 /user/leo/test/nolr2G-2.txt-rw-r--r-- 1 leo leo 2147483669 2015-10-07 00:53 /user/leo/test/nolr2G-3.txt
$ hdfs dfs -cat /user/leo/test/nolr2G-1.txt01234567890123456789...........01234567890123456789
scala> recordLenOfFile.map{ case (path, stat) => f"[${path}][${stat.toString()}]"}.collect().foreach(println)...INFO TaskSetManager: Finished task 47.0 in stage 9.0 (TID 223) in 16 ms on localhost (48/48)...[hdfs://sandbox.hortonworks.com:8020/user/leo/test/nolr2G-1.txt][stats: (count: 1, mean: 2147483648.000000, stdev: 0.000000, max: 2147483648.000000, min: 2147483648.000000), NaN: 0][hdfs://sandbox.hortonworks.com:8020/user/leo/test/nolr2G-2.txt][stats: (count: 1, mean: 2147483648.000000, stdev: 0.000000, max: 2147483648.000000, min: 2147483648.000000), NaN: 0][hdfs://sandbox.hortonworks.com:8020/user/leo/test/nolr2G-3.txt][stats: (count: 1, mean: 2147483648.000000, stdev: 0.000000, max: 2147483648.000000, min: 2147483648.000000), NaN: 0]
2GB + 10 bytes + “\n” + 10 bytes
The output shows how many lines and the statistics for the length of lines in each file.Here we found there exists one line of 2147483648 chars.
Validation (2)
$ hdfs dfs -ls /user/leo/test.2/-rw-r--r-- 1 leo leo 5258688 2015-10-07 06:56 test.2/all-bible-1.txt-rw-r--r-- 1 leo leo 5258688 2015-10-07 06:56 test.2/all-bible-2.txt-rwxr-xr-x 1 leo leo 5258688 2015-10-06 02:12 test.2/all-bible-3.txt
$ hdfs dfs -cat test.2/all-bible-1.txt|wc -l117154
scala> recordLenOfFile.map{ case (path, stat) => f"[${path}][${stat.toString()}]"}.collect().foreach(println)...INFO TaskSetManager: Finished task 0.0 in stage 13.0 (TID 233) in 115 ms on localhost (3/3)...[hdfs://sandbox.hortonworks.com:8020/user/leo/test.2/all-bible-2.txt][stats: (count: 116854, mean: 43.866937, stdev: 30.647162, max: 88.000000, min: 1.000000), NaN: 0][hdfs://sandbox.hortonworks.com:8020/user/leo/test.2/all-bible-3.txt][stats: (count: 116854, mean: 43.866937, stdev: 30.647162, max: 88.000000, min: 1.000000), NaN: 0][hdfs://sandbox.hortonworks.com:8020/user/leo/test.2/all-bible-1.txt][stats: (count: 116854, mean: 43.866937, stdev: 30.647162, max: 88.000000, min: 1.000000), NaN: 0]