pdf data loading without mr using pig
TRANSCRIPT
Unstructured data (pdf) conversion and loading into HDFS
Objective : We received PDF data from client. We have to convert PDF data in txt format and load the data in HDFS so that we can generate reports based on client's data.
Data Sample :
Step 1 :
copy that pdf file into linux box in any folder.File name :InputData.pdfRun this below command in linux environment from specified location :
hadoop@hadoop:~/Testing$ pdftotext -layout -nopgbrk InputData.pdfhadoop@hadoop:~/Testing$ cat InputData.txt|tr -s " ">Input1.txt
Step 2 : copy Input1.txt into HDFS environment
hadoop@hadoop:~/Testing$ hadoop fs -copyFromLocal Input1.txt /rajesh
Step 3 : To view the file in HDFS
Step 4 :Run pig through HDFS mode
grunt> grunt> A= LOAD '/rajesh/Input1.txt' using PigStorage(' ') as (Sid:int,Sname:chararray,Ttrading:chararray,Sloc:chararray,OBal:int,CBal:int,Frate:int);2016-05-23 18:38:22,096 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS grunt> disHM= DISTINCT A; grunt> orHM = ORDER disHM by Sid; grunt> STORE orHM INTO '/rajesh/pigoutput' using PigStorage ',');
To view the output generated by pig :