retrieving big data for the non developer
TRANSCRIPT
![Page 1: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/1.jpg)
Retrieving Big DataFor the non-developer
![Page 2: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/2.jpg)
Intended Audience
People who do not write code
But don’t want to wait for IT to bring them data
![Page 3: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/3.jpg)
Disclaimer
You will have to write code. Sorry...
![Page 4: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/4.jpg)
Worth Noting
A common objection, “But I’m not a developer”
Coding does not make you a developer anymore than patching some drywall makes you a carpenter
![Page 5: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/5.jpg)
Agenda
● The minimum you need to know about Big Data (Hadoop)o Specifically, HBase and Pig
● How you can retrieve data in HBase with Pigo How to use Python with Pig to make querying easier
![Page 6: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/6.jpg)
One Big Caveat
● We are not talking about analysis● Analysis is hard
● Learning code and trying to understand an analytical approach is really hard● Following a straightforward Pig tutorial is
better than a boring lecture
![Page 7: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/7.jpg)
Big Data in One Slide (oh boy)
● Today, Big Data == Hadoop● Hadoop is both a distributed file system
(HDFS) and an approach to messing with data on the file system (MapReduce)o HBase is a popular database that sits on top of
HDFSo Pig is a high level language that makes messing with
data on HDFS or in HBase easier
![Page 8: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/8.jpg)
HBase in one slide
● HBase = Hadoop Database, based on Google’s Big Table
● Column-oriented database – basically one giant table
![Page 9: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/9.jpg)
Pig in one slide
● A data flow language we will use to write queries against HBase
● Pig is not the developer’s solution for retrieving data from HBase, but it works well enough for the BI analyst (and, of course, we aren’t developers)
![Page 10: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/10.jpg)
Pig is easier...Not Easy
● If you have no coding background, Pig will not be easy
● But it’s the best of a bad set of options right now
● Not hating on SQL-on-Hadoop providers, but with SQL you tell the computer what you want, which quickly gets complicated
![Page 11: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/11.jpg)
Here’s our HBase table
![Page 12: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/12.jpg)
Let’s dive in - Load
raw = LOAD 'hbase://peeps'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:first_name info:last_name, '-loadKey true -limit 1')
AS (id:chararray, first_name:chararray, last_name:chararray);
You have to specify each field and it’s type in
order to load it
![Page 13: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/13.jpg)
Response is as expected
'info:first_name info:last_name, AS (first_name:chararray, last_name:chararray);
Will return a first name and last name as seperate fields, e.g., “Steve”, “Buscemi”
![Page 14: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/14.jpg)
If you can write a Vlookup()
=VLOOKUP(C34, Z17:AZ56, 17, FALSE)
You can write a load statement in Pig.
Both are equally esoteric.
![Page 15: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/15.jpg)
But what if we don’t know the fields?
● Suppose we have a column family of friends
● Each record will contain will zero to many friends, e.g., friend_0: “John”, friend_1: “Paul”
![Page 16: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/16.jpg)
The number of friends is variable
● There could be thousands of friends per row
● And we cannot specify “friend_5” because there is no guarantee that each record has five friends
![Page 17: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/17.jpg)
This is common...
● NoSQL databases are known for flexible schemas and flat table structures
● Unfortunately, the way Pig handles this problem utterly sucks...
![Page 18: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/18.jpg)
Loading unknown friends
raw = LOAD 'hbase://SampleTable'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5')
AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]);
Now we have info:friends_* that is represented as a “map”
![Page 19: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/19.jpg)
A map is just a collection of key-value pairs
● That look like this: friend_1# ‘Steve’, friend_2# ‘Willie’
● They are very similar to Python dictionaries...
![Page 20: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/20.jpg)
Here’s why they suck
● We can’t iterate over them
● In order to access a value, in this case a friend’s name, I have to provide the specific key value, e.g., friend_5, in order to receive the name of the fifth friend
![Page 21: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/21.jpg)
But I thought you said we didn’t know the number of friends?
● You are right – Pig expects us to provide the specific value of something unknown
● If only there were some way to iterate over a collection of key-value pairs…
![Page 22: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/22.jpg)
Enter Python
● Pig may not allow you to iterate over a map, but it does allow you to write User-Defined Functions (UDFs) in Python
● In a python UDF we can read in a map as a python dict and return key-value pairs
![Page 23: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/23.jpg)
Python UDF for Pig
@outputSchema("values:bag{t:tuple(key, value)}")
def bag_of_tuples(map_dict):
return map_dict.items()
We are passing in a map, e.g., “Friend_1#Steve, Friend_2#Willie”and manipulating a python dict, e.g. {‘Friend_1’: ‘Steve’, ‘Friend_2’:
‘Willie’}
Based on blog post by Chase Seibert
![Page 24: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/24.jpg)
We can add loops and logic too
@outputSchema("status:chararray")
def get_steve(map_dict):
for key, value in map_dict:
if value == 'Steve':
return "I hate that guy"
else:
return value
![Page 25: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/25.jpg)
Or if you just want the data in Excel
register ‘sample_udf.py’ using jython as my_udfraw = LOAD 'hbase://peeps' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5') AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]);
clean_table = FOREACH raw GENERATE id, FLATTEN(my_udf.bag_of_tuples(friends));
dump clean_table;
![Page 26: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/26.jpg)
Final Thought
Make Your Big Data Small● Prototype your Pig Scripts on your local file
systemo Download some data to your local machineo Start you pig shell from the command line: pig -x
localo Load - Transform - Dump
![Page 27: Retrieving big data for the non developer](https://reader036.vdocuments.us/reader036/viewer/2022062515/55c7f907bb61eb0b648b4707/html5/thumbnails/27.jpg)
Notes
Pig Tutorials● Excellent video on Pig ● Mortar Data introduction to Pig● Flatten HBase column with Python
Me● codingcharlatan.com● @GusCavanaugh