mapr data analyst
TRANSCRIPT
MapR Data Analyst Learning Guide
Selvaraaju Murugesan
November 18, 2017
Selvaraaju Murugesan MapR Data Analyst Learning Guide
Pig
Scripts are written in Pig Latin
Primarily used for automating ETL work�ows for unstructured
data
Uses HCatalog to get schema from Hive Metastore
Byte array is the default data type in Pig
datetime data have to be loaded as char araay and use
ToDate() to convert to date time format
Selvaraaju Murugesan MapR Data Analyst Learning Guide
Pig Entities
Field -> single piece of data
Map -> set of key value pairs
['name':'abc','age':99]
Tuple -> ordered collection of �elds
['abc',123,-1.2]
Bag
unordered collection of tuple
Relation
An outer bagMay contain bag / tupleprimary unit/entity in pig
Selvaraaju Murugesan MapR Data Analyst Learning Guide
Pig Storage
default storage is pigstorage
Other types: json storge, HBase storage, etc..
Default �le type is tab delimited
Selvaraaju Murugesan MapR Data Analyst Learning Guide
Pig ETL
Load data using LOAD operator
Load '�lepath' USING <storage> AS <schema>
DUMP <Relation>
Triggers a MapReduce jobSimilar to select * from table
LIMIT
tmp = LIMIT <Relation> 10;DUMP tmp
DESCRIBE
prints schema of relationdoes not trigger MapReduce job
ILLUSTRATE
displays schema of relation and a sample of datatriggers a MapReduce job
Selvaraaju Murugesan MapR Data Analyst Learning Guide
Pig ETL
EXPLAIN <query>
advanced debugging tool that explains data�ow inludingMapReduce jobs
FOREACH <relationname> GENERATE <columnname>
similar to selecting few columns from a table
Pig relation are not persistent; Use STORE to save a relation
into pig sstorage
UDFs are stored in piggybank
UDFs can be written in Python, Java, Ruby, etc.
Selvaraaju Murugesan MapR Data Analyst Learning Guide
Pig Data Manipulation
DISTINCT
unique rows from a relation
FILTER
�lters a relation based on a condition
SAMPLE
returns a random sample
UNION
concatenates two or more relation
JOIN
combines two or more relation
Group
build a nested structure from a single relation
CoGroup
builds nested structure from multiple relation
ORDER, RANK, FLATTEN
Selvaraaju Murugesan MapR Data Analyst Learning Guide
Hive
First SQL on Hadoop
Uses HQL (Hive Query Langauge)
Uses MapReduce under the hood to execute HQL queries
Very reliable
Hive is not ANSI SQL Compliant
Hive can skip reads and works with di�erent �le formats
Datatypes
supports major data types available in SQLdoes not support numeric and money datatype
Selvaraaju Murugesan MapR Data Analyst Learning Guide
Hive Internal Database
Database
collection of tablesdefault database is used if not speci�edSyntax : CREATE DATABASE <database name>Drop database will drop the database only if the database isemptyDROP DATABASE <database name> CASCADE if you wantto drop the database that is not emptyUse <database name> sets the current database to furtherHQL
Selvaraaju Murugesan MapR Data Analyst Learning Guide
Hive External Databases and Tables
External database
Syntax : CREATE EXTERNAL DATABASE <databasename> Location <path>HiveMetastore indexes the content and all table data arephysically stored in the cluster
Drop table in Hive, Hive deletes schema in the HiveMetaStore
and moves to trash if trash is enabled in HDFS
Drop partition, data physically exist in the disk
Use external databases and tables as access control can be
done at the �data layer� (Cluster)
Hive uses log4j for logging
logs are stored at /tmp/<userid>/hive.log
Selvaraaju Murugesan MapR Data Analyst Learning Guide
Hive Commands
Hive version
hive - -version
Run HQL from a �le
hive -f '�le.hql'
Execute HQL query
hive -e 'hive query'hive -e 'select * from table1'
Hive functions
static functions, UDF, UDAF, etc
Selvaraaju Murugesan MapR Data Analyst Learning Guide
Hive Query Optimization
Supports star schema join
Supports MAP join
enabled by defaultsmall tables loaded into memorytable size limit is 25MB
Referencehttps://cwiki.apache.org/con�uence/display/Hive/LanguageManual+JoinOptimization
Selvaraaju Murugesan MapR Data Analyst Learning Guide
Hive-site.xml
contains con�guration variables
e.g hive.partition.pruning, hive.exec.plan
Location
/opt/mapr/hive/<hive version>/conf
Referencehttps://cwiki.apache.org/con�uence/display/Hive/AdminManual+Con�guration#AdminManualCon�guration-hive-site.xmlandhive-default.xml.template
Selvaraaju Murugesan MapR Data Analyst Learning Guide