mapr data analyst

MapR Data Analyst Learning Guide

Selvaraaju Murugesan

November 18, 2017

Selvaraaju Murugesan MapR Data Analyst Learning Guide

Pig

Scripts are written in Pig Latin

Primarily used for automating ETL work�ows for unstructured

data

Uses HCatalog to get schema from Hive Metastore

Byte array is the default data type in Pig

datetime data have to be loaded as char araay and use

ToDate() to convert to date time format


Pig Entities

Field -> single piece of data

Map -> set of key value pairs

['name':'abc','age':99]

Tuple -> ordered collection of �elds

['abc',123,-1.2]

Bag

unordered collection of tuple

Relation

An outer bagMay contain bag / tupleprimary unit/entity in pig


Pig Storage

default storage is pigstorage

Other types: json storge, HBase storage, etc..

Default �le type is tab delimited


Pig ETL

Load data using LOAD operator

Load '�lepath' USING <storage> AS <schema>

DUMP <Relation>

Triggers a MapReduce jobSimilar to select * from table

LIMIT

tmp = LIMIT <Relation> 10;DUMP tmp

DESCRIBE

prints schema of relationdoes not trigger MapReduce job

ILLUSTRATE

displays schema of relation and a sample of datatriggers a MapReduce job


Pig ETL

EXPLAIN <query>

advanced debugging tool that explains data�ow inludingMapReduce jobs

FOREACH <relationname> GENERATE <columnname>

similar to selecting few columns from a table

Pig relation are not persistent; Use STORE to save a relation

into pig sstorage

UDFs are stored in piggybank

UDFs can be written in Python, Java, Ruby, etc.


Pig Data Manipulation

DISTINCT

unique rows from a relation

FILTER

�lters a relation based on a condition

SAMPLE

returns a random sample

UNION

concatenates two or more relation

JOIN

combines two or more relation

Group

build a nested structure from a single relation

CoGroup

builds nested structure from multiple relation

ORDER, RANK, FLATTEN


Hive

First SQL on Hadoop

Uses HQL (Hive Query Langauge)

Uses MapReduce under the hood to execute HQL queries

Very reliable

Hive is not ANSI SQL Compliant

Hive can skip reads and works with di�erent �le formats

Datatypes

supports major data types available in SQLdoes not support numeric and money datatype


Hive Internal Database

Database

collection of tablesdefault database is used if not speci�edSyntax : CREATE DATABASE <database name>Drop database will drop the database only if the database isemptyDROP DATABASE <database name> CASCADE if you wantto drop the database that is not emptyUse <database name> sets the current database to furtherHQL


Hive External Databases and Tables

External database

Syntax : CREATE EXTERNAL DATABASE <databasename> Location <path>HiveMetastore indexes the content and all table data arephysically stored in the cluster

Drop table in Hive, Hive deletes schema in the HiveMetaStore

and moves to trash if trash is enabled in HDFS

Drop partition, data physically exist in the disk

Use external databases and tables as access control can be

done at the �data layer� (Cluster)

Hive uses log4j for logging

logs are stored at /tmp/<userid>/hive.log


Hive Commands

Hive version

hive - -version

Run HQL from a �le

hive -f '�le.hql'

Execute HQL query

hive -e 'hive query'hive -e 'select * from table1'

Hive functions

static functions, UDF, UDAF, etc


Hive Query Optimization

Supports star schema join

Supports MAP join

enabled by defaultsmall tables loaded into memorytable size limit is 25MB

Referencehttps://cwiki.apache.org/con�uence/display/Hive/LanguageManual+JoinOptimization


Hive-site.xml

contains con�guration variables

e.g hive.partition.pruning, hive.exec.plan

Location

/opt/mapr/hive/<hive version>/conf

Referencehttps://cwiki.apache.org/con�uence/display/Hive/AdminManual+Con�guration#AdminManualCon�guration-hive-site.xmlandhive-default.xml.template


mapr data analyst

Technology