apache bigtop

Apache Bigtop

Week 9 Integration Testing, M/R Coding

Administration

• Yahoo Field Trip, How Hadoop components are used in a production environment. Have to be registered as Working group members/B/C members

• MSFT Azure talk, volunteers for tech leads to port bigtop to Azure.

• Roman’s Yahoo HUG presentation next week• Move to ground floor next week?• Machine Learning Solution Architect, 2/16• List

https://docs.google.com/spreadsheet/ccc?key=0An0r9F5sJ625dF85Z0JaRU1UT1pKM01DQlQyZkdaMUE&hl=en_US%23gid=0

Review from last time

• Hive/Pig/Hbasedata layer for integration tests• Hbase upgrade to x.92• JAVA_LIBRARY_PATH for JVM to point to .so

native libs for hadoop• Hadoop classpath debug to print out classpath• HBASE 0.92 guess where Hadoop is using

HADOOP_HOME• /etc/hostname screwed up on ec2

Bigtop Data Integration Layer, Hive, Pig, Hbase

• Hive: • Create a separate Java project • Install Hive locally, verify you can run the

command line, >show tables;

Hive Data Layer

• Import all the jars under hive-0.8.1/lib to Eclipse

Hive Notes

• Hive has 2 configurations, an embedded and server.

• To start the server:– Set the HADOOP_HEAPSIZE to 1024 by copying

hive-env.sh.template to hive-env.sh and uncommenting the HADOOP_HEAPSIZE setting.

– source ~/hive-0.8.1/conf/hive-env.sh– Verify, echo $HADOOP_HIVESIZE

Start Hive Server from Command Line

Hive Command Line Server

Hive Notes

Increase Heap Size:

Hive Run JDBC Commands

• Like connecting to MySQL/oracle/MSFT db• Create connection, PreparedStatement, ResultSet• Class.forName("org.apache.hadoop.hive.jdbc.Hiv

eDriver");• Connection con

=DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", "");

• Driver in the jar

Hive JDBC Prepared Statement

• Create Table statement differentStatement stmt = con.createStatement();

String tableName = "testHiveDriverTable";stmt.executeQuery("drop table " + tableName);ResultSet res = stmt.executeQuery("create table "+ tableName+ " (key int, value string) ROW FORMAT delimited

fields terminated by '\t'");

Verification – server running and table printout Eclipse output

Hive Eclipse/Java Code

Pig, uses Pig Util Class

• Util not in Pig-xxx.jar, only in Test package• Local mode only, distributed not debuggedUtil.deleteDirectory(new File("/Users/dc/pig-0.9.2/nyse"));

PigServer ps = new PigServer(ExecType.LOCAL);

ps.setBatchOn();

Pig Example

String first = " nyse = load '/Users/dc/programmingpig/data/NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); ";

String second = "B = foreach nyse generate symbol, dividends;";

String third = " store B into 'nyse'; ";

Pig Example

Util.registerMultiLineQuery(ps, first + second + third);

ps.executeBatch();ps.shutdown();

Pig Example Output12/02/11 14:07:57 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: file:///12/02/11 14:07:59 INFO pigstats.ScriptState: Pig features used in the script: UNKNOWN12/02/11 14:08:00 INFO rules.ColumnPruneVisitor: Columns pruned for nyse: $0, $212/02/11 14:08:01 INFO mapReduceLayer.MRCompiler: File concatenation threshold: 100 optimistic? false12/02/11 14:08:01 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size before optimization: 112/02/11 14:08:01 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size after optimization: 112/02/11 14:08:01 INFO pigstats.ScriptState: Pig script settings are added to the job12/02/11 14:08:01 INFO mapReduceLayer.JobControlCompiler: mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.312/02/11 14:08:02 INFO mapReduceLayer.JobControlCompiler: Setting up single store job12/02/11 14:08:02 INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce job(s) waiting for submission.12/02/11 14:08:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable12/02/11 14:08:02 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.12/02/11 14:08:02 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).12/02/11 14:08:02 INFO input.FileInputFormat: Total input paths to process : 112/02/11 14:08:02 INFO util.MapRedUtil: Total input paths to process : 112/02/11 14:08:02 INFO mapReduceLayer.MapReduceLauncher: 0% complete12/02/11 14:08:03 INFO util.MapRedUtil: Total input paths (combined) to process : 112/02/11 14:08:04 INFO mapred.Task: Using ResourceCalculatorPlugin : null12/02/11 14:08:04 INFO mapReduceLayer.MapReduceLauncher: HadoopJobId: job_local_000112/02/11 14:08:05 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting12/02/11 14:08:05 INFO mapred.LocalJobRunner: 12/02/11 14:08:05 INFO mapred.Task: Task attempt_local_0001_m_000000_0 is allowed to commit now12/02/11 14:08:05 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to file:/Users/dc/pig-0.9.2/nyse12/02/11 14:08:07 INFO mapred.LocalJobRunner: 12/02/11 14:08:07 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.12/02/11 14:08:09 WARN pigstats.PigStatsUtil: Failed to get RunningJob for job job_local_000112/02/11 14:08:09 INFO mapReduceLayer.MapReduceLauncher: 100% complete12/02/11 14:08:09 INFO pigstats.SimplePigStats: Detected Local mode. Stats reported below may be incomplete12/02/11 14:08:09 INFO pigstats.SimplePigStats: Script Statistics:

Pig Example OutputHadoopVersion PigVersion UserId StartedAt FinishedAt Features0.20.205.0 0.9.2 dc 2012-02-11 14:08:01 2012-02-11 14:08:09 UNKNOWN

Success!

Job Stats (time in seconds):JobIdAlias Feature Outputsjob_local_0001 B,nyse MAP_ONLY file:///Users/dc/pig-0.9.2/nyse,

Input(s):Successfully read records from: "/Users/dc/programmingpig/data/NYSE_dividends"

Output(s):Successfully stored records in: "file:///Users/dc/pig-0.9.2/nyse"

Pig Example Output

Job DAG:job_local_0001

12/02/11 14:08:09 INFO mapReduceLayer.MapReduceLauncher: Success!

M/R Pattern Design Review

• Why? A correctly designed M/R cluster program which is faster than an individual machine exercises all the components in a M/R cluster.

• Cluster experience with big data on AWS• Important when migrating to production processes• Design Patterns in the Sample HadoopExamples.jar• WordCount• Word Count Aggregation• MultiFileWordCount

From Lin/Dyer Data Intensive Text Processing with Map Reduce

M/R Design Pattern Review

• Word Count

From Lin/Dyer Data-Intensive Text Processing with Map Reduce

M/R Adding Array to Mapper Output

• WordCount Design Process– Mapper(Contents of File, Tokenize, output) <Object, Text, Text,

IntWritable>. Object=file descriptor, Text=fileLine, Text=word, IntWritable=1.

2 steps to mapper design, 1) split up the input then 2) output K,V to reducer– First step, copy Mapper output K,V to reducer . Reducer(Collect

mapper output) <Text, IntWritable, Text, IntWritable> Second Step, final output form.

• Replace the IntWritable with an arraylist• Why?

Word Count Notes

• Remove ctors() from map()

Hadoop Avg Coding Demo

• Create an AvgPair Object, implements writable• Create ivars, sum, count, key• Auto generate methods for ivars• Implement Write and readFields methods• Put the ctors outside map()• Run using M/R plugin

NXServer/NXClient

• Remote Desktop to EC2• 2 options – 1) use prepared AMI by Eric Hammond– 2) Install NXServer

Prepared AMI

• http://aws.amazon.com/amis/Europe/1950• US East AMI ID: ami-caf615a3• Ubuntu 9.04 Jaunty Desktop with NX Server

Free Edition• Update the Repos to newer versions of

Ubuntu• Create new user

http://aws.amazon.com/amis/Europe/1950

http://aws.amazon.com/amis/Europe/1950

Ubuntu Create new user

• Script for this AMI only• >user-setup

Verify login from desktop

• Created user dc, password dc

Download NXPlayer, install

• Create new connection, enter in IP address

Login with username/password

Ubuntu Desktop

Installing NXServer

• Read logs /usr/NX/var/log/install• If installed correctly should see daemons

Create user

Configure sshd

• sudo nano /etc/init.d/sshd_config

Verify ssh login

Same process as before with nxplayer

• Enter in ip, user name/password

Clone the instance store if you cant get the NXServer to work

• Problem is the EasyNXServer method uses an instance store. How to clone to an EBS volume?

• Create blank volume, default attach is /dev/sdfmksf.ext3 /dev/sdfmkdir /newvolumeSudo mount /dev/sdf /newvolume

rsync copy instance store to ebs

• Copy the instance store volume to EBS• rsync –aHxv / /newvolume• Create further snapshots, create an ami by

specifying kernel, etc…

apache bigtop

Documents