apache bigtop
DESCRIPTION
Apache Bigtop. Week 9Integration Testing, M/R Coding. Administration. Yahoo Field Trip, How Hadoop components are used in a production environment. Have to be registered as Working group members/B/C members MSFT Azure talk, volunteers for tech leads to port bigtop to Azure. - PowerPoint PPT PresentationTRANSCRIPT
Administration
• Yahoo Field Trip, How Hadoop components are used in a production environment. Have to be registered as Working group members/B/C members
• MSFT Azure talk, volunteers for tech leads to port bigtop to Azure.
• Roman’s Yahoo HUG presentation next week• Move to ground floor next week?• Machine Learning Solution Architect, 2/16• List
Review from last time
• Hive/Pig/Hbasedata layer for integration tests• Hbase upgrade to x.92• JAVA_LIBRARY_PATH for JVM to point to .so
native libs for hadoop• Hadoop classpath debug to print out classpath• HBASE 0.92 guess where Hadoop is using
HADOOP_HOME• /etc/hostname screwed up on ec2
Bigtop Data Integration Layer, Hive, Pig, Hbase
• Hive: • Create a separate Java project • Install Hive locally, verify you can run the
command line, >show tables;
Hive Notes
• Hive has 2 configurations, an embedded and server.
• To start the server:– Set the HADOOP_HEAPSIZE to 1024 by copying
hive-env.sh.template to hive-env.sh and uncommenting the HADOOP_HEAPSIZE setting.
– source ~/hive-0.8.1/conf/hive-env.sh– Verify, echo $HADOOP_HIVESIZE
Hive Run JDBC Commands
• Like connecting to MySQL/oracle/MSFT db• Create connection, PreparedStatement, ResultSet• Class.forName("org.apache.hadoop.hive.jdbc.Hiv
eDriver");• Connection con
=DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", "");
• Driver in the jar
Hive JDBC Prepared Statement
• Create Table statement differentStatement stmt = con.createStatement();
String tableName = "testHiveDriverTable";stmt.executeQuery("drop table " + tableName);ResultSet res = stmt.executeQuery("create table "+ tableName+ " (key int, value string) ROW FORMAT delimited
fields terminated by '\t'");
Pig, uses Pig Util Class
• Util not in Pig-xxx.jar, only in Test package• Local mode only, distributed not debuggedUtil.deleteDirectory(new File("/Users/dc/pig-0.9.2/nyse"));
PigServer ps = new PigServer(ExecType.LOCAL);
ps.setBatchOn();
Pig Example
String first = " nyse = load '/Users/dc/programmingpig/data/NYSE_dividends' as (exchange:chararray, symbol:chararray, date:chararray, dividends:float); ";
String second = "B = foreach nyse generate symbol, dividends;";
String third = " store B into 'nyse'; ";
Pig Example
Util.registerMultiLineQuery(ps, first + second + third);
ps.executeBatch();ps.shutdown();
Pig Example Output12/02/11 14:07:57 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: file:///12/02/11 14:07:59 INFO pigstats.ScriptState: Pig features used in the script: UNKNOWN12/02/11 14:08:00 INFO rules.ColumnPruneVisitor: Columns pruned for nyse: $0, $212/02/11 14:08:01 INFO mapReduceLayer.MRCompiler: File concatenation threshold: 100 optimistic? false12/02/11 14:08:01 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size before optimization: 112/02/11 14:08:01 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size after optimization: 112/02/11 14:08:01 INFO pigstats.ScriptState: Pig script settings are added to the job12/02/11 14:08:01 INFO mapReduceLayer.JobControlCompiler: mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.312/02/11 14:08:02 INFO mapReduceLayer.JobControlCompiler: Setting up single store job12/02/11 14:08:02 INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce job(s) waiting for submission.12/02/11 14:08:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable12/02/11 14:08:02 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.12/02/11 14:08:02 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).12/02/11 14:08:02 INFO input.FileInputFormat: Total input paths to process : 112/02/11 14:08:02 INFO util.MapRedUtil: Total input paths to process : 112/02/11 14:08:02 INFO mapReduceLayer.MapReduceLauncher: 0% complete12/02/11 14:08:03 INFO util.MapRedUtil: Total input paths (combined) to process : 112/02/11 14:08:04 INFO mapred.Task: Using ResourceCalculatorPlugin : null12/02/11 14:08:04 INFO mapReduceLayer.MapReduceLauncher: HadoopJobId: job_local_000112/02/11 14:08:05 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting12/02/11 14:08:05 INFO mapred.LocalJobRunner: 12/02/11 14:08:05 INFO mapred.Task: Task attempt_local_0001_m_000000_0 is allowed to commit now12/02/11 14:08:05 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to file:/Users/dc/pig-0.9.2/nyse12/02/11 14:08:07 INFO mapred.LocalJobRunner: 12/02/11 14:08:07 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.12/02/11 14:08:09 WARN pigstats.PigStatsUtil: Failed to get RunningJob for job job_local_000112/02/11 14:08:09 INFO mapReduceLayer.MapReduceLauncher: 100% complete12/02/11 14:08:09 INFO pigstats.SimplePigStats: Detected Local mode. Stats reported below may be incomplete12/02/11 14:08:09 INFO pigstats.SimplePigStats: Script Statistics:
Pig Example OutputHadoopVersion PigVersion UserId StartedAt FinishedAt Features0.20.205.0 0.9.2 dc 2012-02-11 14:08:01 2012-02-11 14:08:09 UNKNOWN
Success!
Job Stats (time in seconds):JobIdAlias Feature Outputsjob_local_0001 B,nyse MAP_ONLY file:///Users/dc/pig-0.9.2/nyse,
Input(s):Successfully read records from: "/Users/dc/programmingpig/data/NYSE_dividends"
Output(s):Successfully stored records in: "file:///Users/dc/pig-0.9.2/nyse"
Pig Example Output
Job DAG:job_local_0001
12/02/11 14:08:09 INFO mapReduceLayer.MapReduceLauncher: Success!
M/R Pattern Design Review
• Why? A correctly designed M/R cluster program which is faster than an individual machine exercises all the components in a M/R cluster.
• Cluster experience with big data on AWS• Important when migrating to production processes• Design Patterns in the Sample HadoopExamples.jar• WordCount• Word Count Aggregation• MultiFileWordCount
From Lin/Dyer Data-Intensive Text Processing with Map Reduce
M/R Adding Array to Mapper Output
• WordCount Design Process– Mapper(Contents of File, Tokenize, output) <Object, Text, Text,
IntWritable>. Object=file descriptor, Text=fileLine, Text=word, IntWritable=1.
2 steps to mapper design, 1) split up the input then 2) output K,V to reducer– First step, copy Mapper output K,V to reducer . Reducer(Collect
mapper output) <Text, IntWritable, Text, IntWritable> Second Step, final output form.
• Replace the IntWritable with an arraylist• Why?
Hadoop Avg Coding Demo
• Create an AvgPair Object, implements writable• Create ivars, sum, count, key• Auto generate methods for ivars• Implement Write and readFields methods• Put the ctors outside map()• Run using M/R plugin
NXServer/NXClient
• Remote Desktop to EC2• 2 options – 1) use prepared AMI by Eric Hammond– 2) Install NXServer
Prepared AMI
• http://aws.amazon.com/amis/Europe/1950• US East AMI ID: ami-caf615a3• Ubuntu 9.04 Jaunty Desktop with NX Server
Free Edition• Update the Repos to newer versions of
Ubuntu• Create new user
Clone the instance store if you cant get the NXServer to work
• Problem is the EasyNXServer method uses an instance store. How to clone to an EBS volume?
• Create blank volume, default attach is /dev/sdfmksf.ext3 /dev/sdfmkdir /newvolumeSudo mount /dev/sdf /newvolume