hadoop 20111117

42
hadoop Tech share

Upload: exsuns

Post on 20-Jan-2015

1.198 views

Category:

Technology


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Hadoop 20111117

hadoop

Tech share

Page 2: Hadoop 20111117

• Hadoop Core, our flagship sub-project, provides a distributed filesystem (HDFS) and support for the MapReduce distributed computing metaphor.

• Pig is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core.

Page 3: Hadoop 20111117

ZooKeeper

• ZooKeeper is a highly available and reliable coordination system. Distributed applications use ZooKeeper to store and mediate updates for critical shared state.

Page 4: Hadoop 20111117

JobTracker

• JobTracker: The JobTracker provides command and control for job management. It supplies the primary user interface to a MapReduce cluster. It also handles the distribution and management of tasks. There is one instance of this server running on a cluster. The machine running the JobTracker server is the MapReduce master.

Page 5: Hadoop 20111117

TaskTracker

• TaskTracker: The TaskTracker provides execution services for the submitted jobs. Each TaskTracker manages the execution of tasks on an individual compute node in the MapReduce cluster. The JobTracker manages all of the TaskTracker processes. There is one instance of this server per compute node.

Page 6: Hadoop 20111117

NameNode

• NameNode: The NameNode provides metadata storage for the shared file system. The NameNode supplies the primary user interface to the HDFS. It also manages all of the metadata for the HDFS. There is one instance of this server running on a cluster. The metadata includes such critical information as the file directory structure and which DataNodes have copies of the data blocks that contain each file’s data. The machine running the NameNode server process is the HDFS master.

Page 7: Hadoop 20111117

Secondary NameNode

• Secondary NameNode: The secondary NameNode provides both file system metadata backup and metadata compaction. It supplies near real-time backup of the metadata for the NameNode. There is at least one instance of this server running on a cluster, ideally on a separate physical machine than the one running the NameNode. The secondary NameNode also merges the metadata change history, the edit log, into the NameNode’s file system image.

Page 8: Hadoop 20111117

Design of HDFS

• Design of HDFS– Very large files– Streaming data access– Commodity hardware

• not a good fit– Low-latency data access– Lots of small files– Multiple writers, arbitrary file modifications

Page 9: Hadoop 20111117

blocks

• normally 512 bytes• HDFS : 64 MB by default

Page 10: Hadoop 20111117

HDFS 文件读取• 内存

Page 11: Hadoop 20111117

HDFS 文件写入

Page 12: Hadoop 20111117

HDFS 文件写入• Outputsream.write()• Outputstream.flush() 刷新,超过一个 block

的时候,才会读到。• Outputstream.sync() 强制同步• Outputstream.close() 包括 sync()

Page 13: Hadoop 20111117

DistCp 分布式复制• hadoop distcp -update hdfs://namenode1/foo

hdfs://namenode2/bar

• hadoop distcp –update ……– 只更新修改过的文件

• hadoop distcp –overwrite ……– 覆盖

• hadoop distcp –m 100 ……– 复制任务被分成 N 个 MAP 执行

Page 14: Hadoop 20111117

Hadoop 文件归档• Har 文件

• Hadoop archive –archiveName file.har /myfiles /outpath

• Hadoop fs –ls /outpath/file.har• Hadoop fs –lsr har:///outpath/file.har

Page 15: Hadoop 20111117

文件操作• Hadoop fs –rm hdfs://192.168.126.133:9000/xxx

•cat •cp •lsr •rmr

•chgrp •du •mkdir •setrep

•chmod •dus •moveFromLocal •stat

•chown •expunge •moveToLocal •tail

•copyFromLocal •get •mv •test

•copyToLocal •getmerge •put •text

•count •ls •rm •touchz

Page 16: Hadoop 20111117

分布式部署• Master&slave 192.168.0.10• Slave 192.168.0.20

• 修改 conf/master– 192.168.0.10

• 修改 Conf/slave– 192.168.0.10– 192.168.0.20

Page 17: Hadoop 20111117

安装 hadoop

• ssh-keygen-tdsa –P '‘ –f ~/.ssh/id_dsa

• Cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

• 关闭防火墙 Sudo ufw disable

Page 18: Hadoop 20111117

分布式部署 Core-site.xml (master&slave相同 )

• <configuration>

• <property>• <name>hadoop.tmp.dir</name>• <value>/home/tony/tmp/tmp</value>• <description>Abaseforothertemporarydirectories.</description>• </property>

• <property>• <name>fs.default.name</name>• <value>hdfs://192.168.0.10:9000</value>• </property>

• </configuration>

Page 19: Hadoop 20111117

分布式部署 Hdfs-site.xml (master&slave)

• <configuration>• <property>• <name>dfs.replication</name>• <value>1</value>• </property>• <property>• <name>dfs.name.dir</name>• <value>/home/tony/tmp/name</value>• </property>• <property>• <name>dfs.data.dir</name>• <value>/home/tony/tmp/data</value>• </property>• </configuration>• 并且保证当前机器有该目录

Page 20: Hadoop 20111117

分布式部署 Mapred-site.xml

• <configuration>• <property>• <name>mapred.job.tracker</name>• <value>192.168.0.10:9001</value>• </property>

• </configuration>• 所有的机器都配成 master 的地址

Page 21: Hadoop 20111117

Run

• Hadoop namenode –format– 每次 fotmat 前,先 stop-all ,并清空 tmp 一下

的所有目录• Start-all.sh• 显示运行情况 :– http://192.168.0.20:50070/dfshealth.jsp – 或 hadoop dfsadmin -report

Page 22: Hadoop 20111117
Page 23: Hadoop 20111117
Page 24: Hadoop 20111117

could only be replicated

• java.io.IOException: could only be replicated to 0 nodes , instead of 1.

• 解决:– XML 的配置不正确,要保证 slave 的 mapred-

site.xml 和 core-site.xml 的地址都跟 master 一致

Page 25: Hadoop 20111117

Incompatible namespaceIDs

• java.io.IOException: Incompatible namespaceIDs in /home/hadoop/data: namenode namespaceID = 1214734841; datanode namespaceID = 1600742075

• 原因:– 格式化前没清空 tmp ,导致 ID 不一致

• 解决:– 修改 namenode 的

/home/hadoop/name/current/VERSION

Page 26: Hadoop 20111117

UnknownHostException

• # hostname • Vi /etc/hostname 修改 hostname• Vi /etc/hosts 增加 hostname 对应的 IP

Page 27: Hadoop 20111117

error in shuffle in fetcher

• org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher

• 解决方式:– 问题出在 hosts 文件的配置上,在所有节点

的 /etc/hosts 文件中加入其他节点的主机名和IP 映射

Page 28: Hadoop 20111117
Page 29: Hadoop 20111117

Auto sync

Page 30: Hadoop 20111117

动态增加 datanode

• 主机的 conf/slaves 中,增加 namenode 的地址

• • 启动新增的 namenode – bin/hadoop-daemon.sh start datanode

bin/hadoop-daemon.sh start tasktracker • • 启动后, Hadoop 自动识别。

Page 31: Hadoop 20111117

screenshot

Page 32: Hadoop 20111117

容错• 如果一个节点很长时间没反应,就会清出

集群,并且其它节点会把 replication 补上

Page 33: Hadoop 20111117
Page 34: Hadoop 20111117

执行 MapReduce

• hadoop jar a.jar com.Map1 hdfs://192.168.126.133:9000/hadoopconf/ hdfs://192.168.126.133:9000/output2/

Page 35: Hadoop 20111117

Read From Hadoop URL• //execute: hadoop ReadFromHDFS• public class ReadFromHDFS {• static {• URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());• }• public static void main(String[] args){• try {• URL uri = new URL("hdfs://192.168.126.133:9000/t1/a1.txt");• IOUtils.copyBytes(uri.openStream(), System.out, 4096, false);• }catch (FileNotFoundException e) {• e.printStackTrace();• } catch (IOException e) {• e.printStackTrace();• }• }• }

Page 36: Hadoop 20111117

Read By FileSystem API• //execute : hadoop ReadByFileSystemAPI• public class ReadByFileSystemAPI {• public static void main(String[] args) throws Exception {• String uri = ("hdfs://192.168.126.133:9000/t1/a2.txt");;• Configuration conf = new Configuration();• FileSystem fs = FileSystem.get(URI.create(uri), conf);• FSDataInputStream in = null;• try {• in = fs.open(new Path(uri));• IOUtils.copyBytes(in, System.out, 4096, false);• } finally {• IOUtils.closeStream(in);• }• }• }

Page 37: Hadoop 20111117

FileSystemAPI• Path path = new Path(URI.create("hdfs://192.168.126.133:9000/t1/tt/"));• if(fs.exists(path)){• fs.delete(path,true);• System.out.println("deleted-----------");• }else{• fs.mkdirs(path);• System.out.println("creted=====");• }

• /**• * List files• */• FileStatus[] fileStatuses = fs.listStatus(new Path(URI.create("hdfs://192.168.126.133:9000/")));• for(FileStatus fileStatus : fileStatuses){• System.out.println("" + fileStatus.getPath().toUri().toString() + " dir:" + fileStatus.isDirectory());• }

• PathFilter pathFilter = new PathFilter(){• @Override• public boolean accept(Path path) {• return true;• }• };

Page 38: Hadoop 20111117

文件写入策略• 在创建一个文件之后,在文件系统的命名空间中是可见的,如下所示:• 1. Path p = new Path("p"); • 2. Fs.create(p); • 3. assertThat(fs.exists(p),is(true)); • 但是,写入文件的内容并不保证能被看见,即使数据流已经被刷新。所以文件长度

显• 示为 0 :• 1. Path p = new Path("p"); • 2. OutputStream out = fs.create(p); • 3. out.write("content".getBytes("UTF-8")); • 4. out.flush(); • 5. assertThat(fs.getFileStatus(p).getLen(),is(0L)); • 一旦写入的数据超过一个块的数据,新的读取者就能看见第一个块。对于之后的块

也• 是这样。总之,它始终是当前正在被写入的块,其他读取者是看不见它的。• out.sync(); 强制同步, close() 的时候会自动调用 sync()

Page 39: Hadoop 20111117

集群复制 归档• hadoop distcp -update hdfs://n1/foo

hdfs://n2/bar/foo• 归档– hadoop archive -archiveName files.har

/my/files /my• 使用归档– hadoop fs -lsr har:///my/files.har– hadoop fs -lsr har://hdfs-localhost:8020/my/files.har/my/files/di

• 归档缺点:修改文件、增加删除文件 都需重新归档

Page 40: Hadoop 20111117

SequenceFile Reader&Writer• Configuration conf = new Configuration();• SequenceFile.Writer writer =null ;• try {• System.out.println("start....................");• FileSystem fileSystem = FileSystem.newInstance(conf);• IntWritable key = new IntWritable(1);• Text value = new Text("");• Path path = new Path("hdfs://192.168.126.133:9000/t1/seq");• if(!fileSystem.exists(path)){• fileSystem.create(path);• writer = SequenceFile.createWriter(fileSystem, conf, path, key.getClass(), value.getClass());

• for(int i=1; i<10; i++){• writer.append(new IntWritable(i), new Text("value" + i));• }• writer.close();• }else{• SequenceFile.Reader reader = new SequenceFile.Reader(fileSystem,path,conf);• System.out.println("now while segment");• while(reader.next(key, value)){• System.out.println("key:" + key.get() + " value:" + value + " position" + reader.getPosition());• };• }• } catch (IOException e) {• e.printStackTrace();• } finally{• IOUtils.closeStream(writer);• }

Page 41: Hadoop 20111117

SequenceFile

• 1 value1• 2 value2• 3 value3• 4 value4• 5 value5• 6 value6• 7 value7• 8 value8• 9 value9• 包括一个 Key 和一个 Value• 可以用 hadoop fs –text hdfs://……… 来显示文件

Page 42: Hadoop 20111117

SequenceMap

• 重建索引: MapFile.fix(fileSystem, path, key.getClass(), value.getClass(), true, conf);

• MapFile.Writer writer = new MapFile.Writer(conf, fileSystem, path.toString(), key.getClass(), value.getClass());

• MapFile.Reader reader = new MapFile.Reader(fileSystem,path.toString(),conf);