network traffic analysis using hadoop architecture shan zeng hepix, beijing 17 oct 2012

27
Network Traffic Analysis Network Traffic Analysis using HADOOP Architecture using HADOOP Architecture Shan Zeng Shan Zeng HEPiX, Beijing HEPiX, Beijing 17 Oct 2012 17 Oct 2012

Upload: emil-francis

Post on 16-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Network Traffic Analysis Network Traffic Analysis using HADOOP Architectureusing HADOOP Architecture

Shan ZengShan Zeng

HEPiX, BeijingHEPiX, Beijing

17 Oct 201217 Oct 2012

ZENG SHAN/CC/IHEP

OutlineOutline

• Introduction to HadoopIntroduction to Hadoop

• Traffic Information CaptureTraffic Information Capture

• Traffic Information ResolutionTraffic Information Resolution

• Traffic Information Storage Traffic Information Storage

• Traffic Information Analysis Traffic Information Analysis

• Traffic Information DisplayTraffic Information Display

• ConclusionConclusion

Introduction to HadoopIntroduction to Hadoop

ZENG SHAN/CC/IHEP

ZENG SHAN/CC/IHEP

What can Hadoop do?What can Hadoop do?

• Hadoop is an open-source software framework .

• It was originally developed to support distribution for the Nutch search engine project.

• Supports data-intensive distributed applications.

• Licensed under the Apache v2 license.

•  It enables applications to work with thousands of computation-independent computers and petabytes of data

Lenovo User
需要保证多少丢包率吗?

Traffic Information CaptureTraffic Information Capture

ZENG SHAN/CC/IHEP

ZENG SHAN/CC/IHEP

What is a flow?What is a flow?

• Network flow is a sequence of packets

• From a source computer to a destination, which may be another host, a multicast group, or a broadcast domain.

• A network flow measures sequences of IP packets sharing a common feature as they pass between devices. 

• Flow format:• NetFlow(Cisco)

• J-Flow(Juniper)

• Sflow(HP)

• ….

Lenovo User
需要保证多少丢包率吗?

ZENG SHAN/CC/IHEP

What is nProbe?What is nProbe?• nProbe is an open source tools

• Capture packets flowing on a Ethernet segment, computes flows and export them to the specified collectors.

• Features:• Ability to keep up with Gbit speeds on Ethernet networks handling thousand

of packets per second without packet sampling on commodity hardware.

• Support for major OS including Unix, Windows and Mac OS X

• it is designed for environments with limited resources

Lenovo User
需要保证多少丢包率吗?

Traffic Information Resolution Traffic Information Resolution

ZENG SHAN/CC/IHEP

ZENG SHAN/CC/IHEP

nfcapdnfcapd• nfcapd is the netflow capture daemon, it reads netflow data

from the network and stores it into files.

• The output file is automatically rotated and renamed every n minutes - typically 5 min

e.g. nfcapd.201205030900 contains the data from May 3rd 2012 09:00 onward

• Usage: /usr/local/bin/nfcapd -p 2055 -t 300 -l

/home/zengshan/netflow/nfcapd_file/IHEP & -p -p portnumportnum Specifies the port number to listen. Default port is 9995 Specifies the port number to listen. Default port is 9995

-t interval Specifies the time interval in seconds to rotate files.-t interval Specifies the time interval in seconds to rotate files.

-l base_directory Specifies the base directory to store the output files.-l base_directory Specifies the base directory to store the output files.

Lenovo User
需要保证多少丢包率吗?

ZENG SHAN/CC/IHEP

nfdumpnfdump• nfdump Reads the netflow data from the files stored by nfcapd

• And then dump them to text

Lenovo User
需要保证多少丢包率吗?

ZENG SHAN/CC/IHEP

nfdump output text formatnfdump output text format•

Tag Description Tag Description

%ts Start Time - first seen %in Input Interface num

%te End Time - last seen %ou

tOutput Interface num

%td Duration %pkt  Packets counts in this flow

%pr Protocol %byt Bytes count in this flow

%sa Source Address %fl  Number of flows.

%da Destination Address %flg TCP Flags

%sap Source Address: Port %tos Type of Service

%dap Destination Address:Port %bps

bits per second

%spSource Port %pp

spackets per second

%dp Destination Port %bpp

Bytes per package

%sas Source AS %das

Destination AS

Lenovo User
需要保证多少丢包率吗?

ZENG SHAN/CC/IHEP

Lenovo User
需要保证多少丢包率吗?

Traffic Information StorageTraffic Information Storage

ZENG SHAN/CC/IHEP

ZENG SHAN/CC/IHEP

HDFSHDFS• HDFS is short for Hadoop Distributed File System

• HDFS can provide high throughput access to application data

• Differences from other distributed file systems:• highly fault-tolerant

• designed to be deployed on low-cost hardware. 

• Portability across heterogeneous hardware and software platforms

• Applications run on HDFS need streaming access to their data sets

• provides high throughput access to application data

• suitable for applications that have large data sets

• Moving computation is cheaper than moving data

Lenovo User
需要保证多少丢包率吗?

ZENG SHAN/CC/IHEP

HDFS master/slave HDFS master/slave architecturearchitecture• NameNode

• Manages name space of the file system

• Regulates access to files by clients

• Determine the mapping of blocks to DataNodes

• DataNodes• Responsible for serving read and write requests from the clients

• Perform block creation, deletion and replication upon the instructions from NameNode

Lenovo User
需要保证多少丢包率吗?

ZENG SHAN/CC/IHEP

Data Replication in HDFSData Replication in HDFS• To ensure the fault tolerance in HDFS, the blocks of a file are

replicated, the replicas of a block can be specified by the application

Lenovo User
需要保证多少丢包率吗?

Traffic Information AnalysisTraffic Information Analysis

ZENG SHAN/CC/IHEP

ZENG SHAN/CC/IHEP

Map/ReduceMap/Reduce• MapReduce is a programming model for processing large data sets

• MapReduce is typically used to do distributed computing on clusters of computers

• MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data.

• The model is inspired by the map and reduce functions • "Map" step:"Map" step: The master node takes the input, divides it into smaller sub-problems, and  The master node takes the input, divides it into smaller sub-problems, and

distributes them to slave nodes. The slave node processes the smaller problem, and distributes them to slave nodes. The slave node processes the smaller problem, and passes the answer back to its master node.passes the answer back to its master node.

• "Reduce" step:"Reduce" step: The master node then collects the answers to all the sub-problems and  The master node then collects the answers to all the sub-problems and combines them in some way to form the final outputcombines them in some way to form the final output

Lenovo User
需要保证多少丢包率吗?

Traffic Information DisplayTraffic Information Display

ZENG SHAN/CC/IHEP

ZENG SHAN/CC/IHEP

Drawing toolsDrawing tools• RRDtool

• acronym for round-robin database tool

• The data are stored in a round-robin database(circular buffer)

• It also includes tools to extract RRD data in a graphical format

• drawing flow trend graph in three dimensionality: Flow count, Packet count, Traffic count

• Highstock• Highstock lets you create stock or general timeline charts in pure

JavaScript including sophisticated navigation options like a small navigator series, preset date ranges, date picker, scrolling and panning.

• We just need to write API between HDFS and Highstock

Lenovo User
需要保证多少丢包率吗?

ZENG SHAN/CC/IHEP

ArchitectureArchitecture

Lenovo User
需要保证多少丢包率吗?

ZENG SHAN/CC/IHEP

ZENG SHAN/CC/IHEP

ZENG SHAN/CC/IHEP

ZENG SHAN/CC/IHEP

ZENG SHAN/CC/IHEP

ConclusionConclusion• Network flow trend chart from IHEP every 5 minutes in

three dimension: Flow count/Packet count/Traffic count

• Detailed traffic information chart • select a single timeslot and get the detailed traffic information

• select a time window and get the detailed traffic information

• on hovering the chart, a tooltip text with traffic information on each point and series can be displayed.

• the tooltip follows as the user moves the mouse over the graph

• Traffic information related to the HEP experiments• Once the IP addresses of the machines related to the data

transferring of the HEP experiment is specified

• we already have DYB/YBJ/CMS/ALTAS

Lenovo User
需要保证多少丢包率吗?

ZENG SHAN/CC/IHEP

Thank You Thank You

Questions?Questions?