opensource big data platform - flamingo project

Open Cloud Engine

Introduction and Case Study of Open Source Project Flamingo, the Big Data Platform Open Cloud Engine Flamingo Project Leader Edward Kim ([email protected])

2014.03.01 v0.8

What is a Big Data Platform?

The roles of the Big Data Platform •  What are main tasks that can be done on the Big Data platform?

•  Data mining, Statistical analysis, Log handling (collecting, pre-processing)

•  Who does what on the platform? •  Varies with users.

•  Most operators: development background, so focus on system management and log handling.

•  Analysts: focus on establishing a better environment for analyzing data. •  How many users are using the Big Data platform?

•  Lots of users à functionality of platform and accessibility of infra is important.

•  Big Data platform handles data à vulnerable. Hadoop is insecure. •  What am I? Operator? Architect? Developer? Data Scientist?

•  Depending on a role, functions of platform can be defined differently.

What Big Data Platform Must Provide SOFTWARE STACK

What Big Data Platform Must Provide

INFRA MANAGEMENT MONITORING


WORKFLOW


ANALYSIS AND VISUALIZATION


DASHBOARD


SECURITY

•  ACCESS •  AUTHENTICATION •  AUTHORIZATION •  ENCRYPTION •  AUDITING •  POLICY

What Big Data Platform Must Provide •  Batch job management and monitoring

•  MR based Parallel analysis program

•  User activities monitoring

•  Policy on accessing resources and systems.

•  Various functions to improve accessibility to infrastructures.

Flamingo Project In Open Cloud Engine •  Taking advantage of the web technology, using big data infrastructures and data becomes convenient.

•  Users can handle data easily. •  Provides functionalities to do various jobs in one workspace.

•  Can reuse analysis and processing MapReduces •  Open source oriented and all systems are ready to go. •  Designed to be operator friendly. •  Supports Hadoop EcoSystem.

Browser

Designer Search

Morphology�� Analysis��

Analyze�� Graph��

User�� Evaluation��

Elect�� a�� leader��

Log��

Data�� Scientist�� Service�� Planner��

Data�� Analyst��

Browser

Informa.on Catalogue Search

Informa-on Security Batch Type

User Similarity 1 Daily, 4 PM XML

Item Recommenda.on 2 Daily, 2 AM JSON

Purchase Preference 3 Daily, 8 PM XML/JSON

Opinion Leader 2 Daily, 7 AM XML/JSON

Data�� users��

Systems��

Opinion�� Leader�� Score�� Board��

Open�� API��

Data�� Visualization�� Charts��

Design�� a�� workflow��

Collect��

Data�� user��

Request�� Service��

Mobil�� Devices��

Reuse�� analyzed�� results�� Analyzed�� results�� are�� exposed�� through�� an�� Open�� API��

Validation�� Log�� Data��

MapReduce�� Analysis�� Module��

Big�� Data�� Analysis�� and�� Service�� Platform��

1��

2��

3��

4��

5��

6��

7��

Future of Big Data Platform

Flamingo Project •  Functionalities matter the most in the Hadoop based Big Data environment.

•  Integrated open source projects are difficult to manage and not enough UIs exist to handle them

Flamingo Workbench •  Users can freely move around w/in a workspace conducting various jobs.

•  Each window is separated for its own functionalities

•  To minimize coding, reusable parts are componentized.

•  The system is simplified and well-known frameworks are implemented for easy addition

•  A development method is standardized (Tools, Procedures, Manuals, Environments…)

Flamingo Architecture

File System Browser •  Managing files is an integral part of Hadoop

•  A familiar windows file explorer style UI provides a better UX to users

File System Browser

Converts directories into Hive DBs or tables

Hive DBs and tables are marked with different icons in the browser.

FLAMINGO HAS OPTIMIZED FREQUENTLY NEEDED

FUNCTIONS

File System Browser Enhancement •  Previewing files and its location

•  Restrictions on viewing directories and files to unauthorized users (doesn’t come with Hadoop). •  E.g. /tmp directory is not visible to common users.

•  Setting permission on directories and files

•  A home directory for each user (doesn’t come with Hadoop)

•  Setting a quota on directories

•  Regularly dumping file system size info (for monitoring)

Audit Log •  Search all recorded HDFS logs.

Workflow Designer •  Mounts various analytic modules (e.g. Mahout)

•  Drag and drop provided modules to the canvas.

•  Currently analytic and statistical modules are mounted, Mahout and Giraph are being mounted, and ETL MRs

will be mounted soon.

Big Workflow Case Supports a workflow composed of multiple nodes.

Apache Access Log To CSV

Apache Access Log To CSV

Parameters to MapReduce •  Delimiter •  An option to print non matching pattern logs

Location of Apache Access Log and an output path of a CSV file.

MapReduce JAR file and a driver name

Workflow Designer •  A complex workflow is needed to see a final output.

•  Most times several steps are required to process files with MapReduce jobs. It makes creating a workflow difficult.

•  Engineers prefer the Apache Hive’s SQL like query language over writing MapReduces, so Workflow Designer comes in handy.

•  When handling various types of log file, Workflow Designer and MapReduce are essential.

Workflow Monitoring •  Monitors workflows submitted from Workflow Designer. Accurate logs can be checked.

Workflow Monitoring

root@n02:~/flamingo_data/tmp/2014/03/31/90/JOB_20140331_172000_90_157566920/26385942 $> ls -lsa 합계 40 4 drwxr-xr-x 2 root root 4096 2014-03-31 17:23 . 4 drwxr-xr-x 20 root root 4096 2014-03-31 17:23 .. 16 -rw-r--r-- 1 root root 12731 2014-03-31 17:23 action.log à execution log 4 -rwxrwxrwx 1 root root 1259 2014-03-31 17:23 core-site.xml 0 -rw-r--r-- 1 root root 0 2014-03-31 17:23 hadoop.job_201403300831_0471 à MapReduce Job ID 4 -rwxrwxrwx 1 root root 852 2014-03-31 17:23 script.sh root@n02:~/flamingo_data/tmp/2014/03/31/90/JOB_20140331_172000_90_157566920/26385942 $>

NODES IN A WORKFLOW CONTAIN SEVERAL MAPREDUCE JOBS. SO THEY MUST BE ABLE TO BE TR

ACKED

What users view in the MapReduce execution history

Hadoop Job Monitoring

Must be able to be tracked in Hadoop Job Monitoring.

Expression Language (EL) •  Dynamically substitute values into variables.

•  E.g. Today’s date : dateFormat(‘yyyyMMdd’) dateFormat(‘yyyy-MM-dd’)

•  For example, replace variables with certain dates •  E.g. Daily batch. Record yesterday’s date into a workflow executed today.

•  Supported Expression Language •  dateFormat(‘DATE FORMAT’) à dateFormat(‘yyyyMMddHHmmss’) •  hostname, escapeString, •  yesterday, tommorow •  month, day, hour, minute, … à day(‘yyyyMMdd’, -1) :: yesterday’s date(2013

1111) •  trim, concat, urlEncode, firstNotNull

Expression Language (EL)

The ${EL} format is dynamically replaced with real values.

Hadoop Job Tracker Monitoring •  Displays Hadoop’s job tracker info on graphs

Hadoop Job Tracker Monitoring •  Remote monitoring and tracking of Hadoop jobs are available.

Hive Editor & Hive Metastore Browser •  Search, browse, and download using SQL.

•  Hive Metastore is integrated. Easy to manage databases and tables.

Hive Editor Use Case •  Case 1: Search user access log with Hive

–  If the log is semi-structured or unstructured, it’s problematic.

–  If a column contains an array of map, it’s also problematic.

•  Below is an example of a semi-structured log

TYPE="IPINSIDE" TIME="2014-03-20 17:40:37" ID="guest0899349" MAC="AA-BB-01-18-68-68" NAT_IP="10.24.104.104" NAT_IP_NATION="USA" PROXY_USE="Y" VPN_USE="Y" REMOTE_USE="Y" PROXY_IP="192.24.104.104" PROXY_IP_NATION="USA" VPN_IP="192.24.104.104" VPN_IP_NATION="USA" SVC_CODE="SVC_CODE_0899349" HDD_DISK="HDD_DISK_0899349" CPU_INFO="CPU_INFO_0899349" USE_OS_NATION="USA" MESG="mesg..... time[1395284830] rnd[875899349] unq[5000000]" TYPE="IPINSIDE" TIME="2014-03-20 17:40:37" ID="guest0899349" MAC="AA-BB-01-18-68-68" NAT_IP="10.24.104.104" NAT_IP_NATION="USA" PROXY_USE="Y" VPN_USE="Y" REMOTE_USE="Y" PROXY_IP="192.24.104.104" PROXY_IP_NATION="USA" VPN_IP="192.24.104.104" VPN_IP_NATION="USA" SVC_CODE="SVC_CODE_0899349" HDD_DISK="HDD_DISK_0899349" CPU_INFO="CPU_INFO_0899349" USE_OS_NATION="USA" MESG="mesg..... time[1395284830] rnd[8758ßß99349] unq[5000000]"

Hive Editor Use Case

TYPE="IPINSIDE" TIME="2014-03-20 17:40:37" ID="guest0899349" MAC="AA-BB-01-18-68-68" NAT_IP="10.24.104.104" NAT_IP_NATION="USA" PROXY_USE="Y" VPN_USE="Y" REMOTE_USE="Y" PROXY_IP="192.24.104.104" PROXY_IP_NATION="USA" VPN_IP="192.24.104.104" VPN_IP_NATION="USA" SVC_CODE="SVC_CODE_0899349" HDD_DISK="HDD_DISK_0899349" CPU_INFO="CPU_INFO_0899349" USE_OS_NATION="USA" MESG="mesg..... time[1395284830] rnd[875899349] unq[5000000]”

Hive Editor Use Case public class MasSerde implements SerDe { private StructTypeInfo rowTypeInfo; private ObjectInspector rowOI; private List<String> colNames; private List<Object> row = new ArrayList<Object>(); Pattern p = Pattern.compile("\"(.*?)\""); @Override public Object deserialize(Writable blob) throws SerDeException { row.clear(); Matcher m = p.matcher(blob.toString()); List list = new ArrayList(); while (m.find()) { list.add(m.group(1)); } String[] split = (String[]) list.toArray(new String[list.size()]); int i = 0; for (String fieldName : rowTypeInfo.getAllStructFieldNames()) { TypeInfo fieldTypeInfo = rowTypeInfo.getStructFieldTypeInfo(fieldName); row.add(parseField(split[i], fieldTypeInfo)); i++; } return row; } ... 생략 }

WHEN A LOG FILE IS LOADED, IT’S DESERIA

LIZED.

Pig Script Editor •  Edits and saves Pig Latin scripts.

•  Executes and manages Pig Latin scripts to expedite data processing.

Dashboard •  Displays batch job history

Job Management •  Schedules, monitors, and executes batch job execution

Job Management •  Cron Expression Fully Supported

Project Details •  Download

–  http://www.sourceforge.net/projects/hadoop-manager •  Wiki(manuals and tech notes)

–  http://wiki.opencloudengine.org/pages/viewpage.action?pageId=819205

•  Issues(bugs and new features) –  http://jira.opencloudengine.org

•  Build Server –  http://build.opencloudengine.org

•  Google Groups: [email protected]

•  Subscription : [email protected]

The Future of Flamingo Project

•  Big Data on Cloud

•  Netra (OpenStack based Hadoop Provisioning)

+ Flamingo (Hadoop based Workspace)

•  Open Source based Big Data Platform

•  Apache Hadoop EcoSystem

•  Big Data Management Using Flamingo

•  Apache Hadoop PaaS (Platform as a Service)

•  Big Data All In One Package

Workflow Designer •  MapReduce developers use different parameters.

•  How will we standardize these various MapReduces?

Workflow Designer •  Most UI parts are reusable and provided as components

•  MapReduce Module and UI controls are standardized and offered as a framework

Reuse components

UI Layout

Workflow Designer •  Define module icons through metadata and minimize coding.

•  The framework takes care of most of them, and users only handle metadata

Participate and Share with Us!!

www.opencloudengine.org

opensource big data platform - flamingo project

Technology

data mining

data vulnerable

data scientist

functions of platform

open cloud engine introduction

management monitoring

statistical analysis

log handling