opensource big data platform - flamingo project
DESCRIPTION
Flamingo is a open-source Big Data Platform that combine a Ajax Rich Web Interface + Workflow Engine + Workflow Designer + MapReduce + Hive Editor + Pig Editor. Movies : http://wiki.opencloudengine.org/pages/viewpage.action?pageId=2064714 Screen Shots : http://wiki.opencloudengine.org/pages/viewpage.action?pageId=2065069 Download : http://sourceforge.net/projects/hadoop-manager/files Wiki : http://wiki.opencloudengine.org/pages/viewpage.action?pageId=819212TRANSCRIPT
![Page 1: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/1.jpg)
Open Cloud Engine
Introduction and Case Study of Open Source Project Flamingo, the Big Data Platform Open Cloud Engine Flamingo Project Leader Edward Kim ([email protected])
2014.03.01 v0.8
![Page 2: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/2.jpg)
What is a Big Data Platform?
![Page 3: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/3.jpg)
The roles of the Big Data Platform • What are main tasks that can be done on the Big Data platform?
• Data mining, Statistical analysis, Log handling (collecting, pre-processing)
• Who does what on the platform? • Varies with users.
• Most operators: development background, so focus on system management and log handling.
• Analysts: focus on establishing a better environment for analyzing data. • How many users are using the Big Data platform?
• Lots of users à functionality of platform and accessibility of infra is important.
• Big Data platform handles data à vulnerable. Hadoop is insecure. • What am I? Operator? Architect? Developer? Data Scientist?
• Depending on a role, functions of platform can be defined differently.
![Page 4: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/4.jpg)
What Big Data Platform Must Provide SOFTWARE STACK
![Page 5: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/5.jpg)
What Big Data Platform Must Provide
INFRA MANAGEMENT MONITORING
![Page 6: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/6.jpg)
What Big Data Platform Must Provide
WORKFLOW
![Page 7: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/7.jpg)
What Big Data Platform Must Provide
ANALYSIS AND VISUALIZATION
![Page 8: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/8.jpg)
What Big Data Platform Must Provide
DASHBOARD
![Page 9: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/9.jpg)
What Big Data Platform Must Provide
SECURITY
• ACCESS • AUTHENTICATION • AUTHORIZATION • ENCRYPTION • AUDITING • POLICY
![Page 10: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/10.jpg)
What Big Data Platform Must Provide • Batch job management and monitoring
• MR based Parallel analysis program
• User activities monitoring
• Policy on accessing resources and systems.
• Various functions to improve accessibility to infrastructures.
![Page 11: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/11.jpg)
Flamingo Project In Open Cloud Engine • Taking advantage of the web technology, using big data infrastructures and data becomes convenient.
• Users can handle data easily. • Provides functionalities to do various jobs in one workspace.
• Can reuse analysis and processing MapReduces • Open source oriented and all systems are ready to go. • Designed to be operator friendly. • Supports Hadoop EcoSystem.
![Page 12: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/12.jpg)
Browser
Designer Search
Morphology����������� ������������������ Analysis����������� ������������������
Analyze����������� ������������������ Graph����������� ������������������
User����������� ������������������ Evaluation����������� ������������������
Elect����������� ������������������ a����������� ������������������ leader����������� ������������������
Log����������� ������������������
Data����������� ������������������ Scientist����������� ������������������ Service����������� ������������������ Planner����������� ������������������
Data����������� ������������������ Analyst����������� ������������������
Browser
Informa.on Catalogue Search
Informa-on Security Batch Type
User Similarity 1 Daily, 4 PM XML
Item Recommenda.on 2 Daily, 2 AM JSON
Purchase Preference 3 Daily, 8 PM XML/JSON
Opinion Leader 2 Daily, 7 AM XML/JSON
Data����������� ������������������ users����������� ������������������
Systems����������� ������������������
Opinion����������� ������������������ Leader����������� ������������������ Score����������� ������������������ Board����������� ������������������
Open����������� ������������������ API����������� ������������������
Data����������� ������������������ Visualization����������� ������������������ Charts����������� ������������������
Design����������� ������������������ a����������� ������������������ workflow����������� ������������������
Collect����������� ������������������
Data����������� ������������������ user����������� ������������������
Request����������� ������������������ Service����������� ������������������
Mobil����������� ������������������ Devices����������� ������������������
Reuse����������� ������������������ analyzed����������� ������������������ results����������� ������������������ Analyzed����������� ������������������ results����������� ������������������ are����������� ������������������ exposed����������� ������������������ through����������� ������������������ an����������� ������������������ Open����������� ������������������ API����������� ������������������
Validation����������� ������������������ Log����������� ������������������ Data����������� ������������������
MapReduce����������� ������������������ Analysis����������� ������������������ Module����������� ������������������
Big����������� ������������������ Data����������� ������������������ Analysis����������� ������������������ and����������� ������������������ Service����������� ������������������ Platform����������� ������������������
1����������� ������������������
2����������� ������������������
3����������� ������������������
4����������� ������������������
5����������� ������������������
6����������� ������������������
7����������� ������������������
Future of Big Data Platform
![Page 13: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/13.jpg)
Flamingo Project • Functionalities matter the most in the Hadoop based Big Data environment.
• Integrated open source projects are difficult to manage and not enough UIs exist to handle them
![Page 14: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/14.jpg)
Flamingo Workbench • Users can freely move around w/in a workspace conducting various jobs.
• Each window is separated for its own functionalities
• To minimize coding, reusable parts are componentized.
• The system is simplified and well-known frameworks are implemented for easy addition
• A development method is standardized (Tools, Procedures, Manuals, Environments…)
![Page 15: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/15.jpg)
Flamingo Architecture
![Page 16: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/16.jpg)
File System Browser • Managing files is an integral part of Hadoop
• A familiar windows file explorer style UI provides a better UX to users
![Page 17: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/17.jpg)
File System Browser
Converts directories into Hive DBs or tables
Hive DBs and tables are marked with different icons in the browser.
FLAMINGO HAS OPTIMIZED FREQUENTLY NEEDED
FUNCTIONS
![Page 18: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/18.jpg)
File System Browser Enhancement • Previewing files and its location
• Restrictions on viewing directories and files to unauthorized users (doesn’t come with Hadoop). • E.g. /tmp directory is not visible to common users.
• Setting permission on directories and files
• A home directory for each user (doesn’t come with Hadoop)
• Setting a quota on directories
• Regularly dumping file system size info (for monitoring)
![Page 19: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/19.jpg)
Audit Log • Search all recorded HDFS logs.
![Page 20: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/20.jpg)
Workflow Designer • Mounts various analytic modules (e.g. Mahout)
• Drag and drop provided modules to the canvas.
• Currently analytic and statistical modules are mounted, Mahout and Giraph are being mounted, and ETL MRs
will be mounted soon.
![Page 21: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/21.jpg)
Big Workflow Case Supports a workflow composed of multiple nodes.
![Page 22: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/22.jpg)
Apache Access Log To CSV
![Page 23: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/23.jpg)
Apache Access Log To CSV
Parameters to MapReduce • Delimiter • An option to print non matching pattern logs
Location of Apache Access Log and an output path of a CSV file.
MapReduce JAR file and a driver name
![Page 24: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/24.jpg)
Workflow Designer • A complex workflow is needed to see a final output.
• Most times several steps are required to process files with MapReduce jobs. It makes creating a workflow difficult.
• Engineers prefer the Apache Hive’s SQL like query language over writing MapReduces, so Workflow Designer comes in handy.
• When handling various types of log file, Workflow Designer and MapReduce are essential.
![Page 25: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/25.jpg)
Workflow Monitoring • Monitors workflows submitted from Workflow Designer. Accurate logs can be checked.
![Page 26: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/26.jpg)
Workflow Monitoring
root@n02:~/flamingo_data/tmp/2014/03/31/90/JOB_20140331_172000_90_157566920/26385942 $> ls -lsa 합계 40 4 drwxr-xr-x 2 root root 4096 2014-03-31 17:23 . 4 drwxr-xr-x 20 root root 4096 2014-03-31 17:23 .. 16 -rw-r--r-- 1 root root 12731 2014-03-31 17:23 action.log à execution log 4 -rwxrwxrwx 1 root root 1259 2014-03-31 17:23 core-site.xml 0 -rw-r--r-- 1 root root 0 2014-03-31 17:23 hadoop.job_201403300831_0471 à MapReduce Job ID 4 -rwxrwxrwx 1 root root 852 2014-03-31 17:23 script.sh root@n02:~/flamingo_data/tmp/2014/03/31/90/JOB_20140331_172000_90_157566920/26385942 $>
NODES IN A WORKFLOW CONTAIN SEVERAL MAPREDUCE JOBS. SO THEY MUST BE ABLE TO BE TR
ACKED
What users view in the MapReduce execution history
![Page 27: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/27.jpg)
Hadoop Job Monitoring
Must be able to be tracked in Hadoop Job Monitoring.
![Page 28: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/28.jpg)
Expression Language (EL) • Dynamically substitute values into variables.
• E.g. Today’s date : dateFormat(‘yyyyMMdd’) dateFormat(‘yyyy-MM-dd’)
• For example, replace variables with certain dates • E.g. Daily batch. Record yesterday’s date into a workflow executed today.
• Supported Expression Language • dateFormat(‘DATE FORMAT’) à dateFormat(‘yyyyMMddHHmmss’) • hostname, escapeString, • yesterday, tommorow • month, day, hour, minute, … à day(‘yyyyMMdd’, -1) :: yesterday’s date(2013
1111) • trim, concat, urlEncode, firstNotNull
![Page 29: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/29.jpg)
Expression Language (EL)
The ${EL} format is dynamically replaced with real values.
![Page 30: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/30.jpg)
Hadoop Job Tracker Monitoring • Displays Hadoop’s job tracker info on graphs
![Page 31: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/31.jpg)
Hadoop Job Tracker Monitoring • Remote monitoring and tracking of Hadoop jobs are available.
![Page 32: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/32.jpg)
Hive Editor & Hive Metastore Browser • Search, browse, and download using SQL.
• Hive Metastore is integrated. Easy to manage databases and tables.
![Page 33: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/33.jpg)
Hive Editor Use Case • Case 1: Search user access log with Hive
– If the log is semi-structured or unstructured, it’s problematic.
– If a column contains an array of map, it’s also problematic.
• Below is an example of a semi-structured log
TYPE="IPINSIDE" TIME="2014-03-20 17:40:37" ID="guest0899349" MAC="AA-BB-01-18-68-68" NAT_IP="10.24.104.104" NAT_IP_NATION="USA" PROXY_USE="Y" VPN_USE="Y" REMOTE_USE="Y" PROXY_IP="192.24.104.104" PROXY_IP_NATION="USA" VPN_IP="192.24.104.104" VPN_IP_NATION="USA" SVC_CODE="SVC_CODE_0899349" HDD_DISK="HDD_DISK_0899349" CPU_INFO="CPU_INFO_0899349" USE_OS_NATION="USA" MESG="mesg..... time[1395284830] rnd[875899349] unq[5000000]" TYPE="IPINSIDE" TIME="2014-03-20 17:40:37" ID="guest0899349" MAC="AA-BB-01-18-68-68" NAT_IP="10.24.104.104" NAT_IP_NATION="USA" PROXY_USE="Y" VPN_USE="Y" REMOTE_USE="Y" PROXY_IP="192.24.104.104" PROXY_IP_NATION="USA" VPN_IP="192.24.104.104" VPN_IP_NATION="USA" SVC_CODE="SVC_CODE_0899349" HDD_DISK="HDD_DISK_0899349" CPU_INFO="CPU_INFO_0899349" USE_OS_NATION="USA" MESG="mesg..... time[1395284830] rnd[8758ßß99349] unq[5000000]"
![Page 34: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/34.jpg)
Hive Editor Use Case
TYPE="IPINSIDE" TIME="2014-03-20 17:40:37" ID="guest0899349" MAC="AA-BB-01-18-68-68" NAT_IP="10.24.104.104" NAT_IP_NATION="USA" PROXY_USE="Y" VPN_USE="Y" REMOTE_USE="Y" PROXY_IP="192.24.104.104" PROXY_IP_NATION="USA" VPN_IP="192.24.104.104" VPN_IP_NATION="USA" SVC_CODE="SVC_CODE_0899349" HDD_DISK="HDD_DISK_0899349" CPU_INFO="CPU_INFO_0899349" USE_OS_NATION="USA" MESG="mesg..... time[1395284830] rnd[875899349] unq[5000000]”
![Page 35: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/35.jpg)
Hive Editor Use Case
![Page 36: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/36.jpg)
Hive Editor Use Case public class MasSerde implements SerDe { private StructTypeInfo rowTypeInfo; private ObjectInspector rowOI; private List<String> colNames; private List<Object> row = new ArrayList<Object>(); Pattern p = Pattern.compile("\"(.*?)\""); @Override public Object deserialize(Writable blob) throws SerDeException { row.clear(); Matcher m = p.matcher(blob.toString()); List list = new ArrayList(); while (m.find()) { list.add(m.group(1)); } String[] split = (String[]) list.toArray(new String[list.size()]); int i = 0; for (String fieldName : rowTypeInfo.getAllStructFieldNames()) { TypeInfo fieldTypeInfo = rowTypeInfo.getStructFieldTypeInfo(fieldName); row.add(parseField(split[i], fieldTypeInfo)); i++; } return row; } ... 생략 }
WHEN A LOG FILE IS LOADED, IT’S DESERIA
LIZED.
![Page 37: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/37.jpg)
Hive Editor Use Case
![Page 38: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/38.jpg)
Pig Script Editor • Edits and saves Pig Latin scripts.
• Executes and manages Pig Latin scripts to expedite data processing.
![Page 39: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/39.jpg)
Dashboard • Displays batch job history
![Page 40: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/40.jpg)
Job Management • Schedules, monitors, and executes batch job execution
![Page 41: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/41.jpg)
Job Management • Cron Expression Fully Supported
![Page 42: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/42.jpg)
Project Details • Download
– http://www.sourceforge.net/projects/hadoop-manager • Wiki(manuals and tech notes)
– http://wiki.opencloudengine.org/pages/viewpage.action?pageId=819205
• Issues(bugs and new features) – http://jira.opencloudengine.org
• Build Server – http://build.opencloudengine.org
• Google Groups: [email protected]
• Subscription : [email protected]
![Page 43: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/43.jpg)
The Future of Flamingo Project
• Big Data on Cloud
• Netra (OpenStack based Hadoop Provisioning)
+ Flamingo (Hadoop based Workspace)
• Open Source based Big Data Platform
• Apache Hadoop EcoSystem
• Big Data Management Using Flamingo
• Apache Hadoop PaaS (Platform as a Service)
• Big Data All In One Package
![Page 44: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/44.jpg)
Workflow Designer • MapReduce developers use different parameters.
• How will we standardize these various MapReduces?
![Page 45: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/45.jpg)
Workflow Designer • Most UI parts are reusable and provided as components
• MapReduce Module and UI controls are standardized and offered as a framework
Reuse components
UI Layout
![Page 46: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/46.jpg)
Workflow Designer • Define module icons through metadata and minimize coding.
• The framework takes care of most of them, and users only handle metadata
![Page 47: OpenSource Big Data Platform - Flamingo Project](https://reader034.vdocuments.us/reader034/viewer/2022052621/557e971dd8b42a1d048b4b1a/html5/thumbnails/47.jpg)
Participate and Share with Us!!
www.opencloudengine.org