softserve's hadoop demo lab
DESCRIPTION
SoftServe's Hadoop Demo Lab - is a project to aggregate log files from 300 Apache HTTPD web-servers and populate them into Hadoop/ElasticSearch cluster for future analysis using Microstrategy and Kibana. It is even more interesting keeping in mind that all deployment is fully automated using Vagrant and Puppet.TRANSCRIPT
SOFTSERVE’SHADOOP DEMO LAB
Aug, 2014
Agenda
1) Who we are & What we do
2) Why we started this project
3) High-Level Task Overview
4) Task Analysis
5) Solution Architecture
6) Trade-off Analysis
7) Development Aspects
WHO WE ARE&
WHAT WE DO
4
▪ Leading global Product and Application Development partner founded in 1993
▪ 3,300+ employees across North America, Ukraine and Western Europe
▪ Thousands of successful outsourcing projects!
SaaS/Cloud Solutions . Mobility Solutions . UX/UI BI/Analytics/Big Data . Software Architecture . Security
Clients include:
Why SoftServe
• Dedicated Architecture Group (including BI and BigData, 40+ architects)
• Demo Hadoop Environment
• Reference architecture library
• 10+ successful BI/BigData projects
• Certified Big Data engineers (Hadoop, MongoDB)
• Partnership with major RDBMS, BI and BigData vendors
What we do: Services
1) Design & Assessment
2) Optimization & Modernization
3) POC & Prototyping
4) Development and Quality Control
5) Production and Non-Production Support
Technology Stack
Data Warehouse
Data Integration
Big Data and NoSQL
BI Platforms
Big Data Analytics Expertise
WHY WE STARTED THIS PROJECT
Demo Lab: Why?
1) Increase Internal Experience
2) Create Reference Solution w/o NDA Limitations
3) Get Playground for Tests
4) Provide Demo Environment for Customers (using their data)
5) Decrease time to Deliver (by introducing automation)
HADOOP DEMO LAB:HIGH-LEVEL TASK
OVERVIEW
Demo Lab: Input Data
Data Volume270-300 Web Servers (Apache HTTPD)447 392 events per minute644 245 094 events / day~100-250 bytes per event150GB of data per day
Log Types1) Apache HTTPD access log2) Apache HTTPD error log3) Service log (CPU, RAM, Disk I/O, Disk Space)4) Application server servlet log
RetentionLast 30 days: Raw dataLast 24 hours: per minute aggregationWhole period: per hour aggregation
Demo Lab: Log Data Examples
Access log:127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 Error log:[Sun Mar 7 20:58:27 2004] [info] [client 64.242.88.10] (104)Connection reset by peer: client stopped connection before send body completed[Sun Mar 7 21:16:17 2004] [error] [client 24.70.56.49] File does not exist: /home/httpd/twiki/view/Main/WebHome Vmstatprocs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 305416 260688 29160 2356920 2 2 4 1 0 0 6 1 92 2 0 iostatLinux 2.6.32-100.28.5.el6.x86_64 (dev-db) 07/09/2011 avg-cpu: %user %nice %system %iowait %steal %idle 5.68 0.00 0.52 2.03 0.00 91.76
TASK ANALYSIS
Architecture Drivers: Use Cases
Architecture Drivers: Quality Attributes (1/3)
Architecture Drivers: Quality Attributes (2/3)
Architecture Drivers: Quality Attributes (3/3)
Architecture Drivers: Limitations
Demo Lab: Marketecture
SOLUTION ARCHITECTURE
Reference Architecture Selection
Options:
• Traditional Relational• Extended Relational• Non-Relational• Lambda Architecture (Hybrid)• Data Refinery (Hybrid)
Lambda Architecture:
• Simultaneous access to real-time and historical data• Isolated Design and Development• Increased Fault Tolerance
Lambda Architecture
Data Flow
Infrastructure View
TRADE-OFF ANALYSIS
Distribution Selection
Hive Stinger vs Impala
Compression Ratio
Access Speed
BI Tool SelectionOptions:• Tableau• JasperSoft• Microstrategy• QlikView
Microstrategy:• Powerful and Feature-Rich BI Tool• 31 days trial period w/o trial key• Well-integrated with Hadoop (and Impala)• Easy to install in a silent-mode (command-line)
DEVELOPMENT ASPECTS
Automatization
80% 20%
Development and Debugging F&P Testing, Demo
Local Development Cloud Development
Log Generator
4200 events / second (File source)5500 events / second (TCP source)
Accurate Sizing
100k/min
50k/min
20k/min
200k/min
Calculator!
Reference
35Click to add the title
1) Install Hadoop (CDH4) on 5 nodes with VMWare, CDH4, Cloudera Manager 4https://www.youtube.com/watch?v=CobVqNMiqww
2) Puppet & Vagrant Tutorialhttp://puppetlabs.com/blog/puppet-and-vagrant-tutorial 3) Hardware for Hadoophttp://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.0/bk_cluster-planning-guide/content/hardware-selection-for-hbase.html 4) How to Refine and Visualize Server Log Datahttp://hortonworks.com/hadoop-tutorial/how-to-refine-and-visualize-server-log-data/ 5) Hadoop Cluster Sizinghttp://hortonworks.com/wp-content/uploads/downloads/2013/06/Hortonworks.ClusterConfigGuide.1.0.pdf
36
Thank You!
SoftServe US OfficeOne Congress Plaza, 111 Congress Avenue, Suite 2700 Austin, TX 78701 Tel: 512.516.8880
Contacts Valentyn [email protected]: 866.687.3588 x4341