visual mapping of clickstream data
TRANSCRIPT
![Page 1: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/1.jpg)
Visual Mapping of Clickstream Data: Introduction and Demonstration
Cedric Carbone, Ciaran DynesTalend
![Page 2: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/2.jpg)
2© Talend 2014
Visual mapping of Clickstream data: introduction and demonstration
Ciaran Dynes VP Products
Cedric Carbone CTO
![Page 3: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/3.jpg)
3© Talend 2014
Agenda
• Clickstream live demo
• Moving from hand-code to code generation
• Performance benchmark
• Optimization of code generation
![Page 4: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/4.jpg)
4© Talend 2014
Hortonworks Clickstream demo
http://hortonworks.com/hadoop-tutorial/how-to-visualize-website-clickstream-data/
![Page 5: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/5.jpg)
5© Talend 2014
Trying to get from this…
![Page 6: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/6.jpg)
6© Talend 2014
Big Data – “pure Hadoop”Visual design in Map Reduce and optimize
before deploying on Hadoop
to this…
![Page 7: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/7.jpg)
7© Talend 2014
Demo overview
• Demo flow overview :-1. Load raw Omniture web log files to HDFS
• Can discuss the ‘schema on read’ principle, how it allows any data type to be easily loaded to a ‘data lake’ and is then available for analytical processing
• http://ibmdatamag.com/2013/05/why-is-schema-on-read-so-useful/
2. Define a Map/Reduce process to transform the data
• Identical skills to any graphical ETL tool
• Lookup customer and product data to enrich the results
• Results written back to HDFS
3. Federate the results to a visualisation tool of your choice
• Excel
• Analytics tool such Tableau, Qlikview, etc.
• Google Charts
![Page 8: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/8.jpg)
8© Talend 2014
Big Data Clickstream Analysis
Clickstream Dashboard
TALENDLoad to HDFS
TALENDBIG DATA
(Integration)
TALENDFederate to
analytics
HADOOP
HDFS Map/Reduce
Web logs
Hive
![Page 9: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/9.jpg)
9© Talend 2014
Native Map/Reduce Jobs
• Create classic ETL patterns using native Map/Reduce- Only data management solution on the market to generate native
Map/Reduce code
• No need for expensive big data coding skills
• Zero pre-installation on the Hadoop cluster
• Hadoop is the “engine” for data processing #dataos
![Page 10: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/10.jpg)
10© Talend 2014
SHOW ME
![Page 11: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/11.jpg)
11© Talend 2014
PERFORMANCE OF CODE GENERATION
![Page 12: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/12.jpg)
12© Talend 2014
MapReduce 2.0, YARN, Storm, Spark
• Yarn: Ensures predictable performance & QoS for all apps
• Enables apps to run “IN” Hadoop rather than “ON”
• In Labs: Streaming with Apache Storm
• In Labs: mini-Batch and In-Memory with Apache Spark
Applications Run Natively IN Hadoop
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH(MapReduce)
INTERACTIVE(Tez)
STREAMING(Storm, Spark)
GRAPH(Giraph)
NoSQL(MongoDB)
EVENTS(Falcon)
ONLINE(HBase)
OTHER(Search)
Source: Hortonworks
![Page 13: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/13.jpg)
13© Talend 2014
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH(MapReduce)
INTERACTIVE(Tez)
STREAMING(Storm, Spark)
GRAPH(Giraph)
NoSQL(MongoDB)
Events(Falcon)
ONLINE(HBase)
OTHER(Search)
Talend: Tap – Transform – Deliver
TRANSFORM (Data Refinement)
PROFILE PARSEMAP CDCCLEANSE STANDARD-IZE
MACHINELEARNINGMATCH
TAP(Ingestion)
SQOOP
FLUME
HDFS API
HBase API
HIVE
800+
DELIVER(as an API)
ActiveMQKaraf
CamelCXF
KafkaStorm
MetaSecurity
MDMiPaaS
GovernHA
![Page 14: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/14.jpg)
14© Talend 2014 © Talend 2013
• Context : 9 Nodes cluster, Replication: 3 - DELL R210-II, 1 Xeon® E3 1230 v2, 4 Cores, 16 Go RAM
- Map Slots : 2 Slots / Node
- Reduce Slots : 2 Slots / Node
• Total Processing Capabilities : - 9*2 Maps Slots : 18 Maps
- 9*2 Reduce Slots : 18 Reduces
• Data Volume : 1,10,100GB
Talend Labs Benchmark Environment
![Page 15: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/15.jpg)
15© Talend 2014 © Talend 2013
• PIG and Hive Apache communities are usingTPCH benchmarks- https://issues.apache.org/jira/browse/PIG-2397
- https://issues.apache.org/jira/browse/HIVE-600
• We are currently running the same tests in our labs- Pig Hand Coded script vs. Talend Pig generated code
- Pig Hand Coded script vs. Talend Map/Reduce generated code
- Hive QL produced by community vs. Hive ELT capabilities
• Partial results already available for Pig- Very good results
TPCH Benchmark
![Page 16: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/16.jpg)
16© Talend 2014
Optimizing Job configuration ?
• By default, Talend follows Hadoop recommendations regarding the number of reducers usable for the job execution.
• The rule is that 99% of the total reducers available can be used- http://wiki.apache.org/hadoop/HowManyMapsAndReduces
- For Talend benchmark, default max reducers is :
• 3 nodes : 5 (3*2 = 6 * 99% = 5)
• 6 nodes : 11 (6*2 = 12 * 99% = 11)
• 9 nodes : 17 (9*2 = 18 * 99% = 17)
- Another customer benchmark, default max reducer :
• 700 * 99% = 693 nodes (assumption with half Dell and half HP servers)
© Talend 2013
![Page 17: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/17.jpg)
17© Talend 2014
TPCH Results : Pig Hand Coded vs Pig generated
© Talend 2013
• 19 tests with results similar or better to Pig Hand Coded scripts
![Page 18: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/18.jpg)
18© Talend 2014
TPCH Results : Pig Hand Coded vs Pig generated
© Talend 2013
• 19 tests with results similar or better to Pig Hand Coded scripts
• Code is already optimized and automatically applied
Talend code
is faster
![Page 19: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/19.jpg)
19© Talend 2014
PERFORMANCE IMPROVEMENTS
![Page 20: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/20.jpg)
20© Talend 2014
TPCH Results : Pig Hand Coded vs Pig generated
© Talend 2013
• 19 tests with results similar or better to Pig Hand Coded scripts
• 3 tests will benefit from a new COGROUP feature
RequiresCoGroup
1
![Page 21: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/21.jpg)
21© Talend 2014
Example: How Sort works for Hadoop
Talend has implemented the TeraSort Algorithm for Hadoop
1. 1st Map/Reduce Job is generated to analyze the data ranges- Each Mapper reads its data and analyze its bucket critical values
- The reduce will produce Quartile files for all the data to sort
2. 2nd Map/Reduce job is started- Each Map does simply send the key to sort to the reducer
- A custom partitioner is created to send the data to the best bucket depending on the quartile file previously created
- Each reducer will output the data sorted by buckets
• Research: tSort : GraySort, MinuteSort
© Talend 2013
2
![Page 22: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/22.jpg)
22© Talend 2014
How-to-Get Sandbox!
• Videos on the Jumpstart- How to Launch http://youtu.be/J3Ppr9Cs9wA
- Clickstream video http://youtu.be/OBYYFLmdCXg
• To get the Sandbox- http://www.talend.com/contact
![Page 23: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/23.jpg)
23© Talend 2014
Step-by-Step Directions
• Completely Self-contained Demo VM Sandbox
• Key Scenarios like Clickstream Analysis
![Page 24: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/24.jpg)
24© Talend 2014
Come try the Sandbox
Hortonworks Dev Café & Talend
2
![Page 25: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/25.jpg)
25© Talend 2014
RUNTIME PLATFORM (JAVA, Hadoop, SQL, etc.)
Talend Platform for Big Data v5.4
Talend Platform for Big Data
TALEND UNIFIED PLATFORM
Studio Repository Deployment Execution Monitoring
DATA INTEGRATION
DataAccess ETL / ELT Version
ControlBusiness
RulesChange
Data Capture Scheduler ParallelProcessing
HighAvailability
Big DATA QUALITY
Hive Data Profiling
Drill-downto Values
DQ Portal,Monitoring
DataStewardship
ReportDesign
AddressValidation
CustomAnalysis
M/R Parsing,Matching
BIG DATA
Hadoop 2.0 MapReduceETL/ELT
Hcatalog/meta-data
Pig, Sqoop,Hive
Hadoop JobScheduler
Google BigQuery
NoSQLSupportHDFS
![Page 26: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/26.jpg)
![Page 27: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/27.jpg)
NonStop HBase – Making HBase Continuously Available for Enterprise Deployment
Dr. Konstantin BoudnikWANdisco
![Page 28: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/28.jpg)
Non-Stop HBase
Making HBase Continuously Available for Enterprise DeploymentKonstantin Boudnik – Director, Advanced Technologies, WANdisco
Brett Rudenstein – Senior Product Manager, WANdisco
![Page 29: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/29.jpg)
WANdisco: continuous availability company WANdisco := Wide Area Network Distributed Computing
We solve availability problems for enterprises.. If you can’t afford 99.999% - we’ll help
Publicly trading at London Stock Exchange since mid-2012 (LSE:WAND)
Apache Software Foundation sponsor; actively contributing to Hadoop, SVN, and others
US patented active-active replication technology
Located on three continents
Enterprise ready, high availability software solutions that enable globally distributed organizations to meet today’s data challenges of secure storage, scalability and availability
Subversion, Git, Hadoop HDFS, HBase at 200+ customer sites
![Page 30: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/30.jpg)
What are we solving?
![Page 31: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/31.jpg)
Traditionally everybody relies on backups
![Page 32: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/32.jpg)
HA is (mostly) a glorified backup
Redundancy of critical elements- Standby servers
- Backup network links
- Off-site copies of critical data
- RAID mirroring
Baseline:- Create and synchronize replicas
- Clients switching in case of failure
- Extra hardware allaying idly spinning “just in case”
![Page 33: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/33.jpg)
A Typical Architecture (HDFS HA)
![Page 34: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/34.jpg)
Backups can fail
![Page 35: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/35.jpg)
WANdisco Active-Active Architecture
/ page 35
100% Uptime with WANdisco’s patented replication technology- Zero downtime / zero data loss
- Enables maintenance without downtime
Automatic recovery of failed servers; Automatic rebalancing as workload increases
HDFS Data
![Page 36: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/36.jpg)
Multi-threaded Server Software:Multiple threads processing client requests in a loop
Server Process
make change to state (db)
get client request e.g. hbase put
send return value to client
OP OP OP OP
OP
OP
OP OPOP OP
OP
OP
thread 1
thread 3
thread 2
thread 1
thread 2
thread 3
acquire lock release lock
![Page 37: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/37.jpg)
Ways to achieve single server redundancy
![Page 38: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/38.jpg)
Using a TCP Connection to send data to three replicated servers (Load Balancer)
server3
Server Process
OP OP
server2
Server Process
OP OP OP OP
server1
Server Process
OP OP OP OP
ClientOP OP OP OP
Load BalancerLoad Balancer
![Page 39: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/39.jpg)
HBase WAL replication
State Machine (HRegion contents, HMaster metadata, etc.) is modified first
Modification Log (HBase WAL) is sent to a Highly Available shared storage
Standby Server(s) read edits log and serve as warm standby servers, ready to take over should the active server fail
![Page 40: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/40.jpg)
HBase WAL replication
server1
Server Process
OP OP OP OP
server2
Server ProcessShared Storage
Standby Server
WAL Entries
Single Active Server
![Page 41: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/41.jpg)
HBase WAL tailing, WAL Snapshots etc.
Only one active region server is possible
Failover takes time
Failover is error prone
RegionServer failover isn’t seamless for clients
![Page 42: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/42.jpg)
Implementing multiple active masterswith Paxos coordination(not about leader election)
![Page 43: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/43.jpg)
Three replicated servers
server3
Server Process
OP OP OP OP
Distributed Coordination Engine
server2
Server Process
Distributed Coordination Engine
OP OP OP OP
server1
Server Process
OP OP OP OP
Distributed Coordination Engine
PaxosDConE
Client Client
ClientClient
Client
PaxosDConE
OP OPOPOP
![Page 44: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/44.jpg)
HBase Continuous Availability(multiple active masters)
![Page 45: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/45.jpg)
HBase Single Points of Failure
Single HBase Master- Service interruption after Master failure
Hbase client- Client session doesn’t failover after a RegionServer failure
HBase Region Server: downtime- 30 secs ≥ MMTR ≤ 200 secs
Region major compaction (not a failure, but…)- (un)-scheduled downtime of a region for compaction
![Page 46: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/46.jpg)
HBase Region Server& Master Replication
![Page 47: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/47.jpg)
![Page 48: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/48.jpg)
![Page 49: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/49.jpg)
![Page 50: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/50.jpg)
![Page 51: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/51.jpg)
NonStopRegionServer:
Client Service e.g. multi
Client Service
DConE
HRegionServer
NonStopRegionServer 1
Client Service e.g. multi
Client Service
DConE
HRegionServer
NonStopRegionServer 2
Hbase Client
1. Client calls HRegionServer multi
2. NonStopRegionServer intercepts 3. NonStopRegionServer makes
paxos proposal using DConE library 4. Proposal comes back as agreement on all NonStopRegionServers 5. NonStopRegionServer calls super.multi on all nodes. State changes are recorded 6. NonStopRegionServer 1 alone sends response back to client
HMaster is similar
![Page 52: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/52.jpg)
HBase RegionServer replication using WANdisco DConE
Shared nothing architecture
HFiles, WALs etc. are not shared
Replica count is tuned
Snapshots of HFiles do not need to be created
Messy details of WAL tailing are not necessary:- WAL might not be needed at all (!)
Not an eventual consistency model
Does not serve up stale data
![Page 53: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/53.jpg)
![Page 54: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/54.jpg)
/ page 54
DEMODEMO
![Page 55: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/55.jpg)
/ page 55
![Page 56: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/56.jpg)
/ page 56
![Page 57: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/57.jpg)
/ page 57
![Page 58: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/58.jpg)
/ page 58
DEMOQ & A
![Page 60: Visual Mapping of Clickstream Data](https://reader038.vdocuments.us/reader038/viewer/2022103016/554a3136b4c90520578b51c1/html5/thumbnails/60.jpg)