couchdoop: using couchbase with hadoop

1. CouchdoopConnecting Hadoop with CouchbaseClin-Andrei Burloiu, Avira

2. Outline Two-tier Big Data Applications Real-time Application Tier Couchbase Batch Application Tier Hadoop Data Bridge Couchdoop Performance 3. About Me Clin-Andrei Burloiu MSc. Computer Science Big Data Engineer at Avira 1 year and a half experience in Big Data Hadoop HBase, Couchbase Hive Spark Specialties Distributed Systems Network Protocols Willing to learn more data science Twitter: @CalinBurloiu 4. Big Data Applications Most B2C Big Data Applications have 2 tiers Batch tier Extracts relevant information Feeds the real time component Real-time tier Low latency Exposed to users 5. Google Search Example Batch tier Crawling the Web Indexing Web pages Real-time tier Searching 6. Two-tier ArchitectureReal-time DataProcessingData Bridge (Couchdoop)Batch Data Processing 7. Avira Offers Feature of Avira Browser Safety Helps users find better deals for productswhile shopping online Recommends products to users 8. Avira Offers in ActionAvira Offers giving a better deal for a guitar from Amazon 9. Example User Meet Rudy Guitar player, plays in a heavy metal band Interested in Sound Engineering Likes to travel He likes to buy clothes in his spare time 10. Example Use Case Based on Rudys online shopping history Avira Offers recommends guitar accessories,studio equipment, trips and clothes Based on very recent actions Avira Offers detects user intent Gives relevant recommendations and deals basedon user intent 11. Recommendations Based on User IntentShopping Music EquipmentDeals for a new guitarRecommend guitar accessoriesSearches Hotels in ParisMay plan a concertVoucher for a HotelDeal for a cheap flightSearches for snorkeling equipmentMay plan a holidayBetter deal for snorkeling equipmentRecommend swimming suits 12. Two-tier ArchitectureReal-time Tier (Couchbase) Detects user intent Gives next best recommendation or dealData Bridge (Couchdoop)Batch Tier (Hadoop) Recommends products 13. Couchbase: Real-Time Database Document-based key-value store Stores JSON documents Fast Keeps most documents in memory Keeps index in memory Scalable Runs on a computer cluster Fault tolerant Persistence on disk Replication Has secondary indexes (MapReduce views) Memcached evolution similar with MongoDB 14. Couchbase vs. Competitors 15. Real-time Tier Data Tracking data{user: Rudy,action: view,product: Fender Guitar User events} Product views, clicks and purchases Recommendations{user: Rudy,action: click,product: Guitar Case}{user: Rudy,recommendations: [[Ibanez Acoustic Guitar, 450],[Guitar Tuner, 120],[Sound Mixer, 30]]} 16. Real-time algorithms1. Detect user intent Based on very recent actions Views Clicks Purchases2. Choose the next best recommendation ordeal Based on detected user intent 17. Hadoop: The Batch Tier1. Stores the whole user history / actions HDFS HBase database2. Joins real-time data with data from othersources Data from Facebook or Twitter Payments3. Runs a recommendation engine Apache Mahout Collaborative Filtering (machine learning) 18. Hadoop: A Data HubHadoop 19. Couchdoop: The Bridge Between Tiers Imports documents from Couchbase to Hadoop Feed the recommendations engine with user events Exports documents from Hadoop to Couchbase Make computed recommendations available to thereal-time tier Updates Couchbase documents using data fromboth Couchbase and Hadoop Update user profile documents Update item scores Both command line tool and Hadoop library 20. Couchdoop vs. Sqoop Connector Couchbase provides a Hadoop connector Based on Sqoop Designed to work with SQL Limited Can only take full dumps It cant update 21. Importing Data{user: Rudy,action: view,product: Fender Guitar}{user: Rudy,action: click,product: Guitar Amplifier}{user: Emma,action: buy,product: Blue Skirt}IMPORTCouchdoopHadoop HDFSMachine Learning Recommendations 22. Exporting Data{user: Rudy,recommendations: [[Ibanez Acoustic Guitar, 450],[Guitar Tuner, 120],[Sound Mixer, 30]]}CouchdoopHadoop HDFSMachine Learning Recommendations 23. Updating Data{user: Rudy,recommendations: [[Ibanez Acoustic Guitar, 450],[Guitar Tuner, 120],[Sound Mixer, 30]]}CouchdoopHadoop HDFSMachine Learning600],Recommendations 24. Couchbase Views Couchbase is a key-value store We retrieve documents by keys In our use case We dont know the keys We want to retrieve lots of documents by asecondary index We can use Couchbase views Javascript MapReduce functions 25. Couchbase Views Document with key Rudy::click{"date": "2014-09-16","product": "Fender Guitar"} Index documents by date with a map functionfunction (doc, meta) {emit(["click", doc.date]);} We can retrieve all documents from September16 with view key:["click", "2014-09-16"] 26. Importing in Hadoop Pass a set of view keys to Couchdoophadoop jar target/couchdoop-${VERSION}-job.jar import --couchbase-urls http://couchbase.example.com:8091/pools --couchbase-bucket my_bucket --couchbase-designdoc-name tracking --couchbase-view-name clicks --couchbase-view-keys '["click","2014-09-16"];["click","2014-09-17"]' --output /user/rudy/output View keys are distributed evenly across map tasks Each view key retrieves multiple documents Number of map tasks is configurable 27. Importing in Hadoop CouchbaseViewInputFormat Partitions data into InputSplits Each one contains one or more view keys Creates a RecordReader for each InputSplitwhich:1. Connects to Couchbase2. Queries the view for the view keys3. Converts Couchbase key-values to Hadoop Mapperkey-values 28. Exporting from Hadoophadoop jar couchdoop-${VERSION}-job.jar export --couchbase-urls http://couchbase.example.com:8091/pools --couchbase-bucket my_bucket --couchbase-password 'secret' --input /user/rudy/documents.csv Mapping Hadoop output keys -> Couchbase keys Hadoop output values -> Couchbase operation +Couchbase document Couchbase operations use exponential back-off 29. Exporting from Hadoop CouchbaseOutputFormat RecordWriter1. Connects to Couchbase2. Converts Hadoop output key-values to Couchbaseoperations3. Writes the operations to Couchbase OutputCommitter Not required: we are not writing files 30. Experimental Infrastructure We used Bigstep infrastructure Bare metal instances 7 Hadoop worker nodes 2 and 3 Couchbase nodes Nodes configuration CPU: 2x Intel E5-2690, 2.9 GHz 16 cores with hyperthreading RAM: 192 GiB, DIMM 1600 MHz Local HDDs Ethernet: 10 Gb / s 31. Experiments What did we tested? Importing from Couchbase to Hadoop Exporting to Couchbase from Hadoop What did we varied between experiments? Parallelism (number of Hadoop tasks) 1, 2, 4, 8, 16, 32, 64, 128 Number of Couchbase nodes 2, 3 Each experiment was repeated 10 times 32. Experiments Data Auto-generated fake JSON documents Average document size: 2.2 KiB Importing 7 GiB are imported for each experiment Exporting The amount of data exported growth linearly withparallelism Parallelism 1: 55 MiB Parallelism 2: 110 MiB 33. EXPORTING PERFORMANCE 34. Exporting With 1 Mapper20,00015,00010,0005,0000Operations per secondTime (s) 35. Exporting With 2 Mappers20,00015,00010,0005,0000Operations per secondTime (s) 36. Exporting With 4 Mappers20,00015,00010,0005,0000Operations per secondTime (s) 37. Exporting With 8 Mappers20,00015,00010,0005,0000Operations per secondTime (s) 38. Exporting With 16 Mappers300,000250,000200,000150,000100,00050,0000Operations per secondTime (s) 39. Exporting With 32 Mappers300,000250,000200,000150,000100,00050,0000Operations per secondTime (s) 40. Exporting With 64 Mappers300,000250,000200,000150,000100,00050,0000Operations per secondTime (s) 41. Exporting With 128 Mappers300,000250,000200,000150,000100,00050,0000Operations per secondTime (s) 42. Exporting With 256 Mappers400,000350,000300,000250,000200,000150,000100,00050,0000Operations per secondTime (s) 43. Exporting With 384 Mappers400,000350,000300,000250,000200,000150,000100,00050,0000Operations per secondTime (s) 44. All Exports400,000350,000300,000250,000200,000150,000100,00050,00001 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53Ops / s1248163264128256384 45. Export Performance Dependance onParallelism400,000350,000300,000250,000200,000150,000100,00050,00000 50 100 150 200 250 300 350 400Ops / sParallelismMeanMax 46. Exporting With 1 Mapper4500400035003000250020001500100050001 2 3 4 5 6 7 8 9 10 11 12 13 14Ops / s2 nodes3 nodes 47. Exporting With 16 Mappers40000350003000025000200001500010000500001 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849Ops / s2 nodes3 nodes 48. Exporting With 32 Mappers7000060000500004000030000200001000001 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59Ops / s2 nodes3 nodes 49. Exporting With 64 Mappers1600001400001200001000008000060000400002000001 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30Ops / s2 nodes3 nodes 50. Exporting With 128 Mappers3500003000002500002000001500001000005000001 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27Ops / s2 nodes3 nodes 51. IMPORTING PERFORMANCE 52. Importing With 1 Mapper60,00050,00040,00030,00020,00010,0000Operations per secondTime (s) 53. Importing With 2 Mappers60,00050,00040,00030,00020,00010,0000Operations per secondTime (s) 54. Importing With 4 Mappers60,00050,00040,00030,00020,00010,0000Operations per secondTime (s) 55. Importing With 8 Mappers60,00050,00040,00030,00020,00010,0000Operations per secondTime (s) 56. Importing With 16 Mappers200,000150,000100,00050,0000Operations per secondTime (s) 57. Importing With 32 Mappers200,000150,000100,00050,0000Operations per secondTime (s) 58. Importing With 64 Mappers200,000150,000100,00050,0000Operations per secondTime (s) 59. Importing With 128 Mappers200,000150,000100,00050,0000Operations per secondTime (s) 60. All Imports200,000180,000160,000140,000120,000100,00080,00060,00040,00020,00001 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53Ops / s1248163264128 61. Import Performance Dependance onParallelism200,000180,000160,000140,000120,000100,00080,00060,00040,00020,00000 20 40 60 80 100 120Ops / sParallelismMeanMax 62. Importing With 2 Mappers200001800016000140001200010000800060004000200001 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33Ops / s2 nodes3 nodes 63. Importing With 4 Mappers350003000025000200001500010000500001 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Ops / s2 nodes3 nodes 64. Importing With 8 Mappers7000060000500004000030000200001000001 2 3 4 5 6 7 8 9 10 11 12 13Ops / s2 nodes3 nodes 65. Importing With 16 Mappers1200001000008000060000400002000001 2 3 4 5 6 7 8 9 10Ops / s2 nodes3 nodes 66. Importing With 64 Mappers2000001800001600001400001200001000008000060000400002000001 2 3 4 5 6 7 8Ops / s2 nodes3 nodes 67. Conclusion B2C Big Data applications have 2 tiers Real-time Batch Couchdoop provides a bridge between them When using Couchbase and Hadoop High performance Easy API Exploiting Hadoop parallelism improvescommunication with Couchbase Avira Offers is a 2-tiered application which usesCouchdoop 68. Thank You! Any Questions? Checkout Couchdoop on Github https://github.com/Avira/couchdoop Follow me on Twitter @CalinBurloiu Key words Avira Offers Avira Browser Safety Big Data Real-time Batch Data Bridge Hadoop Couchbase Couchdoop Bigstep

couchdoop: using couchbase with hadoop

Software

couchbase views couchbase

data fromboth couchbase

fender guitar user events

othersources data

data hubhadoop

big data hadoop hbase

guitar tuner

guitar case