hbase at mendeley
DESCRIPTION
The details behind how and why we use HBase in the data mining team at Mendeley.TRANSCRIPT
Overview
➔ What is Mendeley➔ Why we chose HBase➔ How we're using HBase➔ Challenges
Mendeley helps researchers work smarter
Mendeley extracts research data..
Install Mendeley Desktop
Mendeley helps researchers work smarter
..and aggregates research data in the cloud
Mendeley extracts research data..
Mendeley helps researchers work smarter
Mendeley in numbers
➔ 600,000+ users➔ 50+ million user documents
➔ Since January 2009
➔ 30 million unique documents➔ De-duplicated from user and other imports
➔ 5TB of papers
Data Mining Team
➔ Catalogue➔ Importing➔ Web Crawling➔ De-duplication
➔ Statistics➔ Related and recommended research➔ Search
Starting off
➔ Users data in MySQL➔ Normalised document tables
➔ Quite a few joins..
➔ Stuck with MySQL for data mining➔ Clustering and de-duplication➔ Got us to launch the article pages
But..
➔ Re-process everything often➔ Algorithms with global counts➔ Modifying algorithms affect everything
➔ Iterating over tables was slow➔ Could not easily scale processing➔ Needed to shard for more documents➔ Daily stats took > 24h to process...
What we needed
➔ Scale to 100s of millions of documents➔ ~80 million papers➔ ~120 million books➔ ~2-3 billion references
➔ More projects using data and processing➔ Update the data more often➔ Rapidly prototype and develop➔ Cost effective
So much choice..
But they mostly miss out good scalable processing.
And many more...
HBase and Hadoop
➔ Scalable storage➔ Scalable processing➔ Designed to work with map reduce
➔ Fast scans➔ Incremental updates➔ Flexible schema
Where HBase fits in
How we store data
➔ Mostly documents➔ Column Families for different data
➔ Metadata / raw pdf files➔ More efficient scans
➔ Protocol Buffers for metadata➔ Easy to manage 100+ fields➔ Faster serialisation
Example Schema
Row Column family Qualifiersha1_hash metadata document
date_addeddate_modifiedsource
content pdffull_textentity_extraction
canonical_id version_live
● All data for documents in one table
How we process data
➔ Java Map Reduce➔ More control over data flows➔ Allows us to do more complex work
➔ Pig➔ Don't have to think in map reduce➔ Twitter's Elephant Bird decodes protocol buffers➔ Enables rapid prototyping➔ Less efficient than using java map reduce
➔ Quick example...
Example
➔ Trending keywords over time➔ For a give keyword, how many documents per year?➔ Multiple map/reduce tasks➔ 100s of line of java...
Pig Example-- Load the document bagrawDocs = LOAD 'hbase://canonical_documents'
USING HbaseLoader('metadata:document')AS (protodoc);
-- De-serialise protocol bufferdocs = FOREACH rawDocs GENERATE
DocumentProtobufBytesToTuple(protodoc)AS doc;
-- Get keyword, year tuplestagYear = FOREACH docs GENERATE
FLATTEN (doc.(year, keywords_bag))AS keyword, doc::year AS year;
-- Group unique (keyword, year) tuplesyearTag = GROUP tagYear BY (keyword, year);
-- Create (keyword, year, count) tuplesyearTagCount = FOREACH yearTag GENERATE
FLATTEN(group) AS (keyword, year),COUNT(tagYear) AS count;
-- Group the counts by keywordtagYearCounts = GROUP yearTagCount BY keyword;
-- Group the counts by keywordtagYearCounts = FOREACH tagYearCounts GENERATE
group AS keyword,yearTagCount.(year, count) AS years;
STORE tagYearCounts INTO 'tag_year_counts';
Challenges
➔ MySQL hard to export from➔ Many joins slow things down➔ Don't normalise if you don't have to!
➔ HBase needs memory➔ Stability issues if you give it too little
Challenges: Hardware
➔ Knowing where to start is hard...➔ 2x quad core Intel cpu➔ 4x 1TB disks➔ Memory
➔ Started with 8GB, then 16GB➔ Upgrading to 24GB soon
➔ Currently 15 nodes
www.mendeley.com