hbase at mendeley

HBase at Mendeley

Dan HarveyData Mining Engineer

[email protected]

Overview

➔ What is Mendeley➔ Why we chose HBase➔ How we're using HBase➔ Challenges

Mendeley helps researchers work smarter

Mendeley extracts research data..

Install Mendeley Desktop


..and aggregates research data in the cloud

Mendeley extracts research data..


Mendeley in numbers

➔ 600,000+ users➔ 50+ million user documents

➔ Since January 2009

➔ 30 million unique documents➔ De-duplicated from user and other imports

➔ 5TB of papers

Data Mining Team

➔ Catalogue➔ Importing➔ Web Crawling➔ De-duplication

➔ Statistics➔ Related and recommended research➔ Search

Starting off

➔ Users data in MySQL➔ Normalised document tables

➔ Quite a few joins..

➔ Stuck with MySQL for data mining➔ Clustering and de-duplication➔ Got us to launch the article pages

But..

➔ Re-process everything often➔ Algorithms with global counts➔ Modifying algorithms affect everything

➔ Iterating over tables was slow➔ Could not easily scale processing➔ Needed to shard for more documents➔ Daily stats took > 24h to process...

What we needed

➔ Scale to 100s of millions of documents➔ ~80 million papers➔ ~120 million books➔ ~2-3 billion references

➔ More projects using data and processing➔ Update the data more often➔ Rapidly prototype and develop➔ Cost effective

So much choice..

But they mostly miss out good scalable processing.

And many more...

HBase and Hadoop

➔ Scalable storage➔ Scalable processing➔ Designed to work with map reduce

➔ Fast scans➔ Incremental updates➔ Flexible schema

Where HBase fits in

How we store data

➔ Mostly documents➔ Column Families for different data

➔ Metadata / raw pdf files➔ More efficient scans

➔ Protocol Buffers for metadata➔ Easy to manage 100+ fields➔ Faster serialisation

Example Schema

Row Column family Qualifiersha1_hash metadata document

date_addeddate_modifiedsource

content pdffull_textentity_extraction

canonical_id version_live

● All data for documents in one table

How we process data

➔ Java Map Reduce➔ More control over data flows➔ Allows us to do more complex work

➔ Pig➔ Don't have to think in map reduce➔ Twitter's Elephant Bird decodes protocol buffers➔ Enables rapid prototyping➔ Less efficient than using java map reduce

➔ Quick example...

Example

➔ Trending keywords over time➔ For a give keyword, how many documents per year?➔ Multiple map/reduce tasks➔ 100s of line of java...

Pig Example-- Load the document bagrawDocs = LOAD 'hbase://canonical_documents'

USING HbaseLoader('metadata:document')AS (protodoc);

-- De-serialise protocol bufferdocs = FOREACH rawDocs GENERATE

DocumentProtobufBytesToTuple(protodoc)AS doc;

-- Get keyword, year tuplestagYear = FOREACH docs GENERATE

FLATTEN (doc.(year, keywords_bag))AS keyword, doc::year AS year;

-- Group unique (keyword, year) tuplesyearTag = GROUP tagYear BY (keyword, year);

-- Create (keyword, year, count) tuplesyearTagCount = FOREACH yearTag GENERATE

FLATTEN(group) AS (keyword, year),COUNT(tagYear) AS count;

-- Group the counts by keywordtagYearCounts = GROUP yearTagCount BY keyword;

-- Group the counts by keywordtagYearCounts = FOREACH tagYearCounts GENERATE

group AS keyword,yearTagCount.(year, count) AS years;

STORE tagYearCounts INTO 'tag_year_counts';

Challenges

➔ MySQL hard to export from➔ Many joins slow things down➔ Don't normalise if you don't have to!

➔ HBase needs memory➔ Stability issues if you give it too little

Challenges: Hardware

➔ Knowing where to start is hard...➔ 2x quad core Intel cpu➔ 4x 1TB disks➔ Memory

➔ Started with 8GB, then 16GB➔ Upgrading to 24GB soon

➔ Currently 15 nodes

www.mendeley.com

hbase at mendeley

Technology