where ’ s my data?

19
Where’s My Data? Using MetriDoc to manage data integration headaches Joe Zucca– zucca@ pobox.upenn.edu Tommy Barker – [email protected] Sponsored by

Upload: zareh

Post on 21-Jan-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Where ’ s My Data?. Using MetriDoc to manage data integration headaches Joe Zucca– [email protected] Tommy Barker – [email protected] Sponsored by. The Problem. The request seems simple but the solution is complex - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Where ’ s My Data?

Where’s My Data?Using MetriDoc to manage data integration headaches

Joe Zucca– [email protected] Barker – [email protected]

Sponsored by

Page 2: Where ’ s My Data?

The Problem• The request seems simple but the solution is complex

• Generally asked “who did / used x?” which leads to other questions• Where’s the data?• What’s the grain of the answer?

• So how do we answer these questions?• If lucky, run script / query against a database and generate report• If not lucky, build an application to answer the question• This is what MetriDoc is built for

Page 3: Where ’ s My Data?

Current Solution - DatafarmDatafarm = Crontab + Perl + CGI = Spaghetti

Voyager Blackboard

COUNTER

DLA logs

Datafarm

Gate CountEzproxy

Penn CommunityBorrow Direct

App 1App 1 App 3App 3App 2App 2 App nApp n

Page 4: Where ’ s My Data?

Datafarm Shortcomings• Maintainability issues• Not shareable• Not reusable

Page 5: Where ’ s My Data?

MetriDoc = Datafarm 2.0• As our system grew, we began creating MetriDoc to address

Datafarm’s problems• Needed a scheduler that was more sophisticated than cron• Needed languages that were more maintainable than perl• Needed integration tools to simplify data gathering across

disparate systems• We built prototypes and services to help us evaluate

technologies• Received a grant from IMLS to speed up development• Hired another programmer

Page 6: Where ’ s My Data?

MetriDoc Philosophy

• Keep it simple• Sometimes a script is all you need• Ease of use is more important than performance• Don’t recreate the wheel• 100% open source• Sharable data

Page 7: Where ’ s My Data?

MetriDoc – How it Works

• MetriDoc’s core is built around database schemas• A MetriDoc implementation consists of loading tables and normalized tables

• Loading tables prime the repository• The user is responsible for populating these tables

• Normalized tables are built from the data in the loading tables• MetriDoc takes care of this

• Conforming to similar schemas provides interesting possibilities• Sharing data is easy• Sharing a single repository is easy (think amazon web services)• Easier to collaborate

• From a user’s perspective• MetriDoc has tools to get your stuff in the loading tables

• But ultimately you just need to get it in there, so you can use whatever

• Use the MetriDoc tools to manage your integration needs• Useful for getting, transforming / resolving, moving and loading data

Page 8: Where ’ s My Data?

MetriDoc – Core Technologies

• JVM• Java is used for infrastructure• Groovy is the primary language

• Master Scheduler• Essentially the brains of MetriDoc• Using Hudson for now (http://hudson-ci.org/)

• Integration Tooling• Tooling built on top of Apache Camel (http://camel.apache.org/)• Helps move data from one place to another• Really helpful for batch processing

• Resolutions / Transformation Tools• Patron anonymization, text normalization, resource id to title

resolutions, etc.

Page 9: Where ’ s My Data?

The Metridoc SolutionMetridoc = Hudson + Java / Groovy + Apache

Camel = Integration Nirvana

Step 1 – Fill the loading tables

Load EzproxyLoad Ezproxy

Load Patron InfoLoad Patron Info

Load CounterLoad Counter

HudsonLoading Tables

Voyager Ezproxy COUNTER

Page 10: Where ’ s My Data?

Loading Tables00.000.000.000||Philadelphia||PA||United States||Default+datasets+documents+pwp+vanwert||jsmith||[19/Jan/2011:00:01:44 -0500]||GET||https://proxy.library.upenn.edu:443/login?url=http://www.sciencedirect.com/science?_ob=GatewayURL&_origin=SFX&_method=citationSearch&_volkey=0264410X%2329%23266%232&_version=1&md5=8e47306a7f3a7da8a6fe7b521a7a149b||302||0||http://elinks.library.upenn.edu/sfx_local?genre=article&issn=0264410X&title=Vaccine&volume=29&issue=2&date=20101216&atitle=An+adjuvanted+pandemic+influenza+H1N1+vaccine+provides+early+and+long+term+protection+in+health+care+workers.&spage=266&sid=EBSCO:aph&pid=Madhun%2c+Abdullah+S.%3bAkselsen%2c+Per+Espen%3bSjursen%2c+Haakon%3bPedersen%2c+Gabriel%3bSvindland%2c+Signe%3bN%c3%b8stbakken%2c+Jane+Kristin%3bNilsen%2c+Mona%3bMohn%2c+Kristin%3bJul-Larsen%2c+%c3%85sne%3bSmith%2c+Ingrid%3bMajor%2c+Diane%3bWood%2c+John%3bCox%2c+Rebecca+J.5550217620101216aph||Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5 (.NET CLR 3.5.30729)]||Re07OuEIyQo8X6w||UPennLibrary=AAAAAUkQ36AAAFTaAwO7Ag==; __utma=10244330.1344196133.1295210953.1295404568.1295411821.9; __utmc=10244330; __utmz=10244330.1295411821.9.3.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=upenn; WRUID=0; __utmv=10244330.|1=User-Type=Current%20Students=1,; __utma=94565761.447912360.1295320755.1295404584.1295411882.4; __utmc=94565761; __utmz=94565761.1295320755.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=upenn%20blackboard; hp=/vanpelt/; __utma=261680716.1522407254.1295392237.1295404624.1295412044.3; __utmc=261680716; __utmz=261680716.1295412044.3.3.utmcsr=library.upenn.edu|utmccn=(referral)|utmcmd=referral|utmcct=/biomed/; proxySessionID=18175547; ezproxy=Re07OuEIyQo8X6w; ARPT=MWPYIPS108CWYL; EHost2=sid=49d81d33-5139-4dbd-b94f-5d76b01ffbdc@sessionmgr13&k2=dGJyMPGtr0iyqbVIrOPfgeyk44Dt6fIA&k3=dGJyMOPY8Xvt&k4=ehost&k6=en&k7=live&k8=DS:live; __utmb=10244330.4.10.1295411821; __utmb=94565761.6.9.1295413021459; __utmb=261680716.1.10.1295412044; ASPSESSIONIDCCAQQCRC=AHJAGJMDDPNIIMLMHBCPCHBL

Patron_id Patron_ip url Ref_url Proxy_id Ezproxy_id

jsmith 00.000.000.000 http://www… http://elinks… 18175547 Re07OuEIyQo8X6w

Page 11: Where ’ s My Data?

The Metridoc SolutionMetridoc = Hudson + Java / Groovy + Apache

Camel = Integration Nirvana

Step 2 – Populate the normalized tables

Normalize Ezproxy

Normalize Ezproxy

Normalize Patron Info

Normalize Patron Info

Normalize Counter

Normalize Counter

HudsonRepository

Loading Tables

Page 12: Where ’ s My Data?

• Generally used for building software, but a fantastic cron replacement• Can run arbitrary scripts locally and remotely• Supports master / slave distribution model seamlessly• Can be managed entirely via REST• Extensible• Helps with job dependencies• It is simple and free• Active community with a huge collection of plugins

Jenkins – Death to Cron

Page 13: Where ’ s My Data?

A Little Groovy

Page 14: Where ’ s My Data?

The Metridoc Job Framework

Page 15: Where ’ s My Data?

The Metridoc Job Framework

Page 16: Where ’ s My Data?

The Metridoc Job Framework

Page 17: Where ’ s My Data?

Metrics on the Cheap

Page 18: Where ’ s My Data?

Metrics on the Cheap

Page 19: Where ’ s My Data?

Where we are….