where ’ s my data?
DESCRIPTION
Where ’ s My Data?. Using MetriDoc to manage data integration headaches Joe Zucca– [email protected] Tommy Barker – [email protected] Sponsored by. The Problem. The request seems simple but the solution is complex - PowerPoint PPT PresentationTRANSCRIPT
Where’s My Data?Using MetriDoc to manage data integration headaches
Joe Zucca– [email protected] Barker – [email protected]
Sponsored by
The Problem• The request seems simple but the solution is complex
• Generally asked “who did / used x?” which leads to other questions• Where’s the data?• What’s the grain of the answer?
• So how do we answer these questions?• If lucky, run script / query against a database and generate report• If not lucky, build an application to answer the question• This is what MetriDoc is built for
Current Solution - DatafarmDatafarm = Crontab + Perl + CGI = Spaghetti
Voyager Blackboard
COUNTER
DLA logs
Datafarm
Gate CountEzproxy
Penn CommunityBorrow Direct
App 1App 1 App 3App 3App 2App 2 App nApp n
Datafarm Shortcomings• Maintainability issues• Not shareable• Not reusable
MetriDoc = Datafarm 2.0• As our system grew, we began creating MetriDoc to address
Datafarm’s problems• Needed a scheduler that was more sophisticated than cron• Needed languages that were more maintainable than perl• Needed integration tools to simplify data gathering across
disparate systems• We built prototypes and services to help us evaluate
technologies• Received a grant from IMLS to speed up development• Hired another programmer
MetriDoc Philosophy
• Keep it simple• Sometimes a script is all you need• Ease of use is more important than performance• Don’t recreate the wheel• 100% open source• Sharable data
MetriDoc – How it Works
• MetriDoc’s core is built around database schemas• A MetriDoc implementation consists of loading tables and normalized tables
• Loading tables prime the repository• The user is responsible for populating these tables
• Normalized tables are built from the data in the loading tables• MetriDoc takes care of this
• Conforming to similar schemas provides interesting possibilities• Sharing data is easy• Sharing a single repository is easy (think amazon web services)• Easier to collaborate
• From a user’s perspective• MetriDoc has tools to get your stuff in the loading tables
• But ultimately you just need to get it in there, so you can use whatever
• Use the MetriDoc tools to manage your integration needs• Useful for getting, transforming / resolving, moving and loading data
MetriDoc – Core Technologies
• JVM• Java is used for infrastructure• Groovy is the primary language
• Master Scheduler• Essentially the brains of MetriDoc• Using Hudson for now (http://hudson-ci.org/)
• Integration Tooling• Tooling built on top of Apache Camel (http://camel.apache.org/)• Helps move data from one place to another• Really helpful for batch processing
• Resolutions / Transformation Tools• Patron anonymization, text normalization, resource id to title
resolutions, etc.
The Metridoc SolutionMetridoc = Hudson + Java / Groovy + Apache
Camel = Integration Nirvana
Step 1 – Fill the loading tables
Load EzproxyLoad Ezproxy
Load Patron InfoLoad Patron Info
Load CounterLoad Counter
HudsonLoading Tables
Voyager Ezproxy COUNTER
Loading Tables00.000.000.000||Philadelphia||PA||United States||Default+datasets+documents+pwp+vanwert||jsmith||[19/Jan/2011:00:01:44 -0500]||GET||https://proxy.library.upenn.edu:443/login?url=http://www.sciencedirect.com/science?_ob=GatewayURL&_origin=SFX&_method=citationSearch&_volkey=0264410X%2329%23266%232&_version=1&md5=8e47306a7f3a7da8a6fe7b521a7a149b||302||0||http://elinks.library.upenn.edu/sfx_local?genre=article&issn=0264410X&title=Vaccine&volume=29&issue=2&date=20101216&atitle=An+adjuvanted+pandemic+influenza+H1N1+vaccine+provides+early+and+long+term+protection+in+health+care+workers.&spage=266&sid=EBSCO:aph&pid=Madhun%2c+Abdullah+S.%3bAkselsen%2c+Per+Espen%3bSjursen%2c+Haakon%3bPedersen%2c+Gabriel%3bSvindland%2c+Signe%3bN%c3%b8stbakken%2c+Jane+Kristin%3bNilsen%2c+Mona%3bMohn%2c+Kristin%3bJul-Larsen%2c+%c3%85sne%3bSmith%2c+Ingrid%3bMajor%2c+Diane%3bWood%2c+John%3bCox%2c+Rebecca+J.5550217620101216aph||Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5 (.NET CLR 3.5.30729)]||Re07OuEIyQo8X6w||UPennLibrary=AAAAAUkQ36AAAFTaAwO7Ag==; __utma=10244330.1344196133.1295210953.1295404568.1295411821.9; __utmc=10244330; __utmz=10244330.1295411821.9.3.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=upenn; WRUID=0; __utmv=10244330.|1=User-Type=Current%20Students=1,; __utma=94565761.447912360.1295320755.1295404584.1295411882.4; __utmc=94565761; __utmz=94565761.1295320755.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=upenn%20blackboard; hp=/vanpelt/; __utma=261680716.1522407254.1295392237.1295404624.1295412044.3; __utmc=261680716; __utmz=261680716.1295412044.3.3.utmcsr=library.upenn.edu|utmccn=(referral)|utmcmd=referral|utmcct=/biomed/; proxySessionID=18175547; ezproxy=Re07OuEIyQo8X6w; ARPT=MWPYIPS108CWYL; EHost2=sid=49d81d33-5139-4dbd-b94f-5d76b01ffbdc@sessionmgr13&k2=dGJyMPGtr0iyqbVIrOPfgeyk44Dt6fIA&k3=dGJyMOPY8Xvt&k4=ehost&k6=en&k7=live&k8=DS:live; __utmb=10244330.4.10.1295411821; __utmb=94565761.6.9.1295413021459; __utmb=261680716.1.10.1295412044; ASPSESSIONIDCCAQQCRC=AHJAGJMDDPNIIMLMHBCPCHBL
Patron_id Patron_ip url Ref_url Proxy_id Ezproxy_id
jsmith 00.000.000.000 http://www… http://elinks… 18175547 Re07OuEIyQo8X6w
The Metridoc SolutionMetridoc = Hudson + Java / Groovy + Apache
Camel = Integration Nirvana
Step 2 – Populate the normalized tables
Normalize Ezproxy
Normalize Ezproxy
Normalize Patron Info
Normalize Patron Info
Normalize Counter
Normalize Counter
HudsonRepository
Loading Tables
• Generally used for building software, but a fantastic cron replacement• Can run arbitrary scripts locally and remotely• Supports master / slave distribution model seamlessly• Can be managed entirely via REST• Extensible• Helps with job dependencies• It is simple and free• Active community with a huge collection of plugins
Jenkins – Death to Cron
A Little Groovy
The Metridoc Job Framework
The Metridoc Job Framework
The Metridoc Job Framework
Metrics on the Cheap
Metrics on the Cheap
Where we are….