building a distributed data-platform - a perspective on current trends in computing
DESCRIPTION
Data, dev-ops, and cloud services: Building a distributed data-platformA lecture given to Computer Science Students at the University of Warwick, February 2012.TRANSCRIPT
![Page 1: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/1.jpg)
Data, dev-ops, and cloud services
Building a distributed data-platform
Charles Care
Engineering TeamKasabi / Talis
![Page 2: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/2.jpg)
Talk overview
● About me...● What Kasabi is,
● what we are trying to do● how we are working to achieve that● a quick walk-though
● Discussion of the Kasabi platform team● Our technology / architecture● Our engineering culture● Lessons learnt
![Page 3: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/3.jpg)
Views are mine...
…and not necessarily those of my (current/past) employers
![Page 4: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/4.jpg)
About me...
![Page 5: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/5.jpg)
About me...
● 2001-2004 – BSc Computer Science (Warwick) ● 2004-2008 – PhD Computer Science (Warwick) ● 2007-2011 – BT Plc
● Technical risk analyst – BT Global MPLS Network● Software Engineer – Infrastructure for Financial Markets● Senior Software Engineer – Central software standards
and tools
● 2011-Present – Talis/Kasabi ● Software Engineer – Semantic web platform
![Page 6: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/6.jpg)
About Kasabi
![Page 7: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/7.jpg)
About Kasabi
● Data market place● Bringing together data...
● owners● consumers
● Lowering the barrier for data-driven apps to enter the market
● Enabling new opportunities for aggregating and mixing data
![Page 8: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/8.jpg)
Data licensing today
Data Owners Data Consumers
Bespoke, expensive, contracts
![Page 9: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/9.jpg)
Kasabi as a data platform
Data Owners
Third-party services
Application Developers
Data enthusiastsData engineers
API developers
![Page 10: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/10.jpg)
About Kasabi
● Publish datasets using standard APIs● Access data using standard APIs
● Query a dataset using SPARQL● Search a dataset using a simple full-text search
● Define, contribute, and share your own APIs
![Page 12: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/12.jpg)
A dataset
![Page 13: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/13.jpg)
Access data using standard APIs
![Page 14: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/14.jpg)
Contribute custom APIs
![Page 15: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/15.jpg)
Example – contributed APIs
![Page 16: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/16.jpg)
Current organisation
● Product development● Data engineering● Customer operations● Platform development
![Page 17: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/17.jpg)
Current organisation
● Product development● Data engineering● Customer operations● Platform development
![Page 18: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/18.jpg)
Platform architecture
![Page 19: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/19.jpg)
Data Platform
Load balancing and routing
Update services Search services Query services
Datasets
● Need to store and update datasets● Access data via various services● Must scale with load and increasing data● Must be tolerant to failure● Extensible
● Should be easy to add new services over time
![Page 20: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/20.jpg)
To distribute...
...or not to distribute
![Page 21: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/21.jpg)
Dynamic Gossip Network
Distributed PlatformRouting layer
Updateservice Search
service
Sequence Service Storage Service Monitoring Services
Updateservice
Updateservice
Searchservice
Searchservice
SPARQLservice
SPARQLservice
SPARQLservice
Newservice?
![Page 22: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/22.jpg)
Dynamic Gossip Network
Distributed Platform – updatesRouting layer
Updateservice Search
service
Sequence Service Storage Service
Updateservice
Updateservice
Searchservice
Searchservice
SPARQLservice
SPARQLservice
SPARQLservice
Newservice?
Monitoring Services
- Updates are sequenced- Data stored in distributed storage
![Page 23: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/23.jpg)
Dynamic Gossip Network
Distributed Platform – updatesRouting layer
Updateservice Search
service
Sequence Service Storage Service
Updateservice
Updateservice
Searchservice
Searchservice
SPARQLservice
SPARQLservice
SPARQLservice
Newservice?
Monitoring Services
- Updates are gossiped around network- Here a SPARQL node realises that it should apply the update
![Page 24: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/24.jpg)
Dynamic Gossip Network
Distributed Platform – queryRouting layer
Updateservice Search
service
Sequence Service Storage Service
Updateservice
Updateservice
Searchservice
Searchservice
SPARQLservice
SPARQLservice
SPARQLservice
Newservice?
Monitoring Services
SPARQL queries will now reflect the update that was submitted
![Page 25: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/25.jpg)
Monolithic vs distributed
● Monolithic● Easy to synchronise events and data
● Consistent views and queries
● Less inter-process communication / less network overhead
● Easier to optimise for high throughput
● Single code-base
● Fewer processes to monitor
● Distributed● Service-oriented - separate concerns run in isolated processes (and can be scaled
independently)
● Development is component-based
– Changes are more focussed / helps avoids scope-creep
● Deployment can be localised to avoid downtime
● Failure is more likely – so you need to plan for it
● Easier to integrate out-of-the box software – e.g. using standard Apache Solr
![Page 26: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/26.jpg)
Distributed data platform
● Separate services for each API
● Communication via Gossip messages
● Have to manage eventual consistency
● Highly scalable
● Easy to add new services
● Use standard protocols and open-source components● HTTP libraries / REST / ZeroMQ / Apache Thrift● RDF and SPARQL using Apache Jena● Search using Apache Solr● Avoid modification and forks
● Deploy into Amazon EC2 (also using: S3, EMR, and ELB)
![Page 27: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/27.jpg)
Benefits of using cloud services
![Page 28: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/28.jpg)
Consider a start-up in 2002
● Have an idea...
● Get funding (development, op-ex, cap-ex)
● Aquire servers● Set-up your servers
– mail, web, source code repo, build systems
– development, staging, live
● Some 'cloud' services
– …, SourceForge, shared servers, etc
● Build, and go, to market● Probably embedding open-source
components
● Delivery based on full-stack, monolithic, architectures
![Page 29: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/29.jpg)
Consider a start-up in 2012
● Have an idea...
● Get funding (development capital, op-ex)● you will probably not get cap-ex
● Use cloud services... rent rather than buy● SaaS – Software as a Service
– Why would you run your own (chat/email etc)
– Host your code in GitHub/BitBucket etc
● PaaS – Platform as a Service
– Do you need to control the full stack?
– Could you leverage platforms like: Heroku, Joyant, AppEngine etc
– Amazon RDS
● IaaS – Infrastructure as a Service
– Cloud services to provide 'bare metal'
● Build and go to market quickly
● scale elastically over time
![Page 30: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/30.jpg)
But what about the enterprise?
● Benefits of cloud services are already transforming the enterprise● Private clouds
● Virtual appliances
● Cloud bursting
● Independent scaling
● Separation of concerns
● SOA architecture
● And in future...● Appetite for IaaS is growing
● PaaS and SaaS will follow.
● Perimeter security will be replaced by localised security boundaries
![Page 31: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/31.jpg)
So how do we build this stuff...?
![Page 32: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/32.jpg)
How it all happens
● Constantly iterating through...● Requirements● Development (Test-driven)● Testing/Review● Deployment● Operation
● We're an Agile, dev-ops team...
so all the above is a shared responsibility
![Page 33: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/33.jpg)
Being a dev-ops team...
● Removing barriers between development and operations
● Shared responsibilities rather than distrust
● Everyone has root access
● Developers are responsible for operating systems they build
● Everyone is free to make changes
...and responsible to manage the roll-out of those changes
● Ops/Deployment/Monitoring are automated
● Everyone should have full-stack awareness
● Read more...● http://dev2ops.org/blog/2010/2/22/what-is-devops.html
● http://www.jedi.be/blog/
● http://en.wikipedia.org/wiki/Devops
● http://www.slideshare.net/jallspaw/ 10-deploys-per-day-dev-and-ops-cooperation-at-flickr
![Page 34: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/34.jpg)
Life-cycle of a change
![Page 35: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/35.jpg)
Requirements and Planning
● Identification of requirement ● Planning
● Break down big changes into smaller tasks– Can the change be deployed in small steps?– Can the change be dark-deployed?
● Understand the wider impact● Find middle ground between generic and specific
● Team is self-organising● People pull work from the prioritised, planned stories
![Page 36: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/36.jpg)
Branch based development
● One branch per change, squash before merge
![Page 37: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/37.jpg)
Writing the code
● Work on a branch ● don't know if/when you'll merge
● Test-driven● Unit tests first
● Do acceptance tests need to change?
● What technology? Which tool-sets?
● Smoke testing● How do you know it works?
● What's different in production?
● What are the risks of failure?
● Feature flags?
Tests run: 110, Failures: 0, Errors: 0, Skipped: 2
[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESSFUL[INFO] ------------------------------------------------------------------------[INFO] Total time: 39 seconds[INFO] Finished at: Sat Feb 18 15:20:36 GMT 2012[INFO] Final Memory: 33M/240M[INFO] ------------------------------------------------------------------------
![Page 38: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/38.jpg)
Writing the code
● Avoid unnecessary scope-creep● “I'll just fix this...”
● “It would be much cleaner if I re-factored this...”
● “It would be neat if I also added this...”
● …however, these observations can be written as new stories
● …and sometimes it's good to fix things before they cause pain
● …if extra changes are really necessary, can they be implemented separately?
● …team should be empowered to fix technical debt
● ...managing scope-creep is a shared responsibility
● Be prepared to abandon a change if it's taking too long, maybe it needs more planning?
● Should you be pairing?
● Should you demo your work?
![Page 39: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/39.jpg)
Code review
● Code review possible with tools for distributed teams (e.g. Gerrit or ReviewBoard)
● If you're not following a strict pairing policy, code-review is vital
● Useful to make others aware of changes
● Gerrit● Build agent automatically builds your change and
runs tests – verify +/- 1
● Invite others to review your code, they can give it a score between -2 and +2.
● Can only deploy code once at least one person has given a +2
● Work-flow is customisable
● Self-organising... anyone can review
$> git commit$> git review
![Page 40: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/40.jpg)
Code review (2)
![Page 41: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/41.jpg)
Code review (3)
![Page 42: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/42.jpg)
Merge / Deployment
● Merge & Deployment● One-click deployment
● Developer should press the button
● Code is merged into the master/release branch
● Build server automatically checks out the code and builds, tags, and uploads the release to an artefact repository
● Package is automatically deployed on all servers
– Extra orchestration for external-facing services to avoid “thundering-herd” problems
![Page 43: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/43.jpg)
Managing infrastructure
● Puppet or Chef
● Build packages (e.g. DEB or RPM)
● Centralise configuration management
● Utilising cloud compute infrastructure● Amazon EC2
● Amazon S3
● Elastic load balancers
● Elastic Map-Reduce
● Application monitoring● Metrics
● Log analysis
● Internal monitoring
● External checks
![Page 44: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/44.jpg)
Lessons learnt
(again, my views!)
![Page 45: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/45.jpg)
Technical lessons learnt
● Use distributed SOA-based services to reduce tight-coupling
● Monitor everything...● Leverage cloud offerings
● wrap them with well-defined interfaces to avoid lock-in
● Design systems to scale● Use open and unmodified components where possible
● Standard components fronting external APIs● E.g. Jena, Solr, Haproxy, Apache
![Page 46: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/46.jpg)
Practices that have helped us
● Dev-ops culture● Pragmatic approach to agile development
● Task allocation should be 'pull', rather than 'push'● Teams should be self-organising● Pairing when working on new problems
● Test-Driven-Development (TDD)● Continuous integration● Peer-review of code● Continuous deployment
![Page 47: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/47.jpg)
…so, in summary...
![Page 48: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/48.jpg)
Conclusion
● Isolate your design into components● Empower your team to release small changes
frequently● Leverage hosted/cloud offerings
![Page 49: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/49.jpg)
Thanks for listening!
![Page 50: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/50.jpg)
Credits
● Thanks for the invite to speak● Thanks to Kasabi / Talis Systems Ltd
● Sign up at http://www.kasabi.com
Graphics from http://www.iconarchive.com/, http://www.oxygen-icons.org and http://www.icons-land.com
![Page 51: Building a distributed data-platform - A perspective on current trends in computing](https://reader034.vdocuments.us/reader034/viewer/2022051514/54bc9bc84a7959f6568b4609/html5/thumbnails/51.jpg)
Questions?