linked data infrastructure soma - the insight centre for data … · 2016-02-23 · distributed...
TRANSCRIPT
![Page 1: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/1.jpg)
Soma: Linked Data Infrastructure
![Page 2: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/2.jpg)
What is Soma?
It’s Big Data Candy for the Cloud.
The Soma platform helps Data Scientist to collaborate together to discover and share new facts from large datasets hosted on shared infrastructure.
All this while lowering development & operations bottom line.
![Page 3: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/3.jpg)
Meet our CustomersExpertSee themselves as “experts” or an authority on a subject. Wants the big picture, likes easy to use specialised applications with great visualisation.
CreativePeople who see themselves as Data “artists”. Need to explain the meaning of the data. Good generalists, can code, with a flare for the visual or data narrative.
EngineerSee themselves as “engineers”. Focused on the technical problem of managing data — how to get it, store it, and learn from it. Normally strong software developers with some O/R statistics.
ResearcherSee themselves as “scientists”. People with deep academic background in maths, machine learning & modeling complex processes. Reluctant coders.
![Page 4: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/4.jpg)
Customers we support now
Creative Need to explain the meaning of the data. Good generalists, can code, with a flare for the visual or data narrative.
EngineerFocused on the technical problem of managing data Normally strong software developers
ResearcherPeople with deep academic background in science, maths, machine learning Reluctant coders.
![Page 5: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/5.jpg)
What we deliver to customers
CreativeNow:
● Gitlab integration● from gitlab● Web facing applications
ResearcherNow:
● Discovery early adoptersEarly September
● Discovery platform rollout
EngineerNow:
● Big Data Cluster● Container Management
November:● Storage frameworks
![Page 6: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/6.jpg)
Fully operational big data stationRight NowMesos based Cloud O/S● Cluster of 88 CPUs 295 GB of memory● Distributed Application Scheduling● Resource Scheduling
Container ManagementDNS service discover
Features
![Page 7: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/7.jpg)
Deployment
GitlabMesos ClusterZookeeper ClusterHDFS ClusterIntegrated DNSCI serversDocker Registry
![Page 8: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/8.jpg)
![Page 9: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/9.jpg)
Gitlab● All applications MUST be in gitlab
Mesos Cluster and Container Manager● Let’s have a look at what is running right now:
Deeper Dive
![Page 10: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/10.jpg)
“can mix both batch and real-time processing”
“process at batch and real-time Velocity”
Lambda architecture
![Page 11: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/11.jpg)
Data sources
![Page 12: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/12.jpg)
Source Control ManagementContinuous DeploymentService MonitoringAlways available key datasets● DBPedia● SemanticWeb Dogfood
Features
![Page 13: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/13.jpg)
1. Have gitlab account2. Ask Research ops to add Soma Role to your project3. If you are accepted you will be guided through
“dockerizing” you gitlab project4. Once accepted, every push to your master branch will be
deployed and accessible online through soma.
Continuous Deployment
![Page 14: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/14.jpg)
Integrated Discovery platformSOMA Discover - hosted discovery tool based on smarter
data project allowing exploration of data and sharing results.
Other internal tools such as Sig.ma, Social Lens, and other projects to follow.
Features
![Page 15: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/15.jpg)
Goals for Research Ops
Nurture a Data Engineering community at Insight with supportive experts, shared tools & best practices
Provide a Shared analytics platform for Data Scientists at Insight (Soma)
Encourage new research and engagements with the wider big data analytics research community
![Page 16: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/16.jpg)
Nurture● Provide a structured approach to managing and
releasing all Engineering IP (Code and Data) at insight○ Source control (Git)
○ release management
○ Assist in IP management
● Provide Quality Circles for Engineering practices○ 2 Groups - Data Visualisation & Big Data, Workshops to
commence this month.
![Page 17: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/17.jpg)
Provide● Build big data infrastructure for Insight
○ Soma platform
● Support Hadoop ongoing development○ Hadoop clusters, Dataspace support
● Support Ad Hoc projects requiring scale○ Cancer atlas
● Provide “Big Data” Expertise to the Linked Data group○ Hadoop, Yarn, Mesos, Spark, Dataspace, Mongo and Virtuoso
![Page 18: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/18.jpg)
Problems being met
● High cost in research when data scales to “Big Data” [P1]○ Ad Hoc Maintenance of big data sets is expensive [P2]
○ Development complexity of valuable Big Data jobs is prohibitive
[P3]
● The high cost in Operating Big Data infrastructure [P4]○ Scarcity of hardware and lack of funds for new Hardware [P5]
○ Inability to maintain a core operations team [P7]
● Missed opportunity for researcher to collaborate [P6]
![Page 19: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/19.jpg)
Soma serving our customers
Soma Create - Serves data fresh from the source. Has queryable large datasets that are both highly available & up-to-date. Has service to mash these up.
Soma Engineer - Provides a Lambda architecture consuming, cleaning, processing and loading the data to the data layer.
Soma Discover - Useful blocks of processing that can connected together using a nice GUI, works with many datastores
Soma Expert - vertical applications solving a real world problem, these apps are built by Insight’s Data Researchers and Data Creatives.
![Page 20: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/20.jpg)
The 4 kinds of Data ScientistExpertSee themselves as “experts” or an authority on a subject. Wants the big picture, likes easy to use specialised applications with great visualisation.
CreativePeople who see themselves as Data “artists”. Need to explain the meaning of the data. Good generalists, can code, with a flare for the visual or data narrative.
EngineerSee themselves as “engineers”. Focused on the technical problem of managing data — how to get it, store it, and learn from it. Normally strong software developers with some O/R statistics.
ResearcherSee themselves as “scientists”. People with deep academic background in maths, machine learning & modeling complex processes. Reluctant coders.
![Page 21: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/21.jpg)
Goals
Soma to be a complete ecosystem to help researchers deliver “Big Data” distributed applications
Showcase Insight expertise Standardize best practices for linked data at big data scalesDelivers targeted applications & tools tools to build complex analytics apps & job management
![Page 22: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/22.jpg)
![Page 23: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/23.jpg)
Distributed O/S (Better than cloud)
● We use Mesos based infrastructure to provide○ Scheduling Process Execution of Jobs/Applications across the
cluster
○ Resource scheduling of the needed CPU/Memory/Storage for
these applications
![Page 24: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/24.jpg)
SOMA Discover (Data)
![Page 25: Linked Data Infrastructure Soma - The Insight Centre for Data … · 2016-02-23 · Distributed Application Scheduling ... “dockerizing” you gitlab project 4. Once accepted, every](https://reader036.vdocuments.us/reader036/viewer/2022081611/5f05c2f27e708231d4149370/html5/thumbnails/25.jpg)
Where we are now
What we haveSoma Engineer - Standard Mesos platform - Provides a
Lambda architecture consuming, cleaning, processing and loading the data to the data layer.
Soma Discover - Smarter Data - an interactive expressive query tool creates data blocks & visualisations
What we need help onSoma Expert - Pivoty - a medical index built from
standard HCLS datasets and uses a Pivot BrowserSoma Create - The Insight Standard Dataset - a shared
queryable standard set of big-data sources