Service and Support for Science ITScientific Cloud Experiences
Dr. Peter KunsztDirector S3IT
Outline
• Introduction– What is Science IT– How are we organized
• UZH ScienceCloud Infrastructure and Implementation
• Science Data and Security/Privacy
Challenge : Scale Up
• High Throughput Instruments– Much larger data volumes– Increased data complexity
• Large Collaborations– More people– More experiments and measurements– More coverage
BIG
DATA
Fire and forget...
• Scientists do not want to be bothered with infrastructure details
• IT JUST NEEDS TO WORK!
Widening Complexity Gap: IT-Research
Local IT Resources
Research LabsCore Facilities
MiracleSCIENCE IT
What is Science IT ?
FILL THE GAPDedicated Support Center for Science IT
• SPEED : faster time to solution• ACCESS : to infrastructure,
software, expertise• ENABLE : use IT technology and
software for new ideas
Speed
Access
Enablement
Supporting Science• Be a partner to research projects for Science IT• Provide services to individual researchers, groups and consortia
– Consultancy for advanced usage of IT in Science– Research software development and support– Access to competitive IT infrastructure– Access to a library of tools and software– Project management and collaboration support– Training and education on the usage of infrastructure and software
• Collaborate internally, nationally and internationally with partners, suppliers and other Science IT units
• Maintain high level of internal expertise on topics relevant to Science IT
• Advise UZH Governance on evolution of needs, assist in prioritization
Organization Structures are Changing
OrgA
Org C Org D Org E
Org B Org F
Org G
Org H
Org AOrg B
Org C
OrgD
Old world: Hierarchical New world: Federated
http://www.fedsm.eu/
S3IT Organization
Core Team
Site Team
Site Team
EE
EE
EE
EE
EE
...
...
EE = Embedded ExpertWorking directly in projects or on-site in groups on specific tasks
Site TeamsJoint teams with other units providing local support and some global services
Core TeamDirectorate, Office, core services, central infrastructure and consultancy, project mgmt
Partner Interactions
CoreFacilitiesCore
FacilitiesCoreFacilities
Agreements
Services
Research GroupsProjectsProjectsProjects
Partners / Clients
Research GroupsResearch
GroupsResearch Groups
Services
FacultiesInstitutes
Departments
FacultiesInstitutes
Departments
Services
Central IT
Partners / Suppliers
Agreements
CSCS
internalexternal
VendorsVendorsVendorsVendors
Agreements
S3IT Core Business: Project Support
• Infrastructure is important but ‚just‘ a means to an end• Science IT Support: Applications, access, integration• Data analysis• Simulations• Data Integration• Application scaling, making use of big infrastructures• Workflows, automation• Visualization• Software design and usage advice, Code Clinic• Training and education• ...
14
Understand the science..
.. to map Science IT services!
Mapping Security and Privacy
• Most science follows 3 stages– Conception, preparation, proposition stage – private – Project stage (3-5y) – share in group– Publication of results – open to all
• Some have additional constraints (regulations)– Medicine – patient data records need consent
(different per country)– Law and business – confidentiality in projects– Engineering, pharmacology, etc.. – patents
Infrastructure• Supercomputing
– Used as a scientific instrument by • theoretical physics, astrophysics, mathematics, computational chemistry,
biochemistry, quantum chemistry• Continuous usage
• Cluster computing– Used as a workhorse by many groups
• Life science, biochem, geoscience, medicine, digital humanities, banking and finance, art history, ...
• Data analysis, statistical analysis, parameter studies, etc• Non-continuous usage
• Server computing– Used as interactive computers by many groups
• All groups. Interactive processing, visualization, steering of computation. Commercial and open-source tools.
• Daily usage, non-continuous.
Storage Classes
• Large, cheap data store for projects O(xPB)– No need to be backed up: Easy to regenerate but time-
consuming• Reliable project data store O(1PB)
– With secondary copy– Only addition, no changes
• Working storage O(x100TB)– Active data, databases, server-side processes
• Fast storage for streaming analysis O(100TB)– Fast changing data, immediate analysis, rare!
Datacenter Consolidation
OCI – S3IT
ZMB
BIOC
MATH
PHYS
IMLS / Neuro
Consolidate into
Central Datacenter
Aim: Scale and Secure!
NEW
UZH ScienceCloud Implementation
• OpenStack – based on Canonical• Deployment using Ansible• Vagrant-like system for configuration:
Elasticluster (developed at UZH)• Flexible submission and workflow framework
for job control: GC3pie (developed at UZH)• Database management framework openBIS
for data lifecycle management (developed at ETH/SystemsX.ch)
Business Model
• Supercomputing– Investment every 4 years into the system– Research groups to find 3rd party funding
• Commodity Cloud and Storage– Subscription / year : Cores, TB– Per use fee– Subsidized, not TCO – covering operations
• Servers / Pets– Yearly or monthly fee– Size matters
• Yearly acquisition / rollover– Easy to plan
Experience so far:
• Supercomputing needed only by few groups– Can be completely outsourced to national center, done as of 2015
• Cloud is suitable for most Science Workloads– User support scales well– Can cover very many use cases– Build dedicated boxes for exceptions, don‘t be driven by them– Flexibility is key
• Must use local infrastructure for secure, data intensive and memory intensive workloads– Data locality needed for COST and (rarely) policy reasons – exception:
medical data– Hybrid cloud – burst available for CPU intensive jobs– Deal with heterogeneity
Future Cloud Strategy: HYBRID
• Run sizeable local cloud infrastructure for internal workloads
• Burst peak loads to public cloud providers– For selected workloads coherent with policy and cost
Advantages• Plannable local infrastructure (plan for full usage)• Flexibility in scaling, quick provisioning of needed
capacity
Open Questions
• Policies. What workloads can be burst to public clouds? Under what conditions– Calculations, simulations usually OK– Data analyis: depends on data (network issues being resolved)– Check compliance of cloud providers. ISO, HIPAA, etc– Adherence to swiss cantonal data protection regulations
• Cost. How to buy public cloud services? – Public procurement of agreements? – How not to be bound to a single provider? – Is this necessary at all?
• How do i charge my users?– For internal and for external use?– Aim: consolidate their workload into our cloud. No TCO!
Comments on Security in academia
• Users in academia are super smart. They remove barriers faster than you can erect them.
• Do risk assessment and risk analysis instead of prevention.• Don‘t do anything ‚for security reasons‘, always qualify
with real risk numbers• Public Clouds are MUCH MORE secure than our own
– Amazon, Microsoft, IBM etc have whole teams of security experts – they hired our best students for this
• It is a question of TRUST– Regulations by countries– Do we trust the US not to do industrial and academic
espionage, forcing their own companies to give out our data?
Scientific Requirements
• Know your workload: Data, Privacy, Science, Sharing aspects are tightly connected
• Lots of hidden complexity and contradicting requirements
29
1. What Data?
• Different kinds of ‚BIG‘ data• Volume, Variety, Velocity, Veracity• Understanding is Knowledge is Science
– Data vs. Information and Knowledge – What are the right questions?– What should be protected, till when?– How to navigate, explore, evolve
30
WHO OWNS THE DATA?For science, proprietary data is a hindrance
2. Data Reuse
• Currently a wealth of data is not reused for new discovery
• Lots of potential! Regulators need to be told..
• Data repositories with computing and search capability – perfect for Cloud Model
• Do the computation where the data is – Private, public, hybrid Cloud
31
IP on TOOLS, ease of data USE, not DATA itself.
3. Motivate to annotate
• Scientists publish what is necessary and prescribed by the journals, not more –mandate better annotation
• Provide more recognition for producing ´good´ datasets – Data Citation
• Check Data quality – bad quality ordata without annotation has no value
32
Creation of well annotated, sustained public resources
4. Standard Formats
• Too many ‚Standards‘ or not used– Instrument vendors often at fault
• Protection of data by proprietary formats– Data is lost to research
• Do not pay for data in nonstandardformats– Data value is zero if unusable
33
Mandate standard formats for domain data
5. Data Sharing/Publishing
• Share in collaborative mode• Avoid Data Loss • Motivate and enable data publication• Establish business model for data publication
(reward/career benefit)• Journals adapt, see Scientific Data
http://www.nature.com/scientificdata
New role for Archives and Libraries
6. Patient Data Records
• Legal issues of data privacy• People are not in control of their own data• Difficult to get consent• NSA effect – trust
Put citizens back in control
Patient Data Records
• TRUST– Swiss Cooperative: citizen owned
• NEUTRALITY– A simple e-Banking system for any personal health data. Same level of
security• TRACTION
– Volume: it is free, it‘s rewarded• IMPACT
– Request data directly, avoid legal issues
36
• It is a cooperative, not a business• Funding by running campaigns to ask people to
participate in research & surveys• Participants are REWARED for sharing their data
or providing new data• Build tools on top
• Currently seeking funding– H2020, foundations– Projects with hospitals, clinics 37
Approach at S3IT
• Early involvement with Research Groups– Proposal writing, partnership– Advice on Data Management, infrastructure, standards
• Strong cooperation with Libraries– Early involvement with publishers, archives– Joint information to research groups on data management
plans, data citations• Seeking contact with funding bodies and decision makers
– Communicate business plan for Science IT ‚project consumables‘
– Evaluation of projects based on technology cost and feasibility– Usage of public and each others‘ cloud resources for cash
Links
• www.s3it.uzh.ch - Science IT at UZH• www.sybit.net - Systems Biology IT, SystemsX.ch• www.erasysapp.eu - Systems Biology, DMMCore
project• www.healthbank.ch - Public Cooperative being
set up for patient-owned data. Seeking funding (H2020, pending, and other sources)