services for sensitive data and ebiobanks at university of oslo · 2015-05-06 · services for...
TRANSCRIPT
Services for Sensitive Data and eBiobanks at University of Oslo Gard Thomassen, PhD Head of Research Support Services Group Leader of the TSD project University Center for Information Technology (USIT) University of Oslo
Outline • TSD promo J • What is sensitive data • Laws and regulations • TSD overview • TSD nice-to-know • TSD services • TSD opportunities and • Risk
Gard Thomassen,TSD 2.0
Computerworld 16/5-14
Norsk KreftGenom Konsortium Sammenliknet med den hardware vi benyttet fram til overgangen til TSD, som vel kan karakteriseres som en middels brukbar tjenermaskin, med 64 kjerner, kan vi med TSD oppnå en teoretisk hastighetsforbedring på 30X. I tillegg til dette kommer at vi har opitmalisert vår analysepipeline, ved at vi har parallellisert flere trinn. Tidligere ville en sekvenseringsanalyse på 48 svulst/normal-par resultert i kjøringstid på to-tre måneder minimun. Vi kjørte nå denne uka på TSD det samme på to dager og noen timer. Altså forsiktig sagt en dramatisk forbedring. Prof Eivind Hovig, NCGC
Teknisk ukeblad & e24, 5/5-14
Uniforum
What is sensitive data?
• Personal Data Act §2, point 8 – race/ethnic data, political opinion, philosophical
and religious beliefs, the fact that a person has been suspected of, charged with, indicted for or convicted a criminal act, health, sex life and trade-union membership
• Biotechnology Act • Health Registry Act • And so on..
Gard Thomassen,TSD 2.0
System requirements • Security, isolation and access control as given by law • Large storage capacity • Multi tenant (multiple users) • High performance computing (HPC) resource • High bandwidth • Easy to maintain and operate • Easy to use and “practical” (also for audio and video) • Some freedom within confined user space • Accessible from anywhere through proper mechanisms • A variety of software and public data-sources must be available • Windows and Linux support (server/host-side) • Data collection services • Data sharing services
Gard Thomassen,TSD 2.0
Tough requirements, tough project
Services for Sensitive Data – TSD (Norwegian: Tjenester for Sensitive Data)
Started initial work with a pilot in 2009 Full fledged services in production spring 2014
Gard Thomassen,TSD 2.0
System outline
Gateway
HPC - Colossus VM-server
Storage
Internet
Secure encrypted network to special high volume data production sites
1 (project)
1 (storage area)
n 1
Gard Thomassen,TSD 2.0
Using TSD
VM U1 S1
S1
TSD disk
VM U2 S1
GW User1 Study1
Colossus disk
Colossus
Front end Colossus
Gard Thomassen,TSD 2.0
User2 Study1
TSD S1 DB
Data import and export using TSD
“Sluice-server”
Virtual “sluice- server”
Virtual project-server
“Sluice HD”
Project HD
TSD
NFS mount
2
Data copied here by ssh+scp or web-drive (2-factor authentication) encrypted data if sensitive
1 4
3
Gard Thomassen,TSD 2.0
Data collection using TSD
“Nettskjema-minID” Nettskjem hjemmeside
Gard Thomassen,TSD 2.0
minID
Project VM
Project disk
Import mechanism
Encrypted XML (PGP)
TSD
What TSD offers at present
• Secure storage • Secure data analysis • Linux or windows hosts • Secure import and export • Web-based data harvesting • HPC cluster • Postgres DBs
HPC resource – Colossus • At present about 1500 cores (~30 TFLOPs) • No project users are to log in on any nodes • One global job daemon to control data integrity
(to ensure project data separation) • $SCRATCH will be on a per project basis and
cleaned after each job finishes • As similar to Abel (the non-sensitive HPC
resource in Oslo) as possible • Separate disk system for parallel file-system • Huge-mem nodes and Infiniband interconnect
16
Gard Thomassen,TSD 2.0
Practical things to remember
• How to get onboard • Login • Where is my data • What is backed up • What needs to be encrypted • Where can I access TSD from • How to get HPC access • What does it cost • How to use Nettskjema • Where do I send my questions :
Technical details • KVM for virtualization (RedHat Linux) • Cerebrum as provisioning (a USIT application) • AD system administration guided by the provisioning
system (duplicated) • FreeBSD firewall and gateway (duplicated) • Integration with IDporten (Norwegian governmental
eID system) for www-enquiries and applications • Storage with separation between projects (Hitachi
disc system and encrypted backup to tape) • IPv6 on the inside (… and private IPv4) • Free Radius for 2-factor auth • Separate console server (physical)
18
Gard Thomassen,TSD 2.0
Security details
• OATH TOTP 2-factor authentication – Smart phones or programmable hardware tokens
• Import/export is under strict control • No open connection to the internet • Strong separation between projects (VLAN) • Hardened FreeBSD gateway and firewall • Encrypted backup, one key per project • Sys-admins are single users (traceability) • Sys-admins have to use same authentication process • Hardware is physically separated from other UiO hardware
Gard Thomassen,TSD 2.0
Future of TSD - main topics • How to handle video and sound
– harvesting – management – metadata – analysis
• Journal system for Psychologists (Univ of Umeå collaboration) • Biobanks • VMware and VDI infrastructure (BLAST or Thinlinc for Linux, PCoIP for
windows) • Galaxy inside TSD in full scale • Elixir helpdesk connected to TSD • Running Docker containers • Hosting of user-defined VMs -> no! at least not now
Risk-analysis
• System har been discussed with Datatilsynet – no major worries
• Risk analysis has been performed by USIT and no serious issues detected as of February 2015.
• OUS and AHUS and VVHF and several orthers are on board as users
• We have a board of advisory for all changes • Backup has been
Main collaborators on TSD
Collaborators • Norwegian Storage Infrastructure (NorStore) • Norwegian Genetics Analysis Platform (GenAp) • Norwegian Dietary Registry (Medical Faculty) • Institute of Psychology (Faculty of Social Sciences) • Norwegian Cancer Sequencing Consortium (NCGC) Reference group Oslo University Hospital, NorStore, Regional Ethical Committee, National Institute of Public Health, Norwegian Cancer Registry, Research Network at OUS, Elixir Norway, NCGC, GenAP, Institute of Psychology,
Gard Thomassen,TSD 2.0
Capabilities enabled by TSD
• Large scale NGS research on human genomes • Large scale medical imaging studies • Large scale population studies with web-based
data collection • Off-site analysis of sensitive data • Secure storage for verification of published
research • eBiobank hosting • Electronic consent
Gard Thomassen,TSD 2.0
Nordic collaboration opportunities • Laws are fairly similar (Norway very strict) • Difficult to exchange sensitive data for research • One should learn from each other as these systems
demands very special IT-knowledge • Services development and system-administration
know-how is non-sensitive and may be shared • Building TSD addressed many novel security
questions in a University setting to be learnt from • Large DBs/registeries of health data may enable very
interesting research in the future • TSD is involved in the NeIC-based Tryggve project • We are happy to collaborate!
Gard Thomassen,TSD 2.0
People involved
• tsd-core@usit • virt-core@usit • storage-core@usit • postgres-core@usit • network-core@usit • hpc-core@usit • windows-core@usit • unix-core@usit • IT-security@usit
Project group / developers • IT-dir Lars Oftedal • Hans A. Eide • Märtha Felton
Administration / associated
Gard Thomassen,TSD 2.0