ec h2020 dredbox: seminar school at …...2017/11/06  · ec h2020 dredbox: seminar school at...

32
[email protected] www.virtualopensystems.com EC H2020 dRedBox: Seminar School at Polytech'Clermont- Ferrand Kevin CHAPPUIS 2017-11-6

Upload: others

Post on 13-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

[email protected]

www.virtualopensystems.com

EC H2020 dRedBox: Seminar School at Polytech'Clermont-

Ferrand

Kevin CHAPPUIS 2017-11-6

Page 2: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 2

Virtual Open Systems

Virtual Open Systems Proprietary

Part 1: Virtual Open Systems Company Overview

Page 3: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 3

Virtual Open Systems

Virtual Open Systems Proprietary

Part 2: Data-Centers Disaggregation in dRedBox

Page 4: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 4

H2020 dRedBox project description

➢ dRedBox (disaggregated recursive data-center in a box)

➢ Project duration: January 2016 - December 2018

➢ Total Cost: EUR 6 451 500

➢ Objective: To innovate the datacentres architecture, shifting from monolithic clusters of machines to disaggregated pool of components

➢ The dReDBox proposition has the ambition to lead to

significantly improved levels of utilization, scalability,

reliability and power efficiency, both in conventional

cloud and edge datacentres.

Page 5: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 5

Current data-centers design

• Physical servers compound of CPUs, memory, accelerators, storage

• Impose fixed resources assignment ratio

– Low resources utilisation

– Energy waste (unused HW still powered on)

– Higher price

Towards Data-Centers disaggregation (1/2)

Page 6: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 6

Towards data-centers disaggregation (2/2)

Disaggregated data-centers design

• Memory and accelerators separate from CPU brick

• Flexible resources assignment

– High resources utilisation

– Energy optimization (Power off unused resources)

– Lower TCO (Total Cost of Ownership)

Page 7: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 7

Virtualization: Memory disaggregation

Host with minimal local RAM (hypervisor, services)

Memory for a guest obtained from a disaggregated pool

Guest VM uses disaggregated resources exclusively

QEMU is a virtualizer for the guest and each QEMU/VM is just a process for hypervisor. QEMU uses HVA and exposes it as GPA (Hotplug in guest)

Physical RAM

– Local only for the hypervisor

– QEMU process (VMs) uses remote memory only

– More remote memory attached on demand by orchestrator

How to balance it to limit physical reconfiguration?Memory Ballooning

Page 8: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 8

Virtualization: Memory Ballooning

Guest is launched with specific RAM size

Ballon driver operates within the guest RAM capability

– Inflate – reserving VM’s pages (make them unusable)

– Deflate – releasing pages

Reserved pages are reported to the hypervisor – may be reused

When ballon is empty, it is possible to hotplug new memory to the guest and pass it to the ballon.

Page 9: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 9

Linux KVM

Non-secureVirtual Machines

ARMv8-A hardware

VOSYSmonitor

SecureVirtual Machine 1

Secure RTOS(monitoring,

secure gateway, etc)

CPU1 CPU2 CPU3 CPU4

SecureVirtual Machine 2

TEE(Secure services)

Normal world Secure world

Secure Computing Bricks: Multi-OSs consolidation on ARMv8

➢ Provide spatial and temporal isolation through TrustZone

➢ Support legacy RTOS for monitoring applications

➢ Virtualization features (KVM) remain intact for the GPOS

➢ Flexibility for static allocation or overcommitment of hardware resources

Page 10: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 10

Computing Node 1Disagregated Memory

VOSYSmonitor VOSYSmonitor

VOSYSmonitorVOSYSmonitor

CPU 1 CPU 2

CPU 4CPU 3

Linux

LinuxLinux

Computing Node 2

VOSYSmonitor VOSYSmonitor

VOSYSmonitorVOSYSmonitor

CPU 1 CPU 2

CPU 4CPU 3

Linux TEELinux

Shared memory area

Linux

Secure RTOSLinux

Secure RTOSLinux

IP Stack communication

➢ CPU disaggregation – Secure computing brick:

Possibility to deploy secure execution

environments to remote cores through a proper

communication link between computation

bricks.

Secure Computing bricks: VM deployment

Page 11: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 11

Virtual Open Systems

Virtual Open Systems Proprietary

Part 3: Introduction to Virtualization Concepts

http://www.virtualopensystems.com/en/solutions/guides/kvm-on-armv8/

Page 12: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 12

Virtual Open Systems

Virtual Open Systems Proprietary

Part 4: ARMv8 Architecture Introduction

Page 13: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 13

Cortex-A15, Cortex-A9 ...

Cortex-A72, Cortex-A57, Cortex-A53

ARM Architecture evolution

Page 14: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 14

31 General Purpose (GP) registers 64-bit GP registers X0-X30 (32 bit access W0-W30) No banking of GP register Stack pointer (SP) is a specific register (one per Exception Level) Program counter is not a GP registers

Support for Floating Point and Advanced SMID (32 registers 128-bits) PSTATE register (e.g., ALU flags, exception masks) System register access

– MRS x2, sp_el3

ARMv8-A overall description

Architecture profiles:

ARMv8-A - AARCH64 Execution state:

A – application / R – real-time / M - microcontroller

Virtual Open Systems Proprietary

Page 15: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 15

ARMv8-A instruction set

mov x16, #0x10 => Write a value in a register

ldr x4, [x21] => Read the memory space pointed by x21 and put the value in x4.

str x5, [x11] => Write in the memory space pointed by x11 the value contained in X5.

cmp x0, #0x20 => Compare the value contained in X0 with 0x20

beq _label => If it is equal, branch to _label

bl function => Branch to a function by linking

lsl x18, x4, #2 => Shift the value contained in x4 by 2 and put the result in x18

and x6, x2, x4 => Do a logical “and” operation between x2 and x4 and put the result in x6

orr x0, x1, x2 => Do a logical “orr” operation between x1 and x2 and put the result in x0

Page 16: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 16

Exception level changing through specific instructions SMC, SVC, HVC, ERET

Secure world is completely isolated

(memory, devices, etc) from the Normal world by ARM TrustZone security

extensions. Since TrustZone is implemented in hardware, it reduces the

security vulnerabilities. The secure world could be used to run a secure OS to provide secure services to

the OS running in the Normal world.

ARM Virtualization extensions address the needs of devices for the partitioning and

management of complex software environments into

virtual machines.

Normal world to run concurrently another OS (e.g Linux) without impacting the secure OS.

Monitor layer is the highest priority level which provides a bridge between each world to allow some interactions.

ARMv8-A exception level

Virtual Open Systems Proprietary

Page 17: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 17

ARMv8-A features: ARM TrustZone

Normal world Secure world

Sharedmemory

Secure monitor firmware

Safety/Secure OS

Hardware

Rich OS

Secure applications

Normal HW resourcesand peripherals

Secure HW resourcesand peripherals

Rich OS applications

TrustZone splits core into two compartments (e.g., Normal world / Secure world)

Secure monitor firmware (EL3) is needed to support context switching between worlds

Each compartment has access to its own MMU allowing the isolation of Secure and Normal translation tables.

Cache has tag bits to discern content cached by either secure or normal world.

Security information is propagated on AXI/AHB bus

Memory/Peripheral can also be made secured

Provide security interrupts

Virtual Open Systems Proprietary

Page 18: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 18

ARMv8-A features: virtualization extension

Virtual Machines

Hypervisor (EL2)

ARMv8-A architecture includes hardware virtualization extension and Large Physical Address Extension (LPAE) to support the efficient implementation of vitual machine hypervisors:

Some hypervisors compliant with the ARM architecture

• Linux-KVM

• XEN

Dedicated exception level (EL2) for hypervisor.

Full virtualization capacity to run an OS in a virtual machine without any modification.

Combination of hardware features to minimize the need of hypervisor intervention.

Virtual Open Systems Proprietary

Page 19: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 19

TLBsPage tables

ARM core Caches

MMU

Memory

MMU handles translation of virtual addresses to physical addresses.

The address translation is performed through the TLB or a table walk.

*Translation Look-aside Buffers

ARMv8-A features: Memory Management

TTBR1Kernel space

TTBR0User space

Virtual address

Not Mapped(MMU fault)

AARCH64 supports up to 48-bits of Virtual Address

All ELs have independent MMU configuration The page table supports different translation granules Each page table requires different attributes

– Access permissions (Read/Write - User/Privileged modes)

– Memory types (Caching/Buffering rules, Shareable, etc)

Virtual Open Systems Proprietary

Page 20: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 20

ARMv8-A features: Cache memory

Virtual Open Systems Proprietary

0x00

0x04

0x08

0x0C

0xDEADBEFF

0xDEB0CAD0

0xBABA0000

0xFEFEFEFE

Main Memory Index 0

Index 1

Index 2

Index 3

Cache Memory way 0

Index 0

Index 1

Index 2

Index 3

Cache Memory way 1

0x10

0x14

0x18

0x00000000

0x01234567

0xDADAD1D1

Cortex - A53

L1 cache Instruction and data separated. Instruction 2 ways / Data 4 ways Size 8KB to 64KB - Cache line length 64 bytes L1 cache access => ~1 cycle

L2 cache 16-way set associative Size 128KB to 2MB Cache line length 64 bytes L2 cache access => ~10 cycles

Page 21: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 21

Interrupt Distributor

Interrupt Controller

CPU Interface CPU Interface

CPU 0 CPU 1

External sources

IRQ FIQ FIQIRQ

ARM provides a Generic Interrupt Controller (GIC) which supports routing of software generated, private and shared peripheral interrupts between cores. It is composed by:

• Distributor: All interrupt sources are connected. It controls the type of the interrupt, priority, state, core targeted through the CPU interface.

• CPU interface: Through this a core receives an interrupt. The CPU interface provides abilities to mask, identify and control the state of interrupts.

ARM processors include two types of interrupts:

– Fast Interrupt (FIQ) is the highest priority. Some banked registers are allocated to the FIQ handler. FIQ could be used for secure applications.

– General Interrupt Request (IRQ)

ARMv8-A features: Interrupt management

Virtual Open Systems Proprietary

Page 22: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 22

ARMv8-A Vector Table (cntd)

Virtual Open Systems Proprietary

0x780

0x700

0x680

0x600

0x580

0x500

0x480

0x400

0x380

0x300

0x280

0x2000x180

0x1000x080

0x000

Serror / vSerror

FIQ / vFIQ

IRQ / vIRQ

SynchronousSerror / vSerror

Serror / vSerror

Serror / vSerror

FIQ / vFIQ

FIQ / vFIQ

FIQ / vFIQ

IRQ / vIRQ

IRQ / vIRQ

Synchronous

Synchronous

Synchronous

IRQ / vIRQLower EL using

AARCH32

Lower EL using AARCH64

Current EL with SPx

Current EL with SP0

Exception generated during an EL AARCH32 is routed to a higher EL

Exception generated during an EL AARCH64 is routed to a higher EL

Exception directly caught in the current EL with SP_ELx

Exception directly caught in the current EL with SP_EL0

ARMv8 vector table

Page 23: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 23

ARMv8-A Vector Table

Virtual Open Systems Proprietary

Separate vector tables for each exception level. Define the location in VBAR_ELn register.

Synchronous exception

• Aborts from MMU

• SP & PC alignment fault

• Undefined instruction

• Service calls: SVC, SMC, HVC

Serror => Asynchronous data abort (ex: abort triggered by writeback of dirty cache line)

0x780

0x700

0x680

0x600

0x580

0x500

0x480

0x400

0x380

0x300

0x280

0x200

0x180

0x100

0x080

0x000

Serror / vSerror

FIQ / vFIQ

IRQ / vIRQ

SynchronousSerror / vSerror

Serror / vSerror

Serror / vSerror

FIQ / vFIQ

FIQ / vFIQ

FIQ / vFIQ

IRQ / vIRQ

IRQ / vIRQ

Synchronous

Synchronous

Synchronous

IRQ / vIRQLower EL using

AARCH32

Lower EL using AARCH64

Current EL with SPx

Current EL with SP0

ESR_ELx => Include info about the reasons

FAR_Elx => Hold the faulting address

ELR_Elx => Hold the instruction address which caused the data abort.

Information registers for exceptions:

Page 24: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 24

Virtual Open Systems

Virtual Open Systems Proprietary

Part 5: Linux kernel Introduction

Page 25: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 25

Interaction Linux – User

Hundreds of kernel modules (and device drivers) are include in the upstream version of Linux

Thousands are used in every day life (embedded in all sort of devices)

Linux uses the file system to allow interaction between a Linux user (process) and the module (kernel)

Kernel and users are different, interaction between the two has to be kept to a minimum

– Obvious problems of security might arise if a user process can liberally access the kernel structures

– The user activity shouldn’t affect the kernel execution (crashing the kernel means halting all the processes running in the machine)

Virtual Open Systems Proprietary

Page 26: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 26

Interaction Linux – User

Two main types of interaction:

– Read/Write to the device node in /dev/ - This is commonly the case of modules that are exposing information of the kernel that has to be retrieved from the shell Example: cat /dev/kmsg or for some file-system oriented uses: Example: dd if=/dev/zero of=./zeroed_file bs=1MB count=4

– IOCTLs – A module can export some functions that can be called in a user application implemented in C/C++, or in all the programming languages that supports system calls. Example: ioctl(fd, KVM_CREATE_VCPU, (void *)vcpu_id);

Virtual Open Systems Proprietary

Page 27: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 27

Files operations

Both ways of interactions use the device node as entry point

These files are defined in the kernel module. When the kernel module is loaded, the file appears in /dev/

Every file is associate with a set of methods (operations) that are all defined by the structure struct file_operations. Such structure defines the prototype of methods like:

– int (*open) (struct inode *, struct file *);

– ssize_t(*read) (struct file *, char __user *, size_t, loff_t *);

– ssize_t(*write) (struct file *, const char __user *, size_t, loff_t *);

– long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);

Virtual Open Systems Proprietary

Page 28: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 28

Files operations

open(): Called when a process attempts to open the file. This function is always called before any other operation is done on the file

read(): Called when a process attempts to read from the file. The kernel pass the buffer where the data has to be written to as an argument

write(): Called when a process attempts to write to the file. The kernel pass the pointer to the buffer containing the data to be written as an argument

unlocked_ioctl(): Called whenever a process issues an ioctl() call on the device file descriptor. For instance:

int fd = open(’/dev/slm’, O_RDWR);

ioctl(fd, ID_OF_IOCTL, args ...);

Virtual Open Systems Proprietary

Page 29: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 29

Char device

There are several type of devices in Linux: block devices, network devices, char devices, etc.

Usually, all the modules that are not bound to a specific physical device are char devices

– These devices are the most convenient way to exchange information between the kernel and the user space

– They are called ‘char-acter’ devices because the /dev/file belonging to the device is used to write and read characters

– In these devices, the operations of read(), write(), seek(), etc. are not handled by any file system (e.g.: EXT4) because just read and write in a buffer

Virtual Open Systems Proprietary

Page 30: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 30

Anatomy of a Linux module

Virtual Open Systems Proprietary

The entry point of a Linux module is set with module_init(init_fn). init_fn() takes care of:

• Initialization of buffers

• Definition of device ID (for char device, Major number and Minor number)

– It can be done automatically using alloc_chrdev_region()

• Creation Initialization of device node in /dev/ with class_create() and device_create()

– Alternatively, this can be done in user space with the mknod command

• Creation of the actual device since the kernel has to know which subsystem has to be associated with a given device

– cdev_* methods implement all the necessary for this task

Page 31: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential

Virtual Open Systems Confidential & Proprietary 31

Anatomy of a Linux module

Virtual Open Systems Proprietary

The exit point of a Linux module is set with module_exit(exit_fn). exit_fn() takes care of:

• Free of buffers

• Device and class removal

– It can be done using device_destroy() and class_destroy()

• Free of the char device Major and Minor numbers (in case of a char device)

– unregister_chrdev_region() serves this purpose

• Deletion of the device

– In case of a char device: cdev_del()

Page 32: EC H2020 dRedBox: Seminar School at …...2017/11/06  · EC H2020 dRedBox: Seminar School at Polytech'Clermont-Ferrand Kevin CHAPPUIS 2017-11-6 2 Virtual Open Systems Confidential