1 extravirt: detecting and recovering from transient processor faults dominic lucchetti, steve...
Post on 22-Dec-2015
220 views
TRANSCRIPT
![Page 1: 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5b71c/html5/thumbnails/1.jpg)
1
ExtraVirt: Detecting and recovering from transient processor faults
Dominic Lucchetti, Steve Reinhardt, Peter Chen
University of Michigan
![Page 2: 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5b71c/html5/thumbnails/2.jpg)
2
Flips Happen
Similar die area+
Decreasing transition energy=
Increasing risk of transient failure
![Page 3: 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5b71c/html5/thumbnails/3.jpg)
3
Multi-Processors &Virtual Machine
Multi-Processor Ensure error
independence Enable fault detection Efficient resource sharing
Virtual Machine No changes to OS or
applications VM replay
Synchronize replicas Recover correct state
Replica 1 Replica 2
Hypervisor
DeviceDrivers
Replication Management Layer (RML)
![Page 4: 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5b71c/html5/thumbnails/4.jpg)
4
Example: Memory
Copy on write Reduces overhead Protects checkpoints
Merge on checkpoint Verify correctness Re-execute on
deviation Memory Fault
Protection ECC against RAM
faults MMU against CPU
faults
Memory CheckpointReplica 1Checkpoint Replica 2
A
B
CD
E
A
B
CX
E
A
B
C
E
Verify
Replica 3
A
B
CD
E
![Page 5: 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5b71c/html5/thumbnails/5.jpg)
5
Status
Present VM Replay Beginnings of Replication
Management Layer (RML) Still much to do…
Future Replicate the un-replicated Handle faults in device
drivers Expanded fault model
Replica 1 Replica 2
Hypervisor/RML
DeviceDrivers
![Page 6: 1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d785503460f94a5b71c/html5/thumbnails/6.jpg)
6
Questions?