cec 450 real-time systemsmercury.pr.erau.edu/~siewerts/cec450/documents/lectures/...2020/09/29 ·...
TRANSCRIPT
-
September 29, 2020 Sam Siewert
CEC 450Real-Time Systems
Lecture 6 – Accounting for I/O Latency
-
Upcoming Events
Quiz #2 – Part of your preparation for Exam #1Exercise #3 – final exercise before Exam #1Study for and take Exam #1 on October 15th anytime between 8am and before 11:59pm
Sam Siewert 2
-
Exam #1Practice with Quiz #2, October 8th to Oct 13 before class
Thursday, October 15th, anytime that day – start during class time, with access to NUC systems or remotely with your own system and finish up anytime later that day before the day ends
Exam is available from 11:30 am until 11:59pm
Should take 1.25 hours, and at most, up to 2.5 hours to complete, however you have the full day
Sam Siewert 3
-
Code and Timing Analysis WalkthroughsPOSIX RT Threads– Simple Thread– Improved Thread
POSIX RT Examples– Clock– Linux demo– Message Queues
RT Clock
Better, More Generic Sequencer [sequencer_generic]
Synch
Videos on Code and TA - cec450/video/Coursera-rough-cuts/
Sam Siewert 4
http://mercury.pr.erau.edu/%7Esiewerts/cec450/code/rt_simplethread/http://mercury.pr.erau.edu/%7Esiewerts/cec450/code/rt_thread_improved/http://mercury.pr.erau.edu/%7Esiewerts/cec450/code/POSIX-Examples/http://mercury.pr.erau.edu/%7Esiewerts/cec450/code/RT-Clock/http://mercury.pr.erau.edu/%7Esiewerts/cec450/code/sequencer_generic/http://mercury.pr.erau.edu/%7Esiewerts/cec450/code/example-sync-updated-2/pthread3.chttp://mercury.pr.erau.edu/%7Esiewerts/cec450/video/Coursera-rough-cuts/
-
Sam Siewert 5
A Service Release and ResponseCi WCETInput/Output LatencyInterference Time
EventSensed Interrupt Dispatch Preemption Dispatch
Interference
Completion(IO Queued)
Actuation(IO Completion)
Input-Latency
Dispatch-Latency
Execution Execution
Output-Latency
Time
Response Time = TimeActuation – TimeSensed(From Release to Response)
-
Sam Siewert 6
Service Response Timeline(With Intermediate Blocking)
InputLatency
DispatchLatency
EventSensed Interrupt Dispatch
Execution
Preemption Dispatch
ExecutionInterference
Completion(IO Queued)
OutputLatency
Actuation(IO Completion)
Time
Response Time = TimeActuation – TimeSensed(From Event to Response)
InputLatency
DispatchLatency
EventSensed Interrupt Dispatch
Execution
Preemption Dispatch
ExecutionInterference
Completion(IO Queued)
OutputLatency
Actuation(IO Completion)
Time
BLOCKING
Yield
Execution
Dispatch
-
Sam Siewert 7
Service Response Timeline - WCET(With Intermediate I/O – Impacts BCET to Cause
WCET)
InputLatency
DispatchLatency
EventSensed Interrupt Dispatch
Execution
Preemption Dispatch
ExecutionInterference
Completion(IO Queued)
OutputLatency
Actuation(IO Completion)
Time
Response Time = TimeActuation – TimeSensed(From Event to Response)
InputLatency
DispatchLatency
EventSensed Interrupt Dispatch
Execution
Preemption Dispatch
ExecutionInterference
Completion(IO Queued)
OutputLatency
Actuation(IO Completion)
TimeMMIO(Stalls)
Execution
-
Sam Siewert 8
Services and I/OInitial Input I/O from Sensors– Low-Rate Input: Byte or Word Input from Memory-Mapped
Registers or I/O Ports– High-Rate Input: Block Input from DMA Channels and Block
Reads from FIFOs
Final Response I/O to Actuators– Low-Rate Output: Byte or Word Writes Posted to Output FIFO or
Bus Interface Queue– High-Rate Output: Block Output on DMA Channels and Block
Writes
Intermediate I/O– During Service Execution, Memory Mapped Register I/O– External Memory I/O– Cache Loads and Write-Backs– In Memory Input/Output from Service to Service
-
Sam Siewert 9
Service Integration ConceptsInitial Sensor Input and Final Actuator Response Output– Device Interface Drivers
Simple Service Has No Intermediate I/O Latency, No Synchronization
Environment
Sensors Effectors
Digital Control Law Service
ActuatorOutput
Interface
SensorInput Interface
ActuatorElectronics
SensorElectronics
-
Sam Siewert 10
Complex Multi-Service SystemsMultiple ServicesSynchronization Between ServicesCommunication Between ServicesMultiple Sensor Input and Actuator Output InterfacesIntermediate I/O, Shared Memory, Messaging
-
Sam Siewert 11
Multi-Service PipelinesRT Management
tBtvid(30 Hz)
Bt878 PCINTSC decoder
(30 Hz)Serial A/B ISA ethernet
frameRdyinterrupt
RGB32buffer
tFrmDisp(10 Hz)
grayscalebuffer
tFrmLnk(2 Hz)
tNet(2 Hz)
frame
TCP/IP pkts
Event releases / Service control
tOpnav(15 Hz) tTlmLnk(5 Hz)
tThrustCtl(15 Hz)
tCamCtl(3 Hz)
P0
P1
P2
P3
SWHW
frameRdy Control Plane
Data Plane
RemoteDisplay
-
Sam Siewert 12
Pipelined Architecture Review
Recall that Pipeline Yields CPI of 1 or LessInstruction Completed Each CPU ClockUnless Pipeline Stalls!
IF ID Execute Write-Back
IF ID Execute Write-Back
IF ID Execute Write-Back
IF ID Execute Write-Back
IF ID Execute Write-Back
IF ID Execute Write-Back
-
Sam Siewert 13
Service Execution
Efficiency of Execution
Memory Access for Inter-Service Communication
Bounded Intermediate I/O
Ideally Lock Data and Code into Cache or Use Tightly Coupled Memory [Scratch Pad, Zero Wait State]
[ ]( _ _ _ ) _ _best caseWCET CPI Longest Path Inst Count Stall Cycles Clk Period−= × + ×
-
Sam Siewert 14
Service Efficiency
Path Length for a Release– Instruction Count– Longest Path Given Algorithm
BranchesLoop IterationsData Driven?
Path Execution– Number of Cycles to Complete Path– Clocks Per Instruction– Number of Stall Cycles for Pipeline
Data Dependencies (Intermediate IO)Cache Misses (Intermediate Memory Access)Branch Mis-predictions (Small Penalty)
-
Sam Siewert 15
Hiding Intermediate IO Latency(Overlapping CPU and Bus I/O)
ICT = Instruction Count Time– (CPI=1) x Inst_Cnt x CPU_Clk_Period– Time to execute a block of instructions with no stalls, hence CPI=1– CPU Clk Cycles x CPU Clock Period
IOT = Bus Interface IO Time – Bus IO Cycles x Bus Clock Period– Bus Clock Period / CPU_Clk_Period = Wait Clocks
OR = Overlap Required– Percentage of CPU cycles that must be concurrent with I/O cycles
NOA = Non-Overlap Allowable for Si to meet Di– Percentage of CPU cycles that can be in addition to IO cycle time without
missing service deadline
Di = Deadline for Service Si relative to release– interrupt or system call initiates Si request and Si execution
CPI = Clocks Per Instruction for a block of instructions– IPC is Instructions Per Clock, Also Used, Just the Inverse (Superscalar,
Pipelined)
-
Sam Siewert 16
Processing and I/O Time OverlapIf Di < IOT, Si is IO-Bound
If Di < ICT, Si is CPU-Bound
if Di < (IOT + ICT) where (Di < IOT AND Di < ICT), deadline Di can’t be met regardless of overlap, both IO-Bound and CPU-Bound
Di >= (IOT + ICT) requires no overlap of IOT with ICT
if Di < (IOT + ICT) where (Di >= IOT and Di >= ICT), overlap of IOT with ICT is required
CPIICT=1 by definition for ICT alone (no stalls, best case efficiency)
CPIactual = fefficiency x CPIICT
No code I/O optimization
required
Code I/O optimization required
HardwareRe-design
REQUIRED
-
Sam Siewert 17
Service Execution Efficiency – Waiting on I/O from MMIO Bus and Memory
CPIworst-case = (ICT + IOT) / ICT
CPIbest-case = (max(ICT, IOT)) / ICT
CPIrequired = Di / ICT
OR = 1 - [(Di - IOT) / ICT]
CPIrequired = [ICT(1-OR) + IOT] / ICT
NOA = (Di – IOT) / ICT; OR + NOA = 1 (by definition)
ICT IOT
IOT
ICT
IOT
ICT
OR
ICT
-
CPI UnitsCPIrequired = Di / ICT = Time/Time?
– Di = Clocks x Clock_T = Time [N x X nanoseconds]– ICT = CPI=1 x Num_instructions x Clock_T = Time [N x X
nanoseconds]
Di / ICT = (Clocks x Clock_T) / (1 x Num_instructions x Clock_T)– Clocks / Num_instructions = CPIrequired over deadline Di
Sam Siewert 18
-
Sam Siewert 19
Synchronization and Message Passing
Message Queues– Provide
Communication– Provide
Synchronization
Traditional Message Queue– Message Size– Internal Fragmentation
Heap Message Queue– Messages Are Pointers– Message Data in Heap
S1
Msg
S2
S1 Ptr S2
Msg
HeapBufferPool
AllocationOf Message
Heap
De-AllocationOf Message
Heap
ENEA OSE• Message Passing RTOS• Advertised Advantages
http://www.enea.com/solutions/rtos/ose/http://www.enea.com/Documents/Resources/Whitepapers/Advantages%20of%20Enea%20OSE%20-%20The%20Architectural%20Advantages%20of%20%20021061.pdf
-
What About Storage I/OMuch Higher Latency than Off-Chip Memory or Bus MMIOMilliseconds of Latency Compared to GHz Core Clock Rates – 1 million to 1!Flash Better, But Still Slower than SRAM or DRAM
Option #1 – Avoid Disk Drives and Flash I/O
Option #2 – Pre-Buffer Storage I/O (Read-Ahead Cache), Post Write-backs (Write-Back Cache), MFU/MRU (Most Frequently Used / Most Recently Used) Kept in Read Cache
Sam Siewert 20
-
Parallel Processing Speed-up(Makes I/O Latency Worse?)
Grid Data Processing Speed-up1. Multi-Core, Multi-threaded, Macro-blocks/Frames2. SIMD, Vector Instructions Operating over Large Words (Many Times
Instruction Set Size)3. Co-Processor Operates in Parallel to CPU(s)
SPMD – GPU or GP-GPU Co-Processor– PCI-Express Bus Interfaces– Transfer Program and Data to Co-Processor– Threads and Blocks to Transform Data Concurrently
Image Data Processing – Few Data Dependencies– Good Speed-up by Amdahl’s Law– P=Parallel Portion– (1-P)=Sequential Portion– S=# of Cores (Concurrency)– Overhead for Co-Processor– IO for Co-Processing
Sam Siewert 21
SPPUpSpeedMulticore
/)1(1__+−
=
0)1(1__+−
=P
UpSpeedMax
S is infinite here
-
Amdahl’s Law – Infinite CoresMaximum Speed-up Driven by Sequential and Parallel Portions of Program
– P = Parallel Portion– (1-P) = Sequential Portion– Speed-up for Given Multi-core Architecture Function of # of Cores (Speed-up in Parallel
Portions)
Sam Siewert 22
0123456789
1011121314151617181920
-1.55E-150.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Alg
orith
m S
peed
Up
Sequential Portion (% Computation in Sequential vs. Parallel Execution)
Amdahl's Law Max Speed-up(Any Number of Processor Cores)
Max Speed-up
No Parallel PortionAll Sequential(No Speed-up)
All Code Parallel(Infinite Speed-up) 95% Parallel(20x Speed-up)
-
Multi-Core Speed-Up
Sam Siewert 23
0
2
4
6
8
10
12
14
16
18
20
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Spee
d-up
Sequential Portion of Algorithm
Amdahl's Law - Speed-up with # Cores and Parallel Portion
Max Speed-up2 cores4 cores8 cores12 cores32 cores
95% Parallel Program
-
Hiding Storage I/O Latency –Overlapping with Processing
Simple Design – Each Thread has READ, PROCESS, WRITE-BACK Execution
Frame rate is READ+PROCESS+WRITE latency – e.g. 10 fps for 100 milliseconds– If READ is 70 msec, PROCESS is 10 msec, and WRITE-BACK
20 msec, predominate time is IO time, not processing– Disk drive with 100 MB/sec READ rate can only read 16 fps,
62.5 msec READ latency
Sam Siewert 24
READ F(1) Process F(1) Write-back F(1) READ F(2)
-
Hiding Storage I/O LatencySchedule Multiple Overlapping Threads?
Requires Nthreads = Nstages x Ncores1.5 to 2x Number of Threads for SMT (Hyper-threading)For IO Stage Duration Similar to Processing TimeMore Threads if IO Time (Read+WB+Read) >> 3 x Processing Time
Sam Siewert 25
READ F1 Process F1 Write-back F1 READ F4 Process F4 Write-back F4
READ F2 Process F2 Write-back F2 READ F5 Process F5 …
READ F3 Process F3 Write-back F3 Read F6 …
Start-up Core #1 Continuous Processing Core #1 Continuous Processing
READ F1 Process F1 Write-back F1 READ F4 Process F4 Write-back F4
READ F2 Process F2 Write-back F2 READ F5 Process F5 …
READ F3 Process F3 Write-back F3 Read F6 …
Start-up Core #2 Continuous Processing Core #2 Continuous Processing
-
Hiding Latency – Dedicated I/OSchedule Reads Ahead of Processing
Requires Nthreads = 2 + NcoresSynchronize Frame Ready/Write-backsBalance Stage Read/Write-Back Latency to Processing1.5 to 2x Threads for SMT (Hyper-threading)
Sam Siewert 26
Wait Process F1 Process F3 Process F5 …
Wait Process F2 Process F4 Process F6
Read F1 Read F2 Read F3 Read F4 Read F5 Read F6 Read F7 Read F8
Start-up
Wait … WB F1 WB F2 WB F3 WB F4 WB F5 WB F6
Dual-Core Concurrent Processing Completion
-
Processing Latency AloneWrite Code with Memory Resident Frames– Load Frames in Advance– Process In-Memory Frames Over and Over– Do No IO During Processing– Provides Baseline Measurement of Processing Latency per
Frame Alone– Provides Method of Optimizing Processing Without IO Latency
Sam Siewert 27
-
I/O Latency AloneComment Out Frame Transformation Code or Call Stubbed NULL Function– Provides Measurement of IO Frame Rate Alone– Essentially Zero Latency Transform– No Change Between Input Frames and Output Frames– Allows for Tuning of IO Scheduler and Threading
Sam Siewert 28
-
Tips for IO Scheduling
blockdev --getra /dev/sda– Should return 256– Means that reads read-ahead up to 128K– Function calls – read, fread should request as much as possible– Check “actual bytes read”, re-read as needed in a loop
blockdev --setra /dev/sda 16384 (8MB)
Switch CFQ to Deadline– Use “lsscsi” to verify your disk is /dev/sda … substitute block
driver interface used for file system if not sda– cat /sys/block/sda/queue/scheduler – echo deadline > /sys/block/sda/queue/scheduler
Options are noop, cfq, deadline
Sam Siewert 29
CEC 450�Real-Time SystemsUpcoming EventsExam #1Code and Timing Analysis WalkthroughsA Service Release and ResponseService Response Timeline�(With Intermediate Blocking)Service Response Timeline - WCET�(With Intermediate I/O – Impacts BCET to Cause WCET)Services and I/OService Integration ConceptsComplex Multi-Service SystemsMulti-Service PipelinesPipelined Architecture ReviewService ExecutionService EfficiencyHiding Intermediate IO Latency�(Overlapping CPU and Bus I/O)Processing and I/O Time OverlapService Execution Efficiency – Waiting on I/O from MMIO Bus and Memory CPI UnitsSynchronization and Message PassingWhat About Storage I/OParallel Processing Speed-up�(Makes I/O Latency Worse?)Amdahl’s Law – Infinite CoresMulti-Core Speed-UpHiding Storage I/O Latency – Overlapping with ProcessingHiding Storage I/O LatencyHiding Latency – Dedicated I/OProcessing Latency AloneI/O Latency AloneTips for IO Scheduling