if the cpu is so fast, why are the programs running so slowly? cs 614 lecture – fall 2007 –...

43
If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Upload: gwendolyn-mason

Post on 12-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

If the CPU is so fast, why are the programs running so slowly?

CS 614 Lecture – Fall 2007 – Thursday September 20, 2007

By Jonathan Winter

Page 2: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization2

Introduction

Both papers discuss online profiling and optimization. Main Goals:

• Gather data about the users’ actual experience with the system and software

• Improve application behavior without user involvement• Identify performance bottlenecks in the real world• Direct program optimization to alleviate these slowdowns

Challenges:• Continuously running profiler must have low overhead• Difficult to extracting detailed information at runtime• Lack of application specific information in online setting

Page 3: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization3

Outline

1. Application Performance Basics

2. Studying Performance

3. Online Profiling

4. Program Optimization

5. Related Work and Background

6. The Digital Continuous Profiling Infrastructure (DCPI)

7. The Morph System

8. Comparison

9. Comments and Critique

10. Conclusions

Page 4: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization4

Application Performance Basics

CPU Time = Instruction Count x CPI x Clock Cycle Time Instruction Count - number of instruction in program

• Reduced through compilation techniques or ISA changes

CPI = Cycles Per Instruction• Improved through micro-architectural changes• System level factors such as I/O and memory accesses

Clock Cycle Time• Frequency dependent on micro-architecture• Circuit design and electron device technology driven

CPI is primary focus of online profiling and optimization

Page 5: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization5

Architectural View of Performance

Key tasks: get instructions, get data, and provide resources Improve performance by:

• Avoiding control, data, and structural hazardso Control: branch prediction, prefetching, instruction caches, trace cacheso Data: prefetching, data caches, load value prediction, load-store forwardingo Structural: more resources, result value forwarding

• Increased parallelism

o instruction, thread, and memory level

• Reducing cycle timeo pipelining, shorten stage

length

Page 6: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization6

Analyzing Performance – When?

Analysis can be done a different stages of development Trade off between ability to adapt and accuracy Trade off between application specific vs. runtime knowledge

Page 7: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization7

Analyzing Performance – How?

A number of mechanisms can be used.• Static program analysis• Simulation - full system or CPU cycle accurate• Binary instrumentation• Performance counters• Operating system involvement

Major factors are:• Accuracy vs. Speed vs. Coverage• Overhead and behavior perturbation• Ease of implementation

Page 8: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization8

Online Profiling

Requires hardware and software support• Processor must monitor and track hardware events

o Performance counters has become dominant method

• Operating system or application must access counterso Use special purpose registers/memory space o Typically microprocessor vendors provide special libraries

Challenges:• Poor portability across hardware platforms and OS• Continuous profiling requires low overhead

o Gathering, moving, and processing data can have high cost

• Source code and application information not availableo Makes analyzing performance bottlenecks difficult.

• Transparent to system users

Page 9: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization9

Performance Optimization

Range of options• Compiler level• Binary rewriting• Binary instrumentation• Online optimization• Hardware techniques

Benefits of Online Optimization• Customize program to specific hardware, OS, and system• Adaptive to user usage pattern and dynamic variation• Optimize for common case• Does not require user or application developer involvement

Page 10: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization10

Related Work

DCPI and Morph claim to be the first online low-overhead profiling and optimizing tools

Most prior tools were not online and had high overhead.• Eg. Pixie, jprof, gprof, ATOM, MTOOL, SimOS, quartz• Relied on intrusive techniques

o recompilation, binary instrumentation, simulation

• Required significant user intervention

Some used performance counters but lacked detail• Eg. VTune sampler, iprobe, and Speedshop• Memory demands prevented use for continuous profiling

Some used statistical sample – Eg. Prof and Speedshop

Page 11: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization11

Profiling Systems Summary

Page 12: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization12

Hardware Performance Counters

Most common counters track basic information• cycle count, instructions executed, and program counter

More detailed counters track occurrence of 3 hazards• Eg. Branch mispredictions, cache misses, ALU contention

DEC Alpha 21164 has numerous hazard counters• Can also track information about instruction types• Pipeline stalls, # instructions issued, multiprocessor events

Major problem with counters – microarchitecture specific 2 research efforts provide cross-platform support

• Performance Counter Library (PCL)• Performance Application Programming Interface (PAPI)

Page 13: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization13

Digital Continuous Profiling Infrastructure

Objectives• Achieve lower overhead than previous system• Deliver a very high sampling rate• Provide more detailed and accurate cycle level analysis

Three key tools included• dcpiprof – identify distribution of cycles among procedures• dcpicalc – instruction execution details and stall causes• dcpistats – analyze variation in profile data

Key contributions• Novel data structures for gathering counter information• Innovative analysis of counters to determine cause of stalls

Page 14: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization14

Procedure-Level Bottlenecks

Identify dominant procedures to focus on for optimization Obtain low level details, such as instruction cache miss rates

Page 15: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization15

Instruction-Level Bottlenecks

Static analysis can identify structural hazards.

• This provides best-case

DCPI identifies all possible stall causes (conservatively) Different executions of code may suffer from different stalls

Page 16: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization16

Analysis of Variance Across Executions

Variance analysis is useful to characterize system effects Important to evaluate applicability of optimizations

Page 17: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization17

DCPI: System Overview

Analysis tools: system-, load-file-, procedure-, and

instruction-level

cp

u n

...

cp

u 1

Har

dw

are

Use

r sp

ace

Ker

nel

dev

ice

dri

ver

...

cpu 1

Hash table

Overflowbuffer

Per-cpu data

cpu n…

Modifieddynamicloader

Loadmapinfo

Bufferedsamples

counter 1

counter m

Profiles Load files

daemon

Execlog

Optional source code

Page 18: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization18

DCPI: Hardware Support

Program counters generate interrupts on overflow• Interrupts passes PID, program counter, and event type

DCPI monitors CYCLES and IMISS events by default• Intelligent analysis obtains all desired execution details• Other events can be monitored – must be multiplexed

Sampling period is configurable (between 4K and 64K)• Period is randomized to minimize systemic correlations

Six cycle latency between event overflow and PC• Does not affect sampling accuracy for CYCLES and IMISS

Blind spots exist during execution of PALcode and highest level interrupts

Page 19: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization19

DCPI: Kernel Device Driver

DCPI has high interrupt rate, 5200 per second at 333MHz Fast interrupt handler is critical.

• Taking 1000 cycles would consume 1.5% of CPU• Tagged TLB avoids most TLB flushes• Need to reduce cache misses to memory (~100 cycles)• Transfer of data from kernel to user space is bottleneck

Smart data structures reduce overhead• Hash table reduces accessed cache lines• Entry data (PID, PC, and event) packed into 16 bytes• Counter events are aggregated in driver memory• Overflow buffers handles evictions and data transfer

Page 20: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization20

DCPI: User-Mode Daemon

Upon full overflow buffer, data is moved to user space PID and PC are identify program and EVENT data is

merged with accumulated profile information Program image data obtained from

• Modified loader• Recognizer routines invoked by kernel exec• Mach-based system calls

User space data merged with disk database periodically• Disk usage minimized by compact format• Small fraction of program image is actually executed

Page 21: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization21

DCPI: Uniprocessor Workloads

Page 22: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization22

DCPI: Multiprocessor Workloads

Page 23: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization23

DCPI: Workload Slowdowns

Page 24: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization24

DCPI: Time Overhead Breakdown

Interrupt handler setup and teardown took additional 214 cycles

Page 25: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization25

DCPI: Space Overhead Breakdown

Device driver has two 8K entry overflow buffers and a 16K entry hash table, totaling 512KB of kernel memory.

Page 26: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization26

DCPI: Analyzing Profile Data

CYCLES profile data indicates approximate time each instruction spent at the head of the issue queue

High values could indicate• Instruction executed frequently• Instruction spent much time stalling

Objective to determine• Execution frequency and CPI (phase 1)• Set of culprits causing stalls (phase 2)

Page 27: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization27

Phase 1: Estimating Frequency and CPI

Frequency and CPI must be determined only from sample counts and static procedure control flow analysis

Sample Count = Frequency x CPI Procedure

• Build control flow graph from basic block analysis• Group basic blocks and edges into equivalence classes• Statically determine minimum time at head of queue• Assume lowest sample counts indicate minimum CPI• Propagate frequency estimates around CFG• Derive confidence estimates using heuristics

Page 28: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization28

Evaluation of Phase 1 Analysis

Evaluation used “base” SPECfp and “peak” SPECint workloads dcpix, a profiling tool is used, to gather execution counts 73% of instructions within 5% of count, 58% of edges within 10%

Instruction Frequency Edge Frequency

Page 29: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization29

Phase 2: Identifying Stall Culprits

Analysis uses only binary executable and sample counts Static stalls determined by accurate processor modeling Dynamic culprits isolated by process of elimination

• Technique specific to each stall cause• Less than 10% of stalls remain unexplained

Ex. Instruction cache misses• Rule out miss when in same cache line as instruction before• Determine when this occurs by basic block analysis

Accuracy can be determined by comparing against event sampling of stall causes

Page 30: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization30

Evaluation of Phase 2 Analysis

Page 31: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization31

The Morph System

Objectives• Provide user and machine specific optimization capability• Optimizations should not require source code• Profiling and optimization process should be transparent

Key Components• Morph Monitor – online gathering of counter information• Morph Manager – process and prepare data for optimization• Morph Editor – conducts optimizations on intermediate form

Contributions• Develops full system with code layout optimizations as case

study

Page 32: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization32

Morph: System Overview

Two other components

Morph Back-end provides executable with intermediate form annotations to support online optimization

PostMorph can infer annotations from static and dynamic analysis to improve legacy applications

Page 33: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization33

The Morph Monitor

Program activity gauged by low-cost statistical sampling Modified clock interrupt routine collects samples

• Interrupt rate of 1024 Hz producing 8 byte samples• Claim that synchronization with clock is not deterimental

Monitor requires 256KB of kernel memory • Transfer of data to Morph Manager occurs every 30 seconds

Small modifications to OS required• exec() and mmap() changed to provide address space data• exit() modified to log process termination events• Context switch information must also be logged

Page 34: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization34

The Morph Manager

Manager must compile sample data from multiple sample sets and execution modules

During program updates, sample data must be ignored Program counter samples must be interpreted

• Intermediate representation contains CFG information• PC samples are scaled for basic block size• Aggregate basic block execution profile is created

Morph does not compensate for CPI• Authors argue that time-based approach is not detrimental

Profiles from multiple inputs must be combined• Morph combines information weighted by execution length

Page 35: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization35

The Morph Editor

Implemented as a composition of SUIF compiler passes Intermediate representation is modified low-level SUIF Three code layout optimizations performed:

• Branch alignment• Fluff removal• Procedure layout

Optimizations require basic block execution counts and CFG edge frequencies (calculated by Morph Editor)

Profile information used to optimize for common case Optimization reduce control hazards such as branch

mispredictions, misfetches, and improve cache locality

Page 36: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization36

Morph: Workload Descriptions and Inputs

I am not clear on the necessity or desirability of of the two stage experiment with test and train workload inputs for this study

Page 37: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization37

Morph: Overhead in Online Monitor

Non-determinism of bin-hopping policy for virtual to physical page mapping caused problems

DU is the baseline Digital Unix using page coloring for mapping

Larger benchmarks have higher overhead due to cache conflicts

Strawman tests conducted to quantify the relationship between working set and profiling overhead

Monitor adds 72 instructions to clock interrupt

Page 38: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization38

Morph: Overhead in Offline Manager

At 1024 Hz, 8KB of data is generated by Monitor

Adding logged events, Manager must copy 110KB to disk / 10 sec

Profiles made 640KB per minute

Manager can process 60 MB per minute (up to 900 MB per day)

Data typically much less

Long term storage augments intermediate representation and is very compact

Page 39: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization39

Morph: Optimization Results

Profiled samples are capture from train input sets.

Execution time improvement is measure on test input sets

Results compared to conventional optimization techniques utilizing complete profile information instead of sampling

Page 40: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization40

DCPI and Morph Comparison - Similarities

Both target DEC Alpha processors• Same available hardware and OS support (Digital Unix)

First two works proposing low overhead online profiling Both employ statistical sampling of processor activity

• Program counter samples provide bulk of insight

Common infrastructure design and division of labor• Light-weight kernel process for counter collection

o Acts like device driver for performance counters

• Slower user-mode daemon for processing data

Comparable performance• 1-3% for DCPI (5x faster sampling) and 0.3% for Morph

Page 41: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization41

DCPI and Morph Comparison - Differences

Significant focus of Morph on optimization side• Optimization tool tightly integrated

DCPI leaves optimization task to others• Author’s goals was to develop a tool for broad use

Morph developed more for “proof-of-concept”• Develops more integrated profiling and optimization suite

DCPI has heavier instruction-level analysis focus• Stall culprit analysis allows for more extensive optimizations• Morph’s profile data limits optimization to code layout

DCPI provides multiprocessor support Morph targets single user workstations

Page 42: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization42

Comments and Critique

Proposed methodology lacks portability• Profiling infrastructure tied to DEC Alpha and Digital Unix• Common infrastructure (PCL & PAPI) seem more promising

Ability to infer stall causes from PC counts limited to in-order processors• Out-of-order execution poses serious problem

Papers focus on processor core and memory hierarchy• Interconnect performance and I/O critical in multi-core

Would have liked to see more detail on optimization side• How is the profile and optimization cycle automated?

Page 43: If the CPU is so fast, why are the programs running so slowly? CS 614 Lecture – Fall 2007 – Thursday September 20, 2007 By Jonathan Winter

Thurs. Sept. 20, 2007 – CS 614 – Online Profiling and Optimization43

Conclusions

Systems research must be reconciliated with performance profiling

Low-level architectural events are responsible for significant performance losses

Critical to consider low-level impact of OS/system design• OS level changes could affect pipeline stalls• Perceived gains or losses could be accidental side-effect

Are high level performance measurements of virtualization or μKernel overhead meaningful?

Performance results must be taken with grain of salt• Lots of salt, of many different origins