regression testing & quality assurance · regression testing qor regression testing stability...
TRANSCRIPT
Regression Testing &
Quality Assurance
Lukas van Ginneken
August 1, 2019
Outline
• The release process and regression testing
• Roles in testing
• The release process
• Branching and merging
• Other types of testing
• Regression testing
• QOR regression testing
• Stability and Repeatability
Testing in commercial EDA
• Alpha testing
– By developers (R&D) • for bug fixes
• new features and enhancements
– By application engineers (AE) • Usually designs supplied by customers
– Regression testing by QA department
• Beta testing
– Competetive Benchmarks
– Acceptance testing by client EDA dept.
– Bug reporting
Commercial EDA release flow
developers
R&D
QA
Regressions
Applications
Engineering
Client
EDA dept Designer
training training
QA
• Runs regression tests
• Only QA is authorized to check in code to the master
branch
• Determines the order of merges to the master branch
• Executes release schedule
• Generally does not develop regressions tests
Developers
• Develop unit tests
• Installs bug test cases as regresssion tests
• Runs regression tests before submitting branch
Applications engineering
• Supports customers
• Interface between customers are developers
• Files bug reports
• Verifies or creates test cases for bug reports
• Creates end-to-end and QOR regression tests
• Does not run regression tests
Customers
• Supply designs, libraries, floor plans,
• Provide test cases
• Develop their own acceptance tests
Release train
Release
• Runs on a schedule
• Don't miss the train
Gates
Check points for merging and releasing
1. Submit contribution to main branch (Developer) • Verify bug is fixed
• New regression test for new feature
• Run some tests to make sure it didn't break anything else
2. Checkin on main branch (QA) • Smoke tests to verify DOA
• Unit tests to verify by modules
• Select set of QOR tests
3. Minor Release (critical bug fixes or expedited request) (QA) • Feature or bug specific tests
• Customer specific tests
4. Major Release (new functionality) (QA) • Comprehensive tests
Version management and QA
1 2 3 4 5
1 2 3
Master branch
Developer branch
FAIL
Version management and QA
1 2 3 4 5
1 2 3
Master branch
Developer branch 4
6
resolve conflicts
here
fewer problems
here
Code coverage testing
• Function coverage
• Statement or line coverage
• Branch coverage
• Loop coverage
• Variable coverage
Tools • Covtool
• gcov
• PureCoverage
• Many others
• Surprisingly large amount of code is not
covered
• Code not covered is guaranteed to be
untested
• Code covered is not guaranteed to be bug
free.
Memory tests
• Use memory debugger software
• Detect memory errors such as
– Reading uninitialized memory
– Accessing freed memory
– Exceeding array size
– Access outside a malloced block
– Memory leaks (forget to free memory)
• Potential problems and “unclean” code
• May not be an actual problem (a bit like lint)
• Examples: dmalloc, PurifyPlus, Electric fence
Other types of testing
• Performance testing - e.g. gprof, Quantify
• Cache misses
• Static verification - Coverity
• Lint
Input Sanity checks
Design sanity checks • combinational loops
• disconnected pins
• excessive levels of logic
• latches
• asynchronous logic
• data signals feeding clock inputs
• unconstrained IOs
• bad clock latency
Libary sanity checks • unroutable pins
• off grid pins
• use of M2
• timing sense does not match logic
function
• negative slope of NLDM tables
Regression test types
• Smoke tests – Test if basic functionality is not totally broken
– E.g. can it read a design or execute a command
– Read DEF test, Read LEF test, bring up the GUI
• Unit tests – Test a unit
• e.g. global placer, timer, detailed router
• Bug fix tests – Test cases that were originally filed with bug
reports
– tedious and time consuming
• Feature tests
– Features that span the system
• e.g. dont_touch
• QOR tests
– QOR of separate algorithms
– QOR of whole designs
• Release tests – Full suite of all tests
– Expensive to run in compute time.
– Requires server farm with batch job
scheduler
Regression tests
• Pass/Fail
• Uniform calling method
– Executable
– Directory
– Script (always run.tcl)
• exit 0 = success
• exit N = failure
• Written in TCL
• Direct measurement of performance
metrics.
• Adjustable error messages
(Info, warning, error, fatal)
grepping log files from TCL
Diff reports appropriate for timer, not for QOR
tests
stack trace for crashes
Check in regressions in same repository
• Keep track of versions
Test harness
• Batch processing
• Cluster computing
• e.g. LSF, Slurm, Loadleveler
QOR Metrics
• Primary metrics
– Timing violations
• WNS worst negative slack
• TNS total negative slack
• Hold time slack
• Max slew / Min slew
• Clock skew
– DRC violations
• shorts, spacing, open, via, min area, antenna
– User constraint violations
• dont_touch
• dont_use
• regions violations
• blockage violations
QOR Metrics
• Secondary metrics
– Power dynamic/leakage
• Dynamic power consumption often is difficult
to determine accurately
– IR drop, EM metrics
• Depends on dynamic power consumption
– Utilization
• Not important, except as indicator
– Congestion
• Total #overflows
• Worst gcell overflow
– Wirelength
• Half perimeter bounding box
• Steiner tree
• Global routing
– #vias
– Testability
– Run time
– Memory usage
Interpretation of constraints
• Is the clock frequency a hard constraint or merely a suggestion?
• Should the system violate timing to meet an area constraint?
• Should the system violate min/max slew to meet timing or power?
• Setup violations can usually be fixed by slowing down the clock
• Hold violations baked into the silicon (can't be fixed by slowing the clock)
• “I gave you impossible constraints but I still want a reasonable result”
Challenges in QOR testing
• Stability – Small changes in the input can cause big differences in the output
– Butterfly effect
• Repeatability
• Interpretation of constraints
• Size of the solution space
• Correlation of metrics
– utilization to congestion
– global routing congestion to detailed routing DRCs
– synthesis slack vs post opt slack
Repeatability testing
• Multiple runs with seemingly identical inputs can result in different outputs
• Not really, but the input differences can be in
unexpected and unintended places such as:
– User name, working directory, date and time
– Host name, host configuration, OS version
– Environment variables
– Search paths
• A common path for divergence is something like:
– char *pwd = strdup(getenv(“PWD”));
– This causes a variable amount of memory to be allocated, depending on the path name of the current directory
– This causes subsequent malloc() calls to return different addresses
– This causes many objects to have different addresses
– An algorithm uses address dependent logic, such as sorting or hashing by address
– This changes net order, or cell order or pin order
– This causes heuristics to make different optimizations
– Normal instability causes more divergence
• Tracking down the source of the divergence isn't always easy
The stability problem
• Small changes in input can have big consequences
• Butterfly effect – Inherent to deterministic non linear systems
– Examples: the weather, the stock market, fluid dynamics, discrete optimization heuristics
• “Meaningless” changes, such as: – Module, file, net, cell or pin order
– Net cell or pin names or line numbers
• Actual small changes, such as the inverting of a
single pin or moving a pad
• Divergence tends to grow with more optimization
• Discrete decisions are the main causes: eg
– automatic macro and pad placement
– buffer insertion
– routing order
• Smooth algorithms are less unstable: eg. global
placement
The “uses” of instability
• Instability is sometimes used as an optimization technique
– Seemingly arbitrary parameter variations cause different results
– Run a number of nearly identical runs in parallel
– Pick the best one at the end
• Seemingly arbitrary parameter or script changes can cause a regression test
to pass or fail
• Or they can “fix” bugs
• Can lead to organizational “thrashing” - making frequent code changes that
appear to be improvements but are actually arbitrary
QOR testing
This means that it's difficult to prove a parameter setting is better without
doing extensive and well controlled experiments.
• Similar to problems in the medical field
• Requires a “study” • Many data points required
• Each data point takes hours of run time
• Many parameters to tweak - huge search space
• Hard to separate signal / noise
• Underlying system keeps changing
– nullifies learned lessons
So what do people actually do?
• Make a tweak and test it.
• Due to instability this actually has a decent
chance of making things better.
• Call it good and report the problem solved.
• If it didn't work, try a different tweak.
• Some of these are pure luck
• Some of them are genuine improvements
Tendency to tack on additional fixing
steps at the end of the scenario • One last sizing
• One last buffering
• One last legalization
• One last detailed route
• Avoids disturbing the process before the point
where the problem becomes obvious.
QOR regressions
• Create unit tests
• Simple global metric
– Global placement -> HPWL < limit
– Sizing -> TNS == 0
– CTS -> clock skew
– Global routing -> congestion
• Don't diff entire log files or entire designs
• exit 0 if successful
• exit 1 if unsuccessful
• Keep blocks small enough to have
reasonable run time
• Keep blocks large enough to be
meaningful
• Maybe about 5,000 - 100,000 cells
QOR Integration tests
• Must be somewhat challenging
• Usually benchmark design
• High maintenance
• Use for Release testing not overnight build