azul yandexjune010

Gil TeneCTO & co-founder, Azul SystemsJune 26, 2010

Azul Tech TalkYandex, June 2010

2 ©2010 Azul Systems, Inc. Azul Systems Confidential and Proprietary

Agenda

• Background

• Memory stagnation

• Java & Virtualization

• [Concurrent] Garbage Collection deep-dive

• Some [more] Azul Zing Platform details

• Talk-Talk / Q & A


Memory Stagnation


2GB:the new 640K


Memory.

How many of you use heap sizes:

Larger than ½ GB?

Larger than 1 GB?

Larger than 2 GB?

Larger than 4 GB?

Larger than 10 GB?

Larger than 20 GB?

Larger than 100 GB?


Problem statement:Memory footprint has stagnated

• Individual Instances with ~1-2GB of heap were commonplace in 2001

─ Some were still smaller─ Very few were larger

• Individual instances with ~1-2GB dominate in 2010─ Very few are smaller─ Relatively few are larger

• Could it really be that all applications have the same memory size needs?

• The practical size of an individual Java heap has not moved in ~9 years.

6


Why ~2GB? – It’s all about GC (and only GC)

• Seems to be the practical limit for responsive applications

• A 100GB heap won’t crash. It just periodically “pauses” for many minutes at a time.

• [Virtually] All current commercial JVMs will exhibit a periodic multi-second pause on a normally utilized 2GB heap.

─ It’s a question of “When”, not “If”.

─ GC Tuning only moves the “when” and the “how often” around

• “Compaction is done with the application paused. However, it is a necessary evil, because without it, the heap will be useless…” (JRockit RT tuning guide).

7


Maybe 2GB is simply enough?

• We hope not

• Plenty of evidence to support pent up need for more heap

• Common use of lateral scale across machines

• Common use of “lateral scale” within machines

• Use of “external” memory with growing data sets─ Databases certainly keep growing

─ External data caches (memcache, Jcache, JavaSpaces)

• Continuous work on the never ending distribution problem─ More and more reinvention of NUMA

─ Bring data to compute, bring compute to data

8


Will distributed solutions “solve” the problem?

• Distributed data solutions are [only] used to solve problems that can’t be solved without distribution

─ Extra complexity & loss of efficiency only justified by necessity

• It’s always because something doesn’t fit in one simple symmetric UMA process model

─ When we need more compute power than one node has

─ When we need more memory state than one node can hold

• Distributed solutions are not used to solve tiny problems

• “Tiny” is not a matter of opinion, it’s a matter of time─ “Tiny” gets ~100x bigger every decade

9

10 ©2010 Azul Systems, Inc. Azul Systems Confidential and Proprietary 10

“Tiny” application history

1980

1990

2000

2010

100KB apps on a ¼ to ½ MB Server

10MB apps on a 32 – 64 MB server

1GB apps on a 2 – 4 GB server

??? GB apps on 256 GBMoore’s Law: transistor counts grow• 2x every 18 mouths• ~100x every 10 yrs

100GB

Zing VM


2GB is the new 640K

This is getting embarrassing….

• We do strange things within servers now…

• Java runtimes “misbehave” above ~2GB of memory─ Most people won’t tolerate 20 second pauses

• It takes 50 JVM instances to fill up a ~$10K, ~100GB server─ A 256GB server can now be bought for ~$18K (mid 2010)

• Using distributed SW solutions within a single commodity server─ Similar to the EMS tricks Windows 3.1 used to deal with the 640KB cap─ Looks a lot like Yak shaving

• The problem is in the software stack─ Artificial constraints on memory per instance─ GC Pause time is the only limiting factor for instance size─ Can’t just “tune it away”

• Solve GC, and you’ve solved the problem


The Zing PlatformVirtualization++ for Java


Java & Virtualization

How many of you use virtualization?i.e. KVM, VMWare, Xen, desktop

virtualization (Fusion, Parallels, VirtualBox, etc.)

How many of you use it for production applications?

How many of you think that virtualization willmake your application run faster?


The Virtualization Tax

Virtualization is universally considered a “tax”

Typical focus is on measuring and reducing overhead

Everyone hopes to get to “virtually the same as non-virtualized” performance characteristics

But we can do so much better….

What If virtualization made Applications Better ?


Virtualization with Application benefits

Improve response times:

Increase Transaction rates:

Increase Concurrent users:

Forget about GC pauses:

Eliminate daily restarts:

Elastically grow during peaks:

Elastically shrink when idle:

Gain production visibility:

If you want to:…

Use Zing & Virtualization

Use Zing & Virtualization

Use Zing & virtualization







Java VirtualizationNew foundation for elastic, scalable Java deployments

Virtualize the Java runtime….. Liberate Java from the OS, optimizing runtime execution

Allow highly effective use of available resources100x better scale, throughput and responsiveness, improving

user experience

Elastically scale applicationsSmoothly scale resources up/down based on real-time

demand, improving scalability, efficiency and resiliency

Simplify deployment configurationReduce instances count, improve management and visibility

17 ©2009 Azul Systems, Inc. Azul Company Confidential

Java Virtualization

Java App

OS Layer

x86Server

Azul Java VirtualizationLiberating Java from the rigidities of the OS


Zing™ Platform Components

Zing Virtual Appliances

Elastic, Scalable Capacity

Zing Virtual Machine

Transparent Virtualization

Zing Resource Controller

Management and Monitoring

Zing Vision

Built-in App Profiling


20

Zing Elastic DeploymentsVirtualizing Java workloads in the Cloud

Hypervisor Hypervisor Hypervisor

Zing Resource Controller Plug-In

Linux Windows

ZVAZVA

App

ZVAZVA

Apps

ZVAZVA

App

ZVAZVA

App

Zing Virtual ApplianceZing Virtual Appliance

App

ZVAZVA

Apps

ZVAZVA

App

ZVAZVA

App

ZVAZVA

App

Resources are utilized by the Zing virtual appliances, not the

OSs

21 ©2009 Azul Systems, Inc.

Virtues of the Zing Elastic PlatformMaking virtualization the best environment for Java

Optimized Runtime PlatformMore effective use of resources (dozens of cores, 100s of GBs)

Scales smoothly over a wide range (from 1 GB to 1 TB)

Greater stability, resiliency and operating range

Record-breaking ScalabilityCompletely eliminates GC related barriers

Practical support for 100x larger heaps (e.g. 200-500+ GBs)

Sustain 100x higher throughput and allocation rates

Simplified Java App DeploymentsBetter app stability with fewer, more robust JVMs

Zero-overhead runtime visibility

Application-aware resource control


Building on a Heritage of Elastic Runtimes Proven, vertically integrated execution stack

• Up to 864 cores• Heaps up to 640 GB

Hardware

Java Runtime

OS Kernel

Virtualization

Azul Vega™ 3 Compute Appliance


Benefits of the Zing Elastic Platform

Business Implications• Consistently fast Java

application response times

• Improved customer experience and loyalty

• Faster time to market

• Greater application availability, even during peaks

• Room for growth, delivered in robust and cost effective manner

• Lower costs through simplified deployments and virtualization and cloud enablement

IT Implications• 100x improvements in key

response time and throughput metrics

• Accelerate virtualization and cloud adoption

• Increased resiliency and efficiency for all Java applications through dynamic resource sharing

• Simplified deployments through instance consolidation

• Unmatched production-time visibility and management

• Fast ROI


Now you can:

Improve response times: with Zing & virtualization!

Increase Transaction rates: with Zing & virtualization!

Increase Concurrent users: with Zing & virtualization!

Forget about GC pauses: with Zing & virtualization!

Eliminate daily restarts: with Zing & virtualization!

Elastically grow during peaks: with Zing & virtualization!

Elastically shrink when idle: with Zing & virtualization!

Gain production visibility: with Zing & virtualization!

Smoothly scale resources up/down based on real-time demand, improving scalability, efficiency and resiliencyAllow highly effective use of available resources

100x better scale, throughput and responsiveness, improving user experienceSimplify deployment configuration

Reduce instances count, improve management and visibility

Thank You

Gil TeneCTO, Azul Systems

Performance Considerations in

Concurrent Garbage Collected

Systems


1. Understand why concurrent garbage collection is a necessity

2. Gain an understanding of performance considerations specific to concurrent garbage collection

3. Understand what concurrent collectors are sensitive to, and what can cause them to fail

4. Learn how [not] to measure GC in a lab


What is a concurrent collector?

A Concurrent Collector performs garbage collection work concurrently with the application’s own execution

A Parallel Collector uses multiple CPUs to perform garbage collection


About the speakerGil Tene (CTO), Azul Systems

• We deal with concurrent GC on a daily basis

• Azul makes Java scalable thru virtualization─ We make physical (Vega™) and Virtual (Zing™) appliances

─ Our appliances power JVMs on Linux, Solaris, AIX, HPUX, …

─ Production installations ranging from 1GB to 300GB+ of heap

─ Zing VM instances smoothly scale to 100s of GB, 10s of cores

• Concurrent GC has always been a must in our space─ It’s now a must in everyone’s space - can’t scale without it

• Focused on concurrent GC for the past 8 years─ Azul’s GPGC designed for robustness, low sensitivity


Why use a concurrent collector?Why not stop-the-world?

• Because pause times break your SLAs

• Because you need to grow your heap size

• Because your application needs to scale

• Because you can’t predict everything exactly

• Because you live in the real world…


Agenda

• Background

• Failure & Sensitivity

• Terminology & Metrics

• Detail and inter-relations of key metrics

• Collector mechanism examples

• Recommendations for measurements

• Q & A


Why we really need concurrent collectorsSoftware is unable to fill up hardware effectively

• 2000:

─ A 512MB-1GB heap was “large”

─ A 1-2GB commodity server was “large”

─ A 2 core commodity server was “large”

• 2010:─ A 2GB heap is “large”

─ A 128-256GB commodity server is “medium”

─ An 24-48 core commodity server is “medium”

• The gap started opening in the late 1990s

• The root cause is Garbage Collection Pauses


Agenda

• Background






• Q & A


What constitutes “failure” for a collector?It’s not just about correctness any more

• A Stop-The-World collector fails if it gets it wrong…

• A concurrent collector [also] fails if it stops the application for longer than requirements permit

─ “Occasional pauses” longer than SLA allows are real failures

─ Even if the Application Instance or JVM didn’t crash

─ Otherwise, you would have used a STW collector to begin with

• Simple example: Clustering─ Node failover must occur in X seconds or less

─ A GC pause longer than X will trigger failover. It’s a fault.

─ ( If you don’t think so, ask the guy whose pager just went off… )


Concurrent collectors can be sensitiveGo out of the smooth operating range, and you’ll pause

• Correctness now includes response time

• Just because it didn’t pause under load X, doesn’t mean it won’t pause under load Y

• Outside of the smooth operating range:

─ More state (with no additional load) can cause a pause

─ More load (with no additional state) can cause a pause

─ Different use patterns can cause a pause

• Understand/Characterize your smooth operating range


Agenda

• Background






• Q & A


TerminologyUseful terms for discussing concurrent collection

• Mutator─ Your program…

• Parallel─ Can use multiple CPUs

• Concurrent─ Runs concurrently with program

• Pause time─ Time during which mutator is not

running any code

• Generational─ Collects young objects and long

lived objects separately.

• Promotion─ Allocation into old generation

• Marking─ Finding all live objects

• Sweeping─ Locating the dead objects

• Compaction─ Defragments heap─ Moves objects in memory─ Remaps all affected references─ Frees contiguous memory

regions


MetricsUseful metrics for discussing concurrent collection

• Heap population (aka Live set)─ How much of your heap is alive

• Allocation rate─ How fast you allocate

• Mutation rate─ How fast your program updates

references in memory

• Heap Shape─ The shape of the live object graph─ * Hard to quantify as a metric...

• Object Lifetime─ How long objects live

• Cycle time─ How long it takes the collector to

free up memory

• Marking time─ How long it takes the collector to

find all live objects

• Sweep time─ How long it takes to locate dead

objects─ * Relevant for Mark-Sweep

• Compaction time─ How long it takes to free up

memory by relocating objects─ * Relevant for Mark-Compact


Agenda

• Background






• Q & A


Cycle TimeHow long until we can have some more free memory?

• Heap Population (Live Set) matters─ The more objects there are to paint, the longer it takes

• Heap Shape matters─ Affects how well a parallel marker will do

─ One long linked list is the worst case of most markers

• How many passes matters ─ A multi-pass marker revisits references modified in each pass

─ Marking time can therefore vary significantly with load


Heap Population (Live Set)It’s not as simple as you might think…

• In a Stop-The-World situation, this is simple─ Start with the “roots” and paint the world─ Only things you have actual references to are alive

• When mutator runs concurrently with GC:─ Not a “snapshot” of a single program state ─ Objects allocated during GC cycle are considered “live”─ Objects that die after GC starts may be considered “live”─ Weak references “strengthened” during GC…

• So assume:─ Live_Set >= STW_live_set + (Allocation_Rate * Cycle_time)


Mutation rateDoes your program do any real work?

• Mutation rate is generally linear to work performed─ The higher the load, the higher the mutation rate

• A multi-pass marker can be sensitive to mutation:─ Revisits references modified in each pass─ Higher mutation rate longer cycle times─ Can reach a point where marker cannot keep up with mutator─ e.g. one marking thread vs.15 mutator threads

• Some common use patterns have high mutation rates─ e.g. LRU cache


Object lifetimeObjects are active in the Old Generation

• Most allocated objects do die young─ So generational collection is an effective filter

• However, most live objects are old─ You’re not just making all those objects up every cycle…

• Large heaps tend to see real churn & real mutation─ e.g. caching is a very common use pattern for large memory

• OldGen is under constant pressure in the real world─ Unlike some/most benchmarks (e.g. SPECjbb)


Generational AssumptionWhen you are forced to collect live young things

• A lot of state dies when transactions are completed─ Transactions typically take some minimum amount of time

• Load usually shows up as concurrently active state─ More concurrent users & transactions – more state

• Higher load always generates garbage faster─ NewGen collections happens more often as load grows

• At some load point NewGen becomes very expensive─ When NewGen GC cycles faster than transactions complete

─ pauses significantly longer if it uses a STW mechanism


Major things that happen in a pauseThe non-concurrent parts of “mostly concurrent”

• If collector does Reference processing in a pause─ Weak, Soft, Final ref traversal

─ Pause length depends on # of refs.

─ Sensitive to common use cases of weak refs

─ e.g. LRU & multi-index cache patterns

• If the collector marks mutated refs in a pause─ Pause length depends on mutation rate

─ Sensitive to load

• If the collector performs compaction in a pause…


Fragmentation & CompactionYou can’t delay it forever

• Fragmentation *will* happen─ Compaction can be delayed, but not avoided─ “Compaction is done with the application paused. However, it is

a necessary evil, because without it, the heap will be useless…” (JRockit RT tuning guide).

• If Compaction is done as a stop-the-world pause─ It will generally be your worst case pause─ It is a likely failure of concurrent collection

• Measurements without compaction are meaningless─ Unless you can prove that compaction won’t happen (Good

luck with that)


More things that may happen in a pauseMore “mostly concurrent” secrets

• When collector does Code & Class things in a pause─ Class unloading, Code cache cleaning, System Dictionary, etc.

─ Can depend on class and code churn rates

─ Becomes a real problem if full collection is required (PermGen)

• GC/Mutator Synchronization, Safe Points─ Can depend on time-to-safepoint affecting runtime artifacts:

─ Long running no-safepoint loops (some optimizers do this).

─ Huge object cloning, allocation (some runtimes won’t break it up).

• Stack scanning (look for refs in mutator stacks)─ Can depend on # of threads and stack depths


Agenda

• Background






• Q & A


HotSpot CMSCollector mechanism examples

• Stop-the-world compacting new gen (ParNew)

• Mostly Concurrent, non-compacting old gen (CMS)─ Mostly Concurrent marking

─ Mark concurrently while mutator is running─ Track mutations in card marks─ Revisit mutated cards (repeat as needed)─ Stop-the-world to catch up on mutations, ref processing, etc.

─ Concurrent Sweeping─ Does not Compact (maintains free list, does not move objects)

• Fallback to Full Collection (Stop the world, serial).─ Used for Compaction, etc.


Azul GPGC Collector mechanism examples

• Concurrent, compacting new generation

• Concurrent, compacting old generation

• Concurrent guaranteed-single-pass marker─ Oblivious to mutation rate─ Concurrent ref (weak, soft, final) processing

• Concurrent Compactor─ Objects moved without stopping mutator─ Can relocate entire generation (New, Old) in every GC cycle

• No Stop-the-world fallback─ Always compacts, and does so concurrently


Agenda

• Background






• Q & A


Measurement RecommendationsWhen you are actually interested in the results…

• Measure application – not synthetic tests─ Garbage in, Garbage out

• Avoid the urge to tune GC out of the testing window─ You’re only fooling yourself─ Your application needs to run for more than 20 minutes, right?─ Most industry benchmarks are tuned to avoid GC during test

• Rule of Thumb:─ You should see 5+ of the “bad” GCs during test period─ Otherwise, you simply did not test real behavior─ Test until you can show it’s stable (e.g. What if it trends up?)─ Believe your application, not -verbosegc


Don’t ignore “bad” GCCompaction? What Compaction?


Measurement TechniquesMake reality happen

• Aim for 20-30 minute “stable load” tests─ If test is longer, you won’t do it enough times to get good data

─ Don’t “ramp” load during test period – it will defeat the purpose

─ We want to see several days worth of GC in 20-30 minutes

• Add low-load noise to trigger “real” GC behavior─ Don’t go overboard

─ A moderately churning large LRU cache can often do the trick

─ A gentle heap fragmentation inducer is a sure bet

─ Can easily be added orthogonally to application activity

─ See Azul’s “Fragger” example (http://e2e.azulsystems.com)


Establish smooth operating rangeKnow where it works, and know where it doesn’t…

• Test main metrics for sensitivity

• Stress Heap population, allocation, mutation, etc.

• Add artificial load-linear stress if needed─ E.g. Increase allocation and mutation per transaction

─ E.g. Increase state per session, increase static state

─ E.g. Increase session length in time

─ Drive load with artificially enhanced GC stress

─ Keep increasing until you find out where GC breaks SLA in test

─ Then back off and test for stability


SummaryKnow where the cliff is, then stay away from the edge…

• Sensitivity is key─ If it fails, it will be without warning

• Know where you stand on key measurable metrics─ Application driven: Live Set, Allocation rate, Heap size

─ GC driven: Cycle times, Compaction Time, Pause times

• Deal with robustness first, and only then with efficiency─ Efficient and 2% away from failure is not a good thing

• Establish your envelope─ Only then will you know how safe (or unsafe) you are

http://e2e.azulsystems.com


Q & A

Remember:

Zing Announcement

JavaOne Ticket drawing

13:45 @“Double” Conf. room


Thank YouPerformance Considerations in

Concurrent Garbage Collected

Systems

azul yandexjune010

Technology

gb of memory

gb of heap

gb server

gb apps

gb zing vm

zing platform virtualization

gb moores law

gb heap wont crash