java memory model
TRANSCRIPT
JMMJava Memory Model
Łukasz Koniecki24/10/2016
About me
JavaUniverse
SpringMyFaces
JSF
PlaySpark
GWT
Vadin
Tapestry
WicketSpring MVC
StrutsGrails
REST API
JPA
GC
JVM
JAVA EE
TomcatSpark
Goal
• Familiarize with the JMM,
• How processor works?
• Recall how Java compiler and JVM work,
• JIT in action,
• Explain what is a data race and a correctly synchronized program,
• Talk about synchronization and atomicity,
• Based on examples...
• Next-gen JMM...
§17.4 Memory Model
John von Neumann
Dummy program
public class Example {
int i, j;
public void myDummyMethod() {
i+=1;
j+=1;
i+=1;
...
}
}
RAM
i = 0
j = 0
Cache
Program execution
System Bus
public class Example {
int i, j;
public void myDummyMethod() {
i+=1;
j+=1;
i+=1;
...
}
}
The Java Memory Model for Practitioners: http://bit.ly/2cMXklJ
RAM
i = 0
j = 0
Cache
Program execution
System Bus
i = 0
public class Example {
int i, j;
public void myDummyMethod() {
i+=1;
j+=1;
i+=1;
...
}
}
RAM
i = 0
j = 0
Cache
Program execution
System Bus
i = 1
public class Example {
int i, j;
public void myDummyMethod() {
i+=1;
j+=1;
i+=1;
...
}
}
RAM
i = 0
j = 0
Cache
Program execution
System Bus
i = 1
public class Example {
int i, j;
public void myDummyMethod() {
i+=1;
j+=1;
i+=1;
...
}
}
RAM
i = 1
j = 0
Cache
Program execution
System Bus
public class Example {
int i, j;
public void myDummyMethod() {
i+=1;
j+=1;
i+=1;
...
}
}
RAM
i = 1
j = 0
Cache
Program execution
System Bus
j = 0
public class Example {
int i, j;
public void myDummyMethod() {
i+=1;
j+=1;
i+=1;
...
}
}
RAM
i = 1
j = 0
Cache
Program execution
System Bus
j = 1
public class Example {
int i, j;
public void myDummyMethod() {
i+=1;
j+=1;
i+=1;
...
}
}
RAM
i = 1
j = 1
Cache
Program execution
System Bus
j = 1
public class Example {
int i, j;
public void myDummyMethod() {
i+=1;
j+=1;
i+=1;
...
}
}
RAM
i = 1
j = 1
Cache
Program execution
System Bus
Sequentialy consistentexecution
public class Example {
int i, j;
public void myDummyMethod() {
i+=1;
j+=1;
i+=1;
...
}
}
Processor technology
• ...
• 22 nm – 2012
• 14 nm – 2014
• 10 nm – 2017
• 7 nm – ~2019
• 5 nm – ~2021
Wikipedia: http://bit.ly/2cMWoNg
Processor vs. Memory Performance
How L1 and L2 CPU caches work, and why they’re an essential part of modern chips: http://bit.ly/2cpHu1x
Important latency numbers
Core i7 Xeon 5500 Series Data Source Latency (approximate)
local L1 CACHE hit, ~4 cycles ( 2.1 - 1.2 ns )local L2 CACHE hit, ~10 cycles ( 5.3 - 3.0 ns )local L3 CACHE hit, line unshared ~40 cycles ( 21.4 - 12.0 ns )local L3 CACHE hit, shared line in another core ~65 cycles ( 34.8 - 19.5 ns )local L3 CACHE hit, modified in another core ~75 cycles ( 40.2 - 22.5 ns )
remote L3 CACHE (Ref: Fig.1 [Pg. 5]) ~100-300 cycles ( 160.7 - 30.0 ns )
local DRAM ~60 nsremote DRAM ~100 ns
Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors: http://intel.ly/2cV1ZFZ
Cache latency
Weak vs. Strong hardware Memory Models
Weak vs. Strong Memory Models: http://bit.ly/2cC4avk
x86/x64 processor memory model
R-R R-W
W-R W-W
Intel® 64 and IA-32 Architectures Software Developer’s Manual: http://intel.ly/2csMyB2
Processor P can read B
before it’s write to A is seen
by all processors
(processor can move its
own reads in front of its
own writes)
x86/x64 processor memory model
R-R R-W
W-R W-W
Intel® 64 and IA-32 Architectures Software Developer’s Manual: http://intel.ly/2csMyB2
Processor P can read B
before it’s write to A is seen
by all processors
(processor can move its
own reads in front of its
own writes)
How Java compiler works?
javacSourcecode
Bytecode
Bytecode verifier
Class loader
JIT
JVM
OSNativecode
Bytecode
JIT
•Profile guided,
•Speculatively optimizing,
•Backup strategies,
•Optimizes code for us,
•We don’t have to care so much about cache-wise operations
Tiered compilation
time
throuput
startup
interpreted
C1
C2
sampling full speed
deoptimize
bail to interpreter
Tiered compilation (interpreter)
time
throuput
startup
interpreted
C1
C2
sampling full speed
deoptimize
bail to interpreter
Interpreter• extremly slow,• not profiling
Tiered compilation (C1 compiler)
time
throuput
startup
interpreted
C1
C2
sampling full speed
deoptimize
bail to interpreter
C1• client,• fast but dummy,• does the profiling,• e.g: branches, typechecks,
Tiered compilation (C2 compiler)
time
throuput
startup
interpreted
C1
C2
sampling full speed
deoptimize
bail to interpreter
C2• server,• slow but clever,• aggresively optimizing,• based on profile,• e.g.: loop optimizations(unswitching, unrolling),Implicit Null Checking
Why do we need a JMM?
• Different platform memory models (none of them match the JMM!!!)
• Many JVM implementations,
• People don’t know how to program concurrently,
• Programmers: write reliable and multithreaded code,
• Compiler writers: implement optimization which will be a legal, optimization according to the JLS
• Compiler: produce fast and optimal native code,
JMM
• Action: read and write to variable, lock and unlock of monitor, starting and joining with thread,
• Happens-before partial order,
• Thread executing action B can see the results of action A (any thread), there must be a happens-before relationship between A and B,
• Otherwise JVM is free to reorder,
Happens-before orderings
• Unlock of a monitor / lock of that monitor,
• Write to a volatile variable / read of that variable,
• Call to start() / any action in the started thread,
• All actions in a thread / any other thread successfully returns from join() on that thread,
• Setting default values for variables, setting value to a final field in the constructor / constructor finish,
• Write to an Atomic variable / read from that variable,
• Many java.util.concurrent methods,
JMM
• A promise for programmers: sequential consistency must be sacrificed to allow optimizations, but it will still hold for data race free programs. This is the data race free (DRF) guarantee.
• A promise for security: even for programs with data races, values should not appear “out of thin air”, preventing unintended information leakage.
• A promise for compilers: common hardware and software optimizations should be allowed as far as possible without violating the first two requirements.
Java Memory Model Examples: Good, Bad and Ugly: http://bit.ly/2cZfF1I
Example
@NotThreadSafeclass DataRace {
int a, b;int x, y;
void thread1() {y = a;b = 1;
}
void thread2() {x = b;a = 2;
}}
y == 2, x == 1 ???
How can this happen?
• Processor can reorder statements (out-of-order execution, HT)
• Lazy synchronization between caches and main memory,
• Compiler can reorder statements (or keep values is registers),
• Aggressive optimizations in JIT,
Example
@NotThreadSafeclass DataRace {
int a, b;int x, y;
void thread1() {y = a;b = 1;
}
void thread2() {x = b;a = 2;
}}
time
Thread 1 Thread 2
y = a;
b = 1;
x = b;
a = 2;
Example
@NotThreadSafeclass DataRace {
int a, b;int x, y;
void thread1() {y = a;b = 1;
}
void thread2() {x = b;a = 2;
}}
time
Thread 1 Thread 2
b = 1;
y = a;
x = b;
a = 2;
Example
@NotThreadSafeclass DataRace {
int a, b;int x, y;
void thread1() {y = a;b = 1;
}
void thread2() {x = b;a = 2;
}}
time
Thread 1 Thread 2
b = 1;
y = a;
a = 2;
x = b;
Example
@NotThreadSafeclass DataRace {
int a, b;int x, y;
void thread1() {y = a;b = 1;
}
void thread2() {x = b;a = 2;
}}
time
Thread 1 Thread 2
b = 1;
a = 2;
x = b;
y = a;
y == 2, x == 1
Example of x86/x64 test results
Test using jstress
@JCStressTest
@Description("Data race")
@Outcome(id = {"0, 0", "0, 1", "2, 0"}, expect = ACCEPTABLE,
desc = "Trivial under sequential consistency")
@Outcome(id = {"2, 1"}, expect = ACCEPTABLE, desc = "Racy read of x")
@State
public class DataRace {
int a, b;
int x, y;
@Actor
void thread1(IntResult2 r) {
y = a;
b = 1;
r.r1 = y;
}
@Actor
void thread2(IntResult2 r) {
x = b;
a = 2;
r.r2 = x;
}
}
jcstress: http://bit.ly/2daSL5Q
Example of x86/x64 test results
R-R R-W
W-R W-W
Test results interpretation
y==0, x==0
y==0, x==1
y==2, x==0
time
.
.
.
y = a;
b = 1;
.
.
.
x = b;
a = 2;
Test results interpretation
y==0, x==0
y==0, x==1
y==2, x==0
time
.
.
.
y = a;
b = 1;
.
.
.
x = b;
a = 2;
Test results interpretation
y==0, x==0
y==0, x==1
y==2, x==0
time
.
.
.
y = a;
b = 1;
.
.
.
x = b;
a = 2;
Visibility between threads@ThreadSafepublic class DataRace {
int a, b;int x, y;
void thread1() {synchronized (this) {
y = a;b = 1;
}}
void thread2() {synchronized (this) {
x = b;a = 2;
}}
}
Visibility between threads
time
Thread 1 Thread 2
(Th2 starts after Th1)
Programorder
Programorder
synchronizationorder
Every operation thathappens before
an unlock (release)
Is visible to an operation thathappens after
a later lock (aquire)happens-beforeorder
@ThreadSafepublic class DataRace {
int a, b;int x, y;
void thread1() {synchronized (this) {
y = a;b = 1;
}}
void thread2() {synchronized (this) {
x = b;a = 2;
}}
}
.
.
.<enter this>
y = a;b = 1;<exit this>
<enter this>x = b;a = 2;<exit this>...
Possible results:y==0, x == 1y==2, x == 0
Synchronization
High level• java.util.concurrent
Low level• synchronized() blocks and methods,• java.util.concurrent.locks
Low level primitives• volatile variables• java.util.concurrent.atomic
Volatile@ThreadUnsafepublic class Looper {
static boolean done;
public static void main(String[] args)throws InterruptedException {
new Thread(new Runnable() {@Overridepublic void run() {
int count = 0;while (!done) {
count++;}System.out.println("Ending this task");
}}).start();
Thread.sleep(1000);System.out.println("Waiting done");done = true;
}}
Volatile@ThreadSafepublic class Looper {
volatile static boolean done;
public static void main(String[] args)throws InterruptedException {
new Thread(new Runnable() {@Overridepublic void run() {
int count = 0;while (!done) {
count++;}System.out.println("Ending this task");
}}).start();
Thread.sleep(1000);System.out.println("Waiting done");done = true;
}}
Programorder
Programorder
synchronizationorder
Thread 1
time
Thread 2
.
.
.done = true;
while (!done)...
happens-beforeorder
More about volatile
• Volatile reads are very cheep (no locks compared to synchronized)
• Volatile increment is not atomic (!!!)
• Elements in volatile collection are not volatile (e.g. volatile int[])
• Consider using java.util.concurrent
What operations in Java are atomic?
• Read/write on variables of primitive types (except of longand double – Word Tearing problem),
• Read/write on volatile variables of primitive type (including long and double),
• All read/writes to references are always atomic (http://bit.ly/2c8kn8i),
• All operations on java.util.concurrent.atomic types,
ExamplesBe careful what you’re doing...
Double-checked locking
@ThreadSafepublic class DoubleCheckedLocking {
private volatile Helper helper = null;
public Helper getHelper() {
if (helper == null) {
synchronized (this) {
if (helper == null)
helper = new Helper();
}
}
return helper;
}
}
The "Double-Checked Locking is Broken" Declaration: http://bit.ly/2cIDBnA
Final@ThreadUnsafeclass UnsafePublication {
private int a;
private static UnsafePublication instance;
private UnsafePublication() {
a = 1;
}
void thread1() throws InterruptedException {
instance = new UnsafePublication();
}
void thread2() {
if (instance != null) {
System.out.println(instance.a);
}
}
}
What statecan thread 2 see???
null, 0, 1
Final@ThreadSafeclass SafePublication {
private final int a;
private static SafePublication instance;
private SafePublication() {
a = 1;
}
void thread1() throws InterruptedException {
instance = new SafePublication();
}
void thread2() {
if (instance != null) {
System.out.println(instance.a);
}
}
}
Next-JMM
• JEP 188,
• Improve formalization,
• JVM coverage,
• Extend scope,
• Testing support,
• Tool support,
• Enh: atomic r/w for long and double,
To sum up...
• Concurrent programming isn’t easy,
• Design your code for concurrency (make it right before you make it fast),
• Do not code against the implementation. Code against the specification,
• Use high level synchronization wherever possible,
• Watch out for useless synchronization,
• Use Thread Safe Immutable objects,
Further reading
•Aleksey Shipilëv: One Stop Page (http://bit.ly/2cqBt4x),
•Rafael Winterhalter: The Java Memory Model for Practitioners (http://bit.ly/2cMXklJ),
•Brian Goetz: Java Concurrency in Practice (http://amzn.to/2cloe76)
Thank you!