java memory model

JMMJava Memory Model

Łukasz Koniecki24/10/2016

About me

JavaUniverse

SpringMyFaces

JSF

PlaySpark

GWT

Vadin

Tapestry

WicketSpring MVC

StrutsGrails

REST API

JPA

GC

JVM

JAVA EE

TomcatSpark

Goal

• Familiarize with the JMM,

• How processor works?

• Recall how Java compiler and JVM work,

• JIT in action,

• Explain what is a data race and a correctly synchronized program,

• Talk about synchronization and atomicity,

• Based on examples...

• Next-gen JMM...

§17.4 Memory Model

John von Neumann

Wikipedia: http://bit.ly/2cMU0GB

Von Neumann Architecture

http://bit.ly/2cMU0GB

Dummy program

public class Example {

int i, j;

public void myDummyMethod() {

i+=1;

j+=1;

i+=1;

...

}

}

RAM

i = 0

j = 0

Cache

Program execution

System Bus


int i, j;


i+=1;

j+=1;

i+=1;

...

}

}

The Java Memory Model for Practitioners: http://bit.ly/2cMXklJ

http://bit.ly/2cMXklJ

RAM

i = 0

j = 0

Cache

Program execution

System Bus

i = 0


int i, j;


i+=1;

j+=1;

i+=1;

...

}

}

RAM

i = 0

j = 0

Cache

Program execution

System Bus

i = 1


int i, j;


i+=1;

j+=1;

i+=1;

...

}

}

RAM

i = 1

j = 0

Cache

Program execution

System Bus


int i, j;


i+=1;

j+=1;

i+=1;

...

}

}

RAM

i = 1

j = 0

Cache

Program execution

System Bus

j = 0


int i, j;


i+=1;

j+=1;

i+=1;

...

}

}

RAM

i = 1

j = 0

Cache

Program execution

System Bus

j = 1


int i, j;


i+=1;

j+=1;

i+=1;

...

}

}

RAM

i = 1

j = 1

Cache

Program execution

System Bus

j = 1


int i, j;


i+=1;

j+=1;

i+=1;

...

}

}

RAM

i = 1

j = 1

Cache

Program execution

System Bus

Sequentialy consistentexecution


int i, j;


i+=1;

j+=1;

i+=1;

...

}

}

PC World: http://bit.ly/2cE9f7q

Haswell-E processor

http://bit.ly/2cE9f7q

Our world in data: http://bit.ly/1NLxNcH

Moore’s Law

http://bit.ly/1NLxNcH

Moore’s Law

Our world in data: http://bit.ly/1NLxNcH

2006

http://bit.ly/1NLxNcH

Processor technology

• ...

• 22 nm – 2012

• 14 nm – 2014

• 10 nm – 2017

• 7 nm – ~2019

• 5 nm – ~2021

Wikipedia: http://bit.ly/2cMWoNg

http://bit.ly/2cMWoNg

Processor vs. Memory Performance

How L1 and L2 CPU caches work, and why they’re an essential part of modern chips: http://bit.ly/2cpHu1x

http://bit.ly/2cpHu1x

Wikipedia: http://bit.ly/2cm33me

Cache hierarchy in a modern processor

http://bit.ly/2cm33me

Important latency numbers

Core i7 Xeon 5500 Series Data Source Latency (approximate)

local L1 CACHE hit, ~4 cycles ( 2.1 - 1.2 ns )local L2 CACHE hit, ~10 cycles ( 5.3 - 3.0 ns )local L3 CACHE hit, line unshared ~40 cycles ( 21.4 - 12.0 ns )local L3 CACHE hit, shared line in another core ~65 cycles ( 34.8 - 19.5 ns )local L3 CACHE hit, modified in another core ~75 cycles ( 40.2 - 22.5 ns )

remote L3 CACHE (Ref: Fig.1 [Pg. 5]) ~100-300 cycles ( 160.7 - 30.0 ns )

local DRAM ~60 nsremote DRAM ~100 ns

Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors: http://intel.ly/2cV1ZFZ

Cache latency

http://intel.ly/2cV1ZFZ

Weak vs. Strong hardware Memory Models

Weak vs. Strong Memory Models: http://bit.ly/2cC4avk

http://bit.ly/2cC4avk

x86/x64 processor memory model

R-R R-W

W-R W-W

Intel® 64 and IA-32 Architectures Software Developer’s Manual: http://intel.ly/2csMyB2

Processor P can read B

before it’s write to A is seen

by all processors

(processor can move its

own reads in front of its

own writes)

http://intel.ly/2csMyB2

How Java compiler works?

javacSourcecode

Bytecode

Bytecode verifier

Class loader

JIT

JVM

OSNativecode

Bytecode

JIT

•Profile guided,

•Speculatively optimizing,

•Backup strategies,

•Optimizes code for us,

•We don’t have to care so much about cache-wise operations

Tiered compilation

time

throuput

startup

interpreted

C1

C2

sampling full speed

deoptimize

bail to interpreter

Tiered compilation (interpreter)

time

throuput

startup

interpreted

C1

C2

sampling full speed

deoptimize

bail to interpreter

Interpreter• extremly slow,• not profiling

Tiered compilation (C1 compiler)

time

throuput

startup

interpreted

C1

C2

sampling full speed

deoptimize

bail to interpreter

C1• client,• fast but dummy,• does the profiling,• e.g: branches, typechecks,

Tiered compilation (C2 compiler)

time

throuput

startup

interpreted

C1

C2

sampling full speed

deoptimize

bail to interpreter

C2• server,• slow but clever,• aggresively optimizing,• based on profile,• e.g.: loop optimizations(unswitching, unrolling),Implicit Null Checking

Why do we need a JMM?

• Different platform memory models (none of them match the JMM!!!)

• Many JVM implementations,

• People don’t know how to program concurrently,

• Programmers: write reliable and multithreaded code,

• Compiler writers: implement optimization which will be a legal, optimization according to the JLS

• Compiler: produce fast and optimal native code,

JMM

• Action: read and write to variable, lock and unlock of monitor, starting and joining with thread,

• Happens-before partial order,

• Thread executing action B can see the results of action A (any thread), there must be a happens-before relationship between A and B,

• Otherwise JVM is free to reorder,

Happens-before orderings

• Unlock of a monitor / lock of that monitor,

• Write to a volatile variable / read of that variable,

• Call to start() / any action in the started thread,

• All actions in a thread / any other thread successfully returns from join() on that thread,

• Setting default values for variables, setting value to a final field in the constructor / constructor finish,

• Write to an Atomic variable / read from that variable,

• Many java.util.concurrent methods,

JMM

• A promise for programmers: sequential consistency must be sacrificed to allow optimizations, but it will still hold for data race free programs. This is the data race free (DRF) guarantee.

• A promise for security: even for programs with data races, values should not appear “out of thin air”, preventing unintended information leakage.

• A promise for compilers: common hardware and software optimizations should be allowed as far as possible without violating the first two requirements.

Java Memory Model Examples: Good, Bad and Ugly: http://bit.ly/2cZfF1I

http://bit.ly/2cZfF1I

Example

@NotThreadSafeclass DataRace {

int a, b;int x, y;

void thread1() {y = a;b = 1;

}

void thread2() {x = b;a = 2;

}}

y == 2, x == 1 ???

How can this happen?

• Processor can reorder statements (out-of-order execution, HT)

• Lazy synchronization between caches and main memory,

• Compiler can reorder statements (or keep values is registers),

• Aggressive optimizations in JIT,

Example


int a, b;int x, y;


}


}}

time

Thread 1 Thread 2

y = a;

b = 1;

x = b;

a = 2;

Example


int a, b;int x, y;


}


}}

time

Thread 1 Thread 2

b = 1;

y = a;

x = b;

a = 2;

Example


int a, b;int x, y;


}


}}

time

Thread 1 Thread 2

b = 1;

y = a;

a = 2;

x = b;

Example


int a, b;int x, y;


}


}}

time

Thread 1 Thread 2

b = 1;

a = 2;

x = b;

y = a;

y == 2, x == 1

Example of x86/x64 test results

Test using jstress

@JCStressTest

@Description("Data race")

@Outcome(id = {"0, 0", "0, 1", "2, 0"}, expect = ACCEPTABLE,

desc = "Trivial under sequential consistency")

@Outcome(id = {"2, 1"}, expect = ACCEPTABLE, desc = "Racy read of x")

@State

public class DataRace {

int a, b;

int x, y;

@Actor

void thread1(IntResult2 r) {

y = a;

b = 1;

r.r1 = y;

}

@Actor

void thread2(IntResult2 r) {

x = b;

a = 2;

r.r2 = x;

}

}

jcstress: http://bit.ly/2daSL5Q

http://bit.ly/2daSL5Q

Example of x86/x64 test results

R-R R-W

W-R W-W

Test results interpretation

y==0, x==0

y==0, x==1

y==2, x==0

time

.

.

.

y = a;

b = 1;

.

.

.

x = b;

a = 2;

Visibility between threads@ThreadSafepublic class DataRace {

int a, b;int x, y;

void thread1() {synchronized (this) {

y = a;b = 1;

}}


x = b;a = 2;

}}

}

Visibility between threads

time

Thread 1 Thread 2

(Th2 starts after Th1)

Programorder

Programorder

synchronizationorder

Every operation thathappens before

an unlock (release)

Is visible to an operation thathappens after

a later lock (aquire)happens-beforeorder

@ThreadSafepublic class DataRace {

int a, b;int x, y;


y = a;b = 1;

}}


x = b;a = 2;

}}

}

.

.

.<enter this>

y = a;b = 1;<exit this>

<enter this>x = b;a = 2;<exit this>...

Possible results:y==0, x == 1y==2, x == 0

Synchronization

High level• java.util.concurrent

Low level• synchronized() blocks and methods,• java.util.concurrent.locks

Low level primitives• volatile variables• java.util.concurrent.atomic

Volatile@ThreadUnsafepublic class Looper {

static boolean done;

public static void main(String[] args)throws InterruptedException {

new Thread(new Runnable() {@Overridepublic void run() {

int count = 0;while (!done) {

count++;}System.out.println("Ending this task");

}}).start();

Thread.sleep(1000);System.out.println("Waiting done");done = true;

}}

Volatile@ThreadSafepublic class Looper {

volatile static boolean done;

public static void main(String[] args)throws InterruptedException {

new Thread(new Runnable() {@Overridepublic void run() {

int count = 0;while (!done) {

count++;}System.out.println("Ending this task");

}}).start();

Thread.sleep(1000);System.out.println("Waiting done");done = true;

}}

Programorder

Programorder

synchronizationorder

Thread 1

time

Thread 2

.

.

.done = true;

while (!done)...

happens-beforeorder

More about volatile

• Volatile reads are very cheep (no locks compared to synchronized)

• Volatile increment is not atomic (!!!)

• Elements in volatile collection are not volatile (e.g. volatile int[])

• Consider using java.util.concurrent

What operations in Java are atomic?

• Read/write on variables of primitive types (except of longand double – Word Tearing problem),

• Read/write on volatile variables of primitive type (including long and double),

• All read/writes to references are always atomic (http://bit.ly/2c8kn8i),

• All operations on java.util.concurrent.atomic types,

http://bit.ly/2c8kn8i

ExamplesBe careful what you’re doing...

Double-checked locking

@ThreadSafepublic class DoubleCheckedLocking {

private volatile Helper helper = null;

public Helper getHelper() {

if (helper == null) {

synchronized (this) {

if (helper == null)

helper = new Helper();

}

}

return helper;

}

}

The "Double-Checked Locking is Broken" Declaration: http://bit.ly/2cIDBnA

http://bit.ly/2cIDBnA

Final@ThreadUnsafeclass UnsafePublication {

private int a;

private static UnsafePublication instance;

private UnsafePublication() {

a = 1;

}

void thread1() throws InterruptedException {

instance = new UnsafePublication();

}

void thread2() {

if (instance != null) {

System.out.println(instance.a);

}

}

}

What statecan thread 2 see???

null, 0, 1

Final@ThreadSafeclass SafePublication {

private final int a;

private static SafePublication instance;

private SafePublication() {

a = 1;

}

void thread1() throws InterruptedException {

instance = new SafePublication();

}

void thread2() {

if (instance != null) {

System.out.println(instance.a);

}

}

}

Next-JMM

• JEP 188,

• Improve formalization,

• JVM coverage,

• Extend scope,

• Testing support,

• Tool support,

• Enh: atomic r/w for long and double,

To sum up...

• Concurrent programming isn’t easy,

• Design your code for concurrency (make it right before you make it fast),

• Do not code against the implementation. Code against the specification,

• Use high level synchronization wherever possible,

• Watch out for useless synchronization,

• Use Thread Safe Immutable objects,

Further reading

•Aleksey Shipilëv: One Stop Page (http://bit.ly/2cqBt4x),

•Rafael Winterhalter: The Java Memory Model for Practitioners (http://bit.ly/2cMXklJ),

•Brian Goetz: Java Concurrency in Practice (http://amzn.to/2cloe76)

http://bit.ly/2cqBt4x

http://bit.ly/2cMXklJ

http://amzn.to/2cloe76

Thank you!

java memory model

Technology