assembly high performance with arm64 sfo17-314...

© 2017 Arm Limited

SFO17-314 Optimizing Golang for High Performance with ARM64

AssemblyWei Xiao

Staff Software Engineer

[email protected]

September 27, 2017

Linaro Connect SFO17

© 2017 Arm Limited 2

Agenda

• Introduction

• Differences from GNU Assembly

• Integrate assembly into Golang

• Optimize CRC32 for arm64

• Optimize SHA256 for arm64

• Optimize IndexByte for arm64

• Work Summary and Next steps


Introduction

• Assembly optimization benefits

• Take advantages of ARMv8 capabilities

– Hardware specific instructions (such as SVC, AES, SHA and etc.)

– Vector (Single Instruction Multiple Data) Instructions

• Others

– No need for CGo dependency

– Avoid runtime context switching overhead

– Optimized code (vs Go compiler)

– Faster compilation

http://ppt/slides/slide28.xml


Assembly Optimization Current Status

• Go Standard packages with assembly optimization

crypto/aes crypto/elliptic crypto/internal/cipherhw crypto/md5

crypto/rc4 crypto/sha1 crypto/sha256 crypto/sha512

hash/crc32 math math/big reflect

runtime runtime/cgo runtime/internal/atomicruntime/internal/sys

strings sync/atomic syscall ……

red – arm64 optimization ongoing

black – no arm64 optimization


Assembly Terminology

• Mnemonic

• CALL, MOVW, MOVD, …

• Register

• R1, F0, V3, …

• Immediate

• $1, $0x100, …

• Memory

• (R1), 8(R3), …

Registers in AArch64


Instruction Differences from GNU Assembly

• Semi-abstract instruction set (Plan 9 from Bell Labs)

• Architecture independent mnemonics like MOVD

• Some architecture aspects shine through

• Assembler may insert prologues, remove ‘unreachable’ instructions

• Instructions may be expanded by the assembler

• Not all instructions available

• BYTE/WORD/LONG directives to lay down opcodes into instruction stream directly

1 // func Add(a, b int) int 2 TEXT ·Add(SB),$0-24 3 MOVD arg1+0(FP), R0 4 MOVD arg2+8(FP), R1 5 ADD R1, R0, R0 6 MOVD R0, ret+16(FP) 7 RET


Operand Differences from GNU Assembly

• Data flow from left to right

• ADD R1, R2 → R2 += R1

• SUBW R12<<29, R7, R8 → R8 = R7 – (R12<<29)

• Memory operands: base + offset

• MOVH (R1), R2 → R2 = *R1

• MOVBU 8(R3), R4 → R4 = *(8 + R3)

• MOVD mypackage·myvar(SB), R8 → R8 = *myvar

• Addresses

• MOVD $8(R1), R3 → R3 = R1 + 8

• MOVD $·myvar(SB), R4 → R4 = &myvar

package mypackagevar myvar int64

UnicodeU+00B7


Go Assembly Extension for arm64

• Extended register, e.g.: ADD Rm.<ext>[<<amount], Rn, Rd

• Arrangement for SIMD instructions, e.g.: VADDP Vm.<T>, Vn.<T>, Vd.<T>

• Width specifier and element index for SIMD instructions, e.g.: VMOV Vn.<T>[index], Rd

• Register List, e.g.: VLD1 (Rn), [Vt1.<T>, Vt2.<T>, Vt3.<T>]

• Register offset variant, e.g.: VLD1.P (Rn)(Rm), [Vt1.<T>, Vt2.<T>]

• Go assembly for ARM64 reference manual: src/cmd/internal/obj/arm64/doc.go

• Full details

• https://go-review.googlesource.com/c/go/+/41654

https://go-review.googlesource.com/c/go/+/41654


Assembly Build Rule

• Toolchain will select appropriate assembly files according to GOOS+GOARCH

• Using file extensions, e.g.

• sys_linux_arm64.s

• sys_darwin_arm64.s

• Example: assembly files for: hash/crc32

• crc32_amd64p32.s

• crc32_amd64.s

• crc32_arm64.s

• crc32_ppc64le.s crc32_table_ppc64le.s

• crc32_s390x.s


Prototype

• Function call is the bridge between Go and assembly

• Function declaration

• src/runtime/timestub.go

• func walltime() (sec int64, nsec int32)

• Function assembly implementation

• runtime/sys_linux_arm64.s

package(optional)

function name

Flag(optional)

stack frame size

arguments size

(optional)

Middle dot


Pseudo-registers

• FP: Frame Pointer

• Points to the bottom of the argument list

• Offsets are positive

• Offsets must include a name, e.g. arg+0(FP)

• SP: Stack Pointer

• Points to the top of the space allocated for local variables

• Offsets are negative

• Offsets must include a name, e.g. ptr-8(SP)

• SB: Static Base

• Named offsets from a global base

Low address

High address

Low address

High address


Calling Convention

• All arguments are passed on the stack

• Offsets from FP

• Return arguments follow input arguments

• Start of return arguments aligned to pointer size

• All registers are caller saved, except:

• Stack pointer register (RSP)

• G context pointer register (R28)

• Frame pointer (R29)


arm64 Stack Frame

w/o frame pointer w/ frame pointer

Low address

High address


Optimize CRC32 for arm64 – Before

• Pure Go table-driven implementation

src/hash/crc32/crc32_generic.go

42 func simpleUpdate(crc uint32, tab *Table, p []byte) uint32 {43 crc = ^crc44 for _, v := range p {45 crc = tab[byte(crc)^v] ^ (crc >> 8)46 }47 return ^crc48 }


Optimize CRC32 for arm64 – After

• Assembly for arm64src/hash/crc32/crc32_arm64.s

9 // func castagnoliUpdate(crc uint32, p []byte) uint32 10 TEXT ·castagnoliUpdate(SB),NOSPLIT,$0-36 11 MOVWU crc+0(FP), R9 // CRC value 12 MOVD p+8(FP), R13 // data pointer 13 MOVD p_len+16(FP), R11 // len(p) 14 15 CMP $8, R11 16 BLT less_than_8 17 18 update: 19 MOVD.P 8(R13), R10 20 CRC32CX R10, R9 21 SUB $8, R11 22 23 CMP $8, R11 24 BLT less_than_8 25 26 JMP update

… 46 done: 47 MOVWU R9, ret+32(FP) 48 RET

0(FP)

ret

p.cap

p.len

p.base

crc

32(FP)

8(FP)

16(FP)


Optimize CRC32 for arm64 – Result

• Optimization with assembly

• 2X-7X speedup


Optimize SHA256 for arm64

• SHA256 introduction

block rounds K Hash

SHA-256 512bits 64 32bits 32bits 256bits


Optimize SHA256 for arm64 – Message schedule

src/crypto/sha256/sha256block.go

84 for i := 0; i < 16; i++ {85 j := i * 486 w[i] = uint32(p[j])<<24 | uint32(p[j+1])<<16 | uint32(p[j+2])<<8 | uint32(p[j+3])87 }88 for i := 16; i < 64; i++ {89 v1 := w[i-2]90 t1 := (v1>>17 | v1<<(32-17)) ^ (v1>>19 | v1<<(32-19)) ^ (v1 >> 10)91 v2 := w[i-15]92 t2 := (v2>>7 | v2<<(32-7)) ^ (v2>>18 | v2<<(32-18)) ^ (v2 >> 3)93 w[i] = t1 + w[i-7] + t2 + w[i-16]94 }

for i := 16; i < 64; i+=4 {SHA256SU0 Vn.S4, Vd.S4SHA256SU1 Vm.S4, Vn.S4, Vd.S4

}


Optimize SHA256 for arm64 – Hash Computation

src/crypto/sha256/sha256block.go

98 for i := 0; i < 64; i++ { 99 t1 := h + ((e>>6 | e<<(32-6)) ^ (e>>11 | e<<(32-11)) ^ (e>>25 | e<<(32-25))) + ((e & f) ^ (^e & g)) + _K[i] + w[i]100101 t2 := ((a>>2 | a<<(32-2)) ^ (a>>13 | a<<(32-13)) ^ (a>>22 | a<<(32-22))) + ((a & b) ^ (a & c) ^ (b & c))102103 h = g104 g = f105 f = e106 e = d + t1107 d = c108 c = b109 b = a110 a = t1 + t2111 }

for i := 0; i < 64; i+=4 {SHA256H Vm, Vn, Vd.4SSHA256H2 Vm, Vn, Vd.4S

}


Optimize SHA256 for arm64 – Implementation

src/crypto/sha256/sha256block_arm64.s


Optimize SHA256 for arm64 – Result

• Optimization with assembly

• 2X-16X speedup


Optimize IndexByte for arm64 – Before

H E L L O W O R L D …

R1R0

R2 D

R0

src/runtime/asm_arm64.s


Optimize IndexByte for arm64 – After

• Assembly implementation with SIMD

• SIMD instruction: CMEQ Vm.B16, Vn.B16, Vd.B16

Compare 16 bytes in parallel

More details:• Input slice shorter than 16• Input slice address not 16-byte aligned• Input slice size not 16-byte aligned• Count trailing zeros (not leading zeros)

• Implementation:• https://go-review.googlesource.com/c/go/+/41654


Optimize IndexByte for arm64 – Result

• Optimization with SIMD

• 1.5X-8X speedup


Work Summary

Disassembler (arm64):https://go-review.googlesource.com/c/arch/+/43651 https://go-review.googlesource.com/c/arch/+/56810 https://go-review.googlesource.com/c/go/+/58930https://go-review.googlesource.com/c/go/+/56331 https://go-review.googlesource.com/c/go/+/49530

Assembler (arm64):https://go-review.googlesource.com/c/go/+/33594 https://go-review.googlesource.com/c/go/+/33595 https://go-review.googlesource.com/c/go/+/41511 https://go-review.googlesource.com/c/go/+/41654 https://go-review.googlesource.com/c/go/+/45850 https://go-review.googlesource.com/c/go/+/54951https://go-review.googlesource.com/c/go/+/54990 https://go-review.googlesource.com/c/go/+/57852 https://go-review.googlesource.com/c/go/+/58350https://go-review.googlesource.com/c/go/+/56030 https://go-review.googlesource.com/c/go/+/46438 https://go-review.googlesource.com/c/go/+/41653

Optimizations:https://go-review.googlesource.com/c/go/+/40074 https://go-review.googlesource.com/c/go/+/61550 https://go-review.googlesource.com/c/go/+/61570https://go-review.googlesource.com/c/go/+/33597 https://go-review.googlesource.com/c/go/+/64490 https://go-review.googlesource.com/c/go/+/55610

Others:https://go-review.googlesource.com/c/go/+/61511 https://go-review.googlesource.com/c/go/+/62850 https://go-review.googlesource.com/c/go/+/45112https://go-review.googlesource.com/c/go/+/44390 https://go-review.googlesource.com/c/go/+/42971 https://go-review.googlesource.com/c/go/+/40511https://go-review.googlesource.com/c/arch/+/37172

https://go-review.googlesource.com/c/arch/+/43651































Next Steps

• Crypto optimizations:

• aes, elliptic, …

• SIMD optimizations:

• strings, bytes, runtime, reflect, …

• Compiler SSA arm64 back-end optimizations

• Others

• Internal arm64 linker

• Tool for arm64: race detector, memory sanitizer, …

• New architecture features

• ...


CGo

GO ABI C ABI

1 package print2 3 // #include <stdio.h>4 // #include <stdlib.h>5 import "C"6 import "unsafe"7 8 func Print(s string) {9 cs := C.CString(s)10 C.fputs(cs, 11(*C.FILE)(C.stdout))12 C.free(unsafe.Pointer(cs))13 }

CGo





Useful in macros!

Branch Difference from GNU Assembly

• On arm64: B is alias for JMP, BL is alias for CALL

Jump to labels

JMP L1NOP

L1:NOP

L2: NOPNOPB L2

Call and Indirect Jump

BL $p.fooMOV $p·foo, R3CALL(R3)

B (R3)MOV 0(R26), R4JMP (R4)

Jump relative to PC

JMP 2(PC)NOPNOP

NOPNOPJMP -2(PC)

assembly high performance with arm64 sfo17-314...

Documents