the risc-v vector isa · 2020. 8. 13. · the risc-v vector isa krste asanovic, [email protected],...

TheRISC-VVectorISA

KrsteAsanovic,[email protected],VectorWGChairRogerEspasa,[email protected],VectorWGCo-ChairVectorExtensionWorkingGroup

17thRISC-VWorkshop,Nov'17

WhyaVectorExtension?

2

VectorISAGoodness

• Reducedinstructionbandwidth• Reducedmemorybandwidth• Lowerenergy• ExposesDLP• Maskedexecution• Gather/Scatter• FromsmalltolargeVPU

RISC-VVectorExtension

• Small• Naturalmemoryordering• Masksfoldedintovregs(*)• Scalar,Vector&Matrix(*)• Typedregisters• Reconfigurable• Mixed-typeinstructions• CommonVector/SIMDprogrammingmodel

• Fixed-pointsupport• EasilyExtensible• BestvectorISAeverJ

Domains

• MachineLearning• Graphics• DSP• Crypto• Structuralanalysis• Climatemodeling• Weatherprediction• Drugdesign• Andmore…

(*)ChangedsincelastWorkshopPresentation7thRISC-VWorkshop,Nov'17

TheVectorISAinanutshell• 32vectorregisters(v0…v31)

• Eachregistercanholdeitherascalar,avector oramatrix (shape)• Eachvectorregisterhasanassociatedtype (polymorphicencoding)• Variable numberofregisters(dynamicallychangeable)

• Vectorinstructionsemantics• AllinstructionscontrolledbyVectorLength(VL)register• Allinstructionscanbeexecutedundermask• Intuitivememoryorderingmodel• Preciseexceptionssupported

• Vectorinstructionset:• AllinstructionspresentinbaselineISAarepresentinthevectorISA• Vectormemoryinstructionssupportinglinear,strided&gather/scatteraccesspatterns• OptionalFixed-Pointset• OptionalTranscendentalset


NewArchitecturalStateMVL=8

32b

v0v1v2v3

e7 e6 e5 e4 e3 e2 e1 e0 type16b

vl (XLEN)vxrm (3b)vxsat (1b)

Note:Floatingpointflagsusetheexistingscalarflags4

vdcfg(512b)


e7 e6 e5 e4 e3 e2 e1 e0e7 e6 e5 e4 e3 e2 e1 e0

type

type

e7 e6 e5 e4 e3 e2 e1 e0 type

e7 e6 e5 e4 e3 e2 e1 e0 typee7 e6 e5 e4 e3 e2 e1 e0 typee7 e6 e5 e4 e3 e2 e1 e0 typee7 e6 e5 e4 e3 e2 e1 e0 type

v28v29v30v31

CompleteVectorInstructionList

5

VOP VMEMvmadd vadd vmerge vsll vclass vround vld vamoswapvnmadd vaddi vmin vslli vpopc vclip vst vamoaddvmsub vand vmul vsra vsgnj vextract vlds vamoandvnmsub vandi vmulh vsrai vsgnjn vmv vsts vamoor

vdiv vsne vsrl vsgnjx vldx vamoxorvseq vor vsrli vsqrt vstx vamomaxvsge vori vsub vcvt vamominvslt vrem vxorvmax vselect vxori7thRISC-VWorkshop,Nov'17

Addingtwovectorregisters


vadd v1, v2 à v0

• WhenVLiszero,destregisterisfullycleared• Operationspast‘vl’shallnotraiseexceptions• Destinationcanbesameassource

32b 32b 32b 32b 32b 32b 32b

h g f e d c b a

32b

v1v2

v0

p o n m l k j i

0 0 0 e+m d+l c+k b+j a+i

(MVL=8,VL=5,F32)

76543210

+ + + + +

for (i = 0; i < vl; i++ ){

v0[i] = v1[i] +F32 v2[i]}for (i = vl; i < MVL; i++ ){

v0[i] = 0}


Howisthisexecuted?SIMD?Vector?Uptoyou!

VRF

+F32 +F32

1st clock: a+i,b+j2nd clock: c+k,d+l3rd clock: e+m,04th clock: uptoyou

2-laneimplementation



+F32 +F32

1st clock: a+i,b+j,c+k,d+l2nd clock: e+m,0,0,0

4-laneimplementation

+F32 +F32

VRF



8-laneimplementation(a.k.a.SIMD)

+F32

1st clock: a+i,b+j,c+k,d+l,e+m,0,0,0

+F32 +F32 +F32 +F32 +F32 +F32 +F32

VRF

NumberoflanesistransparenttoprogrammerSamecoderunsindependentof#oflanes


Addingavectorandascalar


ScalarvaluesintheVectorRegisterFile

• ThedatainsideaVREGcanhave3possibleshapes:• Asinglescalarvalue• Avector (i.e.,whatyou’dexpect)• Amatrix(optional,notinthebasespec)

• Thecurrentshapeisheldintheper-vregtypefield• ShapechangescauseaVRFreset(discussedlater)

• Avectorregisterwithshapescalar• Onlyholdsonevalue• Implementationchoice:whereexactlythisonevalueisstoredwithinthevectorisnotdefinedbythespec.Whetherthevalueisreplicatedtoeverylaneisalsoimplementationdependent.


vadd v1, v2.s à v0

• Implementationsarefreetoreplicatethescalarvalueacrossallelementsinthevectorregister• AssemblynotationforindicatingscalaroperandsstillT.B.D

0 0 0 e+i d+i c+i b+i a+i

32b 32b 32b 32b 32b 32b 32b

h g f e d c b a

32b

v1v2s

v0

? ? ? ? ? ? ? i

(MVL=8,VL=5,F32)

76543210

+ + + + +

for (i = 0; i < vl; i++ ){

v0[i] = v1[i] +F32 v2[0]}for (i = vl; i < MVL; i++ ){

v0[i] = 0}


Maskedexecution


Maskedexecution

• Masksarestoredinregularvectorregisters• TheLSBofeachelementisusedasaboolean“0”or“1”value• Otherbitsignored

• Masksarecomputedwithcompareoperations(vseq,vsne,vslt,vsge)• veqv6,v7à v1• Comparisonresultsareinteger“0”or“1”(can’tbeassignedtofloattypes)• Encodedwithasmanybitsasthedestinationregisterelementsize

• Instructionsuse2bitsofencodingtoselectmaskedexecution• 00:Nomasking(==assumemaskingis0xFFFF…FFFF)• 01:unused(usedforotherencodings)• 10:Usev1’selementslsbasthemask• 11:Use~v1’selementslsbasthemask


vadd v3, v4, v1.t à v5

• Remember:v1istheonlyregisterusedasmasksource• Masked-outoperationsshallnotraiseanyexceptions• AssemblynotationstillTBD

lsb(v1)

32b 32b 32b 32b 32b 32b 32b

h g f e d c b a

32b

v3v4

v5

p o n m l k j i

0 0 0 0 d+l c+k b+j 0

(MVL=8,VL=5,F32)

76543210

+ + + + +

1 0 1 0 1 1 1 0

for (i = 0; i < vl; i++ ){

v5[i] = lsb(v1[i]) ? v3[i] +F32 v4[i] : 0;}for (i = vl; i < MVL; i++ ){

v5[i] = 0}


VectorLoad(unitstride)


vld 80(x3)à v5

• Unalignedaddressesarelegal,likelyveryslow 18

abcdefghijk

v5 0 0 0 e d c b a76543210

@100@104@108@112@116@120@124@128@132@136@140

sz = sizeof_type(v5); // 4tmp = x3 + 80; // x3 = 20for (i = 0; i < vl; i++ ){

v5[i] = read_mem(tmp, sz);tmp = tmp + sz;

}for (i = vl; i < MVL; i++ ){

v0[i] = 0}


StridedVectorLoad


vlds 80(x3,x9) à v5abcdefghijk

v5 0 0 0 h g e c a76543210

@100@104@108@112@116@120@124@128@132@136@140

• Stride0islegal• Stridesthatresultinunalignedaccessesarelegal

• likelyveryslow


v5[i] = read_mem(tmp, sz);tmp = tmp + x9; // x9 = 8 = stride in bytes

}for (i = vl; i < MVL; i++ ){

v0[i] = 0}


Gather(indexedvectorload)


vldx 80(x3,v2) à v5

• Repeatedaddressesarelegal• Unalignedaddressesarelegal,likelyveryslow

abcd

efghi

v5

0 0 0 d d a i c76543210

v2 0 0 0 12 12 0 32 8

@100@104@108@112@116@120@124@128@132@136@140

sz = sizeof_type(v5); // 4tmp = x3 + 80 // 100for (i = 0; i < vl; i++ ){

addr = tmp + sext(v2[i]);v5[i] = read_mem(addr, sz);

}for (i = vl; i < MVL; i++ ){

v0[i] = 0}


VectorStore(unitstride)


vst v5 à 80(x3)

abcdefghijk

v5 0 0 0 e d c b a76543210

@100@104@108@112@116@120@124@128@132@136@140


write_mem(tmp, sz, v5[i]);tmp = tmp + sz;

}

24• Unalignedaddressesarelegal,likelyveryslow7thRISC-VWorkshop,Nov'17

StridedVectorStore


vsts v5 à 80(x3,x9)

• Stride0islegal• Stridesthatresultinunalignedaccessesarelegal

• likelyveryslow

abcdefghijk

v5 0 0 0 h g e c a76543210

@100@104@108@112@116@120@124@128@132@136@140

// x9 = stride in bytessz = sizeof_type(v5); // 4tmp = x3 + 80; // x3 = 20for (i = 0; i < vl; i++ ){

write_mem(tmp, sz, v5[i]);tmp = tmp + x9; // x9 = 8 = stride in bytes

}


Scatter(indexedvectorstore)


vstx v5 à 80(x3,v2)

• Repeatedaddressesarelegal• Provisionforbothorderedandunorderedscatter

• Unalignedaddressesarelegal• likelyveryslow

abcdefghijk

v5 0 0 0 d d a i c

v2 0 0 0 12 12 0 32 8

@100@104@108@112@116@120@124@128@132@136@140

sz = sizeof_type(v5); // 4tmp = x3 + 80; // 100for (i = 0; i < vl; i++ ){

addr = tmp + sext(v2[i]);write_mem(addr, sz, v5[i]);

}


Ordering

• FromthepointofviewofagivenHART• Vectorloads&storesinstructionshappeninorder• Youdon’tneedanyfencestoseeyourownstores

• FromthepointofviewofotherHART’s• Otherhartsseethevectormemoryaccessesasifdonebyascalarloop• So,theycanbeseenout-of-orderbyotherharts


TypedVectorRegisters


TypedVectorRegisters

• Eachvectorregisterhasanassociatedtype• Yes,differentregisterscanhavedifferenttypes(i.e.,v2canhavetypeF16andv3havetypeF32)• Typescanbemixedinaninstructionundercertainrules

• Hardwarewillautomaticallypromotesometypestoothers(seenextslide)• Typescanbedynamicallychangedbythevcvtinstruction

• Ifthetypechangedoesnotrequiredmorebitsperelementthanincurrentconfiguration• Rationalefortypedregisters

• Registertypesenablea“polymorphic”encodingforallvectorinstructions• Saveslargespaceofconvertfrom“typeA”to“typeB”• Morescalableintothefuture:Supportscustomtypeswithoutadditionalencodings

• SupportedtypesdependonthebaselineISAyourimplementationsupports• RV32I à I8,U8,I16,U16,I32,U32• RV64I à I8,U8,I16,U16,I32,U32,I64,U64• RV128I à I8,U8,I16,U16,I32,U32,I64,U64,X128,X128U• F àF16,F32• FD à F16,F32,F64• FDQ à F16,F32,F64,F128• Provisionforcustomtypeextensions 317thRISC-VWorkshop,Nov'17

Type&dataconversions:vcvt

• Toconvertdataintoadifferentformat• Usevcvtbetweenregistersoftheappropriatetype• vcvt v1F32 à v0F16• vcvt v1u8 à v0F32• vcvt v1F32 à v0I32

• Additionalfeature:changingthedestregistertypewithvcvt• vcvt v1F32 à v0F32, I32• Ignoresthecurrentdest type,andsetsittothetyperequestedinimmediate• Legalifrequestedtypesizeisnotbiggerthancurrentconfiguredelementwidth


MixingTypes:promotingsmallintolarge

• Whenanysourceissmallerthandest,thatsourceis“promoted”todestsize• Ifallowedbypromotiontable.Otherwise,instructionshalltrap

• Promotionexamples• vadd v1I8, v2I8 à v0I16• vadd v1I8, v2I64 à v0I64• vadd v1F16, v2F32 à v0F32• vmadd v1F16, v2F16, v3F32 à v3F32

• Tableontherightdefinesvalidpromotions• Zeroextend• Signextend• Re-biasexponentandpadmantissawith0’s

33

se=signextendze =zeroextendp=passthroughrb =re-biast=trap


ReconfigurableVectorRegisterFile


Reconfigurable,variable-lengthVectorRF• Thevectorunitisconfiguredwithacsrrw x1, vdcfg à x2

• x1containsthenewconfigurationindicating• Numberoflogicalregisters(from2to32)• Typeforeachvectorregister,usinganincrementalscheme

• Hardwareresetsallvectorstatetozero• HardwarecomputesMaximumVectorLength(MVL)

• basedonx1andavailablevectorregisterfilestorage• MVLreturnedinx2• Canbedoneinusermode• Expectedtobefast

• Thevectorunitisunconfiguredwritinga0tovdcfg• Verygoodtosavekernelsave&restore!• Usefulforlowpowerstate

• Implementationchoices• AlwaysreturnthesameMVL,regardlessofconfig• Splitstorageacrosslogicalregisters,maybelosingsomespace• Packlogicalregistersastightlyaspossible

35IMPORTANT:ALLvectorregistersALWAYShavethesameNUMBEROFELEMENTS(MVL)7thRISC-VWorkshop,Nov'17

V0V1V2V3……V28V29V30v31

32b

+F32

V0V1V2V3……V28V29V30v31

32b

+F32

V0V1V2V3……V28V29V30v31

32b

+F32

V0V1V2V3……V28V29V30v31

32b

+F32

Usersasksfor32F32registers• Hardwarehas32rx4ex4B=512B• Need• 4bytesperv0element• 4bytesperv1element• …• 4bytesperv31element

• Therefore• MVL=512B/(32*4)=4

• HowistheVRForganized?• Manypossibleways• Showingonepossibleorganization


V0V1V0V1……V0V1V0V1

32b

+F32

32b

+F32

32b

+F32

32b

+F32

Usersasksforonly2F32registers

• Hardwarehas32rx4ex4B=512B• Need• 4bytesperv0element• 4bytesperv1element

• Therefore• MVL=512B/(4+4)=64

• HowistheVRForganized?• Manypossibleways• ShowinganINTERLEAVEDorganization





V0V1

32b

+F32

V0V1

32b

+F32

V0V1

32b

+F32

V0V1

32b

+F32

Usersasksforonly2F32registers(alsolegal!)• Hardwarehas32rx4ex4B=512B• Need

• 4bytesperv0element• 4bytesperv1element

• Therefore• MVL=512B/(4+4)=64

• Andyet,implementation…• …answerswithMVL=4• Absolutelylegal!



V0,V0

V0,V0

V0,V0

V0,V0

V1,V1

V1,V1

V1,V1

V1,V1

V2

V2

…

V2

V3

V3

…

V3

Unused

Unused

Unused

Unused

Usersasksfor2F16regs&2F32regs• Hardwarehas32rx4ex4B=512B• Need

• 2bytesperv0element• 2bytesperv1element• 4bytesperv2element• 4bytesperv3element• 4‘unusedbytes’tonearestpowerof2

• Therefore• MVL=512B/(12B+4B)=32


39

4

4

8

8

8

V0,V0

V0,V0

V0,V0

V0,V0

V1,V1

V1,V1

V1,V1

V1,V1

V2

V2

…

V2

V3

V3

…

V3

Unused

Unused

Unused

Unused

V0,V0

V0,V0

V0,V0

V0,V0

V1,V1

V1,V1

V1,V1

V1,V1

V2

V2

…

V2

V3

V3

…

V3

Unused

Unused

Unused

Unused

V0,V0

V0,V0

V0,V0

V0,V0

V1,V1

V1,V1

V1,V1

V1,V1

V2

V2

…

V2

V3

V3

…

V3

Unused

Unused

Unused

Unused


MVListransparenttosoftware!

• Codecanbeportableacross• Differentnumberoflanes• DifferentvaluesofMVL• Ifusingsetvlinstruction

• SETVLrs1,rd• vl=rs1>MVL?MVL:rs1• Encodedascsrrw


EncodingSummary


Notcoveredtoday– askoffline

• Exceptions• Kernelsave&restore• Customtypes• CryptoWGhasagoodlistofextendedtypesthatfitwithin16bencoding• GFXhasadditionaltypes

• Matrixshapes(comingsoon)• Usingthesamevregs,don’tpanic!• Vadd“matrix”,“matrix”à “matrix”• Vmul“matrix”,“matrix”à “matrix”


Status&Plans

• BestVectorISAever!J• Goalistohavespecreadytoberatifiedbynextworkshop• WeekofMay7th,2018inBarcelona

• Software• ExpectLLVMtosupportit• ExpectGCCauto-vectorizertosupportit

• Pleasejointhevectorworkinggrouptoparticipate• Meetingevery2nd Friday8amPST• Warning:Githubspecisout-of-date:WIPtoupdatetothispresentation


BACKUPSLIDES


Reductions


vadd v1 à v0.s

tmp = 0;for (i = 0; i < vl; i++ ){

tmp = tmp + v1[i]}v0[0] = tmp;

• Implementationsarefreetoreplicatethefinal“sum”acrossallelementsinthedestvectorregister ? ? ? ? ? ? ? sum

32b 32b 32b 32b 32b 32b 32b

h g f e d c b a

32b

v1

v0s

(MVL=8,VL=5,F32)

76543210

+

+

+

+


PromotionTable(largefont)SourceTypepromotion

I64 I32 I16 I8 U64 U32 U16 U8 F64 F32 F16

DestType

I64 p se se se t ze ze ze t t tI32 t p se se t t ze ze t t tI16 t t p se t t t ze t t tI8 t t t p t t t t t t tU64 t t t t p ze ze ze t t tU32 t t t t t p ze ze t t tU16 t t t t t t p ze t t tU8 t t t t t t t p t t tF64 t t t t t t t t p rb rbF32 t t t t t t t t t p rbF16 t t t t t t t t t t p


the risc-v vector isa · 2020. 8. 13. · the risc-v vector isa krste asanovic, [email protected],...

Documents