the risc-v vector isa · 2020. 8. 13. · the risc-v vector isa krste asanovic, [email protected],...

47
The RISC-V Vector ISA Krste Asanovic, [email protected], Vector WG Chair Roger Espasa, [email protected], Vector WG Co-Chair Vector Extension Working Group 1 7th RISC-V Workshop, Nov'17

Upload: others

Post on 29-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • TheRISC-VVectorISA

    KrsteAsanovic,[email protected],VectorWGChairRogerEspasa,[email protected],VectorWGCo-ChairVectorExtensionWorkingGroup

    17thRISC-VWorkshop,Nov'17

  • WhyaVectorExtension?

    2

    VectorISAGoodness

    • Reducedinstructionbandwidth• Reducedmemorybandwidth• Lowerenergy• ExposesDLP• Maskedexecution• Gather/Scatter• FromsmalltolargeVPU

    RISC-VVectorExtension

    • Small• Naturalmemoryordering• Masksfoldedintovregs(*)• Scalar,Vector&Matrix(*)• Typedregisters• Reconfigurable• Mixed-typeinstructions• CommonVector/SIMDprogrammingmodel

    • Fixed-pointsupport• EasilyExtensible• BestvectorISAeverJ

    Domains

    • MachineLearning• Graphics• DSP• Crypto• Structuralanalysis• Climatemodeling• Weatherprediction• Drugdesign• Andmore…

    (*)ChangedsincelastWorkshopPresentation7thRISC-VWorkshop,Nov'17

  • TheVectorISAinanutshell• 32vectorregisters(v0…v31)

    • Eachregistercanholdeitherascalar,avector oramatrix (shape)• Eachvectorregisterhasanassociatedtype (polymorphicencoding)• Variable numberofregisters(dynamicallychangeable)

    • Vectorinstructionsemantics• AllinstructionscontrolledbyVectorLength(VL)register• Allinstructionscanbeexecutedundermask• Intuitivememoryorderingmodel• Preciseexceptionssupported

    • Vectorinstructionset:• AllinstructionspresentinbaselineISAarepresentinthevectorISA• Vectormemoryinstructionssupportinglinear,strided&gather/scatteraccesspatterns• OptionalFixed-Pointset• OptionalTranscendentalset

    37thRISC-VWorkshop,Nov'17

  • NewArchitecturalStateMVL=8

    32b

    v0v1v2v3

    e7 e6 e5 e4 e3 e2 e1 e0 type16b

    vl (XLEN)vxrm (3b)vxsat (1b)

    Note:Floatingpointflagsusetheexistingscalarflags4

    vdcfg(512b)

    7thRISC-VWorkshop,Nov'17

    e7 e6 e5 e4 e3 e2 e1 e0e7 e6 e5 e4 e3 e2 e1 e0

    type

    type

    e7 e6 e5 e4 e3 e2 e1 e0 type

    e7 e6 e5 e4 e3 e2 e1 e0 typee7 e6 e5 e4 e3 e2 e1 e0 typee7 e6 e5 e4 e3 e2 e1 e0 typee7 e6 e5 e4 e3 e2 e1 e0 type

    v28v29v30v31

  • CompleteVectorInstructionList

    5

    VOP VMEMvmadd vadd vmerge vsll vclass vround vld vamoswapvnmadd vaddi vmin vslli vpopc vclip vst vamoaddvmsub vand vmul vsra vsgnj vextract vlds vamoandvnmsub vandi vmulh vsrai vsgnjn vmv vsts vamoor

    vdiv vsne vsrl vsgnjx vldx vamoxorvseq vor vsrli vsqrt vstx vamomaxvsge vori vsub vcvt vamominvslt vrem vxorvmax vselect vxori7thRISC-VWorkshop,Nov'17

  • Addingtwovectorregisters

    67thRISC-VWorkshop,Nov'17

  • vadd v1, v2 à v0

    • WhenVLiszero,destregisterisfullycleared• Operationspast‘vl’shallnotraiseexceptions• Destinationcanbesameassource

    32b 32b 32b 32b 32b 32b 32b

    h g f e d c b a

    32b

    v1v2

    v0

    p o n m l k j i

    0 0 0 e+m d+l c+k b+j a+i

    (MVL=8,VL=5,F32)

    76543210

    + + + + +

    for (i = 0; i < vl; i++ ){

    v0[i] = v1[i] +F32 v2[i]}for (i = vl; i < MVL; i++ ){

    v0[i] = 0}

    77thRISC-VWorkshop,Nov'17

  • Howisthisexecuted?SIMD?Vector?Uptoyou!

    VRF

    +F32 +F32

    1st clock: a+i,b+j2nd clock: c+k,d+l3rd clock: e+m,04th clock: uptoyou

    2-laneimplementation

    87thRISC-VWorkshop,Nov'17

  • Howisthisexecuted?SIMD?Vector?Uptoyou!

    +F32 +F32

    1st clock: a+i,b+j,c+k,d+l2nd clock: e+m,0,0,0

    4-laneimplementation

    +F32 +F32

    VRF

    97thRISC-VWorkshop,Nov'17

  • Howisthisexecuted?SIMD?Vector?Uptoyou!

    8-laneimplementation(a.k.a.SIMD)

    +F32

    1st clock: a+i,b+j,c+k,d+l,e+m,0,0,0

    +F32 +F32 +F32 +F32 +F32 +F32 +F32

    VRF

    NumberoflanesistransparenttoprogrammerSamecoderunsindependentof#oflanes

    107thRISC-VWorkshop,Nov'17

  • Addingavectorandascalar

    117thRISC-VWorkshop,Nov'17

  • ScalarvaluesintheVectorRegisterFile

    • ThedatainsideaVREGcanhave3possibleshapes:• Asinglescalarvalue• Avector (i.e.,whatyou’dexpect)• Amatrix(optional,notinthebasespec)

    • Thecurrentshapeisheldintheper-vregtypefield• ShapechangescauseaVRFreset(discussedlater)

    • Avectorregisterwithshapescalar• Onlyholdsonevalue• Implementationchoice:whereexactlythisonevalueisstoredwithinthevectorisnotdefinedbythespec.Whetherthevalueisreplicatedtoeverylaneisalsoimplementationdependent.

    127thRISC-VWorkshop,Nov'17

  • vadd v1, v2.s à v0

    • Implementationsarefreetoreplicatethescalarvalueacrossallelementsinthevectorregister• AssemblynotationforindicatingscalaroperandsstillT.B.D

    0 0 0 e+i d+i c+i b+i a+i

    32b 32b 32b 32b 32b 32b 32b

    h g f e d c b a

    32b

    v1v2s

    v0

    ? ? ? ? ? ? ? i

    (MVL=8,VL=5,F32)

    76543210

    + + + + +

    for (i = 0; i < vl; i++ ){

    v0[i] = v1[i] +F32 v2[0]}for (i = vl; i < MVL; i++ ){

    v0[i] = 0}

    137thRISC-VWorkshop,Nov'17

  • Maskedexecution

    147thRISC-VWorkshop,Nov'17

  • Maskedexecution

    • Masksarestoredinregularvectorregisters• TheLSBofeachelementisusedasaboolean“0”or“1”value• Otherbitsignored

    • Masksarecomputedwithcompareoperations(vseq,vsne,vslt,vsge)• veqv6,v7à v1• Comparisonresultsareinteger“0”or“1”(can’tbeassignedtofloattypes)• Encodedwithasmanybitsasthedestinationregisterelementsize

    • Instructionsuse2bitsofencodingtoselectmaskedexecution• 00:Nomasking(==assumemaskingis0xFFFF…FFFF)• 01:unused(usedforotherencodings)• 10:Usev1’selementslsbasthemask• 11:Use~v1’selementslsbasthemask

    157thRISC-VWorkshop,Nov'17

  • vadd v3, v4, v1.t à v5

    • Remember:v1istheonlyregisterusedasmasksource• Masked-outoperationsshallnotraiseanyexceptions• AssemblynotationstillTBD

    lsb(v1)

    32b 32b 32b 32b 32b 32b 32b

    h g f e d c b a

    32b

    v3v4

    v5

    p o n m l k j i

    0 0 0 0 d+l c+k b+j 0

    (MVL=8,VL=5,F32)

    76543210

    + + + + +

    1 0 1 0 1 1 1 0

    for (i = 0; i < vl; i++ ){

    v5[i] = lsb(v1[i]) ? v3[i] +F32 v4[i] : 0;}for (i = vl; i < MVL; i++ ){

    v5[i] = 0}

    167thRISC-VWorkshop,Nov'17

  • VectorLoad(unitstride)

    177thRISC-VWorkshop,Nov'17

  • vld 80(x3)à v5

    • Unalignedaddressesarelegal,likelyveryslow 18

    abcdefghijk

    v5 0 0 0 e d c b a76543210

    @100@104@108@112@116@120@124@128@132@136@140

    sz = sizeof_type(v5); // 4tmp = x3 + 80; // x3 = 20for (i = 0; i < vl; i++ ){

    v5[i] = read_mem(tmp, sz);tmp = tmp + sz;

    }for (i = vl; i < MVL; i++ ){

    v0[i] = 0}

    7thRISC-VWorkshop,Nov'17

  • StridedVectorLoad

    197thRISC-VWorkshop,Nov'17

  • vlds 80(x3,x9) à v5abcdefghijk

    v5 0 0 0 h g e c a76543210

    @100@104@108@112@116@120@124@128@132@136@140

    • Stride0islegal• Stridesthatresultinunalignedaccessesarelegal

    • likelyveryslow

    sz = sizeof_type(v5); // 4tmp = x3 + 80; // x3 = 20for (i = 0; i < vl; i++ ){

    v5[i] = read_mem(tmp, sz);tmp = tmp + x9; // x9 = 8 = stride in bytes

    }for (i = vl; i < MVL; i++ ){

    v0[i] = 0}

    207thRISC-VWorkshop,Nov'17

  • Gather(indexedvectorload)

    217thRISC-VWorkshop,Nov'17

  • vldx 80(x3,v2) à v5

    • Repeatedaddressesarelegal• Unalignedaddressesarelegal,likelyveryslow

    abcd

    efghi

    v5

    0 0 0 d d a i c76543210

    v2 0 0 0 12 12 0 32 8

    @100@104@108@112@116@120@124@128@132@136@140

    sz = sizeof_type(v5); // 4tmp = x3 + 80 // 100for (i = 0; i < vl; i++ ){

    addr = tmp + sext(v2[i]);v5[i] = read_mem(addr, sz);

    }for (i = vl; i < MVL; i++ ){

    v0[i] = 0}

    227thRISC-VWorkshop,Nov'17

  • VectorStore(unitstride)

    237thRISC-VWorkshop,Nov'17

  • vst v5 à 80(x3)

    abcdefghijk

    v5 0 0 0 e d c b a76543210

    @100@104@108@112@116@120@124@128@132@136@140

    sz = sizeof_type(v5); // 4tmp = x3 + 80; // x3 = 20for (i = 0; i < vl; i++ ){

    write_mem(tmp, sz, v5[i]);tmp = tmp + sz;

    }

    24• Unalignedaddressesarelegal,likelyveryslow7thRISC-VWorkshop,Nov'17

  • StridedVectorStore

    257thRISC-VWorkshop,Nov'17

  • vsts v5 à 80(x3,x9)

    • Stride0islegal• Stridesthatresultinunalignedaccessesarelegal

    • likelyveryslow

    abcdefghijk

    v5 0 0 0 h g e c a76543210

    @100@104@108@112@116@120@124@128@132@136@140

    // x9 = stride in bytessz = sizeof_type(v5); // 4tmp = x3 + 80; // x3 = 20for (i = 0; i < vl; i++ ){

    write_mem(tmp, sz, v5[i]);tmp = tmp + x9; // x9 = 8 = stride in bytes

    }

    267thRISC-VWorkshop,Nov'17

  • Scatter(indexedvectorstore)

    277thRISC-VWorkshop,Nov'17

  • vstx v5 à 80(x3,v2)

    • Repeatedaddressesarelegal• Provisionforbothorderedandunorderedscatter

    • Unalignedaddressesarelegal• likelyveryslow

    abcdefghijk

    v5 0 0 0 d d a i c

    v2 0 0 0 12 12 0 32 8

    @100@104@108@112@116@120@124@128@132@136@140

    sz = sizeof_type(v5); // 4tmp = x3 + 80; // 100for (i = 0; i < vl; i++ ){

    addr = tmp + sext(v2[i]);write_mem(addr, sz, v5[i]);

    }

    287thRISC-VWorkshop,Nov'17

  • Ordering

    • FromthepointofviewofagivenHART• Vectorloads&storesinstructionshappeninorder• Youdon’tneedanyfencestoseeyourownstores

    • FromthepointofviewofotherHART’s• Otherhartsseethevectormemoryaccessesasifdonebyascalarloop• So,theycanbeseenout-of-orderbyotherharts

    297thRISC-VWorkshop,Nov'17

  • TypedVectorRegisters

    307thRISC-VWorkshop,Nov'17

  • TypedVectorRegisters

    • Eachvectorregisterhasanassociatedtype• Yes,differentregisterscanhavedifferenttypes(i.e.,v2canhavetypeF16andv3havetypeF32)• Typescanbemixedinaninstructionundercertainrules

    • Hardwarewillautomaticallypromotesometypestoothers(seenextslide)• Typescanbedynamicallychangedbythevcvtinstruction

    • Ifthetypechangedoesnotrequiredmorebitsperelementthanincurrentconfiguration• Rationalefortypedregisters

    • Registertypesenablea“polymorphic”encodingforallvectorinstructions• Saveslargespaceofconvertfrom“typeA”to“typeB”• Morescalableintothefuture:Supportscustomtypeswithoutadditionalencodings

    • SupportedtypesdependonthebaselineISAyourimplementationsupports• RV32I à I8,U8,I16,U16,I32,U32• RV64I à I8,U8,I16,U16,I32,U32,I64,U64• RV128I à I8,U8,I16,U16,I32,U32,I64,U64,X128,X128U• F àF16,F32• FD à F16,F32,F64• FDQ à F16,F32,F64,F128• Provisionforcustomtypeextensions 317thRISC-VWorkshop,Nov'17

  • Type&dataconversions:vcvt

    • Toconvertdataintoadifferentformat• Usevcvtbetweenregistersoftheappropriatetype• vcvt v1F32 à v0F16• vcvt v1u8 à v0F32• vcvt v1F32 à v0I32

    • Additionalfeature:changingthedestregistertypewithvcvt• vcvt v1F32 à v0F32, I32• Ignoresthecurrentdest type,andsetsittothetyperequestedinimmediate• Legalifrequestedtypesizeisnotbiggerthancurrentconfiguredelementwidth

    327thRISC-VWorkshop,Nov'17

  • MixingTypes:promotingsmallintolarge

    • Whenanysourceissmallerthandest,thatsourceis“promoted”todestsize• Ifallowedbypromotiontable.Otherwise,instructionshalltrap

    • Promotionexamples• vadd v1I8, v2I8 à v0I16• vadd v1I8, v2I64 à v0I64• vadd v1F16, v2F32 à v0F32• vmadd v1F16, v2F16, v3F32 à v3F32

    • Tableontherightdefinesvalidpromotions• Zeroextend• Signextend• Re-biasexponentandpadmantissawith0’s

    33

    se=signextendze =zeroextendp=passthroughrb =re-biast=trap

    7thRISC-VWorkshop,Nov'17

  • ReconfigurableVectorRegisterFile

    347thRISC-VWorkshop,Nov'17

  • Reconfigurable,variable-lengthVectorRF• Thevectorunitisconfiguredwithacsrrw x1, vdcfg à x2

    • x1containsthenewconfigurationindicating• Numberoflogicalregisters(from2to32)• Typeforeachvectorregister,usinganincrementalscheme

    • Hardwareresetsallvectorstatetozero• HardwarecomputesMaximumVectorLength(MVL)

    • basedonx1andavailablevectorregisterfilestorage• MVLreturnedinx2• Canbedoneinusermode• Expectedtobefast

    • Thevectorunitisunconfiguredwritinga0tovdcfg• Verygoodtosavekernelsave&restore!• Usefulforlowpowerstate

    • Implementationchoices• AlwaysreturnthesameMVL,regardlessofconfig• Splitstorageacrosslogicalregisters,maybelosingsomespace• Packlogicalregistersastightlyaspossible

    35IMPORTANT:ALLvectorregistersALWAYShavethesameNUMBEROFELEMENTS(MVL)7thRISC-VWorkshop,Nov'17

  • V0V1V2V3……V28V29V30v31

    32b

    +F32

    V0V1V2V3……V28V29V30v31

    32b

    +F32

    V0V1V2V3……V28V29V30v31

    32b

    +F32

    V0V1V2V3……V28V29V30v31

    32b

    +F32

    Usersasksfor32F32registers• Hardwarehas32rx4ex4B=512B• Need• 4bytesperv0element• 4bytesperv1element• …• 4bytesperv31element

    • Therefore• MVL=512B/(32*4)=4

    • HowistheVRForganized?• Manypossibleways• Showingonepossibleorganization

    367thRISC-VWorkshop,Nov'17

  • V0V1V0V1……V0V1V0V1

    32b

    +F32

    32b

    +F32

    32b

    +F32

    32b

    +F32

    Usersasksforonly2F32registers

    • Hardwarehas32rx4ex4B=512B• Need• 4bytesperv0element• 4bytesperv1element

    • Therefore• MVL=512B/(4+4)=64

    • HowistheVRForganized?• Manypossibleways• ShowinganINTERLEAVEDorganization

    V0V1V0V1……V0V1V0V1

    V0V1V0V1……V0V1V0V1

    V0V1V0V1……V0V1V0V1

    377thRISC-VWorkshop,Nov'17

  • V0V1

    32b

    +F32

    V0V1

    32b

    +F32

    V0V1

    32b

    +F32

    V0V1

    32b

    +F32

    Usersasksforonly2F32registers(alsolegal!)• Hardwarehas32rx4ex4B=512B• Need

    • 4bytesperv0element• 4bytesperv1element

    • Therefore• MVL=512B/(4+4)=64

    • Andyet,implementation…• …answerswithMVL=4• Absolutelylegal!

    • HowistheVRForganized?• Manypossibleways• Showingonepossibleorganization

    387thRISC-VWorkshop,Nov'17

  • V0,V0

    V0,V0

    V0,V0

    V0,V0

    V1,V1

    V1,V1

    V1,V1

    V1,V1

    V2

    V2

    V2

    V3

    V3

    V3

    Unused

    Unused

    Unused

    Unused

    Usersasksfor2F16regs&2F32regs• Hardwarehas32rx4ex4B=512B• Need

    • 2bytesperv0element• 2bytesperv1element• 4bytesperv2element• 4bytesperv3element• 4‘unusedbytes’tonearestpowerof2

    • Therefore• MVL=512B/(12B+4B)=32

    • HowistheVRForganized?• Manypossibleways• Showingonepossibleorganization

    39

    4

    4

    8

    8

    8

    V0,V0

    V0,V0

    V0,V0

    V0,V0

    V1,V1

    V1,V1

    V1,V1

    V1,V1

    V2

    V2

    V2

    V3

    V3

    V3

    Unused

    Unused

    Unused

    Unused

    V0,V0

    V0,V0

    V0,V0

    V0,V0

    V1,V1

    V1,V1

    V1,V1

    V1,V1

    V2

    V2

    V2

    V3

    V3

    V3

    Unused

    Unused

    Unused

    Unused

    V0,V0

    V0,V0

    V0,V0

    V0,V0

    V1,V1

    V1,V1

    V1,V1

    V1,V1

    V2

    V2

    V2

    V3

    V3

    V3

    Unused

    Unused

    Unused

    Unused

    7thRISC-VWorkshop,Nov'17

  • MVListransparenttosoftware!

    • Codecanbeportableacross• Differentnumberoflanes• DifferentvaluesofMVL• Ifusingsetvlinstruction

    • SETVLrs1,rd• vl=rs1>MVL?MVL:rs1• Encodedascsrrw

    407thRISC-VWorkshop,Nov'17

  • EncodingSummary

    417thRISC-VWorkshop,Nov'17

  • Notcoveredtoday– askoffline

    • Exceptions• Kernelsave&restore• Customtypes• CryptoWGhasagoodlistofextendedtypesthatfitwithin16bencoding• GFXhasadditionaltypes

    • Matrixshapes(comingsoon)• Usingthesamevregs,don’tpanic!• Vadd“matrix”,“matrix”à “matrix”• Vmul“matrix”,“matrix”à “matrix”

    427thRISC-VWorkshop,Nov'17

  • Status&Plans

    • BestVectorISAever!J• Goalistohavespecreadytoberatifiedbynextworkshop• WeekofMay7th,2018inBarcelona

    • Software• ExpectLLVMtosupportit• ExpectGCCauto-vectorizertosupportit

    • Pleasejointhevectorworkinggrouptoparticipate• Meetingevery2nd Friday8amPST• Warning:Githubspecisout-of-date:WIPtoupdatetothispresentation

    437thRISC-VWorkshop,Nov'17

  • BACKUPSLIDES

    447thRISC-VWorkshop,Nov'17

  • Reductions

    457thRISC-VWorkshop,Nov'17

  • vadd v1 à v0.s

    tmp = 0;for (i = 0; i < vl; i++ ){

    tmp = tmp + v1[i]}v0[0] = tmp;

    • Implementationsarefreetoreplicatethefinal“sum”acrossallelementsinthedestvectorregister ? ? ? ? ? ? ? sum

    32b 32b 32b 32b 32b 32b 32b

    h g f e d c b a

    32b

    v1

    v0s

    (MVL=8,VL=5,F32)

    76543210

    +

    +

    +

    +

    467thRISC-VWorkshop,Nov'17

  • PromotionTable(largefont)SourceTypepromotion

    I64 I32 I16 I8 U64 U32 U16 U8 F64 F32 F16

    DestType

    I64 p se se se t ze ze ze t t tI32 t p se se t t ze ze t t tI16 t t p se t t t ze t t tI8 t t t p t t t t t t tU64 t t t t p ze ze ze t t tU32 t t t t t p ze ze t t tU16 t t t t t t p ze t t tU8 t t t t t t t p t t tF64 t t t t t t t t p rb rbF32 t t t t t t t t t p rbF16 t t t t t t t t t t p

    477thRISC-VWorkshop,Nov'17