eecs 487: interactive computer graphicssugih/courses/eecs487/... · 2015-10-30 · computer...
TRANSCRIPT
EECS487:InteractiveComputerGraphics
Lecture21:• OverviewofLow-levelGraphicsAPI
Metal,Direct3D12,Vulkan
ConsoleGames
WhydogameslookandperformsomuchbetteronconsolesthanonPCswithequivalentspecs?
• consolesareclosedplatformswithlongshelflive,programmerscanprogramthehardwaredirectly
• doingawaywithserializedcall-preparationbottleneckallowsforbetterutilizationofmultipleCPUcores
Motivationsforlow-levelgraphicsAPIs:
• fastergraphicsfromreducedAPIoverhead
• “close-to-metal,”directcontroloftheGPU
[Anandtech:Smith]
Low-Overhead,Low-LevelAPI
WhencethehighoverheadofgraphicsAPI?• hardwareabstractionshideunderlyingplatformdiversity,providingprogrammingconvenienceandflexibility:
graphicsvs.systemprogramming
• “newby-friendly”safetynetsoferrorcheckingandstatevalidation
Codegurus(theoneswritinggameenginesand
renderers)wouldratherhaveperformancethan
hand-holding
[Anandtech:Smith]
Low-Overhead,Low-LevelAPI
Whyisthisanissuenow?• GPUperformanceisfaroutstrippingCPUduetothemassivelyparallelnatureofgraphicsrendering:API
overheadattheCPUisthrottlingGPUperformance
• serializedcommandassemblypriortoissuingdrawcallsrestrictsutilizationofmulti-coreCPU
• instancingandbatchingobjectsintoasmallernumberofdrawcallscanonlyhelpsomuch
Anotheradvantage:easierportingofconsole
gamestoPCs?
[Anandtech:Smith]
HowtoImprovePerformance?
1. Commandbuffer
a. reduceddrawcalloverhead
b. bettercommandsubmissionmulti-threading
2. Baked-instates
a. pipelinestateobjects
b. resourcebinding
3. Pre-compiledshaders
BiggestSourceofCPUOverhead
Assemblyofcommandstreampriortoissuingadrawcall,e.g.,thegatheringtogetherof
• linemode
• polygonmode
• flatorsmoothshading
• textureobjectstouse• whichvertexarrayobjects
• whichvertexbufferobjects
• settingvertexattributepointers• argumentstodrawcalls
Donebydriver
[Anandtech:Smith]
Queue
Single-threadedJobAssembly
GPUFront-End
cmd
driver
cmd cmd
cmd
[nvidia:Foley]
Problem:single-threadedjobassemblybyCPUisoftennot
fastenoughtokeepGPUbusyCPUThread
CPUThread
Utilizesatmost
2cores
commandbuffer
CommandBuffer/List
Developersself-assemblecommandstreamintoacommandbuffer(Vulkan)orcommandlist(D3D12)
Eachcommandbufferisself-contained,somultiple
bufferscanbeassembledinparallel,eachonitsownthread/corewithoutextraconcurrencywork
Finalsubmissionofthecommandbuffersviathe
commandqueueisstillserial,butishighlyefficient
[Microsoft:Sandy]
Metal D3D12 Vulkan
MTLCommandBuffer() ID3D12CommandList() VkCmdBuffer()
MTLCommandQueue() ID3D12CommandQueue() VkCmdQueue()
Queue
Multi-ThreadedJobAssembly
GPUFront-End
CPUThread cmd cmd
CPUThread cmd cmd
CPUThread
CPUThread cmd cmd
cmd cmd
cmd
cmd
cmd cmd cmd cmd cmd
[nvidia:Foley]
CPUThread
CommandBufferRe-Use
InVulkanacommandbuffercanbere-used• a“top-level”commandbuffercan“call”
second-levelcommandbuffers
InD3D12acommandlist“recorded”asabundle
canbesubmittedoncetotheGPUbutexecutedmultipletimes,withdifferentresources,e.g.,
differenttextures(muchlikeOpenGL’sretainedmodedisplaylist)
Metalcurrentlydoesn’tsupportcommandbuffer
re-use
[nvidia:Foley;Microsoft:Sandy]
Direct
3D11
Direct
3D1
2
3DMark–Multi-threadScalingand50%BetterCPUUtilization
UserModedriverworkloaddistributedacrossmultiplethreads/cores
AppLogic
Single-threadedbottleneckreduced
Windowskerneltimereduced
Direct3Ddrivertimereduced
KernelModedrivertimetotallyremoved
[Microsoft:Sandy;Anandtech:Smith]
Imagination’sGnomeHordeNoinstancing
Re-usecommandbuffersforeachtile:
300tiles,13,500draws/frame,30fps,lightCPUusageOver400,000drawcalls/sec,eachwithadifferenttransformationwithmanydifferentmaterials,textures,
blendmodes,andshaders[Imagination:Smith]
FastMovingCamera
appcpuusage
systemcpuusage
[Imagination:Smith]
Commandbuffersneedtoberegeneratedveryfrequently
single-threadedCPUbottleneckcannot
feedGPUfastenough�lowFPS
Commandbufferassemblydistributedto
multiplecores
couldrunCPUatlowerfrequency
PowerEfficiency
WhenCPUandGPUhavetosharepowerandthermalbudget
• lowerCPUusageallowsmorepowerandthermalbudgettogotoGPU
• spreadingworkloadacrossmoreCPUcoresalloweachtorunatalowerclockspeed,furtherreducingpowerusageascomparedtorunningasinglethreadatahighfrequency
(tofeedtheGPU)
[Intel;Lauritzen;Anandtech:Barrett&Smith]
D3D12:CPUuses1/3thepowerofGPU,allowmoreGPUprocessing,renderingspedupbyover70%
D3D12:Ormaintainthesamerenderingperformanceat50%powerusage
D3D11:CPUusesasmuchpowerasGPU
[Intel:Lauritzen]
50,000Asteroids(draws/frame)
PipelineStateObjects(PSOs)
Problem:draw-timevalidationofshaderstatesdelayshardwaresetupandreducesthenumberof
drawcallsperframe
Solution:bake(compileandvalidate)pipelinestates
intoPSOsthatarefinalizedoncreation,switching
PSOshaveloweroverheadthancomputinghardwarestateonthefly
[Microsoft:Sandy]
PipelineStateObjectsContainsallstaticstateforentire3Dpipeline• shaders,vertexattributeformats,rasterization,colorblend,depthstencil,etc.
Createdoutsideoftheperformancecriticalpaths
PSOcanbecachedforre-use,evensavedtodisk/
cloudforre-useacrossappruns
[nvidia:Daniell;AMD]
PSO
WhatDoesn’tGointoaPSO?Resourcebindings• theactualvertex,index,constantbuffers• textures,samplers,etc.
Fixed-functionstatesthatdonotcauseshader
recompilation:viewport,colorblendconstants,polygonoffset,scissor,stencilmasksandrefs,etc.
[nvidia:Foley]
GPUStateVectorPipelineStateObject
Textures Buffers
Samplers
BindingTables
DescriptorTablesandPool/HeapProblem:tousedifferentresources,e.g.,texture,anappmustbindandrebindthemtofixedandlimited
bindslots(descriptors)andissuemultipledrawcalls
[Microsoft:Sandy;nvidia:Foley]
GPUStateVector
PipelineStateObject
TexturesBuffers
Samplers
BindingTables
DescriptorTableandHeap/PoolSolution:pre-writemultiplesetsofdescriptorstodescriptorheap;changingresourcessimplyswitches
descriptorsetsalreadyresidentinGPUmemory
[nvidia:Foley;Microsoft:Sandy]
GPUStateVector
PipelineStateObject
Textures
Buffers
DescriptorTables
Samplers
RootTable
GPUMemoryManagement
Withhigh-levelAPI,topassdatafromapptoGPU,firstallocateadriver-managedbufferand
copythedatabeforepassingthedatatotheshader⇒CPUoverhead
Withlow-levelAPI,adevelopersimplymapsthe
GPUmemoryaddressandwritestothatmemorylocationdirectly,noCPUintervention
[Imagination:Smith]
Pre-CompiledShader
Vulkan:• pre-compilesshadersintoacommonintermediaterepresentation
• providesomeIPprotection,developerscandistributeshadersinacompiledintermediaterepresentationinsteadofinsource
• pre-compiledshadersalsospeedupdrawcalls
Metalalsopre-compilesshaders
[Anandtech:Smith]
[Khronos]
andothers
opensource
GameEnginesotherlanguages,e.g.,C++ShadingLanguage
StandardPortableIntermediateRepresentation:coreinVulkan
VulkanShaderProgramming
otherIR,e.g.,LLVM
GLSLtoSPIR-Vcompiler
opensourcetranslator
MorePredictablePerformance
Previously:appsubmitsadrawcall,mapsabuffer,etc.
Drivermight(GPUdependent):
• compileshaders
• insertsynchronizationfencesintoGPUschedule• flushcaches• allocatememory
Withlow-levelAPIalltheabovemustbedonebythe
appitself,butdriverperformanceacrossvendorsbecomesmorepredictable
[nvidia:Foley]
WhyVulkanisNotforBeginners
Musthandlemulti-threadingandconcurrency/synchronization
Mustmanagememoryallocationandusage
TheseareoptionalinOpenGL,butmandatory
inVulkan
SummaryofFeatures
[nvidia:Foley;Anandtech:Smith]
Tech Metal Direct3D12 Vulkan
commandbuffer ✔ ✔ ✔
pipelinestateobjects ✔ ✔✔
descriptortable ✗ ✔✔
tile-basedrenderpass ✔ ✗ ✔
multi-adapter ✗ ✔ ✔?
VulkanandD3D12:• bothsimilartoMantletostartwith• Mantlesupportsmulti-GPU
• notaslowlevel,tobecross-vendorandcross-platform
Tile-basedArchitectures
“MobileGPU”usuallymeans“tile-basedGPU”• mostAndroidandalliOSdevicesusetile-basedrendering
• VulkanandMetalhavesupportfortile-basedarchitecture,butnotDirect3D 12
• tilingreducesuseofexpensiveoff-chipmemorybandwidth
[Google:Hall;Imagination:Sommefeldt]
Immediate-ModeRendering
Fragmentshading,includingtexturesampling,performedevenonfragmentsthatwilleventually
failthedepthtest• requiresaccessingoff-chipmemory
• inefficientuseofoff-chipbandwidth
[Imagination:Sommefeldt;Merry]
Tile-basedArchitectures
Tile-basedrenderingsplitsframebufferupintotiles(e.g.,16×16or32×32pixels)andsortalltrianglesontileusingon-chipstoragebeforefragmentshading
[Merry]
Multi-AdapterSupport
PCscancontainmultiplegraphicscards
Appscanenumerategraphicscards
• cancreateadeviceabstractionforeach
SomegraphicscardshavemultipleGPUs
• eachwithitsownenginesandmemory
Appsshouldbeabletoassignworkto
anyGPUonanygraphicscard• createqueuesonanyengineandsubmitcommandbuffers
• allocateresourcesinmemoryassociatedwithanyGPU
[Microsoft:Boyd]
Multi-AdapterSupport
Options:• Alternate-framerendering(AFR):framepacingbecomesanissueiftheGPUsareofdifferentperformance
• Split-framerendering
• Worksharingofindividualframes
D3D12ExplicitMulti-Adapter(EMA)modeallows
exchangeofmultipledatatypesbetweenGPUs,beyondjustfinished,renderedimages
ButtransferringdataoverPCIebusisslowandwith
highlatency!
[Anandtech:Smith]
FeatureSets/Levels
Hardwarefeaturescoping• canbedefinedfordifferentplatformsorversionsoftheAPI
• allfeatureslistedinasetmustbesupported
• developerscandevelopagainstFeatureSets• featuresenabledatdevicecreationtime
[Microsoft:Sandy;Anandtech:Smith]
Apple Apple?Imagination?Khronos?Lockedout?
ReferencesSmith,R.,“MicrosoftAnnouncesDirectX12,”Anandtech,Mar.24,2014Smith,R.,“UnderstandingAMDMantle,”Anandtech,Sep.26,2013Smith,R.,“SomeThoughtsonApple’sMetalAPI,”Anandtech,Jun.3,2014Smith,R.,“KhronosAnnouncesNextGenerationOpenGLInitiative,”Anandtech,Aug.11,2014Chester,B.,“ComparingOpenGLEStoMetaloniOS,”Anandtech,Jun.15,2015Sandy,M.,“DirectX12,”DirectXDeveloperBlog,Mar.20,2014Lauritzen,A.,“DirectX12onIntel,”Aug.11,2014Yeung,A.,“DirectX12–LookingbackatGDC2015,”Mar.9,2015Langley,B.,“Windows10 andDirectX12Released!,”DirectXDeveloperBlog,Jul.29,2015Smith,R.,“MicrosoftDetailsDirect3D11.3&12NewRenderingFeatures,”Anandtech,Sep.18,2014Smith,R.,“TheDirectX12PerformancePreview,”Anandtech,Feb.6,2015Smith,R.andCutress,I.,“ExploringDirectX12:3DMarkAPIOverheadFeatureTest,”Anandtech,Mar.27,2015Smith,R.,“NextGenerationOpenGLBecomesVulkan,”Anandtech,Mar.3,2015Smith,A.,“TryingoutthenewVulkangraphicsAPIonPowerVRGPUs,”ImaginationPowerVRGraphicsBlog,Mar.3,2015Smith,A.,“GnomespersecondinVulkanandOpenGLES,”ImaginationPowerVRGraphicsBlog,Aug.10,2015Foley,T.,“Next-GenerationGraphicsAPIs:SimilaritiesandDifferences,”ACMSIGGRAPH2015Sellers,G.,“AWhirlwindTourofVulkan,”ACMSIGGRAPH2015Hall,J.,“VulkanonAndroid,”ACMSIGGRAPH2015Hall,J.,“UsingNext-GenerationAPIsonMobileGPUs,”ACMSIGGRAPH2015Daniell,P.,“VulkanonNVIDIAGPUs,”ACMSIGGRAPH2015Boyd,C.,“Direct3D12,”ACMSIGGRAPH2015Yeung,A.,“DirectX12Multiadapter,”DirectXDeveloperBlog,Apr.30,2015Smith,R.,“GeForce+Radeon:PreviewingDirectX12 Multi-Adapter,”Anandtech,Oct.26,2015Merry,B.,“PerformanceTuningforTile-BasedArchitecture,”OpenGLInsights,eds.Cozzi,P.andRiccio,C.,2012