1 introduction · 1 introduction engineering workpieces are designed, manufactured and measured to...
TRANSCRIPT
1IntroductionEngineeringworkpiecesaredesigned,manufacturedandmeasuredtomeettolerancespecificationsofvariousgeometricalcharacteristics(e.g.,surfacetexture,size,angle,radius).Afteraworkpiecehasbeenmanufactured,it
isnecessarytoverifythederivedcharacterization(tolerance)parameterstoensureitperformsitsdesignedfunctions.Thoughthedesignedfunctionsofaworkpiecearethecomprehensiveeffectofseveralcharacterizationparameters,
surfacetextureisoftenthedominantone.Theroughness,wavinessandformarethreemaincharacteristicsonfunctionalsurfaces(e.g.,roughnessmayaffectthelifetime,efficiencyandfuelconsumptionofapart)[1,2].
Obtainingroughness,wavinessandformfrequencybandsrelevant toasurface isnormallyaccomplishedbyusingfiltrationtechniquestoextractasurfacetexturesignal. Inprinciple, filtrationmustbecarriedoutbefore
Agenericparallelcomputationalframeworkofliftingwavelettransformforonlineengineeringsurfacefiltration
YuanpingXua
ChaolongZhanga,b,⁎
ZhijieXub
JiliuZhouc
KaiweiWangd
JianHuanga
aSchoolofSoftwareEngineering,ChengduUniversityofInformationTechnology,No.24Block1,XuefuRoad,Chengdu610225,People'sRepublicofChina
bSchoolofComputing&Engineering,UniversityofHuddersfield,QueensgateHD13DH,Huddersfield,UK
cSchoolofComputerScience,ChengduUniversityofInformationTechnology,No.24Block1,XuefuRoad,Chengdu610225,People'sRepublicofChina
dSchoolofOpticalScienceandEngineering,ZhejiangUniversity,Hangzhou310027,People'sRepublicofChina
⁎Correspondingauthorat:SchoolofSoftwareEngineering,ChengduUniversityofInformationTechnology,No.24Block1,XuefuRoad,Chengdu610225,People'sRepublicofChina.
Abstract
Nowadays, complex geometrical surface texturemeasurement and evaluation require advanced filtration techniques. Discrete wavelet transform (DWT), especially the second-generation wavelet (LiftingWavelet
Transform–LWT),isthemostadoptedoneduetoitsunifiedandabundantcharacteristicsinmeasureddataprocessing,geometricalfeatureextraction,manufacturingprocessplanning,andproductionmonitoring.However,
when dealingwith varied complex functional surfaces, the computational payload for performingDWT in real-time often becomes a core bottleneck in the context ofmassivemeasured data and limited computational
capacities.Itisamoreprominentproblemforthearealsurfacetexturefiltrationbyusing2DDWT.Toaddresstheissue,thispaperpresentsagenericparallelcomputationalframeworkforliftingwavelettransform(GPCF-
LWT)basedonGraphicsProcessUnit(GPU)clustersandtheComputeUnifiedDeviceArchitecture(CUDA).Duetoitscost-effectivehardwaredesignandthepowerfulparallelcomputingcapacity,theproposedframework
cansupportonline(ornearreal-time)engineeringsurfacefiltrationformicro-andnano-scalesurfacemetrologythroughexploringanovelparallelmethodnamedLBBmodel,theimprovedalgorithmsofliftingschemeand
three implementationoptimizationson theheterogeneousmulti-GPUsystems.The innovativeapproachenablesoptimizationson individualGPUnode throughanoverarching framework that iscapableofdata-oriented
dynamic load balancing (DLB) driven by a fuzzy neural network (FNN). The paper concludes with a case study on filtering and extractingmanufactured surface topographical characteristics from real surfaces. The
experimentalresultshavedemonstratedsubstantialimprovementsontheGPCF-LWTimplementationintermsofcomputationalefficiency,operationalrobustness,andtaskgeneralization.
Keywords:Liftingwavelet;LBB;SurfacetextureMeasurement;Multi-GPU;CUDA;FNN
suitablenumericalsurfacetextureparameterscanbeidentified.Itmeansthatgeometricalcharacteristicinformationrelatingcloselywithfunctionalsurfacescanbeextractedpreciselyfrommeasureddatasetsbyusingappropriate
filters.Thisensuresthepotentialofestablishingfunctionalcorrelationsbetweentolerancespecificationandverification[1–3].
The rapid progress of filtration techniques has presentedmetrologists a set of advanced tools, e.g.Gaussian filters,morphological filters andwavelet filters [4–7].Gaussian filtering is commonly used in surface texture
measurementsandhasbeenrecommendedbyISO16,610andASMEB46standards forestablishingareferencesurface [8–10].However,usingGaussian filter toextractsurface topographicalcharacteristicsmustbebasedona
presuppositionthatarawsurfacetexturesignal isacombinationofaseriesofharmoniccomponents[10].Unfortunately,arealengineeringsurfacetexturemeasurementcontainsnotonlyregularsignalsbutalsononperiodicand
randomsignals,e.g.deepvalleys,highpeaksandscratchesthataregeneratedduringvariousmanufacturingprocesses,sothatextraction,filtrationandanalysisofthesesignalsareindispensablemethodsformonitoringhowprocess
information(e.g.surfacetextureparameters)relatetovariabilityofthemanufacturingprocesses[2].Thus,straightapplicationsofGaussianfilterstruggletosatisfythechallengingrequirementsofsurfacetexturemeasurementand
analysis.Incontrast,thediscretewavelettransform(DWT)hasshowngreatpotentialinhandlingthecomplexissuesefficiently.DWThasagreatdiversityofmotherwavelets(e.g.Haar,Daubechies,CDFbiorthogonal,Coiflet)thatcan
beappliedaccordingtodifferentapplicationpurposes.Moreover,DWThasadvantagesonmulti-resolutionanalysisandlowercomputationalcomplexity.
Recently,surfacetexturemeasurementhasenteredtheeraofminiaturization,e.g.variousexplorationsonmicro-ornano-machiningandmicro-electromechanicalsystems(MEMS)manufacturing[1].Hencethe increasing
demandsforefficienthandlingofbig(measurement)data,processingaccuracy,andreal-timeperformancehavebeentreatedasparamount.Asthestatusquo,thecomputationalperformanceofDWTisacorebottleneckforhandling
high-throughputinonline(ornearreal-time)manner[11],especiallyfortwo-dimensionalDWT(2DDWT)inarealsurfacetexturemeasurement[12].
Toovercomethisbottleneck,purealgorithmicaccelerationsolutionshavebeenhotlyinvestigatedinthelastdecade.SweldensproposedanewDWTmodel,itknownasliftingwavelettransform(LWT)orsecond-generation
wavelet that improved computational performance and optimized the memory consumption compared to the first-generation wavelet – —Mallet algorithm [13]. Although promising, lifting scheme still has limited satisfaction
performance in accelerating computations ofDWT for online applicationswhen facing large-scaledata size.Basedon that, hardwareaccelerationshavebecomea research focus.GraphicsProcessingUnits (GPU) enables those
solutionsandhasbecomethestandardaccessoryforcomputation-intensiveapplications.Unfortunately,currentliftingschemestrugglesonasingleGPUformeetingthedemandsofonlineprocess,especiallywhendealingwithlarge-
scaleanddynamicdatasetstodefinehighprecisionrequirementsforsurfacetexturemetrology.
Totacklethechallenges,thispaperpresentsagenericparallelcomputationalframeworkforLWT(GPCF-LWT)basedontheoff-the-shelfconsumer-gradeCUDAmulti-GPUsystems.TheproposedGPCF-LWTcanaccelerate
computational-intensivewavelets,sothatitcansupportonlineengineeringsurfacefiltrationforhighprecisionsurfacemetrologythroughexploringaninnovativecomputationalmodel(namedtheLBBmodel)forimprovingtheusage
ofGPUsharedmemoryandanimprovedliftingschemeforreducingtheinherentsynchronizationsintheclassicliftingscheme.TheGPCF-LWTalsocontainsthreeimplementationoptimizationsbasedonthehardwareadvancementsof
an individual GPU node and the novel dynamic load balancing (DLB)model by using the improved fuzzy neural network (FNN) for heterogeneousmulti-GPUs. Therefore, it supports intelligent wavelet transform selections in
accordancewithdifferentsurfacecharacterizationrequirementsandmeasurementconditions.
Therestofthispaperisorganizedasfollows.Section2givesabriefreviewofthepreviousrelatedworks,whichleadstofurtherclarifythecontributionsofthisresearch.Anin-depthanalysisofthefeasibilityofalgorithm
improvementsforadaptingparallelcomputationarediscussedinSection3followingonabriefintroductionoftheliftingschemebasedsecond-generationwavelettransform.Section4presentstheGPCF-LWTframeworkanditsthree
keycomponents.TheLBBmodel forGPCF-LWThasbeenexploredbasedonadetailedevaluationof the threeclassicparallelcomputationmodels for liftingscheme implementations tosolve theproblemofGPUsharedmemory
shortageduringLWTcomputation.ToincreasetheparallelizabilityofLWT,animprovedliftingschemeforreducingtheinherentsynchronizationshasbeendefined,andtofurtheracceleratetheLWTcomputation,threecomplementary
implementationoptimizationsinaccordancewiththelatesthardwareadvancementsofGPUcardshavebeendiscussed.Moreover,tomaximizeparallelizationofGPCF-LWTonaheterogeneousmulti-GPUplatform,Section5introduces
anoveldynamicloadbalancing(DLB)model,inwhichamoreeffectiveFNNhasbeendesignedbasedonthepreviousworkofauthors.AcasestudyisdemonstratedandevaluatedinSection6totestandverifytheresearchideas.
Finally,Section7concludesthispaperandsummarizesthefutureworks.
2PreviousrelatedworksModern GPUs are not only powerful graphics engines, but also highly parallel arithmetic and programmable processors. GPU is a SIMD (Single InstructionMultiple Data) parallel device containing several streaming
multiprocessors(SM),andeachSMhasmultipleGPUcores(alsonamedCUDAcores),i.e.,streamprocessors(SP)whicharethecomputationalcoresforprocessingmultiplethreadssimultaneously(eachGPUcoremanagesathread).
Besidesforvisualizationandrendering,othermoregeneral-purposecomputationsusingGPU(socalledGPGPU)havealsoprosperoussince2002whenconsumer-gradegraphicscardsbecametruly‘‘programmable’’,rangingfrom
numericcomputationaloperations[14–16]tocomputeraideddesign,toleranceandmeasurementareas.Morespecifically,GPGPUhasbeenproposedtosolvenumerouscompute-intensiveproblemswhichcannotbesolvedbyusing
CPUbasedserialprocessingmode,e.g.,Mccool introducedtraditionalGPUhardwareandGPUprogrammingmodelsforsignalprocessingandsummarizedsomeclassicsignalprocessingapplicationsbasedonGPU[17].Suet al.
proposed a GPGPU framework to accelerate 2DGaussian filtering, and it achieved about 4.8 times speedupwithout reducing the filtering quality [18]. However, traditional GPGPU programmingmust be codedwith graphical
programminginterfacessuchasOpenGLandDirectX,soheavymanualworkswereinvolvedinthewholeprocedureofdevelopingGPGPUprograms,e.g.,texturemappings,manualandstaticthreading,memorymanagement,and
shadingprogramexecution.Thus,thisobsoleteGPGPUprogrammingprocedureisverycomplicated,anditalsolimitedtheparallelcomputingabilityofGPUcards.
Toalleviatetheseapplicationdifficulties, in2007,NVIDIAintroducedCUDAthat isaprogramingframeworkforgeneral-purposecomputationandprovidesascalableandintegratedprogrammingmodelforallocatingand
organizingprocessingthreadsandmappingthemintoSMsandCUDAcoreswithdynamicaladaptionabilityforallmainstreamGPUarchitectures[19].Inaddition,CUDAmaintainsasophisticatedmemoryhierarchy(e.g.registers,
localmemory,sharedmemory,globalmemoryandconstantmemory)inwhichdifferentmemoriescanbeappliedaccordingtoparticularapplicationsandoptimizationcircumstances,andCUDAGPUcardshadbeenoptimizedwith
highermemorybandwidthandfaston-chipperformance[19–21].Betteryet,itembeddedaseriesofAPIstoprogramdirectlyonGPUinsteadoftransferringapplicationcodestovariousobscuregraphicsAPIsandtheshadinglanguage
manually.InCUDAprograms,anyfunctionrunningonaGPUcardiscalled“kernel”,andlaunchingakernelwillgeneratemultiplethreadsrunningontheGPU[19].CUDAthreadsareorganizedintoahierarchystructure,i.e.,multiple
threadscomposeablock,andmultipleblockscomposeagridsuchasa2Dmatrix(seeFig.11)[22].WhenakernelrunningonaGPU,SMsareinchargeofcreating,managing,schedulingandexecutingCUDAthreadsingroups(called
warps),andeachgrouphandles32parallelthreads.However,thememorycapacityandparallelcomputingabilityofasingleGPUarestilllimitedtointegrallyprocessadatasetwithtensofgigabytes(GB)insizethatisthenormalsize
requiredforcurrentlarge-scalescientificapplications[23].Forexample,ameasureddatasetextractedfromthemicro-scalesurfaceareaofaceramicfemoralheadarticulateis49,152152××49,152insize(i.e.total1212GBdata
withsingleprecision),whereasasinglemodernGPUGeForceGTX1080selectedinthisstudyhasonly8GBGPUmemorycapacity,sothis1212GBdatasetcannotbeloadedandprocessedatasametime.Althoughthemoreexpensive
high-endGPUssuchasNVIDIATITANXp(1212GBmemory)andNVIDIATITANRTX(2424GBmemory)canstorelargerdatasets,theywillstillstrugglewithreal-lifemeasurementdatasetsoftencominginitshundredsandmoreGB
ofdatasizes.Moreover,theGPUcardshavefixedCUDAcores(e.g.,2560,3840and4608forGTX1080,TITANXpandTITANRTX,respectively),andeachonecanexecuteahardwarethread,sothreadsarealsofixedtosupport
certainnumberof concurrentcomputations.Consequently,a singleGPUmustdivide,andprocessmassive smallerchunksofadataset iteratively,whichmaygreatly increase the latencyof intensivecomputations. Inconclusion,
scientificcomputationswithbigdata,especiallyforapplicationsofonlineandhighprecisionsurfacetexturemeasurement,demanddistributedcomputationsonmulti-GPUs.Withcomplementaryoptimizationstrategies,multi-GPUs
canseamlesslyworktogethertoachieveamazingcomputationalperformance,butthemulti-GPUarchitectureismorecomplicatedthanthesingleGPU,anditbringstimeconsuminganderrorproneprocessesonworkloadpartitions
andallocations,theircorrespondingsynchronizationsandcommunicationsofdivideddatachunks,recombinationoftheprocesseddatachunks,anddatatransfersacrossPCI-E(PeripheralComponentInterconnection-Express)buses.
ItisworthmentioningthattheNVLinkconnectionsintroducedrecentlyhaveapproximate10timeshigherbandwidththanthePCI-E[24].However,NVLinksupportsthehigh-endGPUsonly(e.g.,TITANRTXandTESLAP100),anditis
normallyequippedonlyinthesupercomputers.
Historically,althoughplentifulandconcreteresearchesonwavelet-basedfilteringtheoriesarousedinthelastdecades,incomparison,theirpractical implementationsoncomputersachievedlimitedsuccessduetothekey
challengeofcomputationalcostsforwavelettransforms.Ingeneral, the intensivecomputationofDWTthroughits inherentmulti-leveldatadecompositionandreconstructionwillcausedrasticallyreductionof itsperformancefor
onlineapplicationswhenfacinglargedatasize.WiththerapiddevelopmentofGPUparallelism,severalsingleGPUaccelerationsolutionshavebeenexploredasresponsestothischallenge.Forexample,Hopfetal.accelerated2D
DWTonasingleGPUcardbyusingOpenGLAPIsin2002,andconsequentlythisresultedinthebeginningofGPUbasedDWTacceleration[25].Thisearlypracticeadoptedthetraditional“one-step”methodratherthantheseparable
twostepsof1DDWT,sothatthisprogrammingmodelwasnotflexibleandgeneralizableatall,i.e.,eachkindofwavelettransformmusthaveacustomimplementation.In2007,Wongetal.optimizedHopf'sworkbyusingshading
language(namedas“JasPer”),and JasPerusedshaders tohandlemultipledatastreams inpipelinesandhadbeen integrated in JPEG2000codec [26].Notably, JasPerdecomposes2DDWT into twostepsof1DDWT,such that it
supportsdozensofwavelettypesandboundaryextensionmethodsinconvolutionstage.Italsounifiedthecalculationmethodsofforwardandinverse2DDWT.ForWong'swork,duetothelimitationsonGPUprogrammabilityand
hardwarestructureatthetime,theconvolution,downandupsamplingoperationswereexecutedinsequenceonfragmentprocessorsofanearlyGPU,andeachtexturecoordinateintexturemappingmustbepre-definedinseparate
CPUprograms.ThesesolutionsrequiremanualmappingoperationsonGPUhardware,andtaskpartitioningthatdecideswhichpartoftheworkshouldbeconductedonaCPUandwhichpartwillbefitforaGPUisindispensablein
thesepracticespriortotheemergingofunifiedGPUlanguagessuchasCUDA.
Recently,therehasbeenagrowinginterestinCUDAbasedGPGPUcomputationstoparallelizetheDWTforscientificandbigdataapplications.Francoetal.parallelizeda2DDWTontheCUDAenabledGPUNVIDIATesla
C870forprocessingimagesandachievedasatisfactoryspeedupof26.81timesfasterthanOpenMPmethodwiththefastestCPUsettingatthetime[27].Songetal.acceleratedCDF(9,7)basedSPIHT(setpartitioninginhierarchical
trees) algorithm for decoding real-time satellite imagesbased on theNVIDIATeslaC1060,which achieved speedupof about 83 times compared to a single threadCPU implementation [28]. Later, Zhu et al. improved 2DDWT
computationonaGPUandverifieditwithacasestudyoftheringartifactremoval,whichachievedasignificantperformanceincrease[29].ComparedwithpreviousCUDAenabledGPUimplementationsofDWTthatonlyparallelizeda
veryspecifictypeofwaveletunderaspecificcircumstanceindividually,Zhu'sworksupportsdifferenttypesofwaveletsforadaptingdifferentapplicationconditions.Inaddition,thelatestconstantmemorytechniquehasbeenapplied
inZhu'swork to optimize the overallGPUmemory access.However, the sharedmemory access andCUDA streamsbasedhide latencymechanismwere ignored for further performance optimization in this research [19].More
importantly,thefirst-generationwavelettheoryanditscorrespondingalgorithmicoperationsappliedintheconvolutionschemeoftheseresearcheshadincreasedthecomputationalcomplexityandimposedagreatdealofmemoryand
computationaltimeconsumption,e.g.,researchersneedtoconsiderthetroublesomeissueofboundarydestructioninherentintheFouriertransform[10].Thesecond-generationwavelettheoryadoptstheliftingschemeforwavelet
decompositionandreconstruction,whichpreservesallapplicationcharacteristicsofthefirst-generationwavelettransformwhilebringinghigherperformanceandlowermemorydemandduetoitsin-placealgorithmproperty[13].
ThesepreviousCUDAaccelerationstudieschosetheconvolutionschemeratherthanliftingschemebecausetheexecutionorderofliftingschemehasheavydatadependencies,suchthatitisimpossibletobefullyparallelizable.To
exploreCUDAbasedLWTimplementation,in2011,Wladimiretal.acceleratedtheLWTalgorithmwiththree-leveldecompositionsandreconstructionsofan19201920××1080imagebyusingCUDAversion2.1onNVIDIAGeForce
8800GTX768 MB,andthispracticeachievedaspeedupof10timescomparedwiththecorrespondingCPU(AMDAthlon6464××2DualCoreProcessor5200+)implementation[4].Recently,Later,Castelaretal.optimizedthe(5,3)
biorthogonalwavelettransformbyusingaCUDAinapplicationoftheparalleldecompressionofseismicdatasets[5].Unfortunately,thefurtherparalleloptimizationofLWTinthesestudieshasnotbeenconsidered.
PreviousworksfocusonparallelizingdataprocessingthroughusingasingleGPUthathadwitnessedamoderateperformancegainacrossboard.However,duetothelimitationofdatastorageformatandmemoryspace,as
well as the fixednumberofCUDAcores available ona singleGPUdie, previousworks are still struggling to fulfill theonlinedataprocessing requirements frommany large-scale computational applications,which is especially
problematicforprocessinglargemeasuredsurfacetexturedataengagingcomplexfiltrationswithvarioustoleranceparameters[1,30].Incomparison,multi-GPUbasedaccelerationsolutionscanbeflexibleandextensibleforachieving
higherperformancewithrelativelylowhardwarecosts.Numerouscomputational-intensiveissuesthatcannotberesolvedbyusingasingleGPUmodelhavebeenmakingsteadyprogressinthecontextofmulti-GPUs.Forinstance,FFT
(FastFourierTransform)basedmulti-GPUsolutionhasbecomethenorm incurrentonline large-scaledataapplications [31,32]. In themeantime,severalmulti-GPUprogramming libraries (e.g.MGPU) [33]andmulti-GPUsbased
MapReducelibraries(e.g.GPMRandHadoopCL)[34,35]hadbeendevelopedbyresearchers.Moreover,deeplearningwhichisverypopularinthecurrentdecadealsohasbenefitedfromthefastdevelopmentofmulti-GPUtechniques
[36,37].Thesepilotmulti-GPUstudiesarebasedontheassumptionsthatallGPUnodesequippedinamulti-GPUplatformhaveequalcomputationalcapacity.Inaddition,task-basedloadbalancingschedulersthattheseapproaches
relieduponfallshorttosupportapplicationswithhugedatathroughputsbutlimitedprocessingfunction(s)sincethereareveryfew“tasks”toschedule.Tomaximizethedataparallelisminaheterogeneousmulti-GPUsystem,this
studyhasalsodevisedaninnovativedata-orienteddynamicloadbalancing(DLB)modelbasedonanimprovedFNNforrapidmeasureddatadivisionandallocationonheterogeneousGPUnodes[38].
Insummary, thecombinationaccelerationoptimizationsmergingboththeparallelizedarchitectureofmulti-GPUsand improvements fromtheaspectsofLWTandsoftwareroutinesseamlesslytogether(e.g., thealgorithm
optimizationofliftingscheme,comprehensivememoryusageoptimizationbasedontheadvancedhierarchalarchitectureofCUDAmemory,andthelatencyhidemechanismbasedonCUDAmulti-streams)werelargelyignored.To
solvetheseshortcomings,thisstudyexploresandimplementsagenericparallelcomputationalframeworkfortheliftingwavelettransform(GPCF-LWT)basedonaheterogeneousCUDAmulti-GPUsystem.It furtheroptimizedthe
acceleration of 2D LWT through the improved processingmodel of lifting scheme (LBB and improved algorithm of lifting scheme), the devised three implementation optimization strategies and the improved DLBmodel on a
heterogeneousmulti-GPU system.Moreover, GPCF-LWT can support the whole kindwavelet transforms, thereby supporting intelligent wavelet transform selections in accordancewith different surface texture characterization
requirementsandmeasurementstrategies.
3Thesecond-generationwaveletThefundamental ideabehindwaveletanalysis istosimplifycomplexfrequencydomainanalysesintosimplerscalaroperations.Thefirst-generationwaveletappliesfastwavelettransformtoenabledecomposition(forward
DWT) and reconstruction (inverseDWT) in the formof a two-channel filter banks, or the so-calledMallat convolution scheme [39].Mallatwavelet decomposition and reconstruction convolve the input and output datawith the
corresponding decomposition and reconstruction filter banks respectively, so it has become a popular choice formany engineering applications in recent years.However,Mallat algorithmhas a certain degree of computational
complexity(e.g.ithasinherentboundarydestructionwhenexecutingFFT).Thus,SweldensproposedanewimplementationofDWT,i.e.liftingschemeorthesecond-generationwavelet.
The second-generationwavelet that uses the lifting scheme forwavelet decomposition and reconstruction preserves all properties of the first-generationwavelet, but brings higher computational performance and lower
memorydemandbecauseofitsin-placealgorithmattribute[13].Theliftingschemebased1Dforward(i.e.decomposition)DWT(abbreviatedasforwardLWTorFLWT)containsfouroperationsteps:split,predict,updateandscale
[4,13,40]:
1) Split:Thisstepsplitstherawsignalintotwosubsetcoefficients,i.e.evenandodd,andtheformercorrespondstotheevenindexvalueswhilethelatteristheoddindexvalues.ThesplitmethodisexpressedinEq.(1),anditisalsocalledas“lazy
wavelet”.
2) Predict:Forthecoefficientsofevenandodd,itcanpredictoddcoefficientsfromtheevenbyusingthepredictionoperatorPO,andthenreplacetheoldoddvaluesbythepredictedresultasthenextnewoddcoefficients,recursively,andthisstep
(1)
canbeexpressedasEq.(2).TheoddisthesameasthedetailcoefficientsDinMallatalgorithm.
3) Update:Similarly,itcanupdateevenbyusingtheupdateoperatorUO,andthenreplacetheoldevenvaluesbytheupdatedresultasthenextnewevencoefficients,recursively,andthisstepcanbeexpressedasEq.(3).TheevenisthesameasA
inMallatalgorithm.
4) Scale:NormalizeevenandoddcoefficientswithfactorKrespectivelybyusingEq.(4)togetthefinalapproximationcoefficients(evenApp)andthedetailcoefficients(oddDet).
Alloperatorsarecompletedinthein-placecomputationalmodebasedonthesefoursteps.Fig.1illustratesthesplitoperatorwithin-placemode,inthiscase,therawdatashouldbesplitintoevenandoddsubsetswhichare
organizedandstoredinthefirsthalfandthesecondhalfofthememoryspacerespectivelythatisthesamespaceforrawdatastorage.Similarly,prediction,updateandscaleresultsarestoredintheiroriginalmemoryspacesrather
thanallocatingnewmemories,suchthatliftingschemerequireslessmemoryconsumptionthantheclassicMallatalgorithm.Furthermore,the1Dinverseliftingscheme(i.e.reconstruction)DWT(abbreviatedinverseLWTorILWT)just
inversethecomputationalsequenceofoperationstepsofFLWTandswitchthecorrespondingplusandsubtractionoperators:
Scale:
Update:
Predict:
Merge:Thisstepisalsocalledas“inverselazywavelet”,anditcanbeexpressedby:
Theliftingschemecandynamicallyupdateconfigurationparameters(i.e.iterativenumberandwaveletfiltercoefficients)ineachcomputationalstepofDWTbasedonfactorizationsofthepolyphasematrixesofwaveletfilter
coefficients,suchthatthegeneralizationofthisframeworkcanbeguaranteedtorealizetheparallelcomputationsforallkindsofwavelettransformsdynamically.Generally,thefactorizationofpolyphasematrixP(z)forallwaveletscan
bedescribedasthefollowing:
wherecoefficientsofui(z)areupdatecoefficients in theupdatestepandcoefficientsofpi(z)arepredictioncoefficients in thepredictionstep inFigs.2and3,andK is the scale factor. In a conclusion, the computational results
(2)
(3)
(4)
Fig.1Thecomputationofsplitoperationwithin-placemode.
alt-text:Fig1
(5)
(6)
(7)
(8)
(9)
ofbothforwardandinverseLWTforarbitrarywaveletcanbeobtainedbyapplyingseveralstepsofpredictionandupdateoperationsandimposingthefinalnormalizationwithKontheoutputsoffactorizationofP(z).Figs.2and3
illustratethemaincomputationalprocessesofsingle-level1DforwardandinverseLWTrespectively.
4CUDAbasedGPCF-LWTimplementationBasedon the fundamentalLWTconcepts summarized inSection3, this studyproposesGPCF-LWTbyusingCUDAenabledGPUs for online andhighprecision surfacemeasurement applications.This studyassumes that
factorizationsofpolyphasematrixesofathewaveletfiltercoefficientshavebeenpreparedbeforeapplyingtheGPCF-LWT,sothatwehavegotaseriesofupdatecoefficientsU = {u0,u1,… ,un},aseriesofpredictioncoefficients
P = {p0,p1,…,pn},andalsothescalefactorK.
Thetraditionalparallelcomputationalplatforms(e.g.OpenMP,MPIandFPGA)haveempoweredthreeclassicLWTparallelcomputationalmodels,namely,theRC(Row-Column),LB(Line-Based)andBB(Block-Based)models
[41].Generallyspeaking,theLBmodelhasthebestinstructionthroughput,followedbytheBBandRCmodels[4].Unfortunately,exceptfortheshortestwavelets(e.g.,theHaarwavelet),bothLBandBBmodelshadfailedtobe
transplantedonthesingleCUDAGPUduetothespacelimitationofcurrentsharedmemory,andthegeneralizationofLWT(i.e.,supportingallkindsofwavelettransforms)alsocannotbesatisfied.AlthoughRCmodel istheonly
feasibleCUDAGPUtransplantationthatsupportsthegeneralizationofLWT,theprocessingtimeofCUDAbasedRCmodelisapproximately4timesslowerthantheMallatalgorithmsolutionrunningonthesameGPUcard[42].Thus,
basedonthein-depthstudyofthesethreemodels,thisstudydevisedagenericLWTparallelcomputationalmodelfortheCUDAenabledGPUarchitecture.
4.1Theimprovedparallelcomputationalmodel• TheLBmodel.Ontraditionalparallelcomputationalplatforms(e.g.,OpenMP,MPIandFPGA),theoverallframeworkofLBbasedLWTisdepictedinFig.4.TheLBmodeldividesarawdataset(e.g.,1024 × 1024insize)intoN(Nequalstothenumberofcomputational
nodes)smallersubsets(e.g.,ifNis4,eachsubsethas256rowsand1024columns),andeachsubsetwillbeallocatedonacomputationalnodewhereLBperformsthefollowingoperations:
1) Lunchingmthreads,andeachoneperformshorizontal1DLWTforeachrowofthetargetsubsetconcurrently(i.e.,eachthreadprocessesarowofthetargetsubset.)andthentheLBcachesthetemporaryresultsinthelocalmemory(e.g.,aRAM—randomaccessmemory);
2) Oncethemrowsarecompletelyprocessed,themthreadsperformvertical1DLWTonmcolumnsoftemporaryresultsconcurrentlyandrepeatedlyuntilallcolumnsofthecachedtemporaryresultshavebeenprocessed.Inthisstep,theiterativetimesare(width + m–1)/m,andeachiteration
processesmcolumns.
3) Steps1and2maintainanintegratediterationthatwillberepeateduntilallrowsandcolumnsofthetargetsubsetareprocessed.
Fig.2Themaincomputationalprocedureofsingle-level1DforwardLWT.
alt-text:Fig2
Fig.3Themaincomputationalprocedureofsingle-level1DinverseLWT.
alt-text:Fig3
ThelaststepofLBbasedLWTisthatonenode(usuallycalledamasternode)mergestheresultsofNsubsetsandthenobtainsthefinalresult.Theminimumvalueofmisdeterminedbyaparticulartypeofwavelet,e.g.,3forHaarand6forDeslauriers-
Dubuc(13,7).Generally,mcanbeunifiedasafixedvalue(e.g.,10or20)thatlargerthanalltypesofwaveletsneeded.Takingm = 10forafloatdatasethaving1024columnsasanexample,eachcomputationalnoderequires40 KBlocalmemoryspace,soLBmodel
canbeeasilyimplementedonOpenMP,MPIandFPGAsincetheyhavelargeRAMspaces.However,aCUDAblockhasonly16 KBlocalmemory(sharedmemory)spacethatcannotsatisfythememorydemandofLBmodelexceptfortheshortestwavelet(e.g.,Haar
waveletrequiresonly12 KBsharedmemoryspace).Inaddition,columnsofadatasetcannotbedividedbyLB,soitisacoarse-graineddivisionapproach,whichcannotoperatewellwiththeSIMDofCUDA.
• TheBBmodel.UnlikeLB,theBBmodeldividesarawdatasetintoNsubsets(datablocks)frombothrowandcolumndirections,seeFig.5.EachsubsetisallocatedonacomputationalnodewhereRCroutineperformshorizontal1DLWTforallrowsconcurrentlyat
first,andthenvertical1DLWTforallcolumnswillbecomputedconcurrently.LikeLBmodel,themasternodewillmergeallresultsofsubsetsatthelaststepofBBmodel.TheBBmodelisalsoacoarse-graineddivisionapproach,anditalsorequiresagreatdealof
localmemoryspace,e.g.,ifarawfloatdatasetwith1024 × 1024insizeisdividedintofoursubsets(eachsubsethas512rowsand512columns),eachcomputationalnoderequires1024 KBlocalmemoryspacethatconsiderablyexceedthesizeofasharedmemory
spaceofaCUDAGPU,suchthatclassicRCmodelbasedCUDAsolutionsmustuseaGPUglobalmemoryspaceasthetemporarycaches,andithasbecomethekeybottleneckfordataaccess.
• TheinnovativeLBBmodelforCUDALWT.TosolvetheproblemofsharedmemoryshortageonCUDAGPUs,thisstudydevisedaninnovativeparallelcomputationalmodelfor2DLWT-LBBbycombiningbothLBandBB.ThemainoverallframeworkofLBBmodelis
shownasinFig.6.TheLBBmodeldividesarawdatasetintoseveralsubsetsfromthecolumndirection,andeachsubsetcanbeassignedwithaCUDAthreadblock(e.g.,8rowsand32columnsinsize)wherethein-builtthreadsperformthefollowingoperations:
1) EachgroupofconcurrentrowthreadsloadsarowofadatasetintothesharedmemoryofaGPUinparallel.Accessesofarowdatacanbecoalescedsinceallthreadsinablockaccessthecontinuousrelativeaddresses[19],soitisveryefficienttoloadadatasetintoasharedmemoryspace.
2) Eachgroupofconcurrentrowthreadsinablockperformshorizontal1DLWTonadatarowinparallel,andthentheGPUcachesthetemporaryresultsinitssharedmemoryspace.
3) mgroupsofrowthreadsgetmrowtemporaryresultsinparallelbyexecutingsteps1and2(misapresetvalue,anditis8here).
4) Eachgroupofconcurrentcolumnthreadsinablockperformsvertical1DLWTonadatacolumnofthecachedtemporaryresultsinparallel,andwidth/ngroupsofcolumnthreadsprocesswidth/ncolumnsofcachedtemporaryresultsinparallel(width/nisthewidthofasubset).
5) ThecorrespondingcolumnthreadgroupssavetheresultsintoaGPUglobalmemoryspaceanddeletethemrowtemporaryresultsinthesharedmemoryspacegotbystep3.
6) Steps1–5maintainacompleteiterationthatwillberepeateduntilallrowsandcolumnsofadatasetareprocessed.
ThetwodifferencesofdatadivisionmethodsforBBandLBBare:1)InordertoadapttheadvancedarchitectureofCUDAthreads,theheightofasubsetinLBBislargerthanBBwhereasthewidthforLBBissmallerthanBB,andthenumberofsubsetsof
LBBisfarlargerthanBBasthenumberofthreadblocksisfarmorethancomputationalnodesinthosetraditionalparallelplatforms.Steps1–6ofLBBalsointegratedtheLBideawiththeonemajorupdate:Insteadofprovidingathreadforarow,LBBconstructsthe
threadblockstructurebasedonnumerousCUDAthreads(eachthreadblockorganizemgroupsofconcurrentrowthreads,andeachgroupofrowthreadhaswidth/nCUDAthreads,e.g.32CUDAthreadsineachrow),suchthatathreadblockcanperformboth
horizontal1DLWTonmultiplerowsbyusinggroupsofrowthreadsandvertical1DLWTonmultiplecolumnsbyusinggroupsofcolumnthreadsinparallel.
AlthoughLBBdividesalargedatasetintoseveralsmallersubsets,theproblemofsharedmemoryshortageisstillnotable,e.g.,itrequires64 KBsharedmemoryspaceifasubsethassize128 × 128.Thus,theslidingwindowstructurehasbeendesignedin
LBB,seeFig.6.InLBB,thesizeofaslidingwindowisequaltoaconstructedthreadblock(e.g.8 × 32forasubset128 × 128),i.e.,aslidingwindowisbuiltonathreadblock,andeachsubsethasaslidingwindow.Onceaslidingwindowisfull(e.g.,thereare8rows
oftemporaryresultscachedinasharedmemoryspace),column-basedvertical1DLWTwillbecarriedout(i.e.steps4and5),andthenLBBmovesdownaslidingwindowonthetargetsubsettocomputeandcacheotherdatapoints(orpixels)ofotherthreadblocks
concurrentlybymovingthethrowblocktoprocessthenext8rows(startingfromStep1).
• Performanceanalysis.TakingtheCDF(9,7)waveletasanexampletoevaluatetheCGMA(ComputetoGlobalMemoryAccess)(i.e.,anindextoestimatetheradioofinstructionthroughputandglobalmemoryaccessesforperformanceanalysis[22])withLBB,it
requires18timesofcomputationaloperationsforperforming2DDWTonadatapoint,including16timesofmultiplicationsand2timesofadditions,and4timesofglobalmemoryaccesses(2timesofreadandwriterespectively),byassumingthatthesizeofaslide
windowis8 × 32andeachthreadprocessesadatapointinparallel,suchthattheCGMAis:
ComparedwiththeRCmodel(CGMAis0.9),LBBhasbeenincreasedby10times.
(10)
4.2TheimprovedliftingschemeInaGPUroutine,theLWTrequireslotsofsynchronizationsaseveryimplementationprocedureoftheliftingschemehasdifferentfunctionscomputinginsequence,anditdecreasestheparallelizabilityofLWTcomputations.To
solvethisproblem,thisstudydevisedanimprovedapproachtodecreasetheinherentsynchronizationsofliftingscheme.ThefactorizationofpolyphasematrixP(z)(seeEq.(11)Eq.(9))canbeconvertedintothefollowingform:
wheren = mifu0(z) = 0andn = m-1ifu0(z)≠0.AccordingtoEq.(13)Eq.(11),theoptimizedoperationstepsoftheliftingschemebased1DforwardLWTareshownasthefollows:
Fig.4TheoverallframeworkofLBmodel.
alt-text:Fig4
Fig.5TheoverallframeworkofBBmodel.
alt-text:Fig5
Fig.6TheoverallframeworkofLBBmodel.
alt-text:Fig6
(11)
(1) Split:
(2) Preprocessing:
(3) Lifting(predictandupdate):fori = 1,…,n,then:
(4) Scale:
Thestep3isperformediteratively.However,synchronizationoperationsstillexistinstep3sincethecomputationofsicanonlybeginafterthedicomputationduetosireliesondi.Thus,Eq.(16)Eq.(14)shouldbemodifiedasthe
followsing:
ItcanbeseenfromEq.(18)Eq.(16),thecomputationofsinolongerreliesondi,andsianddionlyrelayonsi-1anddi-1respectively,sothatthecorrespondingcomputationaloperationscanbeperformedconcurrentlytodecrease
halfofthesynchronizationsrequiredinliftingscheme.Themaincomputationalprocedureofsingle-level1DforwardLWTisdemonstratedinFig.7thatisoptimizedfromFig.2.TheinverseLWThasthesimilaroptimizationapproach.
4.3TheconstructionofthesingleGPU-basedGPCF-LWTBeforerealizingparallelcomputationsof thesingle-level1DLWTbyusing theGPCF-LWT,a seriesofpredictioncoefficientsP = {p0,p1,… ,pn}andupdate coefficientsU={u0,u1,… ,un}must be generated successively
accordingtothegivennameofawaveletinordertoensurethegeneralizationoftheGPCF-LWT.PandUare2Darrayswiththesamelengthn,andP[i]andU[i]correspondtopredictionandupdatecoefficientsfortheithpredication
andupdateoperationsrespectively.WithintheGPCF-LWT,PandUserveasinputparametersreflectingdifferentwavelettypes,suchthatGPCF-LWTcansupportthewholefamilyofLWT.BasedonLBBandtheimprovedliftingscheme,
Fig.8illustratesthemaincomputationalprocedureofmulti-level2DforwardLWTproposedinthisstudy.ItneedstogenerateaseriesofP(a2Darray)andU(a2Darray)accordingtothegivennameofawavelet,andthencopythese
predictionandupdatecoefficientstogetherwiththeraw2DsignaldatafromaCPUmemoryspacetoaGPUglobalmemoryspace,andaspecificdesignedCUDAstreamoverlappingstrategyforincreasingtheconcurrencyofthisstep
(12)
(13)
(14)
(15)
(16)
Fig.7Theimprovedliftingschemeforsingle-level1DforwardLWT.
alt-text:Fig7
ispresentedinSection4.4.InFig.8,theoutputresultscA,cH,cV,andcDcorrespondtotheapproximationcoefficientsanddetailcoefficientsalonghorizontal,vertical,anddiagonalorientationsrespectively.Multi-level2DLWTapplies
theapproximationcoefficientscAasa2Drawinputsignal forthenextcomputational level,andthenanew2DforwardLWTiterationwiththesamecomputationalstepsandcomputationalparametersas itsprevious levelwillbe
started on this level. In thisway, this procedurewill be executed recursively until the expected level has reached, and the final results are the cA of the last level and all detail coefficients cH, cV, and cD collecting from each
computationallevel.ThecomputationalframeworkofLBBbasedinverseLWTcanbeconstructedinthesimilarwayofFig.8.
Inthiscase,arawsurfacemeasureddatasetwillbedividedintoseveralsmalldatachunkswhichwillbefurtherparallelizedbyapplyingthepipelineoverlappingstrategy(seeSection4.4).Ahostfunction(i.e.,theforwardand
inversescheduler)executingonaCPUisresponsibleforlaunchingthePredict()andUpdate()kernelsonaGPUiterativelyaccordingtotheseriesofpredictionandupdatecoefficients.Fig.9showsthata2DLWTonGPCF-LWThasbeen
realizedthroughseparatingitintotwo1DLWTfromtherowandcolumnorientationsrespectively,andmulti-levelLWThasbeenimplementedbyapplyingthesingle-levelLWTrecursively.Oncethefinalresultsareobtained,inverse
LWTcanbeperformedtoreconstructthefiltratedengineeringsurfaces(e.g.roughnessandwaviness).
GPCF-LWTsupportsthewholefamilyofwavelettransformsbasedonLBBandtheimprovedliftingscheme.Thispaperpresentstwoexamples(HaarandCDF(9,7)wavelets)toshowtherealizationofLWTwithanyspecific
Fig.8ThemaincomputationalprocedureofimprovedliftingschemewithLBBformulti-level2DforwardLWT.
alt-text:Fig8
Fig.9AnimplementationoverviewoftheGPCF-LWTonasingleGPUcard.
alt-text:Fig9
waveletbyusingGPCF-LWTgenerically.OneofThefactorizationsofpolyphasematrixesofwaveletfiltercoefficientsconformingtotheimprovedliftingschemecanbewrittenastheEqs.(19)Eqs.(17)and(20)(18)forHaarandCDF(9,7)
respectively.
AccordingtoEqs.(19)Eqs.(17)and(20)(18),P,UandKforHaarandCDF(9,7)respectivelycanbeacquiredfromthefollowingEqs.(21)Eqs.(19)and(22)(20).
ThecoefficientsU0ofbothHaarandCDF(9,7)arezero,sotheoptimizedcomputationalprocedureofsingle-level1DforwardLWTcanbedevisedasFig.13Fig.10whereahost function isresponsible forschedulingthe
executionsofpredictandupdatekernelsandprovidingdifferentconfigurationandcomputationalparametersfordifferenttypesofwavelets,andthecorrespondingpredictandupdatekernelsareexecutedonaGPUcardinparallel.
TheleftsideofFig.10illustratesthecomputationalprocedureofHaarwavelettransform,andCDF(9,7)waveletisillustratedontherightside.TheinverseLWTimplementationsofHaarandCDF(9,7)aresimilarastheforward
implementations,whichjustinversethecorrespondingsequencesofoperationstepsofforwardcomputationsandswitchthecorrespondingplusandsubtractionoperators.
(17)
(18)
(19)
(20)
Fig.10Theschedulingalgorithmofasingle-level1DforwardLWT.
4.4ThreeoptimizationstrategiesforGPCF-LWTimplementationonasingleGPUcardTheGPCF-LWTfurtherexploresthreeimplementationoptimizationsonasingleGPUbyseamlesslyintegrateingtheGPCF-LWTimplementationwiththelatestCUDAGPUhardwareadvancementstomaximizetheparallelization
ofgeneric2DLWTcomputations:
1) GPCF-LWTprovidesanoptimizationstrategythatenabledynamicconfigurationsofexecutionparametersofkernelsbasedontheCUDAthreadhierarchy.InCUDAprograms,athreadblockmustbeloadedintoaSMinseparably,andthreadsinablockwillbefurther
encapsulatedintoseveralwarps.ASMexecutesallitswarps(e.g.,4warpsinFig.11)conformingtoSIMD(SingleInstruction,MultipleData)model,suchthatallthreadsinawarpsharesameprograminstructionsbuttoprocessdifferentdataatthesametime.
AnotherinactivewarpwillbeactivatedfromtheinactivestatewhenanactivewarpinthesameSMispending(e.g.,awarpwillbependingwhenitneedstoaccessdatafromtheglobalmemoryasitgenerallyrequires400–600cycles)inordertohidememorylatency
[20,21].Thus,itisofvitalimportancetoconfigurethenumberofCUDAthreadsinablockandorganizetheminanappropriatehierarchystructure(i.e.setupapplicableblockandgridsizes)foreachkernel,suchthatSMscanhavejustenoughwarpstoschedulewhile
limitingthecomplexityofthreadmanagement.Ideally,thenumberofthreadsinablockmustbeanintegralmultipleof32tomatchthefactthateachwarphas32threadsfor32CUDAcoresinCUDAGPUs,soeachSMcanbepossiblyarrangedwithjustenough
numberofwarps.Inpractice,SMmayreducethenumberofblocksduringruntimewhentoomanyCUDAthreadsareorganizedinablockandeachthreadoccupiestoomanyregistersoralargesharedmemoryspace,anditoftenleadstothevitalproblemthatthere
arenotenoughwarpstobescheduled.Withtheseconsiderations,thisstudyhadtestedandevaluatedtheunderlyingrelationshipbetweenthedifferentnumberofthreadsinablockandthehardwareconsumptionofGPCF-LWT,whichfoundthattheoccupancyof
CUDAcoresinaSMcancomeupto100%whenweconfigure256threadsperblockintheGPCF-LWT.Thus,theblocknumber(gridsize,bn)ofLWTisbn = (sn+256–1)/256wheresnisthelengthofinputdata,suchthatthebnwitheachblockhaving256threadscan
keepallSMsofaGPUcardbusyallthetime,andhidethedelaycausedbydatatransfers.Totestandevaluatethevalidityandpracticalityofthisimplementation,thisstudyhascomparedtheeffectsofdifferentnumberofthreadsinablocktofindtheinherent
correlationbetweenthenumberofconfiguredthreadsinablockandtheprocessingtime.Fig.12liststheprocessingtimefordifferentnumbersofthreadsinablockfor4-level2DLWTwithdatasizeof4096 × 4096runningonasingleGTX1080GPUcard(the
hardwarespecificationcanbeseeninTable3).Fig.12showsthatGPCF-LWTcangaintheminimumprocessingtimewhenthenumberofthreadsinablockissetto256.
2) GPCF-LWThasdevelopedahierarchicalmemoryusageschemeforgenericLWTcomputationsbasedontheCUDAGPUmemoryhierarchy.CUDAadoptsaSPMD(SingleProgram,MultipleData)programmingmodelandprovidesasophisticatedmemoryhierarchy(i.e.
register,localmemory,sharedmemory,globalmemory,texturememoryandconstantmemory,etc.).Hence,aGPUcanachievehighdataaccessesthroughelaboratelydesignedCUDAcodesempoweredbytheefficientusagesofdifferentmemoriesaccordingtothe
respectivedatafeatures,includingaccessmode,sizeandformat.GPCF-LWTcallstheCUDAAPIfunctioncudaHostAlloc()ratherthanmalloc()toallocateaCPUhostmemoryinpage-lockedmemoryorpinnedmemorymodethatcansupportDMA(DirectMemory
Access)withoutrequiringtoallocateadditionalbuffersforcachingdata.Comparedwithpage-ablememorymode,thepage-lockedmodetransfersdatadirectlyfromaCPUhostmemoryspacetoaGPUglobalmemoryspace,sothewritespeedofitis1.4timesfaster
andreadspeedis1.8timesfasterthanpage-ablemode[21].However,thepinnedmemoryshouldnotbeoverusedforitsscarcitythatleadstoreducedperformance.Withthisinmind,thisstudypre-allocatesafixedsizepinnedmemoryintheinitializationstageand
reuseitduringthewholecomputationallife-cycle,suchthatitmaintainshighmemoryperformancethroughout.Moreover,GPCF-LWTusesthesharedmemoryofaGPUforexecutingkernels,andinsteadofdirectdataaccessesfromaGPUglobalmemory,asingledata
pointwillbeloadedintoasharedmemoryspaceofaSMonceandonlyoncefromtheglobalmemory,andthenanyloadeddatapointcanbeaccessedbyseveralCUDAthreadssimultaneouslyinaveryfastspeed(generallyitonlyrequires1–32cycles).Incomparison,
dataaccessspeedofaglobalmemorywillbedramaticallyreducedwhenseveralthreadsaccessthesamedatapointinaglobalmemoryspacerepeatedly,andregisterswerealsonotappliedhereforthesimilarreason[20].Anothermajorreasonforusingtheshared
memoryisthatrowaddressesinaglobalmemoryarealigned,suchthattheoperationthat32CUDAthreadsinawarploadingdatacontinuouslyfromtheglobalmemorycanbecoalesced,i.e.,loading32datapointsintoawarpneedstheglobalmemoryaccessonly
onceratherthan32accessesinsequence,sothiscoalescingcanenhancethedatatransferbandwidth.AnotherimportantissueisthatwheretostoreandaccessPandU,theconstantmemoryortheglobalmemory.Althoughboththeconstantmemoryandglobal
memoryareinthesamephysicallocationoftheGPUhardwarearchitecture,thetimeconsumptionforaccessingadatapointfromaconstantmemorycanbemuchfasterthanaglobalmemoryduetoitsinherentcacheandbroadcastmechanisms.Forinstance,adata
pointcanbeloadedfromthecacheofaconstantmemorywhenitisacachehit,anditjustcostsonecycleingeneral.Broadcastisanotherveryusefultooltoreducedataaccessdelay,whichmeansthatadatapointloadedbyathreadwillbebroadcastedtoallthe
otherthreadsofawrap.Thus,accessingadatapointfromaconstantmemorycanbeasfastasaregisterwhenallthreadsofawarpaccessthesamedatapoint[21,43].Inthisstudy,allCUDAthreadsusethesamevaluesofPandU,andtheyareconstantandwillnot
bechangedduringperformingPOandUO,suchthatthefastdataaccessescanbeachievedbystoringandaccessingPandUsetsinaconstantmemoryspace.Allinall,thememoryhierarchyofGPCF-LWTforgenericLWTcomputationscanbeorganizedasFig.13.
TotestandevaluatethevalidityandpracticalityofthishierarchicalmemoryusageschemeofGPCF-LWT,thisstudyanalyzedtheperformancefactorsinfluencingthe2DLWTcomputationondifferentmemoryspaces.Firstly,thisexperimenttestedand
comparedthebandwidthbyusingthepinnedandpage-ablememorymoderespectivelybytransferring1GBmeasureddatafrombothH2D(fromaCPUhostmemoryspacetoaGPUglobalmemoryspace)andD2H(fromaGPUglobalmemoryspacetoaCPUhost
memoryspace)directions.InFig.14,thebandwidthofpinnedmemorymodegainsapproximately55%improvementcomparedwiththepage-ablememorymode.Then,theexperimenttestedandcomparedthecomputationalperformancesbetweentheproposed
hierarchicalmemoryusageschemeandthepureglobalmemoryusage.Inthepureglobalmemorysolution,alldata(rawmeasureddatasets,P,U,intermediateresultsandthefiltrationoutputresults)arestoredintheglobalmemory.Incontrast,theproposedscheme
storesdifferentdatasetsindifferentmemoryspaces,i.e., itappliestheglobalmemoryspacetostoretherawmeasureddatasetsandthefiltrationresults,andthesharedmemoryspacetocachesubsetsandthetemporaryresults,whichputPandUcoefficient
instancesintheconstantmemoryspace.Fig.15illustratestheprocessingtimesofthesetwosolutions,anditcanbeseenfromtheexperimentalresultsthattheproposedschemegainsabout50%computationalperformanceenhancement.
3) GPCF-LWTenablesapipelineoverlappingstrategybyusingCUDAstreams.CUDAGPUprovidesanon-blockingasynchronousexecutionmodeforGPUkernels,suchthatthehostcodesonaCPUandCUDAcodesonaGPUcanbeexecutedconcurrently.Basedonthis
alt-text:Fig10
consideration,thisstudyderivesaCUDAstreamoverlappingstrategythatcaneffectivelyalleviatethedelayimposedbyschedulingthesequentialorderofexecutionoftheliftingscheme.EachpipelinemanagesaCUDAstream,andapipelinecanorganizenkernels
runningonitindifferenttimeslices,i.e.,apipelinewillactiveandrunanotherkernelautomaticallywhenakernelispending[19].PureCUDAprogramsbasedonthenon-overlappingmodelarevariantsofthefollowingfundamentalthreestages:(1)transferringaraw
datasetfromaCPUhostmemoryspacetoaGPUglobalmemoryspacethroughPCI-Ebuses(i.e.,H2Ddatatransfers);(2)processingtherawdatasetviaaspecificalgorithm—GPUkernelexecutions(e.g.,2DLWT);(3)transferringoutputresultsfromtheGPUglobal
memoryspacetotheCPUhostmemoryspace(D2Hdatatransfers).Obviously,bothstagesofdatatransfers(i.e.H2DandD2H)andkernelexecutionsfordataprocessingaretime-consuming.Moresignificantly,thesethreetime-consumingstagesareperformedin
sequenceinthenon-overlappingmode,andthesequentialtimelinesforthreepipelinestagesareillustratedinFig.16.Thus,datatransfersinFig.16areblockingtransfersandthehostfunctionrunningonaCPUcanonlylaunchGPUkernelsaftercompletionofthe
correspondingdatatransfers,sothatboththePCI-EbusesandGPUwastealotoftimeinidlestateduringtheentireworkflow,whichsharplydecreasestheprocessingperformanceofCUDAprograms.Toreducetheseidletimeslices,GPCF-LWToverlapsthethree
majortime-consumingstagesbyusingCUDAstreamsinthepipelines.InaccordancetothehardwaredevelopmentthatbothGPUandPCI-Esupportbidirectionaldatatransfers,eachpipelineinGPCF-LWThandlesarepeatedprocessoftransferringasmalldatachunk
toaGPUmemoryfirstandthenlaunchingLWTkernelstoprocessthisdatachunkuntilthewholedatasetisprocessed.Fig.17showsthetimelinesoftwodatatransferstagesandLWTkernelexecutionstagewiththeoverlappingmodel,inwhichLWTkernelscanbe
performedduringstagesofdatatransfers(i.e.,thedatatransfersarehided),suchthataGPUcanalwaysstayinthebusystateduringtheentiredataprocessingworkflowtogainhigherperformancethanthenon-overlappingmodel.Thisstrategycanbeextendedto
hidedatatransferdelaysinmulti-GPUsystemsgracefully.TheCUDAAPIcudaMemcpyAsync()hadbeenappliedinthisstudytoasynchronouslytransferdatafromaCPUhostmemorytoaGPUglobalmemoryinordertorealizetheoverlappingbetweendatatransfers
andkernelexecutions.ThepseudocodesofthedevisedoverlappingstrategyarelistedinAlgorithm1,andtheLBBbased2DLWTkernelislistedinAlgorithm2.Moreover,thisaggregatedstrategystillappliesonthesystemsusingNVLinktechnologywithtwofeatures:
(1)bothH2DandD2Hdatatransfersareefficient,sostage1and3consumefarlesstime;(2)thesetwostagescantransferarelativelargedatachunkeachtime,suchthatstage2maintainshighconcurrency.Withthesenewfeatures,theGPCF-LWTcanachievebetter
performancewithoutextraeffort.
Fig.11ThehierarchystructureofCUDAthreads.
alt-text:Fig11
Fig.12Processingtimesof2DLWTonGPCF-LWTfordifferentnumbersofthreadsinablock.
alt-text:Fig12
Fig.13ThehierarchicalmemoryhierarchyusageofGPCF-LWT.
alt-text:Fig13
Fig.14Thebandwidthofpinnedandpage-ablememorymoderespectively.
alt-text:Fig14
Fig.15Theperformancecomparisonofthetwosolutions.
alt-text:Fig15
Algorithm1Overlappingofthreestages—twodatatransfersandkernelexecutions.
alt-text:Algorithm1
Inputs:a2DrawdataData[][],andrownumberrows,columnnumbercols.
1stream<-cudaStreamCreate(&stream[i]) //createnstreams
2a_h = &Data[0][0] //getthestartaddressofData[][]
3size=N*cols/nStreams //calculatethedatasizeforeachstream
4fori = 0tonStreams−1do
5 offset = i*N/nStreams //calculatetheaddressoffsetforastreampipeline
6 cudaMemcpyAsync(a_d+offset,a_h+offset,size,H2D,stream[i])//transferdatatoaGPUintheithstream
7 LBB_LWT<<<…,stream[i]>>>(a_d+offset) //launchawavelettransformkernelintheithstream
8 cudaMemcpyAsync(a_d+offset,a_h+offset,size,D2H,stream[i])//transferdatatoaCPUintheithstream
Algorithm2LBBbased2DLWTkernel.
alt-text:Algorithm2
Inputs:a2DrawdataData[][],andrownumberrows,columnnumbercols.
Outputs:cA,cH,cV,andcD.
1 //foreachslidewindow
2 for(row=0;row<height;row+=SLIDE_HEIGHT)do
3 __shared__floatsData<-loadrawdatafromaGPUglobalmemoryspacetoaGPUsharedmemoryspace
4 //eachgroupofrowthreadsperformshorizontal1DLWTinparalleltoprocessasubsetandcaches
Fig.16ThetimelinesofthedatatransfersandLWTkernelexecutionsinthenon-overlappingmodel.
alt-text:Fig16
Fig.17Thetimelinesforoverlappingofthreepipelines.
alt-text:Fig17
//thetemporaryresults
5 even,odd<-split(sData);
6 Preprocessing(even,odd);
7 __syncthreads(); //synchronizationoperations
8 predict(even,odd);
9 update(even,odd);
10 __syncthreads();
11 //eachgroupofcolumnthreadsperformsvertical1DLWTinparalleltoprocessthetemporaryresults
12 even,odd<-split(sData);
13 Preprocessing(even,odd);
14 __syncthreads();
15 predict(even,odd);
16 update(even,odd);
17 __syncthreads();
18 save(even,odd,output); //saveoutputresultsbacktoaGPUglobalmemoryspace
TotestandevaluatethevalidityandpracticalityofthispipelineoverlappingstrategyforLWTcomputations,thisstudycomparedtheperformancebetweentheoverlappingandnon-overlappingimplementationsbyusinga
singleGTX1080GPUcardandperforming4-level2DLWTwithD4wavelet(datasize4096 × 4096).Thisexperimentrecordedtheprocessingtimeofeachstep,i.e.,H2D,dataprocessingonaGPUcard(CUDAkernelexecutions),and
D2H.BasedontheexperimentalresultslistedinTable1Table3,itcanbeseenthattheoverallcomputationaltimeofnon-overlappingimplementationisthesumoftimeconsumptionsofthesethreestepssincethesestepsareperformed
inasequentialorder.Bycontrast,theoverallcomputationaltimesofoverlappingimplementationsaresteadilylessthanthesumofthesethreestepsastheyareperformedconcurrently.Thus,theoverlappingoptimizationstrategycan
reducetheoverallcomputationaltime(Table1).
Table1Theperformancecomparisonbetweenoverlappingandnon-overlappingimplementation(ms).
alt-text:Table1
Programroutines Strategies H2D Kernelexecutions D2H Overallprocessingtime
FLWT Non-overlapping 35 193 32 260
Overlapping 33 197 32 213
ILWT No-overlapping 33 215 38 286
Overlapping 34 206 35 225
5Themulti-GPUsolutionofGPCF-LWTThepowerfulcomputingcapabilityofsingleGPUsatisfiesthecomputationaldemandsofmanyapplications.However,thelimitedmemoryspacesandCUDAcoresofasingleGPUcardmakeitimpossibletoprocessacomplete
dataset in online surface texturemeasurement applications. Thus, dealingwith large-scale dataset requires themulti-GPU processingmode. At present, there are two categories of commonly usedmulti-GPU systems, i.e., the
standalonecomputers(asingleCPUnodewithmultipleGPUcards)andtheclusterdistributedsystems(multipleCPUnodesandeachaccompaniedbyoneormoreGPUcards).Bothcanacceleratedataprocessinggreatlyandgain
bettercomputationalperformance.However,limitedtothelow-speedPCI-Ebusesandnetworkconnections,bigdatatransferringbetweenaCPUhostmemoryspaceandaGPUglobalmemoryspacebecomesthekeybottleneckof
multi-GPU systems. Based on this consideration, reducing or hiding data transfer delays are the most important optimization strategy to obtain high performance. More importantly, since the major consumption of LWT is
multiplication,additionandsubtraction,anditsharesonlynearbydata(i.e.GPCF-LWTdoesnotneedtosharedatasetsacrossthenetwork),socommunicationrequirementsamongdistributedcomputernodesinLWTapplicationsare
verylow.Fromthispointofview,astandaloneheterogeneousmulti-GPUsystemhasbeenadoptedforGPCF-LWT.ThisstudytriestoreduceidletimeslicesofPCI-EbusandGPUkernelexecutions,andimprovesthecomputationalload
balancingoverthemultipleGPUcardsforthemulti-GPUGPCF-LWTimplementation.Toreduceidletimeslices,thisstudymadetwoimprovementsonGPCF-LWT:(1)maximizingtheutilizationrateofthePCI-Ebusbandwidth;(2)
hidingdatatransferdelaysthroughmakingeveryGPUcardalwaysinthebusystateduringthewholedataprocessingworkflow.Forthefirstpoint,GPCF-LWTappliesbidirectionaldatatransfersonthebandwidthofaPCI-Ebusthat
cansupportbothH2DandD2Hatthesametime.Intermsofthesecondpoint,thepipelineoverlappingstrategypresentedinSection4.4hasbeenappliedoneveryGPUcardintheheterogeneousmulti-GPUsystem.Furthermore,in
GPCF-LWT,datatransfersbetweenaCPUandaGPUjustneedtoreadonce,anditdoesnotrequireanyintermediatecomputationalresultstransferringbetweenaCPUandaGPUoramongdifferentGPUs.Thus,thismulti-GPUbased
GPCF-LWTdoesnotrequireanyextracommunication,andithasnotanydatatransferdelayduringthewholecomputationalprocedure.Nevertheless,thetraditionaldatasetdivisionapproachesarelikelytocauseloadunbalancing
problemandlowrateofGPUhardwareutilization.
5.1LoadunbalancingproblemBasically,duetotheseparablepropertyof2DLWT,asimpleloadbalancingmodelbasedonthepuredatasetdivisionmethodcanbederived[23].Thissolutionissimpleanduseful,butitmayleadtoloadunbalancingproblem
whenamulti-GPUsystemcontainsheterogeneousGPUcardswithunequalcomputingcapability.Asaresult,theoverallperformanceofamulti-GPUsystemdependsontheGPUnodethathasthelowestperformance.Moreover,2D
LWTapplicationsaredata-orientedcomputations,i.e.,hugedataprocessingbutlimitedCUDAkernels.Thus,thisstudydevisedadata-orienteddynamicloadbalancing(DLB)modelforrapidanddynamicdatasetdivisionandallocation
onheterogeneousGPUnodesbyusingafuzzyneuralnetwork(FNN)framework[38].Thedata-orientedDLBmodelcanminimizetheoverallprocessingtimebydynamicallyadjustingthenumberofdatasubsetsinagroupforeach
GPUnodeaccordingtoruntimefeedbacks.Here,twoimportantimprovementsonFNNofthepreviousworkofauthors([398])toenhancetheeffectivenessofDLBmodelarepresented.
5.2ImprovedFNNmodelTheoverallframeworkoftheFNNbasedDLBmodeldevisedinthepreviousworkisillustratedinFig.18[39][38].Inthismodel,fivestatefeedbackparameters(fuzzysets),i.e.,floating-pointoperationperformance(F),global
memorysize(M),parallelability(P),theoccupancyrateofcomputationalresourcesofaGPU(UF)andtheoccupancyrateoftheglobalmemoryspaceofaGPU(UM),relatingcloselytotheoverallcomputationalperformanceare
defined.Thepredictor inFNNisdevisedtopredict therelativecomputationalperformanceabilityunderdifferentworkloadscenarios,andtheschedulercanreorganize thedatagroups foreachGPUcardrespectivelyaccordingto
dynamicfeedbacksofthepredictor.Inthebeginning,arawdatasetisdividedintoseveralequal-sizeddatachunks(thesizeisrelativelysmall,andthenumberofdatachunksshouldbefarmorethanthenumberofGPUnodes)and
thesechunksareorganizedintongroups(nisequaltothenumberofGPUcardsinamulti-GPUsystem)byusingthescheduler.Intheruntime,thenumberofdatachunksinagroupassignedtoeachGPUcardisdifferent,anditis
determineddynamicallybythereal-timefeedbacks(i.e.,fivestatefeedbackparameters)ofasingleGPUcard.Thus,theDLBmodelminimizestheoverallprocessingtimebydynamicallyadjustingthenumberofdatachunksinagroup
foreachGPUatruntime.
Basedontheearlieroutputs,thisstudyfurtherexploredtwoimprovementsonFNNtoincreasetheefficiencyandpredicationaccuracyofthedevisedDLBmodel:
1) ThisstudyhasconstructedotherthreestatefeedbackparametersforimprovingtheaccuracyofthenthrelativecomputationalabilityCPnipredicationforithGPUnode,i.e.,memoryspeed(MS),boostclock(BC)andthelastrelativecomputationalability(CPn-1ior
LCPni).BothmemoryspeedandboostclockarecorehardwareparametersaffectingthecomputingcapabilityofCUDAGPUcards.Forexample,thememoryspeedofGeForceGTX1080GPUcardis10 Gbps(GBspersecond)andtheboostclockis1607 MHz(Mega
Fig.18TheoverallframeworkoftheFNNbasedDLBmodel.
alt-text:Fig18
Hertz),andtheGeForce750TiGPUcardis1.02 Gbpsand1085 MHz,respectively.ThereisastrongcorrelationbetweenCPn-1iandCPni,i.e.,whenthecurrentCPn-1iisveryhigh,moresubsetswillbeorganizedinthedatagroupofCPUi,anditwillkeepCPUibusywith
highUMandUFvalues,whichinturnleadstothelowCPni.Thus,thebalancebetweenCPn-1iandCPnineedstobeconsideredinpredications.Currently,theimprovedDLBmodelhaseightfeedbackparametersandtheyareallfuzzifiedas“high”and“low”fuzzy
subsets,thesethreekindsoffuzzysetsandsubsetsarelistedinTable2.
2) Thisstudyimprovesthefuzzysubsetdescriptionsbyprovidingacomprehensivemembershipfunctionadaptionmechanism.IntheformerDLBmodel,allfuzzysetsarefuzzifiedbyonemembershipfunction–thesigmoid.Here,thestaticparameters(i.e.,F,M,P,MSand
BC)stillusesigmoidmembershipfunction,whereasthedynamicfeedbackparameters(i.e.,UF,UM,CPandLCP)adoptaninnovativemembershipfunction.Thismembershipfunctionhasbeenexploredbasedonthesoftplusthathasbeenprocessedbyshiftandscalar
operationstosatisfythefundamentalpropertyofafuzzymembershipfunction.Thesoftplusbecameanmainstreamactivationfunctionindeepleaningnetworks,anditcanbeformulatedasEq.(23)Eq.(21)anditscurveisillustratedinFig.19[44].Itgrowsveryslow
whenxisrelativelysmall,anditwillberapidlyincreasedafterxgreaterthanathresholdvalue(e.g.,1inFig.19).ThisphenomenoncanmodelthechangeofcomputationalperformanceofaGPUnodeaswhenUFandUMarerelativelylow,thecomputingcapabilityis
almostunchanged,anditalsowilldropsharplywhenUFandUMarelargerthanthethresholdvalues.However,thesoftpluscannotbeusedasamembershipfunctiondirectlyduetoEq.(23)Eq.(21)failstosatisfyf(x) ∈ [0,1]andx ∈ [0,1].
Table2Threekindsoffuzzysetsandsubsets.
alt-text:Table2
Sets Descriptions Fuzzysubsets Descriptionsoffuzzysubsets
MS MemoryspeedMSL Low
MSH High
BC BoostclockBCL Low
BCH High
LCP ThelastrelativecomputationalabilityLCPL Low
LCPH High
Table3Thespecificationofaheterogeneousmulti-GPUsystem.
alt-text:Table3
Property Description
CPU IntelCorei7-47903.6GHZ
Memory 16G
GPU GPU1:NVIDIAGeForceGTX750Ti,2 G,5 × SM,128SP/SM,1.02 Gbpsmemoryspeed,1085 MHzboostclock
3 × GPU2:NVIDIAGeForceGTX1080,8 G,20 × SM,128SP/SM,10 Gbpsmemoryspeed,1607 MHzboostclock
OS Windows1064bit
CUDA Version8.0
(21)
Toremedythisdefect,thisstudyperformsshiftandscalaroperationsonthesoftplusfunction,sothesoftpluscanbeextendedasfollows:
whereaisascalarfactorandbisashiftfactor,andEq.(24)Eq.(22)mustsatisfy:
Byaddinganormalizationfactor,Eq.(24)Eq.(22)canberedefinedasEq.(26)Eq.(24),andaandbcantakethevalueof5and−3,respectively.Fig.20showsthecurveoftheimprovedsoftplus.
Basedonthethreenewstatefeedbackparametersandtheimprovedsoftplus,thisstudyconstructedamoreeffectiveDLBmodelthatintegratesthefuzzytheory,artificialneuralnetworks(ANN)andthebackpropagation
algorithm.ThecomparisonexperimentbetweentheimproveDLBmodelandtheconventionalDLBmodelin[39][38]ispresentedinSection6.2(thefourthfifthexperiment).
Fig.19Thecurveofsoftplusfunction.
alt-text:Fig19
(22)
(23)
(24)
Fig.20Thecurveoftheimprovedsoftplusfunction.
alt-text:Fig20
6Testandperformanceevaluation6.1Hardwareandtestbed
Table3specifiesaheterogeneousmulti-GPUsystemconstructedfortestingandevaluatingGPCF-LWTwithexperimentsandtheapplicablecasestudy,anditcontainsfourdifferenttypesofGPUcards‒—amiddle-lowrange
GPU(NVIDIAGeForceGTX750Ti)andthreehigh-endGPUs(NVIDIAGeForceGTX1080).GTX750TiandGTX1080contain640and2560CUDAcores,respectively.
6.2BenchmarkingexperimentsThissectiontestsandevaluatesthecomputationalperformanceandpracticabilityofGPCF-LWTunderbothaheterogeneousmulti-GPUsettingandahigh-endsingleGPUsettinginfiveencompassingexperiments:
1) ThefirstexperimentcomparesthecomputationalperformancesoftheLBB,RCandRC&LB[4]basedontheclassicliftingschemebyusingthesingleGPUsetting(aGTX1080GPUcard).TheLBB,RCandRC&LBhavebeentestedonthe4-levelCDF(9,7)forward
andinverse2DLWT.Thedatasizerangesfrom10241024××1024to51205120××5120.Fig.21showsthatwith increasingdatasize, theperformancefromRCisplungingquicklydueto intrinsicsynchronizationoperations.RC&LBshowshighcomputational
performancewhenthedatasizesarerelativelysmall(e.g.,equaltoorsmallerthan20482048××2048).Theprocessingtimewillincreasedrasticallyoncethedatasizegoesupforextrasynchronizationcalls.Incontrast,thedevisedLBBmodelcanmaintainhigh-
performance for larger data range. This experimental demonstrated that the LBBmodel in the GPCF-LWT framework is superior and gains up to 14 times speedup in comparisonwith RC, and up to 7 timeswhen comparedwith RC&LB under a single GPU
configuration.
2) ThesecondexperimentteststheeffectivenessoftheimprovedliftingschemebyusingthesingleGPUsetting.Italsoperformsthe4-levelCDF(9,7)forwardandinverse2DLWT.InFig.22,theimprovedliftingschemegainshighercomputationalperformancesinceit
reduceshalfoftherequiredsynchronizationinliftingstep(step3)(seeSection4.2).ItsubstantiallyincreasestheparallelizabilityofLWTcomputations.
3) ToevaluatethegeneralizationfeatureofCPCF-LWT,thethirdexperimentteststhecomputationalperformanceofsixtypesofwavelets,namelyHaar,D4,D8,D20,CDF(7,5)andCDF(9,7),usingtheclassicliftingschemewithRC;,theclassicliftingschemewith
RC&LB;,andtheimprovedliftingschemewithLBB.AsingleGTX1080GPUcardisusedforathe4-level2DLWTofthewithdatasize4096 × 4096.Inthisexperiment,differentwavelettransformshavebeenperformedinagenericmannerwithdifferentU,PandKfor
differentwavelets.TheexperimentalresultshighlightthatthecomputationalperformanceofCPCF-LWTissignificantlyhigherthanthecorrespondingRCandRC&LBbasedimplementations,seeFig.23.
4) ThefourthexperimentevaluatestheeffectivenessofthethreeoptimizationstrategiesequipmentinGPCF-LWT,againonasingleGTX1080GPUcard.SinceeachstrategyhasbeentestedseparatelyinSection4.4,thisexperimentfocusesonthecomparisonbetween
theLBBwiththehybridstrategy-driven(thecombinationofthreeoptimizations)andLBBwithoutanyoptimization.Fig.24showsthatthehybridapproachgainnoticeableimprovementswhencomparingtothepureLBB.
5) ThefifthexperimentfurtherexaminedthecomputationalperformancesofGPCF-LWTwhenappliedontwosingleGPUsettingsandthreemulti-GPUsettingsbyusingtheheterogeneousGPUsystemlistedinTable3.ThisexperimenthasadaptedaCDF(9,7)waveletto
perform4-level2DforwardandinverseLWToflargedatasizesrangingfrom10,240240××10,240to16,384384××16,384.Fig.25liststheprocessingtimeofthefivetestsettings,i.e.,thesingleGTX750TiGPUsetting(setting1),thesingleGTX1080GPUsetting
(setting2), pure dataset division basedmulti-GPU setting (setting3), the original DLB basedmulti-GPU setting (setting4) [38] and the improved DLB basedmulti-GPU setting (setting5). It can be seen from Fig. 25 that setting1 implementation has the lowest
computationalperformance,followedbysetting2sincetheGTX1080GPUcontainsmoreCUDAcores(2560), largerglobalmemoryspace(8GB)andhigherdatatransferspeed(10 Gbps)thantheGTX750TiGPU(640,2GB,5.4 Gbps, respectively).However, the
computationalperformanceimprovementofsetting2isverylimited,whichprovesthatitwillnotleadtothelinearcomputationalperformanceimprovementforlarge-scalecomputationalapplicationsaccompaniedbyextremelylargeinputvolumeandhighlyrepetitive
simpleoperationalproceduresoriterations,specificallyfor2DLWTbysimplychangingabetterGPUcard.Thesetting3implementationscangainaslightlyhighercomputationalperformanceimprovementthanthetwosingleGPUimplementationsforabout2and3
times.Incontrast, theDLB-drivenGPCF-LWTmulti-GPUimplementations(setting4andsetting5)achievesignificantcomputationalperformanceenhancementsince theFNNbasedDLBcanallocatedata tasksdynamicallyduringruntimeaccordingto therelative
computationalperformanceabilityacrossallGPUnodes.Moreover,theimprovedDLB(setting5)hasgainedbetterperformancethantheoriginalversion(setting4)foralldatasizesduetothesoftplusbasedmembershipfunctionthatismoresuitableforadaptingthe
dynamicchangeofcomputationalperformanceofheteronomousGPUnodes.Thepeakperformancegain(speedupondatasizeof16,384 × 16,384)ofthesetting5hasreachedapproximately10timesmorethansetting3;andwhencomparedwiththetwosingleGPU
implementations,thespeedupfromsetting5hasreached15and20timesthansetting1andsetting2respectively.
Fig.21TheperformancecomparisonamongLBB,RCandRC&LBmodels.
alt-text:Fig21
Fig.22Theperformancecomparisonbetweentheclassicandimprovedliftingscheme.
alt-text:Fig22
Annotations:
A1. deletetheextra"2D"
Fig.23ThegeneralizationfeatureevaluationofCPCF-LWT.
alt-text:Fig23
A1
6.3AcasestudyononlineengineeringsurfacefiltrationThissectioncarriesacasestudyononlineengineeringsurfacefiltrationtotesttheusabilityandefficiencyofthedevisedmulti-GPUGPCF-LWTincomparisonwithotherstate-of-the-artmulti-GPUapproaches.Inthefieldof
arealsurfacetexturecharacterizationextractions,geometricalcharacteristicsignalsrelatingcloselywith functionalrequirementscanbeextractedprecisely fromwaveletcoefficientsbyusingdifferent transmissionbands.The2D
forwardLWTcomputationsshouldbeperformedonraw2DsurfacemeasureddatasetstoobtaininstancesofdetailcoefficientsDiatalllevelsandtheapproximationcoefficientcAinstancesfromthelastlevel,andtheinstancesofcAiof
eachlevelcanbeobtainedduringperforming2DinverseLWT.BasedoninstancesofDiandcAi,thegeometricalcharacteristics(i.e.,roughness,wavinessandform)ofasurfacetexturecanbeobtainedbycalculatingthefollowing
inverseLWTdefinedinEquationEq.(27)Eq.(25)[10,40].
Inordertoextracttheroughnessη(x,y)andwavinessη′(x,y)fromasurfacetextureprofile,thecAimustbesetuptozero.Similarly,theDimustbesetuptozeroforextractingtheformcharacteristic.
Fig.24TheeffectivenessexperimentofthethreeoptimizationstrategiesinGPCF-LWT.
alt-text:Fig24
Fig.25Thecomparisonofprocessingtimesofsetting1,setting2,setting3,setting4andsetting5implementations.
alt-text:Fig25
(25)
Thisstudyselectedtwokindsofrawengineeringsurfaces(surfacesonceramicfemoralheadsandmilledmetallicworkpieces)totestandevaluatetheGPCF-LWT.TheGPCF-LWTintegratesthedevisedLBB,improvedlifting
scheme,thethreeimplementationoptimizations,andtheimproveddata-orientedDLB.ThisanalysisevaluatestheSURFSTANDthatisasurfacecharacterizationsystememployingmodernCPUcore[10],andtwoothermulti-GPU
implementationsofRC&LBandthemethodproposedbyIkuzawa[46][45].Itshouldbenotedthatthepuredatasetdivisionmethodwasappliedforthetwomulti-GPUimplementations.Theheterogeneousmulti-GPUlistedinTable3is
usedforallthreemulti-GPUimplementations.
Figs.26aand27ashowarawmeasureddatasetfromthefunctionalsurfaceofaceramicfemoralheadandamilledmetallicworkpiecerespectivelywiththesamplingsize40964096××4096.InFig.26a,themeasuredsurface
texturehastwodifferenttypesofscratchesproducedbythegrindingmanufacturingprocess,andoneofthemisregularandshallowscratcheswhiletheotherisrandomdeeperscratch.Thesix-levelD4wavelethasbeenusedinthis
casetoperformbothforwardandinverse2DLWT.TheFig.26b-ddemonstratetheroughness,wavinessandformcharacteristicsextractedfromafemoralheadsurfacebyusingtheinverseLWTdefinedinEquationEq.(27)Eq.(25).Inthe
caseofthemilledmetallicsurface(seeFig.27a),thesix-levelCDF(9,7)wavelethasbeenusedtoperform2DLWT,andFig.27b-–ddemonstratetheroughness,waviness,andformcharacteristicsextractedfromamilledmetallic
surfacebyusingtheinverseLWTdefinedinEquationEq.(27)Eq.(25).
It isworthmentioning thatCUDAsupports thesingle float (32-bit,abbreviatedasFP32)and thehalf float (16-bit,FP16)precisions [19,45][19,46].Thisexperiment testedbothof these twoprecisions.Fromthis test, it is
discoveredthatfloatprecisionsettingsdonotaffectthecomputationalperformanceof2DLWTonCPUduetotheIntelCorei7-4790CPUuses64-bitinstruction.BothFP16andFP32operationscanbeexecutedonthe64-bitmodein
negligibledifference.Incontrast,FP16basedmulti-GPUimplementationsshowhighercomputationalperformancethantheFP32implementations.However,thelowerprecisionofFP16failstofulfilthehighprecisionrequirementof
micro- andnano-surface texturemetrology, such that this case studyonlyapplies theFP32 forevaluation.The results show that themulti-GPUGPCF-LWTandpureCPU (IntelCore i7-47903.6GHZ) implementationshave same
filtrationeffectsonbothprecisionandaccuracyaspects.Table4showsthatmulti-GPUimplementationscangainover20timesspeedupoverCPUimplementations.
Fig.26Thefunctionalsurfacetextureofaceramicfemoralhead.
alt-text:Fig26
Annotations:
A1. Theformsurface
A2. Thewavysurface
A3. Theroughsurface
Fig.27Thefunctionalsurfacetextureofamilledmetallicworkpiece.
alt-text:Fig27
Annotations:
A1. Theformsurface
A1A2A3
A1
Table4TheperformancecomparisonsbetweenaCPUandthemulti-GPUimplementation(ms).
alt-text:Table4
Datasets LWT ILWT
CPU GPCF-LWT CPU GPCF-LWT
Ceramicfemoralhead(4096 × 4096) 10,165 415 12,590 450
Ceramicfemoralhead(12,288 × 12,288) 28,569 756 34,681 832
Milledmetallicworkpiece(4096 × 4096) 12,850 510 12,685 511
Milledmetallicworkpiece(12,288 × 12,288) 30,256 748 35,647 786
Fig.28 shows theperformancecomparisonamongGPCF-LWT,RC&LBand Ikuzawa'ssolutionon theheterogeneousmulti-GPUsystemdetailed inTable3. In this case, the testingdatasethasbeenexpanded to the sizeof
12,288288××12,288toevaluatewhetherGPCF-LWTcandealwithlarge-scaledatasetsintheonlinemanner.TheexperimentalresultsshowthatRC&LBhastheminimumcomputationalperformanceduetoitsheavysynchronization
calls.Ikuzawa'ssolutiongainsbetterperformanceduetoitshigherefficiencyonmemoryusage.Thoughbothstillsufferfromtheheavysynchronization,pipelinecluttering,anddataoverloadproblems.Incontrast,theGPCF-LWTwith
advancedoptimizationsgainsupto12timesonprocessingspeedoverothertwomulti-GPUapproaches.
Insummary,multi-GPUbasedGPCF-LWThasgainedsignificantimprovementonthecomputationalperformancefor2DLWT.Itcanprocessaverylargedataset(e.g.,12,288288××12,288)inlessthanonesecond,sothat
performancerequirementsofonline(ornearreal-time)andlarge-scaledataintensivesurfacetexturefiltrationapplicationscanbesatisfiedgracefully.
Fig.28Theperformancecomparisonsofthreemulti-GPUimplementations(ms).
alt-text:Fig28
Annotations:
A1. Ceramicfemoralhead
A2. Ceramicfemoralhead
A3. Milledmetallicworkpiece
A4. Milledmetallicworkpiece
A5. Ceramicfemoralhead
A6. Ceramicfemoralhead
A7. Milledmetallicworkpiece
A8. Milledmetallicworkpiece
A1A2
A3 A4 A5 A6A7
A8
7ConclusionsandfutureworkTofulfilltheonlineandbigdatarequirementsinmicro-andnano-surfacemetrology,theparallelcomputationalpowerofmodernGPUsisexplored.Thisresearchdevisedagenericparallelcomputationalframeworkforlifting
wavelettransform(GPCF-LWT)accelerationbasedonaheterogeneousmulti-GPUsystem.OneoftheoutcomesofthisinnovativeinfrastructureisanewparallelcomputationalmodelnamedLBBtoalleviatethevitalproblemofshared
memory shortage in a single GPU cardwhile ensuring the generalization of LWT in the context of CUDAmemory hierarchy. To increase the parallelizability of LWT, an improved lifting scheme has also been developed. Three
optimizationstrategiesonthesingleGPUconfigurationarevalidatedandthenintegrated.TofurtherimprovethecomputationalperformanceoftheGPCF-LWT,ithasbeenexpandedtocomputeonaheterogeneousmulti-GPUsystem
withaFuzzyTheory-basedloadbalancingmodel.ThisstudyfurtherexploredtwoimprovementsonthepreviousdevisedFNNDLBmodeltoincreasetheefficiencyandpredicationaccuracy.ExperimentsshowthattheproposedGPCF-
LWTcanachievesuperiorandsubstantialcomputationalperformancegainwhencomparedwithotherstate-of-the-arttechniques.Over20timesspeeduphasbeenmonitoredoverthepureCPUsolutions,andupto12timesoverother
benchmarkingmulti-GPUsolutionswithlarge-scalesurfacemeasurementdatasets.Theinnovativeframeworkanditscorrespondingtechniqueshaveaddressedthekeychallengesfrom2DLWTapplicationsthatarevitalforbigsurface
datasetprocessing.
Oneavenueopenedupduring the study for futureexploration is to introduce thedeep learning idealismacross theGPUandCPUboundary.Especiallywhen facingcell-CPUor clusterCPUs, ahybridandefficientdata
distributionscheme,aswellasanonlinesurfacetextureanalysisplatformwithsmartrecommendationfunctionsforfilterselectionwillbeofgreatvalueforpractitioners.
DeclarationofCompetingInterestTheauthorsdeclarethattheyhavenoknowncompetingfinancialinterestsorpersonalrelationshipsthatcouldhaveappearedtoinfluencetheworkreportedinthispaper.
AcknowledgmentsThisworkissupportedbytheNSFC(61203172),theSichuanScienceandTechnologyPrograms(2018JY0146and2019YFH0187),and“598649-EPP-1-2018-1-FR-EPPKA2-CBHE-JP”fromtheEuropeanCommission.
References[1]X.J.JiangandD.J.Whitehouse,Technologicalshiftsinsurfacemetrology,CIRPAnn61(2),2012,815–836,https://doi.org/10.1016/j.cirp.2012.05.009.
[2]D.A.Axinte,N.Gindy,etal.,Processmonitoringtoassisttheworkpiecesurfacequalityinmachining,Int.J.Mach.ToolsManuf.44(10),2004,1091–1108,https://doi.org/10.1016/j.ijmachtools.2004.02.020.
[3]X.Jiang,P.J.Scott,etal.,Paradigmshiftsinsurfacemetrology.PartI.Historicalphilosophy,Proc.R.Soc.AMath.Phys.Eng.Sci.463(2085),2007,2049–2070,https://doi.org/10.1098/rspa.2007.1874.
[4]W.J.vanderLaan,A.C.Jalba,etal.,AcceleratingwaveletliftingongraphicshardwareusingCUDA,IEEETrans.ParallelDistrib.Syst.22(1),2011,132–146,https://doi.org/10.1109/TPDS.2010.143.
[5]J.A.Castelar,C.A.Angulo,etal.,ParalleldecompressionofseismicdataonGPUusingaliftingwaveletalgorithm,SignalProcess.ImagesComput.Vis.,2015,IEEE.
[6]S.Lou,W.H.H.Zeng,etal.,Robustfiltrationtechniquesingeometricalmetrologyandtheircomparison,Int.J.Autom.Comput.10(1),2013,1–8,https://doi.org/10.1007/s11633-013-0690-4.
[7]H.S.Abdul-Rahman,S.Lou,etal.,Freeformtexturerepresentationandcharacterisationbasedontriangularmeshprojectiontechniques,Measurement92,2016,172–182,
https://doi.org/10.1016/j.measurement.2016.06.005.
[8]ISO16610-61:2015Geometricalproductspecification(GPS)–filtration–part61:lineararealfilters–Gaussianfilters,(2015).
[9]ASMEB46.1-2009,Surfacetexture:surfaceroughness,waviness,andlay,(2009).
[10]P.Nejedly,F.Plesinger,etal.,CudaFilters:asignalplantlibraryforGPU-acceleratedFFTandFIR (TheoriginalreferenceNo.29becomesNo.10,originalNo.10becomestheNo.11,No.11becomesNo.12,andNo.12becomesNo.13,etc.)filtering,Softw.-Pract.Exp.48(1),2018,3–9,https://doi.org/10.1002/spe.2507.
[11]X.Jiang,P.J.Scott,etal.,Paradigmshiftsinsurfacemetrology.PartII.Thecurrentshift,Proc.R.Soc.AMath.Phys.Eng.Sci.463(2085),2007,2071–2099,https://doi.org/10.1098/rspa.2007.1873.
[12]W.Sweldens,Theliftingscheme:aconstructionofsecondgenerationwavelets,SIAMJ.Math.Anal.29(2),1998,511–546.
[13]S.Xu,W.Xue,etal.,Performancemodelingandoptimizationofsparsematrix-vectormultiplicationonNVIDIACUDAplatform,J.Supercomput.63(3),2013,710–721,https://doi.org/10.1007/s11227-011-0626-0.
[14]H.Wang,H.Peng,etal.,AsurveyofGPU-basedaccelerationtechniquesinMRIreconstructions,Quant.ImagingMed.Surg.8(2),2018,196–208,https://doi.org/10.21037/qims.2018.03.07.
[15]S.MittalandJ.S.Vetter,AsurveyofCPU-GPUheterogeneouscomputingtechniques,ACMComput.Surv.47(4),2015,1–35,https://doi.org/10.1145/2788396.
[16]M.Mccool,Signalprocessingandgeneral-purposecomputingandGPUs,IEEESignalProcess.Mag.24(3),2007,109–114,https://doi.org/10.1109/MSP.2007.361608.
[17]Y.Su,Z.Xu,etal.,GPGPU-basedGaussianfilteringforsurfacemetrologicaldataprocessing,In:200812thInt.Conf.Inf.Vis.,2008,IEEE,94–99,https://doi.org/10.1109/IV.2008.14.
[18]NVIDIA,CUDACProgrammingGuidev9.0.http://docs.nvidia.com/cuda/cuda-c-programming-guide/.
[19]S.Cook,CUDAProgramming:ADeveloper'sGuidetoParallelComputingwithGPUs,2012,Newnes.
[20]NVIDIA,CUDACBestPracticesGuidev9.0,NVIDIA.http://docs.nvidia.com/cuda/cuda-c-best-practices-guide.
[21]D.B.KirkandW.H.Wen-Mei,ProgrammingMassivelyParallelProcessors:AHands-onApproach,2012,MorganKaufmann.
[22]R.Couturier,DesigningScientificApplicationsOnGPUs,2013,CRCPress.
[23]D.FoleyandJ.Danskin,Ultra-performancePascalGPUandNVLinkinterconnect,IEEEMicro37(2),2017,7–17,https://doi.org/10.1109/MM.2017.37.
[24]M.HopfandT.Ertl,Hardwareacceleratedwavelettransformations,DataVis.,2000,Springer,93–103,https://doi.org/10.1007/978-3-7091-6783-0_10.
[25]Tien-TsinWong,Chi-SingLeung,etal.,Discretewavelettransformonconsumer-levelgraphicshardware,IEEETrans.Multimed.9(3),2007,668–673,https://doi.org/10.1109/TMM.2006.887994.
[26]J.Franco,G.Bernabé,etal.,The2Dwavelettransformonemergingarchitectures:gPUsandmulticores,J.Real-TimeImageProcess7(3),2012,145–152,https://doi.org/10.1007/s11554-011-0224-7.
[27]C.Song,Y.Li,etal.,AGPU-AcceleratedwaveletdecompressionsystemwithSPIHTandreed-solomondecodingforsatelliteimages,IEEEJ.Sel.Top.Appl.EarthObs.RemoteSens4(3),2011,683–690,https://doi.org/10.1109/JSTARS.2011.2159962.
[28]L.Zhu,Y.Zhou,etal.,Parallelmulti-level2D-DWTonCUDAGPUsanditsapplicationinringartifactremoval,Concurr.Comput.Pract.Exp.27(17),2015,5188–5202,https://doi.org/10.1002/cpe.3559.
[29]L.BluntandX.Jiang,AdvancedTechniquesForAssessmentSurfaceTopography:DevelopmentofaBasisFor3DSurfaceTextureStandardsSurfstand (HereoriginalNo.28becomesNo.29,andtherestisnumberedsuccessively.),2003,Elsevier.
[30]WangJianjun,LuWenlong,etal.,High-speedparallelwaveletalgorithmbasedonCUDAanditsapplicationinthree-dimensionalsurfacetextureanalysis,In:Int.Conf.Electr.Inf.ControlEng.,2011,IEEE,2249–2252,https://doi.org/10.1109/ICEICE.2011.5778225.
[31]Y.Chen,X.Cui,etal.,Large-scaleFFTonGPUclusters,In:Proc.24thACMInt.Conf.Supercomput.,NewYork,NewYork,USA2010,ACMPresshttps://doi.org/10.1145/1810085.1810128.
[32]A.Nukada,Y.Maruyama,etal.,Highperformance3-DFFTusingmultipleCUDAGPUs,In:Proc.5thAnnu.Work.Gen.Purp.Process.withGraph.Process.Units-GPGPU,NewYork,NewYork,USA2012,ACMPress,
57–63,https://doi.org/10.1145/2159430.2159437.
[33]S.SchaetzandM.Uecker,Amulti-GPUprogramminglibraryforreal-timeapplications,In:Int.Conf.AlgorithmsArchit.ParallelProcess,2012,Springer,114–128.
[34]J.A.StuartandJ.D.Owens,Multi-GPUMapReduceonGPUclusters,In:2011IEEEInt.ParallelDistrib.Process.Symp.,2011,IEEE,1068–1079,https://doi.org/10.1109/IPDPS.2011.102.
[35]M.Grossman,M.Breternitz,etal.,HadoopCL:MapReduceondistributedheterogeneousplatformsthroughseamlessintegrationofhadoopandOpenCL,In:2013IEEEInt.Symp.ParallelDistrib.Process,2013,IEEE,
1918–1927,https://doi.org/10.1109/IPDPSW.2013.246.
[36]M.Abadi,A.Agarwal,etal.,TensorFlow:large-Scalemachinelearningonheterogeneousdistributedsystems,(2016).doi:10.1038/nn.3331.
[37]S.Chintala,AnoverviewofdeeplearningframeworksandanintroductiontoPytorch,2017.https://smartech.gatech.edu/handle/1853/58786(accessedOctober5,2017).
QueriesandAnswersQuery:Pleaseconfirmthatgivennamesandsurnameshavebeenidentifiedcorrectly.Answer:Yes
Query:Pleaseconfirmthattheprovidedemail18668289@qq.comisthecorrectaddressforofficialcommunication,elseprovideanalternatee-mailaddresstoreplacetheexistingone.Answer:[email protected]
Query:PleasecheckthecitationofTable1inthetext.Ithasbeenputrandomly.Pleasefixitinproperplace.Answer:ThecitationofTable1hasbeencorrectedinthetext.
Query:PleasecheckthecitationofEq.(26)inthetext,ButthereisnoEq.(26)inthetext.Pleasecorrectifnecessary.Answer:Eq.(24)isactuallyEq.(22),andEq.26isactuallyEq.(24).Otherwrongequationnumberhadbeencorrectedintext.
Query:PleasecheckthecitationofEq.27inthetext.Itisnotinthetext.Pleasecorrectitifnecessary.
[38]C.Zhang,Y.Xu,etal.,Afuzzyneuralnetworkbaseddynamicdataallocationmodelonheterogeneousmulti-Gpusforlarge-scalecomputations,Int.J.Autom.Comput.15(2),2018,181–193,https://doi.org/10.1007/s11633-018-1120-4.
[39]K.K.ShuklaandA.K.Tiwari,FilterbanksandDWT,SpringerBriefsComput.Sci.,2013,Springer,21–36,https://doi.org/10.1007/978-1-4471-4941-5_2.
[40]X.Q.Jiang,L.Blunt,etal.,Developmentofaliftingwaveletrepresentationforsurfacecharacterization,Proc.R.Soc.AMath.Phys.Eng.Sci.456(2001),2000,2283–2313,https://doi.org/10.1098/rspa.2000.0613.
[41]M.E.Angelopoulou,K.Masselos,etal.,Implementationandcomparisonofthe5/3lifting2DdiscretewavelettransformcomputationschedulesonFPGAs,J.SignalProcess.Syst.51(1),2008,3–21,
https://doi.org/10.1007/s11265-007-0139-5.
[42]C.Tenllado,J.Setoain,etal.,Parallelimplementationofthe2DdiscretewavelettransformongraphicsprocessingUnits:filterbankversuslifting,IEEETrans.ParallelDistrib.Syst.19(3),2008,299–310,https://doi.org/10.1109/TPDS.2007.70716.
[43]W.Nicholas,TheCUDAHandbook:AComprehensiveGuidetoGPUProgramming,2013.
[44]Y.LeCun,Y.Bengio,etal.,Deeplearning,Nature521(7553),2015,436–444,https://doi.org/10.1038/nature14539.
[45]T.Ikuzawa,F.Ino,etal.,Reducingmemoryusagebythelifting-baseddiscretewavelettransformwithaunifiedbufferonaGPU,J.ParallelDistrib.Comput.93-94,2016,44–55,https://doi.org/10.1016/j.jpdc.2016.03.010.
[46]NVIDIA,Whitepaper-NVIDIAGeForceGTX750Ti,2014.
Highlights
• AnewparallelcomputationmodelforLWTnamedLBBwasdevisedtotacklethelimitationofcurrentsharedmemorybottleneckonmodernGPUs.
• TheimprovedliftingschemehasreducedapproximatelyhalfofthesynchronizationsneededwhencomputingtheliftingwaveletsinasingleCUDAthread.
• ThreeoptimizationstrategieswerebenchmarkedforGPCF-LWTimplementation.
• Themulti-GPUbasedGPCF-LWTsupportsonlinelarge-scalemeasureddatafiltrationthroughanimprovedFNNbaseddynamicloadbalancing(DLB)model.
Answer:Thereare15placesthattheequationnumberiswrong.Wecorrectedandmarkedinthetext.Thankyou.
Query:Pleasecheckfundinginformationandconfirmitscorrectness.Answer:Iconfirmitscorrectness
Query:Pleaseprovidedetailofreferencenos.8and9.Answer:[8]ISO16610-61:2015,Geometricalproductspecification(GPS)–filtration–part61:lineararealfilters–Gaussianfilters,AnISO/TC213dimensionalandgeometricalproductspecificationsandverificationstandard,2015,1-18,https://www.iso.org/standard/60813.html.
[9]ASMEB46.1-2009,Surfacetexture:surfaceroughness,waviness,andlay,AnAmericannationalstandard,2009,1-120,https://www.asme.org/products/codes-standards/b461-2009-surface-texture-surface-roughness.
Query:Pleaseprovidedetailofreferenceno.36.Answer:M.Abadi,A.Agarwal,etal.,TensorFlow:large-Scalemachinelearningonheterogeneousdistributedsystems,TensorFlowwhitepaper,2016,1-19,https://doi.org/10.1038/nn.3331.
Query:PleaseprovidePublishernameandplaceforreferenceno.45.Answer:No.45isajournalpaper,andYoumeanNo.46?NVIDIA,NVIDIAGeForceGTX750Ti,Whitepaper,NVIDIACorporation,2014,1-10,https://international.download.nvidia.cn/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf.