1 introduction · 1 introduction engineering workpieces are designed, manufactured and measured to...

1IntroductionEngineeringworkpiecesaredesigned,manufacturedandmeasuredtomeettolerancespecificationsofvariousgeometricalcharacteristics(e.g.,surfacetexture,size,angle,radius).Afteraworkpiecehasbeenmanufactured,it

isnecessarytoverifythederivedcharacterization(tolerance)parameterstoensureitperformsitsdesignedfunctions.Thoughthedesignedfunctionsofaworkpiecearethecomprehensiveeffectofseveralcharacterizationparameters,

surfacetextureisoftenthedominantone.Theroughness,wavinessandformarethreemaincharacteristicsonfunctionalsurfaces(e.g.,roughnessmayaffectthelifetime,efficiencyandfuelconsumptionofapart)[1,2].

Obtainingroughness,wavinessandformfrequencybandsrelevant toasurface isnormallyaccomplishedbyusingfiltrationtechniquestoextractasurfacetexturesignal. Inprinciple, filtrationmustbecarriedoutbefore

Agenericparallelcomputationalframeworkofliftingwavelettransformforonlineengineeringsurfacefiltration

YuanpingXua

ChaolongZhanga,b,⁎

[email protected]

ZhijieXub

JiliuZhouc

KaiweiWangd

JianHuanga

aSchoolofSoftwareEngineering,ChengduUniversityofInformationTechnology,No.24Block1,XuefuRoad,Chengdu610225,People'sRepublicofChina

bSchoolofComputing&Engineering,UniversityofHuddersfield,QueensgateHD13DH,Huddersfield,UK

cSchoolofComputerScience,ChengduUniversityofInformationTechnology,No.24Block1,XuefuRoad,Chengdu610225,People'sRepublicofChina

dSchoolofOpticalScienceandEngineering,ZhejiangUniversity,Hangzhou310027,People'sRepublicofChina

⁎Correspondingauthorat:SchoolofSoftwareEngineering,ChengduUniversityofInformationTechnology,No.24Block1,XuefuRoad,Chengdu610225,People'sRepublicofChina.

Abstract

Nowadays, complex geometrical surface texturemeasurement and evaluation require advanced filtration techniques. Discrete wavelet transform (DWT), especially the second-generation wavelet (LiftingWavelet

Transform–LWT),isthemostadoptedoneduetoitsunifiedandabundantcharacteristicsinmeasureddataprocessing,geometricalfeatureextraction,manufacturingprocessplanning,andproductionmonitoring.However,

when dealingwith varied complex functional surfaces, the computational payload for performingDWT in real-time often becomes a core bottleneck in the context ofmassivemeasured data and limited computational

capacities.Itisamoreprominentproblemforthearealsurfacetexturefiltrationbyusing2DDWT.Toaddresstheissue,thispaperpresentsagenericparallelcomputationalframeworkforliftingwavelettransform(GPCF-

LWT)basedonGraphicsProcessUnit(GPU)clustersandtheComputeUnifiedDeviceArchitecture(CUDA).Duetoitscost-effectivehardwaredesignandthepowerfulparallelcomputingcapacity,theproposedframework

cansupportonline(ornearreal-time)engineeringsurfacefiltrationformicro-andnano-scalesurfacemetrologythroughexploringanovelparallelmethodnamedLBBmodel,theimprovedalgorithmsofliftingschemeand

three implementationoptimizationson theheterogeneousmulti-GPUsystems.The innovativeapproachenablesoptimizationson individualGPUnode throughanoverarching framework that iscapableofdata-oriented

dynamic load balancing (DLB) driven by a fuzzy neural network (FNN). The paper concludes with a case study on filtering and extractingmanufactured surface topographical characteristics from real surfaces. The

experimentalresultshavedemonstratedsubstantialimprovementsontheGPCF-LWTimplementationintermsofcomputationalefficiency,operationalrobustness,andtaskgeneralization.

Keywords:Liftingwavelet;LBB;SurfacetextureMeasurement;Multi-GPU;CUDA;FNN

suitablenumericalsurfacetextureparameterscanbeidentified.Itmeansthatgeometricalcharacteristicinformationrelatingcloselywithfunctionalsurfacescanbeextractedpreciselyfrommeasureddatasetsbyusingappropriate

filters.Thisensuresthepotentialofestablishingfunctionalcorrelationsbetweentolerancespecificationandverification[1–3].

The rapid progress of filtration techniques has presentedmetrologists a set of advanced tools, e.g.Gaussian filters,morphological filters andwavelet filters [4–7].Gaussian filtering is commonly used in surface texture

measurementsandhasbeenrecommendedbyISO16,610andASMEB46standards forestablishingareferencesurface [8–10].However,usingGaussian filter toextractsurface topographicalcharacteristicsmustbebasedona

presuppositionthatarawsurfacetexturesignal isacombinationofaseriesofharmoniccomponents[10].Unfortunately,arealengineeringsurfacetexturemeasurementcontainsnotonlyregularsignalsbutalsononperiodicand

randomsignals,e.g.deepvalleys,highpeaksandscratchesthataregeneratedduringvariousmanufacturingprocesses,sothatextraction,filtrationandanalysisofthesesignalsareindispensablemethodsformonitoringhowprocess

information(e.g.surfacetextureparameters)relatetovariabilityofthemanufacturingprocesses[2].Thus,straightapplicationsofGaussianfilterstruggletosatisfythechallengingrequirementsofsurfacetexturemeasurementand

analysis.Incontrast,thediscretewavelettransform(DWT)hasshowngreatpotentialinhandlingthecomplexissuesefficiently.DWThasagreatdiversityofmotherwavelets(e.g.Haar,Daubechies,CDFbiorthogonal,Coiflet)thatcan

beappliedaccordingtodifferentapplicationpurposes.Moreover,DWThasadvantagesonmulti-resolutionanalysisandlowercomputationalcomplexity.

Recently,surfacetexturemeasurementhasenteredtheeraofminiaturization,e.g.variousexplorationsonmicro-ornano-machiningandmicro-electromechanicalsystems(MEMS)manufacturing[1].Hencethe increasing

demandsforefficienthandlingofbig(measurement)data,processingaccuracy,andreal-timeperformancehavebeentreatedasparamount.Asthestatusquo,thecomputationalperformanceofDWTisacorebottleneckforhandling

high-throughputinonline(ornearreal-time)manner[11],especiallyfortwo-dimensionalDWT(2DDWT)inarealsurfacetexturemeasurement[12].

Toovercomethisbottleneck,purealgorithmicaccelerationsolutionshavebeenhotlyinvestigatedinthelastdecade.SweldensproposedanewDWTmodel,itknownasliftingwavelettransform(LWT)orsecond-generation

wavelet that improved computational performance and optimized the memory consumption compared to the first-generation wavelet – —Mallet algorithm [13]. Although promising, lifting scheme still has limited satisfaction

performance in accelerating computations ofDWT for online applicationswhen facing large-scaledata size.Basedon that, hardwareaccelerationshavebecomea research focus.GraphicsProcessingUnits (GPU) enables those

solutionsandhasbecomethestandardaccessoryforcomputation-intensiveapplications.Unfortunately,currentliftingschemestrugglesonasingleGPUformeetingthedemandsofonlineprocess,especiallywhendealingwithlarge-

scaleanddynamicdatasetstodefinehighprecisionrequirementsforsurfacetexturemetrology.

Totacklethechallenges,thispaperpresentsagenericparallelcomputationalframeworkforLWT(GPCF-LWT)basedontheoff-the-shelfconsumer-gradeCUDAmulti-GPUsystems.TheproposedGPCF-LWTcanaccelerate

computational-intensivewavelets,sothatitcansupportonlineengineeringsurfacefiltrationforhighprecisionsurfacemetrologythroughexploringaninnovativecomputationalmodel(namedtheLBBmodel)forimprovingtheusage

ofGPUsharedmemoryandanimprovedliftingschemeforreducingtheinherentsynchronizationsintheclassicliftingscheme.TheGPCF-LWTalsocontainsthreeimplementationoptimizationsbasedonthehardwareadvancementsof

an individual GPU node and the novel dynamic load balancing (DLB)model by using the improved fuzzy neural network (FNN) for heterogeneousmulti-GPUs. Therefore, it supports intelligent wavelet transform selections in

accordancewithdifferentsurfacecharacterizationrequirementsandmeasurementconditions.

Therestofthispaperisorganizedasfollows.Section2givesabriefreviewofthepreviousrelatedworks,whichleadstofurtherclarifythecontributionsofthisresearch.Anin-depthanalysisofthefeasibilityofalgorithm

improvementsforadaptingparallelcomputationarediscussedinSection3followingonabriefintroductionoftheliftingschemebasedsecond-generationwavelettransform.Section4presentstheGPCF-LWTframeworkanditsthree

keycomponents.TheLBBmodel forGPCF-LWThasbeenexploredbasedonadetailedevaluationof the threeclassicparallelcomputationmodels for liftingscheme implementations tosolve theproblemofGPUsharedmemory

shortageduringLWTcomputation.ToincreasetheparallelizabilityofLWT,animprovedliftingschemeforreducingtheinherentsynchronizationshasbeendefined,andtofurtheracceleratetheLWTcomputation,threecomplementary

implementationoptimizationsinaccordancewiththelatesthardwareadvancementsofGPUcardshavebeendiscussed.Moreover,tomaximizeparallelizationofGPCF-LWTonaheterogeneousmulti-GPUplatform,Section5introduces

anoveldynamicloadbalancing(DLB)model,inwhichamoreeffectiveFNNhasbeendesignedbasedonthepreviousworkofauthors.AcasestudyisdemonstratedandevaluatedinSection6totestandverifytheresearchideas.

Finally,Section7concludesthispaperandsummarizesthefutureworks.

2PreviousrelatedworksModern GPUs are not only powerful graphics engines, but also highly parallel arithmetic and programmable processors. GPU is a SIMD (Single InstructionMultiple Data) parallel device containing several streaming

multiprocessors(SM),andeachSMhasmultipleGPUcores(alsonamedCUDAcores),i.e.,streamprocessors(SP)whicharethecomputationalcoresforprocessingmultiplethreadssimultaneously(eachGPUcoremanagesathread).

Besidesforvisualizationandrendering,othermoregeneral-purposecomputationsusingGPU(socalledGPGPU)havealsoprosperoussince2002whenconsumer-gradegraphicscardsbecametruly‘‘programmable’’,rangingfrom

numericcomputationaloperations[14–16]tocomputeraideddesign,toleranceandmeasurementareas.Morespecifically,GPGPUhasbeenproposedtosolvenumerouscompute-intensiveproblemswhichcannotbesolvedbyusing

CPUbasedserialprocessingmode,e.g.,Mccool introducedtraditionalGPUhardwareandGPUprogrammingmodelsforsignalprocessingandsummarizedsomeclassicsignalprocessingapplicationsbasedonGPU[17].Suet al.

proposed a GPGPU framework to accelerate 2DGaussian filtering, and it achieved about 4.8 times speedupwithout reducing the filtering quality [18]. However, traditional GPGPU programmingmust be codedwith graphical

programminginterfacessuchasOpenGLandDirectX,soheavymanualworkswereinvolvedinthewholeprocedureofdevelopingGPGPUprograms,e.g.,texturemappings,manualandstaticthreading,memorymanagement,and

shadingprogramexecution.Thus,thisobsoleteGPGPUprogrammingprocedureisverycomplicated,anditalsolimitedtheparallelcomputingabilityofGPUcards.

Toalleviatetheseapplicationdifficulties, in2007,NVIDIAintroducedCUDAthat isaprogramingframeworkforgeneral-purposecomputationandprovidesascalableandintegratedprogrammingmodelforallocatingand

organizingprocessingthreadsandmappingthemintoSMsandCUDAcoreswithdynamicaladaptionabilityforallmainstreamGPUarchitectures[19].Inaddition,CUDAmaintainsasophisticatedmemoryhierarchy(e.g.registers,

localmemory,sharedmemory,globalmemoryandconstantmemory)inwhichdifferentmemoriescanbeappliedaccordingtoparticularapplicationsandoptimizationcircumstances,andCUDAGPUcardshadbeenoptimizedwith

highermemorybandwidthandfaston-chipperformance[19–21].Betteryet,itembeddedaseriesofAPIstoprogramdirectlyonGPUinsteadoftransferringapplicationcodestovariousobscuregraphicsAPIsandtheshadinglanguage

manually.InCUDAprograms,anyfunctionrunningonaGPUcardiscalled“kernel”,andlaunchingakernelwillgeneratemultiplethreadsrunningontheGPU[19].CUDAthreadsareorganizedintoahierarchystructure,i.e.,multiple

threadscomposeablock,andmultipleblockscomposeagridsuchasa2Dmatrix(seeFig.11)[22].WhenakernelrunningonaGPU,SMsareinchargeofcreating,managing,schedulingandexecutingCUDAthreadsingroups(called

warps),andeachgrouphandles32parallelthreads.However,thememorycapacityandparallelcomputingabilityofasingleGPUarestilllimitedtointegrallyprocessadatasetwithtensofgigabytes(GB)insizethatisthenormalsize

requiredforcurrentlarge-scalescientificapplications[23].Forexample,ameasureddatasetextractedfromthemicro-scalesurfaceareaofaceramicfemoralheadarticulateis49,152152××49,152insize(i.e.total1212GBdata

withsingleprecision),whereasasinglemodernGPUGeForceGTX1080selectedinthisstudyhasonly8GBGPUmemorycapacity,sothis1212GBdatasetcannotbeloadedandprocessedatasametime.Althoughthemoreexpensive

high-endGPUssuchasNVIDIATITANXp(1212GBmemory)andNVIDIATITANRTX(2424GBmemory)canstorelargerdatasets,theywillstillstrugglewithreal-lifemeasurementdatasetsoftencominginitshundredsandmoreGB

ofdatasizes.Moreover,theGPUcardshavefixedCUDAcores(e.g.,2560,3840and4608forGTX1080,TITANXpandTITANRTX,respectively),andeachonecanexecuteahardwarethread,sothreadsarealsofixedtosupport

certainnumberof concurrentcomputations.Consequently,a singleGPUmustdivide,andprocessmassive smallerchunksofadataset iteratively,whichmaygreatly increase the latencyof intensivecomputations. Inconclusion,

scientificcomputationswithbigdata,especiallyforapplicationsofonlineandhighprecisionsurfacetexturemeasurement,demanddistributedcomputationsonmulti-GPUs.Withcomplementaryoptimizationstrategies,multi-GPUs

canseamlesslyworktogethertoachieveamazingcomputationalperformance,butthemulti-GPUarchitectureismorecomplicatedthanthesingleGPU,anditbringstimeconsuminganderrorproneprocessesonworkloadpartitions

andallocations,theircorrespondingsynchronizationsandcommunicationsofdivideddatachunks,recombinationoftheprocesseddatachunks,anddatatransfersacrossPCI-E(PeripheralComponentInterconnection-Express)buses.

ItisworthmentioningthattheNVLinkconnectionsintroducedrecentlyhaveapproximate10timeshigherbandwidththanthePCI-E[24].However,NVLinksupportsthehigh-endGPUsonly(e.g.,TITANRTXandTESLAP100),anditis

normallyequippedonlyinthesupercomputers.

Historically,althoughplentifulandconcreteresearchesonwavelet-basedfilteringtheoriesarousedinthelastdecades,incomparison,theirpractical implementationsoncomputersachievedlimitedsuccessduetothekey

challengeofcomputationalcostsforwavelettransforms.Ingeneral, the intensivecomputationofDWTthroughits inherentmulti-leveldatadecompositionandreconstructionwillcausedrasticallyreductionof itsperformancefor

onlineapplicationswhenfacinglargedatasize.WiththerapiddevelopmentofGPUparallelism,severalsingleGPUaccelerationsolutionshavebeenexploredasresponsestothischallenge.Forexample,Hopfetal.accelerated2D

DWTonasingleGPUcardbyusingOpenGLAPIsin2002,andconsequentlythisresultedinthebeginningofGPUbasedDWTacceleration[25].Thisearlypracticeadoptedthetraditional“one-step”methodratherthantheseparable

twostepsof1DDWT,sothatthisprogrammingmodelwasnotflexibleandgeneralizableatall,i.e.,eachkindofwavelettransformmusthaveacustomimplementation.In2007,Wongetal.optimizedHopf'sworkbyusingshading

language(namedas“JasPer”),and JasPerusedshaders tohandlemultipledatastreams inpipelinesandhadbeen integrated in JPEG2000codec [26].Notably, JasPerdecomposes2DDWT into twostepsof1DDWT,such that it

supportsdozensofwavelettypesandboundaryextensionmethodsinconvolutionstage.Italsounifiedthecalculationmethodsofforwardandinverse2DDWT.ForWong'swork,duetothelimitationsonGPUprogrammabilityand

hardwarestructureatthetime,theconvolution,downandupsamplingoperationswereexecutedinsequenceonfragmentprocessorsofanearlyGPU,andeachtexturecoordinateintexturemappingmustbepre-definedinseparate

CPUprograms.ThesesolutionsrequiremanualmappingoperationsonGPUhardware,andtaskpartitioningthatdecideswhichpartoftheworkshouldbeconductedonaCPUandwhichpartwillbefitforaGPUisindispensablein

thesepracticespriortotheemergingofunifiedGPUlanguagessuchasCUDA.

Recently,therehasbeenagrowinginterestinCUDAbasedGPGPUcomputationstoparallelizetheDWTforscientificandbigdataapplications.Francoetal.parallelizeda2DDWTontheCUDAenabledGPUNVIDIATesla

C870forprocessingimagesandachievedasatisfactoryspeedupof26.81timesfasterthanOpenMPmethodwiththefastestCPUsettingatthetime[27].Songetal.acceleratedCDF(9,7)basedSPIHT(setpartitioninginhierarchical

trees) algorithm for decoding real-time satellite imagesbased on theNVIDIATeslaC1060,which achieved speedupof about 83 times compared to a single threadCPU implementation [28]. Later, Zhu et al. improved 2DDWT

computationonaGPUandverifieditwithacasestudyoftheringartifactremoval,whichachievedasignificantperformanceincrease[29].ComparedwithpreviousCUDAenabledGPUimplementationsofDWTthatonlyparallelizeda

veryspecifictypeofwaveletunderaspecificcircumstanceindividually,Zhu'sworksupportsdifferenttypesofwaveletsforadaptingdifferentapplicationconditions.Inaddition,thelatestconstantmemorytechniquehasbeenapplied

inZhu'swork to optimize the overallGPUmemory access.However, the sharedmemory access andCUDA streamsbasedhide latencymechanismwere ignored for further performance optimization in this research [19].More

importantly,thefirst-generationwavelettheoryanditscorrespondingalgorithmicoperationsappliedintheconvolutionschemeoftheseresearcheshadincreasedthecomputationalcomplexityandimposedagreatdealofmemoryand

computationaltimeconsumption,e.g.,researchersneedtoconsiderthetroublesomeissueofboundarydestructioninherentintheFouriertransform[10].Thesecond-generationwavelettheoryadoptstheliftingschemeforwavelet

decompositionandreconstruction,whichpreservesallapplicationcharacteristicsofthefirst-generationwavelettransformwhilebringinghigherperformanceandlowermemorydemandduetoitsin-placealgorithmproperty[13].

ThesepreviousCUDAaccelerationstudieschosetheconvolutionschemeratherthanliftingschemebecausetheexecutionorderofliftingschemehasheavydatadependencies,suchthatitisimpossibletobefullyparallelizable.To

exploreCUDAbasedLWTimplementation,in2011,Wladimiretal.acceleratedtheLWTalgorithmwiththree-leveldecompositionsandreconstructionsofan19201920××1080imagebyusingCUDAversion2.1onNVIDIAGeForce

8800GTX768 MB,andthispracticeachievedaspeedupof10timescomparedwiththecorrespondingCPU(AMDAthlon6464××2DualCoreProcessor5200+)implementation[4].Recently,Later,Castelaretal.optimizedthe(5,3)

biorthogonalwavelettransformbyusingaCUDAinapplicationoftheparalleldecompressionofseismicdatasets[5].Unfortunately,thefurtherparalleloptimizationofLWTinthesestudieshasnotbeenconsidered.

PreviousworksfocusonparallelizingdataprocessingthroughusingasingleGPUthathadwitnessedamoderateperformancegainacrossboard.However,duetothelimitationofdatastorageformatandmemoryspace,as

well as the fixednumberofCUDAcores available ona singleGPUdie, previousworks are still struggling to fulfill theonlinedataprocessing requirements frommany large-scale computational applications,which is especially

problematicforprocessinglargemeasuredsurfacetexturedataengagingcomplexfiltrationswithvarioustoleranceparameters[1,30].Incomparison,multi-GPUbasedaccelerationsolutionscanbeflexibleandextensibleforachieving

higherperformancewithrelativelylowhardwarecosts.Numerouscomputational-intensiveissuesthatcannotberesolvedbyusingasingleGPUmodelhavebeenmakingsteadyprogressinthecontextofmulti-GPUs.Forinstance,FFT

(FastFourierTransform)basedmulti-GPUsolutionhasbecomethenorm incurrentonline large-scaledataapplications [31,32]. In themeantime,severalmulti-GPUprogramming libraries (e.g.MGPU) [33]andmulti-GPUsbased

MapReducelibraries(e.g.GPMRandHadoopCL)[34,35]hadbeendevelopedbyresearchers.Moreover,deeplearningwhichisverypopularinthecurrentdecadealsohasbenefitedfromthefastdevelopmentofmulti-GPUtechniques

[36,37].Thesepilotmulti-GPUstudiesarebasedontheassumptionsthatallGPUnodesequippedinamulti-GPUplatformhaveequalcomputationalcapacity.Inaddition,task-basedloadbalancingschedulersthattheseapproaches

relieduponfallshorttosupportapplicationswithhugedatathroughputsbutlimitedprocessingfunction(s)sincethereareveryfew“tasks”toschedule.Tomaximizethedataparallelisminaheterogeneousmulti-GPUsystem,this

studyhasalsodevisedaninnovativedata-orienteddynamicloadbalancing(DLB)modelbasedonanimprovedFNNforrapidmeasureddatadivisionandallocationonheterogeneousGPUnodes[38].

Insummary, thecombinationaccelerationoptimizationsmergingboththeparallelizedarchitectureofmulti-GPUsand improvements fromtheaspectsofLWTandsoftwareroutinesseamlesslytogether(e.g., thealgorithm

optimizationofliftingscheme,comprehensivememoryusageoptimizationbasedontheadvancedhierarchalarchitectureofCUDAmemory,andthelatencyhidemechanismbasedonCUDAmulti-streams)werelargelyignored.To

solvetheseshortcomings,thisstudyexploresandimplementsagenericparallelcomputationalframeworkfortheliftingwavelettransform(GPCF-LWT)basedonaheterogeneousCUDAmulti-GPUsystem.It furtheroptimizedthe

acceleration of 2D LWT through the improved processingmodel of lifting scheme (LBB and improved algorithm of lifting scheme), the devised three implementation optimization strategies and the improved DLBmodel on a

heterogeneousmulti-GPU system.Moreover, GPCF-LWT can support the whole kindwavelet transforms, thereby supporting intelligent wavelet transform selections in accordancewith different surface texture characterization

requirementsandmeasurementstrategies.

3Thesecond-generationwaveletThefundamental ideabehindwaveletanalysis istosimplifycomplexfrequencydomainanalysesintosimplerscalaroperations.Thefirst-generationwaveletappliesfastwavelettransformtoenabledecomposition(forward

DWT) and reconstruction (inverseDWT) in the formof a two-channel filter banks, or the so-calledMallat convolution scheme [39].Mallatwavelet decomposition and reconstruction convolve the input and output datawith the

corresponding decomposition and reconstruction filter banks respectively, so it has become a popular choice formany engineering applications in recent years.However,Mallat algorithmhas a certain degree of computational

complexity(e.g.ithasinherentboundarydestructionwhenexecutingFFT).Thus,SweldensproposedanewimplementationofDWT,i.e.liftingschemeorthesecond-generationwavelet.

The second-generationwavelet that uses the lifting scheme forwavelet decomposition and reconstruction preserves all properties of the first-generationwavelet, but brings higher computational performance and lower

memorydemandbecauseofitsin-placealgorithmattribute[13].Theliftingschemebased1Dforward(i.e.decomposition)DWT(abbreviatedasforwardLWTorFLWT)containsfouroperationsteps:split,predict,updateandscale

[4,13,40]:

1) Split:Thisstepsplitstherawsignalintotwosubsetcoefficients,i.e.evenandodd,andtheformercorrespondstotheevenindexvalueswhilethelatteristheoddindexvalues.ThesplitmethodisexpressedinEq.(1),anditisalsocalledas“lazy

wavelet”.

2) Predict:Forthecoefficientsofevenandodd,itcanpredictoddcoefficientsfromtheevenbyusingthepredictionoperatorPO,andthenreplacetheoldoddvaluesbythepredictedresultasthenextnewoddcoefficients,recursively,andthisstep

(1)

canbeexpressedasEq.(2).TheoddisthesameasthedetailcoefficientsDinMallatalgorithm.

3) Update:Similarly,itcanupdateevenbyusingtheupdateoperatorUO,andthenreplacetheoldevenvaluesbytheupdatedresultasthenextnewevencoefficients,recursively,andthisstepcanbeexpressedasEq.(3).TheevenisthesameasA

inMallatalgorithm.

4) Scale:NormalizeevenandoddcoefficientswithfactorKrespectivelybyusingEq.(4)togetthefinalapproximationcoefficients(evenApp)andthedetailcoefficients(oddDet).

Alloperatorsarecompletedinthein-placecomputationalmodebasedonthesefoursteps.Fig.1illustratesthesplitoperatorwithin-placemode,inthiscase,therawdatashouldbesplitintoevenandoddsubsetswhichare

organizedandstoredinthefirsthalfandthesecondhalfofthememoryspacerespectivelythatisthesamespaceforrawdatastorage.Similarly,prediction,updateandscaleresultsarestoredintheiroriginalmemoryspacesrather

thanallocatingnewmemories,suchthatliftingschemerequireslessmemoryconsumptionthantheclassicMallatalgorithm.Furthermore,the1Dinverseliftingscheme(i.e.reconstruction)DWT(abbreviatedinverseLWTorILWT)just

inversethecomputationalsequenceofoperationstepsofFLWTandswitchthecorrespondingplusandsubtractionoperators:

Scale:

Update:

Predict:

Merge:Thisstepisalsocalledas“inverselazywavelet”,anditcanbeexpressedby:

Theliftingschemecandynamicallyupdateconfigurationparameters(i.e.iterativenumberandwaveletfiltercoefficients)ineachcomputationalstepofDWTbasedonfactorizationsofthepolyphasematrixesofwaveletfilter

coefficients,suchthatthegeneralizationofthisframeworkcanbeguaranteedtorealizetheparallelcomputationsforallkindsofwavelettransformsdynamically.Generally,thefactorizationofpolyphasematrixP(z)forallwaveletscan

bedescribedasthefollowing:

wherecoefficientsofui(z)areupdatecoefficients in theupdatestepandcoefficientsofpi(z)arepredictioncoefficients in thepredictionstep inFigs.2and3,andK is the scale factor. In a conclusion, the computational results

(2)

(3)

(4)

Fig.1Thecomputationofsplitoperationwithin-placemode.

alt-text:Fig1

(5)

(6)

(7)

(8)

(9)

ofbothforwardandinverseLWTforarbitrarywaveletcanbeobtainedbyapplyingseveralstepsofpredictionandupdateoperationsandimposingthefinalnormalizationwithKontheoutputsoffactorizationofP(z).Figs.2and3

illustratethemaincomputationalprocessesofsingle-level1DforwardandinverseLWTrespectively.

4CUDAbasedGPCF-LWTimplementationBasedon the fundamentalLWTconcepts summarized inSection3, this studyproposesGPCF-LWTbyusingCUDAenabledGPUs for online andhighprecision surfacemeasurement applications.This studyassumes that

factorizationsofpolyphasematrixesofathewaveletfiltercoefficientshavebeenpreparedbeforeapplyingtheGPCF-LWT,sothatwehavegotaseriesofupdatecoefficientsU = {u0,u1,… ,un},aseriesofpredictioncoefficients

P = {p0,p1,…,pn},andalsothescalefactorK.

Thetraditionalparallelcomputationalplatforms(e.g.OpenMP,MPIandFPGA)haveempoweredthreeclassicLWTparallelcomputationalmodels,namely,theRC(Row-Column),LB(Line-Based)andBB(Block-Based)models

[41].Generallyspeaking,theLBmodelhasthebestinstructionthroughput,followedbytheBBandRCmodels[4].Unfortunately,exceptfortheshortestwavelets(e.g.,theHaarwavelet),bothLBandBBmodelshadfailedtobe

transplantedonthesingleCUDAGPUduetothespacelimitationofcurrentsharedmemory,andthegeneralizationofLWT(i.e.,supportingallkindsofwavelettransforms)alsocannotbesatisfied.AlthoughRCmodel istheonly

feasibleCUDAGPUtransplantationthatsupportsthegeneralizationofLWT,theprocessingtimeofCUDAbasedRCmodelisapproximately4timesslowerthantheMallatalgorithmsolutionrunningonthesameGPUcard[42].Thus,

basedonthein-depthstudyofthesethreemodels,thisstudydevisedagenericLWTparallelcomputationalmodelfortheCUDAenabledGPUarchitecture.

4.1Theimprovedparallelcomputationalmodel• TheLBmodel.Ontraditionalparallelcomputationalplatforms(e.g.,OpenMP,MPIandFPGA),theoverallframeworkofLBbasedLWTisdepictedinFig.4.TheLBmodeldividesarawdataset(e.g.,1024 × 1024insize)intoN(Nequalstothenumberofcomputational

nodes)smallersubsets(e.g.,ifNis4,eachsubsethas256rowsand1024columns),andeachsubsetwillbeallocatedonacomputationalnodewhereLBperformsthefollowingoperations:

1) Lunchingmthreads,andeachoneperformshorizontal1DLWTforeachrowofthetargetsubsetconcurrently(i.e.,eachthreadprocessesarowofthetargetsubset.)andthentheLBcachesthetemporaryresultsinthelocalmemory(e.g.,aRAM—randomaccessmemory);

2) Oncethemrowsarecompletelyprocessed,themthreadsperformvertical1DLWTonmcolumnsoftemporaryresultsconcurrentlyandrepeatedlyuntilallcolumnsofthecachedtemporaryresultshavebeenprocessed.Inthisstep,theiterativetimesare(width + m–1)/m,andeachiteration

processesmcolumns.

3) Steps1and2maintainanintegratediterationthatwillberepeateduntilallrowsandcolumnsofthetargetsubsetareprocessed.

Fig.2Themaincomputationalprocedureofsingle-level1DforwardLWT.

alt-text:Fig2

Fig.3Themaincomputationalprocedureofsingle-level1DinverseLWT.

alt-text:Fig3

ThelaststepofLBbasedLWTisthatonenode(usuallycalledamasternode)mergestheresultsofNsubsetsandthenobtainsthefinalresult.Theminimumvalueofmisdeterminedbyaparticulartypeofwavelet,e.g.,3forHaarand6forDeslauriers-

Dubuc(13,7).Generally,mcanbeunifiedasafixedvalue(e.g.,10or20)thatlargerthanalltypesofwaveletsneeded.Takingm = 10forafloatdatasethaving1024columnsasanexample,eachcomputationalnoderequires40 KBlocalmemoryspace,soLBmodel

canbeeasilyimplementedonOpenMP,MPIandFPGAsincetheyhavelargeRAMspaces.However,aCUDAblockhasonly16 KBlocalmemory(sharedmemory)spacethatcannotsatisfythememorydemandofLBmodelexceptfortheshortestwavelet(e.g.,Haar

waveletrequiresonly12 KBsharedmemoryspace).Inaddition,columnsofadatasetcannotbedividedbyLB,soitisacoarse-graineddivisionapproach,whichcannotoperatewellwiththeSIMDofCUDA.

• TheBBmodel.UnlikeLB,theBBmodeldividesarawdatasetintoNsubsets(datablocks)frombothrowandcolumndirections,seeFig.5.EachsubsetisallocatedonacomputationalnodewhereRCroutineperformshorizontal1DLWTforallrowsconcurrentlyat

first,andthenvertical1DLWTforallcolumnswillbecomputedconcurrently.LikeLBmodel,themasternodewillmergeallresultsofsubsetsatthelaststepofBBmodel.TheBBmodelisalsoacoarse-graineddivisionapproach,anditalsorequiresagreatdealof

localmemoryspace,e.g.,ifarawfloatdatasetwith1024 × 1024insizeisdividedintofoursubsets(eachsubsethas512rowsand512columns),eachcomputationalnoderequires1024 KBlocalmemoryspacethatconsiderablyexceedthesizeofasharedmemory

spaceofaCUDAGPU,suchthatclassicRCmodelbasedCUDAsolutionsmustuseaGPUglobalmemoryspaceasthetemporarycaches,andithasbecomethekeybottleneckfordataaccess.

• TheinnovativeLBBmodelforCUDALWT.TosolvetheproblemofsharedmemoryshortageonCUDAGPUs,thisstudydevisedaninnovativeparallelcomputationalmodelfor2DLWT-LBBbycombiningbothLBandBB.ThemainoverallframeworkofLBBmodelis

shownasinFig.6.TheLBBmodeldividesarawdatasetintoseveralsubsetsfromthecolumndirection,andeachsubsetcanbeassignedwithaCUDAthreadblock(e.g.,8rowsand32columnsinsize)wherethein-builtthreadsperformthefollowingoperations:

1) EachgroupofconcurrentrowthreadsloadsarowofadatasetintothesharedmemoryofaGPUinparallel.Accessesofarowdatacanbecoalescedsinceallthreadsinablockaccessthecontinuousrelativeaddresses[19],soitisveryefficienttoloadadatasetintoasharedmemoryspace.

2) Eachgroupofconcurrentrowthreadsinablockperformshorizontal1DLWTonadatarowinparallel,andthentheGPUcachesthetemporaryresultsinitssharedmemoryspace.

3) mgroupsofrowthreadsgetmrowtemporaryresultsinparallelbyexecutingsteps1and2(misapresetvalue,anditis8here).

4) Eachgroupofconcurrentcolumnthreadsinablockperformsvertical1DLWTonadatacolumnofthecachedtemporaryresultsinparallel,andwidth/ngroupsofcolumnthreadsprocesswidth/ncolumnsofcachedtemporaryresultsinparallel(width/nisthewidthofasubset).

5) ThecorrespondingcolumnthreadgroupssavetheresultsintoaGPUglobalmemoryspaceanddeletethemrowtemporaryresultsinthesharedmemoryspacegotbystep3.

6) Steps1–5maintainacompleteiterationthatwillberepeateduntilallrowsandcolumnsofadatasetareprocessed.

ThetwodifferencesofdatadivisionmethodsforBBandLBBare:1)InordertoadapttheadvancedarchitectureofCUDAthreads,theheightofasubsetinLBBislargerthanBBwhereasthewidthforLBBissmallerthanBB,andthenumberofsubsetsof

LBBisfarlargerthanBBasthenumberofthreadblocksisfarmorethancomputationalnodesinthosetraditionalparallelplatforms.Steps1–6ofLBBalsointegratedtheLBideawiththeonemajorupdate:Insteadofprovidingathreadforarow,LBBconstructsthe

threadblockstructurebasedonnumerousCUDAthreads(eachthreadblockorganizemgroupsofconcurrentrowthreads,andeachgroupofrowthreadhaswidth/nCUDAthreads,e.g.32CUDAthreadsineachrow),suchthatathreadblockcanperformboth

horizontal1DLWTonmultiplerowsbyusinggroupsofrowthreadsandvertical1DLWTonmultiplecolumnsbyusinggroupsofcolumnthreadsinparallel.

AlthoughLBBdividesalargedatasetintoseveralsmallersubsets,theproblemofsharedmemoryshortageisstillnotable,e.g.,itrequires64 KBsharedmemoryspaceifasubsethassize128 × 128.Thus,theslidingwindowstructurehasbeendesignedin

LBB,seeFig.6.InLBB,thesizeofaslidingwindowisequaltoaconstructedthreadblock(e.g.8 × 32forasubset128 × 128),i.e.,aslidingwindowisbuiltonathreadblock,andeachsubsethasaslidingwindow.Onceaslidingwindowisfull(e.g.,thereare8rows

oftemporaryresultscachedinasharedmemoryspace),column-basedvertical1DLWTwillbecarriedout(i.e.steps4and5),andthenLBBmovesdownaslidingwindowonthetargetsubsettocomputeandcacheotherdatapoints(orpixels)ofotherthreadblocks

concurrentlybymovingthethrowblocktoprocessthenext8rows(startingfromStep1).

• Performanceanalysis.TakingtheCDF(9,7)waveletasanexampletoevaluatetheCGMA(ComputetoGlobalMemoryAccess)(i.e.,anindextoestimatetheradioofinstructionthroughputandglobalmemoryaccessesforperformanceanalysis[22])withLBB,it

requires18timesofcomputationaloperationsforperforming2DDWTonadatapoint,including16timesofmultiplicationsand2timesofadditions,and4timesofglobalmemoryaccesses(2timesofreadandwriterespectively),byassumingthatthesizeofaslide

windowis8 × 32andeachthreadprocessesadatapointinparallel,suchthattheCGMAis:

ComparedwiththeRCmodel(CGMAis0.9),LBBhasbeenincreasedby10times.

(10)

4.2TheimprovedliftingschemeInaGPUroutine,theLWTrequireslotsofsynchronizationsaseveryimplementationprocedureoftheliftingschemehasdifferentfunctionscomputinginsequence,anditdecreasestheparallelizabilityofLWTcomputations.To

solvethisproblem,thisstudydevisedanimprovedapproachtodecreasetheinherentsynchronizationsofliftingscheme.ThefactorizationofpolyphasematrixP(z)(seeEq.(11)Eq.(9))canbeconvertedintothefollowingform:

wheren = mifu0(z) = 0andn = m-1ifu0(z)≠0.AccordingtoEq.(13)Eq.(11),theoptimizedoperationstepsoftheliftingschemebased1DforwardLWTareshownasthefollows:

Fig.4TheoverallframeworkofLBmodel.

alt-text:Fig4

Fig.5TheoverallframeworkofBBmodel.

alt-text:Fig5

Fig.6TheoverallframeworkofLBBmodel.

alt-text:Fig6

(11)

(1) Split:

(2) Preprocessing:

(3) Lifting(predictandupdate):fori = 1,…,n,then:

(4) Scale:

Thestep3isperformediteratively.However,synchronizationoperationsstillexistinstep3sincethecomputationofsicanonlybeginafterthedicomputationduetosireliesondi.Thus,Eq.(16)Eq.(14)shouldbemodifiedasthe

followsing:

ItcanbeseenfromEq.(18)Eq.(16),thecomputationofsinolongerreliesondi,andsianddionlyrelayonsi-1anddi-1respectively,sothatthecorrespondingcomputationaloperationscanbeperformedconcurrentlytodecrease

halfofthesynchronizationsrequiredinliftingscheme.Themaincomputationalprocedureofsingle-level1DforwardLWTisdemonstratedinFig.7thatisoptimizedfromFig.2.TheinverseLWThasthesimilaroptimizationapproach.

4.3TheconstructionofthesingleGPU-basedGPCF-LWTBeforerealizingparallelcomputationsof thesingle-level1DLWTbyusing theGPCF-LWT,a seriesofpredictioncoefficientsP = {p0,p1,… ,pn}andupdate coefficientsU={u0,u1,… ,un}must be generated successively

accordingtothegivennameofawaveletinordertoensurethegeneralizationoftheGPCF-LWT.PandUare2Darrayswiththesamelengthn,andP[i]andU[i]correspondtopredictionandupdatecoefficientsfortheithpredication

andupdateoperationsrespectively.WithintheGPCF-LWT,PandUserveasinputparametersreflectingdifferentwavelettypes,suchthatGPCF-LWTcansupportthewholefamilyofLWT.BasedonLBBandtheimprovedliftingscheme,

Fig.8illustratesthemaincomputationalprocedureofmulti-level2DforwardLWTproposedinthisstudy.ItneedstogenerateaseriesofP(a2Darray)andU(a2Darray)accordingtothegivennameofawavelet,andthencopythese

predictionandupdatecoefficientstogetherwiththeraw2DsignaldatafromaCPUmemoryspacetoaGPUglobalmemoryspace,andaspecificdesignedCUDAstreamoverlappingstrategyforincreasingtheconcurrencyofthisstep

(12)

(13)

(14)

(15)

(16)

Fig.7Theimprovedliftingschemeforsingle-level1DforwardLWT.

alt-text:Fig7

ispresentedinSection4.4.InFig.8,theoutputresultscA,cH,cV,andcDcorrespondtotheapproximationcoefficientsanddetailcoefficientsalonghorizontal,vertical,anddiagonalorientationsrespectively.Multi-level2DLWTapplies

theapproximationcoefficientscAasa2Drawinputsignal forthenextcomputational level,andthenanew2DforwardLWTiterationwiththesamecomputationalstepsandcomputationalparametersas itsprevious levelwillbe

started on this level. In thisway, this procedurewill be executed recursively until the expected level has reached, and the final results are the cA of the last level and all detail coefficients cH, cV, and cD collecting from each

computationallevel.ThecomputationalframeworkofLBBbasedinverseLWTcanbeconstructedinthesimilarwayofFig.8.

Inthiscase,arawsurfacemeasureddatasetwillbedividedintoseveralsmalldatachunkswhichwillbefurtherparallelizedbyapplyingthepipelineoverlappingstrategy(seeSection4.4).Ahostfunction(i.e.,theforwardand

inversescheduler)executingonaCPUisresponsibleforlaunchingthePredict()andUpdate()kernelsonaGPUiterativelyaccordingtotheseriesofpredictionandupdatecoefficients.Fig.9showsthata2DLWTonGPCF-LWThasbeen

realizedthroughseparatingitintotwo1DLWTfromtherowandcolumnorientationsrespectively,andmulti-levelLWThasbeenimplementedbyapplyingthesingle-levelLWTrecursively.Oncethefinalresultsareobtained,inverse

LWTcanbeperformedtoreconstructthefiltratedengineeringsurfaces(e.g.roughnessandwaviness).

GPCF-LWTsupportsthewholefamilyofwavelettransformsbasedonLBBandtheimprovedliftingscheme.Thispaperpresentstwoexamples(HaarandCDF(9,7)wavelets)toshowtherealizationofLWTwithanyspecific

Fig.8ThemaincomputationalprocedureofimprovedliftingschemewithLBBformulti-level2DforwardLWT.

alt-text:Fig8

Fig.9AnimplementationoverviewoftheGPCF-LWTonasingleGPUcard.

alt-text:Fig9

waveletbyusingGPCF-LWTgenerically.OneofThefactorizationsofpolyphasematrixesofwaveletfiltercoefficientsconformingtotheimprovedliftingschemecanbewrittenastheEqs.(19)Eqs.(17)and(20)(18)forHaarandCDF(9,7)

respectively.

AccordingtoEqs.(19)Eqs.(17)and(20)(18),P,UandKforHaarandCDF(9,7)respectivelycanbeacquiredfromthefollowingEqs.(21)Eqs.(19)and(22)(20).

ThecoefficientsU0ofbothHaarandCDF(9,7)arezero,sotheoptimizedcomputationalprocedureofsingle-level1DforwardLWTcanbedevisedasFig.13Fig.10whereahost function isresponsible forschedulingthe

executionsofpredictandupdatekernelsandprovidingdifferentconfigurationandcomputationalparametersfordifferenttypesofwavelets,andthecorrespondingpredictandupdatekernelsareexecutedonaGPUcardinparallel.

TheleftsideofFig.10illustratesthecomputationalprocedureofHaarwavelettransform,andCDF(9,7)waveletisillustratedontherightside.TheinverseLWTimplementationsofHaarandCDF(9,7)aresimilarastheforward

implementations,whichjustinversethecorrespondingsequencesofoperationstepsofforwardcomputationsandswitchthecorrespondingplusandsubtractionoperators.

(17)

(18)

(19)

(20)

Fig.10Theschedulingalgorithmofasingle-level1DforwardLWT.

4.4ThreeoptimizationstrategiesforGPCF-LWTimplementationonasingleGPUcardTheGPCF-LWTfurtherexploresthreeimplementationoptimizationsonasingleGPUbyseamlesslyintegrateingtheGPCF-LWTimplementationwiththelatestCUDAGPUhardwareadvancementstomaximizetheparallelization

ofgeneric2DLWTcomputations:

1) GPCF-LWTprovidesanoptimizationstrategythatenabledynamicconfigurationsofexecutionparametersofkernelsbasedontheCUDAthreadhierarchy.InCUDAprograms,athreadblockmustbeloadedintoaSMinseparably,andthreadsinablockwillbefurther

encapsulatedintoseveralwarps.ASMexecutesallitswarps(e.g.,4warpsinFig.11)conformingtoSIMD(SingleInstruction,MultipleData)model,suchthatallthreadsinawarpsharesameprograminstructionsbuttoprocessdifferentdataatthesametime.

AnotherinactivewarpwillbeactivatedfromtheinactivestatewhenanactivewarpinthesameSMispending(e.g.,awarpwillbependingwhenitneedstoaccessdatafromtheglobalmemoryasitgenerallyrequires400–600cycles)inordertohidememorylatency

[20,21].Thus,itisofvitalimportancetoconfigurethenumberofCUDAthreadsinablockandorganizetheminanappropriatehierarchystructure(i.e.setupapplicableblockandgridsizes)foreachkernel,suchthatSMscanhavejustenoughwarpstoschedulewhile

limitingthecomplexityofthreadmanagement.Ideally,thenumberofthreadsinablockmustbeanintegralmultipleof32tomatchthefactthateachwarphas32threadsfor32CUDAcoresinCUDAGPUs,soeachSMcanbepossiblyarrangedwithjustenough

numberofwarps.Inpractice,SMmayreducethenumberofblocksduringruntimewhentoomanyCUDAthreadsareorganizedinablockandeachthreadoccupiestoomanyregistersoralargesharedmemoryspace,anditoftenleadstothevitalproblemthatthere

arenotenoughwarpstobescheduled.Withtheseconsiderations,thisstudyhadtestedandevaluatedtheunderlyingrelationshipbetweenthedifferentnumberofthreadsinablockandthehardwareconsumptionofGPCF-LWT,whichfoundthattheoccupancyof

CUDAcoresinaSMcancomeupto100%whenweconfigure256threadsperblockintheGPCF-LWT.Thus,theblocknumber(gridsize,bn)ofLWTisbn = (sn+256–1)/256wheresnisthelengthofinputdata,suchthatthebnwitheachblockhaving256threadscan

keepallSMsofaGPUcardbusyallthetime,andhidethedelaycausedbydatatransfers.Totestandevaluatethevalidityandpracticalityofthisimplementation,thisstudyhascomparedtheeffectsofdifferentnumberofthreadsinablocktofindtheinherent

correlationbetweenthenumberofconfiguredthreadsinablockandtheprocessingtime.Fig.12liststheprocessingtimefordifferentnumbersofthreadsinablockfor4-level2DLWTwithdatasizeof4096 × 4096runningonasingleGTX1080GPUcard(the

hardwarespecificationcanbeseeninTable3).Fig.12showsthatGPCF-LWTcangaintheminimumprocessingtimewhenthenumberofthreadsinablockissetto256.

2) GPCF-LWThasdevelopedahierarchicalmemoryusageschemeforgenericLWTcomputationsbasedontheCUDAGPUmemoryhierarchy.CUDAadoptsaSPMD(SingleProgram,MultipleData)programmingmodelandprovidesasophisticatedmemoryhierarchy(i.e.

register,localmemory,sharedmemory,globalmemory,texturememoryandconstantmemory,etc.).Hence,aGPUcanachievehighdataaccessesthroughelaboratelydesignedCUDAcodesempoweredbytheefficientusagesofdifferentmemoriesaccordingtothe

respectivedatafeatures,includingaccessmode,sizeandformat.GPCF-LWTcallstheCUDAAPIfunctioncudaHostAlloc()ratherthanmalloc()toallocateaCPUhostmemoryinpage-lockedmemoryorpinnedmemorymodethatcansupportDMA(DirectMemory

Access)withoutrequiringtoallocateadditionalbuffersforcachingdata.Comparedwithpage-ablememorymode,thepage-lockedmodetransfersdatadirectlyfromaCPUhostmemoryspacetoaGPUglobalmemoryspace,sothewritespeedofitis1.4timesfaster

andreadspeedis1.8timesfasterthanpage-ablemode[21].However,thepinnedmemoryshouldnotbeoverusedforitsscarcitythatleadstoreducedperformance.Withthisinmind,thisstudypre-allocatesafixedsizepinnedmemoryintheinitializationstageand

reuseitduringthewholecomputationallife-cycle,suchthatitmaintainshighmemoryperformancethroughout.Moreover,GPCF-LWTusesthesharedmemoryofaGPUforexecutingkernels,andinsteadofdirectdataaccessesfromaGPUglobalmemory,asingledata

pointwillbeloadedintoasharedmemoryspaceofaSMonceandonlyoncefromtheglobalmemory,andthenanyloadeddatapointcanbeaccessedbyseveralCUDAthreadssimultaneouslyinaveryfastspeed(generallyitonlyrequires1–32cycles).Incomparison,

dataaccessspeedofaglobalmemorywillbedramaticallyreducedwhenseveralthreadsaccessthesamedatapointinaglobalmemoryspacerepeatedly,andregisterswerealsonotappliedhereforthesimilarreason[20].Anothermajorreasonforusingtheshared

memoryisthatrowaddressesinaglobalmemoryarealigned,suchthattheoperationthat32CUDAthreadsinawarploadingdatacontinuouslyfromtheglobalmemorycanbecoalesced,i.e.,loading32datapointsintoawarpneedstheglobalmemoryaccessonly

onceratherthan32accessesinsequence,sothiscoalescingcanenhancethedatatransferbandwidth.AnotherimportantissueisthatwheretostoreandaccessPandU,theconstantmemoryortheglobalmemory.Althoughboththeconstantmemoryandglobal

memoryareinthesamephysicallocationoftheGPUhardwarearchitecture,thetimeconsumptionforaccessingadatapointfromaconstantmemorycanbemuchfasterthanaglobalmemoryduetoitsinherentcacheandbroadcastmechanisms.Forinstance,adata

pointcanbeloadedfromthecacheofaconstantmemorywhenitisacachehit,anditjustcostsonecycleingeneral.Broadcastisanotherveryusefultooltoreducedataaccessdelay,whichmeansthatadatapointloadedbyathreadwillbebroadcastedtoallthe

otherthreadsofawrap.Thus,accessingadatapointfromaconstantmemorycanbeasfastasaregisterwhenallthreadsofawarpaccessthesamedatapoint[21,43].Inthisstudy,allCUDAthreadsusethesamevaluesofPandU,andtheyareconstantandwillnot

bechangedduringperformingPOandUO,suchthatthefastdataaccessescanbeachievedbystoringandaccessingPandUsetsinaconstantmemoryspace.Allinall,thememoryhierarchyofGPCF-LWTforgenericLWTcomputationscanbeorganizedasFig.13.

TotestandevaluatethevalidityandpracticalityofthishierarchicalmemoryusageschemeofGPCF-LWT,thisstudyanalyzedtheperformancefactorsinfluencingthe2DLWTcomputationondifferentmemoryspaces.Firstly,thisexperimenttestedand

comparedthebandwidthbyusingthepinnedandpage-ablememorymoderespectivelybytransferring1GBmeasureddatafrombothH2D(fromaCPUhostmemoryspacetoaGPUglobalmemoryspace)andD2H(fromaGPUglobalmemoryspacetoaCPUhost

memoryspace)directions.InFig.14,thebandwidthofpinnedmemorymodegainsapproximately55%improvementcomparedwiththepage-ablememorymode.Then,theexperimenttestedandcomparedthecomputationalperformancesbetweentheproposed

hierarchicalmemoryusageschemeandthepureglobalmemoryusage.Inthepureglobalmemorysolution,alldata(rawmeasureddatasets,P,U,intermediateresultsandthefiltrationoutputresults)arestoredintheglobalmemory.Incontrast,theproposedscheme

storesdifferentdatasetsindifferentmemoryspaces,i.e., itappliestheglobalmemoryspacetostoretherawmeasureddatasetsandthefiltrationresults,andthesharedmemoryspacetocachesubsetsandthetemporaryresults,whichputPandUcoefficient

instancesintheconstantmemoryspace.Fig.15illustratestheprocessingtimesofthesetwosolutions,anditcanbeseenfromtheexperimentalresultsthattheproposedschemegainsabout50%computationalperformanceenhancement.

3) GPCF-LWTenablesapipelineoverlappingstrategybyusingCUDAstreams.CUDAGPUprovidesanon-blockingasynchronousexecutionmodeforGPUkernels,suchthatthehostcodesonaCPUandCUDAcodesonaGPUcanbeexecutedconcurrently.Basedonthis

alt-text:Fig10

consideration,thisstudyderivesaCUDAstreamoverlappingstrategythatcaneffectivelyalleviatethedelayimposedbyschedulingthesequentialorderofexecutionoftheliftingscheme.EachpipelinemanagesaCUDAstream,andapipelinecanorganizenkernels

runningonitindifferenttimeslices,i.e.,apipelinewillactiveandrunanotherkernelautomaticallywhenakernelispending[19].PureCUDAprogramsbasedonthenon-overlappingmodelarevariantsofthefollowingfundamentalthreestages:(1)transferringaraw

datasetfromaCPUhostmemoryspacetoaGPUglobalmemoryspacethroughPCI-Ebuses(i.e.,H2Ddatatransfers);(2)processingtherawdatasetviaaspecificalgorithm—GPUkernelexecutions(e.g.,2DLWT);(3)transferringoutputresultsfromtheGPUglobal

memoryspacetotheCPUhostmemoryspace(D2Hdatatransfers).Obviously,bothstagesofdatatransfers(i.e.H2DandD2H)andkernelexecutionsfordataprocessingaretime-consuming.Moresignificantly,thesethreetime-consumingstagesareperformedin

sequenceinthenon-overlappingmode,andthesequentialtimelinesforthreepipelinestagesareillustratedinFig.16.Thus,datatransfersinFig.16areblockingtransfersandthehostfunctionrunningonaCPUcanonlylaunchGPUkernelsaftercompletionofthe

correspondingdatatransfers,sothatboththePCI-EbusesandGPUwastealotoftimeinidlestateduringtheentireworkflow,whichsharplydecreasestheprocessingperformanceofCUDAprograms.Toreducetheseidletimeslices,GPCF-LWToverlapsthethree

majortime-consumingstagesbyusingCUDAstreamsinthepipelines.InaccordancetothehardwaredevelopmentthatbothGPUandPCI-Esupportbidirectionaldatatransfers,eachpipelineinGPCF-LWThandlesarepeatedprocessoftransferringasmalldatachunk

toaGPUmemoryfirstandthenlaunchingLWTkernelstoprocessthisdatachunkuntilthewholedatasetisprocessed.Fig.17showsthetimelinesoftwodatatransferstagesandLWTkernelexecutionstagewiththeoverlappingmodel,inwhichLWTkernelscanbe

performedduringstagesofdatatransfers(i.e.,thedatatransfersarehided),suchthataGPUcanalwaysstayinthebusystateduringtheentiredataprocessingworkflowtogainhigherperformancethanthenon-overlappingmodel.Thisstrategycanbeextendedto

hidedatatransferdelaysinmulti-GPUsystemsgracefully.TheCUDAAPIcudaMemcpyAsync()hadbeenappliedinthisstudytoasynchronouslytransferdatafromaCPUhostmemorytoaGPUglobalmemoryinordertorealizetheoverlappingbetweendatatransfers

andkernelexecutions.ThepseudocodesofthedevisedoverlappingstrategyarelistedinAlgorithm1,andtheLBBbased2DLWTkernelislistedinAlgorithm2.Moreover,thisaggregatedstrategystillappliesonthesystemsusingNVLinktechnologywithtwofeatures:

(1)bothH2DandD2Hdatatransfersareefficient,sostage1and3consumefarlesstime;(2)thesetwostagescantransferarelativelargedatachunkeachtime,suchthatstage2maintainshighconcurrency.Withthesenewfeatures,theGPCF-LWTcanachievebetter

performancewithoutextraeffort.

Fig.11ThehierarchystructureofCUDAthreads.

alt-text:Fig11

Fig.12Processingtimesof2DLWTonGPCF-LWTfordifferentnumbersofthreadsinablock.

alt-text:Fig12

Fig.13ThehierarchicalmemoryhierarchyusageofGPCF-LWT.

alt-text:Fig13

Fig.14Thebandwidthofpinnedandpage-ablememorymoderespectively.

alt-text:Fig14

Fig.15Theperformancecomparisonofthetwosolutions.

alt-text:Fig15

Algorithm1Overlappingofthreestages—twodatatransfersandkernelexecutions.

alt-text:Algorithm1

Inputs:a2DrawdataData[][],andrownumberrows,columnnumbercols.

1stream<-cudaStreamCreate(&stream[i]) //createnstreams

2a_h = &Data[0][0] //getthestartaddressofData[][]

3size=N*cols/nStreams //calculatethedatasizeforeachstream

4fori = 0tonStreams−1do

5 offset = i*N/nStreams //calculatetheaddressoffsetforastreampipeline

6 cudaMemcpyAsync(a_d+offset,a_h+offset,size,H2D,stream[i])//transferdatatoaGPUintheithstream

7 LBB_LWT<<<…,stream[i]>>>(a_d+offset) //launchawavelettransformkernelintheithstream

8 cudaMemcpyAsync(a_d+offset,a_h+offset,size,D2H,stream[i])//transferdatatoaCPUintheithstream

Algorithm2LBBbased2DLWTkernel.

alt-text:Algorithm2

Inputs:a2DrawdataData[][],andrownumberrows,columnnumbercols.

Outputs:cA,cH,cV,andcD.

1 //foreachslidewindow

2 for(row=0;row<height;row+=SLIDE_HEIGHT)do

3 __shared__floatsData<-loadrawdatafromaGPUglobalmemoryspacetoaGPUsharedmemoryspace

4 //eachgroupofrowthreadsperformshorizontal1DLWTinparalleltoprocessasubsetandcaches

Fig.16ThetimelinesofthedatatransfersandLWTkernelexecutionsinthenon-overlappingmodel.

alt-text:Fig16

Fig.17Thetimelinesforoverlappingofthreepipelines.

alt-text:Fig17

//thetemporaryresults

5 even,odd<-split(sData);

6 Preprocessing(even,odd);

7 __syncthreads(); //synchronizationoperations

8 predict(even,odd);

9 update(even,odd);

10 __syncthreads();

11 //eachgroupofcolumnthreadsperformsvertical1DLWTinparalleltoprocessthetemporaryresults

12 even,odd<-split(sData);

13 Preprocessing(even,odd);

14 __syncthreads();

15 predict(even,odd);

16 update(even,odd);

17 __syncthreads();

18 save(even,odd,output); //saveoutputresultsbacktoaGPUglobalmemoryspace

TotestandevaluatethevalidityandpracticalityofthispipelineoverlappingstrategyforLWTcomputations,thisstudycomparedtheperformancebetweentheoverlappingandnon-overlappingimplementationsbyusinga

singleGTX1080GPUcardandperforming4-level2DLWTwithD4wavelet(datasize4096 × 4096).Thisexperimentrecordedtheprocessingtimeofeachstep,i.e.,H2D,dataprocessingonaGPUcard(CUDAkernelexecutions),and

D2H.BasedontheexperimentalresultslistedinTable1Table3,itcanbeseenthattheoverallcomputationaltimeofnon-overlappingimplementationisthesumoftimeconsumptionsofthesethreestepssincethesestepsareperformed

inasequentialorder.Bycontrast,theoverallcomputationaltimesofoverlappingimplementationsaresteadilylessthanthesumofthesethreestepsastheyareperformedconcurrently.Thus,theoverlappingoptimizationstrategycan

reducetheoverallcomputationaltime(Table1).

Table1Theperformancecomparisonbetweenoverlappingandnon-overlappingimplementation(ms).

alt-text:Table1

Programroutines Strategies H2D Kernelexecutions D2H Overallprocessingtime

FLWT Non-overlapping 35 193 32 260

Overlapping 33 197 32 213

ILWT No-overlapping 33 215 38 286

Overlapping 34 206 35 225

5Themulti-GPUsolutionofGPCF-LWTThepowerfulcomputingcapabilityofsingleGPUsatisfiesthecomputationaldemandsofmanyapplications.However,thelimitedmemoryspacesandCUDAcoresofasingleGPUcardmakeitimpossibletoprocessacomplete

dataset in online surface texturemeasurement applications. Thus, dealingwith large-scale dataset requires themulti-GPU processingmode. At present, there are two categories of commonly usedmulti-GPU systems, i.e., the

standalonecomputers(asingleCPUnodewithmultipleGPUcards)andtheclusterdistributedsystems(multipleCPUnodesandeachaccompaniedbyoneormoreGPUcards).Bothcanacceleratedataprocessinggreatlyandgain

bettercomputationalperformance.However,limitedtothelow-speedPCI-Ebusesandnetworkconnections,bigdatatransferringbetweenaCPUhostmemoryspaceandaGPUglobalmemoryspacebecomesthekeybottleneckof

multi-GPU systems. Based on this consideration, reducing or hiding data transfer delays are the most important optimization strategy to obtain high performance. More importantly, since the major consumption of LWT is

multiplication,additionandsubtraction,anditsharesonlynearbydata(i.e.GPCF-LWTdoesnotneedtosharedatasetsacrossthenetwork),socommunicationrequirementsamongdistributedcomputernodesinLWTapplicationsare

verylow.Fromthispointofview,astandaloneheterogeneousmulti-GPUsystemhasbeenadoptedforGPCF-LWT.ThisstudytriestoreduceidletimeslicesofPCI-EbusandGPUkernelexecutions,andimprovesthecomputationalload

balancingoverthemultipleGPUcardsforthemulti-GPUGPCF-LWTimplementation.Toreduceidletimeslices,thisstudymadetwoimprovementsonGPCF-LWT:(1)maximizingtheutilizationrateofthePCI-Ebusbandwidth;(2)

hidingdatatransferdelaysthroughmakingeveryGPUcardalwaysinthebusystateduringthewholedataprocessingworkflow.Forthefirstpoint,GPCF-LWTappliesbidirectionaldatatransfersonthebandwidthofaPCI-Ebusthat

cansupportbothH2DandD2Hatthesametime.Intermsofthesecondpoint,thepipelineoverlappingstrategypresentedinSection4.4hasbeenappliedoneveryGPUcardintheheterogeneousmulti-GPUsystem.Furthermore,in

GPCF-LWT,datatransfersbetweenaCPUandaGPUjustneedtoreadonce,anditdoesnotrequireanyintermediatecomputationalresultstransferringbetweenaCPUandaGPUoramongdifferentGPUs.Thus,thismulti-GPUbased

GPCF-LWTdoesnotrequireanyextracommunication,andithasnotanydatatransferdelayduringthewholecomputationalprocedure.Nevertheless,thetraditionaldatasetdivisionapproachesarelikelytocauseloadunbalancing

problemandlowrateofGPUhardwareutilization.

5.1LoadunbalancingproblemBasically,duetotheseparablepropertyof2DLWT,asimpleloadbalancingmodelbasedonthepuredatasetdivisionmethodcanbederived[23].Thissolutionissimpleanduseful,butitmayleadtoloadunbalancingproblem

whenamulti-GPUsystemcontainsheterogeneousGPUcardswithunequalcomputingcapability.Asaresult,theoverallperformanceofamulti-GPUsystemdependsontheGPUnodethathasthelowestperformance.Moreover,2D

LWTapplicationsaredata-orientedcomputations,i.e.,hugedataprocessingbutlimitedCUDAkernels.Thus,thisstudydevisedadata-orienteddynamicloadbalancing(DLB)modelforrapidanddynamicdatasetdivisionandallocation

onheterogeneousGPUnodesbyusingafuzzyneuralnetwork(FNN)framework[38].Thedata-orientedDLBmodelcanminimizetheoverallprocessingtimebydynamicallyadjustingthenumberofdatasubsetsinagroupforeach

GPUnodeaccordingtoruntimefeedbacks.Here,twoimportantimprovementsonFNNofthepreviousworkofauthors([398])toenhancetheeffectivenessofDLBmodelarepresented.

5.2ImprovedFNNmodelTheoverallframeworkoftheFNNbasedDLBmodeldevisedinthepreviousworkisillustratedinFig.18[39][38].Inthismodel,fivestatefeedbackparameters(fuzzysets),i.e.,floating-pointoperationperformance(F),global

memorysize(M),parallelability(P),theoccupancyrateofcomputationalresourcesofaGPU(UF)andtheoccupancyrateoftheglobalmemoryspaceofaGPU(UM),relatingcloselytotheoverallcomputationalperformanceare

defined.Thepredictor inFNNisdevisedtopredict therelativecomputationalperformanceabilityunderdifferentworkloadscenarios,andtheschedulercanreorganize thedatagroups foreachGPUcardrespectivelyaccordingto

dynamicfeedbacksofthepredictor.Inthebeginning,arawdatasetisdividedintoseveralequal-sizeddatachunks(thesizeisrelativelysmall,andthenumberofdatachunksshouldbefarmorethanthenumberofGPUnodes)and

thesechunksareorganizedintongroups(nisequaltothenumberofGPUcardsinamulti-GPUsystem)byusingthescheduler.Intheruntime,thenumberofdatachunksinagroupassignedtoeachGPUcardisdifferent,anditis

determineddynamicallybythereal-timefeedbacks(i.e.,fivestatefeedbackparameters)ofasingleGPUcard.Thus,theDLBmodelminimizestheoverallprocessingtimebydynamicallyadjustingthenumberofdatachunksinagroup

foreachGPUatruntime.

Basedontheearlieroutputs,thisstudyfurtherexploredtwoimprovementsonFNNtoincreasetheefficiencyandpredicationaccuracyofthedevisedDLBmodel:

1) ThisstudyhasconstructedotherthreestatefeedbackparametersforimprovingtheaccuracyofthenthrelativecomputationalabilityCPnipredicationforithGPUnode,i.e.,memoryspeed(MS),boostclock(BC)andthelastrelativecomputationalability(CPn-1ior

LCPni).BothmemoryspeedandboostclockarecorehardwareparametersaffectingthecomputingcapabilityofCUDAGPUcards.Forexample,thememoryspeedofGeForceGTX1080GPUcardis10 Gbps(GBspersecond)andtheboostclockis1607 MHz(Mega

Fig.18TheoverallframeworkoftheFNNbasedDLBmodel.

alt-text:Fig18

Hertz),andtheGeForce750TiGPUcardis1.02 Gbpsand1085 MHz,respectively.ThereisastrongcorrelationbetweenCPn-1iandCPni,i.e.,whenthecurrentCPn-1iisveryhigh,moresubsetswillbeorganizedinthedatagroupofCPUi,anditwillkeepCPUibusywith

highUMandUFvalues,whichinturnleadstothelowCPni.Thus,thebalancebetweenCPn-1iandCPnineedstobeconsideredinpredications.Currently,theimprovedDLBmodelhaseightfeedbackparametersandtheyareallfuzzifiedas“high”and“low”fuzzy

subsets,thesethreekindsoffuzzysetsandsubsetsarelistedinTable2.

2) Thisstudyimprovesthefuzzysubsetdescriptionsbyprovidingacomprehensivemembershipfunctionadaptionmechanism.IntheformerDLBmodel,allfuzzysetsarefuzzifiedbyonemembershipfunction–thesigmoid.Here,thestaticparameters(i.e.,F,M,P,MSand

BC)stillusesigmoidmembershipfunction,whereasthedynamicfeedbackparameters(i.e.,UF,UM,CPandLCP)adoptaninnovativemembershipfunction.Thismembershipfunctionhasbeenexploredbasedonthesoftplusthathasbeenprocessedbyshiftandscalar

operationstosatisfythefundamentalpropertyofafuzzymembershipfunction.Thesoftplusbecameanmainstreamactivationfunctionindeepleaningnetworks,anditcanbeformulatedasEq.(23)Eq.(21)anditscurveisillustratedinFig.19[44].Itgrowsveryslow

whenxisrelativelysmall,anditwillberapidlyincreasedafterxgreaterthanathresholdvalue(e.g.,1inFig.19).ThisphenomenoncanmodelthechangeofcomputationalperformanceofaGPUnodeaswhenUFandUMarerelativelylow,thecomputingcapabilityis

almostunchanged,anditalsowilldropsharplywhenUFandUMarelargerthanthethresholdvalues.However,thesoftpluscannotbeusedasamembershipfunctiondirectlyduetoEq.(23)Eq.(21)failstosatisfyf(x) ∈ [0,1]andx ∈ [0,1].

Table2Threekindsoffuzzysetsandsubsets.

alt-text:Table2

Sets Descriptions Fuzzysubsets Descriptionsoffuzzysubsets

MS MemoryspeedMSL Low

MSH High

BC BoostclockBCL Low

BCH High

LCP ThelastrelativecomputationalabilityLCPL Low

LCPH High

Table3Thespecificationofaheterogeneousmulti-GPUsystem.

alt-text:Table3

Property Description

CPU IntelCorei7-47903.6GHZ

Memory 16G

GPU GPU1:NVIDIAGeForceGTX750Ti,2 G,5 × SM,128SP/SM,1.02 Gbpsmemoryspeed,1085 MHzboostclock

3 × GPU2:NVIDIAGeForceGTX1080,8 G,20 × SM,128SP/SM,10 Gbpsmemoryspeed,1607 MHzboostclock

OS Windows1064bit

CUDA Version8.0

(21)

Toremedythisdefect,thisstudyperformsshiftandscalaroperationsonthesoftplusfunction,sothesoftpluscanbeextendedasfollows:

whereaisascalarfactorandbisashiftfactor,andEq.(24)Eq.(22)mustsatisfy:

Byaddinganormalizationfactor,Eq.(24)Eq.(22)canberedefinedasEq.(26)Eq.(24),andaandbcantakethevalueof5and−3,respectively.Fig.20showsthecurveoftheimprovedsoftplus.

Basedonthethreenewstatefeedbackparametersandtheimprovedsoftplus,thisstudyconstructedamoreeffectiveDLBmodelthatintegratesthefuzzytheory,artificialneuralnetworks(ANN)andthebackpropagation

algorithm.ThecomparisonexperimentbetweentheimproveDLBmodelandtheconventionalDLBmodelin[39][38]ispresentedinSection6.2(thefourthfifthexperiment).

Fig.19Thecurveofsoftplusfunction.

alt-text:Fig19

(22)

(23)

(24)

Fig.20Thecurveoftheimprovedsoftplusfunction.

alt-text:Fig20

6Testandperformanceevaluation6.1Hardwareandtestbed

Table3specifiesaheterogeneousmulti-GPUsystemconstructedfortestingandevaluatingGPCF-LWTwithexperimentsandtheapplicablecasestudy,anditcontainsfourdifferenttypesofGPUcards‒—amiddle-lowrange

GPU(NVIDIAGeForceGTX750Ti)andthreehigh-endGPUs(NVIDIAGeForceGTX1080).GTX750TiandGTX1080contain640and2560CUDAcores,respectively.

6.2BenchmarkingexperimentsThissectiontestsandevaluatesthecomputationalperformanceandpracticabilityofGPCF-LWTunderbothaheterogeneousmulti-GPUsettingandahigh-endsingleGPUsettinginfiveencompassingexperiments:

1) ThefirstexperimentcomparesthecomputationalperformancesoftheLBB,RCandRC&LB[4]basedontheclassicliftingschemebyusingthesingleGPUsetting(aGTX1080GPUcard).TheLBB,RCandRC&LBhavebeentestedonthe4-levelCDF(9,7)forward

andinverse2DLWT.Thedatasizerangesfrom10241024××1024to51205120××5120.Fig.21showsthatwith increasingdatasize, theperformancefromRCisplungingquicklydueto intrinsicsynchronizationoperations.RC&LBshowshighcomputational

performancewhenthedatasizesarerelativelysmall(e.g.,equaltoorsmallerthan20482048××2048).Theprocessingtimewillincreasedrasticallyoncethedatasizegoesupforextrasynchronizationcalls.Incontrast,thedevisedLBBmodelcanmaintainhigh-

performance for larger data range. This experimental demonstrated that the LBBmodel in the GPCF-LWT framework is superior and gains up to 14 times speedup in comparisonwith RC, and up to 7 timeswhen comparedwith RC&LB under a single GPU

configuration.

2) ThesecondexperimentteststheeffectivenessoftheimprovedliftingschemebyusingthesingleGPUsetting.Italsoperformsthe4-levelCDF(9,7)forwardandinverse2DLWT.InFig.22,theimprovedliftingschemegainshighercomputationalperformancesinceit

reduceshalfoftherequiredsynchronizationinliftingstep(step3)(seeSection4.2).ItsubstantiallyincreasestheparallelizabilityofLWTcomputations.

3) ToevaluatethegeneralizationfeatureofCPCF-LWT,thethirdexperimentteststhecomputationalperformanceofsixtypesofwavelets,namelyHaar,D4,D8,D20,CDF(7,5)andCDF(9,7),usingtheclassicliftingschemewithRC;,theclassicliftingschemewith

RC&LB;,andtheimprovedliftingschemewithLBB.AsingleGTX1080GPUcardisusedforathe4-level2DLWTofthewithdatasize4096 × 4096.Inthisexperiment,differentwavelettransformshavebeenperformedinagenericmannerwithdifferentU,PandKfor

differentwavelets.TheexperimentalresultshighlightthatthecomputationalperformanceofCPCF-LWTissignificantlyhigherthanthecorrespondingRCandRC&LBbasedimplementations,seeFig.23.

4) ThefourthexperimentevaluatestheeffectivenessofthethreeoptimizationstrategiesequipmentinGPCF-LWT,againonasingleGTX1080GPUcard.SinceeachstrategyhasbeentestedseparatelyinSection4.4,thisexperimentfocusesonthecomparisonbetween

theLBBwiththehybridstrategy-driven(thecombinationofthreeoptimizations)andLBBwithoutanyoptimization.Fig.24showsthatthehybridapproachgainnoticeableimprovementswhencomparingtothepureLBB.

5) ThefifthexperimentfurtherexaminedthecomputationalperformancesofGPCF-LWTwhenappliedontwosingleGPUsettingsandthreemulti-GPUsettingsbyusingtheheterogeneousGPUsystemlistedinTable3.ThisexperimenthasadaptedaCDF(9,7)waveletto

perform4-level2DforwardandinverseLWToflargedatasizesrangingfrom10,240240××10,240to16,384384××16,384.Fig.25liststheprocessingtimeofthefivetestsettings,i.e.,thesingleGTX750TiGPUsetting(setting1),thesingleGTX1080GPUsetting

(setting2), pure dataset division basedmulti-GPU setting (setting3), the original DLB basedmulti-GPU setting (setting4) [38] and the improved DLB basedmulti-GPU setting (setting5). It can be seen from Fig. 25 that setting1 implementation has the lowest

computationalperformance,followedbysetting2sincetheGTX1080GPUcontainsmoreCUDAcores(2560), largerglobalmemoryspace(8GB)andhigherdatatransferspeed(10 Gbps)thantheGTX750TiGPU(640,2GB,5.4 Gbps, respectively).However, the

computationalperformanceimprovementofsetting2isverylimited,whichprovesthatitwillnotleadtothelinearcomputationalperformanceimprovementforlarge-scalecomputationalapplicationsaccompaniedbyextremelylargeinputvolumeandhighlyrepetitive

simpleoperationalproceduresoriterations,specificallyfor2DLWTbysimplychangingabetterGPUcard.Thesetting3implementationscangainaslightlyhighercomputationalperformanceimprovementthanthetwosingleGPUimplementationsforabout2and3

times.Incontrast, theDLB-drivenGPCF-LWTmulti-GPUimplementations(setting4andsetting5)achievesignificantcomputationalperformanceenhancementsince theFNNbasedDLBcanallocatedata tasksdynamicallyduringruntimeaccordingto therelative

computationalperformanceabilityacrossallGPUnodes.Moreover,theimprovedDLB(setting5)hasgainedbetterperformancethantheoriginalversion(setting4)foralldatasizesduetothesoftplusbasedmembershipfunctionthatismoresuitableforadaptingthe

dynamicchangeofcomputationalperformanceofheteronomousGPUnodes.Thepeakperformancegain(speedupondatasizeof16,384 × 16,384)ofthesetting5hasreachedapproximately10timesmorethansetting3;andwhencomparedwiththetwosingleGPU

implementations,thespeedupfromsetting5hasreached15and20timesthansetting1andsetting2respectively.

Fig.21TheperformancecomparisonamongLBB,RCandRC&LBmodels.

alt-text:Fig21

Fig.22Theperformancecomparisonbetweentheclassicandimprovedliftingscheme.

alt-text:Fig22

Annotations:

A1. deletetheextra"2D"

Fig.23ThegeneralizationfeatureevaluationofCPCF-LWT.

alt-text:Fig23

A1

6.3AcasestudyononlineengineeringsurfacefiltrationThissectioncarriesacasestudyononlineengineeringsurfacefiltrationtotesttheusabilityandefficiencyofthedevisedmulti-GPUGPCF-LWTincomparisonwithotherstate-of-the-artmulti-GPUapproaches.Inthefieldof

arealsurfacetexturecharacterizationextractions,geometricalcharacteristicsignalsrelatingcloselywith functionalrequirementscanbeextractedprecisely fromwaveletcoefficientsbyusingdifferent transmissionbands.The2D

forwardLWTcomputationsshouldbeperformedonraw2DsurfacemeasureddatasetstoobtaininstancesofdetailcoefficientsDiatalllevelsandtheapproximationcoefficientcAinstancesfromthelastlevel,andtheinstancesofcAiof

eachlevelcanbeobtainedduringperforming2DinverseLWT.BasedoninstancesofDiandcAi,thegeometricalcharacteristics(i.e.,roughness,wavinessandform)ofasurfacetexturecanbeobtainedbycalculatingthefollowing

inverseLWTdefinedinEquationEq.(27)Eq.(25)[10,40].

Inordertoextracttheroughnessη(x,y)andwavinessη′(x,y)fromasurfacetextureprofile,thecAimustbesetuptozero.Similarly,theDimustbesetuptozeroforextractingtheformcharacteristic.

Fig.24TheeffectivenessexperimentofthethreeoptimizationstrategiesinGPCF-LWT.

alt-text:Fig24

Fig.25Thecomparisonofprocessingtimesofsetting1,setting2,setting3,setting4andsetting5implementations.

alt-text:Fig25

(25)

Thisstudyselectedtwokindsofrawengineeringsurfaces(surfacesonceramicfemoralheadsandmilledmetallicworkpieces)totestandevaluatetheGPCF-LWT.TheGPCF-LWTintegratesthedevisedLBB,improvedlifting

scheme,thethreeimplementationoptimizations,andtheimproveddata-orientedDLB.ThisanalysisevaluatestheSURFSTANDthatisasurfacecharacterizationsystememployingmodernCPUcore[10],andtwoothermulti-GPU

implementationsofRC&LBandthemethodproposedbyIkuzawa[46][45].Itshouldbenotedthatthepuredatasetdivisionmethodwasappliedforthetwomulti-GPUimplementations.Theheterogeneousmulti-GPUlistedinTable3is

usedforallthreemulti-GPUimplementations.

Figs.26aand27ashowarawmeasureddatasetfromthefunctionalsurfaceofaceramicfemoralheadandamilledmetallicworkpiecerespectivelywiththesamplingsize40964096××4096.InFig.26a,themeasuredsurface

texturehastwodifferenttypesofscratchesproducedbythegrindingmanufacturingprocess,andoneofthemisregularandshallowscratcheswhiletheotherisrandomdeeperscratch.Thesix-levelD4wavelethasbeenusedinthis

casetoperformbothforwardandinverse2DLWT.TheFig.26b-ddemonstratetheroughness,wavinessandformcharacteristicsextractedfromafemoralheadsurfacebyusingtheinverseLWTdefinedinEquationEq.(27)Eq.(25).Inthe

caseofthemilledmetallicsurface(seeFig.27a),thesix-levelCDF(9,7)wavelethasbeenusedtoperform2DLWT,andFig.27b-–ddemonstratetheroughness,waviness,andformcharacteristicsextractedfromamilledmetallic

surfacebyusingtheinverseLWTdefinedinEquationEq.(27)Eq.(25).

It isworthmentioning thatCUDAsupports thesingle float (32-bit,abbreviatedasFP32)and thehalf float (16-bit,FP16)precisions [19,45][19,46].Thisexperiment testedbothof these twoprecisions.Fromthis test, it is

discoveredthatfloatprecisionsettingsdonotaffectthecomputationalperformanceof2DLWTonCPUduetotheIntelCorei7-4790CPUuses64-bitinstruction.BothFP16andFP32operationscanbeexecutedonthe64-bitmodein

negligibledifference.Incontrast,FP16basedmulti-GPUimplementationsshowhighercomputationalperformancethantheFP32implementations.However,thelowerprecisionofFP16failstofulfilthehighprecisionrequirementof

micro- andnano-surface texturemetrology, such that this case studyonlyapplies theFP32 forevaluation.The results show that themulti-GPUGPCF-LWTandpureCPU (IntelCore i7-47903.6GHZ) implementationshave same

filtrationeffectsonbothprecisionandaccuracyaspects.Table4showsthatmulti-GPUimplementationscangainover20timesspeedupoverCPUimplementations.

Fig.26Thefunctionalsurfacetextureofaceramicfemoralhead.

alt-text:Fig26

Annotations:

A1. Theformsurface

A2. Thewavysurface

A3. Theroughsurface

Fig.27Thefunctionalsurfacetextureofamilledmetallicworkpiece.

alt-text:Fig27

Annotations:

A1. Theformsurface

A1A2A3

A1

Table4TheperformancecomparisonsbetweenaCPUandthemulti-GPUimplementation(ms).

alt-text:Table4

Datasets LWT ILWT

CPU GPCF-LWT CPU GPCF-LWT

Ceramicfemoralhead(4096 × 4096) 10,165 415 12,590 450

Ceramicfemoralhead(12,288 × 12,288) 28,569 756 34,681 832

Milledmetallicworkpiece(4096 × 4096) 12,850 510 12,685 511

Milledmetallicworkpiece(12,288 × 12,288) 30,256 748 35,647 786

Fig.28 shows theperformancecomparisonamongGPCF-LWT,RC&LBand Ikuzawa'ssolutionon theheterogeneousmulti-GPUsystemdetailed inTable3. In this case, the testingdatasethasbeenexpanded to the sizeof

12,288288××12,288toevaluatewhetherGPCF-LWTcandealwithlarge-scaledatasetsintheonlinemanner.TheexperimentalresultsshowthatRC&LBhastheminimumcomputationalperformanceduetoitsheavysynchronization

calls.Ikuzawa'ssolutiongainsbetterperformanceduetoitshigherefficiencyonmemoryusage.Thoughbothstillsufferfromtheheavysynchronization,pipelinecluttering,anddataoverloadproblems.Incontrast,theGPCF-LWTwith

advancedoptimizationsgainsupto12timesonprocessingspeedoverothertwomulti-GPUapproaches.

Insummary,multi-GPUbasedGPCF-LWThasgainedsignificantimprovementonthecomputationalperformancefor2DLWT.Itcanprocessaverylargedataset(e.g.,12,288288××12,288)inlessthanonesecond,sothat

performancerequirementsofonline(ornearreal-time)andlarge-scaledataintensivesurfacetexturefiltrationapplicationscanbesatisfiedgracefully.

Fig.28Theperformancecomparisonsofthreemulti-GPUimplementations(ms).

alt-text:Fig28

Annotations:

A1. Ceramicfemoralhead


A3. Milledmetallicworkpiece






A1A2

A3 A4 A5 A6A7

A8

7ConclusionsandfutureworkTofulfilltheonlineandbigdatarequirementsinmicro-andnano-surfacemetrology,theparallelcomputationalpowerofmodernGPUsisexplored.Thisresearchdevisedagenericparallelcomputationalframeworkforlifting

wavelettransform(GPCF-LWT)accelerationbasedonaheterogeneousmulti-GPUsystem.OneoftheoutcomesofthisinnovativeinfrastructureisanewparallelcomputationalmodelnamedLBBtoalleviatethevitalproblemofshared

memory shortage in a single GPU cardwhile ensuring the generalization of LWT in the context of CUDAmemory hierarchy. To increase the parallelizability of LWT, an improved lifting scheme has also been developed. Three

optimizationstrategiesonthesingleGPUconfigurationarevalidatedandthenintegrated.TofurtherimprovethecomputationalperformanceoftheGPCF-LWT,ithasbeenexpandedtocomputeonaheterogeneousmulti-GPUsystem

withaFuzzyTheory-basedloadbalancingmodel.ThisstudyfurtherexploredtwoimprovementsonthepreviousdevisedFNNDLBmodeltoincreasetheefficiencyandpredicationaccuracy.ExperimentsshowthattheproposedGPCF-

LWTcanachievesuperiorandsubstantialcomputationalperformancegainwhencomparedwithotherstate-of-the-arttechniques.Over20timesspeeduphasbeenmonitoredoverthepureCPUsolutions,andupto12timesoverother

benchmarkingmulti-GPUsolutionswithlarge-scalesurfacemeasurementdatasets.Theinnovativeframeworkanditscorrespondingtechniqueshaveaddressedthekeychallengesfrom2DLWTapplicationsthatarevitalforbigsurface

datasetprocessing.

Oneavenueopenedupduring the study for futureexploration is to introduce thedeep learning idealismacross theGPUandCPUboundary.Especiallywhen facingcell-CPUor clusterCPUs, ahybridandefficientdata

distributionscheme,aswellasanonlinesurfacetextureanalysisplatformwithsmartrecommendationfunctionsforfilterselectionwillbeofgreatvalueforpractitioners.

DeclarationofCompetingInterestTheauthorsdeclarethattheyhavenoknowncompetingfinancialinterestsorpersonalrelationshipsthatcouldhaveappearedtoinfluencetheworkreportedinthispaper.

AcknowledgmentsThisworkissupportedbytheNSFC(61203172),theSichuanScienceandTechnologyPrograms(2018JY0146and2019YFH0187),and“598649-EPP-1-2018-1-FR-EPPKA2-CBHE-JP”fromtheEuropeanCommission.

References[1]X.J.JiangandD.J.Whitehouse,Technologicalshiftsinsurfacemetrology,CIRPAnn61(2),2012,815–836,https://doi.org/10.1016/j.cirp.2012.05.009.

[2]D.A.Axinte,N.Gindy,etal.,Processmonitoringtoassisttheworkpiecesurfacequalityinmachining,Int.J.Mach.ToolsManuf.44(10),2004,1091–1108,https://doi.org/10.1016/j.ijmachtools.2004.02.020.

[3]X.Jiang,P.J.Scott,etal.,Paradigmshiftsinsurfacemetrology.PartI.Historicalphilosophy,Proc.R.Soc.AMath.Phys.Eng.Sci.463(2085),2007,2049–2070,https://doi.org/10.1098/rspa.2007.1874.

[4]W.J.vanderLaan,A.C.Jalba,etal.,AcceleratingwaveletliftingongraphicshardwareusingCUDA,IEEETrans.ParallelDistrib.Syst.22(1),2011,132–146,https://doi.org/10.1109/TPDS.2010.143.

[5]J.A.Castelar,C.A.Angulo,etal.,ParalleldecompressionofseismicdataonGPUusingaliftingwaveletalgorithm,SignalProcess.ImagesComput.Vis.,2015,IEEE.

[6]S.Lou,W.H.H.Zeng,etal.,Robustfiltrationtechniquesingeometricalmetrologyandtheircomparison,Int.J.Autom.Comput.10(1),2013,1–8,https://doi.org/10.1007/s11633-013-0690-4.

[7]H.S.Abdul-Rahman,S.Lou,etal.,Freeformtexturerepresentationandcharacterisationbasedontriangularmeshprojectiontechniques,Measurement92,2016,172–182,

https://doi.org/10.1016/j.measurement.2016.06.005.

[8]ISO16610-61:2015Geometricalproductspecification(GPS)–filtration–part61:lineararealfilters–Gaussianfilters,(2015).

[9]ASMEB46.1-2009,Surfacetexture:surfaceroughness,waviness,andlay,(2009).

[10]P.Nejedly,F.Plesinger,etal.,CudaFilters:asignalplantlibraryforGPU-acceleratedFFTandFIR (TheoriginalreferenceNo.29becomesNo.10,originalNo.10becomestheNo.11,No.11becomesNo.12,andNo.12becomesNo.13,etc.)filtering,Softw.-Pract.Exp.48(1),2018,3–9,https://doi.org/10.1002/spe.2507.

[11]X.Jiang,P.J.Scott,etal.,Paradigmshiftsinsurfacemetrology.PartII.Thecurrentshift,Proc.R.Soc.AMath.Phys.Eng.Sci.463(2085),2007,2071–2099,https://doi.org/10.1098/rspa.2007.1873.

[12]W.Sweldens,Theliftingscheme:aconstructionofsecondgenerationwavelets,SIAMJ.Math.Anal.29(2),1998,511–546.

[13]S.Xu,W.Xue,etal.,Performancemodelingandoptimizationofsparsematrix-vectormultiplicationonNVIDIACUDAplatform,J.Supercomput.63(3),2013,710–721,https://doi.org/10.1007/s11227-011-0626-0.

[14]H.Wang,H.Peng,etal.,AsurveyofGPU-basedaccelerationtechniquesinMRIreconstructions,Quant.ImagingMed.Surg.8(2),2018,196–208,https://doi.org/10.21037/qims.2018.03.07.

[15]S.MittalandJ.S.Vetter,AsurveyofCPU-GPUheterogeneouscomputingtechniques,ACMComput.Surv.47(4),2015,1–35,https://doi.org/10.1145/2788396.

[16]M.Mccool,Signalprocessingandgeneral-purposecomputingandGPUs,IEEESignalProcess.Mag.24(3),2007,109–114,https://doi.org/10.1109/MSP.2007.361608.

[17]Y.Su,Z.Xu,etal.,GPGPU-basedGaussianfilteringforsurfacemetrologicaldataprocessing,In:200812thInt.Conf.Inf.Vis.,2008,IEEE,94–99,https://doi.org/10.1109/IV.2008.14.

[18]NVIDIA,CUDACProgrammingGuidev9.0.http://docs.nvidia.com/cuda/cuda-c-programming-guide/.

[19]S.Cook,CUDAProgramming:ADeveloper'sGuidetoParallelComputingwithGPUs,2012,Newnes.

[20]NVIDIA,CUDACBestPracticesGuidev9.0,NVIDIA.http://docs.nvidia.com/cuda/cuda-c-best-practices-guide.

[21]D.B.KirkandW.H.Wen-Mei,ProgrammingMassivelyParallelProcessors:AHands-onApproach,2012,MorganKaufmann.

[22]R.Couturier,DesigningScientificApplicationsOnGPUs,2013,CRCPress.

[23]D.FoleyandJ.Danskin,Ultra-performancePascalGPUandNVLinkinterconnect,IEEEMicro37(2),2017,7–17,https://doi.org/10.1109/MM.2017.37.

[24]M.HopfandT.Ertl,Hardwareacceleratedwavelettransformations,DataVis.,2000,Springer,93–103,https://doi.org/10.1007/978-3-7091-6783-0_10.

[25]Tien-TsinWong,Chi-SingLeung,etal.,Discretewavelettransformonconsumer-levelgraphicshardware,IEEETrans.Multimed.9(3),2007,668–673,https://doi.org/10.1109/TMM.2006.887994.

[26]J.Franco,G.Bernabé,etal.,The2Dwavelettransformonemergingarchitectures:gPUsandmulticores,J.Real-TimeImageProcess7(3),2012,145–152,https://doi.org/10.1007/s11554-011-0224-7.

[27]C.Song,Y.Li,etal.,AGPU-AcceleratedwaveletdecompressionsystemwithSPIHTandreed-solomondecodingforsatelliteimages,IEEEJ.Sel.Top.Appl.EarthObs.RemoteSens4(3),2011,683–690,https://doi.org/10.1109/JSTARS.2011.2159962.

[28]L.Zhu,Y.Zhou,etal.,Parallelmulti-level2D-DWTonCUDAGPUsanditsapplicationinringartifactremoval,Concurr.Comput.Pract.Exp.27(17),2015,5188–5202,https://doi.org/10.1002/cpe.3559.

[29]L.BluntandX.Jiang,AdvancedTechniquesForAssessmentSurfaceTopography:DevelopmentofaBasisFor3DSurfaceTextureStandardsSurfstand (HereoriginalNo.28becomesNo.29,andtherestisnumberedsuccessively.),2003,Elsevier.

[30]WangJianjun,LuWenlong,etal.,High-speedparallelwaveletalgorithmbasedonCUDAanditsapplicationinthree-dimensionalsurfacetextureanalysis,In:Int.Conf.Electr.Inf.ControlEng.,2011,IEEE,2249–2252,https://doi.org/10.1109/ICEICE.2011.5778225.

[31]Y.Chen,X.Cui,etal.,Large-scaleFFTonGPUclusters,In:Proc.24thACMInt.Conf.Supercomput.,NewYork,NewYork,USA2010,ACMPresshttps://doi.org/10.1145/1810085.1810128.

[32]A.Nukada,Y.Maruyama,etal.,Highperformance3-DFFTusingmultipleCUDAGPUs,In:Proc.5thAnnu.Work.Gen.Purp.Process.withGraph.Process.Units-GPGPU,NewYork,NewYork,USA2012,ACMPress,

57–63,https://doi.org/10.1145/2159430.2159437.

[33]S.SchaetzandM.Uecker,Amulti-GPUprogramminglibraryforreal-timeapplications,In:Int.Conf.AlgorithmsArchit.ParallelProcess,2012,Springer,114–128.

[34]J.A.StuartandJ.D.Owens,Multi-GPUMapReduceonGPUclusters,In:2011IEEEInt.ParallelDistrib.Process.Symp.,2011,IEEE,1068–1079,https://doi.org/10.1109/IPDPS.2011.102.

[35]M.Grossman,M.Breternitz,etal.,HadoopCL:MapReduceondistributedheterogeneousplatformsthroughseamlessintegrationofhadoopandOpenCL,In:2013IEEEInt.Symp.ParallelDistrib.Process,2013,IEEE,

1918–1927,https://doi.org/10.1109/IPDPSW.2013.246.

[36]M.Abadi,A.Agarwal,etal.,TensorFlow:large-Scalemachinelearningonheterogeneousdistributedsystems,(2016).doi:10.1038/nn.3331.

[37]S.Chintala,AnoverviewofdeeplearningframeworksandanintroductiontoPytorch,2017.https://smartech.gatech.edu/handle/1853/58786(accessedOctober5,2017).

QueriesandAnswersQuery:Pleaseconfirmthatgivennamesandsurnameshavebeenidentifiedcorrectly.Answer:Yes

Query:Pleaseconfirmthattheprovidedemail18668289@qq.comisthecorrectaddressforofficialcommunication,elseprovideanalternatee-mailaddresstoreplacetheexistingone.Answer:[email protected]

Query:PleasecheckthecitationofTable1inthetext.Ithasbeenputrandomly.Pleasefixitinproperplace.Answer:ThecitationofTable1hasbeencorrectedinthetext.

Query:PleasecheckthecitationofEq.(26)inthetext,ButthereisnoEq.(26)inthetext.Pleasecorrectifnecessary.Answer:Eq.(24)isactuallyEq.(22),andEq.26isactuallyEq.(24).Otherwrongequationnumberhadbeencorrectedintext.

Query:PleasecheckthecitationofEq.27inthetext.Itisnotinthetext.Pleasecorrectitifnecessary.

[38]C.Zhang,Y.Xu,etal.,Afuzzyneuralnetworkbaseddynamicdataallocationmodelonheterogeneousmulti-Gpusforlarge-scalecomputations,Int.J.Autom.Comput.15(2),2018,181–193,https://doi.org/10.1007/s11633-018-1120-4.

[39]K.K.ShuklaandA.K.Tiwari,FilterbanksandDWT,SpringerBriefsComput.Sci.,2013,Springer,21–36,https://doi.org/10.1007/978-1-4471-4941-5_2.

[40]X.Q.Jiang,L.Blunt,etal.,Developmentofaliftingwaveletrepresentationforsurfacecharacterization,Proc.R.Soc.AMath.Phys.Eng.Sci.456(2001),2000,2283–2313,https://doi.org/10.1098/rspa.2000.0613.

[41]M.E.Angelopoulou,K.Masselos,etal.,Implementationandcomparisonofthe5/3lifting2DdiscretewavelettransformcomputationschedulesonFPGAs,J.SignalProcess.Syst.51(1),2008,3–21,

https://doi.org/10.1007/s11265-007-0139-5.

[42]C.Tenllado,J.Setoain,etal.,Parallelimplementationofthe2DdiscretewavelettransformongraphicsprocessingUnits:filterbankversuslifting,IEEETrans.ParallelDistrib.Syst.19(3),2008,299–310,https://doi.org/10.1109/TPDS.2007.70716.

[43]W.Nicholas,TheCUDAHandbook:AComprehensiveGuidetoGPUProgramming,2013.

[44]Y.LeCun,Y.Bengio,etal.,Deeplearning,Nature521(7553),2015,436–444,https://doi.org/10.1038/nature14539.

[45]T.Ikuzawa,F.Ino,etal.,Reducingmemoryusagebythelifting-baseddiscretewavelettransformwithaunifiedbufferonaGPU,J.ParallelDistrib.Comput.93-94,2016,44–55,https://doi.org/10.1016/j.jpdc.2016.03.010.

[46]NVIDIA,Whitepaper-NVIDIAGeForceGTX750Ti,2014.

Highlights

• AnewparallelcomputationmodelforLWTnamedLBBwasdevisedtotacklethelimitationofcurrentsharedmemorybottleneckonmodernGPUs.

• TheimprovedliftingschemehasreducedapproximatelyhalfofthesynchronizationsneededwhencomputingtheliftingwaveletsinasingleCUDAthread.

• ThreeoptimizationstrategieswerebenchmarkedforGPCF-LWTimplementation.

• Themulti-GPUbasedGPCF-LWTsupportsonlinelarge-scalemeasureddatafiltrationthroughanimprovedFNNbaseddynamicloadbalancing(DLB)model.

Answer:Thereare15placesthattheequationnumberiswrong.Wecorrectedandmarkedinthetext.Thankyou.

Query:Pleasecheckfundinginformationandconfirmitscorrectness.Answer:Iconfirmitscorrectness

Query:Pleaseprovidedetailofreferencenos.8and9.Answer:[8]ISO16610-61:2015,Geometricalproductspecification(GPS)–filtration–part61:lineararealfilters–Gaussianfilters,AnISO/TC213dimensionalandgeometricalproductspecificationsandverificationstandard,2015,1-18,https://www.iso.org/standard/60813.html.

[9]ASMEB46.1-2009,Surfacetexture:surfaceroughness,waviness,andlay,AnAmericannationalstandard,2009,1-120,https://www.asme.org/products/codes-standards/b461-2009-surface-texture-surface-roughness.

Query:Pleaseprovidedetailofreferenceno.36.Answer:M.Abadi,A.Agarwal,etal.,TensorFlow:large-Scalemachinelearningonheterogeneousdistributedsystems,TensorFlowwhitepaper,2016,1-19,https://doi.org/10.1038/nn.3331.

Query:PleaseprovidePublishernameandplaceforreferenceno.45.Answer:No.45isajournalpaper,andYoumeanNo.46?NVIDIA,NVIDIAGeForceGTX750Ti,Whitepaper,NVIDIACorporation,2014,1-10,https://international.download.nvidia.cn/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf.

1 introduction · 1 introduction engineering workpieces are designed, manufactured and measured to...

Documents