edwards bos06

26
Database for Data- Database for Data- Analysis Analysis Developer: Ying Chen (JLab) Developer: Ying Chen (JLab) Computing 3(or N)-pt functions Computing 3(or N)-pt functions Man correlation functions (!uantum numbers)" at Man correlation functions (!uantum numbers)" at man momenta for a #$e% con#guration man momenta for a #$e% con#guration Data analsis re!uires a single !uantum number over Data analsis re!uires a single !uantum number over man con#gurations (calle% an man con#gurations (calle% an Ensemble Ensemble !uantit) !uantit) Can be &' to over &'' !uantum numbers Can be &' to over &'' !uantum numbers nversion problem: nversion problem: *ime to retrieve & !uantum number can be long *ime to retrieve & !uantum number can be long +nalsis ,obs can ta e hours (or +nalsis ,obs can ta e hours (or days) days) to run. /nce to run. /nce cache%" time can be consi%erabl re%uce% cache%" time can be consi%erabl re%uce% Development: Development: 0e!uire better storage techni!ue an% better analsis 0e!uire better storage techni!ue an% better analsis co%e %rivers co%e %rivers

Upload: khalid

Post on 08-Oct-2015

222 views

Category:

Documents


0 download

DESCRIPTION

Multi-Threading on Multi-Core Processors

TRANSCRIPT

  • Database for Data-AnalysisDeveloper: Ying Chen (JLab)Computing 3(or N)-pt functionsMany correlation functions (quantum numbers), at many momenta for a fixed configurationData analysis requires a single quantum number over many configurations (called an Ensemble quantity)Can be 10K to over 100K quantum numbersInversion problem:Time to retrieve 1 quantum number can be longAnalysis jobs can take hours (or days) to run. Once cached, time can be considerably reducedDevelopment:Require better storage technique and better analysis code drivers

  • Database for Data-AnalysisDeveloper: Ying Chen (JLab)Computing 3(or N)-pt functionsMany correlation functions (quantum numbers), at many momenta for a fixed configurationData analysis requires a single quantum number over many configurations (called an Ensemble quantity)Can be 10K to over 100K quantum numbersInversion problem:Time to retrieve 1 quantum number can be longAnalysis jobs can take hours (or days) to run. Once cached, time can be considerably reducedDevelopment:Require better storage technique and better analysis code drivers

  • DatabaseRequirements:For each config worth of data, will pay a one-time insertion costConfig data may insert out of orderNeed to insert or deleteSolution: Requirements basically imply a balanced treeTry DB using Berkeley Sleepy Cat:Preliminary Tests:300 directories of binary files holding correlators (~7K files each dir.)A single key of quantum number + config number hashed to a stringAbout 9GB DB, retrieval on local disk about 1 sec, over NFS about 4 sec.

  • Database and InterfaceDatabase key:String = source_sink_pfx_pfy_pfz_qx_qy_qz_Gamma_linkpathNot intending (at the moment) any relational capabilities among sub-keysInterface functionArray< Array > read_correlator(const string& key);

    Analysis code interface (wrapper):struct Arg {Array p_i; Array p_f; int gamma;};Getter: Ensemble operator[](const Arg&); or Array operator[](const Arg&);Here, ensemble objects have jackknife support, namely operator*(Ensemble, Ensemble); CVS package adat

  • (Clover) Temporal PreconditioningConsider Dirac op det(D) = det(Dt + Ds/)Temporal precondition: det(D)=det(Dt)det(1+ Dt-1Ds/)Strategy: Temporal preconditiong3D even-odd preconditioningExpectationsImprovement can increase with increasing According to Mike Peardon, typically factors of 3 improvement in CG iterationsImproving condition number lowers fermionic force

  • Multi-Threading on Multi-Core ProcessorsJie Chen, Ying Chen, Balint Joo and Chip WatsonScientific Computing GroupIT DivisionJefferson Lab

  • MotivationNext LQCD ClusterWhat type of machines is going to used for the cluster?Intel Dual Core or AMD Dual Core?

    Software Performance ImprovementMulti-threading

  • Test EnvironmentTwo Dual Core Intel 5150 Xeons (Woodcrest) 2.66 GHz4 GB memory (FB-DDR2 667 MHz)Two Dual Core AMD Opteron 2220 SE (Socket F) 2.8 GHz4 GB Memory (DDR2 667 MHz)2.6.15-smp kernel (Fedora Core 5)i386x86_64Intel c/c++ compiler (9.1), gcc 4.1

  • Multi-Core ArchitectureCore 1Core 2Memory ControllerESB2I/OPCI ExpressFB DDR2Core 1Core 2PCI-EBridgePCI-EExpansionHUBPCI-XBridgeDDR2Intel WoodcrestIntel Xeon 5100AMD OpteronsSocket F

  • Multi-Core ArchitectureL1 Cache32 KB Data, 32 KB Instruction8-Way associativityL2 Cache4MB Shared among 2 cores16-way associativity256 bit width10.6 GB/s bandwidth to coresFB-DDR2Increased Latencymemory disambiguation allows load ahead store instructionsExecutionsPipeline length 14; 24 bytes Fetch width; 96 reorder buffersMax decoding rate 4 + 1; Max 4 FP/cycle3 128-bit SSE Units; One SSE instruction/cycleL1 Cache64 KB Data, 64 KB Instruction2-Way associativityL2 Cache1 MB dedicated16-way associativity128 bit width6.4 GB/s bandwidth to coresNUMA (DDR2)Increased latency to access the other memoryMemory affinity is importantExecutionsPipeline length 12; 16 bytes Fetch width; 72 reorder buffersMax decoding rate 3; Max 3 FP/cycle2 128-bit SSE Units; One SSE instruction = two 64-bit instructions.

  • Multi-Core ArchitectureL1 Cache32 KB Data, 32 KB InstructionL2 Cache4MB Shared among 2 cores256 bit width10.6 GB/s bandwidth to coresFB-DDR2Increased Latencymemory disambiguation allows load ahead store instructionsExecutionsPipeline length 14; 24 bytes Fetch width; 96 reorder buffers3 128-bit SSE Units; One SSE instruction/cycleL1 Cache64 KB Data, 64 KB InstructionL2 Cache1 MB dedicated128 bit width6.4 GB/s bandwidth to coresNUMA (DDR2)Increased latency to access the other memoryMemory affinity is importantExecutionsPipeline length 12; 16 bytes Fetch width; 72 reorder buffers2 128-bit SSE Units; One SSE instruction = two 64-bit instructions.Intel Woodcrest XeonAMD Opteron

  • Memory System Performance

  • Memory System PerformanceMemory Access Latency in nanoseconds

  • Performance of ApplicationsNPB-3.2 (gcc-4.1 x86-64)

  • LQCD Application (DWF) Performance

  • Parallel ProgrammingMessagesMachine 1Machine 2OpenMP/PthreadOpenMP/PthreadPerformance Improvement on Multi-Core/SMP machinesAll threads share address spaceEfficient inter-thread communication (no memory copies)

  • Multi-Threads Provide Higher Memory Bandwidth to a Process

  • Different Machines Provide Different Scalability for Threaded Applications

  • OpenMPPortable, Shared Memory Multi-Processing APICompiler Directives and Runtime LibraryC/C++, Fortran 77/90Unix/Linux, WindowsIntel c/c++, gcc-4.xImplementation on top of native threadsFork-join Parallel Programming Model

    MasterForkJoinTime

  • OpenMPCompiler Directives (C/C++)#pragma omp parallel{thread_exec (); /* all threads execute the code */} /* all threads join master thread */#pragma omp critical#pragma omp section#pragma omp barrier#pragma omp parallel reduction(+:result)Run time libraryomp_set_num_threads, omp_get_thread_num

  • Posix ThreadIEEE POSIX 1003.1c standard (1995)NPTL (Native Posix Thread Library) Available on Linux since kernel 2.6.x.Fine grain parallel algorithmsBarrier, Pipeline, Master-slave, Reduction

    ComplexNot for general public

  • QCD Multi-Threading (QMT)Provides Simple APIs for Fork-Join Parallel paradigmtypedef void (*qmt_user_func_t)(void * arg);qmt_pexec (qmt_userfunc_t func, void* arg);The user func will be executed on multiple threads. Offers efficient mutex lock, barrier and reductionqmt_sync (int tid); qmt_spin_lock(&lock);Performs better than OpenMP generated code?

  • OpenMP Performance from Different Compilers (i386)

  • Synchronization Overhead for OMP and QMT on Intel Platform (i386)

  • Synchronization Overhead for OMP and QMT on AMD Platform (i386)

  • QMT Performance on Intel and AMD (x86_64 and gcc 4.1)

  • ConclusionsIntel woodcrest beats AMD Opterons at this stage of game.Intel has better dual-core micro-architectureAMD has better system architecture

    Hand written QMT library can beat OMP compiler generated code.