edwards bos06
DESCRIPTION
Multi-Threading on Multi-Core ProcessorsTRANSCRIPT
-
Database for Data-AnalysisDeveloper: Ying Chen (JLab)Computing 3(or N)-pt functionsMany correlation functions (quantum numbers), at many momenta for a fixed configurationData analysis requires a single quantum number over many configurations (called an Ensemble quantity)Can be 10K to over 100K quantum numbersInversion problem:Time to retrieve 1 quantum number can be longAnalysis jobs can take hours (or days) to run. Once cached, time can be considerably reducedDevelopment:Require better storage technique and better analysis code drivers
-
Database for Data-AnalysisDeveloper: Ying Chen (JLab)Computing 3(or N)-pt functionsMany correlation functions (quantum numbers), at many momenta for a fixed configurationData analysis requires a single quantum number over many configurations (called an Ensemble quantity)Can be 10K to over 100K quantum numbersInversion problem:Time to retrieve 1 quantum number can be longAnalysis jobs can take hours (or days) to run. Once cached, time can be considerably reducedDevelopment:Require better storage technique and better analysis code drivers
-
DatabaseRequirements:For each config worth of data, will pay a one-time insertion costConfig data may insert out of orderNeed to insert or deleteSolution: Requirements basically imply a balanced treeTry DB using Berkeley Sleepy Cat:Preliminary Tests:300 directories of binary files holding correlators (~7K files each dir.)A single key of quantum number + config number hashed to a stringAbout 9GB DB, retrieval on local disk about 1 sec, over NFS about 4 sec.
-
Database and InterfaceDatabase key:String = source_sink_pfx_pfy_pfz_qx_qy_qz_Gamma_linkpathNot intending (at the moment) any relational capabilities among sub-keysInterface functionArray< Array > read_correlator(const string& key);
Analysis code interface (wrapper):struct Arg {Array p_i; Array p_f; int gamma;};Getter: Ensemble operator[](const Arg&); or Array operator[](const Arg&);Here, ensemble objects have jackknife support, namely operator*(Ensemble, Ensemble); CVS package adat
-
(Clover) Temporal PreconditioningConsider Dirac op det(D) = det(Dt + Ds/)Temporal precondition: det(D)=det(Dt)det(1+ Dt-1Ds/)Strategy: Temporal preconditiong3D even-odd preconditioningExpectationsImprovement can increase with increasing According to Mike Peardon, typically factors of 3 improvement in CG iterationsImproving condition number lowers fermionic force
-
Multi-Threading on Multi-Core ProcessorsJie Chen, Ying Chen, Balint Joo and Chip WatsonScientific Computing GroupIT DivisionJefferson Lab
-
MotivationNext LQCD ClusterWhat type of machines is going to used for the cluster?Intel Dual Core or AMD Dual Core?
Software Performance ImprovementMulti-threading
-
Test EnvironmentTwo Dual Core Intel 5150 Xeons (Woodcrest) 2.66 GHz4 GB memory (FB-DDR2 667 MHz)Two Dual Core AMD Opteron 2220 SE (Socket F) 2.8 GHz4 GB Memory (DDR2 667 MHz)2.6.15-smp kernel (Fedora Core 5)i386x86_64Intel c/c++ compiler (9.1), gcc 4.1
-
Multi-Core ArchitectureCore 1Core 2Memory ControllerESB2I/OPCI ExpressFB DDR2Core 1Core 2PCI-EBridgePCI-EExpansionHUBPCI-XBridgeDDR2Intel WoodcrestIntel Xeon 5100AMD OpteronsSocket F
-
Multi-Core ArchitectureL1 Cache32 KB Data, 32 KB Instruction8-Way associativityL2 Cache4MB Shared among 2 cores16-way associativity256 bit width10.6 GB/s bandwidth to coresFB-DDR2Increased Latencymemory disambiguation allows load ahead store instructionsExecutionsPipeline length 14; 24 bytes Fetch width; 96 reorder buffersMax decoding rate 4 + 1; Max 4 FP/cycle3 128-bit SSE Units; One SSE instruction/cycleL1 Cache64 KB Data, 64 KB Instruction2-Way associativityL2 Cache1 MB dedicated16-way associativity128 bit width6.4 GB/s bandwidth to coresNUMA (DDR2)Increased latency to access the other memoryMemory affinity is importantExecutionsPipeline length 12; 16 bytes Fetch width; 72 reorder buffersMax decoding rate 3; Max 3 FP/cycle2 128-bit SSE Units; One SSE instruction = two 64-bit instructions.
-
Multi-Core ArchitectureL1 Cache32 KB Data, 32 KB InstructionL2 Cache4MB Shared among 2 cores256 bit width10.6 GB/s bandwidth to coresFB-DDR2Increased Latencymemory disambiguation allows load ahead store instructionsExecutionsPipeline length 14; 24 bytes Fetch width; 96 reorder buffers3 128-bit SSE Units; One SSE instruction/cycleL1 Cache64 KB Data, 64 KB InstructionL2 Cache1 MB dedicated128 bit width6.4 GB/s bandwidth to coresNUMA (DDR2)Increased latency to access the other memoryMemory affinity is importantExecutionsPipeline length 12; 16 bytes Fetch width; 72 reorder buffers2 128-bit SSE Units; One SSE instruction = two 64-bit instructions.Intel Woodcrest XeonAMD Opteron
-
Memory System Performance
-
Memory System PerformanceMemory Access Latency in nanoseconds
-
Performance of ApplicationsNPB-3.2 (gcc-4.1 x86-64)
-
LQCD Application (DWF) Performance
-
Parallel ProgrammingMessagesMachine 1Machine 2OpenMP/PthreadOpenMP/PthreadPerformance Improvement on Multi-Core/SMP machinesAll threads share address spaceEfficient inter-thread communication (no memory copies)
-
Multi-Threads Provide Higher Memory Bandwidth to a Process
-
Different Machines Provide Different Scalability for Threaded Applications
-
OpenMPPortable, Shared Memory Multi-Processing APICompiler Directives and Runtime LibraryC/C++, Fortran 77/90Unix/Linux, WindowsIntel c/c++, gcc-4.xImplementation on top of native threadsFork-join Parallel Programming Model
MasterForkJoinTime
-
OpenMPCompiler Directives (C/C++)#pragma omp parallel{thread_exec (); /* all threads execute the code */} /* all threads join master thread */#pragma omp critical#pragma omp section#pragma omp barrier#pragma omp parallel reduction(+:result)Run time libraryomp_set_num_threads, omp_get_thread_num
-
Posix ThreadIEEE POSIX 1003.1c standard (1995)NPTL (Native Posix Thread Library) Available on Linux since kernel 2.6.x.Fine grain parallel algorithmsBarrier, Pipeline, Master-slave, Reduction
ComplexNot for general public
-
QCD Multi-Threading (QMT)Provides Simple APIs for Fork-Join Parallel paradigmtypedef void (*qmt_user_func_t)(void * arg);qmt_pexec (qmt_userfunc_t func, void* arg);The user func will be executed on multiple threads. Offers efficient mutex lock, barrier and reductionqmt_sync (int tid); qmt_spin_lock(&lock);Performs better than OpenMP generated code?
-
OpenMP Performance from Different Compilers (i386)
-
Synchronization Overhead for OMP and QMT on Intel Platform (i386)
-
Synchronization Overhead for OMP and QMT on AMD Platform (i386)
-
QMT Performance on Intel and AMD (x86_64 and gcc 4.1)
-
ConclusionsIntel woodcrest beats AMD Opterons at this stage of game.Intel has better dual-core micro-architectureAMD has better system architecture
Hand written QMT library can beat OMP compiler generated code.