l1 event reconstruction in the sts i. kisel gsi / kip cbm collaboration meeting dubna, october 16,...
TRANSCRIPT
L1 Event ReconstructionL1 Event Reconstructionin the STSin the STS
I. KiselI. KiselGSI / KIPGSI / KIP
CBM Collaboration MeetingCBM Collaboration MeetingDubna, October 16, 2008Dubna, October 16, 2008
16 October 2008, Dubna16 October 2008, Dubna Ivan Kisel, GSIIvan Kisel, GSI 22/15/15
Many-core HPCMany-core HPC
• High performance computing (HPC)High performance computing (HPC)• Highest clock rate is reachedHighest clock rate is reached• Performance/power optimizationPerformance/power optimization• Heterogeneous systems of many (>8) coresHeterogeneous systems of many (>8) cores• Similar programming languages (Ct and CUDA)Similar programming languages (Ct and CUDA)• We need a uniform approach to all CPU/GPU familiesWe need a uniform approach to all CPU/GPU families
• On-line event selectionOn-line event selection• Mathematical and computational optimizationMathematical and computational optimization• SIMDization of the algorithm (from scalars to vectors)SIMDization of the algorithm (from scalars to vectors)• MIMDization (multi-threads, multi-cores) MIMDization (multi-threads, multi-cores) • Optimize the STS geometry (strips, sector navigation)Optimize the STS geometry (strips, sector navigation)• Smooth magnetic field Smooth magnetic field
GamingGaming STI: STI: CellCell
GamingGaming STI: STI: CellCell
GP GPUGP GPU Nvidia: Nvidia: TeslaTesla
GP GPUGP GPU Nvidia: Nvidia: TeslaTesla
GP CPUGP CPU Intel: Intel: LarrabeeLarrabee
GP CPUGP CPU Intel: Intel: LarrabeeLarrabee
CPU/GPUCPU/GPU AMD: AMD: FusionFusion
CPU/GPUCPU/GPU AMD: AMD: FusionFusion
????
?? ??
16 October 2008, Dubna16 October 2008, Dubna Ivan Kisel, GSIIvan Kisel, GSI 33/15/15
NVIDIA GeForce GTX 280NVIDIA GeForce GTX 280
NVIDIA GT200GeForce GTX 280 1024MB.
933 GFlops single precision (240 FPUs).
finally double precision support, but only ~ 90 GFlops (8 core Xeon ~80 GFlops).
Currently under investigation:
Tracking
Linpack
Image Processing
Sebastian Kalcher
CUDA (Compute Unified Device Architecture)CUDA (Compute Unified Device Architecture)
16 October 2008, Dubna16 October 2008, Dubna Ivan Kisel, GSIIvan Kisel, GSI 44/15/15
Intel Larrabee: Intel Larrabee: 32 Cores32 Cores
L. Seiler et all, Larrabee: A Many-Core x86 Architecture for Visual Computing, ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, August 2008.
Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways:Radeon 4000 series in three major ways:• use the x86 instruction set with Larrabee-specific extensions;use the x86 instruction set with Larrabee-specific extensions;• feature cache coherency across all its cores;feature cache coherency across all its cores;• include very little specialized graphics hardware.include very little specialized graphics hardware.
The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: • LRB's x86 cores will be based on the much simpler Pentium design;LRB's x86 cores will be based on the much simpler Pentium design;• each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time;each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time;• LRB includes one fixed-function graphics hardware unit;LRB includes one fixed-function graphics hardware unit;• LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory;LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory;• LRB includes explicit cache control instructions;LRB includes explicit cache control instructions;• each core supports 4-way simultaneous multithreading, with 4 copies of each processor register.each core supports 4-way simultaneous multithreading, with 4 copies of each processor register.
16 October 2008, Dubna16 October 2008, Dubna Ivan Kisel, GSIIvan Kisel, GSI 55/15/15
Intel Ct Language Intel Ct Language
Ct: Throughput Programming in C++. Tutorial. Intel.
Ct adds new data types (parallel vectors) & operators to C++ Library-like interface and is fully ANSI/ISO-compliant
Ct abstracts away architectural details Vector ISA width / Core count / Memory model / Cache sizes
Ct forward-scales software written today Ct platform-level API, Virtual Intel Platform (VIP), is designed
to be dynamically retargetable to SSE, SSEx, LRB, etc
Ct is fully deterministic No data races
Nested data parallelism and deterministic task parallelism differentiate Ct on parallelizing irregular data and algorithm
Extend C++ for Throughput-Oriented Computing
Dot Product Using C Loops
for (i = 0; i < n; i++) {
dst += src1[i] * src2[i];
}
Dot Product Using Ct
TVEC<F64> Dst, Src1(src1, n), Src2(src2, n);
Dst = addReduce(Src1*Src2);
1
23
2Element-wise multiply
3Reduction (a global sum)
1Vector operations subsumes loop
The basic type in Ct is a TVEC
16 October 2008, Dubna16 October 2008, Dubna Ivan Kisel, GSIIvan Kisel, GSI 66/15/15
Ct vs. CUDACt vs. CUDA
Matthias Bach
16 October 2008, Dubna16 October 2008, Dubna Ivan Kisel, GSIIvan Kisel, GSI 77/15/15
Multi/Many-Core InvestigationsMulti/Many-Core Investigations
• CA: Game of LifeCA: Game of Life• L1/HLT CA Track FinderL1/HLT CA Track Finder• SIMD KF Track Fitter SIMD KF Track Fitter • LINPACKLINPACK• MIMDization (multi-threads, multi-cores)MIMDization (multi-threads, multi-cores)
GSI, KIP, CERN, IntelGSI, KIP, CERN, Intel
16 October 2008, Dubna16 October 2008, Dubna Ivan Kisel, GSIIvan Kisel, GSI 88/15/15
1 2 4 8 16 24 32
1
1.25892541179417
1.58489319246111
1.99526231496888
2.51188643150958
3.16227766016838
3.98107170553497
5.01187233627272
6.30957344480193
7.94328234724282
10
Track fitter scalability
Clovertown, 8 cores
dev20
perfect scaling
# threads
Rea
l tim
e pe
r tr
ack
SIMD KF Track Fit on Multicore Systems: SIMD KF Track Fit on Multicore Systems: ScalabilityScalability
Using Intel Threading Building Blocks – linear scaling on
multiple cores
#threads
Håvard Bjerke
Rea
l fit
time/
trac
k (
s)
16 October 2008, Dubna16 October 2008, Dubna Ivan Kisel, GSIIvan Kisel, GSI 99/15/15
Parallelization of the L1 CA Track FinderParallelization of the L1 CA Track Finder
11 Create trackletsCreate tracklets 22 Collect tracksCollect tracks
GSI, KIP, CERN, Intel, ITEP, Uni-KievGSI, KIP, CERN, Intel, ITEP, Uni-Kiev
16 October 2008, Dubna16 October 2008, Dubna Ivan Kisel, GSIIvan Kisel, GSI 1010/15/15
L1 Standalone Package for Event SelectionL1 Standalone Package for Event Selection
Igor Kulakov
16 October 2008, Dubna16 October 2008, Dubna Ivan Kisel, GSIIvan Kisel, GSI 1111/15/15
KFParticle: Primary Vertex FinderKFParticle: Primary Vertex Finder
Ruben MoorThe algorithm is implemented and passed first tests.The algorithm is implemented and passed first tests.
16 October 2008, Dubna16 October 2008, Dubna Ivan Kisel, GSIIvan Kisel, GSI 1212/15/15
L1 Standalone Package for Event SelectionL1 Standalone Package for Event Selection
Igor Kulakov, Iouri Vassiliev
Efficiency
Reference set 97.1%
All set 91.9%
Extra set 81.9%
Clone 3.5%
Ghost 3.2%
Tracks/even 691
Efficiency of D+ selection: 48.9%
16 October 2008, Dubna16 October 2008, Dubna Ivan Kisel, GSIIvan Kisel, GSI 1313/15/15
Magnetic Field: Magnetic Field: Smooth in the AcceptanceSmooth in the Acceptance
1.1. Approximate with a polynomial in the plane of each stationApproximate with a polynomial in the plane of each station2.2. Approximate with a parabolic function between each 3 stationsApproximate with a parabolic function between each 3 stations
We need a smooth magnetic field in the acceptanceWe need a smooth magnetic field in the acceptance
16 October 2008, Dubna16 October 2008, Dubna Ivan Kisel, GSIIvan Kisel, GSI 1414/15/15
CA on the STS Geometry with Overlapping SensorsCA on the STS Geometry with Overlapping Sensors
UrQMD MC central Au+Au 25AGeV Efficiency and fraction of killed tracks ok up to ∆Z = Zhit - Zstation < ~0.2cm
Irina Rostovtseva
16 October 2008, Dubna16 October 2008, Dubna Ivan Kisel, GSIIvan Kisel, GSI 1515/15/15
Summary and PlansSummary and Plans
Learn Ct (Intel) and CUDA (Nvidia) programming languagesLearn Ct (Intel) and CUDA (Nvidia) programming languages Develop the L1 standalone package for event selectionDevelop the L1 standalone package for event selection Parallelize the CA track finderParallelize the CA track finder Investigate large multi-core systems (CPU and GPU)Investigate large multi-core systems (CPU and GPU) Parallel hardware -> parallel languages -> parallel algorithmsParallel hardware -> parallel languages -> parallel algorithms