l1 event reconstruction in the sts i. kisel gsi / kip cbm collaboration meeting dubna, october 16,...

L1 Event ReconstructionL1 Event Reconstructionin the STSin the STS

I. KiselI. KiselGSI / KIPGSI / KIP

CBM Collaboration MeetingCBM Collaboration MeetingDubna, October 16, 2008Dubna, October 16, 2008

16 October 2008, Dubna16 October 2008, Dubna Ivan Kisel, GSIIvan Kisel, GSI 22/15/15

Many-core HPCMany-core HPC

• High performance computing (HPC)High performance computing (HPC)• Highest clock rate is reachedHighest clock rate is reached• Performance/power optimizationPerformance/power optimization• Heterogeneous systems of many (>8) coresHeterogeneous systems of many (>8) cores• Similar programming languages (Ct and CUDA)Similar programming languages (Ct and CUDA)• We need a uniform approach to all CPU/GPU familiesWe need a uniform approach to all CPU/GPU families

• On-line event selectionOn-line event selection• Mathematical and computational optimizationMathematical and computational optimization• SIMDization of the algorithm (from scalars to vectors)SIMDization of the algorithm (from scalars to vectors)• MIMDization (multi-threads, multi-cores) MIMDization (multi-threads, multi-cores) • Optimize the STS geometry (strips, sector navigation)Optimize the STS geometry (strips, sector navigation)• Smooth magnetic field Smooth magnetic field

GamingGaming STI: STI: CellCell

GamingGaming STI: STI: CellCell

GP GPUGP GPU Nvidia: Nvidia: TeslaTesla

GP GPUGP GPU Nvidia: Nvidia: TeslaTesla

GP CPUGP CPU Intel: Intel: LarrabeeLarrabee

GP CPUGP CPU Intel: Intel: LarrabeeLarrabee

CPU/GPUCPU/GPU AMD: AMD: FusionFusion

CPU/GPUCPU/GPU AMD: AMD: FusionFusion

????

?? ??


NVIDIA GeForce GTX 280NVIDIA GeForce GTX 280

NVIDIA GT200GeForce GTX 280 1024MB.

933 GFlops single precision (240 FPUs).

finally double precision support, but only ~ 90 GFlops (8 core Xeon ~80 GFlops).

Currently under investigation:

Tracking

Linpack

Image Processing

Sebastian Kalcher

CUDA (Compute Unified Device Architecture)CUDA (Compute Unified Device Architecture)


Intel Larrabee: Intel Larrabee: 32 Cores32 Cores

L. Seiler et all, Larrabee: A Many-Core x86 Architecture for Visual Computing, ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, August 2008.

Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways:Radeon 4000 series in three major ways:• use the x86 instruction set with Larrabee-specific extensions;use the x86 instruction set with Larrabee-specific extensions;• feature cache coherency across all its cores;feature cache coherency across all its cores;• include very little specialized graphics hardware.include very little specialized graphics hardware.

The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: • LRB's x86 cores will be based on the much simpler Pentium design;LRB's x86 cores will be based on the much simpler Pentium design;• each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time;each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time;• LRB includes one fixed-function graphics hardware unit;LRB includes one fixed-function graphics hardware unit;• LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory;LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory;• LRB includes explicit cache control instructions;LRB includes explicit cache control instructions;• each core supports 4-way simultaneous multithreading, with 4 copies of each processor register.each core supports 4-way simultaneous multithreading, with 4 copies of each processor register.


Intel Ct Language Intel Ct Language

Ct: Throughput Programming in C++. Tutorial. Intel.

Ct adds new data types (parallel vectors) & operators to C++ Library-like interface and is fully ANSI/ISO-compliant

Ct abstracts away architectural details Vector ISA width / Core count / Memory model / Cache sizes

Ct forward-scales software written today Ct platform-level API, Virtual Intel Platform (VIP), is designed

to be dynamically retargetable to SSE, SSEx, LRB, etc

Ct is fully deterministic No data races

Nested data parallelism and deterministic task parallelism differentiate Ct on parallelizing irregular data and algorithm

Extend C++ for Throughput-Oriented Computing

Dot Product Using C Loops

for (i = 0; i < n; i++) {

dst += src1[i] * src2[i];

}

Dot Product Using Ct

TVEC<F64> Dst, Src1(src1, n), Src2(src2, n);

Dst = addReduce(Src1*Src2);

1

23

2Element-wise multiply

3Reduction (a global sum)

1Vector operations subsumes loop

The basic type in Ct is a TVEC


Ct vs. CUDACt vs. CUDA

Matthias Bach


Multi/Many-Core InvestigationsMulti/Many-Core Investigations

• CA: Game of LifeCA: Game of Life• L1/HLT CA Track FinderL1/HLT CA Track Finder• SIMD KF Track Fitter SIMD KF Track Fitter • LINPACKLINPACK• MIMDization (multi-threads, multi-cores)MIMDization (multi-threads, multi-cores)

GSI, KIP, CERN, IntelGSI, KIP, CERN, Intel


1 2 4 8 16 24 32

1

1.25892541179417

1.58489319246111

1.99526231496888

2.51188643150958

3.16227766016838

3.98107170553497

5.01187233627272

6.30957344480193

7.94328234724282

10

Track fitter scalability

Clovertown, 8 cores

dev20

perfect scaling

# threads

Rea

l tim

e pe

r tr

ack

SIMD KF Track Fit on Multicore Systems: SIMD KF Track Fit on Multicore Systems: ScalabilityScalability

Using Intel Threading Building Blocks – linear scaling on

multiple cores

#threads

Håvard Bjerke

Rea

l fit

time/

trac

k (

s)


Parallelization of the L1 CA Track FinderParallelization of the L1 CA Track Finder

11 Create trackletsCreate tracklets 22 Collect tracksCollect tracks

GSI, KIP, CERN, Intel, ITEP, Uni-KievGSI, KIP, CERN, Intel, ITEP, Uni-Kiev


L1 Standalone Package for Event SelectionL1 Standalone Package for Event Selection

Igor Kulakov


KFParticle: Primary Vertex FinderKFParticle: Primary Vertex Finder

Ruben MoorThe algorithm is implemented and passed first tests.The algorithm is implemented and passed first tests.


L1 Standalone Package for Event SelectionL1 Standalone Package for Event Selection

Igor Kulakov, Iouri Vassiliev

Efficiency

Reference set 97.1%

All set 91.9%

Extra set 81.9%

Clone 3.5%

Ghost 3.2%

Tracks/even 691

Efficiency of D+ selection: 48.9%


Magnetic Field: Magnetic Field: Smooth in the AcceptanceSmooth in the Acceptance

1.1. Approximate with a polynomial in the plane of each stationApproximate with a polynomial in the plane of each station2.2. Approximate with a parabolic function between each 3 stationsApproximate with a parabolic function between each 3 stations

We need a smooth magnetic field in the acceptanceWe need a smooth magnetic field in the acceptance


CA on the STS Geometry with Overlapping SensorsCA on the STS Geometry with Overlapping Sensors

UrQMD MC central Au+Au 25AGeV Efficiency and fraction of killed tracks ok up to ∆Z = Zhit - Zstation < ~0.2cm

Irina Rostovtseva


Summary and PlansSummary and Plans

Learn Ct (Intel) and CUDA (Nvidia) programming languagesLearn Ct (Intel) and CUDA (Nvidia) programming languages Develop the L1 standalone package for event selectionDevelop the L1 standalone package for event selection Parallelize the CA track finderParallelize the CA track finder Investigate large multi-core systems (CPU and GPU)Investigate large multi-core systems (CPU and GPU) Parallel hardware -> parallel languages -> parallel algorithmsParallel hardware -> parallel languages -> parallel algorithms

l1 event reconstruction in the sts i. kisel gsi / kip cbm collaboration meeting dubna, october 16,...

Documents

larrabee intel

x86 processor cores

gsi415 intel larrabee

cores heterogeneous

larrabee gp cpu intel

tesla nvidia

larrabee cpugpu amd

core x86 architecture