comparison of next generation gaming architectures presented by dela tsiagbe presented by dela...

17
Comparison of Next Generation Gaming Architectures Presented By Presented By Dela Tsiagbe Dela Tsiagbe

Upload: buck-jonas-hoover

Post on 24-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe

Comparison of Next Generation Gaming

Architectures

Comparison of Next Generation Gaming

Architectures Presented ByPresented ByDela Tsiagbe Dela Tsiagbe Presented ByPresented ByDela Tsiagbe Dela Tsiagbe

Page 2: Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe

IntroductionIntroduction

Brief History of Gaming PlatformsBrief History of Gaming PlatformsDifference between consoles and Difference between consoles and

personal computers personal computers Look at actual ArchitectureLook at actual ArchitectureComparison of VendorsComparison of VendorsSummarySummary

Brief History of Gaming PlatformsBrief History of Gaming PlatformsDifference between consoles and Difference between consoles and

personal computers personal computers Look at actual ArchitectureLook at actual ArchitectureComparison of VendorsComparison of VendorsSummarySummary

Page 3: Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe

History of gamingHistory of gaming

Video gaming itself dates back to the Video gaming itself dates back to the 60’s and 70’s 60’s and 70’s

Consoles such as Magnavox Odyssey , Consoles such as Magnavox Odyssey , Atari , and Colecovison made gaming Atari , and Colecovison made gaming popularpopular

NESNESStorytellingStorytelling

Video gaming itself dates back to the Video gaming itself dates back to the 60’s and 70’s 60’s and 70’s

Consoles such as Magnavox Odyssey , Consoles such as Magnavox Odyssey , Atari , and Colecovison made gaming Atari , and Colecovison made gaming popularpopular

NESNESStorytellingStorytelling

Page 4: Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe

Difference between Consoles and PCs Difference between Consoles and PCs In the past it used to be true that the In the past it used to be true that the

computing power of a PC was far more than computing power of a PC was far more than that of a console. that of a console.

Consoles today require much more. Consoles today require much more. Most times, the type of power you get for the Most times, the type of power you get for the

amount you pay for the console is more. amount you pay for the console is more. Meaning you get more for your money when Meaning you get more for your money when you purchase a gaming console of the same you purchase a gaming console of the same price of a PC. price of a PC.

In the past it used to be true that the In the past it used to be true that the computing power of a PC was far more than computing power of a PC was far more than that of a console. that of a console.

Consoles today require much more. Consoles today require much more. Most times, the type of power you get for the Most times, the type of power you get for the

amount you pay for the console is more. amount you pay for the console is more. Meaning you get more for your money when Meaning you get more for your money when you purchase a gaming console of the same you purchase a gaming console of the same price of a PC. price of a PC.

Page 5: Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe

Difference between Consoles and PCs (continued)Difference between Consoles and PCs (continued) Xbox 360 StatsXbox 360 Stats

Custom IBM PowerPC-based CPUCustom IBM PowerPC-based CPU * 3 symmetrical cores running at 3.2 GHz each* 3 symmetrical cores running at 3.2 GHz each * 2 hardware threads per core; 6 hardware threads total* 2 hardware threads per core; 6 hardware threads total * 1 VMX-128 vector unit per core; 3 total* 1 VMX-128 vector unit per core; 3 total * 128 VMX-128 registers per hardware thread* 128 VMX-128 registers per hardware thread * 1 MB L2 cache* 1 MB L2 cache CPU Game Math PerformanceCPU Game Math Performance * 9 billion dot product operations per second* 9 billion dot product operations per second Custom ATI Graphics ProcessorCustom ATI Graphics Processor * 500 MHz* 500 MHz * 10 MB embedded DRAM* 10 MB embedded DRAM * 48-way parallel floating-point dynamically-scheduled shader * 48-way parallel floating-point dynamically-scheduled shader

pipelinespipelines * Unified shader architecture* Unified shader architecture

Xbox 360 StatsXbox 360 Stats

Custom IBM PowerPC-based CPUCustom IBM PowerPC-based CPU * 3 symmetrical cores running at 3.2 GHz each* 3 symmetrical cores running at 3.2 GHz each * 2 hardware threads per core; 6 hardware threads total* 2 hardware threads per core; 6 hardware threads total * 1 VMX-128 vector unit per core; 3 total* 1 VMX-128 vector unit per core; 3 total * 128 VMX-128 registers per hardware thread* 128 VMX-128 registers per hardware thread * 1 MB L2 cache* 1 MB L2 cache CPU Game Math PerformanceCPU Game Math Performance * 9 billion dot product operations per second* 9 billion dot product operations per second Custom ATI Graphics ProcessorCustom ATI Graphics Processor * 500 MHz* 500 MHz * 10 MB embedded DRAM* 10 MB embedded DRAM * 48-way parallel floating-point dynamically-scheduled shader * 48-way parallel floating-point dynamically-scheduled shader

pipelinespipelines * Unified shader architecture* Unified shader architecture

Page 6: Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe

Difference between Consoles and PCs (continued)Difference between Consoles and PCs (continued)

* PowerPC-base Core @3.2GHz* PowerPC-base Core @3.2GHz * 1 VMX vector unit per core* 1 VMX vector unit per core * 512KB L2 cache* 512KB L2 cache * 7 x SPE @3.2GHz* 7 x SPE @3.2GHz * 7 x 128b 128 SIMD GPRs* 7 x 128b 128 SIMD GPRs * 7 x 256KB SRAM for SPE* 7 x 256KB SRAM for SPE * * 1 of 8 SPEs reserved for redundancy * * 1 of 8 SPEs reserved for redundancy

total floating point performance: 218 total floating point performance: 218 GFLOPSGFLOPS

* PowerPC-base Core @3.2GHz* PowerPC-base Core @3.2GHz * 1 VMX vector unit per core* 1 VMX vector unit per core * 512KB L2 cache* 512KB L2 cache * 7 x SPE @3.2GHz* 7 x SPE @3.2GHz * 7 x 128b 128 SIMD GPRs* 7 x 128b 128 SIMD GPRs * 7 x 256KB SRAM for SPE* 7 x 256KB SRAM for SPE * * 1 of 8 SPEs reserved for redundancy * * 1 of 8 SPEs reserved for redundancy

total floating point performance: 218 total floating point performance: 218 GFLOPSGFLOPS

Page 7: Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe

Difference between Consoles and PCs (continued)Difference between Consoles and PCs (continued)

Things to consider:Things to consider:

Although there is less memory, there is Although there is less memory, there is no is a minimal OS running in the no is a minimal OS running in the backgroundbackground

Compatibility of hardware is never a Compatibility of hardware is never a problem problem

There is very little overhead from the There is very little overhead from the system itself.system itself.

Things to consider:Things to consider:

Although there is less memory, there is Although there is less memory, there is no is a minimal OS running in the no is a minimal OS running in the backgroundbackground

Compatibility of hardware is never a Compatibility of hardware is never a problem problem

There is very little overhead from the There is very little overhead from the system itself.system itself.

Page 8: Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe

Types of processors Types of processors

Xbox 360 - Xbox 360 - XenonXenon

PS3 - PS3 - PowerPC PowerPC CellCell

Xbox 360 - Xbox 360 - XenonXenon

PS3 - PS3 - PowerPC PowerPC CellCell

Page 9: Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe

PS3 SchematicsPS3 Schematics

Page 10: Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe

Xbox 360 SchematicsXbox 360 Schematics

Page 11: Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe

Power PC Instruction SetPower PC Instruction Set

li REG, VALUEli REG, VALUE

loads register REG with the number VALUEloads register REG with the number VALUE add REGA, REGB, REGCadd REGA, REGB, REGC

adds REGB with REGC and stores the result in REGAadds REGB with REGC and stores the result in REGA addi REGA, REGB, VALUEaddi REGA, REGB, VALUE

add the number VALUE to REGB and stores the result in REGAadd the number VALUE to REGB and stores the result in REGA mr REGA, REGBmr REGA, REGB

copies the value in REGB into REGAcopies the value in REGB into REGA or REGA, REGB, REGCor REGA, REGB, REGC

performs a logical "or" between REGB and REGC, and stores the result in REGAperforms a logical "or" between REGB and REGC, and stores the result in REGA ori REGA, REGB, VALUEori REGA, REGB, VALUE

performs a logical "or" between REGB and VALUE, and stores the result in REGAperforms a logical "or" between REGB and VALUE, and stores the result in REGA and, andi, xor, xori, nand, nand, and norand, andi, xor, xori, nand, nand, and nor

all of these follow the same pattern as "or" and "ori" for the other logical operationsall of these follow the same pattern as "or" and "ori" for the other logical operations ld REGA, 0(REGB)ld REGA, 0(REGB)

li REG, VALUEli REG, VALUE

loads register REG with the number VALUEloads register REG with the number VALUE add REGA, REGB, REGCadd REGA, REGB, REGC

adds REGB with REGC and stores the result in REGAadds REGB with REGC and stores the result in REGA addi REGA, REGB, VALUEaddi REGA, REGB, VALUE

add the number VALUE to REGB and stores the result in REGAadd the number VALUE to REGB and stores the result in REGA mr REGA, REGBmr REGA, REGB

copies the value in REGB into REGAcopies the value in REGB into REGA or REGA, REGB, REGCor REGA, REGB, REGC

performs a logical "or" between REGB and REGC, and stores the result in REGAperforms a logical "or" between REGB and REGC, and stores the result in REGA ori REGA, REGB, VALUEori REGA, REGB, VALUE

performs a logical "or" between REGB and VALUE, and stores the result in REGAperforms a logical "or" between REGB and VALUE, and stores the result in REGA and, andi, xor, xori, nand, nand, and norand, andi, xor, xori, nand, nand, and nor

all of these follow the same pattern as "or" and "ori" for the other logical operationsall of these follow the same pattern as "or" and "ori" for the other logical operations ld REGA, 0(REGB)ld REGA, 0(REGB)

Page 12: Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe

PowerPC Instruction SetPowerPC Instruction Set use the contents of REGB as the memory address of the value to load into REGAuse the contents of REGB as the memory address of the value to load into REGA lbz, lhz, and lwzlbz, lhz, and lwz

all of these follow the same format, but operate on bytes, halfwords, and words, respectively (the "z" indicates that all of these follow the same format, but operate on bytes, halfwords, and words, respectively (the "z" indicates that they also zero-out the rest of the register)they also zero-out the rest of the register)

b ADDRESSb ADDRESS

jump (or branch) to the instruction at address ADDRESSjump (or branch) to the instruction at address ADDRESS bl ADDRESSbl ADDRESS

subroutine call to address ADDRESSsubroutine call to address ADDRESS cmpd REGA, REGBcmpd REGA, REGB

compare the contents of REGA and REGB, and set the bits of the status register appropriatelycompare the contents of REGA and REGB, and set the bits of the status register appropriately beq ADDRESSbeq ADDRESS

branch to ADDRESS if the previously compared register contents were equalbranch to ADDRESS if the previously compared register contents were equal bne, blt, bgt, ble, and bgebne, blt, bgt, ble, and bge

all of these follow the same form, but check for inequality, less than, greater than, less than or equal to, and all of these follow the same form, but check for inequality, less than, greater than, less than or equal to, and greater than or equal to, respectively.greater than or equal to, respectively.

std REGA, 0(REGB)std REGA, 0(REGB)

use the contents of REGB as the memory address to save the value of REGA intouse the contents of REGB as the memory address to save the value of REGA into stb, sth, and stwstb, sth, and stw

use the contents of REGB as the memory address of the value to load into REGAuse the contents of REGB as the memory address of the value to load into REGA lbz, lhz, and lwzlbz, lhz, and lwz

all of these follow the same format, but operate on bytes, halfwords, and words, respectively (the "z" indicates that all of these follow the same format, but operate on bytes, halfwords, and words, respectively (the "z" indicates that they also zero-out the rest of the register)they also zero-out the rest of the register)

b ADDRESSb ADDRESS

jump (or branch) to the instruction at address ADDRESSjump (or branch) to the instruction at address ADDRESS bl ADDRESSbl ADDRESS

subroutine call to address ADDRESSsubroutine call to address ADDRESS cmpd REGA, REGBcmpd REGA, REGB

compare the contents of REGA and REGB, and set the bits of the status register appropriatelycompare the contents of REGA and REGB, and set the bits of the status register appropriately beq ADDRESSbeq ADDRESS

branch to ADDRESS if the previously compared register contents were equalbranch to ADDRESS if the previously compared register contents were equal bne, blt, bgt, ble, and bgebne, blt, bgt, ble, and bge

all of these follow the same form, but check for inequality, less than, greater than, less than or equal to, and all of these follow the same form, but check for inequality, less than, greater than, less than or equal to, and greater than or equal to, respectively.greater than or equal to, respectively.

std REGA, 0(REGB)std REGA, 0(REGB)

use the contents of REGB as the memory address to save the value of REGA intouse the contents of REGB as the memory address to save the value of REGA into stb, sth, and stwstb, sth, and stw

Page 13: Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe

CPU SpecsCPU Specs Three 3.2 GHz PowerPC cores Three 3.2 GHz PowerPC cores ・・ Shared 1MB L2 cache, 8-way set associative Shared 1MB L2 cache, 8-way set associative ・・ Per-Core Features Per-Core Features ミミ 2-issue per cycle, in-order, decoupled Vector/Scalar 2-issue per cycle, in-order, decoupled Vector/Scalar issue queue issue queue

2 symmetric fine grain hardware threads 2 symmetric fine grain hardware threads ミミ L1 Caches: 32K 2-way I$ / 32K 4-L1 Caches: 32K 2-way I$ / 32K 4-way D$ way D$

Execution pipelines Execution pipelines ・・ Branch Unit, Integer Unit, Load/Store Unit Branch Unit, Integer Unit, Load/Store Unit ・・ VMX128 VMX128 Units: Floating Point Unit, Permute Unit, Simple Unit Units: Floating Point Unit, Permute Unit, Simple Unit ・・ Scalar FPU Scalar FPU ・・ VMX128 VMX128 enhanced for game and graphics workloadsenhanced for game and graphics workloads

ミミ All execution units 4-way SIMD All execution units 4-way SIMD ミミ 128 128-bit vector registers per thread 128 128-bit vector registers per thread ミミ Custom dot-product instruction Custom dot-product instruction ミ ミ Native D3D compressed data formatsNative D3D compressed data formats

Three 3.2 GHz PowerPC cores Three 3.2 GHz PowerPC cores ・・ Shared 1MB L2 cache, 8-way set associative Shared 1MB L2 cache, 8-way set associative ・・ Per-Core Features Per-Core Features ミミ 2-issue per cycle, in-order, decoupled Vector/Scalar 2-issue per cycle, in-order, decoupled Vector/Scalar issue queue issue queue

2 symmetric fine grain hardware threads 2 symmetric fine grain hardware threads ミミ L1 Caches: 32K 2-way I$ / 32K 4-L1 Caches: 32K 2-way I$ / 32K 4-way D$ way D$

Execution pipelines Execution pipelines ・・ Branch Unit, Integer Unit, Load/Store Unit Branch Unit, Integer Unit, Load/Store Unit ・・ VMX128 VMX128 Units: Floating Point Unit, Permute Unit, Simple Unit Units: Floating Point Unit, Permute Unit, Simple Unit ・・ Scalar FPU Scalar FPU ・・ VMX128 VMX128 enhanced for game and graphics workloadsenhanced for game and graphics workloads

ミミ All execution units 4-way SIMD All execution units 4-way SIMD ミミ 128 128-bit vector registers per thread 128 128-bit vector registers per thread ミミ Custom dot-product instruction Custom dot-product instruction ミ ミ Native D3D compressed data formatsNative D3D compressed data formats

Page 14: Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe

CPU Data StreamsCPU Data Streams

High bandwidth data streaming support with minimal High bandwidth data streaming support with minimal cache thrashing cache thrashing – – 128B cache line size (all caches) 128B cache line size (all caches) – – Flexible set locking in L2 Flexible set locking in L2 – – Write streaming: Write streaming: L1s are write through, writes do not allocate in L1 L1s are write through, writes do not allocate in L1 4 uncacheable write gathering buffers per core 4 uncacheable write gathering buffers per core 8 cacheable, non-sequential write gathering buffers per core 8 cacheable, non-sequential write gathering buffers per core Read streaming: Read streaming: xDCBT data prefetch around L2, directly into L1 xDCBT data prefetch around L2, directly into L1 8 outstanding load/prefetches per core 8 outstanding load/prefetches per core Tight GPU data streaming integration (XPS) Tight GPU data streaming integration (XPS) XPS – “Xbox Procedural Synthesis” XPS – “Xbox Procedural Synthesis” GPU 128B read from L2 GPU 128B read from L2 GPU low latency cacheable writebacks to CPU GPU low latency cacheable writebacks to CPU GPU shares D3D compressed data formats with CPU => at least GPU shares D3D compressed data formats with CPU => at least 2x effective bus bandwidth for typical graphics data2x effective bus bandwidth for typical graphics data

High bandwidth data streaming support with minimal High bandwidth data streaming support with minimal cache thrashing cache thrashing – – 128B cache line size (all caches) 128B cache line size (all caches) – – Flexible set locking in L2 Flexible set locking in L2 – – Write streaming: Write streaming: L1s are write through, writes do not allocate in L1 L1s are write through, writes do not allocate in L1 4 uncacheable write gathering buffers per core 4 uncacheable write gathering buffers per core 8 cacheable, non-sequential write gathering buffers per core 8 cacheable, non-sequential write gathering buffers per core Read streaming: Read streaming: xDCBT data prefetch around L2, directly into L1 xDCBT data prefetch around L2, directly into L1 8 outstanding load/prefetches per core 8 outstanding load/prefetches per core Tight GPU data streaming integration (XPS) Tight GPU data streaming integration (XPS) XPS – “Xbox Procedural Synthesis” XPS – “Xbox Procedural Synthesis” GPU 128B read from L2 GPU 128B read from L2 GPU low latency cacheable writebacks to CPU GPU low latency cacheable writebacks to CPU GPU shares D3D compressed data formats with CPU => at least GPU shares D3D compressed data formats with CPU => at least 2x effective bus bandwidth for typical graphics data2x effective bus bandwidth for typical graphics data

Page 15: Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe

GPU GPU

500 MHz graphics processor 500 MHz graphics processor – – 48 parallel shader cores (ALUs); dynamically scheduled; 32bit IEEE 48 parallel shader cores (ALUs); dynamically scheduled; 32bit IEEE FLP FLP – – 24 billion shader instructions per second 24 billion shader instructions per second Superscalar design: vector, scalar and texture ops per instruction Superscalar design: vector, scalar and texture ops per instruction – – Pixel fillrate: 4 billion pixels/sec (8 per cycle); 2x for depth / stencil only Pixel fillrate: 4 billion pixels/sec (8 per cycle); 2x for depth / stencil only AA: 16 billion samples/sec; 2x for depth / stencil only AA: 16 billion samples/sec; 2x for depth / stencil only – – Geometry rate: 500 million triangles/sec Geometry rate: 500 million triangles/sec – – Texture rate: 8 billion bilinear filtered samples / sec Texture rate: 8 billion bilinear filtered samples / sec 10 MB EDRAM 256 GB/s fill 10 MB EDRAM 256 GB/s fill Direct3D 9.0-compatible Direct3D 9.0-compatible – – High-Level Shader Language (HLSL) 3.0+ support High-Level Shader Language (HLSL) 3.0+ support Custom features Custom features – – Memory export: Particle physics, Subdivision surfaces Memory export: Particle physics, Subdivision surfaces – – Tiling acceleration: Full resolution Hi-Z, Predicated Primitives Tiling acceleration: Full resolution Hi-Z, Predicated Primitives – – XPS: XPS: CPU cores can be slaved to GPU processing CPU cores can be slaved to GPU processing GPU reads geometry data directly from L2 GPU reads geometry data directly from L2 – – Hardware scaling for display resolution matchingHardware scaling for display resolution matching

500 MHz graphics processor 500 MHz graphics processor – – 48 parallel shader cores (ALUs); dynamically scheduled; 32bit IEEE 48 parallel shader cores (ALUs); dynamically scheduled; 32bit IEEE FLP FLP – – 24 billion shader instructions per second 24 billion shader instructions per second Superscalar design: vector, scalar and texture ops per instruction Superscalar design: vector, scalar and texture ops per instruction – – Pixel fillrate: 4 billion pixels/sec (8 per cycle); 2x for depth / stencil only Pixel fillrate: 4 billion pixels/sec (8 per cycle); 2x for depth / stencil only AA: 16 billion samples/sec; 2x for depth / stencil only AA: 16 billion samples/sec; 2x for depth / stencil only – – Geometry rate: 500 million triangles/sec Geometry rate: 500 million triangles/sec – – Texture rate: 8 billion bilinear filtered samples / sec Texture rate: 8 billion bilinear filtered samples / sec 10 MB EDRAM 256 GB/s fill 10 MB EDRAM 256 GB/s fill Direct3D 9.0-compatible Direct3D 9.0-compatible – – High-Level Shader Language (HLSL) 3.0+ support High-Level Shader Language (HLSL) 3.0+ support Custom features Custom features – – Memory export: Particle physics, Subdivision surfaces Memory export: Particle physics, Subdivision surfaces – – Tiling acceleration: Full resolution Hi-Z, Predicated Primitives Tiling acceleration: Full resolution Hi-Z, Predicated Primitives – – XPS: XPS: CPU cores can be slaved to GPU processing CPU cores can be slaved to GPU processing GPU reads geometry data directly from L2 GPU reads geometry data directly from L2 – – Hardware scaling for display resolution matchingHardware scaling for display resolution matching

Page 16: Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe

GPU Block DiagramGPU Block Diagram

Page 17: Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe

SoftwareSoftware

SMP/SMT SMP/SMT – – Mainstream techniques Mainstream techniques – – Everything is simplified by being symmetric Everything is simplified by being symmetric UMA UMA – – No partitioning headaches No partitioning headaches OS OS – – All 3 cores available for game developers All 3 cores available for game developers Standard APIs Standard APIs – – Win32, OpenMP Win32, OpenMP – – Direct3D, HLSL Direct3D, HLSL – – Assembly (CPU & Shader) supported - direct hardware access Assembly (CPU & Shader) supported - direct hardware access Standard tools Standard tools – – XNA: PIX, XACT XNA: PIX, XACT – – Visual C++, works with multiple threads ...Visual C++, works with multiple threads ...

SMP/SMT SMP/SMT – – Mainstream techniques Mainstream techniques – – Everything is simplified by being symmetric Everything is simplified by being symmetric UMA UMA – – No partitioning headaches No partitioning headaches OS OS – – All 3 cores available for game developers All 3 cores available for game developers Standard APIs Standard APIs – – Win32, OpenMP Win32, OpenMP – – Direct3D, HLSL Direct3D, HLSL – – Assembly (CPU & Shader) supported - direct hardware access Assembly (CPU & Shader) supported - direct hardware access Standard tools Standard tools – – XNA: PIX, XACT XNA: PIX, XACT – – Visual C++, works with multiple threads ...Visual C++, works with multiple threads ...