CUDA Parallel Programming TutorialRichard Membarthrichard.membarth@cs.fau.deHardware-Software-Co-DesignUniversity of Erlangen-Nuremberg19.03.2009Friedrich-Alexander University of Erlangen-NurembergRichard Membarth1Outline◮ Tasks for CUDA◮ CUDA programming model◮ Getting started◮ Example codesFriedrich-Alexander University of Erlangen-NurembergRichard Membarth2Tasks for CUDA◮ Provide ability to run code on GPU◮ Manage resources◮ Partition data to fit on cores◮ Schedule blocks to coresFriedrich-Alexander University of Erlangen-NurembergRichard Membarth3Data Partitioning◮ Partition data in smallerblocks that can be processedby one core◮ Up to 512 threads in oneblock◮ All blocks define the grid◮ All blocks execute sameprogram (kernel)◮ Independent blocks◮ Only ONE kernel at a timeFriedrich-Alexander University of Erlangen-NurembergRichard Membarth4Memory HierarchyMemory types (fastest memoryfirst):◮ Registers◮ Shared memory◮ Device memory (texture,constant, local, global)Friedrich-Alexander University of Erlangen-NurembergRichard Membarth5Tesla Architecture◮ 30 cores, 240 ALUs (1 mul-add)◮ (1 mul-add + 1 mul): 240 * (2+1) * 1.3 GHz = 936 GFLOPS◮ 4.0 GB GDDR3, 102 GB/s Mem BW, 4GB/s PCIe BW to CPUFriedrich-Alexander University of Erlangen-NurembergRichard Membarth6CUDA: Extended C◮ Function qualifiers◮ Variable qualifiers◮ Built-in keywords◮ Intrinsics◮ Function callsFriedrich-Alexander University of ...