This is used to verify the results of the GPU version. Note: do not modify the CPU version matrixMulCPU. Please make use of the hints provided if you get stuck, and you can always check the task3_solution.cu file to see the answer.
# blocks in the matrixMulGPU function, you will need to finish initializing the number_of_blocks variable in the main function to launch the appropriate number of thread blocks. So, in addition to using blockIdx.x, blockDim.x, and threadIdx.x, you'll also need to use blockIdx.
However, there is a new twist! Instead of just using one-dimensional blocks of threads and blocks, we'll be using two dimensions x and y. Your goal is to modify the matrixMulGPU function with CUDA so it will run on the GPU. In this simplified example, we'll assume our matrices are all square - they have the same number of rows and columns. The next function we will accelerate is a basic matrix multiplication function.
If you get stuck, or just want to check your answer, feel free to look at the task2_solution.cu file. In the text editor below, open the task2.cu file and begin working. For the moment, you just need to know that they are replacing malloc and free, respectively. We'll explore this in the last task of this lab. These two functions are working with managed memory using CUDA's Unified Memory system.
You'll probably notice two new API calls in the code below cudaMallocManaged and cudaFree. # text in the code will help you focus on the appropriate areas that need modification. Using the concepts introduced in Task #1, modify the following code to run on the GPU.
You're going to be accelerating the ever popular SAXPY ( Single-precision A times X Plus Y) function on the GPU using CUDA C/C++. Now that you have had some experience launching a function on the GPU with different numbers of threads, it's time to write your first GPU kernel yourself. Writing and Launching GPU Kernels Task #2 If everything is working, you should see the following: Hello from Thread 0 in block 0 Link the compiled code from #2 and #3 and create the executable.nvcc will give the host compiler, in our case gcc, the CPU code to compile.cu source file, separate code which should be compiled for the GPU and the code which should be compiled for the CPU The nvcc compiler does the following basic steps: Before touching the code at all, select the next cell down and hit Ctrl-Enter (or the play button in the toolbar) to compile using the nvcc compiler from NVIDIA and run it. There is nothing you need to do to the code to get this example to work. Let's explore the above concepts by doing a simple "Hello Parallelism" example. In this case, the system will just continue executing blocks until they have all run. It's possible, and in fact recommended, for one to schedule more blocks than the GPU can actively run in parallel. The first is the total number of blocks we want to run on the GPU, and the second is the number of threads there are per block. Inside the triple-angle brackets we set two values. myKernel > (.) - this is the syntax used to launch a kernel on the GPU.Remember that all the blocks scheduled to execute on the GPU are identical, except for the blockIdx.x value. It simply returns a value indicating the number of threads there are per block. blockDim.x - this is a read-only variable that is defined for you.It is used within a GPU kernel to determine the ID of the thread which is currently executing code in the active block. threadIdx.x - this is a read-only variable that is defined for you.Since there will be many blocks running in parallel, we need this ID to help determine which chunk of data that particular block will work on. It is used within a GPU kernel to determine the ID of the block which is currently executing code. blockIdx.x - this is a read-only variable that is defined for you.For CUDA C/C++, the nvcc compiler will handle compiling this code. _global_ - this keyword is used to tell the CUDA compiler that the function is to be compiled for the GPU, and is callable from both the host and the GPU itself.For the first task, we are going to be using the following concepts: