병렬프로그래밍 - 박낙훈(경북대), KOCW 온라인 강좌
http://www.kocw.net/home/search/kemView.do?kemId=1322170
- Focusing on the capability of parallel programming capability.
- Also Focusing on the capability of CUDA programming.
- Programming abilities will be increases during the course projects.
차시별 강의
1. 비디오 cuda-prelude 1. Course Objectives 2. Lecturer 3. Textbooks 4. 수업의 구성 5. 수업 내용 6. Examinations 7. Evaluation URL
비디오 trends 1. FLOPS 2. Super Computer Trends 3. Personal Super-Computing 4. Energy Efficient 5. Architectures 6. Raspberry Pi URL
비디오 intro 1. CPU and GPU 2. Moore’s Law 3. Massively Parallel Processing 4. Design Philosophy 5. CPU: latency oriented design 6. GPU: throughput oriented design 7. GPU evolution URL
2. 비디오 history 1. 3D Graphics Pipeline 2. hardware Development 3. Early GPGPU 4. OpenGL 5. GPU Shading Languages 6. CUDA 7. other parallel architectures URL
비디오 cuda-install 1. before installation 2. CUDA-capable GPU 3. CUDA installation steps 4. installation results 5. after installation 6. sample execution 7. CUDA compilation 8. hello.cu URL
비디오 memcpy 1. simple CUDA model - host, device 2. CUDA memory copy scenario 3. CPU memory copy functions 4. CUDA memory copy functions 5. host-device memory copy example URL
3. 비디오 errorCheck 1. host-device memory copy example 2. CUDA function rules 3. cudaError_t 4. cudaGetErrorName, cudaGetErrorString 5. CUDA error check macros 6. advanced CUDA error check macros URL
비디오 CPU-kernel 1. CUDA programming model 2. CUDA function declarations 3. vector addition example 4. CPU implementation for loop 5. kernel concept function, as loop-body 6. CPU kernel implementation URL
비디오 CUDA-kernel 1. CUDA programming model 2. CUDA function declarations 3. vector addition example 4. CUDA implementation - multiple thread launch 5. kernel launch - example source code URL
4. 비디오 kernel-launch 1. process and thread 2. CUDA programming model 3. kernel launch 4. pre-defined variables 5. CUDA architecture URL
비디오 matrix-add 1. CUDA thread hierarchy 2. matrix addition example 3. host implementation - double for-loop - kernel version 4. CUDA implementation - 2D kernel URL
비디오 matrix-mult 1. matrix multiplication example 2. host implementation - triple for-loop - kernel version 3. CUDA implementation - 2D kernel - kernel has a for-loop ! URL
5. 비디오 thread-and-gpu 1. CUDA hardware - G80 architecture - Pascal architecture 2. Transparent Scalability 3. Thread and Warp 4. CUDA thread scheduling - SM warp scheduling 5. Overall execution model URL
비디오 tiled-matrix-mult 1. Review for the problem 2. Thread Layout – single block 3. Thread Layout – multiple blocks 4. Index Values – local and global 5. Kernel Function URL
비디오 device-query 1. Review for the kernel function 2. Detailed Source Codes 3. Block Size Considerations - best block size ? URL
6. 비디오 time-funcs 1. Get the current time 2. sleep( ) function 3. QueryPerformanceCounter( ) function 4. C++ Chrono features URL
비디오 memHier1 1. von Neumann Architecture - instruction cycles 2. History of Parallelism 3. CUDA memory hierarchy - registers - shared memory - global memory - constant memory - local memory URL
비디오 memHier2 1. CUDA memory hierarchy - registers - shared memory - global memory - constant memory - local memory 2. race condition 3. barrier synchronization - __syncthreads() function 4. example: adjacent difference URL
7. 비디오 adjDiff1 1. Example: Adjacent Difference 2. Host version 3. CUDA global memory version 4. CUDA shared memory version URL
비디오 adjDiff2 1. Example: Adjacent Difference 2. Shared Memory Use Strategy - CUDA shared memory version - with dynamic memory allocation - with overuse of __syncthreads() 3. pointers in the kernel URL
비디오 sharedMatMult1 1. Example: matrix multiplication 2. Tile load to shared memory 3. Index equations 4. tiled matrix multiplication kernel URL
8. 비디오 sharedMatMult2 1. Example: matrix multiplication 2. Tile load to shared memory 3. tiled matrix multiplication kernel 4. real source codes 5. comparison to the na�ve kernel URL
비디오 float 1. IEEE 754 floating point standards 2. normalized representation 3. exponent representation 4. flush to zero 5. denormalization 6. rounding and error URL
비디오 dram 1. Dynamic RAM (random access memory) 2. DRAM organization 3. dual channel architecture 4. DRAM Bursting 5. memory access patterns URL
9. 비디오 memCoalescing 1. Memory alignment 2. Memory access examples 3. Stride considerations 4. Array of structures 5. Structure of arrays URL
비디오 bankConflicts 1. 2D and 3D array cases 2. shared memory handling - banked memory 3. constant memory handling URL
비디오 occupancy 1. thread scheduling 2. occupancy 3. resource limits 4. occupancy calculator 5. kernel launch overhead URL
10. 비디오 control 1. Streaming Multiprocessor has only one control logic 2. control flow 3. divergence: if-statement 4. divergence: for-loop iterations 5. example: shuffling cases URL
비디오 atomic1 1. Race Conditions 2. Atomics 3. Atomic CAS operation: compare and swap 4. example: number counting - non-atomic version - atomic version - shared-memory atomic version URL
비디오 atomic2 1. Atomics 2. Histogram Example - host version - device version - shared memory version URL
11. 비디오 reduction1 1. Reduction Problem 2. Sequential Reduction Algorithms 3. Parallel Reduction Algorithms 4. A Sum Reduction Example URL
비디오 reduction2 1. Reduction Problem 2. A Sum Reduction Example 3. Host Version 4. An Enhanced Sum Reduction Example URL
비디오 reduction3 1. Reduction Problem Again 2. Parallel Reduction 3. Sequential Addition 4. Atomic Addition 5. Shared-memory Atomic Addition 6. Interleaved Addressing – Tournament Addition URL
12. 비디오 reduction4 1. Reduction Problem Again 2. Parallel Reduction 3. Sequential Addressing – Tournament Addition 4. First Add during Load 5. Unrolling the Last Warp 6. Complete Unrolling URL
비디오 streaming 1. streaming operation – definition - host version - device version 2. CUDA event API - elapsed time check again URL
비디오 asyncCopy 1. synchronous streaming case 2. asynchronous streaming case 3. cuda streams - asynchronous version - multiple streams 4. explicit synchronization URL
13. 비디오 sorting 1. Sorting Algorithms 2. Sorting Networks - comparators 3. Parallel Sorting Algorithms 4. Sorting Implementations - C++ STL version - Bubble sort, shared memory URL
비디오 bitonicSort 1. Bitonic Sort 2. Bitonic Sort Network 3. Implementations - host version - recursive version - iterative version - shared memory version URL
비디오 search 1. Search for Unsorted List - Sequential Search 2. Search for Sorted List - Binary Search - Binary Search with early cutoff
Notes on what I've learned while blogging, using PC and appliances, energy issue, and memos on everything else. 2002 means the memorable 2002 FIFA World Cup.
피드 구독하기:
댓글 (Atom)
| A | B | C |

어제 한 일, 하지 않은 일이 오늘 해야 할 일을 결정한다. 미뤄둔 일은 반드시 새끼친다. - ?
훌륭한 서비스에 대한 결과로 주어지는 것이 이윤이다. - 헨리 포드
생각날 때 귀찮더라도 백업해라. 내일 웃는다. - ?매사 최적화는 좋은 습관이다. 시간을 가장 귀중히 여기는 습관은 더 좋다. - ?
네가 지금 자면 꿈을 꿀 것이다. 그러나 네가 지금 노력하면 꿈을 이룰 것이다. - ?
마감이 되어 급하게 일하는 것은, 밤새 술마시고 시험치는 것과 같다. 최선을 다해 시험봤을 지는 몰라도, 최선을 다해 공부하지는 않았다. 사는 것도 마찬가지다. 얄팍한 머리와 요행을 믿고 임기응변하는 데 맛들인다면, 인생도 어느덧 그렇게 끝난다. - ascii
위대한 생각을 길러라. 우리는 무슨 짓을 해도 생각보다 높은 곳으로는 오르지 못한다. - B. 디즈레일리
꿈의 크기는 자신이 성취할 수 있는 한계를 뛰어넘어야 합니다. 꿈에 압도되지 않는다면 그 꿈은 크지 않은 겁니다. - 앨런 존슨 설리프
댓글 없음:
댓글 쓰기