Parallel and Distributed Programming

Periodically reload this page. Newer entries come above. (Dates) in parentheses indicate when they are posted.

(Posted: Jan 23, 2023) Plan for today:
1. Divide and Conquer
2. pd11_msort
(Posted: Jan 23, 2023)
- I made a tool to submit your record and show submitted results. See 21mnist/README.md (section "A record submission tool and record viewers") for detailed instruction.
- Before submitting your results, please update your repository by running `git pull` and compiling again
```
cd parallel-distributed
git pull
cd 21mnist
make
```
  A few lines have been changed in mnist_util.h and mnist.cc, so the log file can be parsed by the submit tool.
- Other changes have been made so that it now compiles on MacOS
(Posted: Jan 16, 2023) Plan for today
1. Due to an appointment I couldn't decline right before the class, we may have to start a bit later. Please preview slides, Jupyter notebooks, or work on the final report until the class starts
2. Divide and Conquer
3. pd11_msort
(Posted: Jan. 5, 2023) Big apologies for the late announcement of how to get credit and the baseline code for Option 1. Please read it carefully and decide what to do for your final report.
(Posted: Dec. 19, 2022) Plan for today
(Posted: Dec. 12, 2022) Plan for today
1. OpenMP for GPU
2. pd09_omp_gpu (and pd08_mm if you like to work on it today)
3. What You Must Know about Memory, Caches, and Shared Memory
(Posted: Dec. 5, 2022) The class of Dec. 26 will be on-demand (video lecture). Happy the end of the year!
(Posted: Dec. 5, 2022) Answers for the earlier assignment (pd01_omp) and exercises (pd02_cuda and pd06_simd_vect) are released
(Posted: Dec. 5, 2022) Plan for today
1. How to get nearly peak FLOPS (with CPU)
2. pd08_mm (your next assignment)
(Posted: Nov. 28, 2022) Sorry that I didn't make it clear. This week is a mid-term exam week, for courses that end in A1 term. Unless otherwise stated there are no classes for A1+A2 semester courses this week, including this course.
(Posted: Nov. 21, 2022) Plan for today
1. SIMD Programming
2. pd06_simd_vext
3. How to get nearly peak FLOPS (with CPU)
4. pd07_ilp
(Posted: Nov. 14, 2022) Plan for today
1. SIMD Programming
2. pd05_simd_hl
  - be sure you can sign in Microsoft account using UTokyo Account (10-digit@utac.u-tokyo.ac.jp) and open this collaborative Excel workbook.
  - if you don't know how to sign in Microsoft account using UTokyo Account, check this page
3. SIMD Programming (vector types and intrinsics)
4. pd06_simd_vext
(Posted: Nov. 07, 2022) Plan for today
1. explain/demo pd04_cuda_sched_vis
2. revisit "how to choose block size" in (a renewed) CUDA slides
3. SIMD Programming
4. pd05_simd_hl if time permits
  - to get ready, sign in Microsoft account using UTokyo Account (10-digit@utac.u-tokyo.ac.jp) and open this collaborative Excel workbook.
  - if you don't know how to sign in Microsoft account using UTokyo Account, check this page
(Posted: Oct. 31, 2022) The answer to the question I got today on the last problem in pd01_omp
(Posted: Oct. 31, 2022) Plan for today
- CUDA
- CUDA exercise pd02_cuda
- pd04_cuda_sched_vis (visualize how GPUs schedule CUDA threads). CORRECTION In the first C code (cuda_sched_rec.cu), comment out the following line in the main function
```
int D               = (argc > i ? atoll(argv[i]) : 1);    i++;
```
  into
```
// int D               = (argc > i ? atoll(argv[i]) : 1);    i++;
```
  as it is inconsistent with the command line arguments given in section 7. Shared memory
```
./cuda_sched_rec ${N} 1 100 1000 ${S} > cs_N_1_S.dat
```

(Posted: Oct. 27, 2022) CORRECTION In Problem 13 in pd01_omp, consider the integrand is zero outside the unit disk (don't calculate the sqrt of negative numbers). Sorry that it was not clear. Thanks to Arnaud BERGER for pointing that out.
(Posted: Oct. 24, 2022) Plan for today
- OpenMP Data Sharing
- OpenMP exercise pd01_omp
- CUDA
- CUDA exercise pd02_cuda
(Posted: Oct. 24, 2022) A new notebook pd04_cuda_sched_vis released
(Posted: Oct. 17, 2022) An elaborated answer to the question I got from Kawahara-kun right before the lecture ended today.
(Posted: Oct. 17, 2022) Plan for today
- OpenMP
- OpenMP exercise pd01_omp
- Note: classroom moved to 246 (from 241 of the last week)
(Posted: Oct. 17, 2022) Released pd02_cuda, for those who have already done with pd01_omp (also look at CUDA slides). (to run GPU programs, you need to access another Jupyter server, which is in the pd02_cuda notebook)
(Posted: Oct. 17, 2022) Released pd03_omp_sched_vis that helps you understand scheduling (load balancing)
(Posted: Oct. 07, 2022) Lecture recording is available in YouTube. Get the link from ITC-LMS
(Posted: Oct. 03, 2022) Plan for today
- Introduction
- Get ready for Jupyter environment for exercises and assignments. Follow the instruction on How to access Jupyter environment to start working on it.
- OpenMP
- OpenMP exercise pd01_omp, which is going to be the first assignment due about a month from now. Details will be announced in ITC-LMS
(Posted: Oct. 01, 2022)
- Even if you haven't decided to sign up for the course, please be sure you register yourself in ITC-LMS before the first class (Oct. 3), so you can see important announcements for that day (note: it shouldn't be confused with signing up for the course, which you still have to do through UTAS when you decide to do so. If you have already signed up in UTAS, on the other hand, you will be automatically registered to ITC-LMS the day after you have signed up.
- The lecture is in hybrid (you can participate in person or online). See ITC-LMS for Zoom URL.
- Please bring your own PC to the class room when attending in person.
- You are welcome to join the slack channel "2022a-parallel-and-distributed-programming" in 「EEIC: 工学部電子情報工学科・電気電子工学科」 workspace if you like (optional). The channel is used for quick announcements (e.g., the machine is currently down and will be up again in an hour) and for helping among students. Here is how to join the channel.
- How to contact me, with reliability/latency trade off
  - Email (frequently checked, but once missed, it may be burried deep in the stack and forgotten)
  - Slack (quickly responded with luck, but it may not be checked for a day)
  - LMS (unlikely to be totally missed/forgotten, but may not be checked for a few days)
(Posted: Sep. 29, 2022) Site up

The record of the last year is at the end of this page, FYI.

Slides

Course Objectives

Lecture Plan

Languages

Programming Exercises Week(s)

How to get the credit

Here are the requirements for getting the credit.

Submit programming exercise assignments
Write and submit a final report (term paper).
- abstract deadline: Jan 14th 2023 (Sat), 23:59.
- final deadline: Feb 11th 2023 (Sat), 23:59.
Both are to be submitted through ITC-LMS.

The abstract is essentially just for you to decide what you will do for the final report. It can be just a few paragraphs about what you will be doing for the final report. This is mainly for those of you who work on "Option 2" below ("solve your own problems").
The final report must be a logical, consistent, and sufficiently self-contained document, in a PDF file. The topic of the final report can be chosen from the following.
- Option 1: parallelize/optimize a common predefined problem:
  I will provide dataset and a baseline serial program (C++) and a dataset. And you parallelize/optimize it and analyze performance.
  
  NEW Jan 5 2023 I finally made it available at https://github.com/taura/parallel-distributed/ Read 21mnist/README.md for detailed instruction. My big apologies for the huge delay. I will deliver a video lecture explaining more details this later, but please start looking at it to decide what to work on as your final report and start working on it, without waiting for the video I have made three videos. They are all available in the usual channel. The link is in ITC-LMS
  Details are to be announced, but FYI, in the last incarnation of this lecture (2020), it was a neural network for image recognition (VGG, pubilished at ICLR 2015) on Cifar-10 dataset). The baseline code was a C++ equivalent of VGG code distributed with Chainer. You optimize the baseline either on CPU, GPU or both. You are allowed to make a team of up to two. A team of two must work on both CPU and GPU. The most time-consuming operations are matrix multiplications and convolutions.
  
  ~~In this year, too, it is likely to be a neural network. Details are to be announced.~~ It is indeed a neural network (simpler and easier to work with than the last time).
  
  Neural network is a great example of applying what you have learned (or will learn) in this class to make it faster (multicore, SIMD, instruction level parallelism, GPU, etc.)
- Option 2: Solve your problem: Solve a problem you want to get high performance for. The problem may be one that arises in your research or one that interests you. It must be one for which you have a good prospect for applying parallelization or vectorization. Think of good parallel algorithms to get good performance. Apply parallelization and/or vectorization, understand what is the maximum performance achievable, and investigate how close your implementation is. If you choose this option, your abstract must be descriptive and informative enough so that I can get what it is going to be about.
  - Describe your application or a high-level goal
  - Describe what it does in terms of computation
  - Describe how you are going to apply parallelization/vectorization or other things to make it more high-performance
- Update Feb 7 How to submit final report. Submit your code and report as follows.
  - Put your code in taulec server. Details are below
  - Write the report and make a PDF file
  - Submit the PDF as an attachment to ITC-LMS assignment ("Final report option 1" or "Final report option 2"). When submitting, put the following in the comment box.
```
My user name on taulec is uNNNNN.  I put my code at ~/option-N/... 
```
- How to put your code.
  1. Login taulec server
  2. Make directory option-1 or option-2 depending on which option you chose
  3. When you chose Option 1, put parallel-distributed directory right under option-1. If you are developing your code on taulec, the following is likely to do it
```
ssh uNNNNN@taulec.zapto.org
mkdir option-1
cp -r whereever_it_is/parallel-distributed option-1/
```
    Otherwise you will copy your working directory by scp.
  4. When you chose Option 2, put whatever directory that has all the code you developed right under option-2. The procedure will be simliar to that for Option 1 above.
Update Feb 9 on segmentation fault
I've heard from several of you that your vectorized (SIMDized) code receives a segmentation fault. There are so many ways your program may be wrong, so I cannot tell you here is why, but I thought the following problem seems common among you, so I thought I need to draw your attention.
1. if you defined a vector of real as follows,
```
typedef real realv __attribute__((aligned(64)));
```
  and defined the following auxiliary function to access a consecutive 16 elements starting from an element, like
```
realv& V(idx_t i0, idx_t i1=0, idx_t i2=0, idx_t i3=0) {
  return *((realv*)&w[i0][i1][i2][i3]);
}
```
  the compiler uses vmovaps instruction when dereferencing a realv pointer (i.e., realv*). The problem is that, vmovaps requires the address to be aligned to 64 bytes (a multiple of 64). Accessing non-aligned address causes the segmentation fault.
  
  In order to confirm if this is indeed what is happening to your code, run the program in GDB to get the segmentation fault, and issue disas(semble) command of GDB to know the exact instruction that causes the segmentation fault ("disassemble" is the real command name, which can be abbreviated to "disas").
  
  If it is indeed a vmovaps instruction, e.g., vmovaps(%rax),%zmm1, then check the content of %rax register by p(rint) $rax GDB command. If it's not a multiple of 64, then it is indeed what is happening.
2. x86 actually has another instruction, called vmovups, that does not require the address to be 64-byte aligned. Then the question is how to let the compiler use vmovups rather than vmovaps.
3. Which is what (only lightly) said in p32/57 of this slide deck.
```
typedef real realv attribute((vector_size(64), __may_alias__, aligned(sizeof(float))));
```
  By defining realv this way, the compiler assumes it is aligned only to the sizeof(float), which is four, and thus uses vmovups. So I thought ... However, it turned out that clang++ still insists using vmovaps. g++ uses vmovups as we wished. I don't know whether clang++'s behavior is legitimate or not. It may be a compiler glitch.
  
  Given this situation, one resort is to use g++ instead of clang++.
4. Another way is to define the auxiliary function this way.
```
realv* V(idx_t i0, idx_t i1=0, idx_t i2=0, idx_t i3=0) {
  return ((realv*)&w[i0][i1][i2][i3]);
}
```
  That is, it returns a pointer to the given location. The client code using this function then has to dereference the returned pointer like this
```
   v += ... * (*x.V(i0,i1,i2,i3));
```
  or
```
   *y.V(i0,i1,i2,i3) = ...
```
  I confirmed that with this, clang++ uses vmovups. Again, I don't know if there is anything in the spec or the clang++ document that explains the difference. Also, I have not confirmed that this does not cause any adversary effect on performance.
5. A surer way to use vmovups is to use an intrinsic function. You then have to define two functions, one for load and the other for store.
6. A separate but related issue. Since the original loop trip count is often not a multiple of 16, you must deal with remainder iterations.
  
  The above V function loads/stores a full length vector (16 elements) and you want to avoid that. I recommend using a predicated instruction. You have to use an intrinsic function anyways. It is left for your exercise.

I do not want to receive reports from somebody who (almost) have never been engaged in this class at all, just to be sure. I believe it practically does not happen that someone who have never participated in the class finishes all the assignments and final report in the last minute, but just in case if you are reading this page for the first time in February looking for a class of which the attendance was not a part of evaluation and which allows late submissions so that you still might have a chance to get a credit, I will say this class is not for you

update: 21 Jan, 2023. I received a question about which environment you are supposed to do your work in. I've also heard from several of you who seem to be trying to do it on your laptops (some Macs and Windows). Let me explain what I expected and what the rules are.

Well, at least my initial assumption was that it goes without saying that you will be working on the server environment (taulec/tauleg000, which I collectively call taulec below) we've been using for the class. This is what I expected and anticipated to happen.

Then through exchanges with several of you, I came to realize that many of you are actually trying to do it on their personal machines (I don't actually know how many are, as those questions are basically about how to make the baseline code compile/run on their laptops, an issue that you never encounter if you are doing it in taulec.

So, let me make the rules and recommendations clear in this opportunity (sorry that I didn't make it clear earlier partly because, as I said above, I just anticipated everybody to do it on taulec).

No problem in developing it on your laptop first and later taking it to taulec at some point. Comparing the two would be great if there is merit in doing so (esp. comparing Intel vs. M1/M2).

What about doing it ONLY ON your laptop? Well, an issue is that the server has much larger core counts (taulec has 38 cores and there is another one taulec000 having 76 cores, which I will be announcing later) than typical laptops do (2 to 4, at most 8). The requirement is not that you have to use that particular server, but that you should at least try to get most from a highly parallel environment (CPUs having many cores or GPUs).

An exception is that, if you are interested in the M1/M2 CPUs recent Apple machines have and want to dig deep into them (what SIMD instructions of M1/M2 look like, what their ILP capabilities look like, how does it compare with Intel, etc.), I encourage you to do so. I consider an investigation into M1/M2 a topic of its own merit. Comparing M1/M2 and Intel in detail will make a good report.

What about doing it only on Intel laptops having only a few cores? Well, it's not that it's not allowed, but I don't see a point in doing so. Think again about why you wanna do that before going that route when working in taulec is just one SSH command away. If your primary reason behind it is you don't know how to (comfortably) work in a remote server or Linux, consider using this as an opportunity to master it. Most remote machines are Linux and most powerful machines are remote, not on your desk. I explained in the first week how to access the server with SSH.

As an extension of this recommendation, here is a non-exhaustive list of evaluation criteria.

an effort to make a single-thread performance better

an effort to make it scale to a large degree of parallelism an effort to analyze / precisely understand the performance

an effort to do experiments on multiple machines (e.g., CPU and GPU, Intel CPU and M1/M2 CPU, etc.) and shed light on their commonality/difference

an effort to do experiments on multiple programming models (e.g., OpenMP vs CUDA on GPU)

A short summary; I will look forward to seeing reports that do something beyond just vectoring or parallelizing with a few cores and seeing what happens.

Topics

Parallel Programming in Practice

Taxonomy of parallel machines and programming models

Understanding performance of parallel programs (and achieving high performance)

Fundamental Topics (as time permits)

Links and References

Record of announcements in this lecture 2020

Record of announcements in this lecture 2018

Record of the announcement about the credit in the last instance of this lecture

Here are the requirements for getting the credit.

You participated in the classes often enough and generally follow what are covered in the class (this is a prerequisite to understand what you are really asked to do in what follows). As you know, I have not been keeping track of attendance/absence, but I am pretty sure I can tell if somebody who has almost never been in the class suddenly sends a report.
Finish the SpMV hands-on exercises we had during the class.
Write and submit a final report (term paper).
- abstract deadline: January 18th 2019 (Sat), 23:59.
- final deadline: February 9th 2019 (Sat), 23:59.
Both are to be submitted through ITC-LMS.

The abstract can be just a text (detailed instructions will be announced later) of a paragraph or two describing your plan on what you will be doing for the final report.

The final report must be a logical, consistent, and sufficiently self-contained document, in a PDF file. The topic of the final report can be chosen from the following.

Option 1: Optimize Neural Network:
I will provide dataset and a baseline serial program (C++) of a neural network for image recognition (VGG, pubilished at ICLR 2015) on Cifar-10 dataset). The baseline code is a C++ translation of VGG code distributed with Chainer. You optimize the baseline either on CPU, GPU or both. You are allowed to make a team of up to two. A team of two must work on both CPU and GPU. The most time-consuming operations are matrix multiplications.

I will provide a tool to submit records and display them in real time through a page similar to this page, which I made the last year. You are asked to keep it updated with your records, so that we can collectively know how fast they are on each machine.

I will give the toolkit (a baseline code, dataset, submission tool and more detailed spec of the problem). I will try to get them ready by the next lecture (Dec 26th).
Option 2: Solve your problem: Solve a problem you want to get high performance for. The problem may be one that arises in your research or one that interests you. It must be one for which you have a good prospect for applying parallelization or vectorization. Think of good parallel algorithms to get good performance. Apply parallelization and/or vectorization, understand what is the maximum performance achievable, and investigate how close your implementation is.

Parallel and Distributed Programming
Kenjiro Taura

What's New (in the newest-first order)

Slides

Course Objectives

Lecture Plan

Languages

Programming Exercises Week(s)

How to get the credit

Topics

Parallel Programming in Practice

Taxonomy of parallel machines and programming models

Understanding performance of parallel programs (and achieving high performance)

Fundamental Topics (as time permits)

Links and References

Record of announcements in this lecture 2020

Record of announcements in this lecture 2018

Record of the announcement about the credit in the last instance of this lecture

Parallel and Distributed ProgrammingKenjiro Taura

What's New (in the newest-first order)

Slides

Course Objectives

Lecture Plan

Languages

Programming Exercises Week(s)

How to get the credit

Topics

Parallel Programming in Practice

Taxonomy of parallel machines and programming models

Understanding performance of parallel programs (and achieving high performance)

Fundamental Topics (as time permits)

Links and References

Record of announcements in this lecture 2020

Record of announcements in this lecture 2018

Record of the announcement about the credit in the last instance of this lecture

Parallel and Distributed Programming
Kenjiro Taura