cd parallel-distributed git pull cd 21mnist makeA few lines have been changed in mnist_util.h and mnist.cc, so the log file can be parsed by the submit tool.
int D = (argc > i ? atoll(argv[i]) : 1); i++;into
// int D = (argc > i ? atoll(argv[i]) : 1); i++;as it is inconsistent with the command line arguments given in section 7. Shared memory
./cuda_sched_rec ${N} 1 100 1000 ${S} > cs_N_1_S.dat
Both are to be submitted through ITC-LMS.
I will provide dataset and a baseline serial program (C++) and a dataset. And you parallelize/optimize it and analyze performance.
NEW Jan 5 2023
I finally made it available at
https://github.com/taura/parallel-distributed/
Read
21mnist/README.md for detailed instruction.
My big apologies for the huge delay.
I will deliver a video lecture explaining more details this later,
but please start looking at it
to decide what to work on as your final report
and start working on it, without waiting for the video
I have made three videos. They are all available in the usual channel. The link is in ITC-LMS
Details are to be announced, but FYI, in the last
incarnation of this lecture (2020), it was
a neural network for image recognition
(VGG,
pubilished at ICLR 2015)
on Cifar-10
dataset). The baseline code was a C++ equivalent of
VGG code distributed with Chainer.
You optimize the baseline either on CPU, GPU or both.
You are allowed to make a team of up to two.
A team of two must work on both CPU and GPU.
The most time-consuming operations are matrix multiplications
and convolutions.
In this year, too, it is likely to be a neural network.
Details are to be announced.
It is indeed a neural network (simpler and easier to work with than the last time).
Neural network is a great example of applying what you have learned (or will learn) in this class to make it faster (multicore, SIMD, instruction level parallelism, GPU, etc.)
My user name on taulec is uNNNNN. I put my code at ~/option-N/...
ssh uNNNNN@taulec.zapto.org mkdir option-1 cp -r whereever_it_is/parallel-distributed option-1/Otherwise you will copy your working directory by scp.
I've heard from several of you that your vectorized (SIMDized) code receives a segmentation fault. There are so many ways your program may be wrong, so I cannot tell you here is why, but I thought the following problem seems common among you, so I thought I need to draw your attention.
typedef real realv __attribute__((aligned(64)));and defined the following auxiliary function to access a consecutive 16 elements starting from an element, like
realv& V(idx_t i0, idx_t i1=0, idx_t i2=0, idx_t i3=0) { return *((realv*)&w[i0][i1][i2][i3]); }
the compiler uses vmovaps instruction when dereferencing a realv pointer (i.e., realv*). The problem is that, vmovaps requires the address to be aligned to 64 bytes (a multiple of 64). Accessing non-aligned address causes the segmentation fault.
In order to confirm if this is indeed what is happening to your code, run the program in GDB to get the segmentation fault, and issue disas(semble) command of GDB to know the exact instruction that causes the segmentation fault ("disassemble" is the real command name, which can be abbreviated to "disas").
If it is indeed a vmovaps instruction, e.g., vmovaps(%rax),%zmm1, then check the content of %rax register by p(rint) $rax GDB command. If it's not a multiple of 64, then it is indeed what is happening.
Which is what (only lightly) said in p32/57 of this slide deck.
typedef real realv attribute((vector_size(64), __may_alias__, aligned(sizeof(float))));
By defining realv this way, the compiler assumes it is aligned only to the sizeof(float), which is four, and thus uses vmovups. So I thought ... However, it turned out that clang++ still insists using vmovaps. g++ uses vmovups as we wished. I don't know whether clang++'s behavior is legitimate or not. It may be a compiler glitch.
Given this situation, one resort is to use g++ instead of clang++.
Another way is to define the auxiliary function this way.
realv* V(idx_t i0, idx_t i1=0, idx_t i2=0, idx_t i3=0) { return ((realv*)&w[i0][i1][i2][i3]); }That is, it returns a pointer to the given location. The client code using this function then has to dereference the returned pointer like this
v += ... * (*x.V(i0,i1,i2,i3));or
*y.V(i0,i1,i2,i3) = ...I confirmed that with this, clang++ uses vmovups. Again, I don't know if there is anything in the spec or the clang++ document that explains the difference. Also, I have not confirmed that this does not cause any adversary effect on performance.
A separate but related issue. Since the original loop trip count is often not a multiple of 16, you must deal with remainder iterations.
The above V function loads/stores a full length vector (16 elements) and you want to avoid that. I recommend using a predicated instruction. You have to use an intrinsic function anyways. It is left for your exercise.
I do not want to receive reports from somebody who (almost) have never been engaged in this class at all, just to be sure. I believe it practically does not happen that someone who have never participated in the class finishes all the assignments and final report in the last minute, but just in case if you are reading this page for the first time in February looking for a class of which the attendance was not a part of evaluation and which allows late submissions so that you still might have a chance to get a credit, I will say this class is not for you
update: 21 Jan, 2023. I received a question about which environment you are supposed to do your work in. I've also heard from several of you who seem to be trying to do it on your laptops (some Macs and Windows). Let me explain what I expected and what the rules are.
Well, at least my initial assumption was that it goes without saying that you will be working on the server environment (taulec/tauleg000, which I collectively call taulec below) we've been using for the class. This is what I expected and anticipated to happen.
Then through exchanges with several of you, I came to realize that many of you are actually trying to do it on their personal machines (I don't actually know how many are, as those questions are basically about how to make the baseline code compile/run on their laptops, an issue that you never encounter if you are doing it in taulec.
So, let me make the rules and recommendations clear in this opportunity (sorry that I didn't make it clear earlier partly because, as I said above, I just anticipated everybody to do it on taulec).
git fetch upstream git merge upstream/masterin your repository; see supplementary help for working on IST cluster for more details
$ ssh username@login000.cluster.i.u-tokyo.ac.jp(Replace username part with the user name on the IST cluster, which you should have received as the comment to your second assignment on ITC-LMS "2. Tell me when your IST cluster account is ready". It is different from the one on Jupyter environment. Confusing, sorry!)
omp_parallel_master.c: In function ... omp_parallel_master.c:xx:xx: error: expected `#pragma omp' clause before `master' #pragma omp parallel master ^~~~~~Please fix the line containing
#pragma omp parallel masterinto two lines
#pragma omp parallel #pragma omp masterThis error happened because
static int coo_elem_cmp(const void * a_, const void * b_) { coo_elem_t * a = (coo_elem_t *)a_; coo_elem_t * b = (coo_elem_t *)b_; if (a->i < b->i) return -1; if (a->i > b->i) return 1; if (a->j < b->j) return -1; if (a->j > b->j) return 1; if (a->a < b->a) return -1; if (a->a > b->a) return 1; return 0; }before the fix, the last line was this orz.
if (a->a > b->a) return -1; // it should be 1For obvious reasons, I cannot predict how much this affected your experiments. The safe bet is to redo your experiments m(_ _)m
Both are to be submitted through ITC-LMS.
The abstract can be just a text (detailed instructions will be announced later) of a paragraph or two describing your plan on what you will be doing for the final report.
The final report must be a logical, consistent, and sufficiently self-contained document, in a PDF file. The topic of the final report can be chosen from the following.
I will provide dataset and a baseline serial program (C++) of a neural network for image recognition (VGG, pubilished at ICLR 2015) on Cifar-10 dataset). The baseline code is a C++ translation of VGG code distributed with Chainer. You optimize the baseline either on CPU, GPU or both. You are allowed to make a team of up to two. A team of two must work on both CPU and GPU. The most time-consuming operations are matrix multiplications.
I will provide a tool to submit records and display them in real time through a page similar to this page, which I made the last year. You are asked to keep it updated with your records, so that we can collectively know how fast they are on each machine.
I will give the toolkit (a baseline code, dataset, submission tool and more detailed spec of the problem). I will try to get them ready by the next lecture (Dec 26th).