OpenMP Programming Tutorial and Hands-on¶
Enter your name and student ID.
- Name:
- Student ID:
2. Compilers¶
- We use NVIDIA HPC SDK ver. 24.9 (
nvc
andnvc++
) and LLVM ver. 18.1.8 (clang
andclang++
) for C/C++ compilers, as they support OpenMP GPU offloading
2-1. Set up NVIDIA CUDA and HPC SDK¶
Execute this before you use NVIDIA HPC SDK
export PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin:$PATH
export PATH=/usr/local/cuda/bin:$PATH
- Check if it works (check if full paths of nvc/nvc++ are shown)
- make sure
which nvc
andwhich nvc++
show/opt/nvidia/...
which nvcc
shows/usr/local/...
which nvc
which nvc++
which nvcc
2-2. Set up LLVM¶
Execute this before you use LLVM
export PATH=/home/share/llvm/bin:$PATH
export LD_LIBRARY_PATH=/home/share/llvm/lib:/home/share/llvm/lib/x86_64-unknown-linux-gnu:$LD_LIBRARY_PATH
Check if it works (check if full paths of clang/clang++ are shown)
which clang
which clang++
3. Compiling and running OpenMP programs¶
- Summary
clang
/clang++
: give-fopenmp
optionnvc
/nvc++
: give-mp
option- Set
OMP_NUM_THREADS
environment variable when running the executable
%%writefile omp_hello.c
#include <stdio.h>
int main() {
printf("hello\n");
#pragma omp parallel
printf("world\n");
printf("good bye\n");
return 0;
}
- Compiling with clang
- Add
-fopenmp
option to compile OpenMP programs - Other generally useful options
-Wall
warns many suspicous code-O3
maximally optimize code for performance
clang -fopenmp omp_hello.c -o omp_hello_clang
- Compiling with nvc
- Add
-mp
option to compile OpenMP programs - Other generally useful options
-Wall
warns many suspicous code-O4
maximally optimizes code for performance
nvc -mp omp_hello.c -o omp_hello_nvc
Running
Set environment variable
OMP_NUM_THREADS
to the number of threads created by#pragma omp parallel
OMP_NUM_THREADS=3 ./omp_hello_clang
OMP_NUM_THREADS=3 ./omp_hello_nvc
Problem 1 : Change the number of threads¶
- Execute them with various numbers of threads and see what happens
Answer for trivial work omitted
OMP_NUM_THREADS=3 ./omp_hello_clang
OMP_NUM_THREADS=3 ./omp_hello_nvc
4. #pragma omp parallel
¶
- #pragma omp parallel creates a team of threads, each of which executes the statement below
- Note that only the statement that is right below the pragma is executed by the team of threads
- Of course, the statement can be a compound statement and/or include a function call, so each thread can actually execute arbitrary number of statements
- See Determining the Number of Threads for a parallel Region for more details on the number of threads created by
#pragma omp parallel
Problem 2 : Executing multiple statements by threads¶
- Change following program so that both "world" and "good bye" are printed as many times as the number of threads
%%writefile omp_hello.c
#include <stdio.h>
int main() {
printf("hello\n");
#pragma omp parallel
printf("world\n");
printf("good bye\n");
return 0;
}
Example answer:
%%writefile omp_hello_ans.c
#include <stdio.h>
int main() {
printf("hello\n");
#pragma omp parallel
{
printf("world\n");
printf("good bye\n");
}
return 0;
}
clang -fopenmp omp_hello_ans.c -o omp_hello_ans
# nvc -mp omp_hello_ans.c -o omp_hello_ans
OMP_NUM_THREADS=3 ./omp_hello_ans
- Below, choose
clang
ornvc
depending on your taste by commenting out the other one - Below, I chose
clang
by commenting outnvc
clang -fopenmp omp_hello.c -o omp_hello
# nvc -mp omp_hello.c -o omp_hello
OMP_NUM_THREADS=3 ./omp_hello
5. omp_get_num_threads()
and omp_get_thread_num()
¶
When threads are executing a statement with
#pragma omp parallel
,- they are said to be in a parallel region
- they are called a team of threads
While a thread is executing a parallel region,
- omp_get_num_threads() returns the number of threads in the team
- omp_get_thread_num() returns the unique id of the calling thread within the team (0, 1, ..., the number threads in the team - 1)
You need
#include <omp.h>
to use these functions or any OpenMP API functions, for that matter
Problem 3 : Using omp_get_num_threads()
and omp_get_thread_num()
¶
- Change following program so that each thread prints its id in the team and the number of threads in the team, like this. The exact order of lines may differ. Strictly speaking, even characters in two lines can be mixed into a single line.
hello
0/5 world
4/5 world
1/5 world
3/5 world
2/5 world
good bye
%%writefile omp_hello_id.c
#include <stdio.h>
int main() {
printf("hello\n");
#pragma omp parallel
{
printf("world\n");
printf("good bye\n");
}
return 0;
}
clang -fopenmp omp_hello_id.c -o omp_hello_id
# nvc -mp omp_hello_id.c -o omp_hello_id
OMP_NUM_THREADS=3 ./omp_hello_id
Example answer:
%%writefile omp_hello_id_ans.c
#include <stdio.h>
#include <omp.h>
int main() {
printf("hello\n");
#pragma omp parallel
{
int omp_nthreads = omp_get_num_threads();
int omp_rank = omp_get_thread_num();
printf("world %d/%d\n", omp_rank, omp_nthreads);
}
printf("good bye\n");
return 0;
}
clang -fopenmp omp_hello_id_ans.c -o omp_hello_id_ans
# nvc -mp omp_hello_id_ans.c -o omp_hello_id_ans
OMP_NUM_THREADS=3 ./omp_hello_id_ans
6. #pragma omp for
¶
#pragma omp parallel
merely creates a team of threads executing the same statement- In this sense,
#pragma omp parallel
alone cannot make a program run faster with multiple cores - A program can be made faster only when you divide the work among threads (work-sharing)
#pragma omp for
lets you divide iterations of a loop into threads created by#pragma omp parallel
Problem 4 : How does #pragma omp for
divide iterations to threads?¶
- Execute the following cell and observe which iteration is executed by which thread
- Based on the observation, change the number of iterations and threads and predict the mapping between iterations and threads
%%writefile omp_for.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
#pragma omp parallel
{
printf("I am thread %d in a team of %d threads\n",
omp_get_thread_num(), omp_get_num_threads());
#pragma omp for
for (int i = 0; i < 24; i++) {
usleep(100 * 1000 * i);
printf("iteration %d executed by thread %d\n", i, omp_get_thread_num());
fflush(stdout);
}
}
return 0;
}
clang -fopenmp omp_for.c -o omp_for
# nvc -mp omp_for.c -o omp_for
OMP_NUM_THREADS=4 ./omp_for
Example answer:
observation: in the default scheduling policy, which seems static, when executing with $P$ threads, the first $1/P$ of all iterations are executed by thread 0, next $1/P$ iterations by thread 1, etc.
6-1. for loops allowed by #pragma omp for
¶
- There is a severe syntax restriction on the kind of for loops
#pragma omp for
can apply for - See Canonical Loop Form for the spec
- In short, it should look like
for (var = _init_; var < _limit_; var += _inc_)
where init, limit, and inc are all loop-invariant (do not change throughout the loop)
6-2. Combined pragma (parallel + for)¶
#pragma omp parallel
and#pragma omp for
are often used together- If
#pragma omp for
immediately follows#pragma omp parallel
, they can be combined into a single pragma#pragma omp parallel for
%%writefile omp_parallel_for.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
double t0 = omp_get_wtime();
#pragma omp parallel for
for (int i = 0; i < 24; i++) {
usleep(100 * 1000 * i); /* sleep 100 x i milliseconds */
printf("iteration %d executed by thread %d\n", i, omp_get_thread_num());
fflush(stdout);
}
double t1 = omp_get_wtime();
printf("%f sec\n", t1 - t0);
return 0;
}
clang -fopenmp omp_parallel_for.c -o omp_parallel_for
# nvc -mp omp_parallel_for.c -o omp_parallel_for
OMP_NUM_THREADS=4 ./omp_parallel_for
7. Scheduling a work-sharing for loop¶
- As you witnessed, the default scheduling policy in our environment (may be implementation dependent) seems static scheduling (assign roughly the same number of contiguous iterations to each thread)
- Is it enough? Clearly, it does not do a good job when iterations take a different amount of time
- You can change the policy by schedule clause
7-1. Visualizing scheduling¶
- The program below executes the function
iter_fun
#pragma omp parallel for
for (long i = 0; i < L; i++) {
iter_fun(a, b, i, M, N, R, T);
}
iter_fun(a, b, i, M, N, R, T)
repeats x = a x + b many (M * N) times and record time every N iterations
%%writefile omp_sched_rec.c
#include <err.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#include <omp.h>
long cur_time_ns() {
struct timespec ts[1];
if (clock_gettime(CLOCK_REALTIME, ts) == -1) err(1, "clock_gettime");
return ts->tv_sec * 1000000000L + ts->tv_nsec;
}
typedef struct {
double x;
int thread[2];
int cpu[2];
} record_t;
/* the function for an iteration
perform
x = a x + b
(M * N) times and record current time
every N iterations to T.
record thread and cpu to R.
*/
void iter_fun(double a, double b, long i, long M, long N,
record_t * R, long * T) {
// initial value (not important)
double x = i;
// record in T[i * M] ... T[(i+1) * M - 1]
T = &T[i * M];
// record starting thread/cpu
R[i].thread[0] = omp_get_thread_num();
R[i].cpu[0] = sched_getcpu();
// repeat a x + b many times.
// record time every N iterations
for (long j = 0; j < M; j++) {
T[j] = cur_time_ns();
for (long k = 0; k < N; k++) {
x = a * x + b;
}
}
// record ending SM (must be = thread0)
R[i].thread[1] = omp_get_thread_num();
R[i].cpu[1] = sched_getcpu();
// record result, just so that the computation is not
// eliminated by the compiler
R[i].x = x;
}
void dump(record_t * R, long * T, long L, long M, long t0) {
long k = 0;
for (long i = 0; i < L; i++) {
printf("i=%ld x=%f thread0=%d cpu0=%d thread1=%d cpu1=%d",
i, R[i].x, R[i].thread[0], R[i].cpu[0], R[i].thread[1], R[i].cpu[1]);
for (long j = 0; j < M; j++) {
printf(" %ld", T[k] - t0);
k++;
}
printf("\n");
}
}
int main(int argc, char ** argv) {
int idx = 1;
long L = (idx < argc ? atol(argv[idx]) : 100); idx++;
long M = (idx < argc ? atol(argv[idx]) : 100); idx++;
long N = (idx < argc ? atol(argv[idx]) : 100); idx++;
double a = (idx < argc ? atof(argv[idx]) : 0.99); idx++;
double b = (idx < argc ? atof(argv[idx]) : 1.00); idx++;
record_t * R = (record_t *)calloc(L, sizeof(record_t));
long * T = (long *)calloc(L * M, sizeof(long));
long t0 = cur_time_ns();
#pragma omp parallel for
for (long i = 0; i < L; i++) {
iter_fun(a, b, i, M, N, R, T);
}
long t1 = cur_time_ns();
printf("%ld nsec\n", t1 - t0);
dump(R, T, L, M, t0);
return 0;
}
clang -fopenmp -D_GNU_SOURCE omp_sched_rec.c -o omp_sched_rec
# nvc -mp omp_sched_rec.c -o omp_sched_rec
OMP_NUM_THREADS=4 ./omp_sched_rec > a.dat
- Execute the following cell to visialize it
- In the graph,
- horizontal axis is the time from the start in nanosecond
- vertical axis is the iteration number
- the color represents the thread that executed the iteration
import sched_vis
sched_vis.sched_plt(["a.dat"])
# sched_vis.sched_plt(["a.dat"], start_t=1.5e7, end_t=2.0e7)
Problem 5 : Understanding scheduling by visualization¶
Add
schedule
clause to the program (schedule(runtime)
allows you to set the schedule in the command line)Change the number of threads and schedule and observe how iterations are executed
Set the number of threads very large (higher than the physical number of cores) and see what happens
- Hint : you can get the number of cores by
nproc
command
- Hint : you can get the number of cores by
In the above program, each iteration performs exactly the same amount of computation (i.e., x = a x + b (M * N) times), thus takes almost exactly the same time
See what happens if this is not the case
- Specifically, make iteration
i
repeats x = a x + b (M * (i * N)) times (i.e., change the inner loop initer_fun
tofor (long k = 0; k < i * N; k++) { ...
)
- Specifically, make iteration
sched_plt
function below takes optional parametersstart_t
andend_t
specifying the horizontal range to displayIf you zoom very closely to a particular point, you can see individual points and intervals between them, from which you can deduce how long it takes to perform
x = a x + b
once
nproc
Problem 6 : Specifying the scheduling policy by schedule clause¶
- In the following (artificial) loop, iteration i roughly sleeps for (100 x i) milliseconds and this is almost exactly the time it takes
- predict the executing time of the parallel for loop with the default (static) scheduling policy
- vary the scheduling policy and reason about their execution times
%%writefile omp_schedule.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
double t0 = omp_get_wtime();
/* ----- add schedule clause below ----- */
#pragma omp parallel for
for (int i = 0; i < 12; i++) {
usleep(100 * 1000 * i); /* sleep 100 x i milliseconds */
printf("iteration %d executed by thread %d\n", i, omp_get_thread_num());
fflush(stdout);
}
double t1 = omp_get_wtime();
printf("%f sec\n", t1 - t0);
return 0;
}
clang -fopenmp omp_schedule.c -o omp_schedule
# nvc -mp omp_schedule.c -o omp_schedule
OMP_NUM_THREADS=4 ./omp_schedule
Predict the execution time with the default (static) policy
Predict the execution time with the dynamic policy?
Compare them with what you observed
Explain your reasoning below
YOUR ANSWER HERE
Example answer:
In this loop, later iterations take longer. Therefore,
in the static policy, the execution time of the entire parallel loop will be the execution time of the thread 3, which are assigned iterations 9-11. the execution time will therefore be ((9 + 10 + 11) * 100) milliseconds = 3.0 seconds
in the dynamic policy, a crude (and optimistic) estimation, which assumes a perfect load balancing, is simply the serial execution time divided by the number of threads, which is ((0 + 1 + ... + 11) * 100) / 4 milliseconds = 1.65 seconds.
This crude approximation isn't quite acurate in this example, however. To be more precise in this particular case, the likely scenario will be the following (a number represents the last digit of an iteration number).
[0] 444488888888
[1] 155555999999999
[2] 226666660000000000
[3] 333777777711111111111
In this case, the execution time will be (3 + 7 + 11) * 100 milliseconds = 2.1 seconds
8. Collapse clause¶
#pragma omp for
can specify a collapse clause to apply work-sharing for a limited type of nested loops- With a
#pragma omp for clause(2)
, OpenMP considers the doubly-nested loop that comes after this pragma the subject of work-sharing (i.e., distribute iterations of the doubly-nested loop to threads); you must have a perfectly-nested, rectangular doubly-nested loop after this clause - A perfectly-nested loop is a nested loop whose outer loops (all loops except for the innermost one) do not have any statement except the inner loop. e.g.
for (i = 0; i < 100; i++) {
for (j = 0; j < 100; j++) {
S(i,j);
}
}
is perfectly nested whereas
for (i = 0; i < 100; i++) {
S;
for (j = 0; j < 100; j++) {
T;
}
}
is not.
- A perfectly nested loop is conceptually a flat loop with a mechanical transformation.
for (ij = 0; ij < 100 * 100; ij++) {
i = ij / 100;
j = ij % 100;
S(i,j);
}
- A rectangular loop is a loop whose iteration counts of inner loops never depend on outer loops. For example,
for (i = 0; i < 100; i++) {
for (j = 0; j < i; j++) {
S(i,j);
}
}
is not a rectangular loop.
- Generally speaking, OpenMP
#pragma omp parallel
+#pragma omp for
cannot handle nested parallelism very well, but collapse clause alleviates the problem to some extent - Consider using tasks below for more general form of nested parallelism
Problem 7 : Apply collapse and schedule¶
- Apply collapse and schedule to the following loop
- Trick: write
schedule(runtime)
and you can change the scheduling policy at execution time by setting environment variableOMP_SCHEDULE=
in the command line. See OMP_SCHEDULE environment variable for details - Reason about the execution time of various schedule policies and with/without collapse
%%writefile omp_collapse.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
double t0 = omp_get_wtime();
/* apply collapse and schedule */
#pragma omp parallel for
for (int i = 0; i < 5; i++) {
for (int j = 0; j < 5; j++) {
usleep(100 * 1000 * (i + j));
printf("iteration (%d, %d) executed by thread %d\n", i, j, omp_get_thread_num());
fflush(stdout);
}
}
double t1 = omp_get_wtime();
printf("%f sec\n", t1 - t0);
return 0;
}
clang -fopenmp omp_collapse.c -o omp_collapse
# nvc -mp omp_collapse.c -o omp_collapse
OMP_NUM_THREADS=3 ./omp_collapse
Example answer:
%%writefile omp_collapse_ans.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
double t0 = omp_get_wtime();
/* apply collapse and schedule */
#pragma omp parallel for collapse(2) schedule(runtime)
for (int i = 0; i < 5; i++) {
for (int j = 0; j < 5; j++) {
usleep(100 * 1000 * (i + j));
printf("iteration (%d, %d) executed by thread %d\n", i, j, omp_get_thread_num());
fflush(stdout);
}
}
double t1 = omp_get_wtime();
printf("%f sec\n", t1 - t0);
return 0;
}
clang -fopenmp omp_collapse_ans.c -o omp_collapse_ans
# nvc -mp omp_collapse_ans.c -o omp_collapse_ans
OMP_NUM_THREADS=3 ./omp_collapse_ans
9. Task parallelism¶
Task is a more general mechanism to extract parallelism and distribute computation (called a task) dynamically to threads in a team created by
#pragma omp parallel
A thread can create a task at any point in the execution of a parallel region and they are dispatched to available threads at runtime
As a thread can create a task at any point, a task can create another task. that is, parallelism can be arbitrarily nested and the number of tasks can be difficult to predict (unlike the number of iterations of a for loop)
A common pattern
- enter a parallel region by
#pragma omp parallel
- ensure the statement is executed by only a single (root) thread with #pragma omp master
- create tasks at any point by #pragma omp task
- a task waits for tasks it created to finish by #pragma omp taskwait
- enter a parallel region by
Let's see the effect of
#pragma omp master
first (without creating any task)
%%writefile omp_master.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
double t0 = omp_get_wtime();
#pragma omp parallel
{
#pragma omp master
printf("inside the master pragma: I am thread %d of a team of %d threads\n",
omp_get_thread_num(), omp_get_num_threads());
printf("out of the master pragma: I am thread %d of a team of %d threads\n",
omp_get_thread_num(), omp_get_num_threads());
}
double t1 = omp_get_wtime();
printf("%f sec\n", t1 - t0);
return 0;
}
clang -fopenmp omp_master.c -o omp_master
# nvc -mp omp_master.c -o omp_master
OMP_NUM_THREADS=3 ./omp_master
- Since this is a common idiom, they can be combined into one pragma (
#pragma omp parallel master
)- This feature is not supported by NVIDIA compiler, however
- The program below creates a parallel region whose entire region is executed only by the master and thus does not serve any useful purpose but mere a demonstration of the feature
%%writefile omp_parallel_master.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
double t0 = omp_get_wtime();
#pragma omp parallel master
printf("I am thread %d of a team of %d threads\n",
omp_get_thread_num(), omp_get_num_threads());
double t1 = omp_get_wtime();
printf("%f sec\n", t1 - t0);
return 0;
}
clang -fopenmp omp_parallel_master.c -o omp_parallel_master
# NVIDIA compiler does not support this program
# nvc -mp omp_parallel_master.c -o omp_parallel_master
OMP_NUM_THREADS=3 ./omp_parallel_master
- Let's create a few tasks now
%%writefile omp_task.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
double t0 = omp_get_wtime();
#pragma omp parallel
#pragma omp master
{
printf("I am thread %d of a team of %d threads\n",
omp_get_thread_num(), omp_get_num_threads());
#pragma omp task
{
printf("task A executed by %d of %d\n", omp_get_thread_num(), omp_get_num_threads());
usleep(500 * 1000);
}
#pragma omp task
{
printf("task B executed by %d of %d\n", omp_get_thread_num(), omp_get_num_threads());
usleep(1000 * 1000);
}
#pragma omp taskwait
printf("two tasks done, executed by %d of %d\n", omp_get_thread_num(), omp_get_num_threads());
}
double t1 = omp_get_wtime();
printf("%f sec\n", t1 - t0);
return 0;
}
clang -fopenmp omp_task.c -o omp_task
# nvc -mp omp_task.c -o omp_task
OMP_NUM_THREADS=3 ./omp_task
- Tasks are particularly good at parallel recursions, as the following program demonstrates
- This is a common pattern that appears in many algorithms, particularly divide-and-conquer algorithms
%%writefile omp_rec_task.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
void recursive_tasks(int n, int tid) {
printf("task %d by %d of %d\n",
tid, omp_get_thread_num(), omp_get_num_threads());
fflush(stdout);
if (n == 0) {
usleep(300 * 1000);
} else {
#pragma omp task
recursive_tasks(n - 1, 2 * tid + 1);
#pragma omp task
recursive_tasks(n - 1, 2 * tid + 2);
#pragma omp taskwait
}
}
int main() {
double t0 = omp_get_wtime();
#pragma omp parallel
#pragma omp master
{
recursive_tasks(5, 0);
}
double t1 = omp_get_wtime();
printf("%f sec\n", t1 - t0);
return 0;
}
clang -fopenmp omp_rec_task.c -o omp_rec_task
# nvc -mp omp_rec_task.c -o omp_rec_task
OMP_NUM_THREADS=10 ./omp_rec_task
Problem 8 : A quiz about recursive tasks¶
- Answer the following questions
- How many tasks are created by
recursive_tasks(n, 0)
? Include the caller ofrecursive_tasks(n, 0)
as a task. i.e., considerrecursive_tasks(0, 0)
creates one task - How many of them are leaf tasks?
- Express them in terms of $n$
YOUR ANSWER HERE
Example answer:
Let $T(n)$ be the number of times recursive_tasks(n, 0)
encounters during the execution. Then,
- $T(0) = 0$
- $T(n) = 2 + 2T(n - 1)$
Therefore, $T(n) = 2^{n+1} - 2$. Counting the caller of recursive_tasks(n, 0)
as a task, the answer is $(2^{n+1} - 1)$.
Let $L(n)$ be the number of leaf tasks created by recursive_tasks(n, 0)
. Then,
- $L(0) = 1$
- $L(n) = 2 L(n - 1)$
Therefore, the number of leaf tasks $= L(n) = 2^n$
- Approximately what is the ideal execution time of
recursive_tasks(5, 0)
when using 10 threads? - Compare it with what you observed
YOUR ANSWER HERE
Example answer:
Considering a leaf task takes 300 milliseconds, the time for non-leaf tasks will be negligible. $2^5 = 32$ tasks will be distributed among 10 threads, so assuming a good load balancing, two of the ten threads will execute four leaf tasks and the other eight threads three leaf tasks. Therefore execution time will be $300 \times 4$ milliseconds = 1.2 seconds.
- Name an algorithm or two for which recursive tasks will be useful for parallelizing it and explain why you think so
YOUR ANSWER HERE
Example answer:
any algorithm for which recursive calls is useful and those recursive calls can be run in parallel will benefit from task parallelism. to name a few,
- quicksort
- mergesort
- fast fourier transform
- matrix matrix multiplication (formulated with divide-and-conquer for locality)
- sparse matrix-vector multiplication (formulated with divide-and-conquer for locality)
- $N$-body simulation
- KD-tree construction
10. Taskloop¶
- As you can easily imagine, tasks can handle general nested loops if they can handle recursions
- Recent OpenMP actually has a construct just for that, which is #pragma omp taskloop
- This feature is not supported by NVIDIA compiler
- Here is a demonstration showing it can handle non perfectly-nested loops
%%writefile omp_taskloop.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
double t0 = omp_get_wtime();
#pragma omp parallel
#pragma omp master
#pragma omp taskloop
for (int i = 0; i < 5; i++) {
printf("i = %d starts\n", i);
fflush(stdout);
#pragma omp taskloop
for (int j = 0; j < 5; j++) {
usleep(100 * 1000 * (i + j));
printf("iteration (%d, %d) executed by thread %d\n", i, j, omp_get_thread_num());
fflush(stdout);
}
}
double t1 = omp_get_wtime();
printf("%f sec\n", t1 - t0);
return 0;
}
clang -fopenmp omp_taskloop.c -o omp_taskloop
# nvc -mp omp_taskloop.c -o omp_taskloop
OMP_NUM_THREADS=3 ./omp_taskloop
NOTE:
- Implementing task requires a more general mechanism than work-sharing for statement; the former should be able to distribute tasks generated in the course of execution, whereas the former merely needs to distribute iterations that are easily identifiable at the point of entering
#pragma omp for
, thanks to the "canonical form" restriction - Task scheduling is always dynamic whereas work-sharing for (particularly with static scheduling) gives you more control and predictability about which thread executes which iteration
- This is a reason why two mechanisms which are seemingly redundant exist, besides a historical reason that initially there was not a tasking construct in OpenMP
11. Data sharing¶
- OpenMP is a shared memory programming model, which means threads see updates among each other
- That is, when a thread updates a variable x that is then read by another, the reader thread will see the updated value
- This is the default behavior of OpenMP (Data Environment in OpenMP spec)
- It is not always convenient, however
#pragma omp parallel
can thus specify whether local variables in the scope (i.e., defined outside the statement) are privatized (i.e., made private to each thread)
Problem 9 : Observe the effect of privatization¶
- Execute the following and make sense of the output
- Add the
private(x)
clause and observe the difference
%%writefile omp_private.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
int x = 123;
printf("before : x = %d\n", x);
/* add private(x) clause below and see the difference */
#pragma omp parallel
{
int id = omp_get_thread_num();
printf("thread %d : x = %d\n", id, x);
}
printf("after : x = %d\n", x);
return 0;
}
clang -fopenmp omp_private.c -o omp_private
# nvc -mp omp_private.c -o omp_private
OMP_NUM_THREADS=10 ./omp_private
Example answer:
%%writefile omp_private_ans.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
int x = 123;
printf("before : x = %d\n", x);
/* add private(x) clause below and see the difference */
#pragma omp parallel private(x)
{
int id = omp_get_thread_num();
printf("thread %d : x = %d\n", id, x);
}
printf("after : x = %d\n", x);
return 0;
}
private(x)
essentially ignores the original variablex
defined outside the parallel region and behaves as if a variable of the same name is defined by each threadfirstprivate(x)
is likeprivate(x)
, except thatx
of each thread is initialized by the value ofx
just before entering the parallel region
Problem 10 : Observe the effect of private
and firstprivate
¶
- Execute the following and observe the output
- Add the
private(x)
clause and execute it - Add the
firstprivate(x)
clause and execute it - Make sense of the differences
%%writefile omp_firstprivate.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
int x = 123;
printf("before : x = %d\n", x);
/* add private(x)/firstprivate(x) clause and see the difference */
#pragma omp parallel
{
int id = omp_get_thread_num();
x++;
printf("thread %d : x = %d\n", id, x);
}
printf("after : x = %d\n", x);
return 0;
}
clang -fopenmp omp_firstprivate.c -o omp_firstprivate
# nvc -mp omp_firstprivate.c -o omp_firstprivate
OMP_NUM_THREADS=3 ./omp_firstprivate
Example answer:
%%writefile omp_firstprivate_ans.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
int x = 123;
printf("before : x = %d\n", x);
/* add private(x)/firstprivate(x) clause and see the difference */
#pragma omp parallel private(x)
{
int id = omp_get_thread_num();
x++;
printf("thread %d : x = %d\n", id, x);
}
printf("after : x = %d\n", x);
return 0;
}
%%writefile omp_firstprivate_ans.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
int x = 123;
printf("before : x = %d\n", x);
/* add private(x)/firstprivate(x) clause and see the difference */
#pragma omp parallel firstprivate(x)
{
int id = omp_get_thread_num();
x++;
printf("thread %d : x = %d\n", id, x);
}
printf("after : x = %d\n", x);
return 0;
}
12. Race condition¶
Execute the above code without private or firstprivate many times
Observe that the value of
x
after the parallel region is not always 123 + 5 (the number of threads executing the region), even different across runsFor example, with two threads, the following execution order may cause such a behavior
- thread A reads 123
- thread B reads 123
- thread A writes 124
- thread B reads 124
A similar case occurs whenever a thread's read-followed-by-write is intervened by another thread's update
More generally, the following situation is called a "race condition" and if there is a race condition in your program, it almost always means your program is broken
- Two or more threads concurrently access the same variable, and
- at least one of them writes to it Here, "concurrently access" means these accesses are not guaranteed to be separated in time by a synchronization primitive
In all but trivial parallel programs, threads need to communicate with each other to accomplish a task
Threads communicate by having one thread write to a variable and having another read it
If we simply do it without any mechanism to guarantee that they are separated in time, it is a race
Below, we describe three ways to safely communicate among threads without making race conditions
#pragma omp critical
#pragma omp atomic
- reduction
13. #pragma omp critical
¶
- #pragma omp critical guarantees the statement following the pragma is executed not overlapping in time
Problem 11 : Apply #pragma omp critical
¶
- Execute the following program a few times and observe that the result is undeterministic and often not what we want (i.e., 123 + the number of threads)
- Then, add
#pragma omp critical
to the statementx++
and see the result
%%writefile omp_critical.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
int x = 123;
printf("before : x = %d\n", x);
#pragma omp parallel
{
int id = omp_get_thread_num();
x++;
}
printf("after : x = %d\n", x);
return 0;
}
clang -fopenmp omp_critical.c -o omp_critical
# nvc -mp omp_critical.c -o omp_critical
OMP_NUM_THREADS=100 ./omp_critical
Example answer:
%%writefile omp_critical_ans.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
int x = 123;
printf("before : x = %d\n", x);
#pragma omp parallel
{
int id = omp_get_thread_num();
#pragma omp critical
x++;
}
printf("after : x = %d\n", x);
return 0;
}
clang -fopenmp omp_critical_ans.c -o omp_critical_ans
# nvc -mp omp_critical_ans.c -o omp_critical_ans
OMP_NUM_THREADS=100 ./omp_critical_ans
14. #pragma omp atomic
¶
- #pragma omp atomic is similar to
#pragma omp critical
but its effect is slightly different and its applicability limited (see below)
Problem 12 : Apply #pragma omp atomic
¶
- Add
#pragma omp atomic
to the statementx++
and see the result
%%writefile omp_atomic.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
int x = 123;
printf("before : x = %d\n", x);
#pragma omp parallel
{
int id = omp_get_thread_num();
x++;
}
printf("after : x = %d\n", x);
return 0;
}
clang -fopenmp omp_atomic.c -o omp_atomic
# nvc -mp omp_atomic.c -o omp_atomic
OMP_NUM_THREADS=100 ./omp_atomic
Example answer:
%%writefile omp_atomic_ans.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
int x = 123;
printf("before : x = %d\n", x);
#pragma omp parallel
{
int id = omp_get_thread_num();
#pragma omp atomic
x++;
}
printf("after : x = %d\n", x);
return 0;
}
clang -fopenmp omp_atomic_ans.c -o omp_atomic_ans
# nvc -mp omp_atomic_ans.c -o omp_atomic_ans
OMP_NUM_THREADS=100 ./omp_atomic_ans
- The statement that follows this pragma cannot be an arbitrary expression
- See atomic Construct for the spec
- Typically, it is an update to a variable, such as
x += expr;
- What is guaranteed by
#pragma omp atomic
is different from what#pragma omp critical
guarantees
#pragma omp atomic
x += expr;
guarantees that the read and write to x are never intervened by another update labeled #pragma omp atomic
whereas
#pragma omp critical
x += expr;
guarantees that the entire statement x += expr
does not overlap with another statement labeled critical.
- When applicable,
#pragma omp atomic
is more efficient than#pragma omp critical
because the evaluation of expr can overlap
15. Reduction clause¶
- Reduction is the best way to resolve race conditions where applicable and it often is
- It is applicable when threads altogether calculate $v = v_0 \oplus v_1 \oplus ... \oplus v_{n-1}$ where $v_i$ can be computed independently and $\oplus$ is an associative operator (such as +)
- In serial loop, this could be written by
v = initial value;
for (i = 0; i < n; i++) {
v_i = ...
v = v + v_i;
}
- If we parallelize the above loop, updating $v$ will result in a race condition
- This can be safely parallelized by introducing
reduction(+ : v)
Problem 13 : Apply reduction¶
- Add
reduction
clause to#pragma omp parallel
below and observe the result
%%writefile omp_reduction.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
int x = 123;
printf("before : x = %d\n", x);
#pragma omp parallel
{
int id = omp_get_thread_num();
x++;
}
printf("after : x = %d\n", x);
return 0;
}
clang -fopenmp omp_reduction.c -o omp_reduction
# nvc -mp omp_reduction.c -o omp_reduction
OMP_NUM_THREADS=100 ./omp_reduction
Example answer:
%%writefile omp_reduction_ans.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main() {
int x = 123;
printf("before : x = %d\n", x);
#pragma omp parallel reduction(+:x)
{
int id = omp_get_thread_num();
x++;
}
printf("after : x = %d\n", x);
return 0;
}
clang -fopenmp omp_reduction_ans.c -o omp_reduction_ans
# nvc -mp omp_reduction_ans.c -o omp_reduction_ans
OMP_NUM_THREADS=10 ./omp_reduction_ans
16. How reduction clause works and why it is preferable when applicable¶
- Where applicable, reduction is generally much faster than using
#pragma omp atomic
or#pragma omp critical
- This is because, internally, each thread computes its partial results using a private variable and combines their results only once at the end
- That is, each update, e.g.,
x += expr
updates a thread's private version of variable x instead of updating the shared variable - Omitting details you can think of what reduction clause is doing is to convert something like
int x = G;
#pragma omp parallel reduction(+ : x)
{
... x += expr; ...
}
into something like
int x = G;
#pragma omp parallel
{
int x_priv = 0; // (I) initialize a private version of x
{
... x_priv += expr; ...
}
#pragma omp atomic
x += x_priv; // (C) combine the partial results in the private variable into the global variable
}
- This is valid because of the associativity of the operation
17. User-defined reduction¶
- Reduction is a general concept for efficiently executing many computations of $v_i$'s in parallel, when the final outcome we wish to compute is $v = v_0 \oplus v_1 \oplus ... \oplus v_{n-1}$
- It is applicable whenever the order of combining partial results via $\oplus$ does not affect the final outcome (e.g., +)
- Yet the builtin reduction clause of OpenMP can only specify a few builtin operations for a few builtin types (e.g., int, float, etc.)
- You sometimes desire to apply the efficient execution mechanism of the reduction for more general types (perhaps types you defined)
- User-defined reduction exists exactly for that
- You need to define an expression to
- (C) combine two partial results into one (more specifically, combine a partial result assumed to be in a variable omp_in into another variable omp_out)
- (I) initialize a thread-private version of the variable to which reduction is applied, named omp_priv
- For example, in the case of the builtin + operator,
- (C) would be omp_out += omp_in
- (I) would be omp_priv = 0
17-1. Apply a user-defined reduction¶
- Here is a simple (broken) parallel for loop that is meant to do a reduction on 3-element vector
- Define a reduction with
#pragma omp declare reduction
and apply it to the parallel loop
%%writefile omp_ud_reduction.c
#include <stdio.h>
#include <unistd.h>
#include <math.h>
#include <omp.h>
/* 3-element vector */
typedef struct {
double a[3];
} vec_t;
/* x += y */
void vec_add(vec_t * x, vec_t * y) {
for (int i = 0; i < 3; i++) {
x->a[i] += y->a[i];
}
}
/* x = {0,0,0} */
void vec_init(vec_t * x) {
for (int i = 0; i < 3; i++) {
x->a[i] = 0;
}
}
/* add an appropriate #pragma omp declare reduction ... here */
int main() {
vec_t v;
vec_init(&v);
double t0 = omp_get_wtime();
/* add an appropriate reduction clause, so that
the result is always {10000,10000,10000} */
#pragma omp parallel for
for (int i = 0; i < 30000; i++) {
v.a[i % 3]++;
}
double t1 = omp_get_wtime();
printf("ans = {%.1f, %.1f, %.1f} in %f sec\n", v.a[0], v.a[1], v.a[2], t1 - t0);
return 0;
}
clang -fopenmp omp_ud_reduction.c -o omp_ud_reduction
# nvc -mp omp_ud_reduction.c -o omp_ud_reduction
OMP_NUM_THREADS=10 ./omp_ud_reduction
%%writefile omp_ud_reduction_ans.c
#include <stdio.h>
#include <unistd.h>
#include <math.h>
#include <omp.h>
/* 3-element vector */
typedef struct {
double a[3];
} vec_t;
/* x += y */
void vec_add(vec_t * x, vec_t * y) {
for (int i = 0; i < 3; i++) {
x->a[i] += y->a[i];
}
}
/* x = {0,0,0} */
void vec_init(vec_t * x) {
for (int i = 0; i < 3; i++) {
x->a[i] = 0;
}
}
#pragma omp declare reduction \
(vp : vec_t : vec_add(&omp_out,&omp_in)) \
initializer(vec_init(&omp_priv))
/* add an appropriate #pragma omp declare reduction ... here */
int main() {
vec_t v;
vec_init(&v);
double t0 = omp_get_wtime();
/* add an appropriate reduction clause, so that
the result is always {10000,10000,10000} */
#pragma omp parallel for reduction(vp:v)
for (int i = 0; i < 30000; i++) {
v.a[i % 3]++;
}
double t1 = omp_get_wtime();
printf("ans = {%.1f, %.1f, %.1f} in %f sec\n", v.a[0], v.a[1], v.a[2], t1 - t0);
return 0;
}
clang -fopenmp omp_ud_reduction_ans.c -o omp_ud_reduction_ans
# nvc -mp omp_ud_reduction_ans.c -o omp_ud_reduction_ans
OMP_NUM_THREADS=10 ./omp_ud_reduction_ans
Problem 14 : Putting them together: calculating an integral¶
Write an OpenMP program that calculates
$$ \int \int_D \sqrt{1 - x^2 - y^2}\,dx\,dy $$
where
$$ D = \{\;(x, y)\;|\;0\leq x \leq 1, 0\leq y \leq 1, x^2 + y^2 \leq 1 \}$$
- Note: an alternative way to put it is to calculate
$$ \int_0^1 \int_0^1 f(x)\,dx\,dy $$
where
$$ f(x) = \left\{\begin{array}{ll}\sqrt{1 - x^2 - y^2} & (x^2 + y^2 \leq 1) \\ 0 & (\mbox{otherwise}) \end{array}\right. $$
Use a nested loop to calculate the double integral
Try work-sharing for, taskloop, recursive tasks to parallelize it
The result should be close to $\pi/6 = 0.52359..$ (1/8 of the volume of the unit ball)
Play with the number of infinitesimal intervals for integration and the number of threads so that you can observe a speedup
As you are using a shared cloud environment, you do not have to be serious about speedup (nearly perfect speedup is unlikely when other students are simultaneously using the same machine and/or the cloud is doing many other stuff (e.g., servicing the page you are looking at right now)
If you want to work with an editor you are accustomed to rather than web browser, see this page
%%writefile omp_integral.c
Example answer 1: parallelize outer loop only
%%writefile omp_integral_ans.c
#include <err.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>
long cur_time() {
struct timespec ts[1];
clock_gettime(CLOCK_REALTIME, ts);
return ts->tv_sec * 1000000000L + ts->tv_nsec;
}
typedef struct {
long t;
int thread;
} record_t;
enum { A = 32 };
/* parallel for outerloop */
double int_sqrt_one_minus_x2_y2(long n, record_t R[(n+A-1)/A][(n+A-1)/A]) {
(void)R;
double h = 1.0 / n;
double s = 0.0;
#pragma omp parallel for reduction(+:s) schedule(runtime)
for (long i = 0; i < n; i++) {
for (long j = 0; j < n; j++) {
double x = i * h ;
double y = j * h;
double z = 1 - x * x - y * y;
if (z > 0.0) {
s += sqrt(z);
} else {
break;
}
}
}
return s * h * h;
}
int main(int argc, char ** argv) {
int i = 1;
long n = (argc > i ? atof(argv[i]) : 30L * 1000L); i++;
printf("n = %ld (%ld points to evaluate integrand on)\n", n, n * n);
record_t (*R)[] = (record_t (*)[])0;
long t0 = cur_time();
double s = int_sqrt_one_minus_x2_y2(n, R);
long t1 = cur_time();
long dt = t1 - t0;
printf("%.3f sec\n", dt * 1.0e-9);
printf("s = %.9f (err = %e)\n", s, fabs(s - M_PI/6));
return 0;
}
Example answer 2: parallelize the nested loop with collapse
%%writefile omp_integral_ans.c
#include <err.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>
long cur_time() {
struct timespec ts[1];
clock_gettime(CLOCK_REALTIME, ts);
return ts->tv_sec * 1000000000L + ts->tv_nsec;
}
typedef struct {
long t;
int thread;
} record_t;
enum { A = 32 };
/* parallel for both loops */
double int_sqrt_one_minus_x2_y2(long n, record_t R[(n+A-1)/A][(n+A-1)/A]) {
(void)R;
double h = 1.0 / n;
double s = 0.0;
#pragma omp parallel for collapse(2) reduction(+:s) schedule(runtime)
for (long i = 0; i < n; i++) {
for (long j = 0; j < n; j++) {
double x = i * h ;
double y = j * h;
double z = 1 - x * x - y * y;
if (z > 0.0) {
s += sqrt(z);
}
}
}
return s * h * h;
}
int main(int argc, char ** argv) {
int i = 1;
long n = (argc > i ? atof(argv[i]) : 30L * 1000L); i++;
printf("n = %ld (%ld points to evaluate integrand on)\n", n, n * n);
record_t (*R)[] = (record_t (*)[])0;
long t0 = cur_time();
double s = int_sqrt_one_minus_x2_y2(n, R);
long t1 = cur_time();
long dt = t1 - t0;
printf("%.3f sec\n", dt * 1.0e-9);
printf("s = %.9f (err = %e)\n", s, fabs(s - M_PI/6));
return 0;
}
Example answer 3: use task loop
%%writefile omp_integral_ans.c
#include <err.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>
long cur_time() {
struct timespec ts[1];
clock_gettime(CLOCK_REALTIME, ts);
return ts->tv_sec * 1000000000L + ts->tv_nsec;
}
typedef struct {
long t;
int thread;
} record_t;
enum { A = 32 };
/* taskloops */
double int_sqrt_one_minus_x2_y2(long n, record_t R[(n+A-1)/A][(n+A-1)/A]) {
(void)R;
double h = 1.0 / n;
double s = 0.0;
#pragma omp parallel
#pragma omp master
#pragma omp taskloop collapse(2) reduction(+:s)
for (long i = 0; i < n; i++) {
for (long j = 0; j < n; j++) {
double x = i * h ;
double y = j * h;
double z = 1 - x * x - y * y;
if (z > 0.0) {
s += sqrt(z);
}
}
}
return s * h * h;
}
int main(int argc, char ** argv) {
int i = 1;
long n = (argc > i ? atof(argv[i]) : 30L * 1000L); i++;
printf("n = %ld (%ld points to evaluate integrand on)\n", n, n * n);
record_t (*R)[] = (record_t (*)[])0;
long t0 = cur_time();
double s = int_sqrt_one_minus_x2_y2(n, R);
long t1 = cur_time();
long dt = t1 - t0;
printf("%.3f sec\n", dt * 1.0e-9);
printf("s = %.9f (err = %e)\n", s, fabs(s - M_PI/6));
return 0;
}
Example answer 4: use tasks
%%writefile omp_integral_ans.c
#include <err.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>
long cur_time() {
struct timespec ts[1];
clock_gettime(CLOCK_REALTIME, ts);
return ts->tv_sec * 1000000000L + ts->tv_nsec;
}
typedef struct {
long t;
int thread;
} record_t;
enum { A = 32 };
/* task */
typedef struct {
long x0;
long y0;
long dx;
long dy;
} reg_t;
enum { threshold = 10000 };
double int_sqrt_one_minus_x2_y2_rec(long n, reg_t r,
record_t R[(n+A-1)/A][(n+A-1)/A]) {
(void)R;
if (r.dx * r.dy < threshold) {
double h = 1.0 / n;
double s = 0.0;
for (long i = r.x0; i < r.x0 + r.dx; i++) {
for (long j = r.y0; j < r.y0 + r.dy; j++) {
double x = i * h ;
double y = j * h;
double z = 1 - x * x - y * y;
if (z > 0.0) {
s += sqrt(z);
}
}
}
return s * h * h;
} else if (r.dy < r.dx) {
long dx = r.dx;
reg_t r0 = { r.x0, r.y0, dx / 2, r.dy };
reg_t r1 = { r.x0 + dx / 2, r.y0, r.dx - dx / 2, r.dy };
double s0 = 0.0;
double s1 = 0.0;
#pragma omp task shared(s0)
s0 = int_sqrt_one_minus_x2_y2_rec(n, r0, R);
#pragma omp task shared(s1)
s1 = int_sqrt_one_minus_x2_y2_rec(n, r1, R);
#pragma omp taskwait
return s0 + s1;
} else {
long dy = r.dy;
reg_t r0 = { r.x0, r.y0, r.dx, dy / 2 };
reg_t r1 = { r.x0, r.y0 + dy / 2, r.dx, r.dy - dy / 2 };
double s0 = 0.0;
double s1 = 0.0;
#pragma omp task shared(s0)
s0 = int_sqrt_one_minus_x2_y2_rec(n, r0, R);
#pragma omp task shared(s1)
s1 = int_sqrt_one_minus_x2_y2_rec(n, r1, R);
#pragma omp taskwait
return s0 + s1;
}
}
double int_sqrt_one_minus_x2_y2(long n, record_t R[(n+A-1)/A][(n+A-1)/A]) {
reg_t r = { 0, 0, n, n };
double s = 0.0;
#pragma omp parallel
#pragma omp master
s = int_sqrt_one_minus_x2_y2_rec(n, r, R);
return s;
}
int main(int argc, char ** argv) {
int i = 1;
long n = (argc > i ? atof(argv[i]) : 30L * 1000L); i++;
printf("n = %ld (%ld points to evaluate integrand on)\n", n, n * n);
record_t (*R)[] = (record_t (*)[])0;
long t0 = cur_time();
double s = int_sqrt_one_minus_x2_y2(n, R);
long t1 = cur_time();
long dt = t1 - t0;
printf("%.3f sec\n", dt * 1.0e-9);
printf("s = %.9f (err = %e)\n", s, fabs(s - M_PI/6));
return 0;
}
Remarks:
The first question is whether or not you use
collapse
to parallelize the inner loopIf you do, using the break statement in the inner loop becomes invalid, so you cannot break the inner loop even when $(1 - x * x - y * y)$ becomes $< 0$
If you do not, it is safe to do so, and in fact you want to do so as soon as $(1 - x * x - y * y)$ becomes $< 0$, to avoid wasting time calculating $(1 - x * x - y * y)$ in later iterations when you know they are all negative
If the outer loop has enough iterations ($\gg$ the number of threads), then the latter is much more efficient
Either case, the right scheduling strategy will be dynamic, as the work per iteration is not uniform. If you do collapse, the work per (inner) iteration differs depending on whether you perform
sqrt
or not. If you do not collapse, the work per (outer) iteration differs more significantly depending on where you break the loop.If you do collapse, the pitfall is that the work per iteration becomes so small that using the default chunk size (1) makes the dynamic load balancing overhead too large to enjoy any benefit of parallelizaiton. Make the chunk size larger to amortize the overhead. If you do not collapse, a single iteration executes an entire inner loop, so the overhead of dynamic load balancing is hardly an issue in practice.
- Compile it
clang -O3 -fopenmp omp_integral_ans.c -o omp_integral_ans -lm
# nvc -O4 -mp omp_integral_ans.c -o omp_integral_ans
- and run it
OMP_NUM_THREADS=8 ./omp_integral_ans