GistTree.Com
Entertainment at it's peak. The news is by your side.

Parallelizing Pi Approximation with OpenMP

0

OpenMP vs POSIX threads

Both are APIs for shared-memory programming. In C, OpenMP is a rather excessive-level extension, providing compiler directives for parallelizing many purposes with ease whereas some low-level thread interactions is at risk of be complex to program. Pthreads present some low-level coordination constructs which would be unavailable in OpenMP, making a host of purposes more uncomplicated to parallelize whereas requiring narrate specification of the habits of every thread, that could well per chance about be any that’s that it is seemingly you’ll per chance also imagine.

Pthreads purposes will also be aged with any C compiler, offered the machine has a Pthreads library. OpenMP requires compiler reduction for some operations, which many compilers present.

Role Below the Curve

This system we are about to produce is oftentimes referred to as the ‘Hello world!’ of parallel programming. We would be utilizing the next formula for arctan(1)1 in approximating the ticket for π.

Leibniz-formula-for-pi.pngAuthorized as we realized a truly long time ago in Kindergarten, the ticket of the integral here is the set aside between the graph and the x-axis whereas x goes from 0 to 1.

Going thru the serial version of the program in C, the ticket for above is calculated for every integer within vary [1, N) and summed up within a loop. The resulting sum is them divided by N for getting an estimate of π / 4, sooner than printing out the calculated approximate for π.

#encompass 
#encompass 
#encompass 
#encompass 

int predominant() {
  unsigned int N=999999999; 
  long double sum=0, quarter_of_pi=0;

  for (unsigned int i=1; i <= N; i++) {
    sum += 1 / ( 1 + (long double) pow(( i - 0.5 ) / N, 2) );
  }
  quarter_of_pi = sum / N;
  
  printf("Time taken: %.3lf secondsn", tcalc);
  printf("Quarter of Pi ~ %.15LFn", quarter_of_pi);
  printf("Pi ~ %.15LFn", quarter_of_pi * 4.0);

  return 0;
}

A parallel version of the above code is shown underneath. Ogle the directives for compiler, starting with #pragma, and the inclusion of omp.h header.

#encompass 
#encompass 
#encompass 
#encompass 

int predominant() { 
  unsigned int N=999999999; 
  long double sum=0, quarter_of_pi=0;

  double tstart, tcalc, tstop;
  tstart = omp_get_wtime();
  sum = 0;

  omp_set_num_threads(24)

  #pragma omp parallel
  {
    double sum_thread=0; // local to every thread
    #pragma omp for
    for (unsigned int i=1; i <= N; i++) {
      sum_thread += 1.0 / ( 1 + (long double) pow(( i - 0.5 ) / N, 2) );
    }

    #pragma omp atomic
    sum += sum_thread;
  }

  quarter_of_pi = sum / N;
  tstop = omp_get_wtime();
  tcalc = tstop - tstart;

  printf("Time taken: %.3lf secondsn", tcalc);
  printf("Quarter of Pi ~ %.15LFn", quarter_of_pi);
  printf("Pi ~ %.15LFn", quarter_of_pi * 4.0);
  
  return 0;
}

To bring together above with gcc, use following.

gcc -o  .c -f openmp -lm -Wall -g

Enhancing Parallelized Code

Earlier than altering the rest, it would pay to cherish whatever we favor modified within the first-living. So, enable us to wade thru the above code. The characteristic omp_get_wtime(), outlined in omp.h, returns the amount of seconds since a outdated mounted closing date (included for execution timing purposes). Setting the form of threads to spawn when working parallel code (code inside of parallel scope, that’s) performed by omp_set_num_threds. Altering the set aside of abode to web and calling from within a parallel fragment returns the form of threads in execution (web counterpart is on hand for most capabilities with set aside of abode). On the other hand, as a result of useful resource constraints on the machine, it is that it is seemingly you’ll per chance also assume of we ending up with a decrease form of threads.

The directive #pragma omp parallel is for parallelizing the code–it is correct to catch if here’s not fresh somewhere, the program could well maybe not be parallelized by OpenMP. The scope of code that has to be done in parallel is at risk of be outlined the the same ability we attain scoping in C, with {}, or with out–whereas simplest straight following code line shall be conception of as in-scope. That is where fresh threads shall be spawned and originate working alongside the master thread, branching the execution.

Any variables created everywhere in the parallel scope are non-public for every thread (i.e. sum_thread), and could well per chance not make a contribution to flee stipulations. This ability that truth, the for loop incrementing sum_thread is split amongst threads spawned. Please showcase the ability threads divide the iterations amongst themselves will also be dealt with as much as a diploma with specific directives, the default habits is anticipated here by omitting them.

Extreme sections is at risk of be scoped within #pragma omp atomic or #pragma omp severe. The susceptible permits low level atomicity of execution (with greater performance) if on hand, and simplest accepts a single assignment assertion (of an allowed produce) within scope, whereas the latter can contend with a pair of statements. Both successfully serializes execution. If we pause up with most of the code everywhere in the scope of either, our code could well maybe even very well be too depressed for parallelization with OpenMP.

Out of doors the #pragma omp parallel scope, previously spawned threads shall be a part of the master thread, and this is able to proceed executing equivalent to a serial program. It’s price declaring that we could well per chance have as many parallel/serial sections as we please.

As has to be obvious from above, OpenMP is designed to be easy to learn and could well per chance also be aged for parallelizing existing serial code with minimal adjustments. What if we could well per chance extra simplify the above code?

Reducing the Parallel Code

Enhancing the sum variable on the tip of the for loop in above code would be simplified with a reduction operation offered by OpenMP. That is, we are able to enlighten OpenMP to synchronously add the local sum calculated by every thread (within loop) to the shared variable sum as shown within the parallel version underneath.

#encompass 
#encompass 
#encompass 
#encompass 

int predominant() { 
  unsigned int N=999999999; 
  long double sum=0, quarter_of_pi=0;

  double tstart, tcalc, tstop;
  tstart = omp_get_wtime();
  sum = 0;

  omp_set_num_threads(24);

  #pragma omp parallel for reduction(+:sum) shared (N)
    for (unsigned int i=1; i <= N; i++) 
      sum += 1.0 / ( 1 + (long double) pow(( i - 0.5 ) / N, 2) );

  quarter_of_pi = sum / N;
  tstop = omp_get_wtime();
  tcalc = tstop - tstart;

  printf("Time taken: %.3lf secondsn", tcalc);
  printf("Quarter of Pi ~ %.15LFn", quarter_of_pi);
  printf("Pi ~ %.15LFn", quarter_of_pi * 4.0);
  
  return 0;
}

Right here, now we have a single assertion #pragma omp parallel for that tells we favor the straight-following for loop parallelized, which is an equivalent to having #pragma omp for straight sooner than a for loop within the scope of #pragma omp parallel. The reduction(+:sum) tells that we favor the sum variable to beget the sum of every local sum calculated in every iteration simplest on hand within the community to the thread that runs it. OpenMP does this by setting up a non-public sum variable for every thread and initializing it to unit in step with the operation (ex. 0 for +, 1 for *), here’s how we use the reduction operator in OpenMP. The shared(N) tells that the N variable has to be shared with all threads, here’s graceful since threads are simplest reading it–as a result of this truth no doable details races. At remaining, the approximated values along with the time taken for calculations are printed.

Now that we went thru the code, showcase that the omp_set_num_threads is not well-known when utilizing OpenMP, the program defaults to spawning the utmost form of threads that it is seemingly you’ll per chance also assume of on the machine at execution time. Timing details for serial and parallel (24 threads) variations resulted2 in ~9.75 and ~0.76 seconds respectively: speedup3 ~13.

Read More

Leave A Reply

Your email address will not be published.