Grand Central Dispatch
About a decade late, but I decided to give Grand Central Dispatch (GCD) a go in R. GCD is fairly similar to OpenMP as it provides a simplified interface for pthreads. Since its release for Mac OS X 10.6 in 2009, GCD has been ported to a number of operating systems through libdispatch. As of this writing, libdispatch is primarily available on Linux and Apple's operating systems using the Clang compiler. GCC may work, but I have not had any luck.
The most notable aspect of GCD is that it provides a native multithreading environment in Apple's operating systems. This includes MacOS and iOS. Not true for OpenMP since the removal of GCC from Mac OS X Snow Leopard. So if you are trying to get the most performance for all of your customers, GCD may provide some benefit for a lot of #ifdefs.
If you have used OpenCL GCD should be familiar. This is due to OpenCL and GCD both using queues and code blocks. However, you aren't forced to use code blocks, as GCD provides _f variants of the dispatch commands to allow the use of standard C functions.
Queues are provided in two main flavors, serial and concurrent. Serial queues are exactly what they sound like, they provide a FIFO (First In First Out) queue. Therefore, tasks are executed and completed in sequence. Concurrent queues are non-blocking queues in that they execute in order, but do not wait for completion.
A simple concurrent queue can be created through dispatch_get_global_queue
, this function has two inputs. The first is the Quality of Service (QoS) for the queue, the second is reserved for future use and can be set to 0 or NULL.
dispatch_queue_global_t dispatch_get_global_queue(intptr_t qos, uintptr_t flags)
The quality of service will allow flexibility in your R package. You could have one high priority queue to draw a real-time plot, and have a lower priority queue for pulling data from websites in the background.
The level of the queue is set by one of three enumerated values:
DISPATCH_QUEUE_PRIORITY_HIGH = 2,
DISPATCH_QUEUE_PRIORITY_DEFAULT = 0,
DISPATCH_QUEUE_PRIORITY_LOW = -2
dispatch_queue_global_t dispatch_get_global_queue
will only create a concurrent queue, but you can use dispatch_queue_create
to create either queue type:
dispatch_queue_t dispatch_queue_create(const char *label, dispatch_queue_attr_t qt)
The first input provides a label for debugging etc, it can be anything including NULL, the second identifies queue type from one of the enumerated values below:
DISPATCH_QUEUE_SERIAL or NULL
DISPATCH_QUEUE_CONCURRENT
Once you have a queue, you probably want to stick somethign in it. The dispatch commands will get you there. dipatch_apply and likewise dispatch_apply_f will solve this problem for you, they will enqueue your request and run it a fixed number of times. The former function will be used for code blocks while the later will be used for standard functions.
To give you a bit of flavor behind GCD, a parallel for-loop can be created through the following function.
// inner kernel using GCD
void R_parallel_inner_kernel_gcd( double * x, double * y, int * n ) {
// concurrent queue
dispatch_queue_t my_queue = dispatch_get_global_queue(0, 0);
// for like loop
dispatch_apply (*n, my_queue, ^(size_t idx){
size_t j;
double tmp = 0;
for( j = 0; j < *n; j++) {
tmp += (x[idx] - y[j]) * (x[idx] - y[j]);
}
x[idx] = tmp;
});
}
Here you can see the code block is being applied to the dispatch and running n times using the iterator idx. The code block is just a rolled up function that allows the creation of private variables such as tmp and j and shares the data from y and x.
This should be enough for you to get playing with GCD, further documentation exists at the Apple GitHub page.
For comparison against OpenMP, I re-wrote the code below for some performance testing. As you can see, the code is fairly similar, but unlike above, failure for having OpenMP available will allow the code to run as designed unlike the GCD case.
// inner kernel using OpenMP
void R_parallel_inner_kernel_omp( double * x, double * y, int * n ) {
size_t idx,j;
double tmp;
#pragma omp parallel
{
#pragma omp for
for( idx = 0; idx < *n; idx++) {
tmp = 0;
for( j = 0; j < *n; j++) {
tmp += (x[idx] - y[j]) * (x[idx] - y[j]);
}
x[idx] = tmp;
}
}
return;
}
Running the code on an older 12-core, 24-thread (3.3GHz) 2012 MacPro running Mac OS Catalina, I received the following timings for GCC 9 (average 100 runs).
With OpenMP: 0.46 sec elapsed
Without GCD or OpenMP: 9.05 sec elapsed
GCD wasn't tested since I couldn't get it working with GCC in this case.
Using homebrew's LLVM (Version 10), I received the following timings on the same system (average 100 runs).
With GCD: 0.53 sec elapsed
With OpenMP: 0.54 sec elapsed
Without GCD or OpenMP: 9.08 sec elapsed
Switching to my laptop running the same operationg system, a 2017 13" MacBook Pro with an i5 (2.3GHz, dual-core 4-thread) CPU, I had results that actually went in the other direction using homebrew LLVM (Version 10) (average 100 runs).
With GCD: 2.92 sec elapsed
With OpenMP: 3.00 sec elapsed
Without GCD or OpenMP: 11.51 sec elapsed
Using GCC 9, I also repeated the same compilation again using the same flags (average 100 runs).
With OpenMP: 3.21 sec elapsed
Without GCD or OpenMP: 11.61 sec elapsed
We can see that the performance measures differ quite a bit between OpenMP and GCD, even when running a fairly simple program over different hardware. Further testing would be required to better understand the conditions that cause OpenMP to be more performant than GCD. One idea could be the number of cores, or optimzation for specific processors. Perhapse another blog post for that!
For those interested in getting their feet wet a bit more, I have created a reference template for GCD with make files to choose the best parallel implementation for your code (rgcd-template). Included are timing tests and the code samples shown.