Here are the first results for:

Sorting 50.000.000 (50 Million) random elements.

 
Java 7
Objective C (Heap)
Scala 2.9 (Concurrent)
OpenCL (Bitonic)
sec.55,68712,2383,6810,372

For OpenCL the “copy to memory” time was included and for Java and Scala the launch time was excluded from the result.

OpenCL programming is in several ways restricted. The local memory of a kernel is bounded to 32MB. OpenCL isn‘t available on all devices(hosts) and the driver quality differs between different vendors.

Therefore I will initially limit the availability of my library to OsX systems.

Also I abandoned the idea to embed OpenCL inside a guest language. I worked with frameworks for embedded DDL/SQL, Javascript, XML and many more. These solutions always tend to be too extensive and during the “translation process” often performance is lost.

However, If you are interested in embedded solutions take a look at:

As a consequence my library will consist of OpenCL include files. Programs have to be written in the „OpenCL Programming Language“. OpenCL commands like context creation, program compilation etc. are managed by the framework.

What are the base functions?

With OpenGL and scientific applications in mind I started with:

  • Sorting
  • Searching
  • Scheduling
  • Packing
  • LS solving

and

  • N-trees

This library will complete my toolset for developing modern desktop and tablet applications.

Since most modern algorithms and data structures for cg are designed for fast parallel processing I began to play with OpenCL.
Examples therefor are algorithms for physically correct lighting or spatial data structures.

For an easy start I took a simple algorithm and ported it to a kernel:

A perfect number is a positive integer that is equal to the sum of its proper positive divisors. (Wikipedia)

To evaluate a number for perfection, it is necessary to factorize it. A very simple way to do this is a for loop:

__kernel void perfectNumber(__global int * Number, __global int *FACSum) 
{
    int i = get_global_id(0);

    int testNumber = Number[i];
    int result = 0;

    for(int f = 0; f <= testNumber / 2; f++) {
        result += testNumber % f == 0 ? f : 0;
    }

    FACSum[i] = result;
}

The program (kernel) is uploaded to the GPU. The remaining task for the CPU is to feed the GPU with new values.

To ensure that only the GPU is used, it is necessary to filter the desired device-type:

clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id, &ret_num_devices);

In my demo, I stopped the calculation after the fifth result: