From macOS To Linux:
Port Obstacles And Benchmarks

During the port of my macOS applications and libraries to Linux over the last few years, I was several times surprised by unexpected results and different behaviors between the two operation systems.

Retinal Pigment Epithelium and a Dynamic compression Filter
Example algorithm that was ported (optimized) for Linux: Applying a Retinal Pigment Epithelium-Filter combined with a Light Value Compression-Filter.

The first surprise was that Objective-Cs messaging (dynamic method calls) is fast and provides no notable disadvantage against C++.

In the following article I will show my experiences of porting and optimizing my image manipulating library used for Color Essence to Linux. I will do this by benchmarking one of my monochrome filters. The framework for the Color Essence library consists primarily of a histogram and modification step:

The read and write loops are executed in parallel while the inner block is executed in sequence. All images are processed in HDR(float). LDR images are converted. For benchmarks the image Char_(4088888924).jpg (Resolution: 2,372 × 3,160) (Link: commons.wikimedia) by DARREN ST0NE was used.

Algorithm

The RHD monochrome algorithm compacts unnecessary (near black) parts of the light value spectrum and also enhances it further by adding an unobtrusive saturation. This saturation imitates the normal blood circulation in the retina and is almost unnoticeable without a direct comparison to a grayscale image. The image contrast is further improved by a premultiplied noise. The Algorithm has a \(\mathcal{O}(2*n)\) runtime. In other words it should not be significantly slower than a copy operation. Because of a missing overview of monochrome filters, I cannot say that this algorithm is unique or new.

Benchmark results during Linux optimizations.
Benchmark results during Linux optimizations.

Building the original Source Code

Objective-C can be directly compiled and used on Linux. Basic data structures and algorithms are provided by the GnuStep framework. GnuStep is the Open Source branch of OpenStep OS which is also one of the direct ancestors of OS X. Although the development separated decades ago, it is in most cases still simple to cross compile a OS X application for Linux. What is missing are mostly utility functions and features that are back-ported from IOS like Storyboards, or the graphics API Mantle.

The port is simplified by using the same compiler (clang) for both operation systems. Moreover, recent efforts of Apple to distribute the programming language Swift on other operation systems helps. With the Linux Version of swift comes the library called libdispatch which is also necessary for Objective-C multi-threading.

Swift

Until recently I used Swift solely for the development of User Interfaces, but all to common language and API updates made me rethink this decision. Furthermore the Linux Swift ABI seems to be incompatible with GnuStep.

The Setup

For ArchLinux I’m using the default package:
clang version 3.9.0 (tags/RELEASE_390/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

The GnuStep libraries are directly build from sources (Link: github). With this setup and only few sourcecode modifications, a usable Linux build should be possible. Obviously the port suffers a lot of performance problems compared to the OS X reference. (bar 1 and 8)

The Math

Some time ago my Linux compiler started complaining that terms like If(x == x) always returns true. Up until then this was my only check for NAN results. I replaced this clause by If(isnormal(x)).

Compiler Flags

In contrast to a consumer version of an application, which has to work on any device type, it is now possible to create an optimized build for your server or Linux desktop. So instead of the default release flags:
OPT_RELEASE=-Os -fPIC -ffast-math
It is possible to use additional compiler parameters like:
OPT_RELEASE=-O3 -fPIC -ffast-math -march=native -fslp-vectorize-aggressive
Interestingly the resulting binary (bar 2) was considerably slower than the original build. Statistics collected with perf -stat indicates that the clangs SIMD optimization doesn’t work as expected (stalled backend cycles at 40%).

RANDOM

Further profiling revealed that premultiplied noise, created by a random generator was a major bottleneck. My original algorithm uses the FreeBSD variant arc4random() (libBSD). Replacing the function with the rand() from stdlib gives the first performance boost (bar 3).

RANDOM Array

The Linux rand() function is thread safe, but depends on a single source and is therefore time consuming. A small pre-computed random array can avoid this bottleneck (bar 4).

Disabling libDispatch

The simple framework described above allows a lock free read and write access on images. To utilize this feature it was necessary to replace the previous used OpenEXR library with a self-development. The new image API is optimized for vectorization and allows arbitrary channel and channel block selections. In theory this should result in a linear speedup of the calculation for each CPU core available. I therefore replaced the outer row loop with a dispatch queue. Similar to OpenMP, libDispatch creates a thread pool and divides the work between a workgroup. However, in this case the overhead of the thread creation exceeded the actual calculation time (Something that did not become apparent on macOS.). By disabling libdispatch, the Linux port first time surpasses the original algorithm performance (bar5).

OpenMP / OpenACC

Still not convinced that parallelism is out of scope I tried OpenMP (GCC, CLANG) and OpenACC (only GCC) with several options. The Dispatch:

was replaced by:

respectively with:

Finally the inner loop was reduced to:

But still the algorithm could not benefit from parallelism on Linux systems. It was not before the code was completely rewritten in C++ and OpenMP was fine-tuned for the actual input image that a minimal speed advantage could be measured:

Every worker processes 1580 lines per call. This value obviously depends on the image width and height and is used to reduce the work on two thread calls.

GCC/C++ Rewrite

GCC(g++) is the default C++ compiler on any Linux distribution. It is therefore obvious but also costly to port Objective-C code to C++. GnuStep dependencies have to be replaced by STL/Boost methods. Despite the effort the benchmark results are almost identical to the original OS X application(bar 7).

GPU Computing

Optimization with Cuda
Optimization with Cuda.

Alternatives for GPU Computing are Apples Metal Compute Shading, Nvidia Cuda, OpenCL, OpenGL Compute Shading and Vulkan Compute Shading. To save time and effort I used Nvidias Cuda for this port. Since Cuda compiles C/C++11 source code, the greatest challenge remaining was the correct selection of the datagrid and block size. (These values can be copied from one of the many image processing demos from the Cuda SDK.) The processing time of 0.357479 seconds still seems high but it includes the image copy to and from the GPU. The average kernel execution time calculated with the following code is 0.147714 seconds:

Recap

Depending on the used APIs a Linux build of a macOS application is a matter of hours. It opens the possibility for further optimizations and the use on Servers. MacOS multithreading is still better suited (faster) for desktop applications. With the rewrite of Objective-C to C++ no direct speed gain can be achieved but it allows a painless transition to HPC with Nvidias Cuda API.

Leave a Reply

Your email address will not be published.