From macOS To Linux:
Port Obstacles And Benchmarks

During the port of my macOS applications and libraries to Linux over the last few years, I was several times surprised by unexpected results and different behaviors between the two operation systems.

Retinal Pigment Epithelium and a Dynamic compression Filter
Example algorithm that was ported (optimized) for Linux: Applying a Retinal Pigment Epithelium-Filter combined with a Light Value Compression-Filter.

The first surprise was that Objective-Cs messaging (dynamic method calls) is fast and provides no notable disadvantage against C++.

In the following article I will show my experiences of porting and optimizing my image manipulating library used for Color Essence to Linux. I will do this by benchmarking one of my monochrome filters. The framework for the Color Essence library consists primarily of a histogram and modification step:

The read and write loops are executed in parallel while the inner block is executed in sequence. All images are processed in HDR(float). LDR images are converted. For benchmarks the image Char_(4088888924).jpg (Resolution: 2,372 × 3,160) (Link: commons.wikimedia) by DARREN ST0NE was used.

Algorithm

The RHD monochrome algorithm compacts unnecessary (near black) parts of the light value spectrum and also enhances it further by adding an unobtrusive saturation. This saturation imitates the normal blood circulation in the retina and is almost unnoticeable without a direct comparison to a grayscale image. The image contrast is further improved by a premultiplied noise. The Algorithm has a \(\mathcal{O}(2*n)\) runtime. In other words it should not be significantly slower than a copy operation. Because of a missing overview of monochrome filters, I cannot say that this algorithm is unique or new.

Benchmark results during Linux optimizations.
Benchmark results during Linux optimizations.

Building the original Source Code

Objective-C can be directly compiled and used on Linux. Basic data structures and algorithms are provided by the GnuStep framework. GnuStep is the Open Source branch of OpenStep OS which is also one of the direct ancestors of OS X. Although the development separated decades ago, it is in most cases still simple to cross compile a OS X application for Linux. What is missing are mostly utility functions and features that are back-ported from IOS like Storyboards, or the graphics API Mantle.

The port is simplified by using the same compiler (clang) for both operation systems. Moreover, recent efforts of Apple to distribute the programming language Swift on other operation systems helps. With the Linux Version of swift comes the library called libdispatch which is also necessary for Objective-C multi-threading.

Swift

Until recently I used Swift solely for the development of User Interfaces, but all to common language and API updates made me rethink this decision. Furthermore the Linux Swift ABI seems to be incompatible with GnuStep.

The Setup

For ArchLinux I’m using the default package:
clang version 3.9.0 (tags/RELEASE_390/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

The GnuStep libraries are directly build from sources (Link: github). With this setup and only few sourcecode modifications, a usable Linux build should be possible. Obviously the port suffers a lot of performance problems compared to the OS X reference. (bar 1 and 8)

The Math

Some time ago my Linux compiler started complaining that terms like If(x == x) always returns true. Up until then this was my only check for NAN results. I replaced this clause by If(isnormal(x)).

Compiler Flags

In contrast to a consumer version of an application, which has to work on any device type, it is now possible to create an optimized build for your server or Linux desktop. So instead of the default release flags:
OPT_RELEASE=-Os -fPIC -ffast-math
It is possible to use additional compiler parameters like:
OPT_RELEASE=-O3 -fPIC -ffast-math -march=native -fslp-vectorize-aggressive
Interestingly the resulting binary (bar 2) was considerably slower than the original build. Statistics collected with perf -stat indicates that the clangs SIMD optimization doesn’t work as expected (stalled backend cycles at 40%).

RANDOM

Further profiling revealed that premultiplied noise, created by a random generator was a major bottleneck. My original algorithm uses the FreeBSD variant arc4random() (libBSD). Replacing the function with the rand() from stdlib gives the first performance boost (bar 3).

RANDOM Array

The Linux rand() function is thread safe, but depends on a single source and is therefore time consuming. A small pre-computed random array can avoid this bottleneck (bar 4).

Disabling libDispatch

The simple framework described above allows a lock free read and write access on images. To utilize this feature it was necessary to replace the previous used OpenEXR library with a self-development. The new image API is optimized for vectorization and allows arbitrary channel and channel block selections. In theory this should result in a linear speedup of the calculation for each CPU core available. I therefore replaced the outer row loop with a dispatch queue. Similar to OpenMP, libDispatch creates a thread pool and divides the work between a workgroup. However, in this case the overhead of the thread creation exceeded the actual calculation time (Something that did not become apparent on macOS.). By disabling libdispatch, the Linux port first time surpasses the original algorithm performance (bar5).

OpenMP / OpenACC

Still not convinced that parallelism is out of scope I tried OpenMP (GCC, CLANG) and OpenACC (only GCC) with several options. The Dispatch:

was replaced by:

respectively with:

Finally the inner loop was reduced to:

But still the algorithm could not benefit from parallelism on Linux systems. It was not before the code was completely rewritten in C++ and OpenMP was fine-tuned for the actual input image that a minimal speed advantage could be measured:

Every worker processes 1580 lines per call. This value obviously depends on the image width and height and is used to reduce the work on two thread calls.

GCC/C++ Rewrite

GCC(g++) is the default C++ compiler on any Linux distribution. It is therefore obvious but also costly to port Objective-C code to C++. GnuStep dependencies have to be replaced by STL/Boost methods. Despite the effort the benchmark results are almost identical to the original OS X application(bar 7).

GPU Computing

Optimization with Cuda
Optimization with Cuda.

Alternatives for GPU Computing are Apples Metal Compute Shading, Nvidia Cuda, OpenCL, OpenGL Compute Shading and Vulkan Compute Shading. To save time and effort I used Nvidias Cuda for this port. Since Cuda compiles C/C++11 source code, the greatest challenge remaining was the correct selection of the datagrid and block size. (These values can be copied from one of the many image processing demos from the Cuda SDK.) The processing time of 0.357479 seconds still seems high but it includes the image copy to and from the GPU. The average kernel execution time calculated with the following code is 0.147714 seconds:

Recap

Depending on the used APIs a Linux build of a macOS application is a matter of hours. It opens the possibility for further optimizations and the use on Servers. MacOS multithreading is still better suited (faster) for desktop applications. With the rewrite of Objective-C to C++ no direct speed gain can be achieved but it allows a painless transition to HPC with Nvidias Cuda API.

Deploying Linux OpenGL – Applications with AppImage

toplAppImage

The setup of AppImageKit was straightforward and took with instructions from a short README about 10 minutes. During tests it became clear that the AppImage binary contains an iso-image which is in turn mounted as a hidden read-only directory in /tmp. It should be taken into account that the directory name contains a random part and any references to resources (audio/video) have to be updated dynamically for each start. Similar to mac os(sandboxes) or ios, I had to disable several patch/update features.

Collecting the necessary libraries

AppImage does not not collect the necessary library dependencies on its own. Therefore It is necessary to copy these by hand. I wrote a small shell-script to simplify the process(→GIST).

Experimental Build

An experimental image build can be downloaded from here: (→itchio). It can be executed via command-line. Because of lack of time, extensive testing wasn’t possible.

Development Updates

Back-end prototype Back-end prototype
YAL Example

I’m currently reworking my libraries and remove parts that became unnecessary or replace them if better alternatives are available. I also worked on the following topics:

  • A ANTLR ASCT to translate Objective-C to C++14/17. Obj-c ARC is hereby replaced by smart pointers. The conversion is only worthwhile for dependency free code.
  • Mingw-w64 support (Win 32 cross compilation).
  • Experimental GTK support.
  • Support of the Vulkan API.
  • WSI Support: If present, other window controls like GLUT or GLFW become unnecessary.
  • An experimental SPIR-V port of my math libraries with additional support for neural networks.

Improved Wavelength Approximation

Based on the following image I created a new wavelength approximation for simulated refractions. I used some cloud free moments this morning to take the following picture:

Fixing Spectral Equations
Magnified photo of a refracted light beam on paper.

The Color values were aggregated and freed from any background noise. The resulting function differs substantially from my previously used method and creates more realistic results:

Demo of the new wavelength function.
Demo of the new wavelength approximation.

This function and its inverse will be used in upcoming version of my applications:

bundleSmall

If you are interested in the implementation you can contact me: mail@yousry.de.

Update 6.6.16 Playing the incompatible game

I made some minor layout updates to make my website chrome/chromium friendly.

Normal Based Edge Detection

Edge Detection with normal maps.

The Canny Algorithm is a standard technique to identify edges. The detection process is thereby divided into the following steps:

  • Denoise/Blur
  • Grey Map/Sobel Filter
  • Non Max Suppression
  • Hysteresis

EdgeCanny

I identified that most of the convolution kernels (e.g gradient, rotation, slope) can be replaced by a normal map:

EdgeNormal

A simple normal map can be created like:
n(x,y)=(v_(x,y)-v_(x+1,y), v_(x,y)-v_(x,y+1), 1)
where v_(xy) is the gray value for the image pixel at position (x,y). The axis-angle and slope can easily be extracted from this vector.

An Edge is detected by the difference of two adjoining normals.

The result is denoised by the angle between normal and the world z-axis,

and can be as usual normalized:

Additionally the result can be thinned: The values of two neighboring pixels are compared, if the gradients are orthogonal the lesser value gets deleted.

In my tests the approach via normal maps is about three times faster than the canny-edge detection. I consider the results of both algorithms on par. If you still use the original c implementation of canny-edge I would advice you to check your application against possible buffer overruns (for example with valgrind ).

Color Essence: Creating a Reduced Color Palette

PresentationCombined

Color Essence is a little application that I recently submitted to Apples AppStore. It takes the idea from my post Caramelized Artwork and transformes it into a tool for artists. The color-distribution of an image is used to create a reduced color palette. In contrast to the original tone mapping technique, the color spectrum is compressed even further. This palette can then be used for digital or analogue paintings.

The resulting palette consists of:

  • A middle ring of the most frequent colors.
  • An overemphasized outer ring for the case of faded colors in the original and
  • a gradient for drawing programs without color blending options.

Here is an example of a generated palette:

seal

It was created from a painting from Theo van Doesburg:

xmas

Creation Steps

Following steps are applied to create the palette:

Average Blur

To avoid the influence of artifacts from lossy compressions or other flaws of the original image, the input can optionally be blurred.

Histogram

A histogram over the HSV color space is generated. The advantage of HSV over other models like CMY is that vector operations can easily performed on it (necessary algebra can be applied on it).

Frequency Distribution

To identify relevant subdomains, a frequency distribution over the the 3D space [360,1,1] is calculated. As events the pixels of the image are used.

Each subdomain represents a color-space with an additional weight parameter. The hue spectrum is then divided according to the subdomain weights.

Optionally a white and a black color filter can be applied to filter highlights and the black level minimum:

self

Analogue Paint

The calculated colors of the palette can be mapped to paint (respectively the paint name). If the color deviation crosses the 1.0‰ margin a color mixing is calculated:

Colorlist

Purpose

The colors of an image are as important as its geometry. In respect to realism it is the best approach to copy them from a existing photo. Otherwise without a visible color mixing a lot from an artistic style can be lost. To transfer this individuality of color perception and creation alive I created this application.

A better parrot algorithm for improved scene-graph interactions (with examples)

A better parrot algorithm for improved scene-graph interactions (with examples).

parrot

A parrot application records and (with some delay) playbacks a voice recording. Often a pitch shift is applied to mimic an animal or a fantasy figure. A silence detector is used to create a trigger to enable or disable the recording. These applications are often a great fun for infants.

I decided to pick up this idea and experiment with some modifications.

Generating a probabilistic state machine for voiceprints

graph_B

Typical for signal processing is the use of fast fourier transformations. With this method, short signals can be analyzed, modified and replicated. However, if it is desired to recombine different fragments of sentences this technique cannot be applied.

To realize a randomized selection, I record an audio stream directly into a probabilistic state machine, where nodes are divided or merged on demand (The original graph starts with a single node the “root”).

The nodes of the graph thereby contain besides of their value (the amplitude of a sinusoid), a weight. The context of a node is given by its predecessors. After each traverse the weight of a node is increased. If the weight of a node becomes insignificant it can be merged with its nearest neighbor. If the next recorded audio frame would end in a dead end, a new node can be created or (depended on the available memory) the recording is reverted to the root node (The root node is context-free).

The distance of two nodes can thereby be represented as a 2-dimensional vector of their values (the gradient).

After the recording, the “learning” pass of the machine is finished.

If the stream re-runs on the same machine, it can be optimally processed, any other stream should produce errors.

Example: A Voice Drum-Machine

Instead of passing a stream to detect congruence, it is also possible to predict or create a new audio-stream. Therefore Gaussian distributed inputs are applied to the machine. As a result different phonemes should be played according to their commonness.

Here is the result from such an experiment: (All audio-sources are obtained from public domain).

Example: The improved parrot

For the improved parrot it is additionally necessary to detect phases of silence (a flat gradient). The silence can than be used as entry points for the playback. In this example the parrot is only slightly pitched up by one cent:

Example: The Schizo(phrenia) mode

if the state-machine output is applied several times to the audio output, for example by the state-machine called shizo

an unintelligible gibberish is created.

In contrast to other implementations this state-machine creates loop-free recordings. The result can be further improved by replacing random distributions by low discrepancy sequences to avoid repetitions:

Legal Notice (Original work)

All used audio sources are licensed under public domain:

https://librivox.org/les-chansons-de-bilitis-by-pierre-louys/
https://librivox.org/sammlung-deutscher-gedichte-007-by-various/
https://librivox.org/multilingual-poetry-collection-020-by-various/

My derived work:

drum.mp3
parrot.mp3
shizo.mp3

is licensed under Attribution-NonCommercial 4.0 International

Short Guide: Linux to Windows (32/64 bit) cross compilation in less than five minutes

Short Guide: Linux to Windows (32/64 bit) cross compilation in less than five minutes.

It is advised to use a VM or nonessential system for this task!

Install additional packages

On Debian this could be done with:
apt-get install mingw-w64
apt-get install wine1.7

Or on Arch-Linux with:
pacman -S mingw-w64
packman -S wine

Mingw-64 is a free win32 runtime and wine a win32 compatibility layer.

It may also be necessary to add additional repositories to the package manager.

Test

Compile it as usual for the host system:

clang helloWorld.c -o hello

or

gcc helloWorld.c -o hello

For the guest build you can use:

x86_64-w64-mingw32-gcc helloWorld.c -o hello.exe

If you prefer clang you can compile the code with:

clang --target=x86_64-w64-mingw32 helloWorld.c -o helloClang.exe

For 32Bit systems the build target can be replaced with i686-w64:
clang --target=i686-w64-mingw32 helloWorld.c -o helloClang.exe

The result should look like this:
crossA

or on the guest system:
onWin

Libraries

Most necessary libraries are already included by Mingw. Often appropriate build-scripts are available to build static or dynamic libraries.

Example: GLFW

The following image shows the concurrent execution of a X11, Wayland and Windows build (left to right):crossB

Floating Point Arithmetic

Another “dreaded problem” accrued this week which I’m still struggling with. In contrast to Java’s simplified math types, in C a lot more possibilities exist to represent floats. This is problematic in particular when you have to map between the different formats.

For Java I have following rule of thumb:

  • Int(eger) for numbers in \mathbb{Z}.
  • Double for Floating Point Arithmetic (On Dalvic VM  I prefer float).
  • Avoid mixing int with double.
  • Map to Number for DB access.

(Honestly, I can’t remember reading something like for(char i = 0; i < 5; i++) in a Java source-code and since there is no pointer-arithmetic necessary, Long – numbers rarely appear for hashes and DB operations. )  

A Java double is represented by 64Bit,  in YAOgl I’m using the following resolutions:

  • For OpenGL 32 Bit (24 Bit Mobile)
  • For 3D-Models 32 Bit
  • For Positioning, Timing and Animation 32/64 Bit

And notably my

  • CPU realtime optimized version of Bullet-Physic-Engine uses 16Bit floating point arithmetic.

This version uses a SIMD (multimedia) instruction-set for the corresponding target platform (the machine code handles multiple calculations at once). Without the optimization a simulation would take to much computation time.

The 16Bit limitation also narrows the range in which a usable simulations can take place. Even worse, bijective-mapping between floating point arithmetic with different resolutions results in nonsense!  (Objects start to jitter or slowly drifting  away. )

If you ever heard of a successful implementation – I would doubt the source. If you combine the different resolutions to a Set R, than

The Set of Numbers (R) doesn’t form a Ring.   R isn’t commutative. Therefore it  hasn’t an associative or distributive property.

As a result it is necessary to  align important/active parts of  a scene-graph to the usable simulation scale and range.

Since I have to handle very small and very large dimensions in B.I.S (Living-room, the play-board, and the tiny shop on Irata), I’m currently working on a  fast rescale function. This includes texture mapping and sound.

Update 12/12/13

It is now possible to rescale the scene-graph during runtime with: [YAWorld* setScale:(float) scale]. Models, textures, lights and the viewport are than updated for the next frame. The 3D-Audio is mapped according to the actual camera position (or better your ears). Therefore no changes are necessary.

What else was done this week

CapsuleCollision

 

 

Here is a short video (sorry for the silly references but the opt-out function seems to be disabled):