I wanted to take a break from compiling kernel modules and making hardware work and so decided to compile tRayce, a raytracer that I wrote recently, on the Pi and see how slower it exactly was.
The first problem arose on the compilation stage: it was slow. tRayce has 8 source files and it took the Pi about 5 minutes to compile the first one. That wasn’t terribly inspiring, since I did plan to do some programming on it and didn’t have enough swords or chairs to help me tolerate long compilation times.
Hence I compiled a cross-compilation toolkit with crosstool-ng so that I could create binaries for the Pi on my laptop, which would be tremendously faster.
I used the instructions here, which was quite straightforward with the only effort required from me being waiting two hours for the toolkit to get compiled. I did mess up a bit by forgetting to choose to build the C++ compiler and so had to rebuild everything from scratch. I did mess up twice, actually, as the AUR already has the required packages. Oh well.
I compiled tRayce for the ARM architecture on my laptop, transferred the executable onto the Pi using scp and launched it.
It took about 2 minutes for the scene to render (the same scene took 2.5 seconds on my laptop) and there was another problem: the scene was dim. This probably was due to the compiler messing up some floating-point instructions. I changed all float variables to double in my code (which isn’t supposed to change the performance much) and recompiled it again.
It was better this time: my scene was ready in 30 seconds, however, the scene still wasn’t the same as the reference one I rendered on my laptop: the specular highlights on the spheres were bigger than on the reference. Still some floating-point issues or incorrect compiler flags.
The next day, I decided to wait through the compilation on the Pi to see if the image was still different. To my surprise, the compilation went much quicker (~30s for the whole project) for no apparent reason. The scene, on the other hand, took about 3 minutes to render, which definitely was too much and probably meant that my system only supported soft ABI (instead of floating-point operations, the system uses integers and doesn’t delegate anything to the FPU, which is quite a waste). However, the scene now was completely the same as the reference.
According to the Raspbian FAQ, I could compile my program so that it could do the floating-point operations on the FPU and pass the function parameters around in FP registers (hard ABI). The bad news were, I couldn’t do that on a soft-ABI system — all libraries on my system must support hard ABI and I did confirm that the Arch ARM build that I was using wasn’t built with hard ABI support.
I could start using Rasbian, but this could lead me down the route where I would have to compile most of the needed packages myself, as the precompiled binaries wouldn’t be supported by my system. I also could theoretically rebuild all libraries and the whole toolchain to create a version that would only be used for tRayce (according to this StackOverflow comment), but there was an easier solution.
Luckily, GCC supports the flag –mfloat-abi=softfp that, according to this page, “allows use of VFP while remaining compatible with soft-float code”. The exact mechanism of compatibility, apparently, is that the resulting program uses the FPU, but all parameters are passed around using integer registers, which does have some overhead. Nevertheless, I compiled tRayce with these flags:
CFLAGS := -g -Wall -march=armv6 -mfpu=vfp -mfloat-abi=softfp -O2
and launched it, which brought the rendering time back down to ~40s and giving the same scene as the reference.
(In unrelated news, I printed out and glued together Punnet, a printable and together-glueable case, which probably should make the Pi hotter (in both senses)).
And this is a full-sized (1280×800) scene with 4x antialiased edges and soft shadows that was rendered on the Pi in 256 seconds (nice number), which is ~12 times slower than on my laptop (strange, since the light version of this scene was rendered ~20 times slower on the Pi).