Reading 30MiB of Output

Continuing my efforts to obtain decent performance as I integrate the Argonaut Stack to form the Haphaestus web browser, I’ve now tackled my Harfbuzz language bindings, cutting down on the amount of time copying data into & out of the C library.

I removed a memory copy involved in constructing fonts, yielding a surprisingly large performance improvement. Also I started using unsafeDupablePerformIO over unsafePerformIO where that was safe & didn’t result in excessive work being duplicated. But mostly I focused on rapidly converting the output back to C.

Benchmarking

Haskell’s an unusual language, performance-wise! The whole concept is that without mutation or side-effects it doesn’t matter in which order your code is run. By default Haskell procrastinates work until it knows it needs doing, which I find tends to be a good default.

The challenge when that isn’t optimal (say, because you’re calling into C) is: When does Haskell think the work actually needs doing? What effect are your optimizations actually having? If you have ADHD I think you can relate!

To see what you’re doing, writing benchmarks are especially vital in Haskell! I’m using a tool called Criterion, feeding an entire novel (Dracula by Bram Stoker, 873KB & a centrepiece of vampire lore) though Harfbuzz.

Criterion is a nice commandline interface integrating:

It’s a pretty nice tool!

Megabytes of Output

Playing with my benchmarks, it quickly became apparent that my issue was almost entirely in decoding the output arrays. My previous efforts to cut down on copying text data into Harfbuzz was effective, even with the necessary UTF-8 decoding overhead.

Thankfully all the fields of the structures in both output arrays are very conveniently 32bit aligned! Our optimizations can think in terms of arrays of 32bit (4byte) words, with negligible postprocessing overhead!

My first attempt was to switch to peekArray over the reimplementation (namely forM [0..l - 1] $ peekElemOff ptr) I for whatever reason used. This iterates over the array back-to-front pushing onto a pure-Haskell stack. This does allocate plenty of memory, but its practically free to allocate into Haskell’s heap (unless you’re dealing in megabytes it appears) thanks to Arena Allocation. This yields some benefit, but for the megabytes of output I was dealing with I needed something much better…

After assuring myself that it was performant to copy the data somewhere not about to be freed, next I tried a few variations reimplementing peekArray to lazily dereference each word. This yielded huge performance improvements! Not much worse than calling Harfbuzz from C.

Though I did still find more optimizations to implement, such as bypassing the safety checks (I copied ByteString’s code for this, complete with its Eldritch warnings) Haskell adds around dereferencing C pointers.

But then I remembered: I need to be able to handle small quantities of output as well as larger ones (though IRL, never an entire novel in a single call). Typically I’d be shaping a handful of words if not an entire paragraph! Thankfully an eager-dereferencing loop combines very nicely with my lazy-dereferencing loop.

Both eager & lazy evaluation are needed due to how CPU caching works. For quantities of data which fits in a CPU cache that helps eager evaluation perform well. But if its more than a few kilobytes than lazy performs better. Combining the 2 approaches yields optimal performance for both large & small amounts of data, though no better than code optimized for those cases.

Finally I considered that using roughly 40x as much memory (depending on alphabet used) for a large portions of many callers’ (including browser engines) data is probably not optimal, so I called the oneShot GHC primitive to Haskell of this computation. This yielded performance improvements in benchmarks with hints that real usage might benefit more. I also reconsidered whether chunking the data helps, and it turns out that chunking does still help. Just not as much.

This is where I chose to stop, & publish a new version of Harfbuzz Pure. The overhead beyond the C library (diagnosed by commenting out the core function call) is awfully small now. So now every Haskeller can benefit from faster Harfbuzz calls!

Conclusions

Typically the sort of microoptimizations I tackled here have little real value. There’s typically drastically bigger wins to be had elsewhere, maybe that’s still my situation (spoiler warning: it was!). Besides most popular compilers do most of this automatically, so you might be wasting your time on things which have zero impact. But Harfbuzz is a bottleneck for me, & besides contributing to upstream microoptimizations were all that was available to me.

Optimizing software requires a clear theory of what’s happening under the hood (opensource helps with this!), & its vital to take measurements to validate your theories. Especially in a language like Haskell which uses lazy evaluation!

Here the entire issue was the sheer amount of data I was benchmarking upon (which is arguably unrealistic for real usage), which increases by roughly 40x in the output! Depending on the alphabet being used. So my benchmarks instructed me to focus there.

The theory behind my optimizations is that:

  1. CPUs have limited amount of “cache” memory which is reasonably performant, in Harfbuzz’s computations won’t always fit. Small quantities of data vs large requires different optimizations.
  2. Haskell’s peekArray is optimized very well for small quantities.
  3. memcpy takes advantage of various CPU features to have negligable overhead.
  4. There’s a paradigm clash between C & Haskell which is the primary source of performance problems, bridging between the 2 has costs.

If other Haskellers more knowlegable than me want to tackle this problem further, please contribute! If this didn’t fix my performance issues, I’ll benchmark CatTrap next…

PostScript

As I drafted that blogpost I have successfully diagnosed & fixed where the motivating performance issues were coming from!

No, none of my work trying to improve CatTrap’s & Harfbuzz-Pure’s speed amounted to much. Besides perhaps some theoretical energy savings. The issue was with how long it takes download a webpage, while my test page does download quickly it does not do so in a single rendering frame.

While I’m extremely pleased that Haskell’s laziness optimally intermingled compute & I/O without me needing to do anything, this did confuse me when diagnosing the problem!

The issue was solved by combining I/O threads, MVars, & DeepSeq to download & process all the entire webpage outside the mainloop. Now I’ve got a segfault to address, which I’ve narrowed down to my FontConfig language bindings.