As I strive to assemble “The Argonaut Stack” together into the “Haphaestus” web browser, I am facing serious performance problems. One thing that made a big difference (even if its not enough on its own) is to parallelize layout calculation accross multiple CPU cores! Because Haskell made this unusually trivial!
This blogpost explains how this magic works!
Haskell is a “pure-functional” programming language. This means (minor caveats for IO
) data is never mutated, all a function does is create new output data for its caller derived from the provided arguments. For the same inputs a function always returns the same output. This means the order in which Haskell code executes is (aside from performance concerns) irrelevent!
Haskell defaults to “lazy evaluation” where it procrastinates doing work until it knows that work actually needs doing, then caches results to be reused later. Its unknown whether this is actually an effective performance optimization, but I find it helpful & existing Haskell code is optimized based on this behaviour.
But you can also instruct Haskell to run code eagerly, or to parallelize the computation of 2 expressions. Since all data is immutable (they might internally change representation, but not meaning) there can be no data races. The question in Haskell isn’t what can be parallelized, the question is what should be.
Calling into C code (which I am doing) can cause issues here, but both LibICU & Harfbuzz promise to be “thread-safe”. And I have tested that my Harfbuzz language bindings are likewise.
The computation I choose to parallelize is the processing of each of an element’s children within each of the back & forth negotiations involved in laying out a webpage. This was achieved by calling parMap rseq
instead of map
to obtain the list of processed children. Where map
applies a given function to every item in a list & returns a new list with the results.
parMap
wraps map
adding an extra traversal (involving plenty of Glasgow Haskell Compiler specifics) over each item to spawn a task to compute it, including a callback specifying how much of that data should be computed. rseq
wraps seq
to have the compiler ensure the data gets computed.
Tasks are spawned by calling a special “primitive” function the Glasgow Haskell Compiler (GHC) compiles specially to defer to its “Run Time System” (RTS), a small C library embeded within every Haskell program. To do so this’ll push the task, after a couple quick checks, onto an atomic deque in a fixed-sized ringbuffer. The bottom of these atomic deques is more carefully synchronized than the top since it may be mutated concurrently with other cores.
So that another thread can repeatedly pop (or instead dequeue, since benchmarks reportedly give unexpected results) a valid task to run, or dequeue off the deque for another CPU core until all the “sparks” have been run. This thread runs in a seperate I/O scheduler, and as such is implemented as a Haskell loop wrapping another primitive. The garbage collector includes a pass to remove irrelevant sparks.
This code strongly resembles a technique pioneered by Mozilla’s Servo engine, & later adopted by both Gecko (Firefox) & Blink (Chrome).
And yet I’d have this infrastructure in my executables whether I want it or not, so I might as well use it! With Haskell’s immutability (given some care in writing language bindings) it is always safe to invoke it!
While this optimization made a huge difference in making CatTrap performant enough to layout realworld webpages, I’m still not achieving realtime speed. Writing a benchmark feeding Dracula by Bram Stoker (public domain) through my Harfbuzz language bindings indicates I have some serious performance issues there. I believe I have some memory-copies I can cut down on…