Pagination in text layout

Now that Balkón implements line breaking, the next step is to implement page breaking.

All modern web browsers support page breaking to some extent, but they only enable it for print mode. By default, these browsers treat a webpage as a continuous, potentially very long document, to be seen through a viewport that can be smoothly moved by scrolling. Content can be cut off at the edge of the viewport, which is not considered a problem because the user can simply scroll to gradually reveal more content.

However, this approach is not well suited for media with slow response times, such as e-ink displays, or TV screens controlled by an infrared remote.

Unlike current mainstream browsers, Haphaestus will use paginated mode by default, rather than scrolling, in order to accommodate use with TV remotes. This makes page breaking a priority feature.

Terminology

Adrian occasionally refers to the pages as screenfuls, reflecting the fact that they will fit a screen.

CSS covers pagination with a more generic concept called fragmentation, which is used to describe different ways to partition a flow of content:

Balkón, since the first milestone, has been using the CSS term text fragment, simplified to just fragment, to refer to a part of text that fits together on one line after line breaking.

For simplicity, Balkón will use the terms page breaking or pagination for breaking a paragraph along the block axis into discrete pages. Balkón only requires pages to have equal width, so the same pagination mechanism can be used for specific cases of CSS columns or regions as well.

Basic operation

The most basic goal of pagination is to place as much content as possible on one page without overflowing it or breaking it in inconvenient positions.

For a simple paragraph of text, this just means counting how many whole lines can fit on one page.

After Balkón computes the layout of a paragraph, it knows the height of each of its lines. When given the height of a page (or a portion of it, if the paragraph begins somewhere in the middle), Balkón can split this layout using a simple greedy algorithm, analogous to placing words on a line of a given width.

An edge case occurs when the available vertical space is not enough to fit a single line. This needs to be resolved in one of two ways:

  1. By adding a page break immediately before the paragraph, so that it can start on a new page. Usually, this happens when a paragraph would otherwise have to begin near the bottom of a page.

  2. By having a line overflow the page. This is undesirable, but it may be the only way when the font size is greater than the page height.

In order to know how to resolve this edge case, Balkón needs to know the height of a (potential) next page, so that it can see if it can fit at least one whole line of text.

Adding constraints

In typography, there is a balance to be found between filling a page with as much content as possible, and keeping relevant content close together. It may be desirable to prevent paragraphs from beginning too close to the end of a page, or ending too close to the beginning of a page.

CSS provides two properties to control this, borrowing some (unfortunately morbid) typography terms:

In advanced typographic systems such as LaTeX, orphan lines and widow lines can be controlled by making adjustments to line breaking, for example using the Knuth-Plass algorithm. This can make lines more tightly packed or more loosely packed, so that the same text fits on fewer lines or more lines, respectively. This can usually eliminate single orphan lines or widow lines while allowing each page to have the same number of lines, which looks very nice in books.

While there are some efforts to implement the Knuth-Plass algorithm in web browsers, this seems to be a difficult task that no one has yet completed. Instead of adjusting line breaks, Chromium simply breaks the page earlier, moving lines to the next page as necessary to satisfy the orphan and widow constraints. Firefox does not support these CSS properties at all.

Balkón implements the simpler algorithm:

  1. Provisionally insert a page break as far as possible without overflowing the current page.
  2. If the number of lines after the page break is more than zero but less than the widows value, move the page break backwards until this violation is resolved.
  3. If the number of lines before the page break is more than zero but less than the orphans value, move the page break to the beginning of the paragraph (so that the paragraph starts on a new page).

In the edge case where no page has enough space to fit the required number of lines together, the orphans and widows values are ignored (effectively set to 1), as specified by CSS.

Tying it all together

Balkón now exposes an additional API for pagination.

The API expects the following input:

and splits the given paragraph into a page’s worth of content, plus an unpaginated remainder.

The API is designed to be called repeatedly, each time with the unpaginated result returned from the previous call, until all pages are returned and there is nothing left.

The API takes special care to ensure that repeated calls won’t result in an infinite loop: as long as the input is non-empty, it will produce a non-empty page, and moreover, any unpaginated remainder will always be smaller than the input. If the paragraph needs to begin on a new page, this is indicated using a PageContinuity flag, rather than empty output.

The next feature to add to Balkón will be bidirectional rich text.