The sad state of automated layouting solutions

Aug 13, 2016

Back in 2007, I watched this talk by Håkon Wium Lie, co-inventor of CSS, about his proprietary Prince XML. Basically, he asks, why should you have to go around clicking on pages and menus in layout programs, if instead you could write down the rules for the layout once, then automate PDF generation by feeding it only new content whenever you need something printed. He’s talking about using CSS to layout structured text like HTML, of course.

Since then, I’ve been waiting for the future to arrive in the open source lands, where those of us live that don’t have a ton of cash and are crinching at the mere thought of vendor lock-in.

InDesign

You would be forgiven to think for a moment about using Adobe InDesign in a headless mode (running on a server without a GUI) to layout documents in an automated fashion. But then you are quickly reminded that InDesign in general, and InDesign scripting in particular, is basically a mess.

CSS-based solutions

Prince

  • awesome
  • proprietary and rather expensive license

DocRaptor

  • Prince as a cloud service, with pricing per document conversion (see their blog post).

See also print-css.rocks for a discussion of further commercial HTML/CSS-to-PDF engines and some tutorials.

wkhtmltopdf

WeasyPrint

  • CSS layout engine written in Python
  • supports CSS paged media (@page etc.)
  • currently doesn’t support some of the more esoteric CSS properties that are supported in WebKit.
  • doesn’t support JavaScript

Chrome headless

iText

  • Java library for PDF generation without HTML parser, either use:
  • apparently not quite as high-quality as Prince

TeX

From Why is TeX still used?

The reason TeX is still used is because it is open source and beautiful, because it is the best at handling mathematical notation, and because of its inertial dominance in math and the hard sciences in Academia. The reason TeX has so, so many fixable problems after 30 years is because there is no financial incentive for anyone to fix it.

The core algorithms are pretty creaky as well. Knuth did a brilliant job coming up with efficient algorithms that could do a fairly good job at breaking lines/paragraphs/pages and could handle book-length documents even on 70s-era hardware, but they've had minimal improvement since then.

All true. Also, the way indexes, table of contents, bibliographies and cross references work in LaTeX are unholy, fragile hacks. Foot- and endnotes could use some TLC. And on 2011-era hardware we really should have some approximation of globally optimal page breaking.

Never mind all the other TeX variants. ConTeXt is the least bad, but it’s still not CSS-based, which is for better or worse what everyone is familiar with.

P.S. This post received some updates on September 10th, 2017 and March 6th 2018.