On Typesetting Engines: A Programmer's Perspective

Table of Contents

Translations

This post is available in the following translations:

Prologue

Typesetting is “architecture in two dimensions.”

If text and its fonts are the materials of the building, then typesetting is the drawings of the building.

Typesetting is a big topic, it is both an art and an engineering technique that has evolved significantly with the advent of digital technology. Obviously I cannot cover this topic in one post, even a book cannot do.

Among many typesetting concepts, the typesetting engine is one of the core concepts. Basically, a typesetting engine is a piece of software that decides how the glyphs, graphics, tables, etc. are laid out for printing or digital display.

When PPResume was launched, some people asked me why chose LaTeX as the default typesetting engine for PPReseume. Hmmm, this is a big topic.

In this post, I would like to explore the pros and cons of some popular typesetting engines: HTML/CSS, LaTeX, LaTeX.js, Typst, react-pdf and conclude why PPResume chose LaTeX as the default typesetting engine.

But before we start, let us agree on some glossaries that will be used thoughout whole post. Yes this is a long post and it takes time and energy to read. Don’t complain to me later. I warned you here!

Glossaries:

The Accessment Criteria

Each typesetting engine has its strengths and weaknesses, catering to different needs and preferences. Web based typesetting with HTML/CSS is extremely flexible and responsive, ideal for SEO and interactive content. LaTeX.js provides a bridge between the web and LaTeX, while LaTeX itself is the gold standard for academic and high-precision typesetting. Typst is considered as a modern, improved LaTeX alternative. React-pdf allows dynamic PDF generation with react. The choice of typesetting engine depends very much on the specific requirements of the project.

I am not a designer so I cannot talk too much about typesetting from the perspective of art. Instead, I want to discuss some technical things about typesetting engines from a programmer’s perspective. Meanwhile, this post is not an academic benchmarking report, so I won’t evaluate every aspect of typesetting engines. Instead, I will give some assessment criteria based on PPResume’s requirements.

When I wrote the first line code for PPResume, I’ve set 2 goals:

To produce top notch, high quality PDF, the typesetting engine must have a top tier line breaking algorithm, and to provide native support for multi languages, the typesetting engine must support languages with a huge character set (such as Chinese, Japanese and Korean, aka CJK). Let us evaluate these two criteria before we dive into specific typesetting engines.

Wait a minute, I almost forgot, to produce a PDF the typesetting engine must support pagination. You may ask: is there any typesetting engine that does not support pagination? The answer is neither a yes nor a no, depending on whether you consider HTML & CSS to be a typesetting engine. We will talk more about this later when we talk about HTML & CSS.

Finally, it would be better if PPResume could have an excellent user experience, of all possible features I believe instant preview is the most wanted one.

In a nutshell, I will judge a typesetting engine by checking whether it meets the following accessment criteria:

  1. Knuth Plass line breaking algorithm
  2. CJK typesetting
  3. Pagination
  4. Instant Preview

The Sacred Line Breaking Algorithm

Line breaking algorithms are one of the core techniques used in typesetting engines. They play a crucial role in determining how text is arranged on a page or screen.

The primary purpose of a line breaking algorithm is to determine the optimal points at which to break lines of text in a paragraph. Line breaking algorithms are essential to digital typesetting and form a core component of any system that needs to present text in a visually appealing and readable format.

There are 3 key metrics that are used to assess the quality of a line breaking algorithm:

  1. Justification: line breaking algorithms work in conjunction with justification techniques to create evenly spaced lines of text.
  2. Hyphenation: many advanced algorithms incorporate hyphenation to improve line breaks, especially for languages with long words.
  3. Optimization: the algorithm typically tries to minimize unsightly gaps or overly tight spacing between words across an entire paragraph.

There are two categories of line breaking algorithms:

  1. Minimum number of lines: a gready algorithm that puts as many words on a line as possible, then moving on to the next line to do the same until there are no more words left to place. This method is used by many modern word processors, such as LibreOffice Writer and Microsoft Word.
  2. Minimum raggedness: a dynamic programming algorithm, firstly used in TeX, minimizes the sum of the squares of the lengths of the spaces at the end of lines to produce a more aesthetically pleasing result than the greedy algorithm, which does not always minimize squared space.

Technically speaking, the minimum number of lines algorithm has faster speed, while the minimum raggedness algorithm produces more visually pleasing result. Let me show you an example here, in the following image, the top half is a LibreOffice document, using the “minimum number of lines” approach , while the bottom half is a PDF document generated by TeX using the “minimum raggedness” approach. You can very easily see that the bottom half PDF looks less ragged on the right margin and more visually appealing simply because the line breaking is more balanced and justified.

Knuth Plass Line Breaking Algorithm

Among all line breaking algorithms, the Knuth Plass line breaking algorithm is the gold standard for minimum raggedness approach. It is widely adopted by various typesetting engines like TeX, SILE and Typst, etc.

Back to PPResume’s case, one of the design goals for PPResume is to produce top notch, high quality PDF, so the chosen typesetting engine must have a more visually appealing line breaking algorithm, that being said, the typesetting engine must adopt Knuth Plass line breaking algorithm.

CJK Typesetting is Complicated

Typesetting for CJK (Chinese, Japanese, and Korean) languages is generally considered to be more complicated than Latin script languages. Here is a classic discussion from the koreader project. There are several reasons for this.

TL;DR: if you don’t want to delve into the details, you can check out the following W3C draft notes to get an intuitive sense of the complexity of typesetting requirements for CJK:

CJK Character Set is Huge

The root cause for this complexity is that the size of the character set for CJK languages is much more larger than Latin script languages. According to the CJK Unified Ideographs, as of Unicode 16.0, Unicode defines a total of 97,680 characters. This is insanely huge. In contrast, the Latin alphabet only has a few hundred characters, much smaller than CJK. Hmmmm, 100k characters, even creating a font that covers all of them is a huge amount of work, labor-intensive and very expensive.

CJK Characters

Taking PPResume as an example, we met two issues (1, 2) where the fonts recommended by CTeX are missing some characters. Unlike Latin script languages, there are very few fonts that have full coverage of the entire CJK character set, and most of them are commercial— Noto is one of the few exceptions that both has good coverage of CJK characters and is free to use.

Cultural Nuances

Each CJK language has its own set of typographic conventions that must be followed, and these can vary greatly from culture to culture and context to context. For example, punctuation placement and spacing rules differ between Chinese, Japanese, and Korean texts. It is hard to imagine that the quotation mark is used with completely different conventions in CJK:

In Japan, corner brackets are used.

In South Korea, corner brackets and English-style quotes are used.

In North Korea, angle quotes are used.

In mainland China, English-style quotes (full width “ ”) are official and prevalent; corner brackets are rare today. The Unicode code points used are the English quotes (rendered as fullwidth by the font), not the fullwidth forms.

In Taiwan, Hong Kong and Macau, where traditional characters are used, corner brackets are prevalent, although English-style quotes are also used.

In the Chinese language, double angle brackets are placed around titles of books, documents, movies, pieces of art or music, magazines, newspapers, laws, etc. When nested, single angle brackets are used inside double angle brackets. With some exceptions, this usage parallels the usage of italics in English:

「你看過《三國演義》嗎?」他問我。

“Have you read Romance of the Three Kingdoms?”, he asked me.

Font Pairing

When mixing CJK with other Latin script languages, things become more complicated.

Firstly, punctuations are different. For example, the comma has different forms in Chinese and English:

English uses the comma , as a separator to separate parts of a sentence and items in a list, while Chinese uses a Chinese comma to separate sensences, and a dedicated enumeration comma (顿号, ) to separate items in a list (e.g. keyword > list).

Multi Languages Support

Meanwhile, a Latin font may cover only one thousand glyphs, whereas a CJK font must cover at least thousands of glyphs, as mentioned above.

Effective typesetting often requires CJK fonts to be paired with Latin fonts to maintain visual consistency. This can be challenging as it requires combined fonts that intelligently switch between character sets.

So Chinese, Japanese and Korean fonts tend to be developed by Asian designers, with an understandable emphasis on the elegance of the Asian characters. Unfortunately this can be at the expense of the design of the Latin letters, which may in some cases be really quite ugly.

The solution? Use an attractive Latin script font for any Latin letters and numbers, and an Asian font for the Chinese, Japanese or Korean characters. Rather than making the poor typesetter manually change the font each time a Latin letter or number appears, applications such as InDesign allow Combined Fonts to be set within a document which intelligently switch the font according to the nature of each letter or character.

Typesetting conventions and best practices for CJK (Chinese, Japanese, Korean)

Not all typesetting engines have built-in support for font pairing but this is essential for PPResume to provide native support for multi languages.

In summary, the insanely huge size of CJK character sets, cultural nuances and technical challenges contribute to the greater complexity of typesetting CJK languages compared to Latin script languages.

HTML & CSS

Technically speaking, HTML (Hypertext Markup Language) is not a typesetting engine, but a markup language used to create the structure and content of web pages. It’s designed to define the structure of a document, such as headings, paragraphs, lists, and links, and so on.

While HTML can indirectly influence how text appears on a page (e.g. by using the obsolete font tags), it cannot handle the complex tasks of typesetting, such as:

HTML itself is cannot function as a typesetting engine, however, HTML & CSS (Cascading Style Sheets) together can be considered as a rudimentary typesetting engine.

Although not as sophisticated as dedicated typesetting engines such as LaTeX or InDesign, HTML & CSS provide a flexible way to control the layout and appearance of text on web pages.

By combining HTML & CSS, you can achieve a wide range of text formatting and layout effects. However, for more advanced typesetting tasks, such as complex mathematical equations or precise control over typography, dedicated typesetting engines may be more appropriate.

There are many resume builders on the market which use the HTML & CSS as their typesetting engine. Most are commercial, with only a few being free or open source:

WebsiteTechniqueType
https://resume.ioHTML CanvasCommercial
https://flowcv.com/HTML & CSSCommercial
https://www.visualcv.com/HTML & CSSCommercial
https://standardresume.co/HTML & CSSCommercial
https://zety.com/HTML & CSSCommercial
https://rxresu.meHTML & CSSFree & open source

On the one hand, from a business perspective, given the market is so crowded, it is not wise for me to create another resume builder that uses HTML & CSS as the typesetting engine.

On the other hand, from a engineering perspective, HTML & CSS does not implement Knuth Plass line breaking algorithm, so it cannot meet PPResume’s needs.

Line Breaking

In fact, standard CSS do provide some options for adjusting text justification:

Firefox even provides a test-justify option to set what type of justification should be applied to text when text-align: justify; is set on an element, however, this option is only available on Firefox.

However none of them apply proper hyphenation, so they cannot produce the same visually appealing result as a real Knuth Plass line breaking algorithm—Hacker News has a valuable discussion about why modern browsers are too lazy to implement the Knuth Plass line breaking algorithm.

There are also a few JavaScript implementations for the Knuth-Plass linebreaking algorithm, but none of them seems to be production ready:

CJK

HTML & CSS—or the browser, provides support for CJK, that’s for sure, otherwise the browser couldn’t be the world’s most widely adopted information platform on the world. However, this doesn’t mean that every page containing CJK follows typesetting best practices.

For example, it is highly recommended to put some space between CJK and Western characters, plain HTML & CSS cannot do this automatically—this needs the help of JavaScript.

In general, it takes extra effort in order to follow best practices for CJK typesetting in the browser. As mentioned above, Requirements for Chinese Text Layout 中文排版需求 is a pretty good and authoritative reference, and one of the authors, Chen Yijun, has published an open source project called Han which provides a pretty nice implementation if you want to typeset CJK with best practices.

Han.css

Pagination

HTML & CSS is not designed for paginated documents, though with the help of JavaScript, it can simulate paginated documents (oh-my-cv provides a good reference implementation). HTML’s documents are essentially responsive, flow like water, can adapt viewports of any size.

Instant Preview

HTML & CSS can have instant preview if the resume generation process only happens only on the client side, otherwise, if it happens on the server side, there would be a round trip time from request to response and hence no instant preview.

Conclusion

Before we conclude, I couldn’t resist showing you an excellent example of how HTML & CSS typesetting can be pushed to its limit. It uses text-align: justify and hyphens: auto to get an optimal, aligned layout for paragraphs. This is almost the best that HTML & CSS can do. If you ever want to do some typesetting with HTML & CSS, this would be a very good reference.

In summary, while it is theoretically possible to get a top typesetting for HTML & CSS, just as dedicated typesetting engines, the effort would be enormous and they may also be browser compatibility issues. So, for the time being at least, if top notch typesetting is required, it is still recommended to use a dedicated typesetting engine instead of tuning HTML & CSS hand by hand.

LaTeX

TeX is a typesetting system created by Donald Knuth in the late 1970s. It is designed for the creation of high quality typeset documents, particularly those containing complex mathematical and scientific notation. TeX is a low-level system that requires the user to write commands in a specific language to format documents. It has its own set of rules and macros for formatting text, and it is highly customizable and extensible.

LaTeX, on the other hand, is a document preparation system that is built on top of TeX. It was created by Leslie Lamport in the early 1980s to simplify the document preparation process. LaTeX provides a set of higher-level macros on top of TeX’s lower-level programming language, making it more easier and intuitive to use.

One of the most frequently asked questions is, why use LaTeX instead of a word processors like Microsoft Word? The TL;DR answer is: “for beauty”. Dario wrote an excellent post The Beauty of LaTeX with dozens of examples showing the nitty-gritty typesetting details between Microsoft Word and LaTeX. No need for me to repeat here.

In summary, for professional typesetting, LaTeX excels in the following features:

Line Breaking

TeX has the golden line breaking algorithm—the Knuth Plass line breaking algorithm. After all Knuth is the author of TeX, right?

As mentioned above, the Knuth Plass line breaking algorithm does its best to produce a more aesthetically pleasing result by reducing the raggedness to minimum.

Under the hood, the Knuth Plass line breaking algorithm uses a “total-fit” line breaking algorithm, in contrast to the “first-fit” approach used by many other systems. This means:

This allows TeX to produce more visually appealing and balanced paragraphs overall.

Meanwhile, unlike many systems that treat hyphenation separately, TeX’s line breaking algorithm integrates hyphenation decisions directly. This allows for more optimal placement of hyphens in the context of the entire paragraph.

Overall, TeX’s line breaking algorithm is considered one of the most sophisticated and effective approaches to typesetting, and its core principles continue to influence modern typesetting systems and remain at the forefront of high-quality digital typography.

CJK

Regarding to CJK typesetting, LaTeX has pretty good support for CJK with the help of some new engines and some packages:

For example, xeCJK package provide following commands to set fonts for CJK:

xeCJK also provides options for specifying punctuation styles for CJK, spacing between CJK and non-CJK characters, etc.

Overall LaTeX’s CJK support is now quite mature, although it may take some time to set up in different environments. Here’s a manual page from The XeTeX Companion TEX meets OpenType and Unicode, you can get a glance of XeTeX’s ability for CJK typesetting.

XeTeX for CJK

Pagination

LaTeX is designed from ground up for typesetting paginated documents, so yes it has excellent support for pagination, you can easily adjust paper size, orientation, margins, etc.

Check the geometry package for details.

Instant Preview

LaTeX by default runs on the server side so there would be a round trip time from the request to generate the PDF to the response for the generated PDF.

Using LaTeX as the typesetting engine means that we’re losing the ability for instant preview. However there do have ways to mitigate this. The magic is WebAssembly.

There’s some effort that goes into compiling LaTeX to WebAssembly (aka wasm) so that it can run purely in a browser:

Although none of the above are actively maintained though, it is theoretically possible to run LaTeX purely in a browser. This would drastically reduce the round-trip time from browser to server, and we could get instant previews then.

Conclusion

Before concluding, I would like to share a bit of off-topic information here. There are a very few choices for LaTeX based resume builders on the market:

From a business perspective, this is a niche market and not too crowded, so it might be worthwhile for me to create another LaTeX based resume builder.

OK time to conclude LaTeX.

LaTeX.js

LaTeX.js is a LaTeX to HTML5 translator that aims to render LaTeX documents directly in the browser without the need for server-side processing.

It provides a very impressive playground, where on the left you can enter some LaTeX code, on the right it will render the LaTeX code into a pretty nice HTML document.

LaTeX.js Playground

Line Breaking

LaTeX.js does not use Knuth Plass line breaking but instead uses text-align: justify to minimize the raggedness for paragraphs.

Meanwhile, it also uses soft hyphen $shy; to facilitate with hyphens: manual for better line breaking.

Although these techniques produce much better visual result than normal HTML, it is still not true Knuth Plass line breaking.

CJK

LaTeX.js supports CJK because it is just a transpiler on top of HTML & CSS. However, just like HTML & CSS, it doesn’t follow CJK best practices and it’s even harder and requires more work to tune itself according to CJK typesetting best practices.

Pagination

Looks like we can have a LaTeX in a browser? No, no, no, if things were really that easy, the world would be a better place. LaTeX.js comes with lots of limitations, some of which are fatal for a production-ready LaTeX replacement in a browser:

Instant Preview

LaTeX.js provides instant preview because it is a client side library and runs in a browser.

Conclusion

LaTeX.js provides only limited parsing capabilities for TeX/LaTeX, in other words, many LaTeX packages cannot be used in LaTeX.js.

This is a PEG parser, which means it interprets LaTeX as a context-free language. However, TeX (and therefore LaTeX) is Turing complete, so TeX can only really be parsed by a complete Turing machine. It is not possible to parse the full TeX language with a static parser. See here (opens new window)for some interesting examples.

When I started PPResume at Dec, 2022, I also tried LaTeX.js for a while, but after discovering its fatal limitations, I quickly dropped it in favour of server-side LaTeX. As far as what I can tell, LaTeX.js is a good demo idea but far from being a production-ready LaTeX replacement.

Typst

Typst is a modern typesetting system designed to be an intuitive and efficient alternative to LaTeX. It uses a syntax that is heavily inspired by Markdown, making it more accessible to users who may find LaTeX’s syntax complex. Typst allows users to compose documents in a text file, similar to LaTeX, but with a focus on speed, simplicity, and error handling.

Typst App

Line Breaking

Typst provide two options for line breaks:

The line breakingn in Typst would be better if linebreaks option and hyphenate option are used together.

CJK

Because Typst is very young, its CJK support is not as mature as LaTeX. As a result, there’re lots of open issues in the Typst community. Here are some typical ones:

Basically these issues can be categorised as follows:

I am 100% sure that Typst will be able to improve and solve these issues, but it will take time. It is very likely that there will be some breaking changes in the future.

Pagination

Typst supports pagination out of the box, fair enough as a dedicated typesetting engine.

Instant Preview

This part is a bit complicated.

Basically, Typst is an open source project, it can run as a CLI tool where you can just type in a command typst compile path/to/source.typ path/to/output.pdf and get a PDF in your local folder.

Typst provides a typst watch command, combined with incremental compilation, the PDF can be updated in milliseconds. There are also some extensions such as tinymist which allows instant preview on editors.

It can also run purely in a browser, as the project is written in rust and designed to be able to be compiled to WebAssembly. In fact, the official Typst web app run in a browsers via WebAssembly. However, this part is not open sourced:

Typst can be compiled to WASM, but no JS glue is available, you’d have to write that yourself. It’s not as simple as compile(string) because you also need to provide fonts, and if you want a multi-file setup of course also files.

That being said, if you want instant preview for Typst in a browser, you are mostly on your own to write a WebAssembly binding to typst.

Conclusion

In my opinion, Typst is a very promising alternative to LaTeX, but still very young and lacks some key capabilites to handle complicated typesetting scenarios.

React-pdf

React-pdf is react renderer for creating PDF files on the browser and server.

Line Breaking

React-pdf internally implements the Knuth and Plass line breaking algorithm. By default it’s set to hyphenate english words.

This is one page from the example document in react-pdf playground, note the layout of the paragraph, the text overall looks balanced and justified, much better than normal paragraphs in normal HTML & CSS.

React-pdf document

CJK

React-pdf with default settings does not render CJK characters, you need to register a font and quote it in styles.

Pagination

Needless to say, react-pdf supports pagination because it is a library to generate PDF. It also provides options to specify page sizes, DPI, styles, etc.

Instant Preview

React-pdf can be used on both client side and server side.

If used on client side, then yes we have instant preview, again, you can check the playground for a live demo. Otherwise, if used on server side with Node.js, then no instant preview due to the round trip time from request to response.

Conclusion

It seems that react-pdf would be a perfect choice as the typesetting engine for a resume builder.

However, react-pdf is not a dedicated typesetting engine. It lacks many features that are only available or work well with a dedicated typesetting engine. For example, it has no built-in list items. Most importantly, even though it already implements the Knuth-Plass line-breaking algorithm, typesetting is not just about breaking paragraphs into lines, is it? You still need to tune the spacing between paragraphs, adjust font size/styles, respect CJK best typesetting practices, etc. All this tuning requires a huge amount of work that LaTeX already provides out of the box.

In fact, there is an open source resume builder called open-resume which uses this library to generate and update resume PDF in real time, you can check the output PDF by yourself and compare it to the PDF generated by LaTeX.

OK conclusion:

Summary

The goal of PPResume is to be a professional resume builder that offers top notch typesetting quality, with native support for multi languages.

As mentioned above, in order to meet PPResume’s requirements, the typesetting engine must:

Typesetting EngineKnuth Plass line breakingCJKPaginationInstant Preview
HTML & CSSNoYesPartialYes
LaTeXYesYesYesNo
LaTeX.jsNoYesNoYes
TypstYesPartialYesPartial
React-pdfYesNoYesYes

Both HTML & CSS and LaTeX.js do not support Knuth Plass line breaking, react-pdf and Typst’s CJK support is not production ready, hence LaTeX is our only option.

In the long run if there’re better choice, it is possible for PPResume to add support for other typesetting engines.

Last but not least, having fun with polytype, a Rosetta Stone for typesetting engines.

Thanks for reading!

Revisions

Nov 18, 2024

This post featured by Hacker News.

To respond to some comments here.

Indo-European languages

sundarurfriend pointed out that the usage of Indo-European languages is inappropriate, and he is right.

I am not a linguist and the two languages that I know well are Chinese and English. So I have coined a new glossary term “Latin script languages” and use that throughout the post instead.

Choice of typesetting engines

Some people asked me why I did not mention/evaluate xxx, yyy typesetting engines. As I mentioned above, I did the evaluation based on PPReseume’s requirements. I chose the above 5 typesetting engines because each represents a different type:

text-autospace

Chrome has developed a new text-autospace CSS property that can insert inter-script spacing by default. However, this property appears to be available only in Chrome and is currently behind a feature flag.

Nov 8, 2024

Nov 2, 2024

Nov 1, 2024