On Typesetting Engines: A Programmer's Perspective

Table of Contents

Prologue

Typesetting is "architecture in two dimensions."

If text and its fonts are the materials of the building, then typesetting is the drawings of the building.

Typesetting is a big topic, it is both an art and an engineering technique that has evolved significantly with the advent of digital technology. Obviously I cannot cover this topic in one post, even a book cannot do.

Among many typesetting concepts, the typesetting engine is one of the core concepts. Basically, a typesetting engine is a piece of software that decides how the glyphs, graphics, tables, etc. are laid out for printing or digital display.

When PPResume (opens in a new tab) was launched (opens in a new tab), some people asked (opens in a new tab) me why chose LaTeX as the default typesetting engine for PPReseume. Hmmm, this is a big topic.

In this post, I would like to explore the pros and cons of some popular typesetting engines: HTML/CSS, LaTeX.js (opens in a new tab), LaTeX (opens in a new tab), Typst (opens in a new tab), react-pdf (opens in a new tab) and conclude why PPResume chose LaTeX as the default typesetting engine.

But before we start, let us agree on some glossaries that will be used thoughout whole post. Yes this is a long post and it takes time and energy to read. Don't complain to me later. I warned you here!

Glossaries:

The Accessment Criteria

Each typesetting engine has its strengths and weaknesses, catering to different needs and preferences. Web-based typesetting with HTML/CSS is extremely flexible and responsive (opens in a new tab), ideal for interactive and SEO-optimized content. LaTeX.js provides a bridge between the web and LaTeX, while LaTeX itself is the gold standard for academic and high-precision typesetting. Typst is considered as a modern, improved LaTeX alternative. React-pdf allows dynamic PDF generation with react (opens in a new tab). The choice of typesetting engine depends very much on the specific requirements of the project.

I am not a designer so I cannot talk too much about typesetting from the perspective of art. Instead, I want to discuss some technical things about typesetting engines from a programmer's perspective. Meanwhile, this post is not an academic benchmarking report, so I won't evaluate every aspect of typesetting engines. Instead, I will give some assessment criteria based on PPResume's requirements.

When I wrote the first line code for PPResume, I've set 2 goals:

To produce top notch, high quality PDF, the typesetting engine must have a top tier line breaking algorithm (opens in a new tab), and to provide native support for multi languages, the typesetting engine must support languages with a huge character set (such as Chinese, Japanese and Korean, aka CJK). Let us evaluate these two criteria before we dive into specific typesetting engines.

Wait a minute, I almost forgot, to produce a PDF the typesetting engine must support pagination. You may ask: is there any typesetting engine that does not support pagination? The answer is neither a yes nor a no, depending on whether you consider HTML & CSS to be a typesetting engine. We will talk more about this later when we talk about HTML & CSS.

Finally, it would be better if PPResume could have an excellent user experience, of all possible features I believe instant preview is the most wanted one.

In a nutshell, I will judge a typesetting engine by checking whether it meets the following accessment criteria:

  1. Knuth Plass line breaking algorithm
  2. CJK typesetting
  3. Pagination
  4. Instant Preview

The Sacred Line Breaking Algorithm

Line breaking algorithms are one of the core techniques used in typesetting engines. They play a crucial role in determining how text is arranged on a page or screen.

The primary purpose of a line breaking algorithm is to determine the optimal points at which to break lines of text in a paragraph. Line breaking algorithms are essential to digital typesetting and form a core component of any system that needs to present text in a visually appealing and readable format.

There are 3 key aspects that are used to assess the quality of a line breaking algorithm:

  1. Justification: Line breaking algorithms work in conjunction with justification (opens in a new tab) techniques to create evenly spaced lines of text.
  2. Hyphenation: Many advanced algorithms incorporate hyphenation (opens in a new tab) to improve line breaks, especially for languages with long words.
  3. Optimization: The algorithm typically tries to minimize unsightly gaps or overly tight spacing between words across an entire paragraph.

There are two categories (opens in a new tab) of line breaking algorithms:

  1. Minimum number of lines: a gready algorithm that puts as many words on a line as possible, then moving on to the next line to do the same until there are no more words left to place. This method is used by many modern word processors, such as LibreOffice Writer (opens in a new tab) and Microsoft Word.
  2. Minimum raggedness: a dynamic programming algorithm, firstly used in TeX, minimizes the sum of the squares of the lengths of the spaces at the end of lines to produce a more aesthetically pleasing result than the greedy algorithm, which does not always minimize squared space.

Technically speaking, the minimum number of lines algorithm has faster speed, while the minimum raggedness algorithm produces more visually pleasing result. Let me show you an example here. In the following image, the top half is a LibreOffice (opens in a new tab) document, using the "minimum number of lines" approach , while the bottom half is a PDF document generated by TeX using the "minimum raggedness" approach. You can very easily see that the bottom half PDF looks less ragged on the right margin and more visually appealing simply because the line breaking is more balanced and justified.

Knuth Plass Line Breaking Algorithm

Among all line breaking algorithms, the Knuth Plass line breaking algorithm (opens in a new tab) is the gold standard for minimum raggedness approach. It is widely adopted by various typesetting engines like TeX (opens in a new tab), SILE (opens in a new tab) and Typst (opens in a new tab), etc.

Back to PPResume's case, one of the design goals for PPResume is to produce top notch, high quality PDF, so the chosen typesetting engine must have a more visually appealing line breaking algorithm, that being said, the typesetting engine must adopt Knuth Plass line breaking algorithm.

CJK Typesetting is Complicated

Typesetting for CJK (opens in a new tab) (Chinese, Japanese, and Korean) languages is generally considered to be more complicated than Indo-European languages. Here is a classic discussion (opens in a new tab) from the koreader (opens in a new tab) project. There are several reasons for this.

TL;DR: if you don't want to delve into the details, you can check out the following W3C (opens in a new tab) notes to get an intuitive sense of the complexity of typesetting requirements for CJK:

CJK Character Set is Huge

The root cause for this complexity is that the size of the character set for CJK languages is much more larger than Indo-European languages. According to the CJK Unified Ideographs (opens in a new tab), as of Unicode 16.0, Unicode defines a total of 97,680 characters. This is insanely huge. In contrast, Indo-European languages typically use the Latin alphabet, which has a few hundred characters, much smaller than CJK. Hmmmm, 100k characters, even creating a font that covers all of them is a huge amount of work, labor-intensive and very expensive.

CJK Characters

Taking PPResume as an example, we have two issues (1 (opens in a new tab), 2 (opens in a new tab)) where the fonts recommended by CTeX (opens in a new tab) are missing characters. Unlike Indo-European languages, there are very few fonts that have full coverage of the entire CJK character set, and most of them are commercial— Noto (opens in a new tab) is one of the few exceptions that both has good coverage of CJK (opens in a new tab) characters and is free to use.

Cultural Nuances

Each CJK language has its own set of typographic conventions that must be followed, and these can vary greatly from culture to culture and context to context. For example, punctuation placement and spacing rules differ between Chinese, Japanese, and Korean texts. It is hard to imagine that the quotation mark (opens in a new tab) is used with completely different conventions (opens in a new tab) in CJK.

In Japan, corner brackets are used.

In South Korea, corner brackets and English-style quotes are used.

In North Korea, angle quotes are used.

In mainland China, English-style quotes (full width “ ”) are official and prevalent; corner brackets are rare today. The Unicode code points used are the English quotes (rendered as fullwidth by the font), not the fullwidth forms.

In Taiwan, Hong Kong and Macau, where traditional characters are used, corner brackets are prevalent, although English-style quotes are also used.

In the Chinese language, double angle brackets are placed around titles of books, documents, movies, pieces of art or music, magazines, newspapers, laws, etc. When nested, single angle brackets are used inside double angle brackets. With some exceptions, this usage parallels the usage of italics in English:

「你看過《三國演義》嗎?」他問我。

"Have you read Romance of the Three Kingdoms?", he asked me.

Font Pairing

When mixing CJK with other Indo-European languages, things become more complicated.

Firstly, punctuations are different. For example, the comma (opens in a new tab) has different forms in Chinese and English:

English uses the comma , as a separator to separate parts of a sentence and items in a list, while Chinese uses a Chinese comma to separate sensences, and a dedicated enumeration comma (顿号, ) to separate items in a list (e.g. keyword > list).

Multi Languages Support (opens in a new tab)

Meanwhile, a Latin font for Indo-European languages may cover one thousand glyphs, whereas a CJK font must cover at least thousands of glyphs, as mentioned above.

Effective typesetting often requires CJK fonts to be paired with Latin fonts to maintain visual consistency. This can be challenging as it requires combined fonts that intelligently switch between character sets.

So Chinese, Japanese and Korean fonts tend to be developed by Asian designers, with an understandable emphasis on the elegance of the Asian characters. Unfortunately this can be at the expense of the design of the Latin letters, which may in some cases be really quite ugly.

The solution? Use an attractive Latin-script font for any Latin letters and numbers, and an Asian font for the Chinese, Japanese or Korean characters. Rather than making the poor typesetter manually change the font each time a Latin letter or number appears, applications such as InDesign allow Combined Fonts to be set within a document which intelligently switch the font according to the nature of each letter or character.

Typesetting conventions and best practices for CJK (Chinese, Japanese, Korean) (opens in a new tab)

Not all typesetting engines have built-in support for font pairing but this is essential for PPResume to provide native support for multi languages.

In summary, the nuances of character sets, cultural conventions and technical challenges contribute to the greater complexity of typesetting CJK languages compared to Indo-European languages.

HTML & CSS

Technically speaking, HTML (opens in a new tab) (Hypertext Markup Language) is not a typesetting engine, but a markup language used to create the structure and content of web pages. It's designed to define the structure of a document, such as headings, paragraphs, lists, and links, and so on.

While HTML can indirectly influence how text appears on a page (e.g. by using the obsolete font (opens in a new tab) tags), it cannot handle the complex tasks of typesetting, such as:

HTML itself is cannot function as a typesetting engine, however, HTML & CSS (opens in a new tab) (Cascading Style Sheets) together can be considered as a rudimentary typesetting engine.

Although not as sophisticated as dedicated typesetting engines such as LaTeX or InDesign (opens in a new tab), HTML & CSS provide a flexible way to control the layout and appearance of text on web pages.

By combining HTML & CSS, you can achieve a wide range of text formatting and layout effects. However, for more advanced typesetting tasks, such as complex mathematical equations or precise control over typography, dedicated typesetting engines may be more appropriate.

There are many resume builders on the market which use the HTML & CSS as their typesetting engine. Most are commercial, with only a few being free or open source:

WebsiteTechniqueType
https://resume.io (opens in a new tab)HTML CanvasCommercial
https://flowcv.com/ (opens in a new tab)HTML & CSSCommercial
https://www.visualcv.com/ (opens in a new tab)HTML & CSSCommercial
https://standardresume.co/ (opens in a new tab)HTML & CSSCommercial
https://zety.com/ (opens in a new tab)HTML & CSSCommercial
https://rxresu.me (opens in a new tab)HTML & CSSFree & open source

On the one hand, from a business perspective, given the market is so crowded, it is not wise for me to create another resume builder that uses HTML & CSS as the typesetting engine.

On the other hand, from a engineering perspective, HTML & CSS does not implement Knuth Plass line breaking algorithm, so it cannot meet PPResume's needs.

Line Breaking

In fact, standard CSS do provide some options for adjusting text justification:

Firefox even provides a test-justify (opens in a new tab) option to set what type of justification should be applied to text when text-align: justify; is set on an element, however, this option is only available on Firefox.

However none of them apply proper hyphenation, so they cannot (opens in a new tab) produce the same visually appealing result as a real Knuth Plass line breaking algorithm—Hacker News has a valuable discussion (opens in a new tab) about why modern browsers are too lazy to implement the Knuth Plass line breaking algorithm.

There are also a few JavaScript implementations for the Knuth-Plass linebreaking algorithm, but none of them seems to be production ready:

CJK

HTML & CSS—or the browser, provides support for CJK, that's for sure, otherwise the browser couldn't be the world's most widely adopted information platform on the world. However, this doesn't mean that every page containing CJK follows typesetting best practices.

For example, it is highly recommended to put some space between CJK and Western characters, plain HTML & CSS cannot do this automatically—this needs the help of JavaScript.

In general, it takes extra effort in order to follow best practices for CJK typesetting in the browser. As mentioned above, Requirements for Chinese Text Layout 中文排版需求 (opens in a new tab) is a pretty good and authoritative reference, and one of the authors, Chen Yijun (opens in a new tab), has published an open source project called Han (opens in a new tab) which provides a pretty nice implementation if you want to typeset CJK with best practices.

Han.css

Pagination

HTML & CSS is not designed for paginated documents, though with the help of JavaScript, it can simulate paginated documents (here is a good implementation (opens in a new tab) from a oh-my-cv (opens in a new tab)). HTML's documents are essentially responsive (opens in a new tab), flow like water, can adapt viewports of any size.

Instant Preview

HTML & CSS can have instant preview if the resume generation process only happens only on the client side, otherwise, if it happens on the server side, there would be a round trip time from request to response and hence no instant preview.

Conclusion

Before we conclude, I couldn't resist showing you an excellent example (opens in a new tab) of how HTML & CSS typesetting can be pushed to its limit. It uses text-align: justify and hypens: auto to get an optimal, aligned layout for paragraphs. This is almost the best that HTML & CSS can do. If you ever want to do some typesetting with HTML & CSS, this would be a very good reference.

In summary, while it is theoretically possible to get a top typesetting for HTML & CSS, just as dedicated typesetting engines, the effort would be enormous and they may also be browser compatibility issues. So, for the time being at least, if top notch typesetting is required, it is still recommended to use a dedicated typesetting engine instead of tuning HTML & CSS hand by hand.

LaTeX

TeX (opens in a new tab) is a typesetting system created by Donald Knuth (opens in a new tab) in the late 1970s. It is designed for the creation of high quality typeset documents, particularly those containing complex mathematical and scientific notation. TeX is a low-level system that requires the user to write commands in a specific language to format documents. It has its own set of rules and macros for formatting text, and it is highly customizable and extensible.

LaTeX (opens in a new tab), on the other hand, is a document preparation system that is built on top of TeX. It was created by Leslie Lamport (opens in a new tab) in the early 1980s to simplify the document preparation process. LaTeX provides a set of higher-level macros on top of TeX's lower-level programming language, making it more easier and intuitive to use.

One of the most frequently asked questions is, why use LaTeX instead of a word processors like Microsoft Word? The TL;DR answer is: "for beauty". Dario (opens in a new tab) wrote an excellent post The Beauty of LaTeX (opens in a new tab) with dozens of examples showing the nitty-gritty typesetting details between Microsoft Word and LaTeX. No need for me to repeat here.

In summary, for professional typesetting, LaTeX excels in the following features:

Line Breaking

TeX has the golden line breaking algorithm—the Knuth Plass line breaking algorithm. After all Knuth is the author of TeX, right?

As mentioned above, the Knuth Plass line breaking algorithm does its best to produce a more aesthetically pleasing result by reducing the raggedness to minimum.

Under the hood, the Knuth Plass line breaking algorithm uses a "total-fit" line breaking algorithm, in contrast to the "first-fit" approach used by many other systems. This means:

This allows TeX to produce more visually appealing and balanced paragraphs overall.

Meanwhile, unlike many systems that treat hyphenation separately, TeX's line breaking algorithm integrates hyphenation decisions directly. This allows for more optimal placement of hyphens in the context of the entire paragraph.

Overall, TeX's line breaking algorithm is considered one of the most sophisticated and effective approaches to typesetting, and its core principles continue to influence modern typesetting systems and remain at the forefront of high-quality digital typography.

CJK

Regarding to CJK typesetting, LaTeX has pretty good support for CJK with the help of some new engines and some packages:

For example, xeCJK package provide following commands to set fonts for CJK:

xeCJK also provides options for specifying punctuation styles for CJK, spacing between CJK and non-CJK characters, etc.

Overall LaTeX's CJK support is now quite mature, although it may take some time to set up in different environments. Here's a manual page from The XeTeX Companion TEX meets OpenType and Unicode (opens in a new tab), you can get a glance of XeTeX's ability for CJK typesetting.

XeTeX for CJK

Pagination

LaTeX is designed from ground up for typesetting paginated documents, so yes it has excellent support for pagination, you can easily adjust paper size, orientation, margins, etc.

Check the geometry (opens in a new tab) package for details.

Instant Preview

LaTeX by default runs on the server side so there would be a round trip time from the request to generate the PDF to the response for the generated PDF.

Using LaTeX as the typesetting engine means that we're losing the ability for instant preview. However there do have ways to mitigate this. The magic is WebAssembly (opens in a new tab).

There's some effort that goes into compiling LaTeX to WebAssembly (aka wasm) so that it can run purely in a browser:

Although none of the above are actively maintained though, it is theoretically possible to run LaTeX purely in a browser. This would drastically reduce the round-trip time from browser to server, and we could get instant previews then.

Conclusion

Before concluding, I would like to share a bit of off-topic information here. There are a very few choices for LaTeX based resume builders on the market:

From a business perspective, this is a niche market and not too crowded, so it might be worthwhile for me to create another LaTeX based resume builder.

OK time to conclude LaTeX.

LaTeX.js

LaTeX.js (opens in a new tab) is a LaTeX to HTML5 translator that aims to render LaTeX documents directly in the browser without the need for server-side processing.

It provides a very impressive playground (opens in a new tab), where on the left you can enter some LaTeX code, on the right it will render the LaTeX code into a pretty nice HTML document.

LaTeX.js Playground

Line Breaking

LaTeX.js does not use Knuth Plass line breaking but instead uses text-align: justify to minimize the raggedness for paragraphs.

Meanwhile, it also uses soft hyphen (opens in a new tab) $shy; to facilitate with hypens: manual for better line breaking.

Although these techniques produce much better visual result than normal HTML, it is still not true Knuth Plass line breaking.

CJK

LaTeX.js supports CJK because it is just a wrapper on top of HTML & CSS. However, just like HTML & CSS, it doesn't follow CJK best practices and it's even harder and requires more work to tune itself according to CJK typesetting best practices.

Pagination

Looks like we can have a LaTeX in a browser? No, no, no, if things were really that easy, the world would be a better place. LaTeX.js comes with lots of limitations (opens in a new tab), some of which are fatal for a production-ready LaTeX replacement in a browser:

Instant Preview

LaTeX.js provides instant preview because it is a client side library and runs in a browser.

Conclusion

LaTeX.js provides only limited (opens in a new tab) parsing capabilities for TeX/LaTeX, in other words, many LaTeX packages cannot be used in LaTeX.js.

This is a PEG parser, which means it interprets LaTeX as a context-free language. However, TeX (and therefore LaTeX) is Turing complete, so TeX can only really be parsed by a complete Turing machine. It is not possible to parse the full TeX language with a static parser. See here (opens new window)for some interesting examples.

When I started PPResume at Dec, 2022, I also tried LaTeX.js for a while, but after discovering its fatal limitations, I quickly dropped it in favour of server-side LaTeX. As far as what I can tell, LaTeX.js is a good demo idea but far from being a production-ready LaTeX replacement.

Typst

Typst (opens in a new tab) is a modern typesetting system designed to be an intuitive and efficient alternative to LaTeX. It uses a syntax that is heavily inspired by Markdown, making it more accessible to users who may find LaTeX's syntax complex. Typst allows users to compose documents in a text file, similar to LaTeX, but with a focus on speed, simplicity, and error handling.

Typst App

Line Breaking

Typst provide two options for line breaks:

The line breakingn in typst would be better if linebreaks option and hyphenate (opens in a new tab) option are used together.

CJK

Because typst is very young, its CJK support is not as mature as LaTeX. As a result, there're lots of open issues (opens in a new tab) in the typst community. Here are some typical ones:

Basically these issues can be categorised as follows:

I am 100% sure that typst will be able to improve and solve these issues, but it will take time. It is very likely that there will be some breaking changes in the future.

Pagination

Typst supports pagination (opens in a new tab) out of the box, fair enough as a dedicated typesetting engine.

Instant Preview

This part is a bit complicated.

Basically, typst is an open source (opens in a new tab) project, it can run as a CLI tool where you can just type in a command typst compile path/to/source.typ path/to/output.pdf and get a PDF in your local folder.

It can also run purely in a browser, as the project is written in rust and designed to be able to be compiled to WebAssembly. In fact, the official typst web app (opens in a new tab) run typst in a browsers via WebAssembly. However, this part is not (opens in a new tab) open sourced:

Typst can be compiled to WASM, but no JS glue is available, you'd have to write that yourself. It's not as simple as compile(string) because you also need to provide fonts, and if you want a multi-file setup of course also files.

That being said, if you want instant preview for typst in a browser, you are mostly on your own to write a WebAssembly binding to typst.

Conclusion

In my opinion, typst is a very promising alternative to LaTeX, but still very young and lacks some key capabilites to handle complicated typesetting scenarios.

React-pdf

React-pdf (opens in a new tab) is react renderer for creating PDF files on the browser and server.

Line Breaking

React-pdf internally implements (opens in a new tab) the Knuth and Plass line breaking algorithm. By default it's set to hyphenate english words.

This is one page from the example document in react-pdf playground (opens in a new tab), note the layout of the paragraph, the text overall looks balanced and justified, much better than normal paragraphs in normal HTML & CSS.

React-pdf document

CJK

React-pdf with default settings does not render CJK characters, you need to register a font (opens in a new tab) and quote it in styles.

Pagination

Needless to say, react-pdf supports pagination because it is a library to generate PDF. It also provides options (opens in a new tab) to specify page sizes, dpi, styles, etc.

Instant Preview

React-pdf can be used on both client side and server side.

If used on client side, then yes we have instant preview, again, you can check the playground (opens in a new tab) for a live demo. Otherwise, if used on server side with node.js (opens in a new tab), then no instant preview due to the round trip time from request to response.

Conclusion

It seems that react-pdf would be a perfect choice as the typesetting engine for a resume builder.

However, react-pdf is not a dedicated typesetting engine. It lacks many features that are only available or work well with a dedicated typesetting engine. For example, it has no built-in list items. Most importantly, even though it already implements the Knuth-Plass line-breaking algorithm, typesetting is not just about breaking paragraphs into lines, is it? You still need to tune the spacing between paragraphs, adjust font size/styles, respect CJK best typesetting practices, etc. All this tuning requires a huge amount of work that LaTeX already provides out of the box.

In fact, there is an open source resume builder called open-resume (opens in a new tab) which uses this library to generate and update resume PDF in real time, you can check the output PDF by yourself and compare it to the PDF generated by LaTeX (opens in a new tab).

OK conclusion:

Summary

The goal of PPResume is to be a professional resume builder that offers top notch typesetting quality, with native support for multi languages.

As mentioned above, in order to meet PPResume's requirements, the typesetting engine must:

Typesetting EngineKnuth Plass line breakingCJKPaginationInstant Preview
HTML & CSSNoYesPartialYes
LaTeXYesYesYesNo
LaTeX.jsNoYesNoYes
TypstYesPartialYesPartial
React-pdfYesNoYesYes

Both HTML & CSS and LaTeX.js do not support Knuth Plass line breaking, react-pdf and typst's CJK support is not production ready, hence LaTeX is our only option.

In the long run if there're better choice, it is possible for PPResume to add support for other typesetting engines.

Last but not least, having fun with polytype (opens in a new tab), a Rosetta stone for typesetting engines.

Thanks for reading!