Shannon's Hatchet

something, but almost nothing

Wednesday, February 25, 2026

Presenting PLAYA-PDF (and PAVÉS)

If you need to delve into the murky depths of a PDF to return with spices and silk extract metadata, images, and yes, even text, I have some excellent Free Software for you: PLAYA-PDF and PAVÉS. If you’d like to know how this came to be, then continue reading. And if you need a consultant for document intelligence tasks, large and small, I’m currently available for contracts of all sorts!

A young girl swoons surrounded by a storm of papers with the PDF logo

“You’re nothing but a pack of indirect objects!”

As you may or may not know, I’m a computational linguist by trade and training. In 2021, something else happened: I was elected to the town council in a municipality of the greater St-Jérôme area, which shall remain unnamed. Shortly thereafter, I quit my job as Principal Research Scientist at a company (which shall also remain unnamed, and is now a division of Microsoft) because it was clear that I couldn’t continue to work full-time in Montréal while being a responsive and effective public servant. I also found municipal politics to be a lot more interesting and relevant than the slow, incremental improvement of machine learning models for natural language understanding which I was working on at the time.

And in the mean time, well, some other things happened…

One of the unintended consequences of this possibly ill-advised career move was that I ended up becoming an expert of sorts on parsing and manipulating PDF files. Of course, like any programmer, I did this in the usual way, by starting a Free Software project.

Why PDF?

Once you get into the details of document management and archiving in a municipality or other similar organization, it quickly becomes obvious that, despite all the best efforts of decades of work on ODP, OOXML, HTML, and various other purportedly universal document formats, at the end of the day, the only thing you can count on is that a document wil always be available as a PDF. This is the unfortunate result of Microsoft’s domination of the office software market: not only is Office gratuitously incompatible in subtle ways with every other alternative (free and proprietary), but Office isn’t even compatible with itself a lot of the time.

Why another PDF library?

Free Software, fundamentally, is about choice, and I had certain critieria for the tool that I wanted to use, which were not fulfilled by the available choices:

  1. Permissive open-source license (BSD or MIT).
  2. Written in Python and portable to various platforms.
  3. Programmer-friendly interface.
  4. Direct access to internal PDF data structures, not just text extraction but also access to graphics state, images and metadata.
  5. Fast and efficient, to the extent possible given #2.

The closest thing that I found at the time was pdfplumber, which is still a very nice library which definitely satisfies 1, 2 and 3 above! I even contributed support for logical structure trees to it at some point. Unfortunately, pdfplumber, like its underlying library pdfminer.six and other popular projects, is not very efficient, in particular because it needs to parse the entirety of each page and construct all the data structures before returning any useful information.

Enter PLAYA: LazY and Parallel Analyzer

This is the main reason for PLAYA-PDF’s existence: it is designed from the ground up to be “lazy”, and only processes the bare minimum of data needed to get the information you want. On the other hand, if you are lazy, it also has an “eager” interface which can convert PDF metadata to JSON quite efficiently:

with playa.open(path) as pdf:
    json.dumps(playa.asobj(pdf))

The other important aspect of PLAYA-PDF is built-in support for parallel processing, of PDFs, with a simple and easy-to-use interface:

with playa.open(path, max_workers=4) as pdf:
    texts = list(pdf.pages.map(playa.Page.extract_text))

Par-dessus la PLAYA, les PAVÉS!

Since the guiding principles of PLAYA-PDF are efficiency and the absence of external dependencies, it doesn’t do any high-level extraction tasks which require image processing, heuristics or machine learning models.

For this reason I’ve also created PAVÉS which will gradually support more and more ways to do:

  • Structural and textual analysis of PDFs, such as table detection and extraction, as well as extraction of rich text and logical structure.
  • Visualisation of object in a PDF as well as rendering of pages to images.

This second library is under construction but is already good enough to do the analysis and extraction used in my projects like ZONALDA and SÈRAFIM.

Conclusion

If you are one of the tiny minority of people who this might possibly interest, then by all means feel free to give it a try! Take a look at the documentation the sample Jupyter notebooks to get an idea of what it can do.

Of course you can also contribute to its development on GitHub! (I may soon move development to Codeberg or another independent provider outside the USA, but the GitHub mirror will always remain).

Monday, February 27, 2023

TypeScript modules With Emscripten and CMake, part 5

When I set out to create an NPM package for SoundSwallower, I was unable to find much relevant information in the Emscripten documentation or elsewhere on the Web, so I have written this guide, which walks through the process of compiling a library to WebAssembly and packaging it as a CommonJS or ES6 module.

This is part of a series of posts. Start here to read from the beginning.

Optimizing for Size

Really, “for Size” is redundant here. It’s the Web. People are loading your page/app/whatever on mobile phones over metered connections. There is no other optimization you should reasonably care about. So, let’s see how we’re doing:

$ ls -lh jsbuild
-rw-rw-r--  1 dhd dhd  64K fév 27 12:09 kissfft.cjs.js
-rwxrwxr-x  1 dhd dhd 197K fév 27 12:09 kissfft.cjs.wasm

Not good! Well, we did configure this with CMAKE_BUILD_TYPE=Debug, and we didn’t bother adding any optimization flags at compilation or link time. When optimizing for size, however, we should not immediately switch to Release build, as the minimization that it does makes it impossible to understand what parts of the output are wasting space.

Let’s set up CMake to build everything with maximum optimization with the -Oz option, which should be passed at both compile and link time. (Note: I will not discuss -flto here, because it is only useful when dealing with the eldritch horrors of C++). While we’re at it we’ll also disable support for the longjmp function which we know our library doesn’t use:

target_compile_options(kissfft PRIVATE -Oz -sSUPPORT_LONGJMP=0 -sSTRICT=1)
target_link_options(kissfft.cjs PRIVATE
  -sMODULARIZE=1 -Oz -sSUPPORT_LONGJMP=0
  -sEXPORTED_FUNCTIONS=@${CMAKE_CURRENT_SOURCE_DIR}/exported_functions.txt)
target_link_options(kissfft.esm PRIVATE
  -sMODULARIZE=1 -sEXPORT_ES6=1 -Oz -sSUPPORT_LONGJMP=0
  -sEXPORTED_FUNCTIONS=@${CMAKE_CURRENT_SOURCE_DIR}/exported_functions.txt)

Now let’s see where we’re at:

$ cmake --build jsbuild
$ ls -lh jsbuild
-rw-rw-r--  1 dhd dhd  40K fév 27 12:11 kissfft.cjs.js
-rwxrwxr-x  1 dhd dhd 139K fév 27 12:11 kissfft.cjs.wasm

Already much better! Now, if you have WABT installed, you can use wasm-objdump to quickly see which functions are taking the most space:

$ wasm-objdump -x jsbuild/kissfft.cjs.wasm  | grep size
 - func[13] size=5194 <dlmalloc>
 - func[14] size=1498 <dlfree>

Now, Emscripten has an option to use a smaller, less full-featured memory allocator, which, since we know that our library is quite simple and doesn’t do a lot of allocation, is a good idea. Let’s change the link flags again:

target_link_options(kissfft.cjs PRIVATE
  -sMODULARIZE=1 -Oz -sSUPPORT_LONGJMP=0 -sMALLOC=emmalloc
  -sEXPORTED_FUNCTIONS=@${CMAKE_CURRENT_SOURCE_DIR}/exported_functions.txt)
target_link_options(kissfft.esm PRIVATE
  -sMODULARIZE=1 -sEXPORT_ES6=1 -Oz -sSUPPORT_LONGJMP=0 -sMALLOC=emmalloc
  -sEXPORTED_FUNCTIONS=@${CMAKE_CURRENT_SOURCE_DIR}/exported_functions.txt)

This saves another 20K (unminimized and uncompressed):

$ cmake --build jsbuild
$ ls -lh jsbuild
-rw-rw-r--  1 dhd dhd  40K fév 27 12:14 kissfft.cjs.js
-rwxrwxr-x  1 dhd dhd 119K fév 27 12:14 kissfft.cjs.wasm

If we look again at wasm-objdump we can see that there isn’t much else we can do, as what’s left consists of runtime support, including some stubs to allow debug printf() to work.

What about the .js file? Here’s where it gets a bit more complicated. First, let’s rebuild in Release mode and see where we’re at after minimization. It’s also important to look at the compressed size of the .js and .wasm files, as a good webserver should be configured to serve them with gzip compression:

$ emcmake cmake -S. -B jsbuild -DCMAKE_BUILD_TYPE=Release
$ cmake --build jsbuild
$ ls -lh jsbuild
-rw-rw-r--  1 dhd dhd  12K fév 27 12:32 kissfft.cjs.js
-rwxrwxr-x  1 dhd dhd  12K fév 27 12:32 kissfft.cjs.wasm
$ gzip -c jsbuild/kissfft.cjs.js | wc -c
3599
$ gzip -c jsbuild/kissfft.cjs.wasm | wc -c
7307

So, our total payload size is about 11K. This is quite acceptable in most circumstances, so you may wish to skip to the next section at this point.

Now, Emscripten also has an option -sMINIMAL_RUNTIME=1 (or 2) which can shrink this a bit more, but the problem is that it doesn’t actually produce a working CommonJS or ES6 module with -sMODULARIZE=1 and -sEXPORT_ES6=1, and worse yet, it cannot produce working code for the Web or ES6 modules, because it loads the WebAssembly like this:

if (ENVIRONMENT_IS_NODE) {
  var fs = require('fs');
  Module['wasm'] = fs.readFileSync(__dirname + '/kissfft.cjs.wasm');
}

Basically your only option if you use -sMINIMAL_RUNTIME is to postprocess the generated JavaScript to work properly in the target environment, because even if you enable streaming compilation, it will still include the offending snippet above, among other things. Doing this is quite complex and beyond the scope of this guide, but you can look at the build.js script used by wasm-audio-encoders, for example.

The other option, if your module is not too big and you don’t mind that it all gets loaded at once by the browser, is to do a single-file build:

Single-File Builds (WASM or JS-only)

In many cases it is a super huge pain to get the separate .wasm file packaged and loaded correctly when you are using a “framework” or even just a run-of-the-mill JavaScript bundler like Webpack or ESBuild. This is, for instance, the case if you use Angular, which requires a custom Webpack configuration in order for it to work at all with modules that use WebAssembly. (note that by default, Webpack works just fine, as long as you have a pretty recent version)

We will go into the details of this in the next installment, but suffice it to say that, if you have a small enough library, you can save yourself a lot of trouble by simply making a single-file build, which you can do by adding -sSINGLE_FILE=1 to the linker options. This gives a quite acceptable size, which is only slightly large due to the fact that the WebAssembly gets encoded as base64:

$ ls -lh jsbuild
-rw-rw-r--  1 dhd dhd  29K fév 27 16:07 kissfft.cjs.js
$ gzip -c jsbuild/kissfft.cjs.js | wc -c
13321

Note, however, that in this case if you load the resulting JavaScript before your page contents, your users will have to wait until it downloads to see anything, whereas with a separate .wasm file, the downloading can be done asynchronously.

Alternately, if you want to support, say, Safari 13, iOS 12, or anything else that predates the final WebAssembly spec, you can simply disable WebAssembly entirely and compile to JavaScript with -sWASM=0. Sadly, at the moment, this is also incompatible with -sEXPORT=ES6=1.

In the next episode, stay tuned for how to actually use this module in a simple test application!

Friday, February 24, 2023

TypeScript modules With Emscripten and CMake, part 4

When I set out to create an NPM package for SoundSwallower, I was unable to find much relevant information in the Emscripten documentation or elsewhere on the Web, so I have written this guide, which walks through the process of compiling a library to WebAssembly and packaging it as a CommonJS or ES6 module.

This is part of a series of posts. Start here to read from the beginning.

Building with CMake

Just by screwing around on the command line, we were previously able to produce a more or less useful CommonJS module wrapping the real-valued FFT function from the Kiss FFT library (though not as useful as the existing one on npmjs.org). Now let’s look at how we can build a module with CMake as part of the library’s build system.

As a reminder, we configured CMake to build the library with:

emcmake cmake -S . -B jsbuild -DCMAKE_BUILD_TYPE=Debug \
    -DKISSFFT_TOOLS=OFF -DKISSFFT_STATIC=ON -DKISSFFT_TEST=OFF

When configuring using emcmake, the EMSCRIPTEN variable is defined, so if we want to make all of those flags the defaults, we can add this to CMakeLists.txt after the option definitions (line 54 in the current source):

if(EMSCRIPTEN)
  set(KISSFFT_TOOLS OFF)
  set(KISSFFT_STATIC ON)
  set(KISSFFT_TEST OFF)
endif()

Now let’s add a target to build our module. This is a bit “special” for two reasons:

  • The CMake functions for Emscripten treat any output (even a module) as an “executable”, so we have to make believe we’re linking a program.
  • Even though all of the C code is already in the libkissfft-float.a library, which CMake references with the kissfft target, it still expects to have at least one source file to link into our “executable”.

To satisfy CMake, we will first simply create an empty C file:

touch api.c

We may at some point want to add helper functions for our API, so this isn’t entirely useless - see the corresponding file in SoundSwallower for an example.

Now we will add the necessary CMake configuration to the end of CMakeLists.txt:

if(EMSCRIPTEN)
  add_executable(kissfft.cjs api.c)
  target_link_libraries(kissfft.cjs kissfft)
  target_link_options(kissfft.cjs PRIVATE
    -sMODULARIZE=1
    -sEXPORTED_FUNCTIONS=@${CMAKE_CURRENT_SOURCE_DIR}/exported_functions.txt)
  em_link_post_js(kissfft.cjs api.js)
endif()

A few things to note here:

  • em_link_post_js is not documented, but should be.
  • We have to add ${CMAKE_CURRENT_SOURCE_DIR} to the path to exported_functions.txt so that CMake can find it, since we are building in a separate directory.
  • We can’t use kissfft as the target name since that is already taken by the C library.

Emscripten will automatically append .js and .wasm to the target name, so, after adding this, if you run:

emcmake cmake -S . -B jsbuild -DCMAKE_BUILD_TYPE=Debug
cmake --build jsbuild

You should find the files kissfft.cjs.js and kissfft.cjs.wasm in the jsbuild directory.

Building an ES6 module

Up to this point we have built a CommonJS module, since they are simpler to use in Node.js, but in reality, all the cool kids are now using ES6 modules, and they are particularly preferred when using a bundler for the Web like Webpack or Esbuild. The latest versions of Emscripten do have built-in, if occasionally buggy, support for producing ES6 modules. So, we can add an extra target inside the if(EMSCRIPTEN) block at the end of CMakeLists.txt:

add_executable(kissfft.esm api.c)
target_link_libraries(kissfft.esm kissfft)
target_link_options(kissfft.esm PRIVATE
  -sMODULARIZE=1 -sEXPORT_ES6=1
  -sEXPORTED_FUNCTIONS=@${CMAKE_CURRENT_SOURCE_DIR}/exported_functions.txt)
em_link_post_js(kissfft.esm api.js)

Sadly, there is no way in the Emscripten CMake support to choose a different file extension for a specific target, so we can’t call this kissfft.esm.mjs. In addition, the boilerplate loader code that Emscripten gives us won’t allow us to share the WebAssembly (which is identical) between targets. For the moment we will end up with kissfft.esm.js and kissfft.esm.wasm in the jsbuild directory, and this is a Problem, as we will see soon.

Packaging with NPM

Now that everything is built, it is actually quite simple to package this as an NPM package. No other action is required on your part… well, not quite. First, let’s create a package.json file, which will have one big problem, that we’ll get to later:

{
  "name": "kissfft-example",
  "version": "0.0.1",
  "description": "A very simple example of packaging WebAssembly",
  "types": "./index.d.ts",
  "main": "./jsbuild/kissfft.cjs.js",
  "exports": {
    ".": {
      "types": "./index.d.ts",
      "require": "./jsbuild/kissfft.cjs.js",
      "import": "./jsbuild/kissfft.esm.js",
      "default": "./jsbuild/kissfft.esm.js"
    }
  },
  "author": "David Huggins-Daines <dhd@ecolingui.ca>",
  "homepage": "https://ecolingui.ca/en/blog/emguide",
  "license": "MIT",
  "scripts": {
    "test": "npx tsc test_realfft.ts && node test_realfft.js"
  },
  "files": [
    "index.d.ts",
    "jsbuild/kissfft.*.js",
    "jsbuild/kissfft.*.wasm"
  ],
  "devDependencies": {
    "@types/node": "^18.14.1",
    "typescript": "^4.9.5"
  },
  "dependencies": {
    "@types/emscripten": "^1.39.6"
  }
}

Of note above:

  • We use the exports field to supply different entry points for import and require (but note that this won’t actually work… more below).
  • We just package the stuff we built in place, by including only the files we need with the files field.
  • We point to the type definition file with the types field in two places, for good luck.
  • Although the node we get with emsdk includes @types/emscripten by default, others will not, so it is a package (and not dev) dependency.

Now, assuming you have you have previously created test_realfft.ts (if not, download it here), you should be able to run:

npm install
npm test

And you should see the same output we saw previously. But, did we say there was a problem? Yes. The nifty ES6 model built above won’t actually work in Node, because the Node developers somehow can’t agree to not depend on file extensions to select module systems. Since our package contains both CommonJS (loaded with require) and ES6 (loaded with import) modules, we have to change the file extension on at least one of them to satisfy Node’s simplistic view of the world.

The path of least resistance to fix this and still stay CMakically correct is to add a custom command that copies the built .js file for the ES6 module to a .mjs file:

add_custom_command(
  OUTPUT ${CMAKE_CURRENT_BINARY_DIR}/kissfft.esm.mjs
  DEPENDS kissfft.esm
  COMMAND ${CMAKE_COMMAND} -E copy
  ${CMAKE_CURRENT_BINARY_DIR}/kissfft.esm.js
  ${CMAKE_CURRENT_BINARY_DIR}/kissfft.esm.mjs)
add_custom_target(copy-mjs-bork-bork-bork ALL
  DEPENDS ${CMAKE_CURRENT_BINARY_DIR}/kissfft.esm.mjs)

Now we will modify package.json by changing kissfft.esm.js to kissfft.esm.mjs everywhere, and modifying files to specifically only include the files we need:

  "files": [
    "index.d.ts",
    "jsbuild/kissfft.esm.mjs",
    "jsbuild/kissfft.cjs.js",
    "jsbuild/kissfft.*.wasm"
  ],

You can download the updated version here. And now we can test that both import types work by creating a directory called kissfft-test alongside kissfft, creating the files index.mjs (download here) and index.cjs (download here) in it, then running:

npm link ../kissfft
node index.mjs
node index.cjs

Congratulations! You now have a WebAssembly module that will work as both ES6 and CommonJS, and can also be uploaded to NPM (but please don’t do that). To see what would be packaged, you can run:

npm publish --dry-run

In the next installment, we will see what we can do to make the module as small as possible.