Shannon's Hatchet

something, but almost nothing

Wednesday, February 25, 2026

Presenting PLAYA-PDF (and PAVÉS)

If you need to delve into the murky depths of a PDF to return with spices and silk extract metadata, images, and yes, even text, I have some excellent Free Software for you: PLAYA-PDF and PAVÉS. If you’d like to know how this came to be, then continue reading. And if you need a consultant for document intelligence tasks, large and small, I’m currently available for contracts of all sorts!

A young girl swoons surrounded by a storm of papers with the PDF logo

“You’re nothing but a pack of indirect objects!”

As you may or may not know, I’m a computational linguist by trade and training. In 2021, something else happened: I was elected to the town council in a municipality of the greater St-Jérôme area, which shall remain unnamed. Shortly thereafter, I quit my job as Principal Research Scientist at a company (which shall also remain unnamed, and is now a division of Microsoft) because it was clear that I couldn’t continue to work full-time in Montréal while being a responsive and effective public servant. I also found municipal politics to be a lot more interesting and relevant than the slow, incremental improvement of machine learning models for natural language understanding which I was working on at the time.

And in the mean time, well, some other things happened…

One of the unintended consequences of this possibly ill-advised career move was that I ended up becoming an expert of sorts on parsing and manipulating PDF files. Of course, like any programmer, I did this in the usual way, by starting a Free Software project.

Why PDF?

Once you get into the details of document management and archiving in a municipality or other similar organization, it quickly becomes obvious that, despite all the best efforts of decades of work on ODP, OOXML, HTML, and various other purportedly universal document formats, at the end of the day, the only thing you can count on is that a document wil always be available as a PDF. This is the unfortunate result of Microsoft’s domination of the office software market: not only is Office gratuitously incompatible in subtle ways with every other alternative (free and proprietary), but Office isn’t even compatible with itself a lot of the time.

Why another PDF library?

Free Software, fundamentally, is about choice, and I had certain critieria for the tool that I wanted to use, which were not fulfilled by the available choices:

  1. Permissive open-source license (BSD or MIT).
  2. Written in Python and portable to various platforms.
  3. Programmer-friendly interface.
  4. Direct access to internal PDF data structures, not just text extraction but also access to graphics state, images and metadata.
  5. Fast and efficient, to the extent possible given #2.

The closest thing that I found at the time was pdfplumber, which is still a very nice library which definitely satisfies 1, 2 and 3 above! I even contributed support for logical structure trees to it at some point. Unfortunately, pdfplumber, like its underlying library pdfminer.six and other popular projects, is not very efficient, in particular because it needs to parse the entirety of each page and construct all the data structures before returning any useful information.

Enter PLAYA: LazY and Parallel Analyzer

This is the main reason for PLAYA-PDF’s existence: it is designed from the ground up to be “lazy”, and only processes the bare minimum of data needed to get the information you want. On the other hand, if you are lazy, it also has an “eager” interface which can convert PDF metadata to JSON quite efficiently:

with playa.open(path) as pdf:
    json.dumps(playa.asobj(pdf))

The other important aspect of PLAYA-PDF is built-in support for parallel processing, of PDFs, with a simple and easy-to-use interface:

with playa.open(path, max_workers=4) as pdf:
    texts = list(pdf.pages.map(playa.Page.extract_text))

Par-dessus la PLAYA, les PAVÉS!

Since the guiding principles of PLAYA-PDF are efficiency and the absence of external dependencies, it doesn’t do any high-level extraction tasks which require image processing, heuristics or machine learning models.

For this reason I’ve also created PAVÉS which will gradually support more and more ways to do:

  • Structural and textual analysis of PDFs, such as table detection and extraction, as well as extraction of rich text and logical structure.
  • Visualisation of object in a PDF as well as rendering of pages to images.

This second library is under construction but is already good enough to do the analysis and extraction used in my projects like ZONALDA and SÈRAFIM.

Conclusion

If you are one of the tiny minority of people who this might possibly interest, then by all means feel free to give it a try! Take a look at the documentation the sample Jupyter notebooks to get an idea of what it can do.

Of course you can also contribute to its development on GitHub! (I may soon move development to Codeberg or another independent provider outside the USA, but the GitHub mirror will always remain).