If you need to delve into the murky depths of a PDF to return with
spices and silk extract metadata, images, and yes, even text, I
have some excellent Free Software for you:
PLAYA-PDF and
PAVÉS. If you’d like to know how
this came to be, then continue reading. And if you need a consultant
for document intelligence tasks, large and small, I’m currently
available for contracts of all sorts!
“You’re nothing but a pack of indirect objects!”
As you may or may not know, I’m a computational
linguist by
trade and
training. In
2021, something else happened: I was elected to the town council in a
municipality of the greater St-Jérôme area, which shall remain
unnamed. Shortly thereafter, I quit my job as Principal Research
Scientist at a company (which shall also remain unnamed, and is now a
division of Microsoft) because it was clear that I couldn’t continue
to work full-time in Montréal while being a responsive and effective
public servant. I also found municipal politics to be a lot more
interesting and relevant than the slow, incremental improvement of
machine learning models for natural language understanding which I was
working on at the time.
And in the mean time, well, some other things happened…
One of the unintended consequences of this possibly ill-advised career
move was that I ended up becoming an expert of sorts on parsing and
manipulating PDF files. Of course, like any programmer, I did this in
the usual way, by starting a Free Software project.
Why PDF?
Once you get into the details of document management and archiving in
a municipality or other similar organization, it quickly becomes
obvious that, despite all the best efforts of decades of work on ODP,
OOXML, HTML, and various other purportedly universal document formats,
at the end of the day, the only thing you can count on is that a
document wil always be available as a PDF. This is the unfortunate
result of Microsoft’s domination of the office software market: not
only is Office gratuitously incompatible in subtle ways with every
other alternative (free and
proprietary), but Office
isn’t even compatible with itself a lot of the time.
Why another PDF library?
Free Software, fundamentally, is about choice, and I had certain
critieria for the tool that I wanted to use, which were not fulfilled
by the available choices:
Permissive open-source license (BSD or MIT).
Written in Python and portable to various platforms.
Programmer-friendly interface.
Direct access to internal PDF data structures, not just text
extraction but also access to graphics state, images and metadata.
Fast and efficient, to the extent possible given #2.
The closest thing that I found at the time was
pdfplumber, which is still a
very nice library which definitely satisfies 1, 2 and 3 above! I even
contributed support for logical structure trees to it at some point.
Unfortunately, pdfplumber, like its underlying library
pdfminer.six and other
popular projects, is not very efficient, in particular because it
needs to parse the entirety of each page and construct all the data
structures before returning any useful information.
Enter PLAYA: LazY and Parallel Analyzer
This is the main reason for PLAYA-PDF’s existence: it is designed from
the ground up to be
“lazy”,
and only processes the bare minimum of data needed to get the
information you want. On the other hand, if you are lazy, it also
has an “eager” interface which can convert PDF metadata to
JSON quite efficiently:
with playa.open(path) as pdf:
json.dumps(playa.asobj(pdf))
The other important aspect of PLAYA-PDF is built-in support for
parallel
processing,
of PDFs, with a simple and easy-to-use interface:
with playa.open(path, max_workers=4) as pdf:
texts =list(pdf.pages.map(playa.Page.extract_text))
Par-dessus la PLAYA, les PAVÉS!
Since the guiding principles of PLAYA-PDF are efficiency and the
absence of external dependencies, it doesn’t do any high-level
extraction tasks which require image processing, heuristics or machine
learning models.
For this reason I’ve also created
PAVÉS which will gradually support
more and more ways to do:
Structural and textual analysis of PDFs, such as table detection and
extraction, as well as extraction of rich text and logical structure.
Visualisation
of object in a PDF as well as rendering of pages to images.
This second library is under construction but is already good enough
to do the analysis and extraction used in my projects like
ZONALDA and
SÈRAFIM.
Conclusion
If you are one of the tiny minority of people who this might possibly
interest, then by all means feel free to give it a try! Take a look
at the documentation the sample
Jupyter
notebooks to
get an idea of what it can do.
Of course you can also contribute to its development on
GitHub! (I may soon move
development to Codeberg or another independent
provider outside the USA, but the GitHub mirror will always remain).
When I set out to create an NPM package for
SoundSwallower, I was
unable to find much relevant information in the Emscripten
documentation or elsewhere on the Web, so I have written this guide,
which walks through the process of compiling a library to WebAssembly
and packaging it as a CommonJS or ES6 module.
This is part of a series of posts. Start here to read from the beginning.
Optimizing for Size
Really, “for Size” is redundant here. It’s the Web. People are
loading your page/app/whatever on mobile phones over metered
connections. There is no other optimization you should reasonably
care about. So, let’s see how we’re doing:
Not good! Well, we did configure this with
CMAKE_BUILD_TYPE=Debug, and we didn’t bother adding any optimization
flags at compilation or link time. When optimizing for size, however,
we should not immediately switch to Release build, as the
minimization that it does makes it impossible to understand what parts
of the output are wasting space.
Let’s set up CMake to build everything with maximum optimization with
the -Oz option, which should be passed at both compile and link
time. (Note: I will not discuss -flto here, because it is only
useful when dealing with the eldritch horrors of C++). While we’re at
it we’ll also disable support for the longjmp function which we know
our library doesn’t use:
Now, Emscripten has an
option
to use a smaller, less full-featured memory allocator, which, since we
know that our library is quite simple and doesn’t do a lot of
allocation, is a good idea. Let’s change the link flags again:
If we look again at wasm-objdump we can see that there isn’t much
else we can do, as what’s left consists of runtime support, including
some stubs to allow debug printf() to work.
What about the .js file? Here’s where it gets a bit more
complicated. First, let’s rebuild in Release mode and see where
we’re at after minimization. It’s also important to look at the
compressed size of the .js and .wasm files, as a good webserver
should be configured to serve them with gzip compression:
So, our total payload size is about 11K. This is quite acceptable in
most circumstances, so you may wish to skip to the next
section at this point.
Now, Emscripten also has an option -sMINIMAL_RUNTIME=1 (or 2)
which can shrink this a bit more, but the problem is that it doesn’t
actually produce a working CommonJS or ES6 module with
-sMODULARIZE=1 and -sEXPORT_ES6=1, and worse yet, it cannot
produce working code for the Web or ES6 modules, because it loads the
WebAssembly like this:
if (ENVIRONMENT_IS_NODE) {
var fs = require('fs');
Module['wasm'] = fs.readFileSync(__dirname +'/kissfft.cjs.wasm');
}
Basically your only option if you use -sMINIMAL_RUNTIME is to
postprocess the generated JavaScript to work properly in the target
environment, because even if you enable streaming compilation, it will
still include the offending snippet above, among other things. Doing
this is quite complex and beyond the scope of this guide, but you can
look at the build.js script used by
wasm-audio-encoders,
for example.
The other option, if your module is not too big and you don’t mind
that it all gets loaded at once by the browser, is to do a single-file
build:
Single-File Builds (WASM or JS-only)
In many cases it is a super huge
pain to get the
separate .wasm file packaged and loaded correctly when you are using
a “framework” or even just a run-of-the-mill JavaScript bundler like
Webpack or
ESBuild. This is, for instance, the case
if you use
Angular, which
requires a custom Webpack configuration in order for it to work at
all with modules that use WebAssembly. (note that by default, Webpack
works just fine, as long as you have a pretty recent version)
We will go into the details of this in the next installment, but
suffice it to say that, if you have a small enough library, you can
save yourself a lot of trouble by simply making a single-file build,
which you can do by adding -sSINGLE_FILE=1 to the linker
options. This gives a quite acceptable size, which is only slightly
large due to the fact that the WebAssembly gets encoded as base64:
Note, however, that in this case if you load the resulting JavaScript
before your page contents, your users will have to wait until it
downloads to see anything, whereas with a separate .wasm file, the
downloading can be done asynchronously.
Alternately, if you want to support, say, Safari 13, iOS 12, or
anything else that predates the final WebAssembly spec, you can simply
disable WebAssembly entirely and compile to JavaScript with
-sWASM=0. Sadly, at the moment, this is also incompatible with
-sEXPORT=ES6=1.
In the next episode, stay tuned for how to actually use this module
in a simple test application!
When I set out to create an NPM package for
SoundSwallower, I was
unable to find much relevant information in the Emscripten
documentation or elsewhere on the Web, so I have written this guide,
which walks through the process of compiling a library to WebAssembly
and packaging it as a CommonJS or ES6 module.
This is part of a series of posts. Start here to read from the beginning.
Building with CMake
Just by screwing around on the command line, we were
previously able to produce a more or less
useful CommonJS module wrapping the real-valued FFT function from the
Kiss FFT library (though not
as useful as the existing one on
npmjs.org). Now let’s
look at how we can build a module with CMake as part of the library’s
build system.
As a reminder, we configured CMake to build the library with:
When configuring using emcmake, the EMSCRIPTEN variable is
defined, so if we want to make all of those flags the defaults, we can
add this to CMakeLists.txtafter the option definitions (line 54
in the current source):
Now let’s add a target to build our module. This is a bit “special”
for two reasons:
The CMake functions for Emscripten treat any output (even a module)
as an “executable”, so we have to make believe we’re linking a program.
Even though all of the C code is already in the libkissfft-float.a
library, which CMake references with the kissfft target, it still
expects to have at least one source file to link into our “executable”.
To satisfy CMake, we will first simply create an empty C file:
We have to add ${CMAKE_CURRENT_SOURCE_DIR} to the path to
exported_functions.txt so that CMake can find it, since we are
building in a separate directory.
We can’t use kissfft as the target name since that is
already taken by the C library.
Emscripten will automatically append .js and .wasm to the target
name, so, after adding this, if you run:
You should find the files kissfft.cjs.js and kissfft.cjs.wasm in
the jsbuild directory.
Building an ES6 module
Up to this point we have built a CommonJS module, since they are
simpler to use in Node.js, but in reality, all the cool kids are now
using ES6 modules, and they are particularly preferred when using a
bundler for the Web like Webpack or
Esbuild. The latest versions of
Emscripten do have built-in, if occasionally buggy, support for
producing ES6 modules. So, we can add an extra target inside the
if(EMSCRIPTEN) block at the end of CMakeLists.txt:
Sadly, there is no way in the Emscripten CMake support to choose a
different file extension for a specific target, so we can’t call this
kissfft.esm.mjs. In addition, the boilerplate loader code that
Emscripten gives us won’t allow us to share the WebAssembly (which is
identical) between targets. For the moment we will end up with
kissfft.esm.js and kissfft.esm.wasm in the jsbuild directory,
and this is a Problem, as we will see soon.
Packaging with NPM
Now that everything is built, it is actually quite simple to package
this as an NPM package. No other action is required on your
part… well, not quite. First, let’s create a package.json file,
which will have one big problem, that we’ll get to later:
And you should see the same output we saw previously. But, did we say
there was a problem? Yes. The nifty ES6 model built
above won’t actually work in Node, because
the Node developers somehow can’t agree to not depend on file
extensions to select module
systems. Since our
package contains both CommonJS (loaded with require) and ES6 (loaded
with import) modules, we have to change the file extension on at
least one of them to satisfy Node’s simplistic view of the world.
The path of least resistance to fix this and still stay CMakically
correct is to add a custom command that copies the built .js file
for the ES6 module to a .mjs file:
Now we will modify package.json by changing kissfft.esm.js to
kissfft.esm.mjs everywhere, and modifying files to specifically
only include the files we need:
Congratulations! You now have a WebAssembly module that will work as
both ES6 and CommonJS, and can also be uploaded to NPM (but please
don’t do that). To see what would be packaged, you can run:
npm publish --dry-run
In the next installment, we
will see what we can do to make the module as small as possible.