Shannon's Hatchet

something, but almost nothing

Tuesday, April 21, 2026

What can you do, what can you do, with a cheap low-power GPU?

Say you have a corporate surplus small-form-factor PC from the 2010s. Say, an HP/Compaq (yes, “Compaq”) Elite 8100 or a Lenovo ThinkCentre M82. Oh, okay, maybe you have three or four of them (when your significant other moves in, you will have fewer). You have no desire to converse with the fake machine god that will supposedly kill us all, but at the same time, you’ve heard that recent VLMs are remarkably good at OCR and other actually useful information extraction tasks.

What do you need in order to run them on your computer?

  • A low-profile graphics card with a low-profile bracket
  • With reasonable power consumption (single-slot under 75W TDP)
  • Supported by current drivers and compute libraries

Can you do this? Why yes, you can!

All of the information that follows assumes that you are running Debian 13, but should apply to various Ubuntus. I do not know what a “NixOS” or an “Arch” or an “Omarchy” is so please don’t ask.

Cheapo GPUs as of April 2026

As of this writing, approximate prices (via eBay) and specifications (via TechPowerUp) of GPUs that meet these criteria, ordered by price in loonies and toonies:

Card Price (CAD) TFLOPS (FP16) TFLOPS (FP32) Memory (GiB) TDP (W)
P620 $50 0.022 1.4 2 40
P1000 $80 0.022 1.5 4 47
GT1030 $80 0.017 1.2 2 40
Radeon RX 640 $90 1.7 1.7 4 50
Tesla P4 $120 0.09 5.7 8 75
GTX1050Ti $130 0.03 2 4 75
T400 $150 2.2 1.1 4 30
Arc A310 $180 5.4 2.7 4 30
T600 $200 3.4 1.7 4 40
GTX1650 $200 6 3 4 75
Radeon RX 6400 $250 7.1 3.5 4 53
RTX A400 $275 2.7 2.7 4 50
T1000 $280 5 2.5 4 50
Arc Pro A50 $325 9.6 4.8 6 75
T1000 8GB $350 5 2.5 8 50
RTX A2000 $500 8 8 6 70
A2 Tensor Core $550 4.5 4.5 16 60
Arc Pro B50 $600 21 10 16 70
RTX A1000 $600 7 7 8 50
RTX A2000 12GB $800 8 8 12 70
Tesla T4 $950 65 8 16 70

Recommendations

(we’ll talk about the Tesla P4 later on…)

The sweet spot for performance and compatibility (since unfortunately, CUDA Rules Everything Around Me) is the GTX1650, though its fans are quite loud. Be careful to get a low profile version and not simply a small form factor one as it comes in an “Aero-ITX” form factor which is unlikely to fit in a low-profile case.

The GT1030 is still the best fanless GPU after all these years, and can actually be coaxed into running some fairly useful models, but that’s about all it has going for it. Also, its heatsink is massive so it won’t fit in really tiny PCs. So, if you don’t mind a bit of noise, the P620 and P1000 are a great deal.

In theory the Arc A310 should be an excellent choice, but software support for it is quite uncertain. And while the RX640 is a GCN4 architecture card and is thus in theory supported by llama.cpp, it is extremely unclear exactly which version of ROCm you need to install to make this work. But ROCm is open source, so in theory… you could do this (I haven’t yet, stay tuned).

In case you’re worried about the absolutely horrible FP16 performance of some cards in the table… don’t worry about it unless you want to do some form of fine-tuning, which you probably don’t have enough VRAM to do anyway. You’re going to have to use quantized models, so the actual computation will get done in single precision.

Software setup

Our test case here is simple: extract a table as HTML (no, not Markdown, because shut up, clanker, that’s why) using dots.mocr. We’ll use the weakest card in the bunch above, the GT1030, just because we can.

First we’ll install a few gigabytes of CUDA nonsense. This will require you to add the evil incantation contrib nonfree to the end of all the deb lines in /etc/apt/sources.list. Now you can:

sudo apt update
sudo apt install linux-headers-amd64 nvidia-driver nvidia-cuda-toolkit \
                 cmake ninja-build

You’ll probably have to reboot. Make sure your card is detected:

$ nvidia-smi
Tue Apr 21 14:19:24 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GT 1030         Off |   00000000:01:00.0 Off |                  N/A |
|  0%   46C    P8             N/A /   30W |       2MiB /   2048MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Success! Next, we will build a copy of llama.cpp (maybe a pre-compiled one will work, but it probably requires Just Exactly The Right Version Of Ten Thousand Libraries, so we won’t even try).

In the past, you had to convince llama.cpp that you really wanted to use some ancient GPU architecture from 2016, but recent versions of llama.cpp are able to detect your GPU capability, so -DCMAKE_CUDA_ARCHITECTURES is no longer necessary if you build on the same machine you’ll be running on. You can simply:

git clone --depth 1 https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -S. -Bbuild -GNinja -DGGML_CUDA=ON
cmake --build build

This might take a while if your computer is old like mine.

Note that on the GTX 1650, you may want to follow the helpful suggestions that llama.cpp prints when you run it, and instead configure your build with with:

cmake -DGGML_CUDA=ON -S. -Bbuild -GNinja \
    '-DCMAKE_CUDA_ARCHITECTURES=61-virtual;80-virtual' \
    -DGGML_CUDA_FORCE_MMQ=ON

Running a model!

Since HuggingFace “acquired” (meaning “acquihired”, I guess) llama.cpp, it now works super well with their model repositories. Let’s try it out! First, we’ll get ourselves a table of some sort. Because I’ll be using it in another blog post shortly, we can try the permitting statistics for the last year of my term as a town councillor, which I obtained with a freedom of information request (and you can too!).

While llama.cpp now has a fancy web interface that can use pdf.js to convert PDFs to images, we’ll do this the old fashioned way from the command line. First, convert the pages to PNGs:

pdftocairo -png -r 100 analyse_permis.pdf analyse_permis

Now we can simply run llama-cli with an appropriately quantized model that fits on our GPU…

No! We can’t do that. The reason is that by default llama.cpp will try to put not only the model, but also the input and output data buffers, on the GPU. So even if the model fits just fine, it will most likely segfault, because llama.cpp is written in C++, so segmentation faults are just its normal form of error handling.

This is very annoying! Luckily there’s a way around it, which is to use the top-secret GGUF_CUDA_ENABLE_UNIFIED_MEMORY environment variable, which basically tells llama.cpp to load and unload things automatically between CPU and GPU memory:

export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1

You may think that this is going to be slower than explicitly loading everything on the GPU but in reality, it isn’t, unless your model doesn’t fit on the GPU. You’ll know, if that happens 😉

Wow! Look at it go:

$ ./build/bin/llama-cli -hf lodrick-the-lafted/dots.mocr-gguf:Q8_0 \
    --image analyse_permis-1.png -st -p "Convert table to HTML"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 1992 MiB):
  Device 0: NVIDIA GeForce GT 1030, compute capability 6.1, VMM: yes, VRAM: 1992 MiB

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b1-98d2d28
model      : lodrick-the-lafted/dots.mocr-gguf:Q8_0
modalities : text, vision

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern
  /image <file>       add an image file

Loaded media from 'analyse_permis-1.png'

> Convert table to HTML

<html><body><table><thead><tr><td rowspan="2"></td><td colspan="2">Logements ...

[ Prompt: 39,3 t/s | Generation: 18,1 t/s ]

But is the output any good? Amazingly enough, llama-cli cannot simply write the model’s output to a file or standard output without printing its stupid logo everywhere (yes, really), so again, if you don’t want to chat with your imaginary machine god but actually want to write code that gets the computer to do useful work, you have to either:

  1. copy and paste from the terminal
  2. run llama-server instead and talk to it over HTTP
  3. use the llama.cpp Python wrapper (which is not super well maintained)

For simplicity’s sake here, we’ll just copy and paste, and you can verify that indeed, it has quite faithfully rendered the table structure and content!

A quick note about llama-server

You should know that if you use llama-server, it has various default settings which are exceedingly suboptimal for small GPUs, and will probably cause it to segfault or loop endlessly after a couple of prompts. Probably Ollama works better, if you like installing software with curl | sudo bash (do not do this).

Basically you just need to disable any kind of multi-user or caching capability, which you can do with the options --parallel 0 --cache-ram 0 --no-cache-prompt.

A full example of how to use it will show up in the near future on this blog, but in French so, I guess, you might want to take a language course.

What about the NVidia Tesla P4?

But you say, what about Tesla P4s, which are spectacularly cheap and plentiful? Yes, I tried this, so now you don’t have to. There are a few important things to know about this card:

  1. It is “fanless” but relies on active cooling from outside. It will quickly heat up to 91 degrees Celsius and either:

    1. throttle its performance down to nothing
    2. burn a hole in your computer
    3. all of the above

    People have tried various things to cool it with varying degrees of success. Some of these things are even for sale on eBay! Many sellers will even sell you a P4 with a fan installed! Crucially, however, all these cooling fans take up extra space in your case so may not fit in your average $30 small form factor corporate surplus PC. Make sure to measure!

  2. It requires certain PCIe and/or BIOS features that older computers (1st generation Intel Core processors) don’t support. If your computer doesn’t allow you to use the onboard video when a discrete GPU is plugged in, it likely won’t be recognized at all and there is nothing you can do about this. Notably it does not work on a Compaq Elite 8100, but it does work in a Lenovo ThinkCentre M82, and you can also fit a cooler in there if you remove the hard drive cage.

  3. NVidia (and thus the Debian nvidia-detect utility) claims it is only supported by the 470 series drivers, but this is actually a lie mistake in the documentation, as it works absolutely fine with the 550 series on Debian 13. It’s a Pascal-architecture GPU with compute capability 6.1, just like the P1000, but twice as fast and twice as much memory (if you can cool it).

So, while it may seem too good to be true, it may actually work well for you with some fiddling. What I found in practice is that, if you can keep it under 80C, the P4 will run quite a bit faster than the GTX1650 (75 tokens per second with dots.mocr), so it’s definitely worth the trouble.

The same caveats for the P4 apply also to the A2 and T4, which are similarly passive-cooled cards, but also exceedingly expensive, so, in summary, life is a symphony of contrasts. Note that the same cooling fans that work for the P4 generally also work for the T4. But you should just buy an Intel Arc Pro B50 instead if you’re going to spend that much.

Wednesday, February 25, 2026

Presenting PLAYA-PDF (and PAVÉS)

If you need to delve into the murky depths of a PDF to return with spices and silk extract metadata, images, and yes, even text, I have some excellent Free Software for you: PLAYA-PDF and PAVÉS. If you’d like to know how this came to be, then continue reading. And if you need a consultant for document intelligence tasks, large and small, I’m currently available for contracts of all sorts!

A young girl swoons surrounded by a storm of papers with the PDF logo

“You’re nothing but a pack of indirect objects!”

As you may or may not know, I’m a computational linguist by trade and training. In 2021, something else happened: I was elected to the town council in a municipality of the greater St-Jérôme area, which shall remain unnamed. Shortly thereafter, I quit my job as Principal Research Scientist at a company (which shall also remain unnamed, and is now a division of Microsoft) because it was clear that I couldn’t continue to work full-time in Montréal while being a responsive and effective public servant. I also found municipal politics to be a lot more interesting and relevant than the slow, incremental improvement of machine learning models for natural language understanding which I was working on at the time.

And in the mean time, well, some other things happened…

One of the unintended consequences of this possibly ill-advised career move was that I ended up becoming an expert of sorts on parsing and manipulating PDF files. Of course, like any programmer, I did this in the usual way, by starting a Free Software project.

Why PDF?

Once you get into the details of document management and archiving in a municipality or other similar organization, it quickly becomes obvious that, despite all the best efforts of decades of work on ODP, OOXML, HTML, and various other purportedly universal document formats, at the end of the day, the only thing you can count on is that a document wil always be available as a PDF. This is the unfortunate result of Microsoft’s domination of the office software market: not only is Office gratuitously incompatible in subtle ways with every other alternative (free and proprietary), but Office isn’t even compatible with itself a lot of the time.

Why another PDF library?

Free Software, fundamentally, is about choice, and I had certain critieria for the tool that I wanted to use, which were not fulfilled by the available choices:

  1. Permissive open-source license (BSD or MIT).
  2. Written in Python and portable to various platforms.
  3. Programmer-friendly interface.
  4. Direct access to internal PDF data structures, not just text extraction but also access to graphics state, images and metadata.
  5. Fast and efficient, to the extent possible given #2.

The closest thing that I found at the time was pdfplumber, which is still a very nice library which definitely satisfies 1, 2 and 3 above! I even contributed support for logical structure trees to it at some point. Unfortunately, pdfplumber, like its underlying library pdfminer.six and other popular projects, is not very efficient, in particular because it needs to parse the entirety of each page and construct all the data structures before returning any useful information.

Enter PLAYA: LazY and Parallel Analyzer

This is the main reason for PLAYA-PDF’s existence: it is designed from the ground up to be “lazy”, and only processes the bare minimum of data needed to get the information you want. On the other hand, if you are lazy, it also has an “eager” interface which can convert PDF metadata to JSON quite efficiently:

with playa.open(path) as pdf:
    json.dumps(playa.asobj(pdf))

The other important aspect of PLAYA-PDF is built-in support for parallel processing, of PDFs, with a simple and easy-to-use interface:

with playa.open(path, max_workers=4) as pdf:
    texts = list(pdf.pages.map(playa.Page.extract_text))

Par-dessus la PLAYA, les PAVÉS!

Since the guiding principles of PLAYA-PDF are efficiency and the absence of external dependencies, it doesn’t do any high-level extraction tasks which require image processing, heuristics or machine learning models.

For this reason I’ve also created PAVÉS which will gradually support more and more ways to do:

  • Structural and textual analysis of PDFs, such as table detection and extraction, as well as extraction of rich text and logical structure.
  • Visualisation of object in a PDF as well as rendering of pages to images.

This second library is under construction but is already good enough to do the analysis and extraction used in my projects like ZONALDA and SÈRAFIM.

Conclusion

If you are one of the tiny minority of people who this might possibly interest, then by all means feel free to give it a try! Take a look at the documentation the sample Jupyter notebooks to get an idea of what it can do.

Of course you can also contribute to its development on GitHub! (I may soon move development to Codeberg or another independent provider outside the USA, but the GitHub mirror will always remain).

Monday, February 27, 2023

TypeScript modules With Emscripten and CMake, part 5

When I set out to create an NPM package for SoundSwallower, I was unable to find much relevant information in the Emscripten documentation or elsewhere on the Web, so I have written this guide, which walks through the process of compiling a library to WebAssembly and packaging it as a CommonJS or ES6 module.

This is part of a series of posts. Start here to read from the beginning.

Optimizing for Size

Really, “for Size” is redundant here. It’s the Web. People are loading your page/app/whatever on mobile phones over metered connections. There is no other optimization you should reasonably care about. So, let’s see how we’re doing:

$ ls -lh jsbuild
-rw-rw-r--  1 dhd dhd  64K fév 27 12:09 kissfft.cjs.js
-rwxrwxr-x  1 dhd dhd 197K fév 27 12:09 kissfft.cjs.wasm

Not good! Well, we did configure this with CMAKE_BUILD_TYPE=Debug, and we didn’t bother adding any optimization flags at compilation or link time. When optimizing for size, however, we should not immediately switch to Release build, as the minimization that it does makes it impossible to understand what parts of the output are wasting space.

Let’s set up CMake to build everything with maximum optimization with the -Oz option, which should be passed at both compile and link time. (Note: I will not discuss -flto here, because it is only useful when dealing with the eldritch horrors of C++). While we’re at it we’ll also disable support for the longjmp function which we know our library doesn’t use:

target_compile_options(kissfft PRIVATE -Oz -sSUPPORT_LONGJMP=0 -sSTRICT=1)
target_link_options(kissfft.cjs PRIVATE
  -sMODULARIZE=1 -Oz -sSUPPORT_LONGJMP=0
  -sEXPORTED_FUNCTIONS=@${CMAKE_CURRENT_SOURCE_DIR}/exported_functions.txt)
target_link_options(kissfft.esm PRIVATE
  -sMODULARIZE=1 -sEXPORT_ES6=1 -Oz -sSUPPORT_LONGJMP=0
  -sEXPORTED_FUNCTIONS=@${CMAKE_CURRENT_SOURCE_DIR}/exported_functions.txt)

Now let’s see where we’re at:

$ cmake --build jsbuild
$ ls -lh jsbuild
-rw-rw-r--  1 dhd dhd  40K fév 27 12:11 kissfft.cjs.js
-rwxrwxr-x  1 dhd dhd 139K fév 27 12:11 kissfft.cjs.wasm

Already much better! Now, if you have WABT installed, you can use wasm-objdump to quickly see which functions are taking the most space:

$ wasm-objdump -x jsbuild/kissfft.cjs.wasm  | grep size
 - func[13] size=5194 <dlmalloc>
 - func[14] size=1498 <dlfree>

Now, Emscripten has an option to use a smaller, less full-featured memory allocator, which, since we know that our library is quite simple and doesn’t do a lot of allocation, is a good idea. Let’s change the link flags again:

target_link_options(kissfft.cjs PRIVATE
  -sMODULARIZE=1 -Oz -sSUPPORT_LONGJMP=0 -sMALLOC=emmalloc
  -sEXPORTED_FUNCTIONS=@${CMAKE_CURRENT_SOURCE_DIR}/exported_functions.txt)
target_link_options(kissfft.esm PRIVATE
  -sMODULARIZE=1 -sEXPORT_ES6=1 -Oz -sSUPPORT_LONGJMP=0 -sMALLOC=emmalloc
  -sEXPORTED_FUNCTIONS=@${CMAKE_CURRENT_SOURCE_DIR}/exported_functions.txt)

This saves another 20K (unminimized and uncompressed):

$ cmake --build jsbuild
$ ls -lh jsbuild
-rw-rw-r--  1 dhd dhd  40K fév 27 12:14 kissfft.cjs.js
-rwxrwxr-x  1 dhd dhd 119K fév 27 12:14 kissfft.cjs.wasm

If we look again at wasm-objdump we can see that there isn’t much else we can do, as what’s left consists of runtime support, including some stubs to allow debug printf() to work.

What about the .js file? Here’s where it gets a bit more complicated. First, let’s rebuild in Release mode and see where we’re at after minimization. It’s also important to look at the compressed size of the .js and .wasm files, as a good webserver should be configured to serve them with gzip compression:

$ emcmake cmake -S. -B jsbuild -DCMAKE_BUILD_TYPE=Release
$ cmake --build jsbuild
$ ls -lh jsbuild
-rw-rw-r--  1 dhd dhd  12K fév 27 12:32 kissfft.cjs.js
-rwxrwxr-x  1 dhd dhd  12K fév 27 12:32 kissfft.cjs.wasm
$ gzip -c jsbuild/kissfft.cjs.js | wc -c
3599
$ gzip -c jsbuild/kissfft.cjs.wasm | wc -c
7307

So, our total payload size is about 11K. This is quite acceptable in most circumstances, so you may wish to skip to the next section at this point.

Now, Emscripten also has an option -sMINIMAL_RUNTIME=1 (or 2) which can shrink this a bit more, but the problem is that it doesn’t actually produce a working CommonJS or ES6 module with -sMODULARIZE=1 and -sEXPORT_ES6=1, and worse yet, it cannot produce working code for the Web or ES6 modules, because it loads the WebAssembly like this:

if (ENVIRONMENT_IS_NODE) {
  var fs = require('fs');
  Module['wasm'] = fs.readFileSync(__dirname + '/kissfft.cjs.wasm');
}

Basically your only option if you use -sMINIMAL_RUNTIME is to postprocess the generated JavaScript to work properly in the target environment, because even if you enable streaming compilation, it will still include the offending snippet above, among other things. Doing this is quite complex and beyond the scope of this guide, but you can look at the build.js script used by wasm-audio-encoders, for example.

The other option, if your module is not too big and you don’t mind that it all gets loaded at once by the browser, is to do a single-file build:

Single-File Builds (WASM or JS-only)

In many cases it is a super huge pain to get the separate .wasm file packaged and loaded correctly when you are using a “framework” or even just a run-of-the-mill JavaScript bundler like Webpack or ESBuild. This is, for instance, the case if you use Angular, which requires a custom Webpack configuration in order for it to work at all with modules that use WebAssembly. (note that by default, Webpack works just fine, as long as you have a pretty recent version)

We will go into the details of this in the next installment, but suffice it to say that, if you have a small enough library, you can save yourself a lot of trouble by simply making a single-file build, which you can do by adding -sSINGLE_FILE=1 to the linker options. This gives a quite acceptable size, which is only slightly large due to the fact that the WebAssembly gets encoded as base64:

$ ls -lh jsbuild
-rw-rw-r--  1 dhd dhd  29K fév 27 16:07 kissfft.cjs.js
$ gzip -c jsbuild/kissfft.cjs.js | wc -c
13321

Note, however, that in this case if you load the resulting JavaScript before your page contents, your users will have to wait until it downloads to see anything, whereas with a separate .wasm file, the downloading can be done asynchronously.

Alternately, if you want to support, say, Safari 13, iOS 12, or anything else that predates the final WebAssembly spec, you can simply disable WebAssembly entirely and compile to JavaScript with -sWASM=0. Sadly, at the moment, this is also incompatible with -sEXPORT=ES6=1.

In the next episode, stay tuned for how to actually use this module in a simple test application!