What can you do, what can you do, with a cheap low-power GPU?
Say you have a corporate surplus small-form-factor PC from the 2010s. Say, an HP/Compaq (yes, “Compaq”) Elite 8100 or a Lenovo ThinkCentre M82. Oh, okay, maybe you have three or four of them (when your significant other moves in, you will have fewer). You have no desire to converse with the fake machine god that will supposedly kill us all, but at the same time, you’ve heard that recent VLMs are remarkably good at OCR and other actually useful information extraction tasks.
What do you need in order to run them on your computer?
- A low-profile graphics card with a low-profile bracket
- With reasonable power consumption (single-slot under 75W TDP)
- Supported by current drivers and compute libraries
Can you do this? Why yes, you can!
All of the information that follows assumes that you are running Debian 13, but should apply to various Ubuntus. I do not know what a “NixOS” or an “Arch” or an “Omarchy” is so please don’t ask.
Cheapo GPUs as of April 2026
As of this writing, approximate prices (via eBay) and specifications (via TechPowerUp) of GPUs that meet these criteria, ordered by price in loonies and toonies:
| Card | Price (CAD) | TFLOPS (FP16) | TFLOPS (FP32) | Memory (GiB) | TDP (W) |
|---|---|---|---|---|---|
| P620 | $50 | 0.022 | 1.4 | 2 | 40 |
| P1000 | $80 | 0.022 | 1.5 | 4 | 47 |
| GT1030 | $80 | 0.017 | 1.2 | 2 | 40 |
| Radeon RX 640 | $90 | 1.7 | 1.7 | 4 | 50 |
| Tesla P4 | $120 | 0.09 | 5.7 | 8 | 75 |
| GTX1050Ti | $130 | 0.03 | 2 | 4 | 75 |
| T400 | $150 | 2.2 | 1.1 | 4 | 30 |
| Arc A310 | $180 | 5.4 | 2.7 | 4 | 30 |
| T600 | $200 | 3.4 | 1.7 | 4 | 40 |
| GTX1650 | $200 | 6 | 3 | 4 | 75 |
| Radeon RX 6400 | $250 | 7.1 | 3.5 | 4 | 53 |
| RTX A400 | $275 | 2.7 | 2.7 | 4 | 50 |
| T1000 | $280 | 5 | 2.5 | 4 | 50 |
| Arc Pro A50 | $325 | 9.6 | 4.8 | 6 | 75 |
| T1000 8GB | $350 | 5 | 2.5 | 8 | 50 |
| RTX A2000 | $500 | 8 | 8 | 6 | 70 |
| A2 Tensor Core | $550 | 4.5 | 4.5 | 16 | 60 |
| Arc Pro B50 | $600 | 21 | 10 | 16 | 70 |
| RTX A1000 | $600 | 7 | 7 | 8 | 50 |
| RTX A2000 12GB | $800 | 8 | 8 | 12 | 70 |
| Tesla T4 | $950 | 65 | 8 | 16 | 70 |
Recommendations
(we’ll talk about the Tesla P4 later on…)
The sweet spot for performance and compatibility (since unfortunately, CUDA Rules Everything Around Me) is the GTX1650, though its fans are quite loud. Be careful to get a low profile version and not simply a small form factor one as it comes in an “Aero-ITX” form factor which is unlikely to fit in a low-profile case.
The GT1030 is still the best fanless GPU after all these years, and can actually be coaxed into running some fairly useful models, but that’s about all it has going for it. Also, its heatsink is massive so it won’t fit in really tiny PCs. So, if you don’t mind a bit of noise, the P620 and P1000 are a great deal.
In theory the Arc A310 should be an excellent choice, but software support for it is quite uncertain. And while the RX640 is a GCN4 architecture card and is thus in theory supported by llama.cpp, it is extremely unclear exactly which version of ROCm you need to install to make this work. But ROCm is open source, so in theory… you could do this (I haven’t yet, stay tuned).
In case you’re worried about the absolutely horrible FP16 performance of some cards in the table… don’t worry about it unless you want to do some form of fine-tuning, which you probably don’t have enough VRAM to do anyway. You’re going to have to use quantized models, so the actual computation will get done in single precision.
Software setup
Our test case here is simple: extract a table as HTML (no, not Markdown, because shut up, clanker, that’s why) using dots.mocr. We’ll use the weakest card in the bunch above, the GT1030, just because we can.
First we’ll install a few gigabytes of CUDA nonsense. This will
require you to add the evil incantation contrib nonfree to the end
of all the deb lines in /etc/apt/sources.list. Now you can:
sudo apt update
sudo apt install linux-headers-amd64 nvidia-driver nvidia-cuda-toolkit \
cmake ninja-build
You’ll probably have to reboot. Make sure your card is detected:
$ nvidia-smi
Tue Apr 21 14:19:24 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01 Driver Version: 550.163.01 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GT 1030 Off | 00000000:01:00.0 Off | N/A |
| 0% 46C P8 N/A / 30W | 2MiB / 2048MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Success! Next, we will build a copy of llama.cpp (maybe a pre-compiled one will work, but it probably requires Just Exactly The Right Version Of Ten Thousand Libraries, so we won’t even try).
In the past, you had to convince llama.cpp that you really wanted to
use some ancient GPU architecture from 2016, but recent versions of
llama.cpp are able to detect your GPU capability, so
-DCMAKE_CUDA_ARCHITECTURES is no longer necessary if you build on the
same machine you’ll be running on. You can simply:
git clone --depth 1 https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -S. -Bbuild -GNinja -DGGML_CUDA=ON
cmake --build build
This might take a while if your computer is old like mine.
Note that on the GTX 1650, you may want to follow the helpful suggestions that llama.cpp prints when you run it, and instead configure your build with with:
cmake -DGGML_CUDA=ON -S. -Bbuild -GNinja \
'-DCMAKE_CUDA_ARCHITECTURES=61-virtual;80-virtual' \
-DGGML_CUDA_FORCE_MMQ=ON
Running a model!
Since HuggingFace “acquired” (meaning “acquihired”, I guess) llama.cpp, it now works super well with their model repositories. Let’s try it out! First, we’ll get ourselves a table of some sort. Because I’ll be using it in another blog post shortly, we can try the permitting statistics for the last year of my term as a town councillor, which I obtained with a freedom of information request (and you can too!).
While llama.cpp now has a fancy web interface that can use pdf.js to convert PDFs to images, we’ll do this the old fashioned way from the command line. First, convert the pages to PNGs:
pdftocairo -png -r 100 analyse_permis.pdf analyse_permis
Now we can simply run llama-cli with an appropriately quantized
model that
fits on our GPU…
No! We can’t do that. The reason is that by default llama.cpp will try to put not only the model, but also the input and output data buffers, on the GPU. So even if the model fits just fine, it will most likely segfault, because llama.cpp is written in C++, so segmentation faults are just its normal form of error handling.
This is very annoying! Luckily there’s a way around it, which is to
use the top-secret GGUF_CUDA_ENABLE_UNIFIED_MEMORY environment
variable, which basically tells llama.cpp to load and unload things
automatically between CPU and GPU memory:
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
You may think that this is going to be slower than explicitly loading everything on the GPU but in reality, it isn’t, unless your model doesn’t fit on the GPU. You’ll know, if that happens 😉
Wow! Look at it go:
$ ./build/bin/llama-cli -hf lodrick-the-lafted/dots.mocr-gguf:Q8_0 \
--image analyse_permis-1.png -st -p "Convert table to HTML"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 1992 MiB):
Device 0: NVIDIA GeForce GT 1030, compute capability 6.1, VMM: yes, VRAM: 1992 MiB
Loading model...
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b1-98d2d28
model : lodrick-the-lafted/dots.mocr-gguf:Q8_0
modalities : text, vision
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read <file> add a text file
/glob <pattern> add text files using globbing pattern
/image <file> add an image file
Loaded media from 'analyse_permis-1.png'
> Convert table to HTML
<html><body><table><thead><tr><td rowspan="2"></td><td colspan="2">Logements ...
[ Prompt: 39,3 t/s | Generation: 18,1 t/s ]
But is the output any good? Amazingly enough, llama-cli cannot
simply write the model’s output to a file or standard output without
printing its stupid logo everywhere (yes, really), so again, if you
don’t want to chat with your imaginary machine god but actually want
to write code that gets the computer to do useful work, you have to
either:
- copy and paste from the terminal
- run
llama-serverinstead and talk to it over HTTP - use the llama.cpp Python wrapper (which is not super well maintained)
For simplicity’s sake here, we’ll just copy and paste, and you can verify that indeed, it has quite faithfully rendered the table structure and content!
A quick note about llama-server
You should know that if you use llama-server, it has various default
settings which are exceedingly suboptimal for small GPUs, and will
probably cause it to segfault or loop endlessly after a couple of
prompts. Probably Ollama works better, if you
like installing software with curl | sudo bash (do not do this).
Basically you just need to disable any kind of multi-user or caching
capability, which you can do with the options --parallel 0 --cache-ram 0 --no-cache-prompt.
A full example of how to use it will show up in the near future on this blog, but in French so, I guess, you might want to take a language course.
What about the NVidia Tesla P4?
But you say, what about Tesla P4s, which are spectacularly cheap and plentiful? Yes, I tried this, so now you don’t have to. There are a few important things to know about this card:
-
It is “fanless” but relies on active cooling from outside. It will quickly heat up to 91 degrees Celsius and either:
- throttle its performance down to nothing
- burn a hole in your computer
- all of the above
People have tried various things to cool it with varying degrees of success. Some of these things are even for sale on eBay! Many sellers will even sell you a P4 with a fan installed! Crucially, however, all these cooling fans take up extra space in your case so may not fit in your average $30 small form factor corporate surplus PC. Make sure to measure!
-
It requires certain PCIe and/or BIOS features that older computers (1st generation Intel Core processors) don’t support. If your computer doesn’t allow you to use the onboard video when a discrete GPU is plugged in, it likely won’t be recognized at all and there is nothing you can do about this. Notably it does not work on a Compaq Elite 8100, but it does work in a Lenovo ThinkCentre M82, and you can also fit a cooler in there if you remove the hard drive cage.
-
NVidia (and thus the Debian
nvidia-detectutility) claims it is only supported by the 470 series drivers, but this is actually aliemistake in the documentation, as it works absolutely fine with the 550 series on Debian 13. It’s a Pascal-architecture GPU with compute capability 6.1, just like the P1000, but twice as fast and twice as much memory (if you can cool it).
So, while it may seem too good to be true, it may actually work well for you with some fiddling. What I found in practice is that, if you can keep it under 80C, the P4 will run quite a bit faster than the GTX1650 (75 tokens per second with dots.mocr), so it’s definitely worth the trouble.
The same caveats for the P4 apply also to the A2 and T4, which are similarly passive-cooled cards, but also exceedingly expensive, so, in summary, life is a symphony of contrasts. Note that the same cooling fans that work for the P4 generally also work for the T4. But you should just buy an Intel Arc Pro B50 instead if you’re going to spend that much.
