installing cuda
This site gives a good explanation, though it is for version 12 :
https://www.server-world.info/en/note?os=Debian_13&p=nvidia&f=4
However if we want the latest version: 13 download that here :
https://developer.nvidia.com/cuda-downloads
cuda 13
Here you can choose which version you want, and after that the page will display what to do :
wget https://developer.download.nvidia.com/compute/cuda/13.1.0/local_installers/cuda-repo-debian13-13-1-local_13.1.0-590.44.01-1_amd64.deb
sudo dpkg -i cuda-repo-debian13-13-1-local_13.1.0-590.44.01-1_amd64.deb
sudo cp /var/cuda-repo-debian13-13-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-1
If you installed nvcc 12 already, uninstall it with
And add the new path (usually /usr/local/cuda/...) to you path.
Now it should report :
bash # new shell
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Nov__7_07:23:37_PM_PST_2025
Cuda compilation tools, release 13.1, V13.1.80
Build cuda_13.1.r13.1/compiler.36836380_0
cuda 12
Note that is the standard version for trixie, and it install fine but we get errors like this one:
And indeed that seems to only work from 12.8 upwards. Go back to cuda 13 to avoid this !!!
Otherwise follow this guide to install 12.
It boils down to this for installing the driver (https://www.server-world.info/en/note?os=Debian_13&p=nvidia&f=1) make a file called :
Containing this :
deb http://deb.debian.org/debian/ trixie non-free-firmware contrib non-free
deb http://security.debian.org/debian-security trixie-security non-free-firmware contrib non-free
deb http://deb.debian.org/debian/ trixie-updates non-free-firmware contrib non-free
deb http://deb.debian.org/debian/ trixie-backports non-free-firmware contrib non-free
Then install with
Now we have installed the open source driver (nouveau) and it is blacklisted by default.
Now test if it is recognized :
nvidia-smi
Sat Jan 3 10:11:26 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01 Driver Version: 550.163.01 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 42C P0 17W / 115W | 9MiB / 8188MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1445 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+
Now install CUDA :
Now it is finally found:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
Now test with a sample program
git clone https://github.com/NVIDIA/cuda-samples.git
cd ./cuda-samples/Samples/1_Utilities/deviceQuery
cmake .
make
./deviceQuery
This should print out something like
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 4070 Laptop GPU"
CUDA Driver Version / Runtime Version 12.4 / 12.4
CUDA Capability Major/Minor version number: 8.9
...
Meaning : yes also my laptop has a CUDA card, which i was still in doubt about.
Note that if you do this on a VM or machine without nvidia card, this last test won't work, but still we should be able to install nvcc and compile the code .
We need to detect the card from withing the worker process, see if you can get a good test from this deviceQuery example.
Now we still need cuopt to run, and the worker code to compile :
cuopt
Pay attention that for now you cannot use version 26.02 of cuopt on debian. Though it looks like a nice option for the future (multi-depot ..) : https://forums.developer.nvidia.com/t/its-here-nvidia-cuopt-22-06-download-now/218062
You get all kinds of problems with rmm@26.02 . So revert back to version 25.12
Or use a tarball/zip from :
The conda installer seems to run perfectly so i presume this takes care of all dependencies :
Go to miniforge : https://conda-forge.org/download/ to download the installer, and run it. It installs in ~/Install. so add this to bashrc :
Start a new shell of course and run the command in the README.md file :
previous attempt
You might still need to install cudss by hand (check that ?)
https://developer.nvidia.com/cudss-downloads?target_os=Linux&target_arch=x86_64&Distribution=Agnostic&cuda_version=13
cd ~/Install
wget https://developer.download.nvidia.com/compute/cudss/redist/libcudss/linux-x86_64/libcudss-linux-x86_64-0.7.1.4_cuda13-archive.tar.xz
tar xvf libcudss-linux-x86_64-0.7.1.4_cuda13-archive.tar.xz
Then add it to your path in ~/.bashrc
export CUDA_HOME=/usr/local/cuda
export PATH=${CUDA_HOME}/bin:${PATH}
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:$LD_LIBRARY_PATH
export CUDSS_DIR=/home/kees/Install/libcudss-linux-x86_64-0.7.1.4_cuda13-archive
export LD_LIBRARY_PATH=${CUDSS_DIR}/lib:${CTK_DIR}/lib64:${LD_LIBRARY_PATH}
export cudss_DIR=${CUDSS_DIR}/lib/cmake/cudss
export CMAKE_PREFIX_PATH=${CUDSS_DIR}
This will take a VERY LONG TIME !!!!! At least on laptop (64G memory);
Then an build the solver code :
Note that in these steps probably libboost 1.74 were installed instead of 1.81. Sadly that makes osrm-backend fail because that needs 1.81. You can just follow the reinstallation of boost 1.81 section later in this chapter and cuda will just keep working !
TODO : see if you can install cuda with boost 1.81 right away. It does mean uninstalling first. After that try one of the other installers (deb(network) or runfile(local))
compiling code
Also from that site.
The g++ compiler seems to be discarded by nvcc, but it reports an alternative :
/usr/bin/nvcc -ccbin g++ -I../../common/inc -m64 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 -o deviceQuery.o -c deviceQuery.cpp
ERROR: No supported gcc/g++ host compiler found, but clang-14 is available.
Use 'nvcc -ccbin clang-14' to use that instead.
make: *** [Makefile:229: deviceQuery.o] Error 1
So edit the makefile in three places :
Now it compiles and you can view the card info
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 3060"
CUDA Driver Version / Runtime Version 12.3 / 11.8
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 11834 MBytes (12408979456 bytes)
MapSMtoCores for SM 8.6 is undefined. Default to use 128 Cores/SM
MapSMtoCores for SM 8.6 is undefined. Default to use 128 Cores/SM
(28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores
GPU Max Clock rate: 1792 MHz (1.79 GHz)
Memory Clock rate: 7501 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 2359296 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 7 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 11.8, NumDevs = 1, Device0 = NVIDIA GeForce RTX 3060
Result = PASS
gcc -fcommon
This really is a strange bug encountered while compiling code that had compiled perfectly before. But at this latest compilation (gcc-12) we get lot's of multiple definitions.
/usr/bin/ld: OBJ/Penalty_M1_PDTSP.o:/home/kees/projects/cuda/NeuroLKH-main/SRC/INCLUDE/LKH.h:360: multiple definition of `SavedD'; OBJ/Activate.o:/home/kees/projects/cuda/NeuroLKH-main/SRC/INCLUDE/LKH.h:360: first defined here
Installing an older compiler changed nothing, but then again we only had gcc-11 to go back to. This code was probably compiled with gcc-10 ir gcc-9.
It turns out that those older compilers had a default options -fcommon set while newer versions have -fno-common.
-fcommon
In C code, this option controls the placement of global variables defined without an initializer, known as tentative definitions in the C standard. Tentative definitions are distinct from declarations of a variable with the extern keyword, which do not allocate storage.
The default is -fno-common, which specifies that the compiler places uninitialized global variables in the BSS section of the object file. This inhibits the merging of tentative definitions by the linker so you get a multiple-definition error if the same variable is accidentally defined in more than one compilation unit.
The -fcommon places uninitialized global variables in a common block. This allows the linker to resolve all tentative definitions of the same variable in different compilation units to the same object, or to a non-tentative definition. This behavior is inconsistent with C++, and on many targets implies a speed and code size penalty on global variable references. It is mainly useful to enable legacy code to link without errors.
Note that the code for LKH-3.0.6 was just wrong.
The definition of SavedD is inside LKH.H, and included in multiple source files. It should have been declared extern in LKH.H and defined in exactly one of the to .c files that uses it.
solution
Instead of changing all the references, add -fcommon to the CFLAGS
concorde python module
libboost-81 vs 74
The second problem is when using cuda, debian insists on reverting back to libboost-1.74 . This in turn makes osrm fail !!
To definitely switch to version 81 (to make osrm work again)
Remove boost completely
apt-get update
apt-get -y --purge remove libboost-all-dev libboost-doc libboost-dev
sudo rm -f /usr/lib/libboost_*
Now it is probably also possible to reinstall with :
Because the osrm-build tries to install the correct boost libraries.
Or else install from tar file :
sudo apt-get -y install build-essential g++ python-dev-is-python3 autotools-dev libicu-dev libbz2-dev
wget http://downloads.sourceforge.net/project/boost/boost/1.81.0/boost_1_81_0.tar.gz
tar zxvf boost_1_81_0.tar.gz
cd boost_1_81_0
puCores=`cat /proc/cpuinfo | grep "cpu cores" | uniq | awk '{print $NF}'`
echo "Available CPU cores: "$cpuCores
./bootstrap.sh # this will generate ./b2
sudo ./b2 --with=all -j $cpuCores install
Note that we have 16 cores and 32 processors. This way it will use 16 cores and not 32. Find out, or try -j 32 see if that is faster ?
No it is not -j 16 took 49 seconds, -j 32 1m2
TSP GPU
We found 2 implementations for TSP for GPU. one actually based on LKH!!
Read this intro :
And note that the sources are in ~/projects/cuda/ (root directory)
23 is probably the newest so We should compile the code nvcc
nvcc -O3 -arch=sm_35 -use_fast_math TSP_GPU23.cu -o TSP_GPU23
# probably gives you the gcc warning so this works better:
nvcc -ccbin clang -O3 -arch=sm_35 -use_fast_math TSP_GPU23.cu -o TSP_GPU2
NeuroLKH-main
These sources are in the subdirectory ~/projects/cuda/NeuroLKH-main.
However these don't compile so easily. The multiple definition problem can be solved with the -fcommon flag see above. But then we need concorde.
git clone https://github.com/jvkersch/pyconcorde
cd pyconcorde/
pip3 install -e . --break-system-packages
Seems to work ok?
The next command however requires torch, which seems to be a debian package but it will fail because of : "Torch not compiled with CUDA enabled".
But this site seems to have the answer : visit
get cuda version with
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
The browse to the torch site : https://pytorch.org/get-started/locally/
And choose what your versions are in the "Start locally" chooser. My choices resulted in this command , which will take a while to finish :
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --break-system-packages
Next attempt