installing cuda

This site gives a good explanation, though it is for version 12 :

https://www.server-world.info/en/note?os=Debian_13&p=nvidia&f=4

However if we want the latest version: 13 download that here :

https://developer.nvidia.com/cuda-downloads

cuda 13

Here you can choose which version you want, and after that the page will display what to do :

wget https://developer.download.nvidia.com/compute/cuda/13.1.0/local_installers/cuda-repo-debian13-13-1-local_13.1.0-590.44.01-1_amd64.deb
sudo dpkg -i cuda-repo-debian13-13-1-local_13.1.0-590.44.01-1_amd64.deb
sudo cp /var/cuda-repo-debian13-13-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-1

If you installed nvcc 12 already, uninstall it with

apt-get remove -y nvidia-cuda-toolkit

And add the new path (usually /usr/local/cuda/...) to you path.

export CUDA_HOME=/usr/local/cuda
export PATH=${CUDA_HOME}/bin:${PATH}

Now it should report :

bash # new shell
nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Nov__7_07:23:37_PM_PST_2025
Cuda compilation tools, release 13.1, V13.1.80
Build cuda_13.1.r13.1/compiler.36836380_0

cuda 12

Note that is the standard version for trixie, and it install fine but we get errors like this one:

nvcc fatal   : Unknown option '-static-global-template-stub=false'

And indeed that seems to only work from 12.8 upwards. Go back to cuda 13 to avoid this !!!

Otherwise follow this guide to install 12.

It boils down to this for installing the driver (https://www.server-world.info/en/note?os=Debian_13&p=nvidia&f=1) make a file called :

vi /etc/apt/sources.list.d/nvidia.list

Containing this :

deb http://deb.debian.org/debian/ trixie non-free-firmware  contrib non-free
deb http://security.debian.org/debian-security trixie-security non-free-firmware  contrib non-free
deb http://deb.debian.org/debian/ trixie-updates non-free-firmware  contrib non-free
deb http://deb.debian.org/debian/ trixie-backports non-free-firmware  contrib non-free

Then install with

apt update
apt -y install nvidia-driver firmware-misc-nonfree linux-headers-$(uname -r) dkms

Now we have installed the open source driver (nouveau) and it is blacklisted by default.

rm /etc/modprobe.d/nvidia-blacklist-nouveau.conf
update-initramfs -u

reboot # we have a new driver installed

Now test if it is recognized :

nvidia-smi
Sat Jan  3 10:11:26 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   42C    P0             17W /  115W |       9MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1445      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+

Now install CUDA :

sudo apt -y install nvidia-cuda-toolkit nvidia-cuda-dev git cmake
nvcc --version

Now it is finally found:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Now test with a sample program

git clone https://github.com/NVIDIA/cuda-samples.git
cd ./cuda-samples/Samples/1_Utilities/deviceQuery
cmake .
make
./deviceQuery

This should print out something like

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 4070 Laptop GPU"
  CUDA Driver Version / Runtime Version          12.4 / 12.4
  CUDA Capability Major/Minor version number:    8.9
...

Meaning : yes also my laptop has a CUDA card, which i was still in doubt about.

Note that if you do this on a VM or machine without nvidia card, this last test won't work, but still we should be able to install nvcc and compile the code .

We need to detect the card from withing the worker process, see if you can get a good test from this deviceQuery example.

Now we still need cuopt to run, and the worker code to compile :

cuopt

Pay attention that for now you cannot use version 26.02 of cuopt on debian. Though it looks like a nice option for the future (multi-depot ..) : https://forums.developer.nvidia.com/t/its-here-nvidia-cuopt-22-06-download-now/218062

You get all kinds of problems with rmm@26.02 . So revert back to version 25.12

git clone https://github.com/NVIDIA/cuopt
git checkout v25.12.00

Or use a tarball/zip from :

wget https://github.com/NVIDIA/cuopt/archive/refs/tags/v25.12.00.zip

The conda installer seems to run perfectly so i presume this takes care of all dependencies :

Go to miniforge : https://conda-forge.org/download/ to download the installer, and run it. It installs in ~/Install. so add this to bashrc :

export CONDA_HOME=/home/kees/miniforge3/condabin
export PATH=${CONDA_HOME}:${PATH}

Start a new shell of course and run the command in the README.md file :

previous attempt

You might still need to install cudss by hand (check that ?)

https://developer.nvidia.com/cudss-downloads?target_os=Linux&target_arch=x86_64&Distribution=Agnostic&cuda_version=13

cd ~/Install
wget https://developer.download.nvidia.com/compute/cudss/redist/libcudss/linux-x86_64/libcudss-linux-x86_64-0.7.1.4_cuda13-archive.tar.xz
tar xvf libcudss-linux-x86_64-0.7.1.4_cuda13-archive.tar.xz

Then add it to your path in ~/.bashrc

export CUDA_HOME=/usr/local/cuda
export PATH=${CUDA_HOME}/bin:${PATH}
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:$LD_LIBRARY_PATH

export CUDSS_DIR=/home/kees/Install/libcudss-linux-x86_64-0.7.1.4_cuda13-archive
export LD_LIBRARY_PATH=${CUDSS_DIR}/lib:${CTK_DIR}/lib64:${LD_LIBRARY_PATH}
export cudss_DIR=${CUDSS_DIR}/lib/cmake/cudss
export CMAKE_PREFIX_PATH=${CUDSS_DIR}

git clone https://github.com/NVIDIA/cuopt 
cd cuda/cpp
cmake .
make

This will take a VERY LONG TIME !!!!! At least on laptop (64G memory);

Then an build the solver code :

git clone git@gitlab.com:klopt/solver
cd solver
make prepare-trixie
make

Note that in these steps probably libboost 1.74 were installed instead of 1.81. Sadly that makes osrm-backend fail because that needs 1.81. You can just follow the reinstallation of boost 1.81 section later in this chapter and cuda will just keep working !

TODO : see if you can install cuda with boost 1.81 right away. It does mean uninstalling first. After that try one of the other installers (deb(network) or runfile(local))

compiling code

Also from that site.

git clone https://github.com/zchee/cuda-sample.git
cd /cuda-sample/1_Utilities/deviceQuery
make

The g++ compiler seems to be discarded by nvcc, but it reports an alternative :

/usr/bin/nvcc -ccbin g++ -I../../common/inc -m64 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 -o deviceQuery.o -c deviceQuery.cpp
ERROR: No supported gcc/g++ host compiler found, but clang-14 is available.
       Use 'nvcc -ccbin clang-14' to use that instead.
make: *** [Makefile:229: deviceQuery.o] Error 1

So edit the makefile in three places :

...

CUDA_PATH?=/usr
...

SMS ?= 70

...

HOST_COMPILER ?= clang++

Now it compiles and you can view the card info

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 3060"
  CUDA Driver Version / Runtime Version          12.3 / 11.8
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 11834 MBytes (12408979456 bytes)
MapSMtoCores for SM 8.6 is undefined.  Default to use 128 Cores/SM
MapSMtoCores for SM 8.6 is undefined.  Default to use 128 Cores/SM
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1792 MHz (1.79 GHz)
  Memory Clock rate:                             7501 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 2359296 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 7 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 11.8, NumDevs = 1, Device0 = NVIDIA GeForce RTX 3060
Result = PASS

gcc -fcommon

This really is a strange bug encountered while compiling code that had compiled perfectly before. But at this latest compilation (gcc-12) we get lot's of multiple definitions.

/usr/bin/ld: OBJ/Penalty_M1_PDTSP.o:/home/kees/projects/cuda/NeuroLKH-main/SRC/INCLUDE/LKH.h:360: multiple definition of `SavedD'; OBJ/Activate.o:/home/kees/projects/cuda/NeuroLKH-main/SRC/INCLUDE/LKH.h:360: first defined here

Installing an older compiler changed nothing, but then again we only had gcc-11 to go back to. This code was probably compiled with gcc-10 ir gcc-9.

It turns out that those older compilers had a default options -fcommon set while newer versions have -fno-common.

-fcommon
In C code, this option controls the placement of global variables defined without an initializer, known as tentative definitions in the C standard. Tentative definitions are distinct from declarations of a variable with the extern keyword, which do not allocate storage.

The default is -fno-common, which specifies that the compiler places uninitialized global variables in the BSS section of the object file. This inhibits the merging of tentative definitions by the linker so you get a multiple-definition error if the same variable is accidentally defined in more than one compilation unit.

The -fcommon places uninitialized global variables in a common block. This allows the linker to resolve all tentative definitions of the same variable in different compilation units to the same object, or to a non-tentative definition. This behavior is inconsistent with C++, and on many targets implies a speed and code size penalty on global variable references. It is mainly useful to enable legacy code to link without errors.

Note that the code for LKH-3.0.6 was just wrong.

int *SavedD;

    SavedD = malloc(Dimension * Dimension * sizeof(int));

            AddCandidate(From, To, SavedD[From->Id * Dimension - Dimension + To->Id - 1], Alpha);

The definition of SavedD is inside LKH.H, and included in multiple source files. It should have been declared extern in LKH.H and defined in exactly one of the to .c files that uses it.

solution

Instead of changing all the references, add -fcommon to the CFLAGS

CFLAGS = -fcommon -O3 -Wall -I$(IDIR) -D$(TREE_TYPE) -g

concorde python module

libboost-81 vs 74

The second problem is when using cuda, debian insists on reverting back to libboost-1.74 . This in turn makes osrm fail !!

To definitely switch to version 81 (to make osrm work again)

Remove boost completely

apt-get update
apt-get -y --purge remove libboost-all-dev libboost-doc libboost-dev
sudo rm -f /usr/lib/libboost_*

Now it is probably also possible to reinstall with :

cd /work/projects/debian/bookworm
make prepare

Because the osrm-build tries to install the correct boost libraries.

Or else install from tar file :

sudo apt-get -y install build-essential g++ python-dev-is-python3 autotools-dev libicu-dev libbz2-dev
wget http://downloads.sourceforge.net/project/boost/boost/1.81.0/boost_1_81_0.tar.gz
tar zxvf boost_1_81_0.tar.gz
cd boost_1_81_0
puCores=`cat /proc/cpuinfo | grep "cpu cores" | uniq | awk '{print $NF}'`
echo "Available CPU cores: "$cpuCores
./bootstrap.sh  # this will generate ./b2
sudo ./b2 --with=all -j $cpuCores install

Note that we have 16 cores and 32 processors. This way it will use 16 cores and not 32. Find out, or try -j 32 see if that is faster ?

No it is not -j 16 took 49 seconds, -j 32 1m2

TSP GPU

We found 2 implementations for TSP for GPU. one actually based on LKH!!

Read this intro :

visit

And note that the sources are in ~/projects/cuda/ (root directory)

23 is probably the newest so We should compile the code nvcc

nvcc -O3 -arch=sm_35 -use_fast_math TSP_GPU23.cu -o TSP_GPU23
# probably gives you the gcc warning so this works better:
nvcc -ccbin clang -O3 -arch=sm_35 -use_fast_math TSP_GPU23.cu -o TSP_GPU2

NeuroLKH-main

These sources are in the subdirectory ~/projects/cuda/NeuroLKH-main.

However these don't compile so easily. The multiple definition problem can be solved with the -fcommon flag see above. But then we need concorde.

git clone https://github.com/jvkersch/pyconcorde
cd pyconcorde/
pip3 install -e . --break-system-packages

Seems to work ok?

cd ../NeuroLKH
python data_generate.py -test

The next command however requires torch, which seems to be a debian package but it will fail because of : "Torch not compiled with CUDA enabled".

But this site seems to have the answer : visit

get cuda version with

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

The browse to the torch site : https://pytorch.org/get-started/locally/

And choose what your versions are in the "Start locally" chooser. My choices resulted in this command , which will take a while to finish :

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --break-system-packages

Next attempt

# now this works, pretraining step, ends up in the file test/100.pkl
python test.py --dataset test/100.pkl --model_path pretrained/neurolkh.pt --n_samples 1000 --lkh_trials 1000 --neurolkh_trials 1000