installing cuda

The nvidia site get's you stuck but it probably correctly points to the packages and repositories. Fix that and if you get errors go on to this site :

https://www.server-world.info/en/note?os=Debian_12&p=nvidia&f=4

It boils down to :

su
apt -y install nvidia-cuda-toolkit nvidia-cuda-dev
nvcc --version

Now it is finally found:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Note that in these steps probably libboost 1.74 were installed instead of 1.81. Sadly that makes osrm-backend fail because that needs 1.81. You can just follow the reinstallation of boost 1.81 section later in this chapter and cuda will just keep working !

TODO : see if you can install cuda with boost 1.81 right away. It does mean uninstalling first. After that try one of the other installers (deb(network) or runfile(local))

compiling code

Also from that site.

git clone https://github.com/zchee/cuda-sample.git
cd /cuda-sample/1_Utilities/deviceQuery
make

The g++ compiler seems to be discarded by nvcc, but it reports an alternative :

/usr/bin/nvcc -ccbin g++ -I../../common/inc -m64 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 -o deviceQuery.o -c deviceQuery.cpp
ERROR: No supported gcc/g++ host compiler found, but clang-14 is available.
       Use 'nvcc -ccbin clang-14' to use that instead.
make: *** [Makefile:229: deviceQuery.o] Error 1

So edit the makefile in three places :

...

CUDA_PATH?=/usr
...

SMS ?= 70

...

HOST_COMPILER ?= clang++

Now it compiles and you can view the card info

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 3060"
  CUDA Driver Version / Runtime Version          12.3 / 11.8
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 11834 MBytes (12408979456 bytes)
MapSMtoCores for SM 8.6 is undefined.  Default to use 128 Cores/SM
MapSMtoCores for SM 8.6 is undefined.  Default to use 128 Cores/SM
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1792 MHz (1.79 GHz)
  Memory Clock rate:                             7501 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 2359296 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 7 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 11.8, NumDevs = 1, Device0 = NVIDIA GeForce RTX 3060
Result = PASS

gcc -fcommon

This really is a strange bug encountered while compiling code that had compiled perfectly before. But at this latest compilation (gcc-12) we get lot's of multiple definitions.

/usr/bin/ld: OBJ/Penalty_M1_PDTSP.o:/home/kees/projects/cuda/NeuroLKH-main/SRC/INCLUDE/LKH.h:360: multiple definition of `SavedD'; OBJ/Activate.o:/home/kees/projects/cuda/NeuroLKH-main/SRC/INCLUDE/LKH.h:360: first defined here

Installing an older compiler changed nothing, but then again we only had gcc-11 to go back to. This code was probably compiled with gcc-10 ir gcc-9.

It turns out that those older compilers had a default options -fcommon set while newer versions have -fno-common.

-fcommon
In C code, this option controls the placement of global variables defined without an initializer, known as tentative definitions in the C standard. Tentative definitions are distinct from declarations of a variable with the extern keyword, which do not allocate storage.

The default is -fno-common, which specifies that the compiler places uninitialized global variables in the BSS section of the object file. This inhibits the merging of tentative definitions by the linker so you get a multiple-definition error if the same variable is accidentally defined in more than one compilation unit.

The -fcommon places uninitialized global variables in a common block. This allows the linker to resolve all tentative definitions of the same variable in different compilation units to the same object, or to a non-tentative definition. This behavior is inconsistent with C++, and on many targets implies a speed and code size penalty on global variable references. It is mainly useful to enable legacy code to link without errors.

Note that the code for LKH-3.0.6 was just wrong.

int *SavedD;

    SavedD = malloc(Dimension * Dimension * sizeof(int));

            AddCandidate(From, To, SavedD[From->Id * Dimension - Dimension + To->Id - 1], Alpha);

The definition of SavedD is inside LKH.H, and included in multiple source files. It should have been declared extern in LKH.H and defined in exactly one of the to .c files that uses it.

solution

Instead of changing all the references, add -fcommon to the CFLAGS

CFLAGS = -fcommon -O3 -Wall -I$(IDIR) -D$(TREE_TYPE) -g

concorde python module

libboost-81 vs 74

The second problem is when using cuda, debian insists on reverting back to libboost-1.74 . This in turn makes osrm fail !!

To definitely switch to version 81 (to make osrm work again)

Remove boost completely

apt-get update
apt-get -y --purge remove libboost-all-dev libboost-doc libboost-dev
sudo rm -f /usr/lib/libboost_*

Now it is probably also possible to reinstall with :

cd /work/projects/debian/bookworm
make prepare

Because the osrm-build tries to install the correct boost libraries.

Or else install from tar file :

sudo apt-get -y install build-essential g++ python-dev-is-python3 autotools-dev libicu-dev libbz2-dev
wget http://downloads.sourceforge.net/project/boost/boost/1.81.0/boost_1_81_0.tar.gz
tar zxvf boost_1_81_0.tar.gz
cd boost_1_81_0
puCores=`cat /proc/cpuinfo | grep "cpu cores" | uniq | awk '{print $NF}'`
echo "Available CPU cores: "$cpuCores
./bootstrap.sh  # this will generate ./b2
sudo ./b2 --with=all -j $cpuCores install

Note that we have 16 cores and 32 processors. This way it will use 16 cores and not 32. Find out, or try -j 32 see if that is faster ?

No it is not -j 16 took 49 seconds, -j 32 1m2

TSP GPU

We found 2 implementations for TSP for GPU. one actually based on LKH!!

Read this intro :

visit

And note that the sources are in ~/projects/cuda/ (root directory)

23 is probably the newest so We should compile the code nvcc

nvcc -O3 -arch=sm_35 -use_fast_math TSP_GPU23.cu -o TSP_GPU23
# probably gives you the gcc warning so this works better:
nvcc -ccbin clang -O3 -arch=sm_35 -use_fast_math TSP_GPU23.cu -o TSP_GPU2

NeuroLKH-main

These sources are in the subdirectory ~/projects/cuda/NeuroLKH-main.

However these don't compile so easily. The multiple definition problem can be solved with the -fcommon flag see above. But then we need concorde.

git clone https://github.com/jvkersch/pyconcorde
cd pyconcorde/
pip3 install -e . --break-system-packages

Seems to work ok?

cd ../NeuroLKH
python data_generate.py -test

The next command however requires torch, which seems to be a debian package but it will fail because of : "Torch not compiled with CUDA enabled".

But this site seems to have the answer : visit

get cuda version with

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

The browse to the torch site : https://pytorch.org/get-started/locally/

And choose what your versions are in the "Start locally" chooser. My choices resulted in this command , which will take a while to finish :

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --break-system-packages

Next attempt

# now this works, pretraining step, ends up in the file test/100.pkl
python test.py --dataset test/100.pkl --model_path pretrained/neurolkh.pt --n_samples 1000 --lkh_trials 1000 --neurolkh_trials 1000