4.8. rigidmol Accelerated by GPU

4.8.1. Enable GPU Acceleration

Since ABCluster 3.2, rigidmol can be accelerated by single or multiple GPU cards! This can be done with the program rigidmol-gpu, which is currently only available in Linux version.

Before proceeding, we again emphasize that you should have NVIDIA GPU cards and CUDA toolkit well configured before running GPU accelerated ABCluster.

Hardware requirement: CUDA version >= 11.0; compute capability >= 7.0.

Tip

For different compute capability, you should download different ABCluster. First, check the compute capability from https://developer.nvidia.com/cuda-gpus. For example, if you have a NVIDIA A30 Tensor Core GPU, then you can find that its compute capability is 8.0, so you should download -Linux-GPU80 version. An inconsistent version of ABCluster may raise some errors like:

Error occurs: Fail to call the CUDA kernel function. Reason: no kernel image is available for execution on the device.

Error occurs: Fail to call XX. Reason: the provided PTX was compiled with an unsupported toolchain.

The use of GPU acceleration is very easy! Keep in mind that any standard input file of rigidmol can be used for rigidmol-gpu, which is a standard global optimization task on CPU. Say, for the example in Example: (H2O)6, the input file is:

h2o6.inp

h2o6.cluster # cluster file name
         # population size
         # maximal generations
          # scout limit
0          # amplitude
h2o6         # save optimized configuration
         # number of LMs to be saved

So, you can just run the following command to do a standard CPU calculation:

$ rigidmol-gpu h2o6.inp > h2o6.out

To use GPU, just add an argument -gpu at the end of command line:

$ rigidmol-gpu h2o6.inp -gpu > h2o6.out

Now, ABCluster will try to use single or multiple GPU cards. The GPU acceleration is successfully enabled!

4.8.2. Single GPU Performance

Tip

The sample input and output files can be found in testfiles/rigidmol/6-gpuperf.

Probably, in the last example, the GPU calculation is slower than the CPU one. The reason is simply that the system is too small.

In GPU implementation, data transfer between GPU and host memory is very expensive. So, only for large systems, where data transfer is much cheaper than numerical computation, can GPU outperform CPU.

An example can be found in in testfiles/rigidmol/6-gpuperf, where a system of \((\mathrm{CH}_3\mathrm{CN})_{1500}\) is considered. In mol.inp is

mol.inp

mol.cluster     # cluster file name
             # population size
             # maximal generations
             # scout limit
00000000     # amplitude
mol             # save optimized configuration
            # number of LMs to be saved

We use a population size of 1 and a generation number of 1 since we only want to do a single energy calculation from mol.cluster, where an optimized structure is provided. By performing a CPU and GPU calculation on a computer, we get the following result:

A single Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz Core	56 seconds
A single Quadro RTX 4000 Card	3 seconds

The GPU acceleration is amazing!

So, what is the critical size at which GPU outperforms? This depends on CPU, GPU, and other hardware condition. Usually, for a cluster containing more than 10000 atoms, GPU cards should be used.

4.8.3. Multiple GPUs Performance

Tip

The sample input and output files can be found in testfiles/rigidmol/7-multigpus.

rigidmol-gpu will automatically detect the number of GPU cards and use all of them to accelerate calculations. You do not need to do anything extra.

Tip

rigidmol-gpu optimizes a cluster only with one GPU. So, to use multiple GPUs, there must be more than 1 individuals in the population.

In the last example, we can modify the population for more GPUs:

mol.inp

mol.cluster     # cluster file name
            # population size
             # maximal generations
             # scout limit
00000000     # amplitude
mol             # save optimized configuration
            # number of LMs to be saved

For example, if you have 4 A100, to do the global optimization with GPUs, just run:

$ rigidmol-gpu mol.inp -gpu > mol-gpu4.out

In the output, you can find this:

mol-gpu4.inp

CUDA driver version: 11040; runtime version: 11060
4 GPU device is available:
  0: NVIDIA A100-SXM4-80GB
     Computational ability: 8.0
     Global memory:       81251 MB
     Block-shared memory: 48 KB = 6144 double
     Constant memory:     64 KB = 8192 double
     Maximum threads per block: 1024
     Maximum thread dimension:  1024, 1024, 64
     Maximum grid dimension:    2147483647, 65535, 65535
  1: NVIDIA A100-SXM4-80GB
     Computational ability: 8.0
     Global memory:       81251 MB
     Block-shared memory: 48 KB = 6144 double
     Constant memory:     64 KB = 8192 double
     Maximum threads per block: 1024
     Maximum thread dimension:  1024, 1024, 64
     Maximum grid dimension:    2147483647, 65535, 65535
  2: NVIDIA A100-SXM4-80GB
     Computational ability: 8.0
     Global memory:       81251 MB
     Block-shared memory: 48 KB = 6144 double
     Constant memory:     64 KB = 8192 double
     Maximum threads per block: 1024
     Maximum thread dimension:  1024, 1024, 64
     Maximum grid dimension:    2147483647, 65535, 65535
  3: NVIDIA A100-SXM4-80GB
     Computational ability: 8.0
     Global memory:       81251 MB
     Block-shared memory: 48 KB = 6144 double
     Constant memory:     64 KB = 8192 double
     Maximum threads per block: 1024
     Maximum thread dimension:  1024, 1024, 64
     Maximum grid dimension:    2147483647, 65535, 65535

This means rigidmol-gpu has detected 4 GPUs and will use them. This calculation only costs 20 minutes. If it is done with CPU, probably 20 hours are needed!