4.8. rigidmol Accelerated by GPU

4.8.1. Enable GPU Acceleration

Since ABCluster 3.2, rigidmol can be accelerated by single or multiple GPU cards! This can be done with the program rigidmol-gpu, which is currently only available in Linux version.

Before proceeding, we again emphasize that you should have NVIDIA GPU cards and CUDA toolkit well configured before running GPU accelerated ABCluster.

Hardware requirement: CUDA version >= 11.0; compute capability >= 7.0.

Tip

For different compute capability, you should download different ABCluster. First, check the compute capability from https://developer.nvidia.com/cuda-gpus. For example, if you have a NVIDIA A30 Tensor Core GPU, then you can find that its compute capability is 8.0, so you should download -Linux-GPU80 version. An inconsistent version of ABCluster may raise some errors like:

Error occurs: Fail to call the CUDA kernel function. Reason: no kernel image is available for execution on the device.

Error occurs: Fail to call XX. Reason: the provided PTX was compiled with an unsupported toolchain.

The use of GPU acceleration is very easy! Keep in mind that any standard input file of rigidmol can be used for rigidmol-gpu, which is a standard global optimization task on CPU. Say, for the example in Example: (H2O)6, the input file is:

h2o6.inp
1h2o6.cluster # cluster file name
220           # population size
320           # maximal generations
43            # scout limit
54.0          # amplitude
6h2o6         # save optimized configuration
730           # number of LMs to be saved

So, you can just run the following command to do a standard CPU calculation:

$ rigidmol-gpu h2o6.inp > h2o6.out

To use GPU, just add an argument -gpu at the end of command line:

$ rigidmol-gpu h2o6.inp -gpu > h2o6.out

Now, ABCluster will try to use single or multiple GPU cards. The GPU acceleration is successfully enabled!

4.8.2. Single GPU Performance

Tip

The sample input and output files can be found in testfiles/rigidmol/6-gpuperf.

Probably, in the last example, the GPU calculation is slower than the CPU one. The reason is simply that the system is too small.

In GPU implementation, data transfer between GPU and host memory is very expensive. So, only for large systems, where data transfer is much cheaper than numerical computation, can GPU outperform CPU.

An example can be found in in testfiles/rigidmol/6-gpuperf, where a system of \((\mathrm{CH}_3\mathrm{CN})_{1500}\) is considered. In mol.inp is

mol.inp
1mol.cluster     # cluster file name
21               # population size
31               # maximal generations
43               # scout limit
510.00000000     # amplitude
6mol             # save optimized configuration
730              # number of LMs to be saved

We use a population size of 1 and a generation number of 1 since we only want to do a single energy calculation from mol.cluster, where an optimized structure is provided. By performing a CPU and GPU calculation on a computer, we get the following result:

A single Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz Core

56 seconds

A single Quadro RTX 4000 Card

3 seconds

The GPU acceleration is amazing!

So, what is the critical size at which GPU outperforms? This depends on CPU, GPU, and other hardware condition. Usually, for a cluster containing more than 10000 atoms, GPU cards should be used.

4.8.3. Multiple GPUs Performance

Tip

The sample input and output files can be found in testfiles/rigidmol/7-multigpus.

rigidmol-gpu will automatically detect the number of GPU cards and use all of them to accelerate calculations. You do not need to do anything extra.

Tip

rigidmol-gpu optimizes a cluster only with one GPU. So, to use multiple GPUs, there must be more than 1 individuals in the population.

In the last example, we can modify the population for more GPUs:

mol.inp
1mol.cluster     # cluster file name
232              # population size
32               # maximal generations
43               # scout limit
510.00000000     # amplitude
6mol             # save optimized configuration
730              # number of LMs to be saved

For example, if you have 4 A100, to do the global optimization with GPUs, just run:

$ rigidmol-gpu mol.inp -gpu > mol-gpu4.out

In the output, you can find this:

mol-gpu4.inp
 1CUDA driver version: 11040; runtime version: 11060
 24 GPU device is available:
 3  0: NVIDIA A100-SXM4-80GB
 4     Computational ability: 8.0
 5     Global memory:       81251 MB
 6     Block-shared memory: 48 KB = 6144 double
 7     Constant memory:     64 KB = 8192 double
 8     Maximum threads per block: 1024
 9     Maximum thread dimension:  1024, 1024, 64
10     Maximum grid dimension:    2147483647, 65535, 65535
11  1: NVIDIA A100-SXM4-80GB
12     Computational ability: 8.0
13     Global memory:       81251 MB
14     Block-shared memory: 48 KB = 6144 double
15     Constant memory:     64 KB = 8192 double
16     Maximum threads per block: 1024
17     Maximum thread dimension:  1024, 1024, 64
18     Maximum grid dimension:    2147483647, 65535, 65535
19  2: NVIDIA A100-SXM4-80GB
20     Computational ability: 8.0
21     Global memory:       81251 MB
22     Block-shared memory: 48 KB = 6144 double
23     Constant memory:     64 KB = 8192 double
24     Maximum threads per block: 1024
25     Maximum thread dimension:  1024, 1024, 64
26     Maximum grid dimension:    2147483647, 65535, 65535
27  3: NVIDIA A100-SXM4-80GB
28     Computational ability: 8.0
29     Global memory:       81251 MB
30     Block-shared memory: 48 KB = 6144 double
31     Constant memory:     64 KB = 8192 double
32     Maximum threads per block: 1024
33     Maximum thread dimension:  1024, 1024, 64
34     Maximum grid dimension:    2147483647, 65535, 65535

This means rigidmol-gpu has detected 4 GPUs and will use them. This calculation only costs 20 minutes. If it is done with CPU, probably 20 hours are needed!