4.8. rigidmol Accelerated by GPU
4.8.1. Enable GPU Acceleration
Since ABCluster 3.2, rigidmol
can be accelerated by single or multiple GPU cards! This can be done with the program rigidmol-gpu
, which is currently only available in Linux version.
Before proceeding, we again emphasize that you should have NVIDIA GPU cards and CUDA toolkit well configured before running GPU accelerated ABCluster.
Hardware requirement: CUDA version >= 11.0; compute capability >= 7.0.
Tip
For different compute capability, you should download different ABCluster. First, check the compute capability from https://developer.nvidia.com/cuda-gpus. For example, if you have a NVIDIA A30 Tensor Core GPU, then you can find that its compute capability is 8.0, so you should download -Linux-GPU80
version. An inconsistent version of ABCluster may raise some errors like:
Error occurs: Fail to call the CUDA kernel function. Reason: no kernel image is available for execution on the device.
Error occurs: Fail to call XX. Reason: the provided PTX was compiled with an unsupported toolchain.
The use of GPU acceleration is very easy! Keep in mind that any standard input file of rigidmol
can be used for rigidmol-gpu
, which is a standard global optimization task on CPU. Say, for the example in Example: (H2O)6, the input file is:
1h2o6.cluster # cluster file name
220 # population size
320 # maximal generations
43 # scout limit
54.0 # amplitude
6h2o6 # save optimized configuration
730 # number of LMs to be saved
So, you can just run the following command to do a standard CPU calculation:
$ rigidmol-gpu h2o6.inp > h2o6.out
To use GPU, just add an argument -gpu
at the end of command line:
$ rigidmol-gpu h2o6.inp -gpu > h2o6.out
Now, ABCluster will try to use single or multiple GPU cards. The GPU acceleration is successfully enabled!
4.8.2. Single GPU Performance
Tip
The sample input and output files can be found in testfiles/rigidmol/6-gpuperf
.
Probably, in the last example, the GPU calculation is slower than the CPU one. The reason is simply that the system is too small.
In GPU implementation, data transfer between GPU and host memory is very expensive. So, only for large systems, where data transfer is much cheaper than numerical computation, can GPU outperform CPU.
An example can be found in in testfiles/rigidmol/6-gpuperf
, where a system of \((\mathrm{CH}_3\mathrm{CN})_{1500}\) is considered. In mol.inp
is
1mol.cluster # cluster file name
21 # population size
31 # maximal generations
43 # scout limit
510.00000000 # amplitude
6mol # save optimized configuration
730 # number of LMs to be saved
We use a population size of 1
and a generation number of 1
since we only want to do a single energy calculation from mol.cluster
, where an optimized structure is provided. By performing a CPU and GPU calculation on a computer, we get the following result:
A single Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz Core |
56 seconds |
A single Quadro RTX 4000 Card |
3 seconds |
The GPU acceleration is amazing!
So, what is the critical size at which GPU outperforms? This depends on CPU, GPU, and other hardware condition. Usually, for a cluster containing more than 10000 atoms, GPU cards should be used.
4.8.3. Multiple GPUs Performance
Tip
The sample input and output files can be found in testfiles/rigidmol/7-multigpus
.
rigidmol-gpu
will automatically detect the number of GPU cards and use all of them to accelerate calculations. You do not need to do anything extra.
Tip
rigidmol-gpu
optimizes a cluster only with one GPU. So, to use multiple GPUs, there must be more than 1 individuals in the population.
In the last example, we can modify the population for more GPUs:
1mol.cluster # cluster file name
232 # population size
32 # maximal generations
43 # scout limit
510.00000000 # amplitude
6mol # save optimized configuration
730 # number of LMs to be saved
For example, if you have 4 A100, to do the global optimization with GPUs, just run:
$ rigidmol-gpu mol.inp -gpu > mol-gpu4.out
In the output, you can find this:
1CUDA driver version: 11040; runtime version: 11060
24 GPU device is available:
3 0: NVIDIA A100-SXM4-80GB
4 Computational ability: 8.0
5 Global memory: 81251 MB
6 Block-shared memory: 48 KB = 6144 double
7 Constant memory: 64 KB = 8192 double
8 Maximum threads per block: 1024
9 Maximum thread dimension: 1024, 1024, 64
10 Maximum grid dimension: 2147483647, 65535, 65535
11 1: NVIDIA A100-SXM4-80GB
12 Computational ability: 8.0
13 Global memory: 81251 MB
14 Block-shared memory: 48 KB = 6144 double
15 Constant memory: 64 KB = 8192 double
16 Maximum threads per block: 1024
17 Maximum thread dimension: 1024, 1024, 64
18 Maximum grid dimension: 2147483647, 65535, 65535
19 2: NVIDIA A100-SXM4-80GB
20 Computational ability: 8.0
21 Global memory: 81251 MB
22 Block-shared memory: 48 KB = 6144 double
23 Constant memory: 64 KB = 8192 double
24 Maximum threads per block: 1024
25 Maximum thread dimension: 1024, 1024, 64
26 Maximum grid dimension: 2147483647, 65535, 65535
27 3: NVIDIA A100-SXM4-80GB
28 Computational ability: 8.0
29 Global memory: 81251 MB
30 Block-shared memory: 48 KB = 6144 double
31 Constant memory: 64 KB = 8192 double
32 Maximum threads per block: 1024
33 Maximum thread dimension: 1024, 1024, 64
34 Maximum grid dimension: 2147483647, 65535, 65535
This means rigidmol-gpu
has detected 4 GPUs and will use them. This calculation only costs 20 minutes. If it is done with CPU, probably 20 hours are needed!