NVIDIA Monitoring Tools
When evaluating computing performance we look at various KPIs: memory consumption, utilisation of compute power, occupation of hardware accelerators, and – more recently – at the energy consumption and energy efficiency 1,2. For popular NVIDIA cards this can be solved with the help of the NVIDIA Management Library, which allows developer to query details of the device state3.
The library is easier to use through Python bindings available as
pyNVML4. Note that Python overheads may be problematic if higher-frequency querying is needed, plus the API likely comes with its own overheads. So the readings should be understood as estimations.
Here is a simple script, which can be adjusted to query more details, if needed:
# see the NVIDIA docs: https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries # to monitor GPU-1 and dump to a log file, run: python gpu_trace.py 1 log.csv import sys import time import pynvml pynvml.nvmlInit() if __name__ == "__main__": gpu_index = int(sys.argv) # device fname = sys.argv # log file with open(fname,'w') as f: # select device device_handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_index) # prepare headers f.write('Timestamp;Temperature [C];Power [% max];GPU Util [% time];Mem Util [% time];Mem Cons [% max];Energy [kJ]\n') # get some metadata power_max = pynvml.nvmlDeviceGetPowerManagementLimit(device_handle) energy_start = pynvml.nvmlDeviceGetTotalEnergyConsumption(device_handle) while True: # timestamp timestamp = time.time() # temperature temp = pynvml.nvmlDeviceGetTemperature(device_handle,0) # TODO: set sensor if many? # power [% of max] power = pynvml.nvmlDeviceGetPowerUsage(device_handle) / power_max * 100.0 # memory and gpu utilisation [%] util = pynvml.nvmlDeviceGetUtilizationRates(device_handle) # memory consumption [%] mem_info = pynvml.nvmlDeviceGetMemoryInfo(device_handle) mem_cons = mem_info.used / mem_info.total * 100.0 # eneregy delta in kJ (API uses in mJ) eneregy = (pynvml.nvmlDeviceGetTotalEnergyConsumption(device_handle)-energy_start)/10**6 # output result result = (timestamp,temp,power,util.gpu,util.memory,mem_cons,eneregy) f.write(';'.join(map(str, result))+'\n') time.sleep(0.1)
And here is how to post-process and present results:
from datetime import datetime import pandas as pd import matplotlib.pyplot as plt trace_df = pd.read_csv('/home/log.csv',sep=';',header=0) trace_df['Timestamp'] = trace_df['Timestamp'].map(datetime.utcfromtimestamp) trace_df.set_index('Timestamp',inplace=True) fig,ax1 = plt.subplots(1,1,figsize=(12,6)) cols = ['Power [% max]','GPU Util [% time]','Mem Util [% time]','Mem Cons [% max]'] trace_df[cols].plot(ax=ax1) ax1.set_ylabel('%') ax1.legend(loc='upper left') cols = ['Energy [kJ]'] ax2 = ax1.twinx() trace_df[cols].plot(ax=ax2, linestyle='dashed',color='black') ax2.set_ylabel('kJ') ax2.legend(loc='upper right') fig.tight_layout() plt.show()
Case Study 1: Profiling ETL
The example shown below comes from an ETL processes which utilizes a GPU.
Note that, in this case, monitoring identified likely bottlenecks: the GPU gets idle on a periodic basis (likely, device-to-host transfers) plus is overall underutilised. Estimation of energy consumed is a nice feature, as it would be hard to measure it accurately from power traces (due to high variation and subsampling).
Note that utilisation should be understood as time-occupation, in case of both memory and computing. From the documentation:
unsigned int gpu: Percent of time over the past sample period during which one or more kernels was executing on the GPU.
unsigned int memory: Percent of time over the past sample period during which global (device) memory was being read or written.
Case Study 2: Scientific Computing and Power Management
The example below shows a trace from a matrix computation task (see the script below)
import torch x = torch.randn( (256,270725) ).float().cuda('cuda:0') MATRIX_BATCH = 1000 @torch.compile(mode='reduce-overhead', backend='inductor') def similarity_op(x,y): xy = x[:,:,None] - y[:,None,:] xy = xy.abs() < 1 xy = xy.all(axis=0) return xy _ = similarity_op(torch.randn(1,MATRIX_BATCH),torch.randn(1,MATRIX_BATCH)) def similarity(x): x_slices = torch.split(x, MATRIX_BATCH, -1) result =  for x_i in x_slices: result_i =  for x_j in x_slices: result_i.append(similarity_op(x_i,x_j)) result_i = torch.cat(result_i,-1) result_i = result_i.to(device='cpu', non_blocking=True) result.append(result_i) result = torch.cat(result,-2) torch.cuda.synchronize() return result # start profiling here _= similarity(x)
In this example, we see different power management strategies on two similar devices:
Case Study 3: Energy Efficiency of Deep Learning
Here we reproduce some results from Tang et al.1 to illustrate how adjusting frequency can be used to minimise energy spent per computational task (in their case: image prediction). Higher performance comes at a price of excessive energy used, so that energy curves assumes a typical parabolic shape. Note that, in general, the energy-efficient configuration may be optimised over both clock and memory frequencies 5.
And here is the code to reproduce:
import pandas as pd import seaborn as sns import numpy as np # source: Fig 4d, data for resnet-b32 from "The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study" freq = [544, 683, 810, 936, 1063, 1202, 1328] power = [57, 62, 65, 70, 78, 88, 115] # W = J/s requests = [60, 75, 85, 95, 105, 115, 120] # requests/s data = pd.DataFrame(data=zip(freq,power,requests),columns=['Frequency','Power','Performance']) data['Energy'] = data['Power'] / data['Performance'] # [J/s] / [Images/s] = [J/Image] import matplotlib.pyplot as plt fig,(ax1,ax2) = plt.subplots(1,2,figsize=(12,6)) sns.lineplot(data=data,x='Frequency',y='Performance',ax=ax1,color='orange', label='Performance',marker='o') ax1.set_xticks(data['Frequency']) ax1.set_ylabel('Image / s') ax1.set_xlabel('Frequency [MHz]') ax1.legend(loc=0) ax12 = ax1.twinx() sns.lineplot(data=data,x='Frequency',y='Power',ax=ax12,color='steelblue',label='Power',marker='D') ax12.set_ylabel('W') ax12.legend(loc=1) sns.lineplot(data,x='Frequency',y='Energy',ax=ax2,label='Energy') ax2.set_xticks(data['Frequency']) ax2.set_ylabel('J / Image') ax2.set_xlabel('Frequency [MHz]') ax2.legend(loc=0) plt.title('Performance, power, and energy for training of resnet-b32 network on P100.\n Reproduced from: "The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study"') plt.tight_layout() plt.show()
- 1.Tang Z, Wang Y, Wang Q, Chu X. The Impact of GPU DVFS on the Energy and Performance of Deep Learning. Proceedings of the Tenth ACM International Conference on Future Energy Systems. Published online June 15, 2019. doi:10.1145/3307772.3328315
- 2.Tang K, He X, Gupta S, Vazhkudai SS, Tiwari D. Exploring the Optimal Platform Configuration for Power-Constrained HPC Workflows. 2018 27th International Conference on Computer Communication and Networks (ICCCN). Published online July 2018. doi:10.1109/icccn.2018.8487322
- 3.NVIDIA. NVIDIA Management Library Documentation. NVML-API. Accessed August 2023. https://docs.nvidia.com/deploy/nvml-api/index.html
- 4.Hirschfeld A. Python bindings to the NVIDIA Management Library. pyNVML. Accessed August 2023. https://pypi.org/project/nvidia-ml-py/#description
- 5.Fan K, Cosenza B, Juurlink B. Accurate Energy and Performance Prediction for Frequency-Scaled GPU Kernels. Computation. Published online April 27, 2020:37. doi:10.3390/computation8020037