QQ plots, despite their importance, are not well-supported in Python, as of the moment of writing this note. The package statsmodels offers plots of rather poor quality and non-interactive, and even Plotly doesn’t come with its own recipe. In this note, I am sharing how to build an interactive QQ plot and apply it to improve linear regression.
I will use the state-of-the art Vega-Lite language dedicated to interactive visualization, available in Python through the package Altair API.
Let’s use the famous “Prestige” dataset1, and look how prestige is explained by income, education, and job type:
\( prestige \sim income + education:type \)
The model residuals do not follow the normal distribution (as rejected by the Jarque-Bera test). It turns out that the normality of residuals is broken by an anomaly visible on QQ plots. The anomaly in the top-right corner comes from the minister job (hover to inspect!), which comes with high prestige and little money 😉
After removing the political anomaly, the model behaves properly 😉
OLS Regression Results
Dep. Variable:
prestige
R-squared:
0.922
Model:
OLS
Adj. R-squared:
0.914
Method:
Least Squares
F-statistic:
115.8
Date:
Sat, 06 Apr 2024
Prob (F-statistic):
4.31e-21
Time:
18:03:24
Log-Likelihood:
-157.21
No. Observations:
44
AIC:
324.4
Df Residuals:
39
BIC:
333.3
Df Model:
4
Covariance Type:
nonrobust
Omnibus:
3.775
Durbin-Watson:
1.960
Prob(Omnibus):
0.151
Jarque-Bera (JB):
2.734
Skew:
0.583
Prob(JB):
0.255
Kurtosis:
3.365
Cond. No.
261.
Diagnosing and fixing regression by identifying anomalies with QQ-is a common technique. With interactive QQ plots, it becomes more powerful and visually appealing. Below, I attach the full code in Python:
import numpy as npfrom scipy import statsimport pandas as pdimport altair as altimport statsmodels.api as smfrom statsmodels.formula.api import ols# fit a model data = sm.datasets.get_rdataset("Duncan","carData",cache=True).datamodel =ols("prestige ~ income + education:C(type)",data=data).fit()# plot residuals against normal quantilesempirical_quantiles = model.resid.sort_values()dist = stats.norm# for ploting positions, sample extremes are 1/(n+1) and 1-1/(n+1) percentilesnobs =len(empirical_quantiles)pos = np.arange(1.,nobs+1)/(nobs +1)theoretical_quantiles = dist.ppf(pos)df = pd.DataFrame(data={'res':empirical_quantiles.values,dist.name:theoretical_quantiles,'label':empirical_quantiles.index})qqplot = alt.Chart(df).mark_point().encode(y='res',x=dist.name,tooltip='label').interactive()# the line is fit robustly based on 1st and 3rd quartileq1,q3 = stats.mstats.mquantiles(empirical_quantiles,[0.25,0.75])tq1,tq3 = dist.ppf([0.25,0.75])linex = theoretical_quantilesliney =(q3-q1)/(tq3-tq1)*(linex-tq1)+ q1qqline = alt.Chart(pd.DataFrame({'res':liney,dist.name:linex})).mark_line().encode(y='res',x=dist.name,)# combine two plotsqqplot + qqline# fit a fixed model data = sm.datasets.get_rdataset("Duncan","carData",cache=True).datamask =~data.index.isin(["minister"])model =ols("prestige ~ income + education:C(type)",data=data[mask]).fit()model.summary()
References
1.
Duncan OD. A Socioeconomic Index for All Occupations. Occupations and Social Status. 1961;(7).
Normality of data is important in statistics, so it is not a surprise that transforming data to look closer to a normal distribution can be beneficial in many ways. There are ways of finding the right transform, such as the popular Box-Cox method 1,2. In this post, I will share an elegant example on how logarithmic transforms can improve normality and consequently make statistical testing significant, effectively avoiding missing an outcome (false-negative)!
More precisely, by looking into the classical heart disease dataset 3 we will find that the cholesterol level is log-normally distributed, as suggested in the medical literature4, and using this we will prove that cholesterol level is associated with heart disease.
Here is how the data looks like:
We see that under the log-transform statistical tests no longer reject normality, and moreover that the difference between groups becomes significant!
Transform
KS Test p-val
T Test p-val
Healthy
Sick
none
0.003235
0.990000
0.05
log
0.246064
0.453206
0.03
Normality and mean-difference tests for both original and transformed data.
For completeness, here is the code snippet to reproduce:
#%pip install openmlimport matplotlib.pyplot as pltimport seaborn as snsimport openmlimport numpy as npfrom statsmodels.stats import diagnostic, weightstatsimport pandas as pdfrom IPython.display import display## get the datadataset = openml.datasets.get_dataset("Heart-Disease-Prediction")X, y, _, _ = dataset.get_data(dataset_format="dataframe")X["Heart_Disease"]= X["Heart_Disease"].apply({"Absence":False,"Presence":True}.get)sns.histplot(data=X,x="Cholesterol",hue="Heart_Disease")x0 = X['Cholesterol'][X["Heart_Disease"]==False]x1 = X['Cholesterol'][X["Heart_Disease"]==True]## investigate normalityouts =[]for fn_name, fn inzip(["none","log"],[lambdax:x,lambdax:np.log(x)]): p_norm0 = diagnostic.kstest_normal(fn(x0),dist='norm',pvalmethod='table')[1] p_norm1 = diagnostic.kstest_normal(fn(x1),dist='norm',pvalmethod='table')[1] p_ttest = weightstats.ttest_ind(fn(x0),fn(x1),usevar='unequal')[1].round(2) outs.append((fn_name, p_norm0, p_norm1, p_ttest ))outs = pd.DataFrame(data=outs,columns=[("Transform",),('KS Test p-val',"Healthy"),('KS Test p-val',"Sick"),('T Test',)])outs.columns = pd.MultiIndex.from_tuples( outs.columns )display(outs)
References
1.
Daimon T. Box–Cox Transformation. International Encyclopedia of Statistical Science. Published online 2011:176-178. doi:10.1007/978-3-642-04898-2_152
2.
Box GEP, Cox DR. An Analysis of Transformations. Journal of the Royal Statistical Society: Series B (Methodological). Published online July 1964:211-243. doi:10.1111/j.2517-6161.1964.tb00553.x
3.
Andras Janosi WS. Heart Disease. Published online 1989. doi:10.24432/C52P4X
The risk matrix, a popular risk assessment method 1, presents risks in the form of a likelihood / impact heatmap. In this post, I share a convenient and elegant Python snippet for constructing risk matrices, used to generate this hypothetical project risk assessment:
Below are the code details:
import pandas as pdimport matplotlib as mplimport matplotlib.pyplot as pltimport numpy as nplikelihod_cols =["Rare","Unlikely","Moderate","Likely","Almost certain"]impact_cols =["Insignificant","Minor","Significant","Major","Severe"]score_vals = np.arange(len(impact_cols)).reshape(-1,1)+ np.arange(len(likelihod_cols)).reshape(1,-1)cmap = mpl.colormaps["RdYlGn"].reversed().resampled(9)norm = mpl.colors.BoundaryNorm(range(10),10)fig, ax = plt.subplots(figsize=(8,8))ax.imshow(score_vals,cmap=cmap,norm=norm)annotations ={(2,3):["competition"],(3,2):["too ambitiuous"],(3,3):["hardware expertise"],(2,2):["time crunch"],(2,1):["stretched resources"],}annots_style ={'fontweight':'bold'}for(i,j),text in annotations.items(): text ='\n'.join( text ) _ = ax.text(i, j, text,ha="center",va="center",color="b",**annots_style)ax.set_xlabel('Impact',fontweight='bold')ax.set_xticks(range(len(impact_cols)),labels=impact_cols)ax.set_ylabel('Likelihood',fontweight='bold')ax.set_yticks(range(len(likelihod_cols)),labels=likelihod_cols)ax.invert_yaxis()plt.tight_layout()plt.show()fig.savefig("risk_heatmap.svg")
1.
Peace C. The risk matrix: uncertain results? Policy and Practice in Health and Safety. Published online July 3, 2017:131-144. doi:10.1080/14773996.2017.1348571
This distribution arises as passage times of random walks 1. For a survey of its basic properties, see 2. In this note we explore in detail its approximate normality, which holds when \( \lambda/\mu \to \infty\).
Consider the random variable \(X\sim \mathsf{IG}(\mu,\lambda)\) and its standardised version $$Y = \frac{X-\mathbf{E}[X]}{\sqrt{\mathbf{Var}[X]}}.$$ Below we establish normal approximation in various metrics.
Convergence in Distribution
Since the characteristic function of \(X\) is \(\phi_X(t) = e^{\frac{\lambda \left(1 – \sqrt{1 – \frac{2 i \mu^{2} t}{\lambda}}\right)}{\mu}}\), and \(\mathbf{E}[X] = \mu, \mathbf{Var}[X]=\mu^3/\lambda\), the characteristic function of the standardized distribution equals
$$ \phi_Y(t) = e^{k^{2} \cdot \left(1 – \sqrt{1 – \frac{2 i t}{k}}\right)} e^{- i k t}, $$
where \( k \triangleq \sqrt{\lambda/\mu}\). The limit under \(k\to\infty\) equals \(e^{-\frac{t^2}{2}}\), the characteristic function of the standard normal distribution, which proves convergence in distribution:
import sympy as smx, y, t, lmbd, mu, k = sm.symbols('x y t lambda mu k',positive=True)pdf = sm.Mul( sm.sqrt(lmbd/(2*sm.pi*x**3),evaluate=False), sm.exp(-lmbd*(x-mu)**2/(2*mu**2*x)),evaluate=False)mean = muvar = mu**3/lmbdnorm_pdf = sm.sqrt(var)* pdf.subs({x:y*sm.sqrt(var)+mu})norm_pdf.subs({lmbd:k**2*mu}).simplify()
and compare it to that of the standard normal distribution. Namely, we have
that is the convergence of probability densities to the standard normal distribution. By Scheffe’s Theorem 3 this proves the convergence in total variation (stronger than convergence in distribution)
We now attempt to derive concrete numerical estimates. The ratio of the characteristic function \(\phi_k\) of \(Y\) and \(\phi_{\infty}\) of the standard normal equals
$$\phi_k(t)/\phi_{\infty}(t) = e^{ k^2 h(t /k ) }, \quad h(x) \triangleq 1-\sqrt{1-2 i x}- ix + \frac{x^2}{2}.$$
Observe that \(h(x)\) expands into a convergent power series around \(x=0\) when \(|x|<\frac{1}{2}\). Thus, under the condition that \(|t| \leq \sqrt{k}\) and \(k\geq 4\), we obtain that \(|k^2 h(t /k )|\leq O(1)\), and by the bound \(|e^{z}-1| = \Theta(z)\) for \(|z|\leq O(1)\) that
We know handle the regime \(\sqrt{k} < |t| \leq k/2\). Observe that \(\Re h(x) =\Theta(1) x^4\) by the power series expansion. It follows that \(|e^{k^2 h(t /k )}| = e^{\Theta(1)}\) and therefore
By power series expansions we have, uniformly in \(k\) $$\frac{f_Y(y)}{f_{\infty}(y)} = 1-\Theta(y/k),\quad |y|\leq k^{1/3},$$ and because the expression increases in \(y\) and decreases in \(k\) when \(|y| > k^{1/3}, k>8\) (we have \(\frac{\partial}{\partial y} \frac{f_Y(y)}{f_{\infty}(y)} = \frac{k^{\frac{3}{2}} \left(- 3 k + y^{2} \cdot \left(3 k + 2 y\right) – 3 y\right) e^{\frac{y^{3}}{2 \left(k + y\right)}}}{2 \left(k + y\right)^{\frac{7}{2}}}\) and \(\frac{\partial}{\partial k} \frac{f_Y(y)}{f_{\infty}(y)} = – \frac{\sqrt{k} y \left(k y^{2} – 3 k – 3 y\right) e^{\frac{y^{3}}{2 k + 2 y}}}{2 \left(k + y\right)^{\frac{7}{2}}}\)) also $$ \frac{f_Y(y)}{f_{\infty}(y)} > 1,\quad y \geq k^{1/3},k\geq 8$$ and $$ \frac{f_Y(y)}{f_{\infty}(y)} < 1,\quad y \leq -k^{1/3},k\geq 8.$$ Therefore, we obtain $$ \int \left| f_Y(y)-f_{\infty}(y) \right| \mbox{d}y \leq O(1/k)\int_{|y|<k^{1/3}} |y| f_{\infty}(y)\mbox{d} y + \int_{t<-k^{1/3}} 2 f_{\infty}(y) = O(1/k). $$
Which proves that the convergence rate in total variation is \(O(1/k)\):
$$d_{TV}(Y;\mathsf{N}(0,1)) \leq O(1/k) = O(\sqrt{\mu/\lambda})$$ (in fact, here we obtain the matching lower bound too).
Numerical Evaluation
Numerical results match the theoretical analysis, as shown below:
For completeness, here is the script:
import numpy as npimport matplotlib.pyplot as pltfrom scipy.stats import invgauss, normfig, axs = plt.subplots(1,2,figsize=(12,4))# Part 1: evaluate total variationmus = np.linspace(0.5,0.00001,100)out =[]for mu in mus:# construct Inverse Gaussian Z =invgauss(mu=mu) mean = Z.mean() std = Z.std() k =(mean/std)**2# standardise and compare with normal Z_std =invgauss(mu=mu,loc=-mean/std,scale=1/std) Norm =norm(0,1) xs = np.linspace(-10,10,1000) metric = np.abs(Z_std.pdf(xs)- Norm.pdf(xs)).mean()*20 result =(k,metric) out.append(result)ax = axs[1]ax.plot(*zip(*out))ax.set_xscale('log')ax.set_xlabel('$\lambda / \mu$')ax.set_yscale('log')ax.set_ylabel('total-variation to normal')# Part 2: plot pdfax = axs[0]for mu in[0.2,0.1,0.05,0.01,0.0001]:# construct Inverse Gaussian Z =invgauss(mu=mu) mean = Z.mean() std = Z.std() k =(mean/std)**2# standardise and plot Z_std =invgauss(mu=mu,loc=-mean/std,scale=1/std) xs = np.linspace(-5,5,10000) ax.plot(xs,Z_std.pdf(xs),label=f'$\lambda/\mu$={k:.2f}')ax.legend(title="standardised distribution")ax.set_xlabel('$x$')ax.set_ylabel('pdf')plt.tight_layout()plt.show()
Folks JL, Chhikara RS. The Inverse Gaussian Distribution and its Statistical Application—A Review. Journal of the Royal Statistical Society: Series B (Methodological). Published online July 1978:263-275. doi:10.1111/j.2517-6161.1978.tb01039.x
3.
Scheffe H. A Useful Convergence Theorem for Probability Distributions. Ann Math Statist. Published online September 1947:434-438. doi:10.1214/aoms/1177730390
4.
Bobkov SG. Proximity of probability distributions in terms of Fourier-Stieltjes transforms. Russ Math Surv. Published online December 31, 2016:1021-1079. doi:10.1070/rm9749
Docker is a modern technology for containerizing software. While packaging to a working version is relatively easy, creating a lightweight Docker image presents a challenge. In this note, I will briefly discuss an example from my university class on software engineering.
The task was to containerize a card game web application, written by students as a group project. The application was created with React.js and used to be run in a development mode (which supports tracing, debugging and so on) via npm start. The page can be then optimized for deployment via npm build. The job of a Docker container is to serve this optimized version with a simple http server.
Of course, the easiest way was to both page build and deployment within the Create React App framework. But the optimal way is to stay minimalistic and leverage multi-stage Docker builds with task-tailored sub-images: one to build and another to serve. The image size indeed gets reduced from 500Mb to about 20MB! To build and serve we used, respectively, a Node.js image and a Static Web Server image. The build product of the first step gets copied to the second image which provides a lightweight server, leaving unessential stuff (cached files, development tools ) behind. The implementation is shown below:
# docker build -t brydz:latest .# docker run -it -p 3000:3000 brydz:latest sh# stage 1: install dependencies and build the appFROM node:18-alpine AS pre_buildWORKDIR /brydzCOPY client/package.json ./client/package.jsonCOPY client/package-lock.json ./client/package-lock.jsonRUN npm install --prefix clientCOPY client/ ./client/ENV PUBLIC_URL=“.” RUN npm run build --prefix client# stage 2: move to a clean prod environmentFROM joseluisq/static-web-server:2-alpine AS finalWORKDIR /brydzCOPY --from=pre_build /brydz/client/build ./buildEXPOSE 3000CMD ["static-web-server", "--port=3000", "--root=build", "--log-level=trace"]
Two-player games can be solved by following a very intuitive algorithm called Regret Matching1. Players modify their action probabilities according to the so-called regrets or advantages, which can be thought as consequences of alternative choices. For a good overview of the topic, see the friendly yet detailed introduction by Neller and Lanctot 2.
The following code snippet demonstrates an efficient implementation for two-player zero-sum finite games in PyTorch, and tests on the game called Rock-Paper-Scissors with Crash. In this variant, the scissors crashes against rock, and the loss is doubled. This shifts the equilibrium choices in favour of “paper”, at the exact proportion of 25%/50%/25%. Remarkably, regrets can be used to estimate the approximation accuracy, which approaches the perfect solution under mild conditions 3. This is also demonstrated in the snippet.
import torch## Game: Rock-Paper-Scissors with Crash, the equilibrium achieved with 25%/50%/25%N_ACTIONS =3# 0: rock, 1: paper, 2: scissorsPAYOFF = torch.tensor([[0,-1,2],[1,0,-1],[-2,1,0]],dtype=torch.float).cuda()# gain of row player / loss of col player## UtilsdefgetStrategy(cfr):"""Return a strategy corresponding to observed regrets Args: cfr (array): counter-factual regret, shape (N_ACTIONS,) Returns: weights (array): strategy, shape (N_ACTIONS,)""" weights = torch.clip(cfr,0,torch.inf) weights = weights/torch.sum(weights) N_ACTIONS = weights.shape[0] weights = torch.nan_to_num(weights,nan=1/N_ACTIONS)return weights#@torch.jit.script@torch.compile(mode='max-autotune')defgetEquilibrium(PAYOFF,N_ITER:int=500,stochastic_advantage:bool=False):# auxiliary variables N_ACTIONS = PAYOFF.shape[0] cumCFR1 = torch.zeros(N_ACTIONS,).cuda() cumCFR2 = torch.zeros(N_ACTIONS,).cuda() cumStrategy1 = torch.zeros(N_ACTIONS,).cuda() cumStrategy2 = torch.zeros(N_ACTIONS,).cuda() strategy1 = torch.ones(N_ACTIONS,).cuda()/N_ACTIONS strategy2 = torch.ones(N_ACTIONS,).cuda()/N_ACTIONS# training loopfor _ inrange(N_ITER):# sample actions and observe regretsif stochastic_advantage:# a) stochastic variant, often implemented in tutorials action1 = torch.multinomial(strategy1,num_samples=1).squeeze() action2 = torch.multinomial(strategy2,num_samples=1).squeeze() cfr1 = PAYOFF[:,action2]-PAYOFF[action1,action2] cfr2 =-(PAYOFF[action1,:]-PAYOFF[action1,action2])else:# b) averaged variant PAYOFF_avg = strategy1.view(1,-1).mm(PAYOFF).mm(strategy2.view(-1,1)) cfr1 =(PAYOFF.mm(strategy2.view(-1,1))-PAYOFF_avg).squeeze() cfr2 =(strategy1.view(1,-1).mm(PAYOFF)-PAYOFF_avg).squeeze()*(-1)# update strategies proportionally to regrets strategy1 =getStrategy(cumCFR1) strategy2 =getStrategy(cumCFR2)# track cumulated regrets and strategies cumCFR1 += cfr1 cumCFR2 += cfr2 cumStrategy1 += strategy1 cumStrategy2 += strategy2# averaged strategies converge to Nash Equilibrium avgStrategy1 = cumStrategy1/cumStrategy1.sum() avgStrategy2 = cumStrategy2/cumStrategy2.sum()# estimate approximation error (upper bound) eps =2*torch.max(cumCFR1.max(),cumCFR2.max())/N_ITER torch.cuda.synchronize()return(avgStrategy1,avgStrategy2,eps)getEquilibrium(PAYOFF)# eps < 0.03
Regrets are expensive to compute in large games, but can be approximated using modern machine-learning techniques. This approach has recently found many applications, including solvers for Poker and even larger card games 4,5 .
1.
Hart S, Mas-Colell A. A Simple Adaptive Procedure Leading to Correlated Equilibrium. Econometrica. Published online September 2000:1127-1150. doi:10.1111/1468-0262.00153
Waugh K. Abstraction in Large Extensive Games. University of Alberta Libraries. Published online 2009. doi:10.7939/R3CM74
4.
Brown N, Lerer A, Gross S, Sandholm T. Deep Counterfactual Regret Minimization. In: ; 2019.
5.
Adams D. The Feasibility of Deep Counterfactual Regret Minimisation for Trading Card Games. AI 2022: Advances in Artificial Intelligence. Published online 2022:145-160. doi:10.1007/978-3-031-22695-3_11
When evaluating computing performance we look at various KPIs: memory consumption, utilisation of compute power, occupation of hardware accelerators, and – more recently – at the energy consumption and energy efficiency 1,2. For popular NVIDIA cards this can be solved with the help of the NVIDIA Management Library, which allows developer to query details of the device state3.
The library is easier to use through Python bindings available as pyNVML4. Note that Python overheads may be problematic if higher-frequency querying is needed, plus the API likely comes with its own overheads. So the readings should be understood as estimations.
Here is a simple script, which can be adjusted to query more details, if needed:
# see the NVIDIA docs: https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries# to monitor GPU-1 and dump to a log file, run: python gpu_trace.py 1 log.csvimport sysimport timeimport pynvmlpynvml.nvmlInit()if __name__ =="__main__": gpu_index =int(sys.argv[1])# device fname = sys.argv[2]# log filewithopen(fname,'w')as f:# select device device_handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_index)# prepare headers f.write('Timestamp;Temperature [C];Power [% max];GPU Util [% time];Mem Util [% time];Mem Cons [% max];Energy [kJ]\n')# get some metadata power_max = pynvml.nvmlDeviceGetPowerManagementLimit(device_handle) energy_start = pynvml.nvmlDeviceGetTotalEnergyConsumption(device_handle)whileTrue:# timestamp timestamp = time.time()# temperature temp = pynvml.nvmlDeviceGetTemperature(device_handle,0)# TODO: set sensor if many?# power [% of max] power = pynvml.nvmlDeviceGetPowerUsage(device_handle)/ power_max *100.0# memory and gpu utilisation [%] util = pynvml.nvmlDeviceGetUtilizationRates(device_handle)# memory consumption [%] mem_info = pynvml.nvmlDeviceGetMemoryInfo(device_handle) mem_cons = mem_info.used / mem_info.total *100.0# eneregy delta in kJ (API uses in mJ) eneregy =(pynvml.nvmlDeviceGetTotalEnergyConsumption(device_handle)-energy_start)/10**6# output result result =(timestamp,temp,power,util.gpu,util.memory,mem_cons,eneregy) f.write(';'.join(map(str, result))+'\n') time.sleep(0.1)
And here is how to post-process and present results:
The example shown below comes from an ETL processes which utilizes a GPU.
Note that, in this case, monitoring identified likely bottlenecks: the GPU gets idle on a periodic basis (likely, device-to-host transfers) plus is overall underutilised. Estimation of energy consumed is a nice feature, as it would be hard to measure it accurately from power traces (due to high variation and subsampling).
Note that utilisation should be understood as time-occupation, in case of both memory and computing. From the documentation:
unsigned int gpu: Percent of time over the past sample period during which one or more kernels was executing on the GPU. unsigned int memory : Percent of time over the past sample period during which global (device) memory was being read or written.
In this example, we see different power management strategies on two similar devices:
Case Study 3: Energy Efficiency of Deep Learning
Here we reproduce some results from Tang et al.1 to illustrate how adjusting frequency can be used to minimise energy spent per computational task (in their case: image prediction). Higher performance comes at a price of excessive energy used, so that energy curves assumes a typical parabolic shape. Note that, in general, the energy-efficient configuration may be optimised over both clock and memory frequencies5.
And here is the code to reproduce:
import pandas as pdimport seaborn as snsimport numpy as np# source: Fig 4d, data for resnet-b32 from "The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study" freq =[544,683,810,936,1063,1202,1328]power =[57,62,65,70,78,88,115]# W = J/srequests =[60,75,85,95,105,115,120]# requests/sdata = pd.DataFrame(data=zip(freq,power,requests),columns=['Frequency','Power','Performance'])data['Energy']= data['Power']/ data['Performance']# [J/s] / [Images/s] = [J/Image]import matplotlib.pyplot as pltfig,(ax1,ax2)= plt.subplots(1,2,figsize=(12,6))sns.lineplot(data=data,x='Frequency',y='Performance',ax=ax1,color='orange',label='Performance',marker='o')ax1.set_xticks(data['Frequency'])ax1.set_ylabel('Image / s')ax1.set_xlabel('Frequency [MHz]')ax1.legend(loc=0)ax12 = ax1.twinx()sns.lineplot(data=data,x='Frequency',y='Power',ax=ax12,color='steelblue',label='Power',marker='D')ax12.set_ylabel('W')ax12.legend(loc=1)sns.lineplot(data,x='Frequency',y='Energy',ax=ax2,label='Energy')ax2.set_xticks(data['Frequency'])ax2.set_ylabel('J / Image')ax2.set_xlabel('Frequency [MHz]')ax2.legend(loc=0)plt.title('Performance, power, and energy for training of resnet-b32 network on P100.\n Reproduced from: "The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study"')plt.tight_layout()plt.show()
References
1.
Tang Z, Wang Y, Wang Q, Chu X. The Impact of GPU DVFS on the Energy and Performance of Deep Learning. Proceedings of the Tenth ACM International Conference on Future Energy Systems. Published online June 15, 2019. doi:10.1145/3307772.3328315
2.
Tang K, He X, Gupta S, Vazhkudai SS, Tiwari D. Exploring the Optimal Platform Configuration for Power-Constrained HPC Workflows. 2018 27th International Conference on Computer Communication and Networks (ICCCN). Published online July 2018. doi:10.1109/icccn.2018.8487322
Fan K, Cosenza B, Juurlink B. Accurate Energy and Performance Prediction for Frequency-Scaled GPU Kernels. Computation. Published online April 27, 2020:37. doi:10.3390/computation8020037
Staring from the version 2.x PyTorch, a popular deep-learning framework, introduces a JIT compiler torch.compile. In this post, I am sharing a non-trivial example demonstrating how this tool can reduce memory footprint on GPU. The point of departure is a sub-routine which computes similarity, similar to covariance but not as friendly to compute.
For two tensors of shape \( (n_{samples},n_{dim})\) it produces a similarity tensor of shape \( (n_{dim},n_{dim})\). However, the logic uses broadcasting when constructing and reducing an intermediate tensor of shape \( (n_{samples},n_{dim},n_{dim})\). Thus, the naive implementation takes \( O(n_{samples}\cdot n_{dim}^2)\) of memory which is seen from the profiler. After compilation, this bottleneck is removed 💪
This is how the profiling code looks like:
import torchfrom torch.profiler import profile, record_function, ProfilerActivityx = torch.randn((256,2000)).float().cuda()torch.cuda.synchronize()#@torch.compile(mode='max-autotune') # compare the effect with and without !defsimiliarity(x,y): xy = x[:,:,None]-y[:,None,:] xy = xy.abs().lt(1).sum(axis=0) xy = xy.to('cpu')return xywithprofile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],profile_memory=True)as prof:withrecord_function("memory profile"):similiarity(x,x) torch.cuda.synchronize()profiler_summary = prof.key_averages().table(sort_by="self_cuda_memory_usage",row_limit=10)
And this is a useful utility to convert the profiling output to table
# torch profiler output to pandasimport pandas as pdimport ioimport retotal_width = re.search('\n',profiler_summary).start()widths =[t.end()-t.start()+2for t in re.finditer('-{1,}',profiler_summary[:total_width])]df = pd.read_fwf(io.StringIO(profiler_summary),widths=widths)df.columns = df.loc[0]df.drop([0,1],axis=0,inplace=True)df.set_index(df.columns[0],inplace=True)df.head(10)
This is the output without compiler, note huge memory excess in tensor operations while broadcasting:
Location on maps are often provided in a Coordinate Reference System (CRS). In computer vision projects, however, we need to translate them to pixels. Plotting and transformations can be accomplished with the rasterio Python module.
Below I share a useful snippet showing how to convert CRS locations to pixel coordinates:
import matplotlib.pyplot as pltimport geopandasimport rasterio as rsfrom rasterio.plot import showimport pandas as pdfrom shapely.geometry import Pointfrom PIL import Image, ImageDrawfig, axs = plt.subplots(1,2)# open the image and its annotationsimg_path ="trees_counting_mount/polish_ortophoto/1_2000/images/66579_623608_8.135.12.19_cropped_2000.tif"label_path ="trees_counting_mount/polish_ortophoto/1_2000/annotations/66579_623608_8.135.12.19_cropped_2000.csv"raster = rs.open(img_path,crs="EPSG:2180")label_df = pd.read_csv(label_path)# extract selected pointsgeometry =list(map(Point, label_df[["x","y"]].values))idxs =[0,100,200]Ps =[geometry[idx]for idx in idxs]# plot points in two alternative ways!ax = axs[0]show(raster,ax=ax)# variant 1: plot geo-coordinates with geopandas.GeoDataFramegeo_df = geopandas.GeoDataFrame(None,geometry=Ps,crs="EPSG:2180")geo_df.plot(ax=ax)ax = axs[1]# variant 2: convert geo-coordinates to pixel locationsys, xs = rs.transform.rowcol(raster.transform,[P.x for P in Ps],[P.y for P in Ps])img = Image.open(img_path)img_draw = ImageDraw.Draw(img)for x, y inzip(xs, ys): img_draw.ellipse((x-150,y-150,x+150,y+150),fill="yellow")plt.imshow(img)plt.title("Points in geo-coordinates (left) and image pixel coordinates (right).")plt.tight_layout()plt.show()
CUDA is a computing platform for graphical processing units (GPUs) developed by NVIDIA, widely used to accelerate machine-learning. Existing frameworks, such as Tensorflow or PyTorch, utilize it under the hood not asking user for any specific coding. However, it is still necessary to set its dependencies, particularly the compiler nvcc, properly to benefit of acceleration. In this short note, I share an interesting use-case that occurred when prototyping on Kaggle Docker image and NVIDIA Docker image.
Compatibility of CUDA tools and targeted libraries
It turns out that one of Kaggle images was released with incompatible CUDA dependencies: compilation tools were not aligned with PyTorch, as revealed when attempting to compile detectron2, an object detection library by Facebook.
(base) maciej.skorski@shared-notebooks:~$ docker imagesREPOSITORYTAGIMAGEIDCREATEDSIZEgcr.io/kaggle-gpu-images/pythonlatest87983e20c2904weeksago48.1GBnvidia/cuda11.6.2-devel-ubuntu20.04e1687ea9fbf27weeksago5.75GBgcr.io/kaggle-gpu-images/python<none>2b12fe42f3722monthsago50.2GB(base) maciej.skorski@shared-notebooks:~$ docker run -d \-it\--namekaggle-test\--runtime=nvidia\--mounttype=bind,source=/home/maciej.skorski,target=/home\2b12fe42f372(base) maciej.skorski@shared-notebooks:~$ docker exec -it kaggle-test python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'...RuntimeError:ThedetectedCUDAversion (12.1) mismatches the version that was used to compilePyTorch (11.8). Please make sure to use the same CUDA versions.
In order to compile detectron2, it was necessary to align the CUDA toolkit version. Rather than trying to install it manually – which is known to be an error-prone task – a working solution was to change the Kaggle image. It turns out that the gap was bridged in a subsequent release:
(base) maciej.skorski@shared-notebooks:~$ docker run 87983e20c290 nvcc --versionnvcc:NVIDIA (R) Cuda compiler driverCopyright (c) 2005-2022 NVIDIA CorporationBuiltonWed_Sep_21_10:33:58_PDT_2022Cudacompilationtools,release11.8,V11.8.89Buildcuda_11.8.r11.8/compiler.31833905_0(base) maciej.skorski@shared-notebooks:~$ docker run 2b12fe42f372 nvcc --versionnvcc:NVIDIA (R) Cuda compiler driverCopyright (c) 2005-2023 NVIDIA CorporationBuiltonMon_Apr__3_17:16:06_PDT_2023Cudacompilationtools,release12.1,V12.1.105Buildcuda_12.1.r12.1/compiler.32688072_0
And indeed, the Facebook library installed smoothly under the new image 👍
Consider the simple CUDA script querying the GPU device properties:
// query_GPU.cu#include<stdio.h>intmain(){int nDevices;cudaGetDeviceCount(&nDevices);for(int i =0; i < nDevices; i++){ cudaDeviceProp prop;cudaGetDeviceProperties(&prop, i);printf("Device Number: %d\n", i);printf(" Name: %s\n",prop.name);printf(" Integrated: %d\n",prop.integrated);printf(" Compute capability: %d.%d\n",prop.major,prop.minor);printf(" Peak Memory Bandwidth (GB/s): %f\n\n",2.0*prop.memoryClockRate*(prop.memoryBusWidth/8)/1.0e6);printf(" Total global mem: %ld\n",prop.totalGlobalMem);printf(" Multiprocessor count: %d\n",prop.multiProcessorCount);}}
This code compiles and presents GPU properties only under the image equipped with the matching major compiler version (select the appropriate image here):
As the last example, consider the recent cuZK project which implements some state-of-the-art cryptographic protocols on GPU. The original code was missing dependencies and compilation instructions, therefore I shared a working fork version.
To work with the code, let’s use the NVIDIA Docker image with the appropriate version, here I selected the tag 11.6.2-devel-ubuntu20.04. Checkout the code and start a container mounting the working directory with the GitHub code, like below: