On Subgaussian Concentration of Missing Mass

The problem of missing mass goes back to the cryptographic work of Good and Turing during WWII, but has been also studied in the context of linguistic, ecology, and by probability theoreticians. The missing mass is defined as the weight of elements not observed in a sample, due to pure chance:

True and empirical frequency based on 100 samples, for Poiss(10).
The value of 12 is missing, and the corresponding bin is empty.

Such an event and the total mass of all unseen elements are exponentially small in the sample size.

Quite curiously, proving it rigorously has been quite hard. The first proof appeared at COLT’00, but has since then been reworked many times, in attempts to simplify and numerically improve. The arguments were based on a thermodynamic framework, logarithmic Sobolev inequalities and information theory. Formally, for some constant $K>0$ we want

$$\Pr[\pm(M-\boldsymbol{E}M)>\epsilon]\leq \mathrm{e}^{-K\cdot n \epsilon^2}$$

where $n$ is the sample size and $M$ is the missing mass for the i.i.d. sample of size $n$.

The problem of proving it „standard” concentration inequalities has been open so far. In reponse to this challenge, in my recent note I have proved it with Bernstein’s inequality, century old. So, no complicated approaches and refinements were necessary!

Approximating Tails of Beta Distribution

Beta distribution is ubiquitous in statistics, but particularly popular in real-world modeling. The beta-binomial model is perhaps the most known example, given the recent interest in Bayesian inference. But it was in use nearly 50 years ago, for example in toxicology.

Unfortunately, computing probabilities from the density depends on intractable incomplete beta integrals. This creates a demand for closed-form approximations, particularly for probability cfs/tails. The goal is to obtain an exponential concentration inequality in of Bernstein-type

$$\Pr[|X-\mathbf{E}[X]|>\epsilon]\leq \mathrm{e}^{-\frac{\epsilon^2}{2v^2+2c\epsilon}}.$$

Such bounds have been studied few times, the last one being the sub-gaussian approximation, that is when \(c=0\). Recently, I have further improved to optimal \(v\) (most important) and some good value of \(c\) (less important, but possibly worth further improvement). This gives a more accurate approximation when the distribution is very skewed (this happens when we model rare events, like conversion). For example with Beta(2,998) we get this:

The trick is to obtain a recursion scheme on central moments, and bound their growth by a geometric progression. The details are in my paper, and the code is shared in this notebook.

Improving State-of-Art on Sparse Random Projections

Random projections are widely used to reduce data dimension in various analyses. Provable guarantees were developed first in the important result of Johnson and Lindenstrauss on Lipschitz maps, but more recently there has been a lot of follow-up work in the context of machine-learning.

Particularly attractive are sparse random projections, which share similar guarantees as the original proposal, but are much faster to compute. In my recent paper I improve upon previous results of Meena Jagadeesan from NIPS’19. The key idea is to consider entropy of input data, a more fine-grained approach than in prior works. In the mathematical analysis I show this gives superior statistical guarantees. Besides that, the novel analysis seems to be cleaner and simpler.

The paper has been already reviewed and graded high at STACS’21, yet the committee members somehow didn’t want this machine-learning inspired topic in their program:

The PC members agreed that this is a non-trivial and interesting contribution. However, we had a very tough competition, and in the end, the PC members felt that the results were maybe too specialised for STACS audience

I am resubmitting it to a venue which will better appreciate the combination of ML and theory. I am also going to revisit this note and give a friendly overview of the work 🙂

Performance drawbacks of Tensorflow Datasets

Tensorflow, the popular framework for machine-learning, recommends its new dataset API for preprocessing and serving data. It supports useful tricks, such as caching data in memory, prefetching in parallel threads and others described in tutorials. Still, Tensorflow has issues with slow data slicing, so the dataset API may actually do harm in setups where computations are relatively fast; this includes linear models, word2vec and other shallow networks. In this note I illustrate this issue on training a classifier on MNIST dataset (handwritten images). As shown in the benchmark below, sometimes it makes sense to serve your data by custom generators not the recommended api.

The model is Logistic Regression. The loss function is implemented via logsumexp trick, for numerical stability and computational efficiency.

## model: Logistic Regression

w = tf.Variable(tf.random.normal(shape=(28*28,10),stddev=0.1),trainable=True)
optimizer = tf.optimizers.SGD(0.01)

def train_step(x, y):
  with tf.GradientTape() as tape:
    all_logits = tf.matmul(x,w) # (n_batch,n_class)
    y_logits = tf.gather(all_logits,y,batch_dims=1) # (n_batch,)
    logp = y_logits - tf.reduce_logsumexp(all_logits,axis=1)
    loss = -logp
  gradients = tape.gradient(loss,[w])

Now we compare two ways of serving the data: by custom generators and by Dataset API (with caching and prefetching for better performance).

## serving data: a) via custom generator b) via TF Dataset API

def gen_batch(X,n_window):

  def gen():
    for i in range(0,len(X),n_window):
      yield X[i:i+n_window]
  return gen

def gen_data():
  n_window = 32
  return zip(gen_batch(x_train,n_window)(),gen_batch(y_train,n_window)())

tf_data = tf.data.Dataset.from_tensor_slices((x_train,y_train))
tf_data = tf_data.batch(32,drop_remainder=True)
tf_data = tf_data.cache()
tf_data = tf_data.prefetch(1)

Below running the comparison, the graph and dataset are warmed by one full-pass (then the graph gets built and the api pipeline is cached). Both approaches fit the same classifier, but the code with custom generator runs 50% faster. For the full code, see the jupyter notebook.