Such an event and the total mass of all unseen elements are exponentially small in the sample size.

Quite curiously, proving it rigorously has been quite hard. The first proof appeared at COLT’00, but has since then been reworked many times, in attempts to simplify and numerically improve. The arguments were based on a thermodynamic framework, logarithmic Sobolev inequalities and information theory. Formally, for some constant $K>0$ we want

$$\Pr[\pm(M-\boldsymbol{E}M)>\epsilon]\leq \mathrm{e}^{-K\cdot n \epsilon^2}$$

where $n$ is the sample size and $M$ is the missing mass for the i.i.d. sample of size $n$.

The problem of proving it „standard” concentration inequalities has been open so far. In reponse to this challenge, in my recent note I have **proved it with Bernstein’s inequality**, century old. So, no complicated approaches and refinements were necessary!

Unfortunately, computing probabilities from the density depends on intractable incomplete beta integrals. This creates a demand for closed-form approximations, particularly for probability cfs/tails. The goal is to obtain an exponential concentration inequality in of Bernstein-type

$$\Pr[|X-\mathbf{E}[X]|>\epsilon]\leq \mathrm{e}^{-\frac{\epsilon^2}{2v^2+2c\epsilon}}.$$

Such bounds have been studied few times, the last one being the sub-gaussian approximation, that is when \(c=0\). Recently, I have further improved to optimal \(v\) (most important) and some good value of \(c\) (less important, but possibly worth further improvement). This gives a more accurate approximation when the distribution is very skewed (this happens when we model rare events, like conversion). For example with Beta(2,998) we get this:

The trick is to obtain a recursion scheme on central moments, and bound their growth by a geometric progression. The details are in my paper, and the code is shared in this notebook.

]]>Particularly attractive are *sparse random projections,* which share similar guarantees as the original proposal, but are much faster to compute. In my recent paper I improve upon previous results of Meena Jagadeesan from NIPS’19. The key idea is to consider **entropy of input data**, a more fine-grained approach than in prior works. In the mathematical analysis I show this gives superior statistical guarantees. Besides that, the novel analysis seems to be cleaner and simpler.

The paper has been already **reviewed and graded high** at STACS’21, yet the committee members somehow didn’t want this machine-learning inspired topic in their program:

The PC members agreed that this is a non-trivial and interesting contribution. However, we had a very tough competition, and in the end, the PC members felt that the results were maybe too specialised for STACS audience

I am resubmitting it to a venue which will better appreciate the combination of ML and theory. I am also going to revisit this note and give a friendly overview of the work

]]>The model is Logistic Regression. The loss function is implemented via logsumexp trick, for numerical stability and computational efficiency.

```
## model: Logistic Regression
w = tf.Variable(tf.random.normal(shape=(28*28,10),stddev=0.1),trainable=True)
optimizer = tf.optimizers.SGD(0.01)
@tf.function
def train_step(x, y):
with tf.GradientTape() as tape:
all_logits = tf.matmul(x,w) # (n_batch,n_class)
y_logits = tf.gather(all_logits,y,batch_dims=1) # (n_batch,)
logp = y_logits - tf.reduce_logsumexp(all_logits,axis=1)
loss = -logp
gradients = tape.gradient(loss,[w])
optimizer.apply_gradients(zip(gradients,[w]))
```

Now we compare two ways of serving the data: by custom generators and by Dataset API (with caching and prefetching for better performance).

```
## serving data: a) via custom generator b) via TF Dataset API
def gen_batch(X,n_window):
def gen():
for i in range(0,len(X),n_window):
yield X[i:i+n_window]
return gen
def gen_data():
n_window = 32
return zip(gen_batch(x_train,n_window)(),gen_batch(y_train,n_window)())
tf_data = tf.data.Dataset.from_tensor_slices((x_train,y_train))
tf_data = tf_data.batch(32,drop_remainder=True)
tf_data = tf_data.cache()
tf_data = tf_data.prefetch(1)
```

Below running the comparison, the graph and dataset are warmed by one full-pass (then the graph gets built and the api pipeline is cached). Both approaches fit the same classifier, but **the code with custom generator runs 50% faster.** For the full code, see the jupyter notebook.