Unbiased variance estimation

and my first steps with SymPy

Posted by Lense Swaenen on June 16, 2023 · 42 mins read

In this blog post I dip my toes into the world of statistics, and to make it a bit more hip: Bayesian modeling. Disclaimer: I am not a statistician, so whoever has the solution to my problem (spoiler alert: I do not reach a fully satisfactory solution in the end…), please enlighten me! This is work coming out of a real world problem. However, as I do not reach a fully satisfactory solution, the developments will likely not make it into production. So, I thought, let me at least consolidate things on my blog (and below is only toy model stuff anyways). I did learn a lot of new things along the way, hence the looong post! We even get almost philosophical in our discussion of bias, and touching on frequentism vs Bayesianism.

It is also been a first experience with SymPy for me, which I quite liked. I’m usually more of the handwavy/gutfeel/numerical experiment mathematics kind of guy, but I see myself using SymPy a lot more in the future!

The problem

We consider two random variables/processes $X \sim \mathcal{N}(0, \sigma^2_x) = \mathcal{N}(0, v_x)$ and $Y \sim \mathcal{N}(0, \sigma^2_y) = \mathcal{N}(0, v_y)$ which we observe through four observations:

\[m_1 = X_1 + Y_1\] \[m_2 = X_1 + Y_2\] \[m_3 = X_2 + Y_3\] \[m_4 = X_2 + Y_4\]

The subscript to $X_1$ and $Y_2$ are to indicate realizations of the random variable, so observations $m_1$ and $m_2$ share the same realization for $X$. The challenge now is to infer the unknown standard deviations $\sigma_x$ and $\sigma_y$ (or the corresponding variances $v_x = \sigma^2_x$ and $v_y = \sigma^2_y$). Preferably in an unbiased way.

This toy model is (almost) the simplest one I could think of to work with in this post. The real application behind it is a lot more complicated and hush-hush. To give some physical intuition: $X$ and $Y$ may be two random processes that have significantly different time scales. $X$ varies slowly and $Y$ varies quickly. And $m_1$ and $m_2$ are observed on day 1, while $m_3$ and $m_4$ are observed on day 2. Notice that we assume to know that the means of $X$ and $Y$ are zero. This is not very common: Usually people try to infer the mean, or infer the mean and standard deviation, but rarely only the standard deviation. We make this choice as it simplifies some of the math.

In the blog post we consider two solution strategies to this problem. The first one is currently being used in the real application. This solution is now preferred because it satisfies the property of unbiasedness. However, the method also allows for negative estimates of the variances, which hopefully strikes you as very undesireable too! So afterwards, I try and solve this, to no avail, but with a lot of lessons learned.

Unbiasedness

I’ve put the term unbiasedness in bold twice above now, so before we go into the solution strategies, let’s expand on this topic. What is unbiasedness? Why should we want it?

To start, I often find bias a very loaded and vague term, for example as used in the bias-variance tradeoff. Luckily, in this post were are actually using the very accurately defined concept of bias coming from frequentist statistics. Below is a nothing-less-than-fantastic video by Jake VanderPlas explaining the difference between frequentism and Bayesianism. This was always something I heard about be never really understood. Jake’s video is one frequently come back to (pun intended :)).

The essence of it is:

  • For frequentists, probability is about the frequency of repeated (random) events.
  • For Bayesians, probability is about uncertainty of our own beliefs.

And so, the frequentist definition of unbiasedness that we will be working with, is about repeated events: What happens to statistical estimates when you repeat experiments. But before we talk about unbiasedness, it’s also good to talk about consistency. Consistency and unbiasedness are two desirable properties for estimators. Which brings us to the very basic question: What is an estimator?

Estimators

An estimator for a parameter is a procedure that takes as inputs the data, and outputs a value, the estimate. I’m using ‘procedure’ rather than ‘function’ as the number of data points (i.e. repeated experiments) can vary. In a normal distribution with parameters $\mu$ and $\sigma$, like $X \sim \mathcal{N}(\mu, \sigma^2)$, estimators for the $\mu$ and $\sigma$ are the population mean and population standard deviation for example. We could use the (not perfect) notation

\[p^e_N(X_1, X_2, \ldots, X_N) \equiv p^e_N(X_{1..N})\]

for an estimator with name, type or version $e$ for parameter $p$. The most famous estimator for the parameter $\mu$ of the normal distribution, would be the population mean (or sample mean):

\[\mu^{PM}_N(X_{1..N}) = \frac{1}{N} \sum_{i=1}^N X_i\]

where we use the superscript $PM$ to represent “population mean”.

Consistency

Before we talk about unbiasedness, we first consider an even more desireable property for an estimator: consistency. An estimator is consistent if the value of it converges (“in probability”) to the parameter we are trying to estimate, when the number of repeated experiments/data points $N$ that are put into $f$ grows to infinity. Or in equation form:

\[\lim_{N\to\infty} p^{e}_N(X_{1..N}) = p\]

The population mean is a consistent estimator to the parameter $\mu$. But also the median over the data $X_i$ are a consistent estimator to $\mu$. Enough with the theory, time for some code and do some numerical experiments. The population mean found above is implemented by the numpy routines np.mean:

%matplotlib notebook

import numpy as np
import matplotlib.pyplot as plt
Ns = np.array([10, 100, 1000, 10000, 100000])
k = 10
mus = np.zeros((k, Ns.size))

for i, N in enumerate(Ns):
    data = np.random.randn(k, N)
    mus[:, i] = np.mean(data, axis=1)

plt.figure()
plt.semilogx(Ns, mus.T, '.', color='C0')
plt.plot(Ns, 0*Ns, '--', color='gray')
plt.xlabel('N')
plt.ylabel('Population mean')

np.mean(np.random.randn(10000000))
0.00045228916981572776

The snippet above generates $N$ standard normally distributed random variables and takes the population mean. This is done for a range of different $N$ and done $k=10$ times per $N$. We see the spread on the points decreasing around zero. The speed of decrease is the famous $\frac{1}{\sqrt{N}}$ behavior. Also numerically, you see we get really close to the true $\mu = 0$.

The mean is not so interesting. In the toy model we even disregarded it all together. This post is all about estimating variances and standard deviations, or in terms of the normal distribution, parameter $v \ (= \sigma^2)$. An estimator for the variance could be

\[v^\star_N(X_i) = \frac{1}{N} \sum_i^N \left(X_i - \mu^{PM}_N(X_i)\right)^2\]

with

\[\mu^{PM}_N(X_i) = \frac{1}{N} \sum_i^N X_i\]

Something along the lines of the mean quadratic deviation from the (ordinary) mean… Note that the formula above does not assume that the value of $\mu$ is known to be zero, but it is also estimated. And note that I’m not calling it the population or sample variation, nor giving it a “population variance” superscript but a generic $\star$… And the keen observer may be wondering why I’m not estimating $\sigma$ but $v$. All of this, we’ll return to later. The formula above is implemented in NumPy by the function np.var.

Ns = np.array([10, 100, 1000, 10000, 100000])
k = 50
mus = np.zeros((k, Ns.size))

for i, N in enumerate(Ns):
    data = np.random.randn(k, N)
    mus[:, i] = np.var(data, axis=1)

plt.figure()
plt.semilogx(Ns, mus.T, '.', color='C0')
plt.plot(Ns, np.ones_like(Ns), '--', color='gray')
plt.xlabel('N')
plt.ylabel('np.var')

np.var(np.random.randn(10000000))
0.9997322618817084

And so numerically, we see that the np.var function gives a consistent estimator for $v$. As $N\to \infty$, $v^\star_N \to v \ (= 1)$ here.

The careful observer perhaps does notice some assymetry around $v=1$ for the $N=10$ datapoints… This makes a nice bridge from consistency to bias.

Bias

The bias concept arises when we obtain estimates from finite population sizes, and then repeat that as a whole, which is something different from consistency. An estimator is unbiased if the mean over repeated realizations and finite-size estimates converges to the true estimate. In equations: estimator $p^e$ for parameter $p$ is unbiased if:

\[\lim_{k \to \infty} \frac{1}{k} \sum_{j=1}^k p^e_N(X^{(j)}_{1..N}) = p\]

This is like a meta-consistency for the mean over finite-size estimates. A consistent estimator can be biased, and this bias will usually be most evident for very small sample sizes (as for large sample sizes, the consistency implies closeness to the true value of $p$).

The population mean is not only consistent, but also unbiased. We take a vary small sample size of $N = 3$ and do a numerical experiment:

N = 3
ks = np.array([10, 100, 1000, 10000, 100000, 1000000])

mus = np.zeros_like(ks, dtype=np.float64)

for i, k in enumerate(ks):
    data = np.random.randn(k, N)
    mus[i] = np.mean(np.mean(data, axis=1), axis=0)

plt.figure()
plt.semilogx(ks, mus, 'o', color='C0')
plt.plot(ks, np.zeros_like(ks), '--', color='gray')
plt.xlabel('k')
plt.ylabel('np.mean')

The plot looks similar to the previous one, but there we looped over the sample size. Now, this is fixed to $N=3$ and we loop over the number of repeats $k$ that we then average over. We observe convergence, and convergence to the true value of $\mu = 0$. And one can prove this unbiasedness and understand this intuititvely as our formule for $\mu^{PM}$ had this outer $\sum$ operator, and that one plays nicely with the $\sum$ operator that is embedded in the definition of ‘unbiasedness’ which talks about a mean converging…

Finally, we get a little bit more interesting and non-trivial: Unbiased estimation of variance. While the estimator \(v^\star_N(X_i) = \frac{1}{N} \sum_i^N \left(X_i - \mu^{PM}_N(X_i)\right)^2\) is consistent, it is not unbiased. The figure below illustrates that:

N = 3
ks = np.array([10, 100, 1000, 10000, 100000, 1000000])

mus = np.zeros_like(ks, dtype=np.float64)

for i, k in enumerate(ks):
    data = np.random.randn(k, N)
    mus[i] = np.mean(np.var(data, axis=1), axis=0)

plt.figure()
plt.semilogx(ks, mus, 'o', color='C0')
plt.plot(ks, np.ones_like(ks), '--', color='gray')
plt.xlabel('k')
plt.ylabel('np.var')

The output of the np.var function, which implements the $v^\star$ formula above converges to some value close to 0.7, but that is not the true value of $v$ (which is the one we used for generating the data, and which is $1$). We have a bias!

This is of course the famous case of every introductory statistics class, where we learn to take a population variance not by dividing by $N$ but $N-1$:

\[v^{PV}_N(X_i) = \frac{1}{N-1} \sum_i^N \left(X_i - \mu^{PM}_N(X_i)\right)^2\]

This is also known as Bessel’s correction. Because we are calculating a mean squared deviation to a value (the population mean) that itself is already coming from the data, we tend to underestimate the spread, because the population mean is more centered for the data then the real mean actually is. This is even kind of consistent when having only one datapoint. The population mean would equal the data point, and the numerator of the variance formula would be $0$. If we divide by $N=1$ we have a grossly underestimated variance estimate of $0$ which should be $1$. If we divide by $N-1 = 0$, we get $\frac{0}{0}$ which I’d argue is ‘more correct’.

How could the NumPy implementation be so wrong? Well actually, the np.var function has a flag called ddof, which, when set to 1, gives us the population variance formula from above. The explanation is something about degrees of freedom already taken up by the population mean etc… We will not dive into that.

N = 3
ks = np.array([10, 100, 1000, 10000, 100000, 1000000])

mus = np.zeros_like(ks, dtype=np.float64)

for i, k in enumerate(ks):
    data = np.random.randn(k, N)
    mus[i] = np.mean(np.var(data, axis=1, ddof=1), axis=0)

plt.figure()
plt.semilogx(ks, mus, 'o', color='C0')
plt.plot(ks, np.ones_like(ks), '--', color='gray')
plt.xlabel('k')
plt.ylabel('np.var(..., ddof=1)')

Repeating the experiment, we now see convergence to $1$. We have found an unbiased estimator for the $v$ parameter of our Gaussian distribution! And we can now also understand through a correction factor $\frac{N}{N-1}$ for $N=3$ that the biased variant was converging to $0.66\ldots$ in stead of $0.7$.

And so you may thing this is the end of our journey, and everyone lived happily ever after…

You’d be wrong! We are just getting started. And we do this by picking up that detail we glanced over before: estimating the variance-like parameter $v$ versus estimating a standard deviation-like parameter $\sigma$. Their relationship is simple enough: $v = \sigma^2$, so surely an unbiased estimator for $v$ can be converted into an unbiased estimator for $\sigma$. Unfortunately, this is not the case.

\[\sigma^{PS}_N(X_i) = \sqrt{\frac{1}{N-1} \sum_i^N \left(X_i - \mu^{PM}_N(X_i)\right)^2}\]

is a consistent, but not an unbiased estimator for the standard deviation. We can once again illustrate experimentally:

N = 3
ks = np.array([10, 100, 1000, 10000, 100000, 1000000])

mus = np.zeros_like(ks, dtype=np.float64)

for i, k in enumerate(ks):
    data = np.random.randn(k, N)
    mus[i] = np.mean(np.std(data, axis=1, ddof=1), axis=0)

plt.figure()
plt.semilogx(ks, mus, 'o', color='C0')
plt.plot(ks, np.ones_like(ks), '--', color='gray')
plt.xlabel('k')
plt.ylabel('np.var(..., ddof=1)')

The problem is in the square root operator which creeps in between the sum over the population of the estimator, and the sum over repetitions in the bias definition like $\sum \sqrt{ \sum (\ldots)}$. Unbiased estimation of the (population) standard deviation requires a correction factor which is no longer as simple as $\frac{N}{N-1}$, and is even distribution dependent. There is an entire Wikipedia page dedicated to this topic, which also has a table of correction factor values for the normal distribution. For $N = 3$, the above plot was actually converging to the odd value off $\frac{\sqrt{\pi}}{2} = 0.886…$. Come to think of it, this would be a cool way to estimate $\pi$ (a bit similar to how Buffon’s needle problem can be used). We only need something the generates a normal distribution? This is already the second post on my blog that allows for $\pi$ estimation. Time to bother Matt Parker again!

The Wikipedia page on the unbiased estimation of the standard deviation also states that

… [this] provides an example where imposing the requirement for unbiased estimation might be seen as just adding inconvenience, with no real benefit.

Which brings us to the more philosophical part: Why would you want unbiasedness in a practical application? My imagination currently has settled on: If I make some estimates on part of the data (let’s call them partial estimates), then I want to be able to combine the partial estimates later on into a more accurate aggregated estimate. If I have classical unbiasedness, I can even use the simple (arithmetic) mean calculation to do that aggregation. For the standard deviation, the mean doesn’t work, but could something else? Well, in fact, we don’t have to look far: If we square things, we get in the domain of the unbiased variance estimation. So for the standard deviation, a root-mean-square aggregation operator is able to combine partial estimates into a more accurate aggregated estimate. And if we keep aggregating, we will converge to the true value again.

And so, coming back to our toy model with two random variables, and the baseline approach (that we will detail out shortly) that gives unbiased variance estimates, which can be negative unfortunately… If we relax our unbiasedness requirement into a more ‘generalized unbiasedness’ requirement corresponding to some alternative aggregation operator, could we then solve the negativity problem? Aside from arithmetic mean, there’s plenty of other types of means, like the geometric mean, the harmonic mean, the root-mean-square… What is so special about the arithmetic mean anyway?

One could argue that doing this aggregation is perhaps something done by managers in Excel sheets, so we really shouldn’t have a too complex aggregation operator (or deliver it in a user-friendly manner). But then I would argue: Would those managers even be aware that, for sake of unbiasedness, they shouldn’t ever be taking averages over standard deviations, but always of variances?

One way to define a ‘trivial’ generalized-unbiased aggregation operator is to concatenate all the raw data from the different repetitions, and just apply the original estimation procedure to it. If that was consistent, then we leverage that consistency to now get our new concept of generalized unbiasedness.

There is one desireable property that classical unbiasedness brings with it, that this last procedure does not necessarily have: An aspect of data compression or memorylessness. With an unbiased estimator like the population mean, we can throw away the original data once we have taken the partial estimates. This is not so for our ‘trivial’ method described above. We need to remember all the data and concatenate it. Any partial estimates are effectively not used…

Baseline strategy: A frequentist approach

After a long detour into bias, we finally return back to our toy model: $X \sim \mathcal{N}(0, v_x)$ and $Y \sim \mathcal{N}(0, v_y)$ and observations:

\[m_1 = X_1 + Y_1\] \[m_2 = X_1 + Y_2\] \[m_3 = X_2 + Y_3\] \[m_4 = X_2 + Y_4\]

Note how we assume known means of $0$ for both $X$ and $Y$ so we actually do not have to bother with Bessel’s correction.

We do some arithmetic juggling on these equations: taking squares, linear combinations and applying the expected value operator:

\[E(m_.^2) = E(X_.^2 + Y_.^2 + 2X_.Y_.) = E(X_.^2) + E(Y_.^2) = v_x + v_y\]

where the crossterm between $X$ and $Y$ drops out because they are independent. The dot subscript is to indicate that this is valid for any subscript. The variant we will use further is:

\[Q_1 = \frac{1}{4}(m_1^2 + m_2^2 + m_3^2 + m_4^2)\]

such that

\[E(Q_1) = v_x + v_y\]

If we can get a second equation like this with our observations on the left hand side, and our variances linearly on the right hand side (and then linearly independent from the equation above), we can solve a $2\times 2$ linear system.

For that second equation, we take difference of $m_1$ and $m_2$ and rely on them having the same realization of $X$:

\[E\left((m_1 - m_2)^2\right) = E\left((Y_1 - Y_2)^2\right) = 2 E(Y^2) = 2 v_y\]

Again there is a crossterm (between $Y_1$ and $Y_2$) which is zero as the subsequent realizations are independent. So if we define a quantitiy $Q_2$ calculated from the observations like:

\[Q_2 = \frac{1}{4}\left((m_1 - m_2)^2 + (m_3 - m_4)^2\right)\]

then we have

\[E(Q_2) = v_y\]

and we have our second linear equation.

We reshuffle into a more standard form, and plug in the quantities $Q$ rather than their true expected values (which we don’t have…) we get:

\[\begin{pmatrix} 1 & 1 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} v_x \\ v_y \end{pmatrix} = \begin{pmatrix} Q_1 \\ Q_2 \end{pmatrix}\]

If we invert this linear system, we get

\[\begin{pmatrix} v_x \\ v_y \end{pmatrix} = \begin{pmatrix} 1 & -1 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} Q_1 \\ Q_2 \end{pmatrix}\]

And the big problem now is the $-1$ entry in this matrix, which means that our $v_x$ estimate can become negative if $Q_2 \leq Q_1$. $m_1 = 1, m_2 = -1, m_3 = m_4 = 0$ is a combination that makes this happen.

Still, this method gives an unbiased estimator, which can be understood intuitively through our use of mostly linear combinations and expressions, but which we’ll also show numerically now. In our example we use $v_x = 9$ and $v_y = 4$.

sigx = 3
sigy = 2
varx = sigx**2
vary = sigy**2
def generate_data(N=1):
    X = sigx * np.random.randn(2, 1, N)
    Y = sigy * np.random.randn(2, 2, N)
    
    M = X + Y
    return M
N = 100000
data = generate_data(N=N)
plt.figure()
_ = plt.hist(np.reshape(data, (4, -1)).T, bins=20)

All 4 measurements have the same distribution. What is not visible in this picture is that $m_1$ and $m_2$ are correlated, as are $m_3$ and $m_4$, through their joint realizations of $X$.

def estimate(data):
    Q1 = 1/4 * np.sum(np.sum(data**2, axis=1), axis=0)
    Q2 = 1/4 * sum((data[i, 0, :] - data[i, 1, :])**2 for i in range(2))
    Q = np.vstack((Q1, Q2))
    
    M = np.array([[1, 1], [0, 1]])
    Minv = np.linalg.inv(M)
    
    return Minv @ Q
ests = estimate(data)
plt.figure()
_ = plt.hist(ests[0, :], bins=100, alpha=0.5, density=True)
_ = plt.hist(ests[1, :], bins=100, alpha=0.5, density=True)
plt.xlim([-20, 100])
plt.legend(['$v_x$', '$v_y$'])

Plotted above are the distributions of the estimates for $v_x$ and $v_y$ for 100.000 repetitions. We observe that the estimates on $v_x$ indeed get negative, and in fact this happens about 15% of the time.

np.sum(ests[0, :] < 0)/N
0.15277
np.mean(ests, axis=1)
array([9.02955883, 3.99183563])

Taking the mean, we also observe that the estimates are indeed unbiased. However, it is also worth noting that the spread (variance) of our estimates is very big, due to the low number of observations used for a single estimate. A little bit of bias might go unnoticed for a long time…

Before we move on to the Bayesian approach to try and fix the negativity, we simplify the toy model even further (as the equations in the Bayesian approach get a little long and hard for sympy to handle…)

If we only consider observations $m_1$ and $m_2$ in our toy model: $X \sim \mathcal{N}(0, v_x)$ and $Y \sim \mathcal{N}(0, v_y)$ :

\[m_1 = X_1 + Y_1\] \[m_2 = X_1 + Y_2\]

our analysis and approach are still valid. The reason we started with 4 observations is that it allows us to give a more intuitive application with the measurements on 2 different days. Moreover, one may think that estimating 2 variances from 2 observations (or 1 variance from 1 observation) is not possible. In the mean time, we have learned however that this is only true if we don’t assume the mean to be known and we would have to estimate it as well.

With only observations $m_1$ and $m_2$, the variables $Q_1$ and $Q_2$ simply become: \(Q_1 = \frac{1}{2}(m_1^2 + m_2^2)\) and \(Q_2 = \frac{1}{2}(m_1 - m_2)^2\) We update the numerical experiment:

def generate_data(N=1):
    X = sigx * np.random.randn(1, N)
    Y = sigy * np.random.randn(2, N)
    
    M = X + Y
    return M
N = 100000
data = generate_data(N=N)
def estimate(data):
    Q1 = 1/2 * np.sum(data**2, axis=0)
    Q2 = 1/2 * (data[0, :] - data[1, :])**2
    Q = np.vstack((Q1, Q2))
    
    M = np.array([[1, 1], [0, 1]])
    Minv = np.linalg.inv(M)
    
    return Minv @ Q
M = np.array([[1, 1], [0, 1]])
Minv = np.linalg.inv(M)
Minv
array([[ 1., -1.],
       [ 0.,  1.]])
ests = estimate(data)
plt.figure()
_ = plt.hist(ests[0, :], bins=100, alpha=0.5, density=True)
_ = plt.hist(ests[1, :], bins=100, alpha=0.5, density=True)
plt.xlim([-20, 100])
plt.legend(['$v_x$', '$v_y$'])

np.sum(ests[0, :] < 0)/N
0.25301
np.mean(ests, axis=1)
array([9.09047521, 3.98240542])

We still get unbiased estimates, but the spread on the estimates has gotten even wider (because of less data in our partial estimates), causing 25% of the $v_x$ values to be negative! Let’s try to fix this!

Alternative strategies from Bayesian statistics

Bayesian statistics is all about calculating posterior distributions. We start from a model of how the data gets generated from realizations of our random variables (a “generative” model), and then turn it inside out: Given observed data, what are now the probability distributions over the parameters in our model. Central to Bayesian statistics is Bayes’ rule:

\[P(A\mid B) = \frac{P(B \mid A) P(A)}{P(B)}\]

which hold for the abstract random variables $A$ and $B$, but in our estimation application can better be read as:

\[P(\text{model} \mid \text{data}) = \frac{P(\text{data} \mid \text{model}) P(\text{model})}{P(\text{data})}\]

in which the 4 factors are also refered to as

\[\text{posterior} = \frac{\text{likelihood} \times \text{prior}}{\text{evidence}}\]

The evidence factor will mostly act as a normalization factor (as probabilities need to sum/integrate to 1) and we’ll largely omit.

Even though our toy model is not super complex, we illustrate this principle on an even simpler case: Inferring the variance parameter $v_x$ from a directly observed random variable $X \sim \mathcal{N}(0, v_x)$.

Bayesian inference of a variance parameter of a univariate Gaussian

What does Bayesian inference allow us to infer about variance from a single observation $m$? Let’s just work out the equations:

\[P(v_x | m ) \propto P(m | v_x) \times P(v_x)\]

with

\[P(m | v_x) = \frac{1}{\sqrt{2\pi v_x}}e^{- \frac{m^2}{2v_x}}\]

Let’s for now take $P(v_x) = 1$ but we’ll comment on it later.

So

\[P(v_x | m ) \propto \frac{1}{\sqrt{2\pi v_x}}e^{- \frac{m^2}{2v_x}}\]

looks a lot like our old familiar Gaussian distribution, but the odd thing now is that $m$ is given/fixed, and $v_x$ is the variable over which the posterior distribution lives. Let’s plot this function for $m = 2$. From here on we’ll also start using SymPy which allows us to do some analytical calculations with the posterior distributions.

import sympy as sp
def normalv(x, mean, var):
    z = (x - mean)
    return (1 / (sp.sqrt(2 * sp.pi * var))) * sp.exp(-(z * z) / (2 * var))
v = sp.Symbol('v', positive=True)
m = 2
posterior = normalv(m, 0, v)
sp.plot(posterior, (v, 0, 20))

We only plotted for $v_x \geq 0$ as negative $v_x$ would conflict with the square root under the denominator anyways. The posterior ‘distribution’ is unimodal with a peak at $v_x = m^2$ (which is $v_x = 4$ for $m=2$). It goes to zero very quickly for small $v_x$ due to the exponential in the formula. It does have quite a heavy tail for large $v_x$, due to the $\frac{1}{\sqrt{v_x}}$ factor.

This can understood intuitively as well. With an observation of $m=2$, it is very unlikely that the variance was actually much smaller than 4, say 0.01, because the Gaussian has tails that drop off so rapidly. However, a distribution with a very large variance, like 100, still has a reasonable probability of yielding a $m=2$ observation.

The tail here in fact is even so heavy that we don’t have a proper probability distribution (which is why I put ‘distribution’ in quotes a few sentences back).

sp.integrate(posterior, (v, 0, sp.oo))

$\displaystyle \infty$

This heavy tail get integrable when we consider the posterior from repeated observations $m_1, m_2, \ldots, m_k$. For independent observations, we just have to multiply the likelihoods:

\[P(v_x | m_1, m_2, \ldots, m_k ) \propto P(m_1, m_2, \ldots, m_k | v_x) = \prod_{i=1}^k P(m_i | v_x) \propto \frac{1}{v_x^{k/2}} \exp\left( - \frac{1}{2v_x}\sum_{i=1}^k m_i^2\right)\]

The tail will only get integrable for $k \geq 3$. An illustration with $m_i = [2, 3, 4]$

vs = np.linspace(0.1, 100, 1001)
def normalv_np(x, mean, var):
    z = (x - mean)
    return (1 / (np.sqrt(2 * np.pi * var))) * np.exp(-(z * z) / (2 * var))
plt.figure()
y1 = normalv_np(2, 0, vs)
y1 = y1/np.max(y1)
plt.plot(vs, y1, '--', label='m_1')

y2 = normalv_np(3, 0, vs)
y2 = y2/np.max(y2)
plt.plot(vs, y2, '--', label='m_2')

y3 = normalv_np(4, 0, vs)
y3 = y3/np.max(y3)
plt.plot(vs, y3, '--', label='m_3')

y = y1*y2*y3
plt.plot(vs, y, label='m_1, m_2 and m_3')
plt.legend()

We used some normalization to make the visualization nicer. We observe that the posterior gets narrower with more data, and in conjunction, the tail gets less heavy. The latter we observe more explicitly in a loglog plot:

plt.figure()
plt.loglog(vs, y1, '--', label='m_1')
plt.loglog(vs, y2, '--', label='m_2')
plt.loglog(vs, y3, '--', label='m_3')

y = y1*y2*y3
plt.loglog(vs, y, label='m_1, m_2 and m_3')
plt.xlim(0.3, 1e2)
plt.ylim(1e-2, 3)

For now we have been talking about posterior distributions, while the baseline strategy was working with concrete numbers… How do we compare those two? The answer is that we will derive point estimates from the posterior distribution. Two popular ones are the posterior mean and the posterior mode, better known as the maximum likelihood (ML) / maximum-a-posteriori (MAP) point estimate. ML and MAP are equivalent for flat priors. We could come up with others too, like the posterior median. And in fact, this way of constructing estimators leads to naturally consistent estimators, as long as we are assuming exact posteriors (which in practice is difficult). With more and more observations, the posterior distribution will get more and more localized and narrow around the true value. This is even independent of the prior that is used, which is kind of the whole point of Bayesian modeling. When data is scarse, a prior can help steer the model towards reasonable results, but that prior automatically has less influence the more data is available.

However, this blog post was all about unbiasedness which manifests itself exactly when data is sparse and when the effect of a Bayesian prior is significant. In general, using non-flat/informative priors can not be expected to yield unbiased estimates, as they are kind of fundamentally contradicting… So in our pursuit of unbiasedness through Bayesian posterior-derived estimates, we will use flat/non-informative priors.

The only prior information we may like to impose, is that our variance parameter $v_x$ is positive. That sounds reasonable, right? Rather than a flat prior, we would have a Heaviside type prior. Which I guess we have already kind of been using when we assumed $P(v_x) = 1$ but then ignored the negative $v_x$ in our plots.

Posterior mean

So, how do we feel about the posterior mean for a point estimate for our variance parameter $v_x$

The posterior mean can be calculated as

\[\int v_x P(v_x \mid m_i)\, dv_x\]

Well, in fact we can be pretty short on it: We already needed $k \geq 3$ observations for the posterior to be integrable. However, the tail for $k = 3$ is still too heavy for the mean to be finite (as the multiplication with $v_x$ make the tail under the integrat heavier once again. So in fact we’ll be needing at least $k=5$ observations to get a finite posterior mean (or expected value) for the variance parameter. Note that our model had a known mean, so with the classical formulas we could get a finite estimate with $k=1$ already…

Let’s illustrate with code and observations $m = [2, 3, 4, 5, 6]$. We rewrite the formula above a little for non-normalized posteriors:

\[\int v_x P(v_x \mid m_i)\, dv_x = \frac{\int v_x P(m_i \mid v_x)\, dv_x}{\int P(m_i \mid v_x)\, dv_x}\]
posterior4 = normalv(2, 0, v)*normalv(3, 0, v)*normalv(4, 0, v)*normalv(5, 0, v)
sp.integrate(posterior4, (v, 0, sp.oo))

$\displaystyle \frac{1}{108 \pi^{2}}$

With 4 observations, the denominator of this fraction is finite (which means the posterior can be normalized).

sp.integrate(v*posterior4, (v, 0, sp.oo))

$\displaystyle \infty$

However, the numerator is not finite… We need the 5th observation for that:

posterior5 = normalv(2, 0, v)*normalv(3, 0, v)*normalv(4, 0, v)*normalv(5, 0, v)*normalv(6, 0, v)
denom = sp.integrate(posterior5, (v, 0, sp.oo))

sp.integrate(v*posterior5, (v, 0, sp.oo))

numer = sp.integrate(v*posterior5, (v, 0, sp.oo))
numer / denom

$\displaystyle 90$

We get a finite value, but the population variance (with known mean of 0) for that data set is…

np.mean(np.array([2, 3, 4, 5, 6])**2)
18.0

Even though integrable now, the tail is still so heavy that it introduces a significant overestimate.

Conclusion: Posterior mean estimates for the variance have a significant bias towards overestimating the variance, and moreover too much data is needed to get any finite answer.

Things get a little better if we repeat the same approach to estimate the standard deviation rather than the variance, as the posterior has a tail like $\frac{1}{\sigma^k}$ rather than $\frac{1}{v^{k/2}}$

Maximum likelihood estimation

With a maximum likelihood point estimate, we look for the point in the posterior which is the ‘highest’. For a single observation we have already observed a unimodal posterior, and I have good faith that the posterior will still be unimodal for multiple observations. Then we don’t have to deal with multiple local maxima, and we can just look for the single point where the derivative of the function is zero (and of course ignore $v_x$ where the derivative is also 0, but that is a minimum). For the posterior mean we were integrating, now we will be differentiating and optimizing. It is commonly said that differentiation/optimization is easier than integrating (in high dimensions), which is why maximum likelihood is so popular.

Note that the point of maximum likelihood even looked well defined for the single observation “posterior” which was not normalizable. Let’s give that one a go with some symbolic calculations. In stead of plugging in $m=2$ we can also leave $m$ a symbolic variable:

m = sp.Symbol('m')
posterior = normalv(m, 0, v)
sp.diff(posterior, v)

$\displaystyle \frac{\sqrt{2} m^{2} e^{- \frac{m^{2}}{2 v}}}{4 \sqrt{\pi} v^{\frac{5}{2}}} - \frac{\sqrt{2} e^{- \frac{m^{2}}{2 v}}}{4 \sqrt{\pi} v^{\frac{3}{2}}}$

sp.solve(sp.diff(posterior, v), v)
[m**2]

With a single observation, the maximum likelihood estimate for $v_x$ is $m^2$. Let’s try 4: $m_1, m_2, m_3, m_4$:

m1, m2, m3, m4 = sp.symbols(['m_1', 'm_2', 'm_3', 'm_4'])
posterior = normalv(m1, 0, v)*normalv(m2, 0, v)*normalv(m3, 0, v)*normalv(m4, 0, v)
posterior.simplify()

$\displaystyle \frac{e^{- \frac{m_{1}^{2} + m_{2}^{2} + m_{3}^{2} + m_{4}^{2}}{2 v}}}{4 \pi^{2} v^{2}}$

sp.diff(posterior, v).simplify()

$\displaystyle \frac{\left(m_{1}^{2} + m_{2}^{2} + m_{3}^{2} + m_{4}^{2} - 4 v\right) e^{- \frac{m_{1}^{2} + m_{2}^{2} + m_{3}^{2} + m_{4}^{2}}{2 v}}}{8 \pi^{2} v^{4}}$

sp.solve(sp.diff(posterior, v), v)
[m_1**2/4 + m_2**2/4 + m_3**2/4 + m_4**2/4]

Could it be?! The maximum likelihood is identical to the classical formula: the mean of the squares (again, for this model with a known mean of zero)? Well the data only appears in a summed fashion in the exponent of the posterior, so we only need to show that a power of $k$ that governed the tail $\frac{1}{v^{k/2}}$ yields this factor $\frac{1}{k}$ for the ML estimate:

k, s = sp.symbols(['k', '\Sigma'])
p = 1/v**(k/2)*sp.exp(-s/(2*v))
p

$\displaystyle v^{- \frac{k}{2}} e^{- \frac{\Sigma}{2 v}}$

sp.diff(p, v).simplify()

$\displaystyle \frac{v^{- \frac{k}{2} - 2} \left(\Sigma - k v\right) e^{- \frac{\Sigma}{2 v}}}{2}$

sp.solve(sp.diff(p, v).simplify(), v)[0]

$\displaystyle \frac{\Sigma}{k}$

Tadaah!

So the ML estimate seems like a decent candidate to work out for the toy model with $X \sim \mathcal{N}(0, v_x)$ and $Y \sim \mathcal{N}(0, v_y)$.

Toy model with single m = X + Y observation

However, our gradual buildup is still not quite there yet… (thanks for sticking around by the way! We are close, really…) In the previous example, application of Bayes’ rule was easy as we only had one random variable. And when we had multiple observations, they were independent. Both of those things will not be true anymore. Therefore we take the intermediate step of how to do such a calculation for a model where we have two random variables $X \sim \mathcal{N}(0, v_x)$ and $Y \sim \mathcal{N}(0, v_y)$, but only one observation $m = X + Y$. Common sense would say that this problem is underdetermined, but with Bayesian modelling you still can get a long way.

We start off with Bayes’ rule, to get a bivariate posterior in $v_x$ and $v_y$.

\[P(v_x, v_y | m ) \propto P(m | v_x, v_y) P(v_x, v_y)\]

The prior we’ll assume flat or Heaviside. But can we rewrite $P(m \mid v_x, v_y)$ into expressions that depend on only $v_x$ and/or only $v_y$? To do so, we introduce helper variables $X$ and $Y$ over which we integrate, like

\[P(m | v_x, v_y) = \iint P(m | X, Y)\ P(X | v_x)\ P(Y | v_y)\ dX dY\]

The final piece of the puzzle is now to realize that the equality $m = X + Y$ can be interpreted as a joint probability density function through the use of a Dirac delta:

\[P(m | X, Y) = \delta(m - X - Y)\]

Bringing it all together:

\[P(v_x, v_y | m ) \propto \iint \delta(m - X - Y)\ P(X | v_x)\ P(Y | v_y)\ dX dY\]

I think it is very cool that SymPy allows us to compute with Dirac deltas.

vx, vy = sp.symbols('v_x, v_y', positive=True)
x, y = sp.symbols('x, y')
posterior = sp.integrate(sp.integrate(sp.DiracDelta(m - x - y)*normalv(x, 0, vx)*normalv(y, 0, vy), (x, -sp.oo, sp.oo)), (y, -sp.oo, sp.oo))
posterior

$\displaystyle - \frac{\sqrt{v_{x}} \left(- \frac{\pi m \sqrt{v_{y}} e^{\frac{m^{2} v_{y}}{4 v_{x} \left(\frac{v_{x}}{2} + \frac{v_{y}}{2}\right)}} \operatorname{erf}{\left(\frac{m \sqrt{v_{y}}}{2 \sqrt{v_{x}} \sqrt{\frac{v_{x}}{2} + \frac{v_{y}}{2}}} \right)}}{2 \sqrt{v_{x}} \sqrt{\frac{v_{x}}{2} + \frac{v_{y}}{2}}} - \frac{\pi m \sqrt{v_{y}} e^{\frac{m^{2} v_{y}}{4 v_{x} \left(\frac{v_{x}}{2} + \frac{v_{y}}{2}\right)}}}{2 \sqrt{v_{x}} \sqrt{\frac{v_{x}}{2} + \frac{v_{y}}{2}}}\right) e^{- \frac{m^{2}}{2 v_{x}}}}{2 \pi^{\frac{3}{2}} m \sqrt{v_{y}}} + \frac{\sqrt{v_{x}} \left(- \frac{\pi m \sqrt{v_{y}} e^{\frac{m^{2} v_{y}}{4 v_{x} \left(\frac{v_{x}}{2} + \frac{v_{y}}{2}\right)}} \operatorname{erf}{\left(\frac{m \sqrt{v_{y}}}{2 \sqrt{v_{x}} \sqrt{\frac{v_{x}}{2} + \frac{v_{y}}{2}}} \right)}}{2 \sqrt{v_{x}} \sqrt{\frac{v_{x}}{2} + \frac{v_{y}}{2}}} + \frac{\pi m \sqrt{v_{y}} e^{\frac{m^{2} v_{y}}{4 v_{x} \left(\frac{v_{x}}{2} + \frac{v_{y}}{2}\right)}}}{2 \sqrt{v_{x}} \sqrt{\frac{v_{x}}{2} + \frac{v_{y}}{2}}}\right) e^{- \frac{m^{2}}{2 v_{x}}}}{2 \pi^{\frac{3}{2}} m \sqrt{v_{y}}}$

That looks quite unwieldy! Let’s see if it can be simplified.

posterior.simplify()

$\displaystyle \frac{\sqrt{2} e^{- \frac{m^{2}}{2 \left(v_{x} + v_{y}\right)}}}{2 \sqrt{\pi} \sqrt{v_{x} + v_{y}}}$

So using this Dirac delta trick, we can get something like a joint posterior (but not really an integrable one). The maximum of this one would occur for $v_x + v_y = m^2$ beyond which it would indeed not be separable (until we use an informative prior for example).

Maximum likelihood variance estimation on the toy model

Finally, I think we have all the ingredients to give the toy model a go with the maximum likelihood variance estimation. Having had a sneak preview into the unwieldiness of SymPy expressions, I hope you can now appreciate why we aim for an absolute minimal example, like $X \sim \mathcal{N}(0, v_x)$ and $Y \sim \mathcal{N}(0, v_y)$ with observations

\[m_1 = X_1 + Y_1\] \[m_2 = X_1 + Y_2\]

Note that a maximum likelihood approach could in practice definitely be used through numerical rather than symbolic computing, but the symbolic approach gives more insights.

Back at it with Bayes’ rule:

\[P(v_x, v_y | m_1, m_2 ) \propto P(m_1 | v_x, v_y) P(m_2 | v_x, v_y) P(v_x, v_y)\]

Again we take the flat prior. And we rewrite using helper variables and Dirac delta functions. The crux now is that there is a shared helper variable $X_1$ between both $P(m_1 \mid v_x, v_y)$ and $P(m_2 \mid v_x, v_y)$:

\[P(v_x, v_y | m_1, m_2) \propto \iiint \delta(m_1 - X_1 - Y_1)\ \delta(m_2 - X_1 - Y_2)\ P(X_1 | v_x)\ P(Y_1 | v_y)\ P(Y_2 | v_y)\ dX_1 dY_1 dY_2\]

I wouldn’t want to work out that integral by hand!

vx, vy = sp.symbols('v_x, v_y', positive=True)
x1, y1, y2 = sp.symbols('x_1, y_1, y_2')
d = sp.DiracDelta
box = ((x1, -sp.oo, sp.oo), (y1, -sp.oo, sp.oo), (y2, -sp.oo, sp.oo))
posterior = sp.integrate(d(m1 - x1 - y1)*d(m2 - x1 - y2)*normalv(x1, 0, vx)*normalv(y1, 0, vy)*normalv(y2, 0, vy), *box)
posterior = posterior.simplify()
posterior

$\displaystyle \frac{e^{\frac{m_{1}^{2} v_{x}}{2 v_{y} \left(2 v_{x} + v_{y}\right)} - \frac{m_{1}^{2}}{2 v_{y}} - \frac{m_{1} m_{2} v_{x}}{v_{y} \left(2 v_{x} + v_{y}\right)} - \frac{m_{1} m_{2}}{2 v_{x} + v_{y}} + \frac{m_{1} m_{2}}{v_{y}} + \frac{m_{2}^{2} v_{x}}{2 v_{y} \left(2 v_{x} + v_{y}\right)} + \frac{m_{2}^{2}}{2 v_{x} + v_{y}} - \frac{m_{2}^{2}}{2 v_{y}} + \frac{m_{2}^{2} v_{y}}{2 v_{x} \left(2 v_{x} + v_{y}\right)} - \frac{m_{2}^{2}}{2 v_{x}}}}{2 \pi \sqrt{v_{y}} \sqrt{2 v_{x} + v_{y}}}$

The maximum of this 2D posterior is where both components of the gradient are zero:

sol = sp.solve([posterior.diff(vx), posterior.diff(vy)], [vx, vy], dict=True)
sol
[{v_x: m_1*m_2, v_y: m_1**2/2 - m_1*m_2 + m_2**2/2}]

We can rewrite the $v_y$ solution a bit:

sol[0][vy].factor()

$\displaystyle \frac{\left(m_{1} - m_{2}\right)^{2}}{2}$

You may recognize this as the $Q_2$ quantity introduced way up in the baseline strategy.

\[Q_1 = \frac{1}{2}(m_1^2 + m_2^2)\]

and

\[Q_2 = \frac{1}{2}(m_1 - m_2)^2\]

And likewise, you may recognize $v_x = Q_1 - Q_2$. The maximum likelihood estimate for the variance is identical to the ‘juggling’ with quantities done in the baseline approach. One advantage to maximum likelihood is of course that it is generic and flexible and can be naturally extended, for example to let go off the assumption that the mean of $X$ and $Y$ is known. And no cleverness was needed to define quantities $Q$ for which all of this worked out. Now it all follow from a naturally built up approach. It was not obvious though that we would get some nice and clean analytical expressions out like we did.

However, we aren’t done quite yet. One of the benefits of the ML approach is that we can easily incorporate the constraints $v_x \geq 0$ and $v_y \geq 0$ into the optimization problem that is the ML estimate. This is equivalent to imposing a Heaviside-like prior on the variances: can’t be negative but flat elsewhere. That this constrained optimization problem yields a more nuanced solution than simply clipping $v_x$ and $v_y$ to be positive we first illustrate graphically. Below are two contour plots of the posterior distribution, first for observation pair $(m_1 = 2, m_2 = 4)$, and after that for $(m_1 = -2, m_2 = 4)$.

posterior_np = sp.lambdify([vx, vy, m1, m2], posterior)
vxs = np.linspace(-20, 40, 1001)
vys = np.linspace(-4, 40, 1003)
vxs.shape = (vxs.size, 1)
vys.shape = (1, vys.size)
VX, VY = np.meshgrid(vxs, vys, indexing='ij')
plt.figure()
plt.contourf(VX, VY, posterior_np(vxs, vys, 2, 4), levels=10)
plt.xlabel('$v_x$')
plt.ylabel('$v_y$')

We observe a few things: - The maximum of this landschape (the ML) estimate, is in the upper right quadrant ($v_x, v_y \geq 0$). We concluded way up that indeed negative variance estimates can only occur when $m_1$ and $m_2$ have opposite signs, but its even easier to see in the SymPy solution for $v_x$:

sol[0][vx]

$\displaystyle m_{1} m_{2}$

Furthermore, we see that the landscape does not exist everywhere. This is due to the denominator of the posterior

posterior

$\displaystyle \frac{e^{\frac{m_{1}^{2} v_{x}}{2 v_{y} \left(2 v_{x} + v_{y}\right)} - \frac{m_{1}^{2}}{2 v_{y}} - \frac{m_{1} m_{2} v_{x}}{v_{y} \left(2 v_{x} + v_{y}\right)} - \frac{m_{1} m_{2}}{2 v_{x} + v_{y}} + \frac{m_{1} m_{2}}{v_{y}} + \frac{m_{2}^{2} v_{x}}{2 v_{y} \left(2 v_{x} + v_{y}\right)} + \frac{m_{2}^{2}}{2 v_{x} + v_{y}} - \frac{m_{2}^{2}}{2 v_{y}} + \frac{m_{2}^{2} v_{y}}{2 v_{x} \left(2 v_{x} + v_{y}\right)} - \frac{m_{2}^{2}}{2 v_{x}}}}{2 \pi \sqrt{v_{y}} \sqrt{2 v_{x} + v_{y}}}$

Those square roots only exist for $v_y \geq 0$ and $2v_x + v_y \geq 0$. This is another way to understand why this method can only yield a negative value for $v_x$ but not for $v_y$.

The second pair of observations has opposite signs and moves the unconstrained maximum into the upper left quadrant:

plt.figure()
plt.contourf(VX, VY, posterior_np(vxs, vys, -2, 4), levels=10)
plt.plot(0*vys.T, vys.T, 'r-.')
plt.plot(vxs, 0*vxs, 'r-.')
plt.xlabel('$v_x$')
plt.ylabel('$v_y$')

Where one of the contours touches the vertical red line $v_x = 0$, is where the constrained maximum is. That is at a different value for $v_y$ then the unconstrained maximum, so indeed a different solution to clipping. We can even use SymPy to find an expression:

posterior.subs(vx, 0)

$\displaystyle \frac{e^{- \frac{m_{1}^{2}}{2 v_{y}} + 2 \tilde{\infty} m_{2}^{2} + \frac{m_{2}^{2}}{2 v_{y}}}}{2 \pi v_{y}}$

Substition gives this weird infinity symbol. Probably some $0/0$ going on or something. Luckily SymPy has more tricks up its sleeve: taking limits:

sp.limit(posterior, vx, 0)

$\displaystyle \frac{e^{- \frac{m_{1}^{2}}{2 v_{y}}} e^{- \frac{m_{2}^{2}}{2 v_{y}}}}{2 \pi v_{y}}$

sp.solve(sp.diff(sp.limit(posterior, vx, 0), vy), vy)[0]

$\displaystyle \frac{m_{1}^{2}}{2} + \frac{m_{2}^{2}}{2}$

Compare this to the unconstrained estimate for $v_y$:

sol[0][vy]

$\displaystyle \frac{m_{1}^{2}}{2} - m_{1} m_{2} + \frac{m_{2}^{2}}{2}$

It’s as if we clipped $v_x = m_1m_2$ and then plugged that back in here.

So what about unbiasedness! Well, the unconstrained ML estimate was unbiased, so I think we lost that property when we started doing the constrained optimization… I am confident the ML approach can be used to get unbiased estimates through the previously outlined approach of redoing the whole optimization for aggregated data and relying on the consistency of the estimator. However, I’m currently not optimistic that, due to the ‘clipping’ we can have this ‘data compression’ feature where we can recover aggregate estimates from partial estimates rather than raw data. When we clip (the proper way) with constrained ML when $m_1m_2 < 0$ such that $v_x = 0$ and $v_y = \frac{m_1^2}{2} + \frac{m_2^2}{2}$, there is no way to recover $m_1$ and $m_2$ individually…

Discussion

So this is kind of where my journey stands at the moment, without complete satisfaction. We started with a baseline/frequentist approach to variance estimation that is unbiased but suffers from negative estimates. My ambition was to relax the ‘unbiasedness for the arithmetic mean’ requirement to a more ‘generalized unbiasedness’ with perhaps a different operation than the arithmetic mean. A property to be perserved would be the ability to get aggregate estimates from partial estimates without having to remember all the data. Through Bayesian theory, we derived that the maximum likelihood estimator for the variance parameters is very closely related to the baseline approach. The ML approach has some pros in its flexibility to handle the non-negativity. The approach does not satisfy all the desired properties of consistency, generalized unbiasedness and aggregation of partial estimates at the moment. Perhaps the latter should be ditched (data storage is cheap, definitely as we are dealing with little data here where biases are relatively big). Or perhaps the notion of unbiasedness, which we lose anyways when we use Bayesian statistics with informative priors…

Also fitting to the discussion is this entertaining (near R-rated) blog post that argues that the whole “DIVIDE BY N-1 OR THE BIAS MONSTER IS GONNA GET YA” is bullshit. That that bias $O(1/N)$ is usually small compared to the variance of the estimator $O(1/\sqrt{N})$. That “the old guys who went wild about bias are now mostly dead” and that we should rather “teach them how to bootstrap and teach the damn thing properly”.

Just before wrapping up this post, that blog post inspired following outlook / followup:

  • Get a proper understanding of bootstrapping
  • Quantifying the magnitude of the bias induced by clipping / constrained ML and comparing that to estimator variance (with the hypothesized conclusion that the bias is negligible
  • See if we can make a quantative argument that the constrained ML estimate is really better than plain clipping.

Finally, I’d like to end / pause with this interesting quote from the “Numerical Recipes” book:

“…if the difference between n and n−1 ever matters to you, then you are probably up to no good anyway - e.g., trying to substantiate a questionable hypothesis with marginal data.”


None yet

Want to leave a comment?

Very interested in your comments but still figuring out the most suited approach to this. For now, feel free to send me an email.