Exponential generalization
When dealing with a learning system we often call all its parameters $W$, be them a matrix of parameters (e.g. linear regression), several matrices (e.g. neural networks) or something even more complex. Yet, we can argue that there exist a $W$ that a collection of all parameters. Next, we will name $W_i$ a specific case of $W$, in which each parameter has a specific value.
As an example, if we are executing a simple Artificial Neural Network (ANN), we could say that $W$ is a collection of two matrices. Say, we have two layers and the parameter (weight) matrices have sizes $4 \times 3$ and $3 \times 2$. In this case we say that two possible values of $W$ are:
$$ W_1 = \left\{ \: \left[ \begin{matrix} 3.2 & 1.2 & 4 \\ -2 & 7 & 2 \\ -2.1 & -3 & 0.3 \\ 6 & 2.1 & 1 \\ \end{matrix} \right], \left[ \begin{matrix} 2.2 & 5.2 \\ -1 & 3 \\ -3 & 1 \\ \end{matrix} \right] \: \right\} \\ W_7 = \left\{ \: \left[ \begin{matrix} 2 & 1.2 & 1 \\ -1 & 7 & 5 \\ -1 & -3 & 3 \\ 3 & 1.3 & 6 \\ \end{matrix} \right], \left[ \begin{matrix} -2 & 2 \\ 1.1 & 4 \\ 1 & 4 \\ \end{matrix} \right] \: \right\} $$
And so on.
Note that the numbering of $W_i$'s is arbitrary, we only ask that one value of $W$ is different from all other values. In other words, every value of $W$ is unique:
$$ W_i = W_j \iff i = j $$
Also let's define a generic $W_i$:
$$ W_i = \left\{ \: \left[ \begin{matrix} w_1 & w_2 & w_3 \\ w_4 & w_5 & w_6 \\ w_7 & w_8 & w_9 \\ w_{10} & w_{11} & w_{12} \\ \end{matrix} \right], \left[ \begin{matrix} w_{13} & w_{14} \\ w_{15} & w_{16} \\ w_{17} & w_{18} \\ \end{matrix} \right] \: \right\} \\ $$
Where, in our example of an ANN a $w_i$ is a weight on a connection between neurons. This way any $W_i$ is the state of the full model.
We can also say that there is a finite number of different values for $W$. This can be contested since each weight can assume any value; yet, in a computer, the weights are represented by floating point numbers, which have a limited number of representable values. Below we will work under the assumption that $W$ has a finite number of values, yet one can change all sums into into integrals (since they all behave like Riemann sums) and work under no assumption of finite possibilities.
In the defined context we argue that training a learning model will result into a specific $W_i$ value for $W$. Given a dataset over which an algorithm is trained we see a probability distribution $p(W)$ for certain value of $W$ give a better fit than others. Finally we can argue that, over a specific learning problem (specific dataset), $W$ is a random variable with some distribution $p(W)$. In other words, whenever we train a learning system over a problem, we reach a specific value of $W$. If the training procedure is stochastic, then over several training procedures on the same problem we reach different (although likely similar) values of $W$.
Note that we are not interested in non-stochastic training procedures, in such a case we would have a $p(W)$ with the value $1$ at one specific location and $0$ everywhere else (or a dirac delta function if we are not assuming finite possibilities). Stochastic training procedures, including regularization (see below), give us a more interesting, and less studied, probability distribution for $p(W)$.
The definition of a learning system above is analogous to the definition of a physical system in statistical mechanics: all possible states of the system can be encoded as a value of the multidimensional random variable $W$, i.e. as a point in weight-space; just like all states of a physical thermodynamic system can be encoded as points in phase space. This analogy allows us to use probability distribution properties from the distributions over phase space over our $p(W)$ distribution. That is, as long as we make sure that the constraints of the distributions in phase space are followed by our distribution for $p(W)$.
The prime analogy between a thermodynamic system and $p(W)$ is that both are formed of several parts. The thermodynamic system is made of several particles each of which with different positions and momenta, whilst $W$ is formed of several weights each of which with a particular value. For now we will take the assumption that the particles' position and momenta are independent of each other, as well as the values of the weights are also independent from each other. This assumption is incorrect but we will explore fixing it later.
Next we will attempt to link the concept of entropy to $p(W)$. We said that we have a stochastic process performing the learning, therefore we can say that $0 \leq p < 1, \forall p$. This means that the logarithm of $p$ is negative, and the smaller $p$ is, the bigger the logarithm. If we do not count values of $p = 0$ we can say that:
$$ \begin{equation} \langle (W - \mu_W)^2 \rangle \propto - \sum_{p_i \neq 0} \ln p(W_i) \end{equation} $$
In other words, the more spread the distribution is the smaller the probabilities.
Equation $(1)$ is almost entropy but not quite. Entropy of a thermodynamic system is defined only for a univariate distribution (a distribution with a single peak) and we can imagine a non univariate distribution for $p(W)$. The entropy is limited to univariate distribution because a thermodynamic system is constrained by energy, and high energy states are unlikely. On the other hand, any learning system that is regularized (be it $L1$, $L2$ or something else) will have a univariate distribution for $p(W)$. This is because the weight values will be pushed downwards and high values of weights will become unlikely.
With the analogy in mind we can now say:
$$ \begin{equation} \langle (W - \mu_W)^2 \rangle \propto - \sum_{p_i \neq 0} \ln p(W_i) \propto - \sum_{i=0}^N p(W_i) \ln p(W_i) \end{equation} $$
Therefore the variance of $p(W)$ is proportional to its entropy. That is not a surprising result. Yet, it will be important when we will need to fix the independence assumption. Also note that we introduced $N$, the number of all possible values of $W$.
(Here we have references to papers that distinguish additivity and extensivity)
The concepts of additivity and extensivity are separate, yet in physical phenomena they always appear alongside each other, related by the logarithm. In other words, in the vast majority of physical phenomena the logarithm of an extensive quantity is additive, and vice-versa. We will explore the additivity of the logarithm in equations $(1)$ and $(2)$.
We can now say that:
$$ \begin{equation} \sum_{i=0}^N \ln p(W_i) = \ln p(W) = 0 \end{equation} $$
The right-hand side should be obvious: the probability of $W$ being in one of its possible states is $1$, and the logarithm of $1$ is $0$. The left side is less obvious, and it is only true if the probability is extensive.
Another way to say this is to argue that we can separate our system $W$ into two systems and the probabilities of one system do not influence the other. We can say that $\forall M, 0 < M < N$ we have:
$$ \begin{equation} \sum_{i=0}^N \ln p(W_i) = \sum_{i=0}^M \ln p(W_i) + \sum_{i=M}^N \ln p(W_i) \end{equation} $$
That is to say that the probability of $W$ is just a product of the probabilities of each possible value:
(this is wrong, there's no correlation between states but between the particles)
$$ \begin{equation} p(W) = p(W_1) p(W_2) p(W_3) \dots \end{equation} $$
Now, if equation $(4)$ is not true, then equation $(3)$ is not true either. We have little reason to believe that equation $(4)$ is true, by the contrary.
We now introduce a learning algorithm. Since we have a learning system we expect that there exists an algorithm that walks from some not so good solution at $W_i$ to a better solution at $W_j$ by some means. For the purpose of an example let's say that we are using Stochastic Gradient Descent, or any (stochastic) variation of it thereof. At any $W_i$ the algorithm will have high probability of choosing several $W_j$ states as the next state, and a very low probability of choosing some different $W_k$ states. Part of the choice is due to the algorithm stochasticity, yet the biggest part of the choice is due to the current state $W_i$.
We can extend this further. If the algorithm has some memory, for example it uses learning momentum, then not only the current state but previous states influence the probability of the final state of the learning system. Therefore equation $(5)$ for any stochastic learning algorithm is in reality:
$$ \begin{equation} p(W) = p(W_1 | W_2, W_3, \dots) p(W_2 | W_1, W_3, \dots) p(W_3 | W_1, W_2, \dots) \dots \end{equation} $$
The only stochastic algorithm for which equation $(5)$ would be true is a random search with a single step. This means that the properties of the states $W$ are not additive, i.e. $\ln p(W)$ is not additive. But not all is lost, in the physical world we also face the issue of non-additivity, and solutions have been proposed (Refs). Yet, first we need to be confident we can use solutions to physical problems as solutions on our learning system.
Until now we have been vaguely alluring to the idea that a learning system can be described by entropy in a similar fashion to a physical thermodynamical system. Now comes the time where we need to deepen the analogy.
In statistical mechanics, Entropy describes the spread of the probability distribution over the energy states, $E_i$, of a thermodynamical system. Each energy state is a combination of position and momenta, say $q_i$ and $p_i$, of every particle in the system. The probability distribution is univariate over the average energy.
The analogy to our learning system should start to appear.
$$ \begin{align} E &\sim W_{best} \\ E_i &\sim W_i \\ p(E_i) &\sim p(W_i) \\ p(E_i | \dots) &\sim p(W_i | \dots) \\ p_i, q_i &\sim w_i \\ \end{align} $$
The first equation requires more explanation. The average energy ($E$) is what we can measure in the physical system an can be found in the middle of the probability distribution of $E_i$. For our learning system the middle of the probability distribution is the most probable final state of $W$, the state in which the system has learned the problem. Note that this is a stochastic solution and we are dealing with a probability distribution, therefore the exact value of $W_{best}$ is somewhere in the peak of the distribution.
The probabilities in the physical thermodynamical problem are often additive, i.e. the probabilities are independent of each other. This is because of the assumption that in an ideal gas the space between molecules is very big compared to the size of the molecules, and also that there are no long range potential forces between the molecules.
Yet, several problems out there do not behave like an ideal gas. In other words, we are faced with physical thermodynamical problems in which the probabilities of energy levels ($E_i$) are not independent of each other, just like the probabilities of $W_i$ in our learning system. In those physical systems the Boltzmann-Gibbs-Shannon Entropy does not make a good description of, yet other function representations of entropy exist.
The reasons for why the BGS Entropy does not work in these cases is relevant. We have non-negligible interactions between molecules, which then translates into non-negligible dependencies between the energy states. We can say that the momentum of a particle at random is a distribution of possible momentum values. For a specific momentum to have the value $k$ we now have a joint distribution.
$$ p(p_i = k) = p(p_i | p_1, p_2, \dots, q_1, q_2, \dots) $$
Finally we say that the probability of a specific average energy is also a product of joint probabilities. This is a hypothesis, based on the microscopic nature of molecules above.
$$ \begin{equation} p(E) = p(E_1 | E_2, E_3, \dots) p(E_2 | E_1, E_3, \dots) p(E_3 | E_1, E_2, \dots) \dots \end{equation} $$
We will return to this hypothesis. But for now let's see how to make it work for the physical problem.
As proposed by Tsallis (Ref) we have the non-additive entropy:
$$ \frac{1 - \sum_i p(W_i)^q}{1 - q} $$
Where for a specific problem there may exists a special value of $q$ for which this entropy is extensive despite the fact that Boltzmann-Gibbs-Shannon entropy is not extensive on the problem.
The next
Note that the above Entropy is an extension on the probability distribution of $E_i$. And
Just as in the physical system we assume that
$$ \mathrm{Cov}(W_i, W_j) \sim \sum \mathrm{Cov}(w_i, w_j) \hspace{6em} w_i \in W_i, w_j \in W_j $$