$\def\iset{\textrm{I}\!\mathbb{I}\!\textrm{I}}$
Let's first define a generalized way of thinking about a predictive system. We will say that a predictive system is an algorithm and set of parameters $W_i$ which, for an input $I_j$, predicts the most likely output $O_k$. In a more generic way we can say: there exists a process which we do not know. Yet, we can measure a finite number of features which form a single input into the problem. Then we can repeat the unknown process several times and measure the input and the output of the process in order to get pairs of $I_j$ and $O_j$. A predictive system is then an algorithm which, given the best ("best" to be defined later) set of parameters $W_i$, performs an operation:
$$ f(W_i, I_j) = O_j $$
For the majority of the $I_j$, $O_j$ pairs.
The idea of training such a predictive systems is the task of finding the best $W_i$ from a set of all possible values of all parameters $\mathbb{W}$. The training procedure takes many forms, yet all the forms have common objectives. The most common objective during training is to achieve generalization, i.e. to find the best $W_i$, means to find the $W_i$ which provides the best generalization.
Before attempting to define generalization in out abstraction let's define a generalized process: A process that can be predicted by a predictive system consists of:
A set of all possible inputs $\iset$. Each of them, $I_i$, be a measure of a finite set of features. The number of features in a single $I_i$ is dependent on the specific process.
A set of all possible outputs $\mathbb{O}$. Each of them, $O_i$, be a measure of a finite set of outputs. Most processes have a single measured output, in which case the set $\mathbb{O}$ is just a set of all possible values for that single output. We will be dealing with the case of a single output here, although processes with more outputs can be generalized in the same fashion.
A set of all possible parameters $\mathbb{W}$. Each of them, $W_i$, be a measure of a finite set of parameters.
If we take an analogy to the typical problem of classification, we can say that $\iset$ and $\mathbb{W}$ can be large, whilst $\mathbb{O}$ will be a small set. In the case of regression, all three sets are large.
Let's now quickly argue that all three sets, $\iset$, $\mathbb{O}$ and $\mathbb{W}$, are finite. Both $I_i$ and $O_i$ can only be measured, and there are limits on how precisely we can measure. Moreover, the predictive system will run on a computer system, and computer systems have limited storage, therefore there is a limited (although large) set of possible values for every measure. The last note means that there is a finite number of possible $W_i$, $I_i$ and $O_i$, therefore $\iset$, $\mathbb{O}$ and $\mathbb{W}$ are all finite. Note that one could redo the work here proposed with infinite sets by changing the (Riemann) sums into integrals, yet we keep the work simple by operating on sums.
We can now say that a trained predictive system performs a prediction of what $O_j$ should be given $I_j$. The predicted $O_j$ is often called $\hat{O}$ and the predictive system is written as a function so that $\hat{O} = f(W_i, I_j)$. One formalism for this function, using probability theory, is to argue that $f$ - for a given $I_j$ - estimates all conditional probabilities of the form:
$$ P(O_k | I_j) $$
And the final output, $\hat{O}$, is the $O_k$ for which the conditional probability was highest. Not all predictive systems use the concept of conditional probabilities in order to perform the prediction, yet one can calibrate probabilities to the predictive system or even estimate the conditional probabilities by observing the predictive system. Next, we can argue that the trained predictive system is able to define conditional probabilities for all combinations between sets $\iset$ and $\mathbb{O}$.
We can now define generalization in this framework. We said that during training one find the best $W_i$ by finding the $W_i$ for which the predictive system best generalizes. If we know beforehand the entire set $\iset$, the entire set $\mathbb{O}$, and also know the output $O_j$ of the process being predicted for every $I_j$ in the set; then we defined a predictive system that generalizes perfectly as the predictive system that predicts the same $O_j$ as above for every $I_i$. This should not be surprising. If we are omnipotent and can know everything - with perfect accuracy and for every single possible input - about a process, we can build a mapping which will then operate as that process.
In the real world achieving this generalization is much harder because we can never know everything about a process. In the real world case, the measure of generalization we use is an estimate of this perfect generalization. Whilst accounting for the fact that we do not know everything to be known about the process we are trying to predict. How this generalization estimate is performed is beyond the scope of this work. And there is vast literature on several techniques: notably cross-validation and hold-out data points. Instead we assume that the generalization achieved during the training, is a a good enough estimate of the actual generalization.
A similar framework - and the base for the one we just defined - has been proposed by (Bialek et al, 2001). Their work build the framework for time-series only. Yet, in the framework above the only difference between a time-series predictive system and a non-time-series predictive system is the arrangement of the member of set $\iset$. A fixed window time-series can always be transformed into a set of non-time-series inputs.
Moreover, in the same work by (Bialek, 2001) we can see the proof that for any time-series to be possible to predict it must be possible to establish a size of a window for which predictions are made. In summary, if one observes a time series for long enough - in the limit of time going to infinity - the amount of predictive information (to be defined shortly) converges to a constant.
Still following from (Bialek et al, 2001), we will argue that the amount of predictive information from a prediction for a single combination of $I_j$ and $O_i$ is:
$$ \frac{P(O_i)}{P(O_i | I_j)} = \frac{P(O_i)P(I_j)}{P(O_i, I_j)} $$
For all permutations between members of $\iset$ and $\mathbb{O}$ we will have a value, and we said that one or both sets may be large, therefore the number of permutations is large as well. We will have a probability distribution of the form above. One thing that we can say about the probability distribution above is that for every $I_j$ there is an $O_i$ which is most likely, since our predictive system has been trained. Therefore the distribution will be univariate in at least one direction.
A common way to sum up an univariate probability distribution is an entropy. An entropy is the average of the logarithm of the inverse probability. We will call this specific entropic form, $S_p$, an average predictive information. In the case above we say:
$$ S_p = - \left\langle \log \left( \frac{P(O_i)P(I_j)}{P(O_i, I_j)} \right) \right\rangle = - \sum_i \sum_j \frac{P(O_i)P(I_j)}{P(O_i, I_j)} \log \left( \frac{P(O_i)P(I_j)}{P(O_i, I_j)} \right) $$
Let's simplify the notation and open the logarithm to obtain:
$$ S_p = \left\langle - \log \frac{P_OPI}{P{O,I}} \right\rangle = \langle - \log P_O \rangle + \langle - \log P_I \rangle
This is the result (Bialek et al, 2001) obtained, with the only difference that we are using the framework above to obtain it. More interestingly, if we assume that entropy is always positive then we can see that the entropy of a predictive system is always subextensive. In other words, the final term $S(P_{O, I})$ is positive and it is subtracted from the total, therefore the total entropy of the entire predictive system is less than the sum of the parts involved - inputs and outputs of the predictive system.
Let's think about the above equation in terms of some edge cases. First, if we imagine a process that cannot be predicted at all: a process that given an input returns a random output. For this process we have:
$$ P(O, I) = P(O)P(I) $$
And therefore we have
$$ S_p = S(P_O) + S(P_I) - S(P_O) - S(P_I) = 0 $$
Which should make sense. For a completely random process we can never predict anything, whatever we predict right is by chance alone and should level up by what we predict wrong. The average predictive information should be minimal.
Note that the average predictive information is the opposite of the entropy of $P(O | I)$. The specific corner case discussed above is the case in which $P(O | I)$ has maximum entropy. The counterposition should make sense since $P(O | I)$ is in the denominator of our original equation.
A perfectly predictable process is a mapping: for every input there exists an output. If, for simplicity we assume that all inputs and outputs are disjoint, we have:
$$ P(O, I) = P(O) = P(I) $$
And therefore:
$$ S_p = S(P_O) + S(P_O) - S(P_O) = S(P_O) = S(P_I) $$
Since the mapping can be perfectly predicted we get a high average predictive information. Note as well that as we have more members in the sets $\iset$ and $\mathbb{O}$ - there are more inputs and outputs - the value of $S_p$ increases.
The counterposition works here as well. The entropy of $P(O | I)$ for a mapping should be the minimum entropy.
A real predictive system is somewhere in between. In both edge cases the subextensive entropy became extensive because the last term, $S(P_{O, I})$, could be eliminated. In a more complex case the last term may be quite complex to estimate. We will need to use the entropic form in its subextensive form. Luckily, there are nonextensive forms of entropy that we can refer to. One rather popular form is Tsallis entropy (Tsallis, 1998):
$$ S_q = \frac{1 - \sum_i p(W_i)^q}{1 - q} $$
Which is parametrized by the value $q$.
One interesting property of Tsallis' entropy is its pseudo-additivity. Given a system of more than one part - here $A$ and $B$ - the total entropy of the system can be found by:
$$ S_q(A + B) = S_q(A) + S_q(B) + (1 - q)S_q(A)S_q(B) $$
Now if we compare the equations from out extension of Bialek's work with the pseudo-additivity property of Tsallis' entropy we can argue that, for a predictive system:
$$
We cannot say much about the scales of the values. Yet, since we are assuming that entropies are always positive, we can argue that both sides should be negative. This should make sense since a value of $q > 1$ in Tsallis' entropy means a system that is subextensive.
We argue that for a system to have predictive properties its entropic form - average predictive information - must be subextensive. This also means that if we use Tsallis' entropy as the entropic form, the value of the parametrization in the Tsalli's entropy must be $q > 1$.
This behavior can be seen in experimental result of training a predictive system using Tsallis' entropy as the entropic form (Anastasiadis, 2004), (Anastasiadis, 2003), (Tsallis, 1998), (Seba, 2012). In other words, we attempt to explain the experimental results found in the previous work using the theory shown here.
Based on the framework here presented we can further argue that a predictive system can be introspected by the value of $q$ which best fits its entropic form.
A predictive system that predicts randomly will have $q = 1$.
A predictive system that predicts less badly will have $q$ slightly higher than $1$.
A predictive system that predicts reasonably will have $q > 1$.
A predictive system that predicts well will have $q$ slightly higher than $1$.
A predictive system that predicts perfectly will have $q = 1$.
How far $q$ will deviate from $1$ is likely dependent on the predicted process. The approach of $q \rightarrow 1$ may tell us how much information about the process we have within the pairs of $(I_i, O_i)$. In other words, assuming that the system is improving, we can argue that our generalization is approaching perfect generalization as $q$ approaches $1$.
Here we outline future work to be preformed in order to prove, disprove, or improve the prospects of the average predictive information.
Although the number of experimental works which found $q > 1$ is small, the existence of a predictive system that behaves in an entropic form with $q < 1$ would be quite problematic. Since $S_p$ is an entropy it cannot be negative, and since all three of its terms are entropies as well they cannot be negative either. But for $q < 1$ the last would need to be negative in order for the predictive information to follow the forms above.
If one can find a predictive system which behaves with $q < 1$ we will need to argue that either: We have a negative entropy, and in this case negative information (which is a rather difficult argument to make). Or, that an entropic form for $S(P(O, I))$ would not follow a probability but instead would follow a correlation, i.e. $S(Corr(O, I))$.
As (Tsallis, 2004) defined and (Bulraga, 2005) proved experimentally, a physical system outside of equilibrium will behave under non-extensive entropy. Also, such a system will be characterized by a q-triangle: three values of $q$, where different characteristics of the system will be formalized under a Tsallis entropy.
The three values of $q$ are defined as:
An equilibrium state means $\{q_{statistical}, q_{sensitivity}, q_{relaxation}\} = \{1, 1, 1\}$; and out of equilibrium means $\{q_{statistical}, q_{sensitivity}, q_{relaxation}\} \neq \{1, 1, 1\}$. Moreover, how far the quantities deviate from $1$ means a higher trend into the characteristic of that $q$. A physical system out of equilibrium is expected to behave with: $q_{statistical} > 1$, $q_{sensitivity} < 1$ and $q_{relaxation} > 1$.
The question is which $q$ is the one used towards average predictive information. As long as we assume the previous question to be negative, i.e. there cannot be a predictive system with $q < 1$, we can say that we are not after $q_{sensitivity}$. This should make sense. A predictive system is trained several times, cross-validated, and its input comes in shuffled order. Since the order of the data during the training of the predictive system should not matter towards the resulting trained system, training algorithms are expected to remove initial condition sensitivity.
Whether the $q$ we are after for predictive systems is $q_{statistical}$ or $q_{relaxation}$ is a harder question. This leads to the next, and more practical, question.
We are thinking about the entropy of the predictive system when trained. This leads to the idea of adding an entropic term to the error function optimized during training, akin of the work by (Anastasiadis, 2003). For some value of $q$, at any point during training the value of the average predictive information will be about the size of the variance between the current error and its global minimum. This is akin of the work in (Tsallis, 1992). The assumption in that work is that one could start adding infinite noise and then reduce from that, which is not possible in practice. Yet, in practice we can start from a high enough noise.
Perhaps placing the entropic form into the error function is not the only possible addition to the training algorithm. As in (Anastasiadis, 2003), who placed an entropic form into Nprop training directly, other parametrisations of current training algorithms are possible. If more than one paramtrisation is possible, then it could be that we do not need to choose between $q_{statistical}$ and $q_{relaxation}$. It may be that each $q$ parameter corresponds to the distinct $q$ in a physical system.
We have been working under the assumption that our target perfect predictive system has no noise in the input or output measures. We then kept this assumption when thinking about real world predictive systems. In the real world data will be noisy. Whether the improved predictive system with entropic term will react (train) better or worse under noisy data is a topic for another discussion. Currently there is no experimental work that did attempt to test the effects of noisy data on the behavior of predictive systems using non-extensive entropic forms.