What Does Entropy Mean?
Based on statistical analysis of sources and noise, and usually relying on i.i.d. assumptions, information theory collects conclusive results on limitations of information processing tasks. While these bounds take often the form of rates, for example maximum communication rate or maximum compression rate. Those limits are mathematically expressed as entropic functions of probability measures and are by construction associated with a processing task, and therefore they have an operational meaning.
In quantum mechanics the notion of operational definitions is quite common and it has close relation with the notion of observables. And therefore they can be readily associated to a measurement or a preparation procedure, a task. Since the advent of quantum information, this good practice has been also carried over. And operational meaning of quantum entropic functions has been a topic of growing interest in connection with fundamental limitations of quantum information processing.
Often these two classical and quantum tasks are related. For example in asymptotic state discrimination: in order to derive fundamental limits on how well two states can be taken apart, it suffices to solve an (appropriately constructed) associated classical discrimination problem. While the convergence of averages taken over large samples suffice in the i.i.d. case. Limit probabilistic laws extend poorly to general frameworks, which are required by this classical-quantum mapping, and moreover they say little about the rate in which convergence takes place. And for a number of information theoretical problems, this is precisely what is of importance.
Large deviations theory deals in turn with computing specifically the asymptotic rate with which such convergence takes place. And moreover large deviations results often go beyond the i.i.d. assumption. In some cases even let the stochastic processes structure aside and deal with the general framework of sequences of probability measures. And in dealing mostly with rates of convergence, large deviations has important connections with information theory. When applied to information processing tasks, large deviations theorems commonly derive rates of convergence which are entropic functions, for example Kullback-Leibler divergence, the Chernoff distance and Renyi relative entropy, to cite a few.
Hence, entropic functions arise from first principles under the large deviations machinery and moreover are already associated with their respective information theoretic bound. It is hence of profound interest to investigate further the connection between the large deviations principle (LDP) and the operational meaning of entropic functions in general. And in turn such analysis can be naturally extended to quan- tum entropic functions as the LDP encompasses the needed generality to treat sequences of probability measures with minimal structural properties.
How Fast Can We Learn?
In statistical machine learning most of the theory is dedicated to statistical inference on samples of independent and identically distributed (i.i.d.) observations. This is known as passive learning, as the learner has no role in the collection of data. Apart from cases where all the data is already available, it is usually possible to combine data analysis and collection. when the learner can sequentially decide on sampling strategies. Active learning present significant advantages over standard data collection, sometimes with exponential speedups. And moreover it may be the only feasible solution in a changing environment, as for example when the system being observed is also required to be controlled. Active learning systems are by nature feedback systems.
On the other side, information theoretic analysis are often used to derive limit bounds on passive data collection and learning schemes. Information theory is in itself also a passive (unidirectional) theory of communication. And for years after Shannon’s fundamental contribution it has remained mostly as such. When information flows in two directions, establishing fundamental limits of information exchange becomes more challenging. In recent years though more attention has been given to the field of interactive information theory. In which one is interested in deriving such bounds on communication and information exchange in a dynamic way.
It is hence easily seen that the methods of interactive information theory shall be of use in ranking efficiently the performance of active learning schemes. In which information flows in both directions and moreover there is the presence of feedback loops, which allows for memory and correlation in the composed system data plus learner. This shall be of great value in the development of a information theoretic analysis for comparison of dynamic learning schemes and clearly for optimization of such methods.
How Good is an Optimal Decision?
In statistical decision theory the task in hypothesis testing problem consists in based on acquired data to decide which is the most suitable statistical hypothesis which underlies the measured data. In a binary setting one is discriminating between only two statistical hypotheses. Which are usually probabilistic models. Attached to every observation of the system followed by a decision there is an error probability of rejecting the true hypothesis or accepting the false one.
In a symmetric framework, both errors are treated equally, this is known as the Bayesian setting. If furthermore the observation process can be indefinitely repeated and moreover under the assumptions that upon every observation the data is identically distributed and there is no correlation on successive observations, it is known that for the optimal decision tests the error probability vanishes exponentially fast with the number of trials. Since the pioneering work of Chernoff, it is known moreover that the asymptotic decay exponent is uniquely defined by the two statistical hypothesis being tested in this case. This result is known as the Chernoff bound.
Although enlightening, this framework is far from reality. As independence and equally distribution are two very strong assumptions which most models fail to fulfill. As processes in the real world which present some form of complexity, memory and/or information processing are correlated. In turn those are precisely the processes for which statistical analysis is needed, as a deterministic model is non-existent or hard to derive.
This project is aimed to extend Chernoff's work and treat the setting when the observed data from consecutive observations may be correlated. From this assumption, generalized versions of the original hypothesis testing problem are formalized and treated. Results obtained from this analysis show how optimal decision tests in such statistical framework perform in the asymptotic regime under correlation.
From this project we were able to state and prove a generalized Chernoff bound. Which presents a sharp bound on the asymptotic performance of optimal decision rules.
How Do Machines Think?
Soft computing methods consist in a set of algorithms which are capable to treat efficiently computa- tionally hard problems by providing approximated or inexact solutions. As such problems are often NP-complete. And hence they lack efficient algorithms which provide a precise solution in polynomial time. For example the human brain is known machine which is believed to apply approximated solution methods in order to solve complex problems. Therefore it is able to present partial solutions in a time efficient manner. This approximative power and imprecise modus operandi of the human reasoning approach in solving NP-complete problems inspires most of the algorithms encompassed by soft com- puting. A non extensive list of such algorithms one may find: Fuzzy Logic, Evolutionary Computation, Machine Learning and Bayesian Networks.
Most methods of soft computing enjoy of a characterizing trait, all of them employ an explicit or implicit learning policy. In some cases this procedure can take the form of a guided convergence of an optimization algorithm, as in evolutionary computation. In other cases it is explicitly employed in the design phase of the algorithm, as in fuzzy inference systems. This learning policy may take place either online or offline. Machine learning methods as neural networks or support vector machines have an offline learning procedure. Which takes place before the algorithm run and its goal is to fine tune the output presented by the algorithm with the one provided in the data. Another possible variant is realizing this procedure online, this are known to be adaptive methods.
Although a number of new methods are yearly proposed which improve on the performance of older ones, Soft Computing still lacks in general a solid analysis of its algorithms through the magnifying lenses of information theory. This shall help substitute the case-based comparison of algorithms and most importantly shall help clarify how information is being processed in those usually bio-inspired systems. And therefore it may be a valuable tool for improving sof computing systems through a theoretical analysis rather than trial-and-error approach, which has been dominant lately in the field.