Feature Selection is a ubiquitous problem in across data mining,
bioinformatics, and pattern recognition, known variously as variable
selection, dimensionality reduction, and others. Methods based on
information theory have tremendously popular over the past decade, with
dozens of 'novel' algorithms, and hundreds of applications published in
domains across the spectrum of science/engineering. In this work, we
asked the question 'what are the implicit underlying statistical
assumptions of feature selection methods based on mutual information?'
The main result I will present is a unifying probabilistic framework for
information theoretic feature selection, bringing almost two decades of
research on heuristic methods under a single theoretical interpretation.