Learning Active Learning from Data • Sinjoy Saha

Ksenia Konyushkova, Raphael Sznitman, Pascal Fua, “Learning active learning from data”, NeurIPS, 2017. [1]

What problem does this paper try to solve, i.e., its motivation

The main problems this paper tries to solve can be summarized as follows:

Prior active learning methods use a combination of hand-crafted heuristics and are restricted to these existing techniques.
The assessment of the classification performance of previous AL methods is unreliable as it highly depends on the scarce annotated data.
The paper also highlights the problem of uncertainty sampling (US), the most popular sampling criterion, in unbalanced datasets. The more imbalanced the classes are in a dataset, the further from the optimum choice made by US. Although query selection procedures can account for statistical properties of datasets and classifiers, for complex scenarios with many factors such as label noise, outliers and shape of distribution, there is no easy way to take into account all possible factors.

How does it solve the problem?

This paper attempts to solve the above problems using two features.

They look at a continuum of AL strategies instead of combinations of pre-specified heuristics.
They avoid the need for performance evaluation of the classification quality for application-specific data as their approach can learn from previous tasks and easily transfer strategies to new domains.

To achieve this, the authors propose a data-driven approach and formulate Learning Active Learning (LAL) as a regression problem. Specifically, given a trained classifier and its output for an unlabeled sample, they predict the reduction in generalization error if that label is added to that data point. They show that this regression function can be trained on synthetic data by using simple features. These features may be classifier output variance or predicted probability distribution over the labels for a data point. Since these features are not domain-specific, it means that a regressor trained on synthetic data can be directly applied to other classification tasks.

The authors use uncertainty sampling (US) as it is the most popular and widely applicable sampling criterion. To motivate the need for data-driven approaches to improve AL strategies and to deal with scenarios where US fails, the authors present two toy examples with balanced and unbalanced classes. For balanced datasets, US is the best greedy approach. However, they show that US by design becomes sub-optimal for the imbalanced case since the data point that corresponds to the largest expected error reduction is different from 0.5 for the unbalanced class with much more data than the other.

The paper formulates AL as a data-driven Monte Carlo technique and proposes two approaches to constructing the datasets for training the regressor. The proposed method is quite fast during the online AL steps.

Independent LAL strategy incorporates unused labels at random to retrain the regressor to correlate the change in test performance with classifier and new data point properties. Specifically, the classifier is characterized by \(K\) parameters and any new randomly selected data point is characterized by \(R\) parameters. These two sets of parameters form the characterization vector \(\xi\) of the learning state for the classifier-data point pair. The difference between the losses (\(\delta_x = l_\tau - l_x\)) for the classifiers \(f_\tau\) and \(f_x\), trained without and with the new randomly selected data point is recorded for each learning state. This is repeated for \(Q\) different initializations, for \(T\) different labeled subset sizes each with \(M\) different data points. Thus, a dataset \(\Xi \in {\rm I\!R}^{(QMT) \times (K+R)} \) is created and a regressor can be trained to learn the mapping from learning state \(\xi \in {\rm I\!R}^{K+R}\) to expected error reduction \(\delta\). Thus, the LALIndependent method greedily looks for samples that have the highest potential to reduce the classifier error at each time step of the AL process.
Iterative LAL accounts for selection bias in AL by simulating the AL procedure and data point selection considers the strategy learnt on previously collected data. Thus, in each iteration selection of the most promising sample depends on the samples and strategies in the previous iteration. Hence, the final strategy in AL learns the sampling bias represented in the data.

Experiments are conducted for LALIndependent and LALIterative with (a) cold start with one sample from each class, and (b) warm start with a larger dataset. The synthetic datasets used are (a) 2D Gaussian point clouds and (b) XOR-like data. Once the regressors are trained on these synthetic datasets, they are tested on real-world data. The authors use a Gaussian Process classifier (GPC) and Random Forest (RF). The two LAL methods with cold start on 2D synthetic outperform baseline methods of random (RS) and uncertainty sampling (US) and previous AL methods, Kapoor [2] and ALBE [3] in both speed and accuracy. Subsequent experiments are done only on RF due to computational cost. The proposed methods trained on synthetic datasets do remarkably well on real data like the Striatum, MRI and Credit card datasets and better than RS, US and ALBE. The LALIndependent with warm start is applied on Splice and Higgs datasets and outperforms RS, US and ALBE. The paper reports a variety of metrics showing the robustness of the method to choice of loss function for measuring error reduction.

A list of novelties / contributions

The following are the main contributions of the paper:

The paper empirically shows that uncertainty sampling (US), which is the most popular and widely used exploitative sampling method, selects sub-optimal samples for imbalanced datasets. The more the imbalance in the classes, the further from the optimum the choice is made by US.
The paper formulates the learning AL as a regression problem and shows that the error reduction for the addition of a new labeled data point can be predicted by a regressor, given the properties of the classifier and that data point.
The classifier can be characterized by simple features such as kernel parameters for kernel-based, average depths of trees for tree-based or prediction variability for ensemble classifier. The data points can also be characterized by simple features like predicted probability for a class, distance to the closest point in dataset, distance to the closest labeled point, etc.
The authors show that the generalization capability AL regressor from synthetic to real data. It can be trained using 2D synthetic datasets and this can be applied to a warm start and further tuning on real-world datasets for both binary classification and segmentation with remarkable performance.

What do you think are the downsides of the work?

Although the paper highlights some key challenges in previous AL strategies and sampling methods and formulates a data-driven regression problem to tackle these issues, there are a few drawbacks of the proposed method as follows.

Task and dataset - The experiments are performed only on binary classification and binary segmentation datasets. It is unclear if the proposed method would generalize to multi-class datasets. Also, the chosen datasets are mostly image or numerical data and further experiments may be needed to show efficacy on text, video or time-series datasets.
Hand-designed features for classifier and data point - The hand-crafted features used to characterize the classifier and the data point under consideration need careful design decisions and are time-consuming to come up with. In Sec. 5, the paper lists six features which are used. These features constitute the learning state and can heavily influence the training of the regressor. This defeats the prior claims of avoiding hand-crafted heuristics.
RF Classifier - The paper mostly reports results for the RF classifier for task model with RF regressor for learning AL with only the first experiment performed on GPC and RF, stating computational reasons. This creates an ambiguity in whether the choice of the classifier influences the performance of the regressor. It is also not obvious how this classifier can be generalized to more complex image segmentation or text models.

References

[1] Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. Learning active learning from data. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

[2] Ashish Kapoor, Kristen Grauman, Raquel Urtasun, and Trevor Darrell. Active learning with gaussian processes for object categorization. In 2007 IEEE 11th International Conference on Computer Vision, pages 1–8, 2007.

[3] Wei-Ning Hsu and Hsuan-Tien Lin. Active learning by learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.