RALF: A Reinforced Active Learning Formulation for Object Class Recognition • Sinjoy Saha

Sandra Ebert, Mario Fritz, Bernt Schiele, “RALF: A Reinforced Active Learning Formulation for Object Class Recognition”, CVPR, 2012. [1]

What problem does this paper try to solve, i.e., its motivation

This paper addressed the challenge of balancing the selection of informative samples in pool-based active learning, where the entire dataset is treated as a pool, and samples are selected based on given criteria, for labeling by an oracle. The paper highlights the “exploitation-exploration dilemma” which often arises when only a single criterion is used for selecting examples from the data pool. The main issues with using either pure exploitative or pure explorative criteria can be summarized as follows:

Pure exploitative criteria - These are often implemented using uncertainty-based metrics. These metrics tend to prioritize examples that the model is uncertain of i.e. those that are difficult to learn are ranked much higher. Thus, a pure exploitative strategy often leads to sampling bias and does not provide enough coverage of the entire data space since the model only focuses on outliers. This becomes more pronounced in multi-class scenarios where the dataset can be imbalanced and some classes are requested more while other classes are ignored. This problem also becomes prominent on more challenging datasets when examples in the same class are spatially separated into dense regions in the latent space.
Pure explorative criteria - These are often implemented using density-based metrics leveraging the underlying unlabeled data distribution, using clustering or linear reconstructions, to find the most representative samples. This approach, although samples from the entire data space, ignores feedback during the labeling process, and thus needs many iterations i.e. too many labeling requests before a good decision boundary is found. Prior methods have attempted to solve this problem by combining different criteria. The main problems with these approaches pointed out by the authors are as follows:
- These typically combine only two criteria and are difficult to balance effectively in practice.
- The combination of these criteria is usually fixed, instead of time-varying. Thus, they lack the flexibility to adapt to time-varying trade-offs.
- These methods do not generalize well to multi-class scenarios.
- Prior methods mostly overlook the incorporation of feedback from the classifier to learn from previous AL rounds to improve subsequent sample selections and label requests.
- They are limited in flexibility in accommodating more criteria and thus, the range of possible strategies that can be achieved.
- Many of these methods face computational challenges.
- The authors show that no single, pre-determined scheme works well across all datasets with different properties.

How does it solve the problems?

The paper addresses the issues listed above by proposing a reinforced active learning formulation (RALF) that considers the entire active learning as a meta-learning process that is optimized by learning a strategy from feedback “on the fly”. In the first part, the authors motivate the problem by discussing two exploitation criteria, namely Entropy and Margin, and three exploration criteria, namely Node potential, Kernel farthest first and Graph density. Graph density is introduced by the authors as a novel sampling criterion that uses a \(k\) -nearest neighbour graph to identify highly connected nodes. This is further discussed in the next section [A list of novelties / contributions].

Next, the paper describes the datasets and the classifiers used in the experiments. According to the increasing number of object classes, the datasets used are ETH-80 (ETH) with 8 classes, Cropped PASCAL (C-PASCAL) with 20 classes and Caltech 101 with 102 classes. The paper uses two classifiers, one semi-supervised and the other supervised, namely Label Propagation (LP) with \(k\) graph and Support Vector Machine (SVM) with RBF kernel.

Multiple experiments are performed to analyze each single criterion and classifier pair, with random sampling from uniform distribution as the baseline. The paper also highlights that some datasets might need more exploration at the beginning and more exploitation at the end and vice versa, while others might need a constant trade-off. Thus, numerous experiments are also performed to analyze the fixed and time-varying combinations of exploration and exploitation criteria, with Eq. 10 in the paper being the AL framework. Essentially, different \(\beta(t)\) such as \(0.5, 0, 1, log(t), t,\) etc. denote fixed, pure exploration, pure exploitation, and time-varying strategies respectively. These experiments lead to the following major observations:

LP is always better than SVM, so subsequent experiments are done only on LP.
Exploitation criteria work better than exploration criteria due to local feedback after each iteration.
The novel criteria, Graph density, works best for all datasets in combination with Entropy or Margin criteria.
No single, pre-determined scheme, whether single criteria or time-varying combinations, works well across all datasets with different properties. Finally, motivated by these experimental results, the paper aims at modeling the progress of the learnt classifier and uses it to control trade-offs between criteria. This is the main difference from earlier works which overlooked the feedback from the classifier. It formulates the active learning process as a feedback-driven Markov decision process (MDP) which uses Q-learning to learn a transition table where the states are the different combinations of two criteria with \(n\) actions representing \(n\) fixed trade-offs. The authors claim it is still time-varying since it can switch between different actions. The authors have chosen Q-learning since it is model-free and computationally fast. The initialization problem, where the method starts with an empty \(Q\) table, is solved using a guided initialization phase which uses a prediction probability-weighted entropy aggregated over each label \(j\) for each sample \(x_i\). This is used to select the next action.

A list of novelties / contributions

The following are the main contributions of the paper:

The paper proposes a novel exploration-based sampling criteria called Graph density leveraging a \(k\) graph with Manhattan distance (and \(k = 10\) ) which is weighted with a Gaussian kernel and normalized by number of edges, and used to rank the data points corresponding to the representatives. The main intuition is that each class representative is highly embedded in the graph and, thus is well-connected having many edges. The normalization and weight reduction of direct neighbors avoid over-sampling dense regions or outliers. It is more robust than the Node potential criteria due the underlying \(k\) graph structure.
The paper empirically shows that no single criterion works best across all datasets.
The paper integrates the probability of switching between exploration and exploitation into the already parameterized time-varying AL framework so that there is always a mix of two criteria.
Since previous reward rescaling functions do not generalize well to new datasets, the paper proposes a new rescaling function to map all previously observed rewards in the interval \([-1, 1]\).
The authors also propose a new reward function using the difference in the overall entropies between the two time steps. This is closer to the learning progress of the classifier instead of just a change in the prediction.
The main AL process is formulated as a Markov decision process which is learnt using the Q-learning method. This is a model-free and computationally efficient RL algorithm. There are only two parameters for Q-learning and no dataset-specific tuning is needed.
To tackle the challenge of initialization with empty Q table and to avoid the computationally method of visiting each state-action-pair, the authors propose a guided initialization phase.

What do you think are the downsides of the work?

The authors run several experiments to show that their novel sampling criterion works best compared to other explorative criteria and that time-varying strategies are better than fixed ones. Their proposed MDP-based method, RALF, outperforms the compared methods and works well in the multi-class scenario for three different datasets. However, there are a few downsides to this work.

Kernel and classifier - The authors choose a \(k\) -NN-based label propagation (LP) and RBF-based SVM classifier and show that LP consistently does better than SVM and all subsequent experiments are thus done only on LP. It is not clear from the experiments if the choice of kernel affects the classification performance and the AL process. An additional RBF-based LP could have been included to solve this ambiguity. Also, the results of subsequent experiments might differ with a different classifier.
Task and dataset - The authors choose three different multi-class datasets with varying difficulty. However, all experiments show the out-performance of the method only on a single task of multi-class object classification. It is not obvious if the proposed method would generalize to other tasks and domains, either in computer vision, NLP or time series data.
Limited sampling criteria - The parameterization of the AL framework to balance between explorative and exploitative sampling limits the strategies to only a mixture of two criteria, one from each category.
Fixed trade-offs - The authors also use \(n\) fixed trade-offs in \([0, 1]\) and argue that it is still time-varying since it can switch between different actions. However, the range of possible time-varying strategies still becomes limited.
RL algorithm - The authors choose Q-Learning for learning the strategies. As the authors point out, Q-learning is computationally efficient only when the number of states and actions are kept small to speed up initialization, thus defeating the flexibility with a much lesser range of strategies.

Minor point - \(U\) is overloaded, i.e. in Sec. 3.1, \(U\) denotes the set of all unlabeled examples and in Sec. 6, \(U\) denotes the set of all uncertainty-based criteria.

References

[1] Sandra Ebert, Mario Fritz, and Bernt Schiele. Ralf: A reinforced active learning formulation for object class recognition. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3626–3633. IEEE, 2012.