Active Preference Learning for Large Language Models • Sinjoy Saha

William Muldrew, Peter Hayes, Mingtian Zhang, David Barber, “Active preference learning for large language models”, ICML, 2024. [1]

What problem does this paper try to solve, i.e., its motivation

The fine-tuning techniques for aligning large language models (LLMs) require careful consideration and effective use of human resources where reinforcement learning is used (RLHF). Prior works have also used feedback from AI models (RLAIF) to align smaller language models. However, these methods are quite complex and unstable. Direct Preference Optimization (DPO) is a much simpler and more stable technique for aligning LLMs. Methods have also been developed using active learning to improve fine-tuning LLMs. The main problems with these previous methods of fine-tuning and aligning LLMs using active learning are as follows:

Relying only on human feedback becomes increasingly unviable in today’s era where language models have become extremely large and fine-tuning requires a lot of data.
Methods relying on an AI model as an oracle for generating responses tend to consult it for every data point which becomes computationally inefficient. [2]
Prior methods have explored RLHF/RLAIF for active learning which are complex and unstable.
Many AL sampling criteria are not straight-forward and might require modifications to the model architecture and fine-tuning process itself.

How does it solve the problem?

This paper aims to solve the above problems in the following manner:

The paper leverages the DPO technique within the active learning process to make the fine-tuning process much simpler and more stable. The need for a separate reward model and multi-stage process to adapt to the autoregressive LLM is eliminated in DPO.
The authors propose a few different acquisition functions. First, predictive entropy (PE) is a widely used measure of uncertainty in LLMs and it has previously been shown to be well calibrated for LLMs. Second, the preference model certainty under the Bradley-Terry model captures the oracle’s preferences better than PE. The proposed function is maximised when the difference between the implicit rewards for the two generations is large and vice versa. Lastly, these two approaches are complementary and can be combined into a hybrid approach. The authors propose selecting a relatively large batch of prompts ranked by entropy. Then, only the top subset is used for generating prompt/completion pairs. Finally, the pairs are scored and ranked according to preference certainty. This filtering step minimizes the number of oracle consultations required for generation.
The paper shows that GPT-4 is much more self-consistent than GPT-3.5 and thus selects GPT-4 as the oracle despite the high cost and latency.
The main LLMs fine-tuned using active learning are GPT-2 and Pythia (GPT-3-like) models. The pre-trained versions of these models were obtained from Hugging Face.
In the fine-tuning step, the authors use a straight-forward implementation of re-initialization, uniform sampling from the set of all previously sampled data and fine-tuning to convergence. The authors show that the proposed acquisition functions perform much better than the baseline of random sampling.
The analysis of the histogram of the results shows that there is a much better differentiation for preference certainty and hybrid approach with the random sampling baseline.

A list of novelties / contributions

The main contributions of the paper are as follows:

The paper introduces the combination of the DPO technique and active learning to simplify and stabilize the complex and data-hungry steps of fine-tuning of autoregressive LLMs. This eliminates the need for a separate reward model as in the case of RLHF/RLAIF base techniques.
The authors propose a hybrid acquisition function combining predictive entropy and preference certainty. The filtering step using entropy reduces the number of requests to the oracle and the preference certainty captures the oracle preference much better.

What do you think are the downsides of the work?

Although the paper highlights the pressing challenge of fine-tuning LLMs with less data and proposes an AL method leveraging DPO to tackle this challenge, there are a few drawbacks of the proposed method as follows.

Dataset - The paper only focuses on text generation tasks on generic text datasets. The first dataset is a text completion task for movie reviews and the second is a summarization of Reddit posts. Further analysis is required to show that the method works for domain-specific datasets and tasks.
Model - The paper only runs experiments on GPT-2 and Pythia (GPT-3-like) models. These are mainly decoder-based models. For the sake of completeness, it would be interesting to experiment with encoder-decoder models too.
Closed model as oracle - Since the paper mainly uses GPT-4 as the oracle for generating answers, the effects of the oracle cannot be studied very well. Also, closed models come at a much higher cost which might be economical for running multiple iterations of active learning. Also, the paper compares the choice of oracle by measuring the average self-consistency of only GPT-3.5 and GPT-4. However, a comparison including open models would have been better.
Prompts - The authors only provide two prompts for the two datasets and thus the effect of prompt engineering is not well-studied.
Comparative Analysis - The paper mainly compares the proposed acquisition functions predictive entropy, preference certainty and hybrid (PE + PC) with the baseline of random sampling. No comparison with other widely used sampling strategies is shown. The paper points out that the Reward rAnked Fine-Tuning (RaFT) \cite{dong2023raft} technique consults the oracle on every data point before filtering. However, no comparative study has been done with previous AL methods for fine-tuning LLMs.

Despite some drawbacks, the paper proposes a novel method of leveraging Direct-Preference Optimization (DPO) for active learning of the fine-tuning of LLMs.

References

[1] William Muldrew, Peter Hayes, Mingtian Zhang, and David Barber. Active preference learning for large language models. In Forty-first International Conference on Machine Learning, 2024.

[2] Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.