CriticGPT

OpenAI has trained a model, CriticGPT, to catch bugs in GPT-4’s code. They’re starting to integrate such models into RLHF alignment pipeline to help humans supervise AI on difficult tasks.

CriticGPT, a model based on GPT-4, writes critiques of ChatGPT responses to help human trainers spot mistakes during RLHF.

On this website we will explore and evaluate this GPT Model.

What is CriticGPT?

A model based on GPT-4, named CriticGPT, has been developed to identify errors in ChatGPT’s code output. Studies have shown that individuals using CriticGPT to review ChatGPT code perform better 60% of the time compared to those without such assistance. Work is underway to integrate models like CriticGPT into the RLHF labeling pipeline, providing trainers with explicit AI support. This integration aims to enhance the evaluation of outputs from advanced AI systems, which can be challenging to assess without improved tools.

The GPT-4 series models, which power ChatGPT, are designed to be helpful and interactive through “Reinforcement Learning from Human Feedback” (RLHF). A crucial aspect of RLHF involves collecting comparisons where AI trainers rate different ChatGPT responses against each other.

As advances in reasoning and model behavior are made, ChatGPT becomes more accurate, and its errors more subtle. This evolution makes it difficult for AI trainers to identify inaccuracies, complicating the comparison task essential to RLHF. This presents a fundamental limitation of RLHF, as models become more knowledgeable than any human capable of providing feedback.

To address this challenge, CriticGPT has been trained to write critiques that highlight inaccuracies in ChatGPT’s answers.

CriticGPT’s suggestions are not always accurate, but they significantly assist trainers in identifying many more issues with model-generated answers than they would without AI support.

Furthermore, when individuals use CriticGPT, the AI enhances their capabilities, leading to more thorough critiques compared to when they work independently, and it produces fewer hallucinated bugs than when the model operates alone. In experiments, a second random trainer preferred critiques from the Human+CriticGPT team over those from an unassisted individual more than 60% of the time.

*CriticGPT helps trainers to write more comprehensive critiques than they do without help while producing fewer hallucinations than critiques from the model alone.*

Methods

CriticGPT was trained using RLHF, similar to ChatGPT. However, unlike ChatGPT, it was exposed to numerous inputs containing deliberate mistakes, which it then had to critique. AI trainers manually inserted these mistakes into code written by ChatGPT and then wrote example feedback as if they had identified the bug they had just inserted. The same individual then compared multiple critiques of the modified code to easily determine if a critique identified the inserted bug. Experiments assessed CriticGPT’s ability to catch both inserted bugs and naturally occurring ChatGPT bugs previously identified by a trainer. Results showed that trainers preferred CriticGPT’s critiques over ChatGPT’s in 63% of cases involving naturally occurring bugs, partly because CriticGPT produced fewer “nitpicks” and hallucinated problems less frequently.

Additionally, longer and more comprehensive critiques were generated by employing extra test-time search against the critique reward model. This search procedure allowed for balancing the aggressiveness in identifying code problems and configuring a precision-recall trade-off between hallucinations and detected bugs. This enabled the generation of critiques that are highly useful for RLHF. Further details can be found in the research paper.

Limitations of CriticGPT

CriticGPT was trained on relatively short ChatGPT answers. Supervising future agents will require methods that aid trainers in understanding long and complex tasks. Models still hallucinate, and trainers sometimes make labeling mistakes influenced by these hallucinations.

Real-world mistakes can often be dispersed across multiple parts of an answer, while the current work focuses on errors that can be identified in a single location. In the future, tackling dispersed errors will be necessary. CriticGPT’s assistance is limited; if a task or response is extremely complex, even an expert with model assistance may struggle to evaluate it correctly.

Next Steps

To align increasingly complex AI systems, better tools are essential. Research on CriticGPT indicates that applying RLHF to GPT-4 shows promise in helping humans produce better RLHF data for GPT-4. Plans are underway to scale this work further and implement it in practice.