Reinforcement Learning from Human Feedback (RLHF)
Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique that trains AI models to produce outputs preferred by humans by learning from human-provided rankings or ratings rather than purely from raw data.
Comprendre Reinforcement Learning from Human Feedback (RLHF)
RLHF was instrumental in turning raw large language models into the helpful, harmless, and honest assistants seen in products like ChatGPT and Claude. The process typically involves three stages: supervised fine-tuning on high-quality demonstrations, training a reward model from human preference data (humans rank multiple model outputs from best to worst), and then using reinforcement learning — specifically Proximal Policy Optimization (PPO) — to fine-tune the original model to maximize the learned reward signal. The key insight behind RLHF is that it is easier for humans to compare outputs ("A is better than B") than to specify exactly what a good output looks like. This comparative preference signal can be aggregated into a reward model that generalizes beyond the rated examples. RLHF significantly improves the helpfulness and safety of deployed models but is not without limitations. Models can learn to 'reward hack' — producing outputs that score highly on the reward model without genuinely being better. The quality of RLHF is bounded by the quality of human raters, who may have inconsistent or biased preferences. Alternatives and extensions include Direct Preference Optimization (DPO), which achieves similar alignment without a separate reward model, and Constitutional AI (CAI), which uses AI feedback rather than human feedback.
Comment GAIA utilise Reinforcement Learning from Human Feedback (RLHF)
GAIA's underlying language models are trained with RLHF to produce helpful, accurate, and safe responses. The alignment instilled through RLHF is what allows GAIA to handle sensitive personal data — emails, calendar events, tasks — and make reasonable judgments about what requires user attention versus what can be handled autonomously. GAIA benefits from RLHF without exposing users to the raw, unaligned model behavior.
Concepts liés
Constitutional AI
Constitutional AI (CAI) is a training methodology developed by Anthropic that aligns AI models with human values by having the AI evaluate and revise its own outputs against a written set of principles — a 'constitution' — rather than relying exclusively on human-labeled preference data.
Ajustement fin
L'ajustement fin est le processus qui consiste à reprendre l'entraînement d'un modèle d'IA pré-entraîné sur un jeu de données plus petit et spécifique à une tâche afin d'adapter son comportement à un domaine ou une application particuliers.
Large Language Model (LLM)
Un Large Language Model (LLM) est un modèle d'apprentissage profond entraîné sur d'immenses ensembles de textes, capable de comprendre, générer et raisonner sur le langage humain dans une grande variété de tâches.
Humain dans la boucle
L'humain dans la boucle (HITL) est un modèle de conception dans lequel un système IA inclut une supervision et une validation humaines à des points de décision clés, garantissant que les actions sensibles ou à fort impact nécessitent une confirmation humaine avant exécution.
Ingénierie de prompt
L’ingénierie de prompt est la pratique qui consiste à concevoir et affiner les instructions données à des modèles linguistiques d’IA afin d’obtenir de manière fiable les résultats souhaités, en influençant leur comportement sans modifier leurs paramètres internes.


