Apprentissage par renforcement avec rétroaction humaine - Historique des versions

Pitpitt le 18 avril 2026 à 15:09

2026-04-18T15:09:26Z

← Version précédente		Version du 18 avril 2026 à 11:09
Ligne 38 :		Ligne 38 :

	'''reinforcement learning from human preferences'''		'''reinforcement learning from human preferences'''
	~~<!-- To understand RLHF~~, ~~we first need to understand~~ the ~~process of training~~ a ~~model like ChatGPT and where RLHF fits in~~, ~~which is the focus of~~ the ~~first section of this post~~. ~~The following 3 sections cover the 3 phases of ChatGPT development~~. ~~For each phase~~, ~~I’ll discuss the goal for that phase~~, ~~the intuition for why this phase is needed~~, and ~~the corresponding mathematical formulation for those who want to see more technical detail~~.
			In reinforcement learning, the algorithm learns a behavior from repeated experiments, so as to optimize the rewards received over time. Like unsupervised learning, reinforcement learning does not require labeled data.
	~~Currently~~, ~~RLHF is not yet widely used in~~ the ~~industry except for~~ a ~~few big key players – OpenAI, DeepMind, and Anthropic. However, I’ve seen many work-in-progress efforts using RLHF, so I wouldn’t be surprised to see RLHF used more in the future~~.		Typically, an intelligent agent, immersed in an environment, makes a decision or performs an action based on its current state and observation of its environment.
	~~---~~
	~~Learning from instructions and human feedback are thought to~~ be ~~at the core~~ of ~~recent advances in instruction following large language models (LLMs). While recent efforts such as Open Assistant, Vicuna,~~ and ~~Alpaca have advanced our understanding of instruction fine-tuning~~, the ~~same cannot be said for RLHF-style algorithms that learn directly from human feedback~~. ~~AlpacaFarm aims to address~~ this ~~gap by enabling fast~~, ~~low-cost research and development on methods that learn from human feedback. We identify three main difficulties with studying RLHF-style algorithms: the high cost of human preference data~~, ~~the lack of trustworthy evaluation, and the absence of reference implementations~~. ~~-->~~		In return for the agent's action, the environment provides the agent with a reward or punishment.

			Reinforcement learning can be seen as a game of trial and error, the aim of which is to determine which actions will maximize an intelligent agent's gains. In this way, it will develop optimal behavior, known as strategy or policy.


	==Español==		==Español==

Pitpitt : Remplacement de texte : « ==Español== » par « ==Español== Catégorie:es »

2025-09-24T00:56:39Z

Remplacement de texte : « ==Español== » par « ==Español== Catégorie:es »

← Version précédente		Version du 23 septembre 2025 à 20:56
Ligne 45 :		Ligne 45 :

	==Español==		==Español==
			[[Catégorie:es]]

	'''''aprendizaje por refuerzo a partir de la retroalimentación humana'''''		'''''aprendizaje por refuerzo a partir de la retroalimentación humana'''''

Pitpitt le 20 août 2025 à 23:51

2025-08-20T23:51:34Z

← Version précédente		Version du 20 août 2025 à 19:51
Ligne 72 :		Ligne 72 :
	[[Catégorie:GRAND LEXIQUE FRANÇAIS]]		[[Catégorie:GRAND LEXIQUE FRANÇAIS]]
	[[Catégorie:101]]		[[Catégorie:101]]
	~~[[Catégorie:Publication]]~~

Claude COULOMBE le 19 août 2025 à 20:52

2025-08-19T20:52:04Z

← Version précédente		Version du 19 août 2025 à 16:52
Ligne 43 :		Ligne 43 :
	---		---
	Learning from instructions and human feedback are thought to be at the core of recent advances in instruction following large language models (LLMs). While recent efforts such as Open Assistant, Vicuna, and Alpaca have advanced our understanding of instruction fine-tuning, the same cannot be said for RLHF-style algorithms that learn directly from human feedback. AlpacaFarm aims to address this gap by enabling fast, low-cost research and development on methods that learn from human feedback. We identify three main difficulties with studying RLHF-style algorithms: the high cost of human preference data, the lack of trustworthy evaluation, and the absence of reference implementations. -->		Learning from instructions and human feedback are thought to be at the core of recent advances in instruction following large language models (LLMs). While recent efforts such as Open Assistant, Vicuna, and Alpaca have advanced our understanding of instruction fine-tuning, the same cannot be said for RLHF-style algorithms that learn directly from human feedback. AlpacaFarm aims to address this gap by enabling fast, low-cost research and development on methods that learn from human feedback. We identify three main difficulties with studying RLHF-style algorithms: the high cost of human preference data, the lack of trustworthy evaluation, and the absence of reference implementations. -->


	==Español==		==Español==

Claude COULOMBE le 19 août 2025 à 20:31

2025-08-19T20:31:55Z

← Version précédente		Version du 19 août 2025 à 16:31
Ligne 5 :		Ligne 5 :

	==Compléments==		==Compléments==
	Ce type d'apprentissage est utilisé dans les jeux [[AlphaGo]] et les générateurs de ~~texte fondés~~ sur les [[grand modèle de langues\|grands modèles de langues]].		Ce type d'apprentissage est utilisé dans les jeux [[AlphaGo]] et les [[Robot conversationnel génératif\|robots conversationnels génératifs]] ou [[générateurs automatique de textes]] basés sur les [[grand modèle de langues\|grands modèles de langues]].
	<hr/>		<hr/>
	Le [[Modèle de récompense\|modèle de récompense]] est pré-entraîné pour que la politique soit optimisée afin de prédire si une sortie est bonne (récompense élevée) ou mauvaise (récompense faible ou pénalité).		Le [[Modèle de récompense\|modèle de récompense]] est pré-entraîné pour que la politique soit optimisée afin de prédire si une sortie est bonne (récompense élevée) ou mauvaise (récompense faible ou pénalité).

Patrickdrouin le 19 août 2025 à 19:33

2025-08-19T19:33:46Z

← Version précédente		Version du 19 août 2025 à 15:33
Ligne 57 :		Ligne 57 :
	==Sources==		==Sources==

	[https://fr.wikipedia.org/wiki ~~Wikipedia -~~ Apprentissage_par_renforcement_%C3%A0_partir_de_r%C3%A9troaction_humaine]		[https://fr.wikipedia.org/wiki/Apprentissage_par_renforcement_%C3%A0_partir_de_r%C3%A9troaction_humaine Wikipedia - apprentissage par renforcement à partir de rétroaction humaine]

	[https://www.journaldunet.com/solutions/dsi/1518637-chatgpt-l-intelligence-artificielle-peut-elle-tenir-ses-promesses/ Journal du Net]		[https://www.journaldunet.com/solutions/dsi/1518637-chatgpt-l-intelligence-artificielle-peut-elle-tenir-ses-promesses/ Journal du Net]

Patrickdrouin le 19 août 2025 à 19:32

2025-08-19T19:32:09Z

← Version précédente		Version du 19 août 2025 à 15:32
Ligne 12 :		Ligne 12 :

	'''apprentissage par renforcement avec rétroaction humaine'''		'''apprentissage par renforcement avec rétroaction humaine'''

			'''apprentissage par renforcement à partir de rétroaction humaine'''

	'''apprentissage par renforcement à partir de retours humains'''		'''apprentissage par renforcement à partir de retours humains'''
Ligne 55 :		Ligne 57 :
	==Sources==		==Sources==

	[https://~~huyenchip~~.~~com~~/~~2023/05/02/rlhf.html Source : huyenchip~~]		[https://fr.wikipedia.org/wiki Wikipedia - Apprentissage_par_renforcement_%C3%A0_partir_de_r%C3%A9troaction_humaine]

	[https://~~crfm~~.~~stanford~~.~~edu~~/~~2023~~/05/22/~~alpaca-farm.html~~ ~~Source : stanford~~]		[https://www.journaldunet.com/solutions/dsi/1518637-chatgpt-l-intelligence-artificielle-peut-elle-tenir-ses-promesses/ Journal du Net]

	[https://en.~~wikipedia~~.~~org~~/~~wiki~~/~~Reinforcement_learning_from_human_feedback Source: Wikipedia~~]		[https://www.obvia.ca/sites/obvia.ca/files/ressources/202501-OBV-Out-Glossaire_Obvia.pdf Glossaire de l'Obvia - apprentissage par renforcement à partir de retours humains]

	[https://~~www.journaldunet~~.com/~~solutions~~/~~dsi~~/~~1518637-chatgpt-l-intelligence-artificielle-peut-elle-tenir-ses-promesses~~/ ~~Source : Journal du Net~~]		[https://huyenchip.com/2023/05/02/rlhf.html huyenchip]

	[https://~~www~~.~~obvia~~.ca/~~sites~~/~~obvia.ca~~/~~files~~/~~ressources/202501~~-~~OBV-Out-Glossaire_Obvia~~.~~pdf Glossaire de l'Obvia - apprentissage par renforcement à partir de retours humains~~]		[https://crfm.stanford.edu/2023/05/22/alpaca-farm.html Stanford]

			[https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback Wikipedia - reinforcement learning from human feedback]

	{{Modèle:101}}		{{Modèle:101}}

Patrickdrouin le 19 août 2025 à 19:28

2025-08-19T19:28:08Z

← Version précédente		Version du 19 août 2025 à 15:28
Ligne 12 :		Ligne 12 :

	'''apprentissage par renforcement avec rétroaction humaine'''		'''apprentissage par renforcement avec rétroaction humaine'''

			'''apprentissage par renforcement à partir de retours humains'''

	'''apprentissage par renforcement avec retour humain'''		'''apprentissage par renforcement avec retour humain'''
Ligne 60 :		Ligne 62 :

	[https://www.journaldunet.com/solutions/dsi/1518637-chatgpt-l-intelligence-artificielle-peut-elle-tenir-ses-promesses/ Source : Journal du Net]		[https://www.journaldunet.com/solutions/dsi/1518637-chatgpt-l-intelligence-artificielle-peut-elle-tenir-ses-promesses/ Source : Journal du Net]

			[https://www.obvia.ca/sites/obvia.ca/files/ressources/202501-OBV-Out-Glossaire_Obvia.pdf Glossaire de l'Obvia - apprentissage par renforcement à partir de retours humains]


	{{Modèle:101}}		{{Modèle:101}}
	[[Catégorie:Intelligence artificielle]]		[[Catégorie:Intelligence artificielle]]
	[[Catégorie:GRAND LEXIQUE FRANÇAIS]]		[[Catégorie:GRAND LEXIQUE FRANÇAIS]]
	[[Catégorie:101]]		[[Catégorie:101]]
			[[Catégorie:Publication]]

Jean-Sébastien Zavalone le 21 juillet 2025 à 19:06

2025-07-21T19:06:08Z

← Version précédente		Version du 21 juillet 2025 à 15:06
Ligne 60 :		Ligne 60 :

	[https://www.journaldunet.com/solutions/dsi/1518637-chatgpt-l-intelligence-artificielle-peut-elle-tenir-ses-promesses/ Source : Journal du Net]		[https://www.journaldunet.com/solutions/dsi/1518637-chatgpt-l-intelligence-artificielle-peut-elle-tenir-ses-promesses/ Source : Journal du Net]
			{{Modèle:101}}
	[[Catégorie:Intelligence artificielle]]		[[Catégorie:Intelligence artificielle]]
	[[Catégorie:GRAND LEXIQUE FRANÇAIS]]		[[Catégorie:GRAND LEXIQUE FRANÇAIS]]
	[[Catégorie:101]]		[[Catégorie:101]]

Jean-Sébastien Zavalone le 21 juillet 2025 à 19:05

2025-07-21T19:05:51Z

← Version précédente		Version du 21 juillet 2025 à 15:05
Ligne 34 :		Ligne 34 :

	'''reinforcement learning from human preferences'''		'''reinforcement learning from human preferences'''

	<!-- To understand RLHF, we first need to understand the process of training a model like ChatGPT and where RLHF fits in, which is the focus of the first section of this post. The following 3 sections cover the 3 phases of ChatGPT development. For each phase, I’ll discuss the goal for that phase, the intuition for why this phase is needed, and the corresponding mathematical formulation for those who want to see more technical detail.		<!-- To understand RLHF, we first need to understand the process of training a model like ChatGPT and where RLHF fits in, which is the focus of the first section of this post. The following 3 sections cover the 3 phases of ChatGPT development. For each phase, I’ll discuss the goal for that phase, the intuition for why this phase is needed, and the corresponding mathematical formulation for those who want to see more technical detail.

Ligne 40 :		Ligne 39 :
	---		---
	Learning from instructions and human feedback are thought to be at the core of recent advances in instruction following large language models (LLMs). While recent efforts such as Open Assistant, Vicuna, and Alpaca have advanced our understanding of instruction fine-tuning, the same cannot be said for RLHF-style algorithms that learn directly from human feedback. AlpacaFarm aims to address this gap by enabling fast, low-cost research and development on methods that learn from human feedback. We identify three main difficulties with studying RLHF-style algorithms: the high cost of human preference data, the lack of trustworthy evaluation, and the absence of reference implementations. -->		Learning from instructions and human feedback are thought to be at the core of recent advances in instruction following large language models (LLMs). While recent efforts such as Open Assistant, Vicuna, and Alpaca have advanced our understanding of instruction fine-tuning, the same cannot be said for RLHF-style algorithms that learn directly from human feedback. AlpacaFarm aims to address this gap by enabling fast, low-cost research and development on methods that learn from human feedback. We identify three main difficulties with studying RLHF-style algorithms: the high cost of human preference data, the lack of trustworthy evaluation, and the absence of reference implementations. -->


			==Español==

			'''''aprendizaje por refuerzo a partir de la retroalimentación humana'''''

			''En el ámbito del aprendizaje automático, el aprendizaje por refuerzo de la retroalimentación humana es una técnica para mejorar el rendimiento de un agente utilizando la retroalimentación humana.''

			''Se empieza por entrenar un modelo de recompensa a partir de resultados anotados con comentarios humanos. A continuación, este modelo se utiliza como función de recompensa para mejorar la política de un agente mediante el aprendizaje por refuerzo con un algoritmo de optimización.''



	==Sources==		==Sources==