At a high level, the training of ChatGPT consists of 3
steps:
- The Supervised Fine-tuning Model (STF): In the first step, the model is trained using supervised learning. This is a type of machine learning where the model is trained to recognize patterns in data using labeled examples. In other words, the model is provided with the input and the output that it should learn. In our case, human annotators created appropriate responses to a dataset of user prompts.
- The Reward Model: the previously trained model generated multiple predictions for different user prompts, and human annotators ranked the predictions from the least to the most helpful. Using this data, the Reward Model was trained to predict how useful a response was to a given prompt.
- The Reinforcement Learning Process: Finally, the Reinforcement Learning process is used to further train the Supervised Fine-tuning model. Here, the STF is used as an agent that maximizes the reward from the Reward Model.
Now, steps 2
and 3
can be repeated multiple times: using the newly trained model from Step 3
, a new reward model can be trained by repeating step 2
, which is fed again into Step 3
, and so on.
After the three-step training process, ChatGPT’s responses became more sophisticated and effective in real-world scenarios.