GPT-4 is the latest milestone in OpenAI’s efforts to scale Deep Learning. GPT-4 is a large multimodal model (accepting image and text input and providing text output) that, while less powerful than a human in many real-world scenarios, performs at human levels on various professional and academic benchmarks.
Over the past two years, OpenAi has rebuilt its entire deep learning stack and partnered with Azure to build a supercomputer from scratch for this workload.
A year ago, they trained GPT-3.5 as the first “test run” of the system. They found and fixed some bugs and improved their theoretical foundations.
As a result, the GPT-4 training run was unprecedentedly stable and became the first large model whose training performance could be accurately predicted in advance.
In casual conversation, the difference between GPT-3.5 and GPT-4 can be subtle. The difference emerges when the complexity of the task reaches a sufficient threshold – GPT-4 is more reliable, creative, and able to process much more sophisticated instructions than GPT-3.5.
To understand the difference between the two models, OpenAi tested a number of benchmarks, including simulating tests originally designed for humans.
They used the latest publicly available tests (in the case of the Olympiads and AP free response questions) or purchased the 2022-2023 editions of the practice exams.
They did not undergo any special training for these exams. A minority of the problems in the exams were seen by the model during training, but they believe the results are representative.
OpenAi is releasing GPT-4’s text input capability via ChatGPT and the API (with a waiting list). To prepare the image input capability for wider availability, they are initially working closely with a single partner.
They are also making available OpenAI Evals, their framework for automatically evaluating the performance of AI models, to allow anyone to report shortcomings in their models and contribute to further improvements.