We are starting the first issue of AI Frontiers newsletter this week. This is a perfect time to look back at 2023, and see how far we have come. (See resources and announcements at the end of this article.)
The year 2023 is a watershed year for AI. For the first time, AI has entered the public realm, touching every aspect of our lives. It is starting to replace search engines, becoming our go-to place to ask questions. AI is poised to disrupt many industries: from education to marketing, IT support, and medicine. Here, I want to summarize the progress of AI in seven major areas. They are definitely not exhaustive. I've chosen these areas for their importance and the potential to disrupt the future. Feel free to drop a comment here if you are observing other interesting developments.
1. The breakthrough in AI capabilities
Today's AI systems can easily pass the Turing test, and we no longer debate whether AI is feasible. If AI was perceived as a toddler before 2023, it has matured into a teenager in 2023, though it is not yet an adult. An adult AI system should be capable of thinking and reasoning like a human adult, which means passing college exams or completing similarly difficult tasks. In 2023, both GPT-4 and Google's Gemini have made significant progress toward that goal.
Both GPT-4 and Gemini are very large. This is due to the fact that an LLM becomes more intelligent as its size gets larger. GPT-4 is estimated1 to have approximately 1.8 trillion parameters, with around 120 layers and using Mixture of Experts within the model. Google has not released the size of Gemini, but it is significantly larger than PaLM 2, which had 340 billion parameters2. According to the Gemini report3, training Gemini requires significantly more resources than PaLM 2, likely three times as much. This places it in the range of 1 trillion parameters. The architecture of Gemini is likely to be similar to that of GPT-4: a decoder-only transformer model with a mixture of experts.
Today's large language models (LLMs) demonstrate remarkable intelligence, as evidenced by their performance on a range of challenging datasets4.
In commonsense reasoning (HelloSwag), GPT-4 has achieved a 95% accuracy rate, equivalent to human performance. In grade school mathematics (GSM8K), both LLMs achieved around 95% accuracy. In college exams covering 57 subjects (MMLU), both LLMs achieved over 90% accuracy, surpassing human performance (89%). For the problems that used to cause LLMs to stumble (Big-bench-hard), GPT-4 achieved an 89% accuracy rate, while Gemini reached 83%. It appears that LLMs are overcoming their shortcomings. In coding problems (HumanEval), GPT-4 reached an 88% success rate. For reading comprehension with numerical reasoning (DROP), both LLMs achieved around 83% accuracy. The only area in which these LLMs performed poorly was mathematical competition questions (MATH). In summary, our large foundation models outperformed humans in 3 out of 7 tasks, approached near-human performance in 3 other tasks, and performed poorly in only one out of 7 tasks. AI is approaching to have human adult intelligence.
What we can anticipate for 2024 is a continuous improvement in the performance of large foundation models. By the end of 2024, I expect that the best LLMs will surpass humans in almost all datasets. By then, we may declare that AI has reached adulthood, with the capability of reasoning and understanding equivalent to an adult human.
Open-source foundation models
The large foundation models are all closed-source and owned by a couple of companies. Many companies are concerned about their dependence on these models because there is no visibility into their inner workings. This concern has led to the emergence of many open-source models.
Meta released Llama in February, and LIMA was released in May. However, most of them did not deliver satisfactory performance compared to the state of the art OpenAI model (GPT 3.5 at that time).
Meta's Llama 2 and Mistral's Mixtral 8x7B model are among the best-performing ones. They have generated excitement because they approached the GPT 3.5 level. But they are still far behind GPT-4. Here is the newest performance chart :
On average, the open-source model is 20% below the best GPT-4 model. This raises questions about deploying open-source models. This is because commercial products demand high accuracy. Therefore, most companies would stick with OpenAI or Google for foundation models, mainly because of the highly accurate results. For this reason, we will see a continuing rise of OpenAI this year, with more companies using OpenAI API for their GPT-4 products. Google will also be an active player in this game, with its existing GCP and high-performing Gemini, Google could become an AI provider to enterprises.
We have not solved hallucination problems. In fact, hallucination may be an inherent property of large language models, as some research showed. Remedies for hallucination include limiting answers to existing documents and using external search to check the validity of the answer. Another way is requiring Chain of Thoughts reasoning in the response. Researchers found this significantly reduced wrong answers. Since hallucination is a big problem in many practical applications, we will see more research on solving this problem in 2024.
2. Multi-modal AI
Another significant advancement is the maturation of multi-modal LLMs. Bard allowed image uploading in July, enabling users to ask questions based on images. OpenAI released GPT-4V in September 2023, which is capable of understanding text, images, and speech. Google released Gemini in December 2023, which can process text, images, audio, and video simultaneously. We now have fully multi-modal LLMs, which is also called LMMs (Large Multi-modal Models).
The emerging trend of 2023 is the integration of all these modalities into a single model. Such a model uses a transformer as its core architecture and transforms every type of input into tokens that can be processed by the transformer. Not only can we process different modalities, but we can also generate different modalities from such a model.
The achievement of multimodal capabilities is the result of the widespread adoption of transformers in all AI fields, allowing for a unified architecture to handle text, images, audio, and video. Vision transformers and video transformers have proven to be superior to CNN models, and speech transformer models outperformed CNN-based speech recognition models. Today, we only need a single transformer model to process these input formats, with the only extra work being the generation of image tokens or speech tokens.
The newest VideoPoet5 is demonstration of such multimodal process and multimodal generation model. VideoPoet uses a decoder-only transformer that processes multimodal inputs -- including images, videos, text, and audio.
VideoPoet achieved state-of-the-art zero-shot video generation, and can generate high-fidelity video.
3. The explosion of Generative AI
Using AI to generate images, music, and videos became the biggest advancement in 2023. Text-to-image generation achieved remarkable fidelity in terms of image quality and realism. Here is a summary of the major generative models in 2023.
For image generation, Meta released the Segment Anything Model (SAM) in April, capable of zero-shot segmentation on any picture. In October, OpenAI released Dall-E 3. It has the best image generation quality with deep language understanding.
In text-to-video generation, Meta released Emu Video in November 6. This model simplified video generation into two steps, allowing it to generate a 4-second video from text and an image. Emu Video outperformed all previous models, including MAV, Google's Imagen, AYL, PYOCO, R&D, Cog, Gen2, and Pika. Emu outperformed all these other models in human evaluation, being preferred over each of the other models over 90% of the time.
The most exciting achievement of 2023 occurred at the end of the year. AudioBox 7 was released in December, enabling AI to generate any sound based on text. This followed after Lyria 8, which can generate any music in the style of artist based a text prompt.
VideoPoet was also released in December, ushering in a new paradigm of video generation without the diffusion model and integrating it into LLM.
Alphacode 2 was announced on the same day as Gemini. It uses Gemini as a foundation model and achieved a performance level of 85% compared to human participants in coding competitions. Magicoder was also released, and it is the best open-source code generator.
The year 2023 marked the triumph of the diffusion model, as many image generations were based on this model, including Emu Video. However, alternatives to the diffusion model have emerged. OpenAI's Dall-E 3 employs a consistency model9 that does not rely on the diffusion model. Google's VideoPoet uses transformers directly, also avoiding the use of the diffusion model in their image generation. In other words, the two largest AI companies are shifting away from the use of the diffusion model for image generation. My prediction is that the diffusion model will decline in 2024. The drive to move away from the diffusion model is the pursuit of using a single transformer model for all tasks. We expect to see more research results in transformer-generated images in 2024.
4. The rise of AI agents
In 2023, we started to see the "agent," an AI system that can take action on our behalf. Such actions can include sending an email, calling a restaurant, retrieving information from a database, or generating a chart. Once actions are introduced, the AI assistant can become more powerful. This action model is seamlessly integrated into the LLM; therefore, it is learnable and tunable.
One application of agents is in data analytics. In the future, analyzing data will no longer be a human job but will be delegated to AI. If an executive is interested in customer trends, they can simply ask a question in natural language, and the answer and chart will generate automatically. There is no need for data scientists to write elaborate SQL code to retrieve data. This suggests that text-to-SQL and chart generation will be significant applications in 2024. There are also other applications for accessing a database to serve customer needs.
OpenAI is supporting the AI agent paradigm by offering the Assistants API. It links your code to external tools, making it potentially powerful. However, the fact that the assistants require a lot of context and each time append the context to the total tokens makes it very expensive. Additionally, it is not easy to integrate the Assistant with other tools. In 2024, the AI assistant remains an open field for competition. A flexible assistant API and a low-cost solution can be attractive. Langchain has gained a lot of traction, but it's not the perfect one. AutoGen seems much easier to use. AutoGPT was a good try but falls short in many key functions. We may see new companies that deploy good agent solutions. This is where startup innovation can happen.
Even though OpenAI and Google lead in foundation models, good prompt engineering and agent actions could generate many interesting applications. We expect to see some specialized agents, such as a travel assistant, research assistant, price negotiation agent, and so on. Each of these assistants can leverage specialized tools and deliver value to the customers.
5. Better ways to finetune LLMs
The success of ChatGPT brought a lot of attention to the method called RLHF (Reinforcement Learning with Human Feedback). This method gave a significant boost to the original GPT-3 model and led to the successful deployment of GPT-3.5, which powered ChatGPT. RLHF is also used to enhance the performance of GPT-4, Google’s PaLM 2, and Meta’s Llama 2 model. Thus, it is the most widely used fine-tuning method for LLMs today.
Since RLHF has been so successful and is used with all foundation models, people are attempting to find ways to improve it. This is achieved by simplifying the RLHF steps. RLHF involves three steps: 1. Supervised fine-tuning: Use human-created data to train the current model. 2. Training a reward model. In this step, user preferences for AI-generated outputs are collected, and each output is given a score. Then, a scoring model or reward model is trained. 3. Applying reinforcement learning and the reward model to train the large language model.
(1) DPO
One improvement on RLHF is replacing the reinforcement learning step. Researchers from Stanford University proposed a method called DPO (Direct Preference Optimization) 10. Instead of training a reward model and then learning a reinforcement learning model, DPO simply uses the preference data directly to train the LLM. Therefore, DPO reduces two steps (reward function learning and RL) to one single step.
The authors show that DPO outperforms the reinforcement learning approach. Today, DPO has gained traction among practitioners for fine-tuning their models. This trend will continue in 2024.
(2) RLAIF
Another way to improve RLHF is by removing the bottleneck of data gathering. One of the key steps in RLHF is gathering human feedback data, which is expensive to obtain by hiring people to provide answers. The human data gathering process is also time-consuming. Instead of relying on humans, we can use an LLM such as GPT-4 to provide us with feedback. RLAIF (Reinforcement Learning with AI Feedback) 11 employs GPT-4 to generate preference data, and they demonstrate that RLAIF has a similar effect as RLHF in boosting a model. By utilizing AI for feedback, we eliminate the bottleneck associated with collecting data from humans.
It appears that we are moving toward the use of AI for generating evaluation data, not only for preference data but also for other supervised training tasks.
(3) Week-to-Strong Alignment
A third important development is investigating whether RLHF will continue to be useful in the future. There is an implicit assumption that RLHF will always improve a model's performance because humans know better. However, this assumption may not be true anymore. In the coming year (this or next year), we will observe AI growing into superhuman intelligence. This means it will beat humans in almost all tasks, from writing a good email to solving a math problem. When we force an LLM to conform to a human's way of writing or speaking, we could degrade the LLM’s performance in doing other tasks. In other words, training with RLHF could make an LLM less capable. This is very different from classical supervised training, where humans are always smarter. This situation is shown in the center picture of the following figure, where a person is trying to teach a superhuman AI.
Researchers from OpenAI have investigated this problem and have made the first attempt to simulate this issue 12. They used a weak LLM (GPT-2) to teach a strong LLM (GPT-4) and confirmed that the performance of GPT-4 indeed degraded. This suggests that RLHF may not work well in the future. OpenAI researchers have proposed a remedy by adding an auxiliary confidence loss. This allows finetuned GPT-4's performance to increase to the GPT 3.5 level but still remains below the original GPT-4 level. This paper represents the first attempt to understand the effect of applying a weak model to train a strong model. They dubbed this method weak-to-strong generalization, and we expect to see more results on this from OpenAI in 2024 .
6.The exciting development of robotics
As LLMs continue to mature and become more powerful, the frontier of AI has shifted from building digital models to physical ones. The next stage of AI development will in the field of robotics.
The progress of robotics in 2023 is exciting, though not as rapid as that of LLMs. This is primarily due to the inherent challenges in building and testing physical components. An exciting achievement in this domain is Tesla Optimus 2, capable of delicately picking up and placing an egg without breaking it. Such precise handling marks a significant breakthrough for robots entering households.
Another noteworthy breakthrough is the transformer-based robotic architecture RT-2 13 . It introduced a vision-language-action model, that encode robot actions as tokens to be processed by a transformer. The transform can generate such action tokens for the robot to take action accordingly. The architecture looks like this:
The transformer model can accept text and image as inputs and then generate corresponding actions. This architecture will enable today’s robots to use LLM as its core model. Such a robot can have all the listening, seeing and speaking capabilities in addition to moving and grasping.
In October, Google researchers released the Open X-Embodiment dataset14. Collected from 22 different robots through a collaboration between 21 institutions, it contains 527 skills. This dataset can help robots jump-start their learning and leverage the "pre-training" in other skills to boost their performance. As a result, it will accelerate robotics development.
7. Detecting brain activities
When we measure a person's brain signals, can we actually detect what the person is hearing or seeing? Another astounding achievement in 2023 involves real-time image reconstruction based on brain signals recorded by MEG.15 The level of accuracy it achieves is truly astonishing.
It appears that we can recover not only the correct shape and color but also very specific details from the brain signals. This work, conducted by Meta researchers, builds upon the earlier work of detecting speech from brain signals and image reconstruction from fMRI recordings.
In the near future, we may be able to apply these techniques to a person while they are sleeping and monitor their dreams. Could it be possible that one day we can project a person's dream onto a big screen like a movie? Research in image recovery is expected to continue in 2024, likely yielding much better performance."
Resources:
I gave a talk on this subject at AI Frontiers meetup. Please check out the talk video and slides.
Announcements:
Join our next meetup on understanding VideoPoet on Friday, January 19, 2024
References
Research Review of 2023:
Microsoft research (Dec 22, 2023), “Research at Microsoft 2023: A year of groundbreaking AI advances and discoveries”.
Jeff Dean et al (Dec 22, 2023), “2023: A Year of Groundbreaking Advances in AI and Computing”. Google Deepmind blog
Datasets:
MATH: https://huggingface.co/datasets/hendrycks/competition_math
Big-Bench-Hard: https://github.com/suzgunmirac/BIG-Bench-Hard
Agents:
Langchain: https://github.com/langchain-ai/langchain
AutoGen: https://github.com/microsoft/autogen
The decoder, Jul 11, 2023, "GPT-4 architecture, datasets, costs and more leaked"
Elias, Jennifer (16 May 2023). "Google's newest A.I. model uses nearly five times more text data for training than its predecessor". CNBC.
Team, Gemini, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut et al. "Gemini: a family of highly capable multimodal models." arXiv preprint arXiv:2312.11805 (Dec 19, 2023).
When it was released, Gemini was shown to surpass GPT-4 in every dataset except a commonsense reasoning task (HellaSwag). However, this achievement was quickly surpassed just one week later, according to a new study from Microsoft . Researchers discovered that GPT-4 outperforms Gemini Ultra in all datasets when using a novel prompting method called Promptbase.
Kondratyuk, Dan, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam et al. "VideoPoet: A Large Language Model for Zero-Shot Video Generation." arXiv preprint arXiv:2312.14125 (Dec 21, 2023).
Girdhar, Rohit, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. "Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning." arXiv preprint arXiv:2311.10709 (Nov 17, 2023). Website: https://emu-video.metademolab.com/
Vyas, Apoorv, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang et al. "Audiobox: Unified Audio Generation with Natural Language Prompts." arXiv preprint arXiv:2312.15821 (Dec 25, 2023). Website: https://audiobox.metademolab.com/
Google Deepmind Blog (Nov 16, 2023), “Transforming the future of music”
Song, Yang, and Prafulla Dhariwal. "Improved Techniques for Training Consistency Models." arXiv preprint arXiv:2310.14189 (Oct 22, 2023).
Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. "Direct preference optimization: Your language model is secretly a reward model." arXiv preprint arXiv:2305.18290 (May 29, 2023).
Lee, Harrison, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback." arXiv preprint arXiv:2309.00267 (Sept 1, 2023).
Burns, Collin, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen et al. "Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision." arXiv preprint arXiv:2312.09390 (Dec 14, 2023).
Brohan, Anthony, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding et al. "RT-2: Vision-language-action models transfer web knowledge to robotic control." arXiv preprint arXiv:2307.15818 (July 28, 2023).
Deepmind blog, October 3, 2023, “Scaling up learning across many different robot types”
Website: https://robotics-transformer-x.github.io/
Benchetrit, Yohann, Hubert Banville, and Jean-Rémi King. "Brain decoding: toward real-time reconstruction of visual perception." arXiv preprint arXiv:2310.19812 (Oct 18, 2023).
Thanks Junling! This is a wonderful overview of the watershed year we just experienced, and a hint at the revolutionary year we are heading into. This is exactly what I wanted to read.