Foundation Models: AI’s Exciting New Frontier

Image credit: Depositphotos enhanced by CogWorld

Source: Irving Wladawsky-Berger, CogWorld Think Tank member

Over the past decade, powerful AI systems have matched or surpassed human levels of performance in a number of specific tasks such as image and speech recognition, skin cancer classification and breast cancer detection, and highly complex games like Go. These AI breakthroughs have been based on deep learning (DL), a technique loosely based on the network structure of neurons in the human brain that now dominates the field. DL systems acquire knowledge by being trained with millions to billions of texts, images and other data instead of being explicitly programmed.

These task-specific, DL systems have generally relied on supervised learning, a training method where the data must be carefully labeled, - e.g., cat, not-cat, - thus requiring a big investment of time and money to produce a model that’s narrowly focused on a specific task and can’t be easily repurposed. The rising costs for training ever-larger, narrowly focused DL systems have prompted concerns that the technique was running out of steam.

Foundation models promise to get around these DL concerns by bringing to the world of AI the reusability and extensibility that have been so successful in IT software systems, from operating system like iOS and Android to the growing number and varieties of internet-based platforms.

“AI is undergoing a paradigm shift with the rise of models that are trained on broad data at scale and are adaptable to a wide range of downstream tasks,” said On the Opportunities and Risks of Foundation Models a recent report by the Center for Research on Foundation Models, an interdisciplinary initiative in the Stanford Institute for Human-Centered Artificial Intelligence (HAI) that was founded in 2021 to make fundamental advances in the study, development, and deployment of foundation models. Foundation models aim to replace the task-specific models that have dominated AI over the past decade with models that are trained with huge amounts of unlabeled data, and can then be adapted to many different tasks with minimal fine-tuning. Current examples of foundation models include large language models like GPT-3 and BERT.

Shortly after GPT-3 went online in 2020, its creators at the AI research company OpenAI discovered that not only could GPT-3 generate whole sentences and paragraphs in English in a variety of styles, but it had developed surprising skills at writing computer software even though the training data was focused on the English language, not on examples of computer code. But, as it turned out, the vast amounts of Web pages used in its training included many examples of computer programming accompanied by descriptions of what the code was designed to do, thus enabling GPT-3 to teach itself how to program. GPT-3 can also generate legal documents, like licensing agreements or leases, as well documents in a variety of other fields.

“At the same time, existing foundation models have the potential to accentuate harms, and their characteristics are in general poorly understood,” warns the Stanford report. A major finding of the 2022 AI Index Report was that while large language models like GPT-3 are setting new records on technical benchmarks, they’re also more prone to reflect the biases that may have been included in their training data, including racist, sexist, extremist and other harmful language as well as overtly abusive language patterns and harmful ideologies.

While foundation models are based on DL technologies, they’ve been enabled by two more recent advances, transfer learning and scale. Unlike the task-specific training of earlier DL systems, transfer learning takes the knowledge learned from training one task and applies it to different but related tasks, - such as using the training in object recognition in images and applying it to activity recognition in videos, or using the knowledge gained from learning to recognize cars and applying it to recognizing trucks. With transfer learning “a model is trained on a surrogate task (often just as a means to an end) and then adapted to the downstream task of interest via fine-tuning.”

“Transfer learning is what makes foundation models possible, but scale is what makes them powerful,” adds the report.” Scale is enabled  by three recent AI advances:

  • improvements in computer hardware, - according to the 2022 AI Index Report, “Since 2018, the cost to train an image classification system has decreased by 63.6%, while training times have improved by 94.4%.”;

  • huge amounts of training data, - according to a recent article in The Economist, GPT-2, - GPT-3’s predecessor, - was trained with 40 gigabytes of data, while GPT-3 was trained with 570 gigabytes of data, including a big chunk of the internet, all of Wikipedia, and many digital books; and

  • highly parallel architectures, - transformer architectures enable the much larger deep learning networks in foundation models to take advantage of the inherent parallelism of the hardware.

“The significance of foundation models can be summarized with two words: emergence and homogenization,” notes the report.

“Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities.” Emergence occurs when a very large system exhibits behaviors that could not have been predicted by the behaviors of its individual components and only emerge as a result of their highly complex interactions. “Emergence means that the behavior of a system is implicitly induced rather than explicitly constructed; it is both the source of scientific excitement and anxiety about unanticipated consequences.”

“For example, GPT-3, with 175 billion parameters compared to GPT-2’s 1.5 billion, permits in-context learning, in which the language model can be adapted to a downstream task simply by providing it with a prompt (a natural language description of the task), an emergent property that was neither specifically trained for nor anticipated to arise.” It’s why discovering that GPT-3 taught itself how to program and generate legal documents without being explicitly trained to do so caught its creators by surprise.

The effectiveness of foundation models have also led to an unprecedented level of homogenization. For example, almost all state-of-the-art NLP models are now adapted from one of a few foundation models, e.g., BERT, GPT-3. “While this homogenization produces extremely high leverage (any improvements in the foundation models can lead to immediate benefits across all of NLP), it is also a liability; all AI systems might inherit the same problematic biases of a few foundation models.”

The impending widespread deployment of foundation models demands caution, warns the report. Along with its powerful leverage, homogenization also means that the defects of a foundation model are inherited by all the adapted downstream models. And due to their emergent properties, we currently lack a clear understanding of how foundation models work, what they’re capable of, and when and how they fail. “To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.”

At over 200 pages and over 100 authors, the Stanford report represents a comprehensive overview of the state of foundation models, highlighting its exciting raw potential but reminding us that it should be viewed as a research technology in its very early years. The report’s 26 different sections are grouped into four interrelated areas: capabilities, applications, technology, and society, noting that “the technologies and capabilities are developed in a way that is sensitive to real societal concerns, while being inspired by and grounded out in applications.”

“There are tremendous economic incentives to push the capabilities and scale of foundation models, so we anticipate steady technological progress over the coming years,” says the report in conclusion. “But the suitability of a technology relying largely on emergent behavior for widespread deployment to people is unclear. What is clear is that we need to be cautious, and that now is the time to establish the professional norms that will enable the responsible research and deployment of foundation models. Academia and industry need to collaborate on this: industry ultimately makes concrete decisions about how foundation models will be deployed, but we should also lean on academia, with its disciplinary diversity and non-commercial incentives around knowledge production and social benefit, to provide distinctive guidance on the development and deployment of foundation models that is both technically and ethically grounded.”


Irving Wladawsky-Berger is a Research Affiliate at MIT's Sloan School of Management and at Cybersecurity at MIT Sloan (CAMS) and Fellow of the Initiative on the Digital Economy, of MIT Connection Science, and of the Stanford Digital Economy Lab.