Google announced a breakthrough technology called CALM that accelerates large language models (like GPT-3 and LaMDA) without sacrificing performance.
Recent Articles
Larger training data is better, but at a cost
Large Language Models (LLMs) train on large amounts of data.
Training the language models with larger amounts of data causes the model to learn new skills that are not always intended.
For example, adding more training data to a language model might unexpectedly give it the ability to translate between different languages even though it wasn’t trained to do so.
These new skills are called emergent skills, skills that are not necessarily planned.
Another research paper (PDF) on emergent abilities states:
“Although there are dozens of examples of emergent capabilities, there are currently few compelling explanations as to why such capabilities emerge in this way.”
They cannot explain why different skills are learned.
But it is known that scaling the amount of data used to train the machine results in it becoming more capable.
The downside of upscaling the training data is that it takes more computing power to produce an output, making the AI slower when generating text output (a moment called “inference time”).
So the trade-off of making an AI smarter with more data is that at inference time, the AI also gets slower.
The new research paper from Google (Reliable adaptive language modeling PDF) describes the problem as follows:
“Recent advances in Transformer-based Large Language Models (LLMs) have resulted in significant performance improvements on many tasks.
These gains are accompanied by a dramatic increase in the size of the models, potentially resulting in slow and costly deployment at the time of inference.”
Confident Adaptive Language Modeling (CALM)
Researchers at Google came across an interesting solution to speed up the language models while maintaining high performance.
The solution, to make an analogy, is something like the difference between answering a simple question and solving a harder one.
A simple question like the color of the sky can be answered with little thought.
But a difficult answer requires stopping and thinking a little more to find the answer.
Computationally, large language models do not distinguish between a difficult part of a text generation task and an easy part.
They generate text for both the easy and difficult parts, using their full computational power at inference time.
Google’s solution is called Confident Adaptive Language Modeling (CALM).
What this new framework does is dedicate fewer resources to trivial parts of a text generation task and devote full power to more difficult parts.
The research paper on CALM describes the problem and solution as follows:
“Recent advances in Transformer-based Large Language Models (LLMs) have resulted in significant performance improvements on many tasks.
These gains are accompanied by a drastic increase in the size of the models, potentially resulting in slow and costly deployment at the time of inference.
In practice, however, the generation series of LLMs consists of different levels of difficulty.
While certain predictions really benefit from the full capacity of the models, other continuations are more trivial and can be solved with reduced computational power.
…While large models generally perform better, each input may not require the same amount of computation to achieve similar performance (e.g., depending on whether the input is easy or difficult).”
What is Google CALM and does it work?
CALM works by dynamically allocating resources depending on the complexity of each part of the task, using an algorithm to predict whether something will require full or partial resources.
The research paper shares that they tested the new system for various natural language processing tasks (text summarization, machine translation, and question answering) and found that they were able to speed up inference by about a factor of three (300%). .
The figure below shows how well the CALM system works.
The few areas in red indicate where the machine had to use its full capacity for that portion of the task.
In the green areas, the machine only used less than half of its capacity.
Red = Full capacity/Green = Less than half capacity
Here is what the research paper says about the figure above:
“CALM speeds up generation by terminating early when possible and selectively using full decoder capacity for only a few tokens, as demonstrated here with a CNN/DM example using a softmax-based confidence measure. Y(1) early and Y(2) early use different confidence thresholds for an early exit.
Below (sic) the text, we report the measured text and risk consistency of each of the two outcomes along with the efficiency gains.
The colors represent the number of decoding layers used for each token – shades of light green indicate less than half of the total layers.
Only a select few tokens use the full capacity of the model (colored in red), while for most tokens the model exits after one or a few layers of decoding (colored in green).”
The researchers concluded the paper by stating that implementing CALM requires minimal changes to adapt a large language model to make it faster.
This research is important as it opens the door to creating more complex AI models trained on much larger datasets without slowing down while maintaining high levels of performance.
However, it is possible that this method will also benefit large language models that are also trained with less data.
For example, InstructGPT models, of which ChatGPT is a sibling, are trained with about 1.3 billion parameters, but are still able to outperform models trained with many more parameters.
The researchers noted in the conclusion:
“Overall, our full adaptive compute framework for LMs requires minimal changes to the underlying model, enabling efficiencies while meeting stringent output quality guarantees.”
This information about this research was just published on December 16, 2022 on Google’s AI Blog. The research paper itself is dated October 25, 2022.
It will be interesting to see if this technology finds its way into large language models in the near future.
Read Google’s blog post:
Accelerate text generation with Confident Adaptive Language Modeling (CALM)
Read the research paper:
Reliable Adaptive Language Modeling (PDF)
Featured image from Shutterstock/Master1305