Series “AI Lab”: Epoch AI - Can AI Scaling Continue Through 2030? #5

Is it technically feasible for the current rapid pace of AI training scaling to continue through 2030?

Oct 22, 2024

Welcome back to the AI Lab Series, where we delve into the latest advancements shaping the world of artificial intelligence. This week, we're unpacking Epoch AI incredibly well done report about AI scalability. Can the current rapid pace of AI training scaling - approximately 4x per year - continue through 2030?

Introduction

Recent advancements in AI capabilities have been significantly driven by scaling up computational resources used in training AI models. Our research indicates that this increase in compute accounts for a substantial portion of performance improvements. The consistent gains from scaling have led AI labs to expand training compute at an approximate rate of 4x per year.

This rapid growth rate surpasses other technological expansions, such as mobile phone adoption and solar energy capacity installation. In this article, we explore whether this pace of AI training scaling can continue through 2030 by examining four key potential constraints:

Power Availability
Chip Manufacturing Capacity
Data Scarcity
The Latency Wall

We incorporate industry projections, including semiconductor manufacturing plans and electricity providers' capacity growth forecasts, to assess these constraints.

Our main finding is that training runs of up to 2×10292 \times 10^{29}2×1029 FLOP are likely feasible by 2030. This scale surpasses GPT-4 to the same extent that GPT-4 exceeded GPT-2. Such advancements could lead to AI capabilities as transformative as those seen between 2019 and 2023.

Figure 1: Estimates of the scale constraints imposed by the most important bottlenecks to scale. Each estimate is based on historical projections. The dark shaded box corresponds to an interquartile range and light shaded region to an 80% confidence interval. Click on the arrow to learn more.

What Constrains AI Scaling This Decade

Power Constraints

The Current Trend of AI Power Demand

AI training currently consumes a small but rapidly growing portion of data center power usage. The state-of-the-art GPU, Nvidia's H100, has a peak power demand of approximately 1,700W per GPU when accounting for supporting hardware and overhead.

Recent models like Llama 3.1 405B used clusters requiring around 27 MW of power. Projecting forward, if training compute continues to grow at 4x per year, training runs in 2030 would require about 200 times more power than current models, estimating a demand of around 6 GW.

This power demand is significant compared to the total power consumption of all U.S. data centers today (~20 GW) but small relative to the U.S.'s total installed power capacity (~1,200 GW).

Power Constraints for Geographically Localized Training Runs

Single data center campuses between 1 to 5 GW are likely possible by 2030. Companies like Amazon and Microsoft are already planning or considering such large-scale facilities. A 5 GW data center could support training runs of up to 2×10292 \times 10^{29}2×1029 FLOP, accounting for expected advances in energy efficiency and increased training duration.

Power Constraints for Geographically Distributed Training

Distributed training across multiple data centers can tap into larger power resources. Projections suggest that U.S. data center capacity could grow to 90 GW by 2030, with a significant portion allocated to AI. Assuming a company can utilize about 26% of AI data center capacity, an 8 GW distributed training run is plausible.

Figure 2: Reported and planned total installed IT capacity of North America data centers via SemiAnalysis’ data center industry model. **Important note:** to find total capacity, we must multiply these figures by PUE, which ranges from 1.2x for AI datacenters to 1.5x for other datacenters.

Feasibility of Geographically Distributed Training

Geographically distributed training is technically feasible and already in practice. Latency and bandwidth are critical considerations but are not likely to be the binding constraints. Latency across U.S. data centers can be managed and advancements in data center switch technology suggest that sufficient bandwidth can be achieved to support large-scale distributed training.

Modeling Energy Bottlenecks

Combining these analyses, we conclude that training runs between 1×10281 \times 10^{28}1×1028 to 2×10302 \times 10^{30}2×1030 FLOP are feasible by 2030, considering both localized and distributed power constraints.

Chip Manufacturing Capacity

Current Production and Projections

AI chip production, particularly GPUs like Nvidia's H100, is crucial for scaling AI. Current production is constrained by advanced packaging processes and high-bandwidth memory (HBM) production. TSMC, Nvidia's primary chip fab, is rapidly expanding its packaging capacity, aiming for significant growth in the coming years.

Modeling GPU Production and Compute Availability

Projected growth rates for GPU production range between 30% to 100% per year. By 2030, this could result in enough GPUs to support training runs up to 9×10299 \times 10^{29}9×1029 FLOP, even after accounting for distribution among multiple labs and usage for model serving. However, uncertainties remain due to potential bottlenecks in packaging and memory production.

Figure 3: Distribution of H100-equivalent GPUs and FLOP available for the largest AI training run in 2030 under different scenarios. “Projected TSMC capacity” estimates TSMC’s capacity for GPU production based on historical trends and projections, while “Full TSMC capacity” is a hypothetical where 100% of TSMC’s leading-edge wafer capacity goes to GPU production.

Data Scarcity

The availability of high-quality data is essential for training large AI models.

Multimodality

To mitigate data scarcity, AI labs can leverage multimodal data, including images, videos and audio. Current models already incorporate a significant portion of image data. By efficiently encoding multimodal data, the effective stock of training data could be increased substantially, allowing for training runs up to 2×10322 \times 10^{32}2×1032 FLOP.

Synthetic Data

Generating synthetic data using existing AI models is another avenue to expand training data. While promising, this approach faces challenges such as potential model collapse and computational overhead. Nonetheless, synthetic data could play a role in overcoming data bottlenecks, although our analysis conservatively excludes it due to current uncertainties.

Latency Wall

The latency wall refers to the fundamental speed limits in AI computations due to sequential processing requirements.

Latency Wall Given Intranode Latencies

Within a single node, latencies are minimal but increase with model size. By considering batch sizes and model parameters, we estimate that training runs up to 6×10316 \times 10^{31}6×1031 FLOP are feasible before hitting latency constraints.

Latency Wall Given Latencies Between Nodes

When considering internode communication, latencies increase due to factors like network topology. However, improvements in network design and communication protocols can mitigate these latencies. Scaling beyond 1×10321 \times 10^{32}1×1032 FLOP may require innovative solutions, such as new network topologies or adjustments in batch size scaling.

How Can These Latencies Be Reduced?

Potential strategies to reduce latency include adopting mesh topologies, increasing GPUs per node and developing more efficient communication protocols. Adjusting batch sizes and the number of layers in models may also help push back the latency wall.

Figure 4: Largest training runs permitted by latency and corresponding latencies per layer.

What Constraint Is the Most Limiting?

Among the constraints analyzed, power availability and chip manufacturing capacity are the most immediate bottlenecks. Power may be more malleable due to the potential for expanding energy infrastructure, while chip production faces significant challenges in scaling advanced packaging and memory capacity.

Data scarcity presents substantial uncertainty, especially regarding the utility of multimodal data and the feasibility of synthetic data. The latency wall is a more distant constraint but will become significant as models continue to scale.

Figure 5: Conservative estimate of the largest possible training run allowed by each of the four constraints we consider. Also plotted: point estimate of the largest frontier run expected by 2030, assuming a 4x/year growth rate since GPT-4’s release.

Will Labs Attempt to Scale to These New Heights?

Whether AI labs will pursue scaling to the projected levels depends on economic factors and the potential return on investment. The continuous improvement in AI capabilities with increased scaling suggests that labs may find it worthwhile. Reports of significant investments, such as Microsoft's and OpenAI's rumored $100 billion data center project, indicate industry momentum toward larger-scale training runs.

The potential economic payoff from automating substantial portions of economic tasks is enormous. Investing trillions in AI development and infrastructure could be economically justified if AI can effectively substitute for human labor, potentially driving unprecedented economic growth.

Conclusion

Our analysis suggests that, based on current trends, training runs of up to 2e29 FLOP are feasible by 2030. This scaling is likely to continue unless constrained by significant bottlenecks in power supply, chip manufacturing, data availability or latency.

Power availability and chip manufacturing capacity are the most pressing constraints that need to be addressed to sustain scaling. Overcoming these challenges could lead to AI models with transformative capabilities, attracting massive investments and potentially becoming one of humanity's largest technological endeavors.

For the full report: https://epochai.org/blog/can-ai-scaling-continue-through-2030

Stay tuned to the AI Lab Series for more deep dives into the evolving world of artificial intelligence. Together, we'll explore the trends, challenges and innovations shaping our future.

Artificial Intelligence in Monaco

Discussion about this post