will scaling work?

dwarkesh patel, 2023-12-26


exporation of some of today's biggest questions in AI: data bottlenecks, generalization benchmarks, primate evolution, intelligence as compression, world modelers... "(+/-)" denote arguments for/against.

We will run out of high quality language data by 2025

  • (-) Scaling curves imply we need 1e³⁵ FLOPs (~estimated for human-level thought) for an AI reliable/smart enough to write scientific papers.
    • ⤷ 5 OOms more data than we have (≈100,000× more data) ∴ ∄ data to keep up with the exponential increase in compute demanded by scaling laws (even if we assume generous improvements in techniques, multimodal training, recycling tokens on multiple epochs and curriculum learning.)
  • (compute = parameters × data) → more data ⇒ more compute
    • in total, we need ≈ 9 OOMs more compute
  • (+) we can produce GPT4 with 0.03% of MSFT's yearly revenue + big chuck of the internet.
    • ⤷ optimism: "if the internet was bigger, scaling up a few-hundred-lines python model could have produced human level mind"
    • humans developed language as a genetic/cultural coevolution
      • is this ≈ to synthetic data/self play loops for LLMs? (Where models get smarter to better make sense of complex symbolic outputs of similar copies.)
  • (-) if self-play/syntheytic data doesn't work, LLMs are fucked: new architectures are very unlikely yo help. A jump in sample efficiency bigger than LSTM→Transformers is needed.
    • If a model can't approach human level performance with data a human would see in 20,000 years, maybe 2bn years worth of data will also be insufficient.
      "there's no amount of jet fuel you can add to an airplane to make it reach the moon." –François Chollet.
  • Novel reasoning doesn't have a concrete win condition ∴ LLMs are incapable of correcting their own reasoning.
    • Self-play worked with AlphaGo since the model judged itself on a concrete condition: "did I win this game?"

has scaling even actually worked?

  • (+) Performance on benchmarks has scaled consistently for 8OOM, a trend that has worked so consistently for the last 8OOM will be reliable for the next 8.
    • GPT4's technical report stated they predicted its performance with "models trained using the same methodology but using at most 10,000× less compute than GPT4."
    • Performance would also likely compound and help speed up AI research.
  • (-) Do scaling curves on next-token prediction actually correspond to true progress towards generality. (major question in my opinion)
    • (+) as models scale, their performance consistently and reliably improves on a broad range of tasks, measured by many benchmarks.
    • (-) these benchmarks test for memorization not intelligence. Why is it impressive that models trained on the internet happen to have many random facts memorized? How is this indicative of intelligence or creativity?
      • MMLU, BigBench, HumanEval: memorization, recall, interpolation,
      • SWE-bench, ARC: ability to problem-solve across long time horizons or difficult abstractions.
  • GPT3 → GPT4 represents a 100× performance scale-up. a 10,000× scale-up represents <1% of world GDP.
    • Pretraining compute efficency gains (mixture-of-experts, flash attention) + new post-training methods (RLAI, fine tuning on chain of thought, self-play...) + hardware improvements should allow to turn 1% of world GDP into a GPT8-level model.
    • ⤷ How much are societies willing too spend on general purpose technologies?
      • At its peak in 1847, the British Railway Investemnt was 7% of their GDP.
      • Just after the Telecomunications Act of 1996, telecom companies invested >$500bn (~$1T adj. for inflation) into laying fiber optic cable and building wireless networks.

do models understand the world?

  • (+) Training LLMs on code makes them better at language reasoning. this means there is:
    • some shared logical structure between code and language.
    • unsupervised gradient descent can extract and use them to improve reasoning.
      • Gradient descent finds the most efficient compression of data → most efficient compression is also the deepest and most powerful one. (this is cool)
    • A deeply internalized understanding of underlying scientif explanations allows to predict how an incomplete argument from a book is likely to proceed.
  • (-) ablility to compress ∈ intelligence, but compression ≠ intelligence.
    • Can't really argue that Plato is an idiot compared to me+my(compressed)knowlege
    • if LLMs are compressions made by other processes, we still don't know anything about their own ability to make compressions. (?)
  • (+) The usual pattern in history of technology is that Invention precedes Theory. We should expect the same of intelligence. (very very interesting i think)
    • We developed a full understanding of thermodynamics 100 years after the steam engine was invented.
    • No need to explain why scaling must keep working for scaling to keep working.
    • No law of physics states that Moore's Law must continue and you can do mental gymnastics about practical hurdles, compute bottlenecks, brittleness of benchmarks, or you can just look at the fucking line.

Conclusion

  • Many theoretical possible things have always been intrinsically difficult to build for some reason or another (fusion power, flying cars, nanotech)
  • (-) The theoretical reason to expect scaling to keep going is murky + generality of benchmarks where the scaling is evident are debatable.
    • If self-play/synthetic data doesn't work, the models look fucked.

Will models get insight based learning?

  • (+) Grokking seems similar to human learning.
    • We have mental models that change themselves over time with new observations + intuitions about how to categorize new information.
    • Gradient descent over such a large diverse set of data will select the most general and extrapolative circuits -> Grokking will lead to insight-based learning.
  • (-) Teaching a kid the sun is at the center of the solar system immediately changes how he understands the night sky.
    • You can't really feed Copernicus' writing to a model untrained on astronomy and expect it to immediately incorporate those insights into all relevant future outputs.
    • Models need to hear information many times in many contexts to grok underlying concepts (bizarre).

note mentions


Eduardo Gonzalez Ortiz

ego gifego