will scaling work?

dwarkesh patel, 2023-12-26

exporation of some of today's biggest questions in AI: data bottlenecks, generalization benchmarks, primate evolution, intelligence as compression, world modelers... "(+/-)" denote arguments for/against.

⠀

We will run out of high quality language data by 2025

(-) Scaling curves imply we need 1e³⁵ FLOPs (~estimated for human-level thought) for an AI reliable/smart enough to write scientific papers.
- ⤷ 5 OOms more data than we have (≈100,000× more data) ∴ ∄ data to keep up with the exponential increase in compute demanded by scaling laws (even if we assume generous improvements in techniques, multimodal training, recycling tokens on multiple epochs and curriculum learning.)
(compute = parameters × data) → more data ⇒ more compute
- in total, we need ≈ 9 OOMs more compute
(+) we can produce GPT4 with 0.03% of MSFT's yearly revenue + big chuck of the internet.
- ⤷ optimism: "if the internet was bigger, scaling up a few-hundred-lines python model could have produced human level mind"
- humans developed language as a genetic/cultural coevolution
  - is this ≈ to synthetic data/self play loops for LLMs? (Where models get smarter to better make sense of complex symbolic outputs of similar copies.)
(-) if self-play/syntheytic data doesn't work, LLMs are fucked: new architectures are very unlikely yo help. A jump in sample efficiency bigger than LSTM→Transformers is needed.
- If a model can't approach human level performance with data a human would see in 20,000 years, maybe 2bn years worth of data will also be insufficient.
  "there's no amount of jet fuel you can add to an airplane to make it reach the moon." –François Chollet.
Novel reasoning doesn't have a concrete win condition ∴ LLMs are incapable of correcting their own reasoning.
- Self-play worked with AlphaGo since the model judged itself on a concrete condition: "did I win this game?"

⠀

has scaling even actually worked?

(+) Performance on benchmarks has scaled consistently for 8OOM, a trend that has worked so consistently for the last 8OOM will be reliable for the next 8.
- GPT4's technical report stated they predicted its performance with "models trained using the same methodology but using at most 10,000× less compute than GPT4."
- Performance would also likely compound and help speed up AI research.
(-) Do scaling curves on next-token prediction actually correspond to true progress towards generality. (major question in my opinion)
- (+) as models scale, their performance consistently and reliably improves on a broad range of tasks, measured by many benchmarks.
- (-) these benchmarks test for memorization not intelligence. Why is it impressive that models trained on the internet happen to have many random facts memorized? How is this indicative of intelligence or creativity?
  - MMLU, BigBench, HumanEval: memorization, recall, interpolation,
  - SWE-bench, ARC: ability to problem-solve across long time horizons or difficult abstractions.
GPT3 → GPT4 represents a 100× performance scale-up. a 10,000× scale-up represents <1% of world GDP.
- Pretraining compute efficency gains (mixture-of-experts, flash attention) + new post-training methods (RLAI, fine tuning on chain of thought, self-play...) + hardware improvements should allow to turn 1% of world GDP into a GPT8-level model.
- ⤷ How much are societies willing too spend on general purpose technologies?
  - At its peak in 1847, the British Railway Investemnt was 7% of their GDP.
  - Just after the Telecomunications Act of 1996, telecom companies invested >$500bn (~$1T adj. for inflation) into laying fiber optic cable and building wireless networks.

⠀

do models understand the world?

(+) Training LLMs on code makes them better at language reasoning. this means there is:
- some shared logical structure between code and language.
- unsupervised gradient descent can extract and use them to improve reasoning.
  - Gradient descent finds the most efficient compression of data → most efficient compression is also the deepest and most powerful one. (this is cool)
- A deeply internalized understanding of underlying scientif explanations allows to predict how an incomplete argument from a book is likely to proceed.
(-) ablility to compress ∈ intelligence, but compression ≠ intelligence.
- Can't really argue that Plato is an idiot compared to me+my(compressed)knowlege
- if LLMs are compressions made by other processes, we still don't know anything about their own ability to make compressions. (?)
(+) The usual pattern in history of technology is that Invention precedes Theory. We should expect the same of intelligence. (very very interesting i think)
- We developed a full understanding of thermodynamics 100 years after the steam engine was invented.
- No need to explain why scaling must keep working for scaling to keep working.
- No law of physics states that Moore's Law must continue and you can do mental gymnastics about practical hurdles, compute bottlenecks, brittleness of benchmarks, or you can just look at the fucking line.

⠀

Conclusion

Many theoretical possible things have always been intrinsically difficult to build for some reason or another (fusion power, flying cars, nanotech)
(-) The theoretical reason to expect scaling to keep going is murky + generality of benchmarks where the scaling is evident are debatable.
- If self-play/synthetic data doesn't work, the models look fucked.

⠀

Will models get insight based learning?

(+) Grokking seems similar to human learning.
- We have mental models that change themselves over time with new observations + intuitions about how to categorize new information.
- Gradient descent over such a large diverse set of data will select the most general and extrapolative circuits -> Grokking will lead to insight-based learning.
(-) Teaching a kid the sun is at the center of the solar system immediately changes how he understands the night sky.
- You can't really feed Copernicus' writing to a model untrained on astronomy and expect it to immediately incorporate those insights into all relevant future outputs.
- Models need to hear information many times in many contexts to grok underlying concepts (bizarre).

note mentions

works under writing are original, my notes a mix of thoughts with quotes from the artwork subject of the note. about · contact
⠀
⠀
writing
- oct 2023 · on numbers
- aug 2023 · after cinema
- oct 2022 · on cinema
- ongoing · on ai
⠀⠀
books
- beginning of infinity, david deutsch
- desierto sonoro, valeria luiselli
- llano en llamas, juan rulfo
- recuerdos del porvenir, elena garro
- laberinto de la soledad, octavio paz
- hitchhicker's guide to the galaxy, douglas adams
- amor en los tiempos del colera, gabriel garcía márquez
- homo deus, yuval noah harari
- salvar el fuego, guillermo arriaga
- a moveable feast, ernest hemingway
- ride of a lifetime, bob iger
- sin querer queriendo, roberto gómez bolaños
⠀⠀
articles
- to firmly drive common prosperity, xi jinping
- how the enlightment ends, henry kissinger
- people aren't meant to talk this much, ian bogost
- after babel, jonathan haidt
- the year of fukuyama, richard hanania
- ai–mort de l'art ou renouveau de la culture?, sébastien piquemal a
- will scaling work?, dwarkesh patel
- software 2.0, andrej karpathy
- roger federer as religious experience, david foster wallace
- recurrent neural networks, andrej karpathy
⠀⠀
film
- century of the self, adam curtis
- hitchcock vs hitchcock, andré bazin
- why marvel movies aren't cinema, martin scorsese
- succession, jesse armstrong
⠀
philo
- paragraphs on conceptual art (!), sol lewitt
- discrete image, bernard stielger
- how to do philosophy, paul graham
- taste and design, paul graham
- aristotle invented the computer, chris dixon
- la peinture de manet, michel foucault
- the question concerning technology, martin heidegger
- l'écrit, l'écran, l'ésprit, anne alombert
- discipline and punish, michel foucault
- modernist painting, clement greenberg
- the revolution will be uploaded (!), peter snowdon
⠀
⠀ ⠀
⠀ ⠀
writing keeps ideas in space
speech lets them travel in time
we use paintings to decorate space
and music to decorate time
⠀ ⠀
⠀ ⠀
find the way by moonlight
see the dawn before
the rest of the world
⠀ ⠀
⠀ ⠀
unconscious time, no peace of mind,
falling in space but still alive.
sketching the future in a single line,
everything's spinning, cannot sit down.
moments in space, places in time,
thoughts penciled in, now come to life.
⠀ ⠀
⠀ ⠀
As of today, no one knows how to translate paintings, flowers or music into language. Their beauty is implicit and exclusive to their form, which is why it's so hard to explain how a particular piece of art makes us feel.
⠀ ⠀
⠀ ⠀
⠀⠀
⠀
notes
⠀
symbols
- ∴ (therefore)
- → (if then)
- ↔ (if and only if)
- ⤷ (consequence of)
- ≔ (definition)
- ⫫ (independent from)
- ∵ (because)
- ∃, ∄ (there exists/does not exist)
- ∈, ∉ (belongs to/does not belong to)
⠀
⠀
⠀
⠀
this is a collection of notes that i've written over time, mostly for myself. in the spirit of working with garage doors open, i've published them and open sourced this website.
⠀⠀
notes
works under writing are original, my notes a mix of thoughts with quotes from the artwork subject of the note. about · contact ⠀

Eduardo Gonzalez Ortiz

notes

will scaling work?

note mentions

writing

notes

notes

Eduardo Gonzalez Ortiz