46 Comments

> Contra Chollet, I think that current LLMs are well described as doing at least some useful learning when doing in-context learning.

I agree that Chollet appears to imply that in-context learning doesn't count as learning when he states:

> "Most of the time when you're using an LLM, it's just doing static inference. The model is frozen. You're just prompting it and getting an answer. The model is not actually learning anything on the fly. Its state is not adapting to the task at hand."

(This seems misguided as we have evidence of models tracking and updating state in activation space)

However later on in the Dwarkesh interview, he says:

> "Discrete program search is very deep recombination with a very small set of primitive programs. The LLM approach is the same but on the complete opposite end of that spectrum. You scale up the memorization by a massive factor and you're doing very shallow search. They are the same thing, just different ends of the spectrum."

My steelman of Chollet's position is that he thinks the _depth_ of search you can perform via ICL in current LLMs is too shallow, which means they rely much more on learned mechanisms that require comparatively less runtime search/computation but inherently limit generalization.

I think the directional claim "you can easily overestimate LLMs' generalization abilities by observing their performance on common tasks" is correct - LLMs are able to learn very many shallow heuristics and memorize much more information than humans, which allows them to get away with doing less in-context learning. However, it is also true that this may not limit their ability to automate many tasks, especially with the correct scaffolding, or stop them from being dangerous in various ways.

Expand full comment

I agree with basically all of this.

Expand full comment

Have you published the actual prompts you used anywhere? Thanks!

Expand full comment

Is it possible that the issues with vision are because the size of the cells in the input is different from GPT-4o's patch size?

Expand full comment

I did a variety of tests on image size. I found that a 40 pixel wide cell worked well and that it wasn't hugely sensitive to cell size.

Expand full comment

There is a real issue here. These images are tiny.

As the post mentions, vision models don't see these images very well, because they're not used to seeing well-defined objects which are exactly 3 pixels or exactly 5 pixels in height.

And humans aren't either. If we viewed the images at their real size, we'd be using a magnifying glass to try to spot patterns.

This links to issues of spectral energy, image scale, and adversarial perturbations in typical computer vision models.

Expand full comment

Fantastic, thanks for sharing! Regarding non-IID splits, actually hidden test score is likely to be lower as there are quite a few examples within validation set which are very similar, but supposedly the test set won't have such leakage, see this comment by Chollet: https://www.kaggle.com/competitions/abstraction-and-reasoning-challenge/discussion/130710#764733

Not sure what is the difference in terms of the errors, a few errors in the evaluation set examples were fixed and a few still remain, humans can normally detect them and still understand the correct solution, but this can be more of an issue for automated methods.

Expand full comment

Given that the previous contender https://lab42.global/community-interview-jack-cole/ was using dynamic evaluation (https://gwern.net/doc/ai/nn/dynamic-evaluation/index), it would probably work very well to apply dynamic evaluation to a program+image-generating LLM prompt. Since dynamic evaluation ~ finetuning ~ ICL on a big enough context window, this should also be doable with context windows in the millions.

Specifically: the simplest way is to either dynamically-evaluate or context on the submitted Q&A. As it goes, it builds up a history of answers it can draw on. This is in the spirit of Chollet's quest for DL which has neuroplasticity - language models have always had neuroplasticity, it just comes in the form of dynamic evaluation and hidden-states/context-windows, and for some historical reason, the former has been forgotten.

Given the mention of the fact that a lot of the programs are invalid and the LLM doesn't seem to understand what they generate, one could go further and do more test-time dynamic evaluation on triplets: a hypothetical grid, a program which is supposed to generate it, and the actual output of the program. If they don't match, include the matching pair. (And one try to do further generation to fix that error, although that probably is going too far in making it a general vector image generator: https://gwern.net/idea#vector-generation .) Add a few dozen of these to the LLM, and it will probably get much better at generating reasonable programs and so the program search will be a lot more efficient: the context/evaluation may seem expensive, but if it saves you from sampling literally millions of possible erroneous programs, it will probably be a big win.

I doubt OA cares enough to try to enable dynamic evaluation on GPT-4o, but the Cole results I linked are with a very small LLM, and so probably you could do all that with an existing VLM.

Expand full comment

Fuck, I was looking into Triadic Memory https://github.com/PeterOvermann/TriadicMemory

Expand full comment

I'm still skeptical - generating 5k programs per solution is less "reasoning" and more luck. I don't argue that the LM isn't doing anything - getting a decent solution within 5k means it has *some* idea at least. But still.

What would convince me to update:

1. LM needs to search far fewer programs (<500)

2. The performance on private LB is competitive with humans. I highly doubt this is the case here - I'm more inclined to believe somehow training priors got leaked (say, through ppl uploading ARC solvers to GitHub with such training set priors that won't hold up in test set)

That said, I'm surprised 4o held out. Expecting much better results for GPT4 and Opus

Expand full comment

With 128 samples we get:

- 26% on test

- 47% on train

Expand full comment

But that's the "held-out" training set used as test right - not private LB?

Expand full comment

Yep, not private leaderboard. I'm skeptical dataset leakage will be a serious issue.

Expand full comment

> The performance on private LB is competitive with humans. I highly doubt this is the case here - I'm more inclined to believe somehow training priors got leaked (say, through ppl uploading ARC solvers to GitHub with such training set priors that won't hold up in test set)

There don't appear to be serious leakage issues based on the new semi-private evaluation set.

See https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50-sota-on-arc-agi-with-gpt-4o?commentId=sfs8gH2qhac53zwqC for context.

Expand full comment

That is good to know!

Expand full comment

With an updated version of the method, I get 36% on test with about 256 samples per problem.

(Note this is with 2 submissions per problem rather than 3.)

And, as I note in the other comment, it doesn't seem like data contamination is a serious issue.

I think 36% is probably notably worse than typical people on MTurk, but not wildly worse. (Edit: Perhaps typical people on MTurk get around 70%? And probably there is a broad human range.)

Curious if this makes you update.

Expand full comment

It does - to a non-significant degree. I think better LLMs (claude 3.5 sonnet next? 😉 I'm sure they'd love to give you compute credits) would obviously help. However, what I would like to see is a feedback loop - where the LLM can iterate over its solution from the result of some "evaluation" function which dictates how far the generated program is from fitting the few-shot examples. This implies using the LLM as a mesaoptimizer - which would be much more sample efficient and generally effective.

If this approach works, then I can very easily imagine LLMs dominating SOTA atleast - if not fully solving ARC-AGI.

BTW, thanks for all the time, energy and money you've poured into this :)

Expand full comment

Have you tried providing the image as JSON?

An array of arrays of color names, or ascii letters etc?

I have anecdotal good results with this approach in trying to solve some pixel-based problems in one of my projects.

I provided icons as images and as ascii-art-like representations, and didn't have very good results getting it to do what I wanted (island detection, offsetting, border thickening, etc).

Then I moved to 2D JSON and got significantly better (though far from perfect) solutions.

Size of the images was 16x16 pixels binary (so in my case the values were boolean not strings).

I'd be very curious if doing this would result in any improvements in your case (it should be fairly easy to implement, though I expect it's not cheap to try it out...).

Amazing work anyway.

Expand full comment

I use a carefully engineered ascii representation.

This ascii representation represents the colors/symbols as numbers, it would plausibly be better to use the names of the colors.

Expand full comment

I did find a non-trivial improvement when using JSON-formated data instead of just lists of values. I'll try to see if I can get those scripts running again and get some data for you by comparing the two formats.

Expand full comment

It seems kind of odd to attribute the intelligence to GPT4o instead of your own cleverness in reshaping the problem to be easier? You built a larger system with the LLM as an (admittedly important) component of it. It's quite a useful component, but the LLM doesn't build systems.

Expand full comment

Certainly some credit goes to me and some to GPT4o.

The solution would be much worse without careful optimization and wouldn't work at all without gpt4o (or another llm with similar performance).

It's worth noting a high fraction of my time went into writing prompts and optimization the representation. (Which is perhaps better described as teaching gpt4o and making it easier for it to see the problem.)

There are different analogies here which might be illuminating:

- Suppose that you strand a child out in the woods and never teach them anything. I expect they would be much worse at programming. So, some credit for there abilities goes to society and some to their brain.

- If you remove my ability to see (on conversely, use fancy tools to make it easier for a blind person to see) this would greatly affect my ability to do ARC-AGI puzzles.

- You can build systems around people which remove most of the interesting intelligence from various tasks.

I think what is going on here is analogous to all of these.

Separately, this tweet is relevant: https://x.com/MaxNadeau_/status/1802774696192246133

Expand full comment

The "child in the woods" is little too anthropomorphic for my tastes. Here's another approach:

When I was playing around with Code Interpreter last year, it felt like I was doing an software interview with a well-read junior programmer who wasn't all that bright. I had to do a lot of hand-holding. I treated it as a game: how can I make it build what I want it to build, just by asking the right questions?

Although this might feel like teaching and a human might learn a lot from it, there's a sense in which it isn't. The LLM itself is immutable. When you start over, it will have forgotten everything. The illusion of learning comes from generating the chat transcript, which serves as guidance. So an alternate view is that you're generating educational material that can guide the LLM to a solution.

The LLM and the educational material together is a larger system that could be thought of as learning something, as you add more material. This works quite well, since it's easy to add more material. And I expect these larger systems to improve quite a bit as software engineers tinker with them.

And as you suggest, if you include *people* in that larger system, it gets pretty interesting.

Also, there is a longer-term loop where LLM's themselves are improved - they absorb information from educational material. But it's not real-time.

Expand full comment

Precisely! And he did things like using different prompts for different tasks. Which means that he's helping it.

Expand full comment

Technically the 2 prior ARC challenges people are building heuristic program synthesis with predefined operations on a DSL, so yeah this does does raise the participation rate of any AI system on this from 0% to maybe 20%?

Expand full comment

is the ability to integrate/merge 1000s of potential programs that map an input to an output the same as seeing 1-2 inputs generalizing from only those 2 and then “knowing” the correct output.

I’d argue the former isn’t a form of induction, but deduction.

Expand full comment

You have to have a vague idea of what the solution is for any of the thousands of programs to ever work.

Expand full comment

True, but thousands of programs is several OOMs than what humans' search space. This seems like fancier brute forcing, or at least pseudo MCTS (where the LM is like a value and prediction head combined)

Expand full comment

It is very, very far off actual brute force. Sampling randomly from all possible programs wouldn't work this well. Humans are somewhat more "intelligent" here (at least consciously, we sample fewer candidate solutions), but I think this is a difference in degree rather than in kind.

Expand full comment

I'm on the fence here, sampling 5,10k programs would not work for mere brute force, yet since LLM definitely is capable of ICL, and the output actually is pretty limited (it has to be a 2D array of int), maybe the valid search space isn't that big to start with?

So, inside my head I have 2 different opinion on this:

1. the 10k sampling is effectively a pseudo brute force with most of the learning happening at training time where LLM learn some heuristics of generic programming. This can be proved by looking at deepmind's alphacode 1, which they achieve codeforce level coding puzzle with 1 million sampling. And if we extrapolate the correction rate, then we only get merely 74% success rate on ARC training set. And obviously there's no way that codeforce problems are the same level hard as ARC. So this indicates ARC's visual representation and resistance of memory forms a very big challenge for AI. If we extrapolate such characteristic to an extreme, where we have a AI that still takes 10k sampling to solve ARC, and take, say 10milliion to solve Riemann hypothesis, then even if we admit such AI has general learning capablility, we must say such capablility is not according to human's normal convention.

2. The vast sampling is just what a parallel stochastic system would most effective work. Human loathe such approach simply because the hardware, or say wetware, is not capable of parallism. It would be logically unwise to not use such capability if applicable. And maybe for a AI with paralism and stochastic the smart way is to just roll the dice the hell out of it and pick the best branch as some sort of learning.

Expand full comment

I personally think that difference in degree is very informative in the differences in how human brains and LLMs operate.

Expand full comment

Sure, but isn’t that a negative, if in abstraction space the LLM already has the abstractions then it hasn’t learned the abstraction at all. It implies that this is a deductive problem where it is given a sparse abstraction, and must continually guess at refinements. But arguably no generalization has occurred.

Expand full comment

The original purpose of ARC challenge was to evaluate the AI's capacity in doing abstraction and analogy. Solving the problem with search (using thousands of trials as you did) would significantly downplay the challenge from that perspective. What's your opinion on this?

Expand full comment

To elaborate a bit on why using search might downplay the challenge: Using search allows you to form simple heuristics using the external feedback (the more trials you do, the better heuristics you'll get), and I think we probably need a better metric that incorporates the number of trials. Say in the extreme case, we have a random code generator that generates infinite programs over the grid (we can have some very simple constraints over it such as both the inputs and outputs should be a grid layout), certainly it will hit some correct solutions (i.e., the infinite monkey theorem), would you consider such a random generator "intelligent"?

Expand full comment

Thanks for the great read.

Would you say that the biggest bottleneck to LLM succeeding in here is that LLMs are often not able to identify the patterns needed to solve the puzzle, or that they are not able to execute and build the solution?

Also, do you think sampling Python programs is the best thing for solving the puzzles? or There should be better programming languages/DSLs that would make LLMs do less mistakes?

Expand full comment

I think failing to identify and reason though the pattern is the main bottleneck.

Better languages/DSLs is possible, though models are very familiar with Python and that makes it easier.

Expand full comment

I like the less dangerous part of this.

It's kind of funny looking back at some of the novels I read as a kid - it more feels like we're collectively performing a very intricate high tech summoning ritual to summon the Machine God.

That is of course said jokingly but only half. To some extent we may be too inspired to be afraid.

Expand full comment

Just did some back of the napkin calculation, so so far this method uses 30k token per request * 8000 sampling = 240 million tokens per problem? And to inc the success rate by 3% the token count needs to double?

Expand full comment

Your calculation neglects prefix caching - we use the same prefix for multiple completions.

The bulk of the cost is on generating the output token. Responses average around 600 output tokens for 8000 * 600 = 480,000 output tokens per.

Expand full comment

Excellent work.

The 'update' comment triggered me a bit. It falls into the category of stereotypical behavior that members of philosophical cults with whom I do not share beliefs or ideals would use in their language. I am glad that at least you did not say 'prior' or 'Bayesian'. That was very relieving.

The second thing that triggered me a little: 'I'm ambivalent about advancing AI progress overall, especially in a broadly proliferated fashion. If I thought my work on ARC-AGI would likely substantially advance AI progress, I would not have published this blog post'. This also seems to fall into the category of stereotypical behavior that members of philosophical cults with whom I do not share beliefs or ideals would express, in the form of an implicit belief that what you are doing may actually be harmful to humanity (while there is no basis to show that this is true, and it feels like an ego complex or unjustified fear situation that many members of such philosophical cults deal with). It would be beneficial to leave philosophy and subjective feelings or beliefs about AI safety aside and work on improving AI safety while improving AI capabilities. The post does not benefit from "virtue signaling", and it may actually be off-putting because it makes the post somehow "elitist" and "projective" to members of such philosophical cults, when everyone else in the public may not agree with the basis of the statement, or may disagree with it in principle.

Other than those two things, nothing else triggered me, and the work is extremely interesting and clever. For LLM's type of work, I really appreciate what has been done, and the meticulous way in which you have optimized your overall method (which augments GPT-4o for the task at hand), which seems to closely follow a good phenomenological application of the scientific method. I like the fact that you are honest about your work in making the model perform well under the circumstances, caveats, and exploitation of resources, the latter which seems paradigmatic for achieving so-called "SOTA" performance with LLMs.

Expand full comment

Very good read on the ARC challenge, would you mind me sharing it in my AI newsletter?

Expand full comment

Sure, but please note the caveats here https://x.com/RyanPGreenblatt/status/1803194126420328502

Expand full comment

How much did your experiments cost?

Expand full comment