Getting 50% (SoTA) on ARC-AGI with GPT-4o

Ryan Greenblatt

Jun 17, 2024

124

You can just draw more samples

Read →

44 Comments

Melanie Mitchell

Jun 19, 2024

Have you published the actual prompts you used anywhere? Thanks!

Expand full comment

Reply (1)

Buck Shlegeris

Jun 19, 2024

Yep, https://github.com/rgreenblatt/arc_draw_more_samples_pub

Expand full comment

osmarks

Jun 17, 2024

Is it possible that the issues with vision are because the size of the cells in the input is different from GPT-4o's patch size?

Expand full comment

Reply (2)

Ryan Greenblatt

Jun 26, 2024Edited

I did a variety of tests on image size. I found that a 40 pixel wide cell worked well and that it wasn't hugely sensitive to cell size.

Expand full comment

James McDermott

Jun 18, 2024

There is a real issue here. These images are tiny.

As the post mentions, vision models don't see these images very well, because they're not used to seeing well-defined objects which are exactly 3 pixels or exactly 5 pixels in height.

And humans aren't either. If we viewed the images at their real size, we'd be using a magnifying glass to try to spot patterns.

This links to issues of spectral energy, image scale, and adversarial perturbations in typical computer vision models.

Expand full comment

Konstantin

Jun 17, 2024

Fantastic, thanks for sharing! Regarding non-IID splits, actually hidden test score is likely to be lower as there are quite a few examples within validation set which are very similar, but supposedly the test set won't have such leakage, see this comment by Chollet: https://www.kaggle.com/competitions/abstraction-and-reasoning-challenge/discussion/130710#764733

Not sure what is the difference in terms of the errors, a few errors in the evaluation set examples were fixed and a few still remain, humans can normally detect them and still understand the correct solution, but this can be more of an issue for automated methods.

Expand full comment

gwern

Jun 17, 2024

Given that the previous contender https://lab42.global/community-interview-jack-cole/ was using dynamic evaluation (https://gwern.net/doc/ai/nn/dynamic-evaluation/index), it would probably work very well to apply dynamic evaluation to a program+image-generating LLM prompt. Since dynamic evaluation ~ finetuning ~ ICL on a big enough context window, this should also be doable with context windows in the millions.

Specifically: the simplest way is to either dynamically-evaluate or context on the submitted Q&A. As it goes, it builds up a history of answers it can draw on. This is in the spirit of Chollet's quest for DL which has neuroplasticity - language models have always had neuroplasticity, it just comes in the form of dynamic evaluation and hidden-states/context-windows, and for some historical reason, the former has been forgotten.

Given the mention of the fact that a lot of the programs are invalid and the LLM doesn't seem to understand what they generate, one could go further and do more test-time dynamic evaluation on triplets: a hypothetical grid, a program which is supposed to generate it, and the actual output of the program. If they don't match, include the matching pair. (And one try to do further generation to fix that error, although that probably is going too far in making it a general vector image generator: https://gwern.net/idea#vector-generation .) Add a few dozen of these to the LLM, and it will probably get much better at generating reasonable programs and so the program search will be a lot more efficient: the context/evaluation may seem expensive, but if it saves you from sampling literally millions of possible erroneous programs, it will probably be a big win.

I doubt OA cares enough to try to enable dynamic evaluation on GPT-4o, but the Cole results I linked are with a very small LLM, and so probably you could do all that with an existing VLM.

Expand full comment

Reply (1)

loveofdoing

Jun 18, 2024

Fuck, I was looking into Triadic Memory https://github.com/PeterOvermann/TriadicMemory

Expand full comment

Neel Gupta

Jun 17, 2024

I'm still skeptical - generating 5k programs per solution is less "reasoning" and more luck. I don't argue that the LM isn't doing anything - getting a decent solution within 5k means it has *some* idea at least. But still.

What would convince me to update:

1. LM needs to search far fewer programs (<500)

2. The performance on private LB is competitive with humans. I highly doubt this is the case here - I'm more inclined to believe somehow training priors got leaked (say, through ppl uploading ARC solvers to GitHub with such training set priors that won't hold up in test set)

That said, I'm surprised 4o held out. Expecting much better results for GPT4 and Opus

Expand full comment

Reply (3)

Ryan Greenblatt

Jun 17, 2024

With 128 samples we get:

- 26% on test

- 47% on train

Expand full comment

Reply (1)

Neel Gupta

Jun 17, 2024

But that's the "held-out" training set used as test right - not private LB?

Expand full comment

Reply (1)

Ryan Greenblatt

Jun 18, 2024

Yep, not private leaderboard. I'm skeptical dataset leakage will be a serious issue.

Expand full comment

Ryan Greenblatt

Jun 27, 2024Edited

> The performance on private LB is competitive with humans. I highly doubt this is the case here - I'm more inclined to believe somehow training priors got leaked (say, through ppl uploading ARC solvers to GitHub with such training set priors that won't hold up in test set)

There don't appear to be serious leakage issues based on the new semi-private evaluation set.

See https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50-sota-on-arc-agi-with-gpt-4o?commentId=sfs8gH2qhac53zwqC for context.

Expand full comment

Reply (1)

Neel Gupta

Jun 27, 2024

That is good to know!

Expand full comment

Ryan Greenblatt

Jun 27, 2024Edited

With an updated version of the method, I get 36% on test with about 256 samples per problem.

(Note this is with 2 submissions per problem rather than 3.)

And, as I note in the other comment, it doesn't seem like data contamination is a serious issue.

I think 36% is probably notably worse than typical people on MTurk, but not wildly worse. (Edit: Perhaps typical people on MTurk get around 70%? And probably there is a broad human range.)

Curious if this makes you update.

Expand full comment

Reply (1)

Neel Gupta

Jun 27, 2024

It does - to a non-significant degree. I think better LLMs (claude 3.5 sonnet next? 😉 I'm sure they'd love to give you compute credits) would obviously help. However, what I would like to see is a feedback loop - where the LLM can iterate over its solution from the result of some "evaluation" function which dictates how far the generated program is from fitting the few-shot examples. This implies using the LLM as a mesaoptimizer - which would be much more sample efficient and generally effective.

If this approach works, then I can very easily imagine LLMs dominating SOTA atleast - if not fully solving ARC-AGI.

BTW, thanks for all the time, energy and money you've poured into this :)

Expand full comment

arthur wolf

Jun 18, 2024

Have you tried providing the image as JSON?

An array of arrays of color names, or ascii letters etc?

I have anecdotal good results with this approach in trying to solve some pixel-based problems in one of my projects.

I provided icons as images and as ascii-art-like representations, and didn't have very good results getting it to do what I wanted (island detection, offsetting, border thickening, etc).

Then I moved to 2D JSON and got significantly better (though far from perfect) solutions.

Size of the images was 16x16 pixels binary (so in my case the values were boolean not strings).

I'd be very curious if doing this would result in any improvements in your case (it should be fairly easy to implement, though I expect it's not cheap to try it out...).

Amazing work anyway.

Expand full comment

Reply (1)

Ryan Greenblatt

Jun 26, 2024

I use a carefully engineered ascii representation.

This ascii representation represents the colors/symbols as numbers, it would plausibly be better to use the names of the colors.

Expand full comment

Reply (1)

arthur wolf

Jun 26, 2024

I did find a non-trivial improvement when using JSON-formated data instead of just lists of values. I'll try to see if I can get those scripts running again and get some data for you by comparing the two formats.

Expand full comment

skybrian

Jun 18, 2024

It seems kind of odd to attribute the intelligence to GPT4o instead of your own cleverness in reshaping the problem to be easier? You built a larger system with the LLM as an (admittedly important) component of it. It's quite a useful component, but the LLM doesn't build systems.

Expand full comment

Reply (3)

Ryan Greenblatt

Jun 26, 2024

Certainly some credit goes to me and some to GPT4o.

The solution would be much worse without careful optimization and wouldn't work at all without gpt4o (or another llm with similar performance).

It's worth noting a high fraction of my time went into writing prompts and optimization the representation. (Which is perhaps better described as teaching gpt4o and making it easier for it to see the problem.)

There are different analogies here which might be illuminating:

- Suppose that you strand a child out in the woods and never teach them anything. I expect they would be much worse at programming. So, some credit for there abilities goes to society and some to their brain.

- If you remove my ability to see (on conversely, use fancy tools to make it easier for a blind person to see) this would greatly affect my ability to do ARC-AGI puzzles.

- You can build systems around people which remove most of the interesting intelligence from various tasks.

I think what is going on here is analogous to all of these.

Separately, this tweet is relevant: https://x.com/MaxNadeau_/status/1802774696192246133

Expand full comment

Reply (1)

skybrian

Jun 27, 2024Edited

The "child in the woods" is little too anthropomorphic for my tastes. Here's another approach:

When I was playing around with Code Interpreter last year, it felt like I was doing an software interview with a well-read junior programmer who wasn't all that bright. I had to do a lot of hand-holding. I treated it as a game: how can I make it build what I want it to build, just by asking the right questions?

Although this might feel like teaching and a human might learn a lot from it, there's a sense in which it isn't. The LLM itself is immutable. When you start over, it will have forgotten everything. The illusion of learning comes from generating the chat transcript, which serves as guidance. So an alternate view is that you're generating educational material that can guide the LLM to a solution.

The LLM and the educational material together is a larger system that could be thought of as learning something, as you add more material. This works quite well, since it's easy to add more material. And I expect these larger systems to improve quite a bit as software engineers tinker with them.

And as you suggest, if you include *people* in that larger system, it gets pretty interesting.

Also, there is a longer-term loop where LLM's themselves are improved - they absorb information from educational material. But it's not real-time.

Expand full comment

Johan Bergman

Jun 18, 2024

Precisely! And he did things like using different prompts for different tasks. Which means that he's helping it.

Expand full comment

Jianghong Ying

Jun 23, 2024

Technically the 2 prior ARC challenges people are building heuristic program synthesis with predefined operations on a DSL, so yeah this does does raise the participation rate of any AI system on this from 0% to maybe 20%?

Expand full comment

loveofdoing

Jun 17, 2024

is the ability to integrate/merge 1000s of potential programs that map an input to an output the same as seeing 1-2 inputs generalizing from only those 2 and then “knowing” the correct output.

I’d argue the former isn’t a form of induction, but deduction.

Expand full comment

Reply (1)

osmarks

Jun 17, 2024

You have to have a vague idea of what the solution is for any of the thousands of programs to ever work.

Expand full comment

Reply (2)

Neel Gupta

Jun 17, 2024

True, but thousands of programs is several OOMs than what humans' search space. This seems like fancier brute forcing, or at least pseudo MCTS (where the LM is like a value and prediction head combined)

Expand full comment

Reply (1)

osmarks

Jun 18, 2024

It is very, very far off actual brute force. Sampling randomly from all possible programs wouldn't work this well. Humans are somewhat more "intelligent" here (at least consciously, we sample fewer candidate solutions), but I think this is a difference in degree rather than in kind.

Expand full comment

Reply (2)

Jianghong Ying

Jun 23, 2024

I'm on the fence here, sampling 5,10k programs would not work for mere brute force, yet since LLM definitely is capable of ICL, and the output actually is pretty limited (it has to be a 2D array of int), maybe the valid search space isn't that big to start with?

So, inside my head I have 2 different opinion on this:

1. the 10k sampling is effectively a pseudo brute force with most of the learning happening at training time where LLM learn some heuristics of generic programming. This can be proved by looking at deepmind's alphacode 1, which they achieve codeforce level coding puzzle with 1 million sampling. And if we extrapolate the correction rate, then we only get merely 74% success rate on ARC training set. And obviously there's no way that codeforce problems are the same level hard as ARC. So this indicates ARC's visual representation and resistance of memory forms a very big challenge for AI. If we extrapolate such characteristic to an extreme, where we have a AI that still takes 10k sampling to solve ARC, and take, say 10milliion to solve Riemann hypothesis, then even if we admit such AI has general learning capablility, we must say such capablility is not according to human's normal convention.

2. The vast sampling is just what a parallel stochastic system would most effective work. Human loathe such approach simply because the hardware, or say wetware, is not capable of parallism. It would be logically unwise to not use such capability if applicable. And maybe for a AI with paralism and stochastic the smart way is to just roll the dice the hell out of it and pick the best branch as some sort of learning.

Expand full comment

Neel Gupta

Jun 18, 2024

I personally think that difference in degree is very informative in the differences in how human brains and LLMs operate.

Expand full comment

loveofdoing

Jun 17, 2024

Sure, but isn’t that a negative, if in abstraction space the LLM already has the abstractions then it hasn’t learned the abstraction at all. It implies that this is a deductive problem where it is given a sparse abstraction, and must continually guess at refinements. But arguably no generalization has occurred.

Expand full comment

Yu Gu

Jul 9, 2024

The original purpose of ARC challenge was to evaluate the AI's capacity in doing abstraction and analogy. Solving the problem with search (using thousands of trials as you did) would significantly downplay the challenge from that perspective. What's your opinion on this?

Expand full comment

Reply (1)

Yu Gu

Jul 9, 2024Edited

To elaborate a bit on why using search might downplay the challenge: Using search allows you to form simple heuristics using the external feedback (the more trials you do, the better heuristics you'll get), and I think we probably need a better metric that incorporates the number of trials. Say in the extreme case, we have a random code generator that generates infinite programs over the grid (we can have some very simple constraints over it such as both the inputs and outputs should be a grid layout), certainly it will hit some correct solutions (i.e., the infinite monkey theorem), would you consider such a random generator "intelligent"?

Expand full comment

Yassine

Jul 4, 2024

Thanks for the great read.

Would you say that the biggest bottleneck to LLM succeeding in here is that LLMs are often not able to identify the patterns needed to solve the puzzle, or that they are not able to execute and build the solution?

Also, do you think sampling Python programs is the best thing for solving the puzzles? or There should be better programming languages/DSLs that would make LLMs do less mistakes?

Expand full comment

Reply (1)

Ryan Greenblatt

Jul 4, 2024

I think failing to identify and reason though the pattern is the main bottleneck.

Better languages/DSLs is possible, though models are very familiar with Python and that makes it easier.

Expand full comment

Rene

Jun 25, 2024

I like the less dangerous part of this.

It's kind of funny looking back at some of the novels I read as a kid - it more feels like we're collectively performing a very intricate high tech summoning ritual to summon the Machine God.

That is of course said jokingly but only half. To some extent we may be too inspired to be afraid.

Expand full comment

Jianghong Ying

Jun 23, 2024

Just did some back of the napkin calculation, so so far this method uses 30k token per request * 8000 sampling = 240 million tokens per problem? And to inc the success rate by 3% the token count needs to double?

Expand full comment

Reply (1)

Ryan Greenblatt

Jun 26, 2024

Your calculation neglects prefix caching - we use the same prefix for multiple completions.

The bulk of the cost is on generating the output token. Responses average around 600 output tokens for 8000 * 600 = 480,000 output tokens per.

Expand full comment

Nino Risteski

Jun 19, 2024

Very good read on the ARC challenge, would you mind me sharing it in my AI newsletter?

Expand full comment

Reply (1)

Buck Shlegeris

Jun 19, 2024

Sure, but please note the caveats here https://x.com/RyanPGreenblatt/status/1803194126420328502

Expand full comment

Simon

Jun 18, 2024

How much did your experiments cost?

Expand full comment

Alex Malafeev

Jun 18, 2024

Small nitpick: In the screenshots for problem 2, case 4 it looks like you did not fully screenshot the total symmetrical bounding box but cut off some part of it. Otherwise there is missing information!

Expand full comment

Reply (1)