97 comments

  • Smaug123 44 days ago
    So I am extremely hyped about this, but it's not clear to me how much heavy lifting this sentence is doing:

    > First, the problems were manually translated into formal mathematical language for our systems to understand.

    The non-geometry problems which were solved were all of the form "Determine all X such that…", and the resulting theorem statements are all of the form "We show that the set of all X is {foo}". The downloadable solutions from https://storage.googleapis.com/deepmind-media/DeepMind.com/B... don't make it clear whether the set {foo} was decided by a human during this translation step, or whether the computer found it. I want to believe that the computer found it, but I can't find anything to confirm. Anyone know?

    • ocfnash 44 days ago
      The computer did find the answers itself. I.e., it found "even integers" for P1, "{1,1}" for P2, and "2" for P6. It then also provided provided a Lean proof in each case.
      • freehorse 43 days ago
        It would make a lot of sense for the lean-code-formalisation of the problems done by the researchers fed to the AI to be provided. Not assuming bad intent in not providing them, but it would help understand better the results.
      • nnarek 44 days ago
        formal definition of first theorem already contain answer of the problem "{α : ℝ | ∃ k : ℤ, Even k ∧ α = k}" (which mean set of even real numbers).if they say that they have translated first problem into formal definition then it is very interesting how they initially formalized problem without including answer in it
        • Smaug123 44 days ago
          (You're talking to one of the people who was part of the project, which is why I took @ocfnash's answer as authoritative: they did not cheat.)
          • Xelynega 44 days ago
            If they're talking to the people who are part of the project I'd hope the answer would contain detail and not expect to be taken as authoritative.
            • refulgentis 44 days ago
              [flagged]
              • pishpash 44 days ago
                [flagged]
                • refulgentis 44 days ago
                  That wasn't very nice. Are you curious about anything? Happy to help. I'd proactively do it, but I don't want to guess at whats in your mind. My initial guess is you think I think that engaging with the public is an infinite loop. I don't!
        • golol 44 days ago
          I would expect that in their data which they train AlphaProof on they have some concept of a "vague problem" whoch could just look like

          {Formal description of the set in question} = ?

          And then Alphaproof has to find candidate descriptions of this set and prove a theorem that they are equal to the above.

          I doubt they would claim to solve the problem if they provided half of the answer.

          • puttycat 44 days ago
            > I doubt they would claim to solve the problem if they provided half of the answer.

            Stranger things have happened

          • chx 44 days ago
            > I doubt they would claim to solve the problem if they provided half of the answer.

            This falls under extraordinary claims require extraordinary proof and we have seen nothing of the sort.

          • sebzim4500 44 days ago
            To be fair, that isn't half the answer it's like 99% of the answer.

            They clarified above that it provided the full answer though.

          • JyB 44 days ago
            The deepmind team has a history of being misleading. The great StarCraft 2 strategist bot is still in mind.
            • totoglazer 43 days ago
              What’s the story with that bot? Always thought it was cool. Was that all smoke and mirrors?
              • topato 43 days ago
                I think maybe parent comment is referring to it essentially just employing a zerg rush but with the speed and reaction time of an AI? Not 100% sure... Unrelated, iirc the starcraft functionality was an early example of generalizing a pretrained NN, alphaGO, and showing that it could adapt to learn and defeat games across strategic domains, especially after it learned so much strategy from the most difficult, widely played, and most strategically-varied physical game available.
        • pishpash 44 days ago
          Exactly, a problem and its answer are just different ways of describing the same object. Every step of a proof is a transformation/translation of the same object. It would be disingenuous to say that some heavy lifting isn't done in formalizing a problem but it seems that step is also performed by a machine:

          "We established a bridge between these two complementary spheres by fine-tuning a Gemini model to automatically translate natural language problem statements into formal statements, creating a large library of formal problems of varying difficulty."

          I'm confused, is the formalization by Gemini or "manually"? Which is it?

        • cygaril 44 days ago
          Come up with many possible answers, formalize them all, and then try to prove or disprove each of them.
          • Davidzheng 44 days ago
            This is probably partially what they did idk why it's downvoted lol
        • riku_iki 44 days ago
          its not clear if theorem is actual input formal definition, or formal definition was in different form.
      • Davidzheng 44 days ago
        Can you elaborate on how it makes guesses like this? Does it do experiments before? Is it raw LLM? Is it feedback loop based on partial progress?
        • Sharlin 44 days ago
          "AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself how to master the games of chess, shogi and Go."
          • taneq 44 days ago
            Huh, so MCTS to find the ‘best’ token using a (relatively) small, quick language model? Sounds like an interesting approach to small model text generation too…
          • JKCalhoun 44 days ago
            Yeah I am not clear the degree to which this system and LLMs are related. Are they related? Or is AlphaProof a complete tangent to CHatGPT and its ilk?
            • gowld 44 days ago
              It's not an English LLM (Large Language Model).

              It's a math Language Model. Not even sure it's a Large Language Model. (Maybe shares a foundational model with an English LLM; I don't know)

              It learns mathematical statements, and generates new mathematical statements, then uses search techniques to continue. Similar to Alpha Go's neural network, what makes it new and interesting is how the NN/LLM part makes smart guesses that drastically prune the search tree, before the brute-force search part.

              (This is also what humans do to solve math probrems. But humans are really, really slow at brute-force search, so we really almost entirely on the NN pattern-matching analogy-making part.)

              • nextos 44 days ago
                These kind of LLMs are also very interesting for software engineering. It's just a matter of replacing Lean with something that is more oriented towards proving software properties.

                For example, write a formal specification of a function in Dafny on Liquid Haskell and get the LLM to produce code that is formally guaranteed to be correct. Logic-based + probability-based ML.

                All GOFAI ideas are still very useful.

              • sebzim4500 44 days ago
                My reading of it is that it uses the same architecture as one of the Gemini models but does not share any weights with it. (i.e it's not just a finetune)
              • _heimdall 43 days ago
                This is really interesting. I would have expected the understanding to be that humans make a guess, test it, and learn from what did or did not work. The lessons learned from the prior tests would impact future guesses.

                Do you know if a system like the OP is learning from failed tests to guide future tests, or is it a truly a brute force search as if it were trying to mine bitcoin?

                • Thorrez 43 days ago
                  This quote from the article sounds like it learns from failed tests:

                  >We trained AlphaProof for the IMO by proving or disproving millions of problems, covering a wide range of difficulties and mathematical topic areas over a period of weeks leading up to the competition. The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.

                  • _heimdall 43 days ago
                    Reading between the lines a bit, that does answer the question I had though don't think I I clarified very well.

                    I read that to say the model's token weights are adjusted as it goes, so in an LLM sense it is kind of learning. It isn't reasoning through an answer in the way a human does though. Meaning, the model is still just statistically predicting what an answer may be and checking if it worked.

                    I wouldn't chalk that up to learning at all. An AI solving complex math doesn't even seem too impressive to me with the predictive loop approach. Computers are well adept at math, throwing enough compute hardware at it to brute force an answer isn't suprising. I'd be really impressed if it could reliably get there with a similar number of failed attempts as a human, that could indicate that it really learned and reasoned rather than rammed through a mountain of failed guesses.

                    • Thorrez 43 days ago
                      >with a similar number of failed attempts as a human

                      I'd be hard to know how many failed attempts the human made. Humans are constantly thinking of ideas and eliminating them quickly. Possibly to fast to count.

                      • _heimdall 43 days ago
                        Ive never competed in math competitions at this level, but I would have expected it to be pretty clear to the human when they tested a different solution. As complex as the proofs are, is it really feasible that they are testing out a full proof in their head without realizing it?
                        • Thorrez 42 days ago
                          Hmm, I think it comes down to what the definition of "testing" and "attempt". A human will generate many ideas, and eliminate them without creating full proofs, by just seeing that the idea is going in the wrong direction.

                          It sounds like AlphaProof will doggedly create full proofs for each idea.

                          Is what the human is doing testing attempts?

                    • sdenton4 43 days ago
                      Computers are good at arithmetic, not math...

                      There's definitely an aspect of this that is 'airplanes, not birds.' Just because the wings don't flap doesn't mean it can't fly, though.

                      • _heimdall 43 days ago
                        That's totally fair, though wouldn't the algorithm here have to reduce the math proofs to arithmetic that can be computed in silico?
              • bbor 43 days ago
                yeah but it doesn't understand the exact syntax on an absolute level, does it...? I understood this to be the same as any language model applied to programming languages (aka Formal Languages). Is that mistaken?
                • Zondartul 43 days ago
                  As far as I understand, and I may be wrong here, the system is composed of two networks: Gemini and AlphaZero. Gemini, being an ordinary LLM with some fine-tunes, only does translation from natural to formal language. Then, AlphaZero solves the problem. AlphaZero, unburdened with natural language and only dealing with "playing a game in the proof space" (where the "moves" are commands to the Lean theorem prover), does not hallucinate in the same way an LLM does because it is nothing like an LLM.
                • danielheath 43 days ago
                  Yes, but the problem space means that invalid outputs can be quickly identified - whereas general programming isn’t necessarily amenable to rapid checks.
                  • bbor 43 days ago
                    I mean, aren’t you just describing formal language syntax? Seems like a fundamentally similar situation —- the computer can automatically flag any syntax errors in a millisecond by checking it against the generating grammar for that language. Thats what makes a formal language in the first place, I think!

                    I do think this language is considerably more robust than the typical programming language, which means a sound program is more likely to end up being valid/“correct”. But still, that’s a difference of degree, not kind, IMO

                    • danielheath 42 days ago
                      I don’t mean syntax errors - I mean the difficulty of validating code that contains side effects (like http requests, database access etc).

                      Validating a math proof either terminates in a reasonable time (in which case it’s useful for training), or does not (in which case the AI should be discouraged from using that approach).

    • summerlight 44 days ago
      To speak generally, that translation part is much easier than the proof part. The problem with automated translation is that the translation result might be incorrect. This happens a lot when even people try formal methods by their hands, so I guess the researchers concluded that they'll have to audit every single translation regardless of using LLM or whatever tools.
      • thomasahle 44 days ago
        You'd think that, but Timothy Gowers (the famous mathematician they worked with) wrote (https://x.com/wtgowers/status/1816509817382735986)

        > However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.

        So didn't actually solve autoformalization, which is why they still needed humans to translate the input IMO 2024 problems.

        The reason why formalization is harder than you think is that there is no way to know if you got it right. You can use Reinforcement Learning with proofs and have a clear signal from the proof checker. We don't have a way to verify formalizations the same way.

        • thrdbndndn 44 days ago
          > However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.

          A small detail wasn't clear to me: for these incorrectly formalized problems, how do they get the correct answer as ground truth for training? Have a human to manually solve them?

          (In contrast to problems actually from "a huge database of IMO-type problems", they do have answers for these already).

          • summerlight 44 days ago
            > A small detail wasn't clear to me: for these incorrectly formalized problems, how do they get the correct answer as ground truth for training? Have a human to manually solve them?

            Formal proofs can be mechanically checked if it's correct or not. We just don't know what's the answer. Think it as an extremely rigorous type system that typically requires really long type annotations, like annotation itself is a complex program. So if AlphaProof happens to generate a proof that passes this checker, we know that it's correct.

            • thrdbndndn 43 days ago
              Ah, thanks. That makes a lot of sense now.
              • thomasahle 43 days ago
                One more trick: They look for both proofs and disproofs. So even if they failed the formalization and created a "wrong" theorem, it's just another task.
          • adrianN 44 days ago
            You write proofs in a formal language that can be machine checked. If the checker is happy, the proof is correct (unless there is a bug in the checker, but that is unlikely).
          • raincole 44 days ago
            They said the incorrectly formalized ones are usually easier, so I assume they just hire humans to solve them in the old way until the AI is smart enough to solve these easier problems.
            • czl 43 days ago
              > I assume the just hire humans to solve…

              An incorrectly formalized problem is a different problem and a solution to any formalized problem still useful for AI training because such solutions can be mechanically checked for correctness and this does not require the hire of humans. What requires humans is the initial formalization process since that is more a language translation task which requires nuance and judgment and is difficult to mechanically verify.

        • llwu 44 days ago
          > We don't have a way to verify formalizations the same way.

          While there is no perfect method, it is possible to use the agent to determine if the statement is false, has contradictory hypotheses, or a suspiciously short proof.

      • ajross 44 days ago
        > To speak generally, that translation part is much easier than the proof part.

        To you or me, sure. But I think the proof that it isn't for this AI system is that they didn't do it. Asking a modern LLM to "translate" something is a pretty solved problem, after all. That argues strongly that what was happening here is not a "translation" but something else, like a semantic distillation.

        If you ask a AI (or person) to prove the halting problem, they can't. If you "translate" the question into a specific example that does halt, they can run it and find out.

        I'm suspicious, basically.

    • dooglius 44 days ago
      The linked page says

      > While the problem statements were formalized into Lean by hand, the answers within the problem statements were generated and formalized by the agent.

      However, it's unclear what initial format was given to the agents that allowed this step

      • Smaug123 44 days ago
        FWIW, GPT-4o transcribed a screenshot of problem 1 perfectly into LaTeX, so I don't think "munge the problem into machine-readable form" is per se a difficult part of it these days even if they did somehow take shortcuts (which it sounds like they didn't).
        • pclmulqdq 44 days ago
          Comparing "turn photo into LaTeX" to "translate theorems into Lean" is like comparing a child's watercolor drawing to the Mona Lisa.
          • Smaug123 44 days ago
            … no? After the LaTeX output, I told stock GPT4o that the answer was "all even integers", and asked for the statement in Lean. I had to make two changes to its output (both of which were compile-time errors, not misformalisations), and it gave me the formalisation of the difficult direction of the problem.

            Both changes were trivial: it had one incorrect (but unnecessary) import, and it used the syntax from Lean 3 instead of Lean 4 in one lambda definition. A system that was trained harder on Lean would not make those mistakes.

            The one actual error it made was in not proposing that the other direction of the "if and only if" is required. Again, I am quite confident that this formalisation failure mode is not hard to solve in a system that is, like, actually trained to do this.

            Obviously formalising problems that a working mathematicican solves is dramatically harder than formalising IMO problems, and is presumably way ahead of the state of the art.

            • lmm 44 days ago
              > I am quite confident that this formalisation failure mode is not hard to solve in a system that is, like, actually trained to do this.

              Why?

              • Smaug123 43 days ago
                This is really not the kind of problem LLMs are bad at! But since you insist, given the LaTeX, Claude 3.5 Sonnet correctly stated the theorem in full while inventing notation for the floor operation (it did correctly note unprompted what the right function was and how to obtain it from mathlib, but it incorrectly attempted to define syntax sugar for it).
          • sebzim4500 44 days ago
            I think that's exagerating a bit. If you are familiar with both Lean and LaTeX then I think transcribing these problems to Lean only takes about twice as long as transcribing them to LaTeX.
      • pclmulqdq 44 days ago
        So if Lean was used to find the answers, where exactly is the AI? A thin wrapper around Lean?
        • sdenton4 44 days ago
          Think of problems in NP - you can check the answer efficiently, but finding the answer to check is the hard part... This is basically what we're looking at here: The proof checker can quickly evaluate correctness, but we need something to produce the proof, and that's the hard part.
        • dooglius 44 days ago
          Lean checks that the proof is valid, it didn't find the proof.
        • pishpash 44 days ago
          The AI is the "solver network", which is the (directed) search over solutions generated by Lean. The AI is in doing an efficient search, I suppose.

          I'm also waiting for my answer on the role of the Gemini formalizer, but reading between the lines, it looks like it was only used during training the "solver network", but not used in solving the IMO problems. If so then the hyping is greatly premature, as the hybrid formalizer/solver is the whole novelty of this, but it's not good enough to use end-to-end?

          You cannot say AlphaProof learned enough to solve problems if formalization made them easier to solve in the first place! You can say that the "solver network" part learned enough to solve formalized problems better than prior training methods.

        • xrisk 44 days ago
          Lean is just the language, Presumably to drive the AI towards the program (“the proof”)
    • zerocrates 44 days ago
      Interesting that they have a formalizer (used to create the training data) but didn't use it here. Not reliable enough?
    • golol 44 days ago
      > When presented with a problem, AlphaProof generates solution candidates and then proves or disproves them by searching over possible proof steps in Lean.

      To me, this sounds like Alphaproof receives a "problem", whatever that is (how do you formalize "determine all X such that..."? One is asked to show that an abstract set is actually some easily understandable set...). Then it generates candidate Theorems, persumably in Lean. I.e. the set is {n: P(n)} for some formula or something. Then it searches for proofs.

      I think if Alphaproof did not find {foo} but it was given then it would be very outrageous to claim that it solved the problem.

      I am also very hyped.

    • sebzim4500 44 days ago
      I as someone with a maths degree but who hasn't done this kind of thing for half a decade, was able to immediately guess the solution to (1) but actually proving it is much harder.
    • gowld 44 days ago
      The article says

      > AlphaProof solved two algebra problems and one number theory problem by determining the answer and proving it was correct.

    • kurthr 44 days ago
      As is often the case, creating a well formed problem statement often takes as much knowledge (if not work) as finding the solution.

      But seriously, if you can't ask the LLM to solve the right question, you can't really expect it to give you the right answer unless you're really lucky. "I'm sorry, but I think you meant to ask a different question. You might want to check the homework set again to be sure, but here's what I think you really want."

    • rldjbpin 43 days ago
      as a noob, i feel that formalizing is a major part of solving the problem by yourserlf. my assessment is that once you identify certain patterns, you can solve problems by memorizing some patterns. but people might me can struggle with the first stage and solve the wrong problem.

      still good progress nonetheless. won't call the system sufficient by itself tho.

      • SonOfLilit 43 days ago
        My mathematician friend said problem 5 (I think? With the monsters) seems hard to formulate, so I spent 15 minutes formulating it in pseudo-haskell.

        Then he gave me a huge hint to the solution, after which it only took me a couple of hours to solve.

        (Formalizing the solution is of course the hardest part, and might serve as a good masters dissertation I think)

        • rldjbpin 42 days ago
          please pardon my ignorance, but to me a tower of hanoi question as a middle-schooler was the hardest thing to comprehend. but after learning about it, it is no longer quite as challenging to tackle.

          i understand that there are very hard questions for the olympiad. but it might be possible to learn about some recurring types of them by looking at past instances. it may not be the meta for IMO but has been for other kinds of exams.

    • hyfgfh 43 days ago
      > First, the problems were manually translated into formal mathematical language for our systems to understand.

      Some people call this programming

    • allxnb 43 days ago
      Presenting this just as "translating into formal language" omits important information.

      Lean isn't just a formal language, it is also a theorem prover, Could the IMO participants use the nlinarith tactic? Could they use other tactics?

      Of course not, they had to show their work!

      Could they have mathematicians translate the problem statements into the formal language for them?

      Of course not, they had to do it themselves. In "How to solve it" Polya stresses multiple times that formalizing the initial question is an important part of the process.

      Then, the actual computational resources expressed in time are meaningless if one has a massive compute cloud.

      I'm a bit dissatisfied with the presentation, same as with the AlphaZero comparison to an obsolete Stockfish version that has been debunked multiple times.

  • necovek 44 days ago
    This is certainly impressive, but whenever IMO is brought up, a caveat should be put out: medals are awarded to 50% of the participants (high school students), with 1:2:3 ratio between gold, silver and bronze. That puts all gold and silver medalists among the top 25% of the participants.

    That means that "AI solves IMO problems better than 75% of the students", which is probably even more impressive.

    But, "minutes for one problem and up to 3 days for each remaining problem" means that this is unfortunately not a true representation either. If these students were given up to 15 days (5 problems at "up to 3 days each") instead of 9h, there would probably be more of them that match or beat this score too.

    It really sounds like AI solved only a single problem in the 9h students get, so it certainly would not be even close to the medals. What's the need to taint the impressive result with apples-to-oranges comparison?

    Why not be more objective and report that it took longer but was able to solve X% of problems (or scored X out of N points)?

    • SonOfLilit 43 days ago
      I have met IMO competitors. They are insanely smart. I wouldn't imagine it's possible to be that smart before I started hanging in these circles. So more like 25% of 0.01% of high school students.

      Time is not a very interesting dimension here, because humans don't use the same CPU as huge GPU clusters. The binary "is it able to reach a solution given enough resources?" is more interesting (for GPT/Claude the answer is a clear negative).

      • necovek 42 days ago
        You probably misinterpreted the above: 25% of the participants in IMO, who are generally high school students. Never claimed that 25% of all high schoolers would do better.

        Now, you may say how time is not a useful dimension here, but really, this is where we are seeing a lot of these advances come from: general researchers today do get access to huge compute capability, allowing them to quickly iterate on different ideas. In a sense, they can be less smart about their use of resources and simply try things out: this does drive the innovation (compared to waiting for their turn on a supercomputer nearby).

        And finally, time is essential even for humans: given a couple hundred years, they will find the proof for Fermat's last theorem, but they might not do it in 4.5h. Since we are comparing AI capabilities to humans in the article, it's very possible that increased compute will never allow AI to find novel proofs we have not come up with either. That's where the AI bit comes in: we know that brute searching through the entire possible space of proofs is still too expensive for our compute capabilities, so we need AI to emulate our "intuition" when looking for the direction to narrow down the search.

        So there are really two reasons time matters: 1. getting enough of compute might still be far away (heck, prime factorization and elliptic curves are still the basis of the most of world cryptography for that reason) and 2. maybe it's not even enough to increase compute capability to make huge jumps in problem solving capabilities (iow, maybe we are reaching maximum of where the approach can take us).

      • astromaniak 42 days ago
        I've seen chess players, they are smart too. So what? Specialized model on commodity hardware beats most of them, if not all. This doesn't mean model is smarter. The same could be here. Is it possible that solutions are one step from model's database, or training set. Humans don't remember this much and have to do many more actions in their heads.

        In other words model which can solve any/most school / college exam problem isn't necessarily smart. It can be just a database, or, in fact, a lookup table. Smarter version can be a lookup + 1 step test. Not saying it's bad, but it doesn't scale to less formalized domains. BTW, in this case formalization was done by humans.

      • lupire 43 days ago
        But many more of those students (including many not in the contest) could solve those problems given more time than the 4.5hr sessions.

        If these problems were important to solve, redundantly by thousands or millions of people (like the real work that most people do), far more people would put in the effort to learn how to solve these problems.

        It's just a weird comparison. Contests are very artificially structured in ways that don't make sense for comparing to computers.

        • badrunaway 43 days ago
          I think what we need to look at is that AI systems were not able to do it before and now they are able to do it. Sooner these systems with millions of dollars of compute will just scale the algorithms and beat humans at this as well - just like they did for chess, go, shogi etc.
    • NiloCK 43 days ago
      > medals are awarded to 50% of the participants (high school students)

      In case this confuses anyone: the high school students in question are not a standard sample of high school students. AFAIK, they are teams of the ~6 strongest competitive problem solving high school students from each country.

    • Davidzheng 44 days ago
      In my opinion (not Google s) the only reason they didn't get gold this year (apart from being unlucky on problem selection) is that they didn't want to try for any partial credit in P3 and P5. They are so close to the cut off and usually contestants with a little bit of progress can get 1 point. But i guess they didn't want to get a gold on a technicality--it would be bad press. So they settled in a indisputable silver
      • lozenge 44 days ago
        The AI took a day on one of the problems so it must have generated and discarded a lot of proofs that didn't work. How could it choose which one to submit as the answer, except the objective fact of the proof passing in Lean.
        • snewman 44 days ago
          When tackling IMO problems, the hard part is coming up with a good approach to the proof. Verifying your proof (and rejecting your false attempts) is much easier. You'll know which one to submit.

          (Source: I am a two-time IMO silver medalist.)

          • lupire 43 days ago
            You are a human, not an AI. You know whether your idea seems related to the solution. The AI has thousands of ideas and doesn't know which are better. Graders shouldn't accept a thousand guesses grubbing for 1 point.

            If verifying a good idea is easy, then the evidence shows that the AI didn't have good ideas for the other 2 problems.

            • badrunaway 43 days ago
              we are talking about lean proofs. Given a formal statement and a proof - the lean can verify whether it's correct or not. It's like generating computer programs to solve a problem - the problem lied in generating useful solutions/sub-solutions so that the search is effective. They achieve this via using gemini as a lean proof generator aka. using a world model LLM fine tuned to generate lean proofs in a more effective manner.

              Humans are even better at this as you mention - but effectively the approach is similar. Come up with lot of ideas and see what proves it.

          • saagarjha 43 days ago
            I don't see why we should take your word for it, as opposed to just asking AlphaProof to comment instead.
            • topato 43 days ago
              Well, he does have twice the amount of silver medals... And can speak the English language... Although, an AI attempting to speak with the human race entirely through an esoteric math-proofing language would be an interesting take on the whole, "humans make ET contact, interact through universal language.... pause for half of the movie, military hothead wanting to blow it out of the sky, until pretty lady mathematician runs into the oval office waving a sheet of paper... OF MATH!" trope.... But now, it's a race between you and I, to see who can write the screenplay first!
        • Agingcoder 43 days ago
          To some extent, what they do is stronger that the other contestants, who I understand don’t formally prove their answers.
        • Davidzheng 44 days ago
          I think it proves lemmas it can submit all the lemmas it proved lol
      • llwu 44 days ago
        Partial credit is quite challenging to earn, per: https://www.imo-official.org/year_statistics.aspx?year=2024
        • Davidzheng 44 days ago
          Link isn't working for me can you summarize. What i heard ten years ago was that 1pt is quite common for a significant progress
          • llwu 44 days ago
            Yeah, the progress has to be quite significant, no points are awarded for trivial observations. Thus scores are usually bimodal around 0 and 7. In the linked stats you can see that 1 point for P3/P5 was less common than full score on other problems.
            • gus_massa 43 days ago
              Problem 2 and 5 have a lot of 1s. Sometimes thre is an interesting advance that is easirr than the full solution. Also in problem 6 the only hope for most was to get that 1 point.
      • dooglius 44 days ago
        I don't believe anything was graded by the IMO, Google is just giving itself 7 for anything proved in Lean (which is reasonable IMO), so they can't really try for partial credit so much as choose not to report a higher self-graded number.
        • utopcell 44 days ago
          From the article: "Our solutions were scored according to the IMO’s point-awarding rules by prominent mathematicians Prof Sir Timothy Gowers, an IMO gold medalist and Fields Medal winner, and Dr Joseph Myers, a two-time IMO gold medalist and Chair of the IMO 2024 Problem Selection Committee."
          • dooglius 44 days ago
            Fair enough. Although I think the question is whether P3/P5 were given zero points vs not evaluated vs evaluated but not published. I don't think it is surprising that Lean-verified proofs get a 7.
            • utopcell 43 days ago
              It is an interesting question: how this new system quantifies progress and whether it can detect that a chain of reasoning is close to the solution. Although it was just 2 points shy of a golden medal (1 point partial credit per problem not solved?), I'd suspect that the team would want a clean victory, given the magnitude of the claim. I bet next year's results would be 42/42. Deep Mind is unstoppable!
    • tardygrade 44 days ago
      One key difference between giving humans more time and giving computer programs more time is that historically we have had more success making the latter run faster than the former.
    • muglug 44 days ago
      > What's the need to taint the impressive result with apples-to-oranges comparison?

      Most of DeepMind’s research is a cost-centre for the company. These press releases help justify the continued investment both to investors and to the wider public.

      • utopcell 44 days ago
        > Most of DeepMind’s research is a cost-centre for the company.

        The effect of establishing oneself as the thought leader in a field is enormous.

        For example, IBM's stock went up 15% the month after they beat Kasparov.

      • mensetmanusman 43 days ago
        Cost centers are profit centers when R&D is successful.
    • acchow 44 days ago
      But computers get faster each year, so even with zero progress in actual AI, this will reach human-student speeds in a few years (need a 40x speed up)
      • jimkoen 44 days ago
        Could you explain where the 40x speedup comes from, given that literally the biggest problem in semi conductors right now is smaller node size?
        • cma 44 days ago
          Just bigger chip area at lower clock could do it too if there is no threshold on cost anyway and they were limited by power. Google would hit enough accumulated production of AI chip area and cluster build out even if Moore's law stopped. There will likely be big algorithmic improvments too.
        • lupire 43 days ago
          Improved parallelization. More cost in less time.

          This is Nvidia's main area of research now.

          • knotthebest 42 days ago
            Still, very marginal improvements. There is also Amdahl’s Law which limits the speed gains of parallelisation.
    • asah 43 days ago
      The complexity of truth doesn't fit in a headline, let alone attract enough clicks to win the competition for eyeballs and virality - which means that even if someone somehow succeeded in telling the whole truth in a headline, that article would lose in the race for eyeballs to the clickbait-headline version, which means your eyeballs would (statistically) never see it, and only the clickbait version.

      I'd say "welcome to the web" but this was true in 1800s newspapers as well.

    • xkcd1963 43 days ago
      Because that wouldn't be hype enough.
  • golol 44 days ago
    This is the real deal. AlphaGeometry solved a very limited set of problems with a lot of brute force search. This is a much broader method that I believe will have a great impact on the way we do mathematics. They are really implementing a self-feeding pipeling from natural language mathematics to formalized mathematics where they can train both formalization and proving. In principle this pipeline can also learn basic theory building like creating auxilliary definitions and Lemmas. I really think this is the holy grail of proof-assistance and will allow us to formalize most mathematics that we create very naturally. Humans will work podt-rigorously and let the machine asisst with filling in the details.
    • fmap 44 days ago
      Agreed, this is a big step forward. Geometry problems are in a different class, since you can translate them into systems of polynomial equations and use well known computer algebra algorithms to solve them.

      By contrast, this kind of open ended formalization is something where progress used to be extremely slow and incremental. I worked in an adjacent field 5 years ago and I cannot stress enough that these results are simply out of reach for traditional automated reasoning techniques.

      Real automatic theorem proving is also useful for a lot more than pure mathematics. For example, it's simple to write out an axiomatic semantics for a small programming language in lean and pose a question of the form "show that there exists a program which satisfies this specification".

      If this approach scales it'll be far more important than any other ML application that has come out in the last few years.

      • cubefox 44 days ago
        > Agreed, this is a big step forward. Geometry problems are in a different class, since you can translate them into systems of polynomial equations and use well known computer algebra algorithms to solve them.

        The blog post indicates the opposite. The geometry problem in the IMO problem set was solved by AlphaGeometry 2, which is an LLM based on Google's Gemini. LLMs are considered relatively general systems. But the other three solved problems were proved by AlphaProof, which is a narrow RL system that is literally based on AlphaZero, the Go and Chess AI. Only its initial (bootstrapping) human training data (proofs) were formalized and augmented by an LLM (Gemini).

        • dash2 43 days ago
          AlphaZero is more general than a Go and Chess AI, right? Isn't it a general self-play algorithm?
          • sapiogram 43 days ago
            Only slightly more general. It only works for games that are zero-sum, deterministic, have no hidden information, and discrete game state and moves. Other examples include connect-4.
            • cubefox 43 days ago
              So finding Lean proofs can be conceptualized as a zero-sum game?

              Another basic requirement is that valid moves / inference steps and the winning condition can be efficiently verified using some non-AI algorithm. Otherwise there would not be a reward signal for the reinforcement learning algorithm. This is different from answering most natural language questions, where the answer can't be checked trivially.

              • fmap 43 days ago
                Theorem proving can be formulated as a game, see e.g., https://plato.stanford.edu/entries/logic-games/ and interactive theorem provers can verify that a proof is correct (and related sub problems, such as that a lemma application is valid).

                Conceptually, if you're trying to show a conjunction, then it's the other player's turn and they ask you for a proof of a particular case. If you're trying to show a disjunction then it's your turn and you're picking a case. "Forall" is a potentially infinite conjunction, "exists" is a potentially infinite disjunction.

                In classical logic this collapses somewhat, but the point is that this is still a search problem of the same kind. If you want to feel this for yourself, try out some proofs in lean or coq. :)

              • sapiogram 43 days ago
                I don't think AlphaZero is related to this work, apart from both being NN-based. AlphaZero and its training pipeline fundamentally only works for "chess-like" two-player games, where the agent can play against itself and slowly improve through MCTS.
                • adroniser 43 days ago
                  "AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself how to master the games of chess, shogi and Go."
          • redman25 43 days ago
            I think that it can only play “games” for which it has perfect information about the state of the “game”.
        • lacker 44 days ago
          AlphaGeometry is not an LLM
          • cubefox 43 days ago
            It is an LLM combined with a symbolic deduction engine.
    • visarga 44 days ago
      > a lot of brute force search

      Don't dismiss search, it might be brute force but it goes beyond human level in Go and silver at IMO. Search is also what powers evolution which created us, also by a lot of brute forcing, and is at the core of scientific method (re)search.

      • thomasahle 44 days ago
        Also AlphaProof had to search for 60 hours for one of the IMO problems it solved.
        • renonce 44 days ago
          It’s going to be significantly faster very soon, we have seen how AlphaGo evolved into KataGo which is many magnitudes more compute efficient
          • thomasahle 43 days ago
            The main difficulty to scaling Alpha Proof is finding theorems to train it with. AlphaGo didn't have that problem because it could generate it's own data.
        • randcraw 44 days ago
          And I understand the upper time limit for each question was 4.5 hours. So it solved one question almost immediately, two well over the allotted time (60 hrs), and two not at all. No medal for you, Grasshopper.
          • snewman 44 days ago
            Contestants get 4.5 hours for each of the two days of competition. They have to solve three problems in that time, so on average you can spend 1.5 hours per problem (if you're aiming to finish all three).

            That said, the gap from "can't do it at all" to "can do it in 60 hours" is probably quite a bit larger than the gap from 60 hours to 1.5 hours.

            • moffkalast 43 days ago
              Timing something that can be ran faster by throwing better hardware at it honestly feels conceptually irrelevant, as long as the complexity is actually tractable.
      • gowld 44 days ago
        What makes solving IMO problems hard is usually the limits of human memory, pattern-matching, and search, not creativity. After all, these are problems that are already solved, and it is expected that many people can solve the problems in about 1 hour's time.

        That makes it, in principle, similar or even easier than a champsionship-level chess move, which often take more than 1 hour for a professional human (with more training than an IMO high school student) to solve.

        Another interesting concern is that when posing a problem to humans, it's fine to pose an "easy" brute-forceable problem, but humans, being slow brute-searchers, need to find more clever solutions. But if you give such a problem to a computer, it can trivialize it. So to test a computer, you need to pose non- easily-brute-forceable problems, which are harder for the computer than the others, but equally difficult for the humans as the other problems are.

        • sebzim4500 44 days ago
          Ok but if you read the actual solutions they aren't a bizarre mess of brute force.

          They look like what a human would write if they were trying to come up with a formal proof (albeit it does some steps in a weird order).

          • knotthebest 42 days ago
            They are, though. I spoke to an author just yesterday. They did mostly use brute-force.
          • lupire 43 days ago
            The solutions aren't a bizarre mess of brute force. The search for the solutions is.
        • snewman 43 days ago
          Why do you say these are problems that are already solved? Sure, they're often variations on existing themes, but the same is true for chess positions and, honestly, almost everything else in any field of human endeavor.

          Agreed that the absolute upper tier of chess players have trained longer and harder than most or all IMO contestants. Though I do wonder which (top-tier chess or the IMO) draws on a larger talent pool. To my understanding, a significant fraction of all high school students on Earth take some form of qualifying exam which can channel them into an IMO training program.

          And as far as the being amenable to brute force (relative difficulty for humans vs. computers): it seems that chess was comparatively easier for computers, IMO problems are comparatively easier for humans, and the game of Go is somewhere in between.

          • okintheory 43 days ago
            These problems are literally already solved? Of course, the IMO problem designers make sure the problems have solutions before the use them. That's very different than math research, where it's not known in advance what the answer is, or even that there is good answer.
            • snewman 43 days ago
              I'm saying they weren't solved until the problem composer (created and) solved them. They're not, in general, problems for which solutions have been lying around. So "these are problems that are already solved" isn't introducing anything interesting or useful into the discussion. The post I was replying to was trying to draw a contrast with chess moves, presumably on the grounds that (after the opening) each position in a chess game is novel, but IMO problems are equally novel.

              It's true that IMO problems are vetted as being solvable, but that still doesn't really shed any information on how the difficulty of an IMO problem compares to the difficulty of chess play.

      • Eridrus 44 days ago
        Search is great, search works, but there was not a tonne to learn from the AlphaGeometry paper unless you were specifically interested in solving geometry problems.
      • kypro 44 days ago
        My old AI professor used to say that every problem is a search problem.

        The issue is that to find solutions for useful problems you're often searching through highly complex and often infinite solution spaces.

        • visarga 44 days ago
          For some problems validation is expensive. Like the particle collider or space telescope, or testing the COVID vaccine. It's actually validation that is the bottleneck in search not ideation.
        • shkkmo 44 days ago
          I would argue that no actually searchable solution space is really infinite (if only because infinite turing machines can't exist). Finite solution spaces can get more than large enough to be intractable.
          • smokel 44 days ago
            What about ℕ? Seems pretty infinite to me, unless with "actually" you mean finite in time and space, which would make your argument a tautology. Or am I missing something?
            • lupire 43 days ago
              Almost every "number" "in" N doesn't actually exist. In the search for numbers that exist, we will most likely only ever find a finite set of numbers before the Universe or humanity dies.

              ("Scare quotes")

            • shkkmo 43 days ago
              Searches happen in finite time an space and, more importantly, systems performing those searches have practical finite limits on parameters that determine size of the space within which that search can take place (such as available memory).

              Even within fairly modest finite limits, you can produce a solution space that cannot be significantly searched with the available finite matter and time available in the observable universe.

              Thus, the problem with using search isn't that solution spaces can be infinite, but that finite solution spaces can be unimaginably large.

        • pishpash 44 days ago
          There's no problem with search. The goal is to search most efficiently.
          • deely3 44 days ago
            You mean that by improving search we can solve any problem? What if solution field is infinite, even if we make search algo 10x100 more performant, solution field will still be infinite, no?
            • pishpash 44 days ago
              Gradient descent is a search. Where does it say the search space has to be small?
      • Davidzheng 44 days ago
        Yes and there's a lot of search here too. That's a key to the approach
    • sideeffffect 43 days ago
    • geysersam 44 days ago
      I imagine a system like this to be vastly more useful outside the realm of mathematics research.

      You don't need to be able to prove very hard problems to do useful work. Proving just simple things is often enough. If I ask a language model to complete a task, organize some entries in a certain way, or schedule this or that, write a code that accomplishes X, the result is typically not trustworthy directly. But if the system is able to translate parts of the problem to logic and find a solution, that might make the system much more reliable.

      • creata 44 days ago
        But for it to be 100% trustworthy, you'd have to express correctness criteria for those simple tasks as formal statements.
        • humansareok1 43 days ago
          There's a lot of automated proof checkers out there. Presumably you would just run any solution from an AI through those.
        • geysersam 43 days ago
          My intuition is that a regular LLM is better att coming up with a correct task description from a fuzzy description than it is at actually solving tasks.
        • lanstin 44 days ago
          And most applied maths doesn't seem to worry about proofs much. They have techniques that either work pretty well or blow up.
          • optimalsolver 43 days ago
            Bridge collapses are a form of proof validation.
    • EugeneOZ 44 days ago
      No. It's like you are allowed to use search engines to find a solution, nothing more than that.
      • humansareok1 43 days ago
        Through a search space often large enough to be completely intractable with a galaxy wide computer.
    • benreesman 43 days ago
      As resident strident AI skeptic, yeah, this is real.

      But MCTS was always promising when married to large NNs and DeepMind/Brain were always in front on it.

      I don’t know who fucked up on Gemini and it’s concerning for Alphabet shareholders that no one’s head is on a spike. In this context “too big to fail” is probably Pichai.

      But only very foolish people think that Google is lying down on this. It’s Dean and Hassabis. People should have some respect.

  • Ericson2314 44 days ago
    The lede is a bit buried: they're using Lean!

    This is important for more than Math problems. Making ML models wrestle with proof systems is a good way to avoid bullshit in general.

    Hopefully more humans write types in Lean and similar systems as a much way of writing prompts.

    • Smaug123 44 days ago
      And while AlphaProof is clearly extremely impressive, it does give the computer an advantage that a human doesn't have in the IMO: nobody's going to be constructing Gröbner bases in their head, but `polyrith` is just eight characters away. I saw AlphaProof used `nlinarith`.
      • empath75 44 days ago
        Don't think of lean as a tool that the ai is using (ie, cheating), think of lean plus AlphaProof as a single system. There's no reason to draw artificial boundaries around where the AI is and where the tools that the AI is using are. Lean itself is a traditional symbolic artificial intelligence system.

        People want always knock generative AIs for not being able to reason, and we've had automated systems that reason perfectly well for decades, but for some reason that doesn't count as AI to people.

      • Davidzheng 44 days ago
        Good. I want my AI to use all the advantages it has to reinvent the landscape of mathematics
        • YeGoblynQueenne 43 days ago
          Sure but note that's not "your AI". It's a closed-source, proprietary system by DeepMind who typically publish a result to reap the hype and then bury the system forever (see AlphaGo).
        • wslh 43 days ago
          > I want my AI to use all the advantages it has to reinvent the landscape of mathematics

          The interesting thing about math, or science, and art in general, comparing it to games like chess or go is that science gives you the freedom to continue to excel as a human while in games we have lost the game and/or the league.

          Science and art are infinite and no AI can produce infinity.

      • Ericson2314 44 days ago
        Hehe, well, we'll need to have a tool-assited international math Olympiad then.
        • ComplexSystems 44 days ago
          This is the greatest idea ever. Why doesn't this exist?
          • panagathon 44 days ago
            This does sound like a lot of fun. Since this AI only reaches silver level, presumably such a contest could be quite competitive, with it not yet being clear cut whether a human or a tool-assisted human would come out on top, for any particular problem.
        • sebzim4500 44 days ago
          If the tools are the same as the ones AlphaProof gets (i.e. a lean compiler) then no one would use them.
          • Ericson2314 43 days ago
            Well, hopefully that changes soon :)
      • xrisk 44 days ago
        Can you give some context on how using Lean benefits?

        In my understanding, proofs are usually harder to transcribe into Lean which is nobody _writes_ proofs using Lean.

        What is a nlinarith?

      • sebzim4500 44 days ago
        The uses of `nlinarith` are very straight forward manipulations of inequalities, they would be one or two steps for a human too.
      • gowld 43 days ago
        That's not necessarily an edge. https://www.ma.imperial.ac.uk/~buzzard/xena/formalising-math...

        has the example:

          0 ≤ x^2 if x : ℝ
        
        which humans simply use without proof. The IMO doesn't challenge participants to prove everything, only the main ideas.
    • queuebert 44 days ago
      That's amazing. I was just about to comment that hooking this up to Lean [1] would be killer. This must be the way forward for higher math, as proofs are getting so complicated that almost no one understands all pieces of major proofs.

      1. https://lean-lang.org/

    • Ericson2314 44 days ago
      They're def gonna go after the Riemann hypothesis with this, hehe.
      • nwoli 44 days ago
        Guessing the context here is that the RH was recently translated into Lean. Would be very cool if they threw their compute on that
        • Smaug123 44 days ago
          I think you might be thinking of the recent project to start Fermat's Last Theorem? The Riemann hypothesis has been easy to state (given what's in Mathlib) for years.
          • Davidzheng 44 days ago
            Yeah lol i don't think either is hard to formalize in lean
            • raincole 44 days ago
              They're not just formalizing Fermant's Last Theorem's statement itself. They're formalizing the proof.
  • signa11 43 days ago
    > ... but whenever IMO is brought up, a caveat should be put out: medals are awarded to 50% of the participants (high school students), with 1:2:3 ratio between gold, silver and bronze. That puts all gold and silver medalists among the top 25% of the participants.

    yes, it is true, but getting to the country specific team is itself an arduous journey, and involves brutal winnowing every step of the way f.e. regional math-olympiad, and then national math-olympiad etc.

    this is then followed by further trainings specifically meant for this elite bunch, and maybe further eliminations etc.

    suffice it to say, that qualifying to be in a country specific team is imho a big deal. getting a gold/silver from amongst them is just plain awesome !

    • nb_quant 43 days ago
      Some countries pull these kids out of school for an entire year to focus on training for it, while guaranteeing them entry into their nation's top university.

      Source: a friend who got silver on the IMO

  • michael_nielsen 44 days ago
    A good brief overview here from Tim Gowers (a Fields Medallist, who participated in the effort), explaining and contextualizing some of the main caveats: https://x.com/wtgowers/status/1816509803407040909
  • fancyfredbot 44 days ago
    I'm seriously jealous of the people getting paid to work on this. Sounds great fun and must be incredibly satisfying to move the state of the art forward like that.
    • GuB-42 44 days ago
      I don't know about that. A lot of the work that should have been very satisfying turned out to be boring as hell, if not toxic, while at the same time, some apparently mundane stuff turned out to be really exciting.

      I found the work environment to be more important than the subject when it comes to work satisfaction. If you are working on a world changing subject with a team of assholes, you are going to have a bad time, some people really have a skill for sucking the fun out of everything, and office politics are everywhere, especially on world changing subjects.

      On the other hand, you can have a most boring subject, say pushing customer data to a database, and have the time of your life: friendly team, well designed architecture, time for experimentation and sharing of knowledge, etc... I have come to appreciate the beauty of a simple thing that just works. It is so rare, maybe even more rare than scientific breakthroughs.

      Now, you can also have an awesome work environment and an awesome subject, it is like hitting the jackpot... and a good reason to be envious.

      • phillypham 44 days ago
        Awesome work environment for one person can be not ideal for another.

        Pretty much all the top AI labs are both intensely competitive and collaborative. They consist of many former IMO and IOI medalists. They don't believe in remote work, either. Even if you work at Google DeepMind, you really need to be in London for this project.

        • optimalsolver 43 days ago
          The open-source software projects these companies critically depend on are developed by collaborators who have never met in person, and yet these companies still believe you can only do great work in the office.
    • lonesword 43 days ago
      I work in this space (pretraining LLMs). It looks fancier than it really is. It does involve wrangling huge ymls and writing regular expressions at scale (ok I am oversimplifying a bit). I should be excited (and grateful) that I get to work on these things but shoddy tooling takes the joy out of work.
    • onemoresoop 44 days ago
      You probably mean envious not jealous.
      • yalok 44 days ago
        I'm learning something new today. In some other languages these 2 are usually the same 1 word.
        • Vinnl 44 days ago
          Huh. So I tried to look it up just now and I'm not sure if I understand the difference. (To the extent that there is one - apparently one can mean the other, but I imagine they're usually used as follows.)

          It looks like "jealous" is more being afraid of losing something you have (most commonly e.g. a spouse's affection) to someone else, whereas "envious" is wanting what someone else has?

        • mr_toad 43 days ago
          Most English speakers use them interchangeably. It’s usually only Simpsons fans who know the difference.

          https://www.youtube.com/watch?v=HG8Yn9FX40c

        • onemoresoop 44 days ago
          No worries. I learned this too a while ago (was also using jealous instead of envious and vice versa myself). From my understanding the use of jealous is when you have something but that is threatened by some external factor, eg a partner, a friend having more fun with somebody else. Envious is when you covet something that you do not have currently but wish to, which is in this case playing with exciting tech.
    • Mithriil 44 days ago
      Best we can do then is keep ourselves up to date and give our support!
    • bearjaws 44 days ago
      C'mon you're meant to be re-configuring 3,292,329 line of YML for K8s.

      (/s)

      • psbp 44 days ago
        It's funny that if I could describe my entire career, it would probably be something similar to software janitor/maintenance worker.

        I guess I should have pursued a PhD when I was younger.

        • geodel 43 days ago
          In another universe, this comment would be "With low pay and few academic jobs going for PhD was the worst decision of my life"
  • mik09 41 days ago
    how long before it solves the last two problems?
  • cynicalpeace 44 days ago
    Machines have been better than humans at chess for decades.

    Yet no one cares. Everyone's busy watching Magnus Carlsen.

    We are human. This means we care about what other humans do. We only care about machines insofar as it serves us.

    This principle is broadly extensible to work and art. Humans will always have a place in these realms as long as humans are around.

    • ertgbnm 44 days ago
      I'm sure humans will always enjoy chess and art regardless of how much better AI is at it. In the same way, there will probably always be mathematic hobbyist who study math for fun. But I seriously doubt that in the near future there will be mathematicians who will be publishing new advancements that aren't mostly or entirely discovered by AI. A human might get credit for a proof for asking the initial question, but there is pretty much no world where a computer can easily solve a meaningful mathematical problem but we insist on a human solve it more slowly and expensively instead.
      • cynicalpeace 44 days ago
        My point was where does the concept of "meaningful" come from?

        The proof will only have value if it's meaningful to us.

    • camjw 44 days ago
      Sure but if an AI can prove e.g the Goldbach conjecture then that is a bfd.
      • hyperbovine 44 days ago
        What if the proof were incomprehensible to humans?
        • fanatic2pope 44 days ago
          If it cannot explain how it was proven, was it actually proven?
          • cynicalpeace 44 days ago
            No. Funny how these discussions too often devolve into semantics lol.
            • auggierose 44 days ago
              Funny how people don't understand basic logic. If it is a proof in a logic, and the machine checked that proof, it is a proof, no matter that no human actually understands it.

              A human doesn't need to understand the proof, they just have to understand why the proof is a proof.

              • cweld510 43 days ago
                The useful thing about proofs is that they are written in English (or another language), not formal logic. In general they can be mapped to formal logic, though. This means that people can digest them on an intuitive level. The actual goal of a proof is to create new knowledge (via the proof) which can be distributed across the mathematical community. If proofs exist but are not easily comprehensible, then they don’t accomplish this goal.
              • moffkalast 43 days ago
                Well... assuming a human made no mistakes setting up that logic.
                • auggierose 43 days ago
                  Of course. That falls under "understanding why the proof is a proof".
                  • throwaway290 43 days ago
                    Now we only need to find that human that never makes mistakes and we're golden...
                    • auggierose 43 days ago
                      Luckily, that is not necessary. You can make many mistakes, until you arrive at a logic you are happy with. Then you talk with other humans about it, and eventually you will all agree, that to the best of your knowledge, there is no mistake in the logic. If you pick first-order logic, that has already been done for you.

                      Then you need to implement that logic in software, and again, you can and will mistakes here. You will use the first version of that software, or another logic software, to verify that your informal thoughts why your logic implementation is correct, can be formalised and checked. You will find mistakes, and fix them, and check that your correctness proof still goes through. It is very unlikely that it won't, but if it doesn't, you fix your correctness proof. If you can indeed fix it, you are done, no mistakes remain. If you cannot, something must be wrong with your implementation, so rinse and repeat.

                      At the end of this, you have a logic, and a logic implementation, which doesn't contain any mistakes. Guaranteed.

              • cynicalpeace 43 days ago
                We just have different definitions of what a proof is. Hence, semantics.
                • auggierose 42 days ago
                  I would rather say that I actually have a definition of what a proof is, and you don't. But feel free to prove me wrong (pun intended), and tell me yours.
        • humansareok1 43 days ago
          If we can formally verify the proof then it doesn't matter. Often the implications on other problems is substantial just knowing the proof exists.
        • camjw 44 days ago
          I think that is unlikely to be the case - the classic example of a proof that human's "can't understand" is the Four Colour Theorem, but thats because the proof is a reduction to like 100000 special cases which are checked by computer.

          To what extent is the proof of Fermat's Last Theorem "incomprehensible to humans" because only like a dozen people on the planet could truly understand it - I don't know.

          The point of new proofs is really to learn new things about mathematics, and I'm sure we would learn something from a proof of Goldbach's conjecture.

          Finally if it's not peer reviewed then its not a real proof eh.

          • hyperbovine 43 days ago
            I think the 4-color theorem is rather different though. It reduced to a large number of cases that can in principle each be verified by a human, if they were so inclined (indeed a few intrepid mathematicians have done so over the years, at least partially.) The point of using a computer was to reduce drudgery, not to prove highly non-obvious things.

            Thinking back to Wiles' proof of FLT, it took the community several years of intense work just to verify/converge on the correct result. And that proof is ~130 pages.

            So, what if the computer produced a provably correct, 4000-page proof of the Goldbach conjecture?

        • steego 44 days ago
          It shouldn’t count. We need to require it be able to ELI5 the proof to Goldbach’s conjecture to an actual class of graduating kindergartners.
      • cynicalpeace 44 days ago
        How is that a counterpoint?
    • christianqchung 44 days ago
      > This principle is broadly extensible to work and art

      Nah, as a consumer it makes no difference to me if a meat packing factory or Amazon warehouse employs 5000 or 5 people. To art, this principle is totally real, but for work, it only applies to some/most of it.

      • cynicalpeace 44 days ago
        Read the next sentence: "We only care about machines insofar as it serves us."

        Imagine a machine doing "work" that only serves itself and other machines that does no service to humanity. It would have no economic value. In fact the whole concept of "work" only makes sense if it is assigned economic value by humans.

        • christianqchung 44 days ago
          Then we agree, but your chess example made it sound like if a machine could automatically pack meat with little human intervention, people wouldn't want it. That's also not the next sentence. How is it broadly applicable to work?
    • zone411 44 days ago
      I don't think this principle extends to math proofs. It's much, much easier to verify a proof than to create it, and a second proof will just be a footnote. Not many mathematicians will want to work on that. That said, there is a lot of distance between IMO and the frontiers of research math.
      • cynicalpeace 44 days ago
        Why does it not extend to math proofs?
    • auggierose 44 days ago
      There are people that believe that mathematics is actually useful, in ways that chess or art are not. I know, most mathematicians don't think so. But let us just entertain this crazy thought for a moment. Then a proof is just a tool that tells us, oh, we have applied this piece of mathematics right. No understanding of the proof is actually required for that, and no one cares if some mathematician somewhere actually fully understands the proof. It will be OK, even expected, that the machine is better than us at finding and checking proofs.
    • smokel 44 days ago
      > Everyone's busy watching Magnus Carlsen.

      Actually, I was looking up Elo ratings of the top computer chess players, and learned that it is not that trivial to compare these, due to differences in hardware requirements and whatnot.

      • cynicalpeace 44 days ago
        Are you arguing computer chess players are as popular as human chess players?
        • smokel 43 days ago
          Not at all. I just tripped over your black-and-white presentation of things, and thought it might be helpful to provide some counterbalance.
    • awahab92 44 days ago
      magnus carlsen basically quit because computers ruined chess. As did kasparov.

      Fischer was probably the last great player who was unassisted by tools.

      • karmakurtisaani 44 days ago
        Carlsen plays still at the highest level. He just didn't want to do the world championship anymore, wasn't worth the effort after winning it so many times.
      • hyperbovine 44 days ago
        ?? Carlsen is very much active -- look up the YouTube channel EpicChess for an extremely entertaining recap of what he's up to recently.
    • bongodongobob 43 days ago
      Eh, people definitely care. AI has completely changed chess. Non viable lines have been proven viable and vice versa. All the pros study and develop new lines with AI.
  • thrance 44 days ago
    Theorem proving is a single-player game with an insanely big search space, I always thouht it would be solved long before AGI.

    IMHO, the largest contributors to AlphaProof were the people behind Lean and Mathlib, who took the daunting task of formalizing the entirety of mathematics to themselves.

    This lack of formalizing in math papers was what killed any attempt at automation, because AI researcher had to wrestle with the human element of figuring out the author's own notations, implicit knowledge, skipped proof steps...

    • camjw 44 days ago
      > Theorem proving is a single-player game with an insanely big search space, I always thouht it would be solved long before AGI.

      This seems so weird to me - AGI is undefined as a term imo but why would you expect "producing something generally intelligent" (i.e. median human level intelligence) to be significantly harder than "this thing is better than Terrence Tao at maths"?

      • thrance 44 days ago
        My intuition tells me we humans are generally very bad at math. Proving a theorem, in an ideal way, mostly involves going from point A to point B in the space of all proofs, using previous results as stepping stones. This isn't particularly a "hard" problem for computers which are able to navigate search spaces for various games much more efficiently than us (chess, go...).

        On the other hand, navigating the real world mostly consists in employing a ton of heuristics we are still kind of clueless about.

        At the end of the day, we won't know before we get there, but I think my reasons are compelling enough to think what I think.

        • tim-kt 44 days ago
          I don't think that computers have an advantage because they can navigate search spaces efficiently. The search space for difficult theorems is gigantic. Proving them often relies on a combination of experience with rigorous mathematics and very good intuition [1] as well as many, many steps. One example is the classification of all finite simple groups [2], which took about 200 years and a lot of small steps. I guess maybe brute forcing for 200 years with the technology available today might work. But I'm sceptical and sort of hope that I won't be out of a job in 10 years. I'm certainly curious about the current development.

          [1] https://terrytao.wordpress.com/career-advice/theres-more-to-...

          [2] https://en.m.wikipedia.org/wiki/Classification_of_finite_sim...

          • thrance 44 days ago
            Oh for sure, when I say "soon" it's only relative to AGI.

            What I meant to convey, is that theorem proving at least is a well-defined problem, and computers have had some successes in similar-ish search problems before.

            Also I don't think pure brute-force was ever used to solve any kind of interesting problem.

            Chess engines make use of alpha-beta pruning plus some empirical heuristics people came up with over the years. Go engines use Monte-Carlo Tree Search with straight deep learning models node evaluation. Theorem proving, when it is solved, will certainly use some kind of neural network.

        • Davidzheng 44 days ago
          I also think humans are bad at math. And that we are probably better at IRL but maybe IRL has more data anyway
      • pedrosorio 43 days ago
        Replace "this thing is better than Terrence Tao at maths" with "this thing is better than Garry Kasparov at chess" and your statement would sound equally reasonable in the early 90s.
      • HarHarVeryFunny 43 days ago
        Because being good/competent at EVERYTHING is harder than being great at any one thing for a computer, since generality requires fluid intelligence but a single specialist skill can be achieved through brute force search, as we've seen over and over again from DeepBlue 30 years ago to the present time.

        In the meantime, while DeepBlue beat the world chess champion, Kasparov, at chess, our best efforts at generalism - LLMs than many (not me!) think are the path to AGI - struggle to play tic tac toe.

      • raincole 44 days ago
        Because "Generally Intelligent" is a very broad and vague term.

        "Better than Terrence Tao at solving certain formalized problems" (not necessarily equal to "Better that Terrence Tao at maths) isn't.

        • RealityVoid 43 days ago
          Not to join the hate train but it probably isn't better than Terrence Tao at solving the problems either, since this Tao would have probably been able to solve the problem himself and would have probably taken less time. There were participants in this contest that were better than the AI.

          Still, an impressive result for sure.

        • camjw 44 days ago
          Seems fair - I strongly think that "certain formalized problems" is very very far away from doing actual maths! And that actual maths is sometimes much less about solving known conjectures and much more about developing new theories.
    • Davidzheng 44 days ago
      They didn't formalize the entirety of math. Good thing imo doesn't need the entirety. But they didn't even formalize enough for imo--this is probably why combo wasn't solved
  • zone411 44 days ago
  • nopinsight 43 days ago
    Once Gemini, the LLM, integrates with AlphaProof and AlphaGeometry 2, it might be able to reliably perform logical reasoning. If that's the case, software development might be revolutionized.

    "... We'll be bringing all the goodness of AlphaProof and AlphaGeometry 2 to our mainstream #Gemini models very soon. Watch this space!" -- Demis Hassabis, CEO of Google DeepMind. https://x.com/demishassabis/status/1816499055880437909

  • Jun8 44 days ago
    Tangentially: I found it fascinating to follow along the solution to Problem 6: https://youtu.be/7h3gJfWnDoc (aquaesulian is a node to ancient name of Bath). There’s no advanced math and each step is quite simple, I’d guess on a medium 8th grader level.

    Note that the 6th question is generally the hardest (“final boss”) and many top performers couldn’t solve it.

    I don’t know what Lean is or how see AI’s proofs but an AI system that can explain such a question on par with the YouTuber above would be fantastic!

  • adverbly 44 days ago
    > First, the problems were manually translated into formal mathematical language for our systems to understand. In the official competition, students submit answers in two sessions of 4.5 hours each. Our systems solved one problem within minutes and took up to three days to solve the others.

    Three days is interesting... Not technically silver medal performance I guess, but let's be real I'd be okay waiting a month for the cure to cancer.

    • ZenMikey 44 days ago
      I haven't read TFA as I'm at work, but I would be very interested to know what the system was doing in those three days. Were there failed branches it explored? Was it just fumbling its way around until it guessed correctly? What did the feedback loop look like?
      • qsort 44 days ago
        I can't find a link to an actual paper, that just seems to be a blog post. But from what I gather the problems were manually translated to Lean 4, and then the program is doing some kind of tree search. I'm assuming they are leveraging the proof checker to provide feedback to the model.
      • visarga 44 days ago
        > just fumbling its way around until it guessed correctly

        As opposed to 0.999999% of the human population who can't do it even if their life depends on it?

        • dsign 44 days ago
          I was going to come here to say that. I remember being a teenager and giving up in frustration at IMO problems. And I was competing at IPhO.
          • kevinventullo 44 days ago
            Yeah, as a former research mathematician, I think “fumbling around blindly” is not an entirely unfair description of the research process.

            I believe even Wiles in a documentary described his search for the proof of Fermat’s last theorem as groping around in a pitch black room, but once the proof was discovered it was like someone turned the lights on.

        • logicchains 44 days ago
          I guess you mean 99.9999%?
      • thomasahle 44 days ago
        They just write "it's like alpha zero". So presumably they used a version of MCTS where each terminal node is scored by LEAN as either correct or incorrect.

        Then they can train a network to evaluate intermediate positions (score network) and one to suggest things to try next (policy network).

      • tsoj 44 days ago
        This is NOT the paper, but probably a very similar solution: https://arxiv.org/abs/2009.03393
      • lacker 44 days ago
        The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.

        So they had three days to keep training the model, on synthetic variations of each IMO problem.

      • utopcell 44 days ago
        I'm at work and reading this article is the first thing I did this morning. What's your point ?
    • 10100110 44 days ago
      Don't confuse interpolation with extrapolation. Curing cancer will require new ideas. IMO requires skill proficiency in tasks where the methods of solving are known.
      • trotro 44 days ago
        The methods are know, but the solutions to the IMO problems weren't. So the AI did extrapolate a solution.

        Also, there's no reason to affirm that an eventual cure for cancer requires fundamentally new methods. Maybe the current methods are sufficient, it's just that nobody has been "smart" enough to put the pieces together. (disclaimer: not an expert at all)

        • markusde 44 days ago
          Unlike curing cancer, the IMO problems were specifically designed to be solvable
        • dsign 44 days ago
          I think you are correct though. We don't need new physics to cure cancer. But we may need information-handling, reasoning and simulation systems which are orders of magnitude bigger and more complex than anything we have this year. We also need to stop pussy-footing and diddling with ideologies and start working on the root cause of cancer and almost every other disease, which is aging.
      • golol 44 days ago
        Mathematicians spend most of their time interpolating between known ideas and it would be extremely helpful to have computer assistance with that.
      • im3w1l 44 days ago
        I think this is kinda false actually on the cancer side. We have reached a point where we have known approaches that work. It's "just" a matter of putting them into practice which will of course require solving many little details, which is very important and time-consuming work, but it doesn't require super-human genius level of lateral thinking, just a few millions man years of grinding away at it.
      • visarga 44 days ago
        Search is extrapolation. Learning is interpolation. Search+Learn is the formula used by AZ. Don't forget AZ taught us humans a thing or two about a game we had 2000 years head start in, and starting from scratch not from human supervision.
        • xdavidliu 44 days ago
          no, search is not extrapolation. Extrapolation means taking some data and projecting out beyond the limits of that data. For example, if my bank account had $10 today and $20 tomorrow, then I can extrapolate and say it might have $30 the day after tomorrow. Interpolation means taking some data and inferring the gaps of that data. For example, if I had $10 today and $30 the day after tomorrow, I can interpolate and say I probably had $20 tomorrow.

          Search is different from either of those things, it's when you have a target and a collection of other things, and are trying to find the target in that collection.

          • visarga 44 days ago
            Search can go from a random init model to beating humans at Go. That is not interpolation.

            - Search allows exploration of the game tree, potentially finding novel strategies.

            - Learning compresses the insights gained from search into a more efficient policy.

            - This compressed policy then guides future searches more effectively.

            Evolution is also a form of search, and it is open-ended. AlphaProof solved IMO problems, those are chosen to be out of distribution, simple imitation can't solve them. Scientists do (re)search, they find novel insights nobody else discovered before. What I want to say is that search is on a whole different level than what neural nets do, they can only interpolate their training data, search pushes outside of the known data distribution.

            It's actually a combo of search+learning that is necessary, learning is just the little brother of search, it compresses novel insights into the model. You can think of training a neural net also as search - the best parameters that would fit the training set.

      • xdavidliu 44 days ago
        new doesn't necessarily mean "an extremal point that's not the average of two existing points". The set of existing knowledge is not necessarily continuous; the midpoint between two known points may be unknown, and thus would be a "new" point that could be obtained by interpolation.
      • trueismywork 44 days ago
        They are the same things
    • nnarek 44 days ago
      "three days" does not say anything about how much computational power is used to solve problems, maybe they have used 10% of all GCP :)
      • falcor84 44 days ago
        And say they did use 10% of all GCP? Would it be less impressive? This is a result that was considered by experts to be far beyond the state of the art; it's absolutely ok if it's not very efficient yet.

        Also, for what it's worth, I'm pretty sure that I wouldn't have been able to solve it myself in three days, even if I had access to all of GCP, Azure and AWS (except if I could mine crypto to then pay actual IMO-level mathematicians to solve it for me).

        • data_maan 44 days ago
          Which experts said that?

          I don't think that's the case at all. The writing was already on the wall.

          • SonOfLilit 43 days ago
            The writing was on the wall for the last year and a half (in fact I lost a bet to an IMO medalist about AI getting IMO gold by 8/2023) but three years ago this was unimaginable.
        • nnarek 44 days ago
          yes it is very impressive, especially autoformalization of problems written in natural language and also proof search of theorems
      • vlovich123 44 days ago
        The thing is though, once we have a benchmark that we pass, it’s pretty typical to be able to bring down time required in short order through performance improvements and iterating on ideas. So if you knew you had GAI but it took 100% of all GCP for 3 years to give a result, within the next 5 years that would come down significantly (not least of which you’d build HW dedicated to accelerating the slow parts).
        • tsimionescu 44 days ago
          That's patently false for many classes of problems. We know exactly how to solve the traveling salesman problem, and have for decades, but we're nowhere close to solving a random 1000 city case (note: there are approximate methods that can find good, but not optimal, results on millions of cities). Edit: I should say 1,000,000 city problem, as there are some solutions for 30-60k cities from the 2000s.

          And there are good reasons to believe that theorem finding and proof generation are at least NP-hard problems.

          • vlovich123 44 days ago
            We're not talking about mathematical optimality here, both from the solution found and for the time taken. The point is whether this finds results more cheaply than a human can and right now it's better on some problems while others it's worse. Clearly if a human can do it, there is a way to solve it in a cheaper amount of time and it would be flawed reasoning to think that improving the amount of time would be asymptotically optimal already.

            While I agree that not all problems show this kind of acceleration in performance, that's typically only true if you've already spent so much time trying to solve it that you've asymptoted to the optimal solution. Right now we're nowhere near the asymptote for AI improvements. Additionally, there's so many research dollars flowing into AI precisely because the potential upside here is nowhere near realized and there's lots of research lines still left to be explored. George Hinton ended the AI winter.

            • visarga 44 days ago
              > The point is whether this finds results more cheaply than a human can

              If you need to solve 1000 problems in 3 days you wouldn't find the humans that can do it. So it would not be cheaper if it's not possible.

              • vlovich123 44 days ago
                Well if it takes 10% of all of Google’s servers 3 days to solve, you may find it difficult to scale out to solving 1000 problems in 3 days as well.

                As for humans, 100 countries send 6 students to solve these problems. It also doesn’t mean that these problems aren’t solvable by anyone else. These are just the “best 6” where best = can solve and solve most quickly. Given a three day budget, 1000 problems could reasonably be solvable and you know exactly who to tap to try to solve them. Also, while the IMO is difficult and winners tend to win other awards like Field Medals, there’s many professional mathematicians who never even bother because that type of competition isn’t interesting to them. It’s not unreasonable to expect that professional mathematicians are able to solve these problems as well if they wanted to spend 3 days on it.

                But in terms of energy per solve, humans are definitely cheaper. As you note the harder part is scaling it out but so far the AI isn’t solving problems that are impossible for humans, just that given enough time it managed to perform the same task. That’s a very promising result but supremacy is slightly a ways off for now (this AI can’t win the competition for now)

          • rowanG077 44 days ago
            The person said typical not always the case. Just because there are obviously cases where it didn't happen does mean it it's still not typically the case.
    • wongarsu 44 days ago
      The problem solved "within minutes" is also interesting. I'd interpret that as somewhere between 2 and 59 minutes. Given the vagueness probably on the higher end, otherwise they'd celebrate it more. The students had 6 tasks in 9 hours, so on average 1.5h per task. If you add the time a student would take to (correctly!) translate the problems to their input format, their best-case runtime is probably about as fast as a silver-medalist would take to solve the problem on their own.

      But even if they aren't as fast as humans yet this is very valuable. Both as a stepping stone, and because at a certain scale compute is much easier to scale than skilled mathematicians.

      • gjm11 44 days ago
        They say "our systems" (presumably meaning AlphaProof and AlphaGeometry 2) solved one problem "within minutes", and later on the page they say that the geometry question (#4) was solved by AlphaGeometry in 19 seconds.

        So either (1) "within minutes" was underselling the abilities of the system, or (2) what they actually meant was that the geometry problem was solved in 19 seconds, one of the others "within minutes" (I'd guess #1 which is definitely easier than the other two they solved), and the others in unspecified times of which the longer was ~3 days.

        I'd guess it's the first of those.

        (Euclidean geometry has been a kinda-solved domain for some time; it's not super-surprising that they were able to solve that problem quickly.)

        As for the long solve times, I would guess they're related to this fascinating remark:

        > The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.

        • visarga 44 days ago
          Euclidian Geometry still requires constructions to solve, and those are based in intuition.
          • gjm11 44 days ago
            There are known algorithms that can solve _all_ problems in euclidean (ruler-and-compasses) geometry, no intuition required. The most effective algorithms of this type are quite inefficient, though, and (at least according to DeepMind) don't do as well as AlphaGeometry does at e.g. IMO geometry problems.
    • ComplexSystems 44 days ago
      Or the simultaneous discovery of thousands of cryptographic exploits...
      • poincaredisk 44 days ago
        Still waiting for the first one. I'm not holding my breath - just like fuzzing found a lot of vulnerabilities in low-level software, I expect novel automated analysis approaches will yield some vulnerabilities - but that won't be a catastrophic event just like fuzzing wasn't.
        • ComplexSystems 44 days ago
          Why don't you think that AI models will, perhaps rather soon, surpass human capabilities in finding security vulnerabilities? Because an AI that's even equally competent would be a fairly catastrophic event.
        • criddell 44 days ago
          It's rumored that the NSA has 600 mathematicians working for them. If they are the ones finding the exploits you will probably never hear about them until they are independently discovered by someone who can publish.
        • throwaway240403 44 days ago
          Hope that's true. Really mucks up the world a bit if not.
        • sqeaky 44 days ago
          I hope it doesn't find a new class of bug. Find another thing like Spectre could be problematic.

          EDIT - I hope if that new class of bug exists that it is found. I hope that new class of bug doesn't exist.

    • lolinder 44 days ago
      It feels pretty disingenuous to claim silver-medal status when your machine played by significantly different rules. The article is light on details, but it says they wired it up to a theorem prover, presumably with feedback sent back to the AI model for re-evaluation.

      How many cycles of guess-and-check did it take over the course of three days to get the right answer?

      If the IMO contestants were allowed to use theorem provers and were given 3 days (even factoring in sleep) would AlphaProof still have gotten silver?

      > let's be real I'd be okay waiting a month for the cure to cancer.

      I don't think these results suggest that we're on the brink of knowledge coming at a substantially faster rate than before. Humans have been using theorem provers to advance our understanding for decades. Now an LLM has been wired up to one too, but it still took 8x as long to solve the problems as our best humans did without any computer assistance.

      • golol 44 days ago
        I believe you are misreading this.

        First of all, this is not a sport and the point is not to compare AI to humans. The point is to compare AI to IMO-difficulty problems.

        Secondly, this is now some hacky trick where Brute force and some theorem prover magic are massaged to solve a select few problems and then you'll never hear about it again. They are building a general pipeline which turns informal natural lamguage mathematics (of which we have ungodly amounts available) into formalized mathematics, and in addition trains a model to prove such kinds of mathematics. This can also work for theory building. This can become a real mathematical assistant that can help a mathematician test an argument, play with variations of a definition, try 100 combinations of some estimates, apply a classic but lengthy technique etc. etc.

        • lolinder 44 days ago
          > First of all, this is not a sport and the point is not to compare AI to humans. The point is to compare AI to IMO-difficulty problems.

          If this were the case then the headline would be "AI solves 4/6 IMO 2024 problems", it wouldn't be claiming "silver-medal standard". Medals are generally awarded by comparison to other contestants, not to the challenges overcome.

          > This can become a real mathematical assistant that can help a mathematician test an argument, play with variations of a definition, try 100 combinations of some estimates, apply a classic but lengthy technique etc. etc.

          This is great, and I'm not complaining about what the team is working on, I'm complaining about how it's being sold. Headlines like these from lab press releases will feed the AI hype in counterproductive ways. The NYT literally has a headline right now: "Move Over Mathematicians, Here Comes AlphaProof".

          • golol 44 days ago
            At the IMO "silver medal" afaik is define as some tange of points, which more or less equals some range of problems solved. For me it is fair to say that "silver-medal performance" is IMO langauge for about 4/6 problems solved. And what's the problem if some clickbait websites totally spin the result? They would've done it anyways even with a different title, and I also don't see the harm. Let people be wrong.
            • mathnmusic 44 days ago
              No, "silver medal" is defined as a range of points to be earned in the allotted time (4.5 hours for both papers of 3 problems each).
              • lolinder 44 days ago
                And the cutoffs are chosen after the results are in, not in advance.
        • riku_iki 44 days ago
          > They are building a general pipeline which turns informal natural lamguage mathematics

          but this part currently sucks, because they didn't trust it and formalized problems manually.

          • golol 44 days ago
            Yea that's fair, but I don't think it will keep sucking forever as formalization is in principle just a translation process.
            • riku_iki 44 days ago
              and we don't have 100% accuracy in translation in ambiguous texts, because system often need some domain knowledge, context etc. And math has 0% tolerance to mistakes.

              I also expect that math formalized by machine will be readable by machine and hardly understandable by humans.

      • ToucanLoucan 44 days ago
        I am so exhausted of the AI hype nonsense. LLMs are not fucking curing cancer. Not now, not in five years, not in a hundred years. That's not what they do.

        LLM/ML is fascinating tech that has a lot of legitimate applications, but it is not fucking intelligent, artificial or otherwise, and I am sick to death of people treating it like it is.

        • apsec112 44 days ago
          What observation, if you saw it, do you think would falsify that hypothesis?
          • necovek 44 days ago
            It seems unlikely people will employ only ML models, especially LLM, to achieve great results: they will combine it with human insights (through direction and concrete algorithms).

            It's obvious that's happening with LLMs even today to ensure they don't spew out too much bullshit or harmful content. So let's get to a point where we can trust AI as-is first, and let's talk about what's needed to achieve the next milestone after and if we get there.

            And I love asking every new iteration of ChatGPT/Gemini something along the lines of "What day was yesterday if yesterday was a Thursday?" It just makes me giggle.

            • borggambit 42 days ago
              Thanks for that what day was yesterday prompt. I have ran across these situations before but never quite like that.

              What is great about that Thursday prompt is how naked the LLM is to the reality that it knows absolutely nothing in the way we think of "to know". The bubble we are in is just awesome to behold.

        • jercos 44 days ago
          A significant part of the problem is that a majority of people are unaware of just how simple what they consider "intelligence" really is. You don't need actual intelligence to replace the public-facing social role of a politician, or a talking head, or a reactive-only middle manager. You just need words strung together that fit a problem.
        • 29athrowaway 44 days ago
          It is not artificial? so it is natural then?
      • regularfry 44 days ago
        I'm not sure it matters that it had access to a theorem prover. The fact that it's possible to build a black box that solves hard proofs on its own at all is the fascinating bit.

        > it still took 8x as long to solve the problems as our best humans did without any computer assistance.

        Give it a year and that ratio will be reversed. At least. But also it matters less how long it takes if doubling the number of things reasoning at a best-human level is pronounced "ctrl-c, ctrl-v".

  • riku_iki 44 days ago
  • SJC_Hacker 43 days ago
    The kicker with some of those math competition problems, there will be problems that reduce to finding all natural numbers for which some statement is true. These are almost always small numbers, less than 100 in most circumstances.

    Which means these problems are trivial to solve if you have a computer - you can simply check all possibilities. And is precisely the reason why calculators aren't allowed.

    But exhaustive searches are not feasible by hand in the time span the problems are supposed to be solved - roughly 30 minutes per problem. You are not supposed to use brute force, but recognize a key insight which simplifies the problem. And I believe even if you did do an exhaustive search, simply giving the answer is not enough for full points. You would have to give adequate justification.

  • nitrobeast 43 days ago
    Reading into the details, the system is more impressive than the title. 100% of the algebra and geometry problems were solved. The remaining problems are of combinatorial types, which ironically more closely resembles software engineering work.
  • robinhouston 44 days ago
    Some more context is provided by Tim Gowers on Twitter [1].

    Since I think you need an account to read threads now, here's a transcript:

    Google DeepMind have produced a program that in a certain sense has achieved a silver-medal peformance at this year's International Mathematical Olympiad.

    It did this by solving four of the six problems completely, which got it 28 points out of a possible total of 42. I'm not quite sure, but I think that put it ahead of all but around 60 competitors.

    However, that statement needs a bit of qualifying.

    The main qualification is that the program needed a lot longer than the human competitors -- for some of the problems over 60 hours -- and of course much faster processing speed than the poor old human brain.

    If the human competitors had been allowed that sort of time per problem they would undoubtedly have scored higher.

    Nevertheless, (i) this is well beyond what automatic theorem provers could do before, and (ii) these times are likely to come down as efficiency gains are made.

    Another qualification is that the problems were manually translated into the proof assistant Lean, and only then did the program get to work. But the essential mathematics was done by the program: just the autoformalization part was done by humans.

    As with AlphaGo, the program learnt to do what it did by teaching itself. But for that it needed a big collection of problems to work on. They achieved that in an interesting way: they took a huge database of IMO-type problems and got a large language model to formalize them.

    However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.

    It's not clear what the implications of this are for mathematical research. Since the method used was very general, there would seem to be no obvious obstacle to adapting it to other mathematical domains, apart perhaps from insufficient data.

    So we might be close to having a program that would enable mathematicians to get answers to a wide range of questions, provided those questions weren't too difficult -- the kind of thing one can do in a couple of hours.

    That would be massively useful as a research tool, even if it wasn't itself capable of solving open problems.

    Are we close to the point where mathematicians are redundant? It's hard to say. I would guess that we're still a breakthrough or two short of that.

    It will be interesting to see how the time the program takes scales as the difficulty of the problems it solves increases. If it scales with a similar ratio to that of a human mathematician, then we might have to get worried.

    But if the function human time taken --> computer time taken grows a lot faster than linearly, then more AI work will be needed.

    The fact that the program takes as long as it does suggests that it hasn't "solved mathematics".

    However, what it does is way beyond what a pure brute-force search would be capable of, so there is clearly something interesting going on when it operates. We'll all have to watch this space.

    1. https://x.com/wtgowers/status/1816509803407040909?s=46

    • visarga 44 days ago
      > If the human competitors had been allowed that sort of time per problem they would undoubtedly have scored higher.

      Or if AlphaProof used more compute they could have slashed that time to 1/10 or less. It's arbitrary as long as we don't define what is the compute the AI should be entitled to use here.

  • petters 44 days ago
    The problems were first converted into a formal language. So they were partly solved by the AI
    • golol 44 days ago
      Formalization is in principle just a translation process and should be a much simpler problem than the actual IMO problem. Besides, they also trained a Gemini model which formalizes natural language problems, and this is how they generated training data for AlphaProof. I would therefore expect that they could have also formalized the IMO problems with that model and just did it manually because the point is not to demonstrate formalizing but instead proof capabilities.
      • pishpash 44 days ago
        Yet the facts at hand are the opposite of what you say. Reliable formalizer was the more difficult problem than solving formalized IMO problems, because they have not produced one.
        • fngjdflmdflg 44 days ago
          That does not necessarily follow from the facts at hand. For example they may have prioritized work on the proof solver itself as they may feel that that is the more important result. Alternatively if their goal is to build a proof solver then building the formalizer would be useless if they could not build the actual proof solver.
          • pishpash 44 days ago
            A proof solver existed. They were improving the proof solver explicitly by making the formalizer a part of the training. Formalizer reliability is the key novelty. It turns out it was only reliable enough for training. So unless they made the problem statement at the outset that "we'll only make the formalizer strong enough to train but not use", I disagree with that assessment.
            • fngjdflmdflg 44 days ago
              That's a good point. The formalizer was used to created the training data for the proof solver, so they likely worked on it more than if they just used it as a preprocessing step during inference. It is still possible that they worked on the formalizer until they got good results from it enough to create good training data, and then began training as soon as possible and did not spend too much time trying to improve the formalizer. Depending on how long the training was expected to take, perhaps that is a reasonable assumption. Although I think I agree more with your view now.
      • riku_iki 44 days ago
        > Formalization is in principle just a translation process and should be a much simpler problem than the actual IMO problem

        maybe not, because you need to connect many complicated topics/terms/definitions together, and you don't have a way to reliably verify if formalized statement is correct.

        They built automatic formalization network in this case, but didn't trust it and formalized it manually.

      • petters 43 days ago
        If they could have solved it, they would have. But I agree that language models will be able to do it.
    • jeremyjh 44 days ago
      Yes and it is difficult for me to believe that there is not useful human analysis and understanding involved in this translation that the AI is helpless without. But that I suppose is a problem that could be tackled with a different model...
      • adrianN 44 days ago
        Even so, having a human formalize the problems and an AI to find machine checkable proofs could be very useful for mathematicians.
      • sebzim4500 44 days ago
        It is vastly easier to do the formalization than to actually solve the problem. Any undergraduate with some lean familiarity could do it in minutes.
        • Davidzheng 44 days ago
          Disagree! Some problems are much harder than others. If you don't believe me please go formalize P5 in this year imo.
          • SonOfLilit 43 days ago
            I formalized it last night, to a level that an IMO trainer agreed was adequate. Took maybe 15 minutes.

            Find n such that p(n) and not p(n-1).

              p(n):
                exists(f: state -> move) such that solves(f, n)
              
              state: solved | illegal | (k, is_first_move in {T,F}, px in (1,2023), py in (1,2024+1), mapping from x,y to {T,F, ?})
              
              initial_state(n): (n, T, 1, 1, {(x,y) -> ?})
              
              move: U|D|L|R|(x in (1,2023))
              
              power: ((a -> a), integer) -> (a->a)
              power(h, 0)(x) = h(x)
              power(h, k)(x) = h(power(k-1))(x)
              
              solves(f, n): exists l such that for every board, power(make_move(board, f), l)(initial_state(n)) = solved
              
              board: permutation of (1, 2, 3, ..., 2022+1)
              
              make_move: (board, (state -> move)) -> (state -> state)
              make_move(board, f)(solved): solved
              make_move(board, f)(illegal): illegal
              make_move(board, f)(s = (k, is_first_move, px, py, m))
                 if is_first_move = T:
                    if k = 0:
                        illegal
                    else if f(s) is a number:
                        (k, F, f(s), 1, m)
                    else:
                        illegal
                else:
                    if f(s) is a number:
                        illegal
                    else if py = 2025:
                        solved
                    else if board(px) = py and py != 1:
                        (k - 1, T, px, py, m + {((px, py), T)})
                    else:
                        dy = {U:-1,D:1,L:0,R:0}(f(s))
                        dx = {U:0,D:0,L:-1,R:1}(f(s))
                        px' = px+dx
                        py' = py+dy
                        if px' < 1 or px' > 2023 or py' < 1 or py' > 2025:
                            illegal
                        (k, F, px', py', m + {((px, py), F)})
            • Davidzheng 43 days ago
              I find it incredibly impressive you did this in 15 minutes! You should really help out in formalizing math (completely serious).

              Personally in the past I tried a few times to formalize some statements and sometimes I found that the mathlib libraries were pretty lacking in these more open-ended problems (I wanted to reason about lists and stuff). But it seems that I am just very bad at formalization lol.

              • SonOfLilit 42 days ago
                Formalization is mechanical work, lets leave it for computers to do :)
          • sebzim4500 44 days ago
            Yeah, I was just referring to the problems that it actually did.
    • clbrmbr 44 days ago
      IIUC, a Gemini-based system could translate the natural language questions into Lean, but in the blog post they don’t really commit to whether this was done just to generate training data or was used in the competition.
      • cygaril 43 days ago
        Formalizations for the competition were done by hand.
    • trotro 44 days ago
      But formalization is the easy part for humans. I'm sure every mathematician would be be happy if the only thing required to prove a result was to formalize it in Lean and feed it to the AI to find the proof.
      • Davidzheng 44 days ago
        Not sure every mathematician would be happy to do this... it sounds much less pleasant than thinking. It's like saying mathematicians would rather be programmers lol. It's a significant difficult problem which i believe should be left completely to AI. Human formalization should become dead
    • rpois 44 days ago
      Does this formalization process include giving it the answer it should try to prove?
  • amarant 43 days ago
    This is quite cool! I've found logical reasoning to be one of the biggest weak points of LLMs, nice to see that an alternative approach works better! I've tried to enlist gpt to help me play a android game called 4=10, where you solve simple math problems, and gpt was hilariously terrible at it. It would both break the rules I described, and make math mistakes, such as claiming 6*5-5+8=10

    I wonder if this new model could be integrated with an LLM somehow? I get the feeling that combining those two powers would result in a fairly capable programmer.

    Also perhaps a LLM could do the translation step that is currently manual?

  • _heimdall 43 days ago
    I'm still unclear whether the system used here is actually reasoning through the process of solving the problem, or brute forcing solutions with reasoning coming in during the mathematical proof of each potential proof.

    Is it clear whether the algorithm is actually learning from why previously attempted solutions failed to prove out, or is it statistically generating potential answers similar to an LLM and then trying to apply reasoning to prove out the potential solution?

  • StefanBatory 44 days ago
    Wow, that's absolutely impressive to hear!

    Also it's making me think that in 5-10 years almost all tasks involving computer scientists or mathematicians will be done in AI. Perhaps people going into trades had a point.

    • visarga 44 days ago
      Everything that allows for cheap validation is going that way. Math, code, or things we can simulate precisely. LLM ideation + Validation is a powerful combination.
      • machiaweliczny 43 days ago
        This, I've said it many years ago.

        Math => Code => Simulation => Robots => GG

  • gallerdude 44 days ago
    Sometimes I wonder if in 100 years, it's going to be surprising to people that computers had a use before AI...
    • necovek 44 days ago
      AI is simply another form of what we've been doing since the dawn of computers: expressing real world problems in the form of computations.

      While there are certainly some huge jumps in compute power, theory of data transformation and availability of data to transform, it would surprise me if computers in a 100 years do not still rely on a combination of well-defined and well-understood algorithms and AI-inspired tools that do the same thing but on a much bigger scale.

      If not for any other reason, then because there are so many things where you can easily produce a great, always correct result simply by doing very precise, obvious and simple computation.

      We've had computers and digital devices for a long while now, yet we still rely heavily on mechanical contraptions. Sure, we improve them with computers (eg. think brushless motors), but I don't think anyone would be surprised today about how did anyone design these same devices (hair dryers, lawn mowers, internal combustion engines...) before computers?

    • onemoresoop 44 days ago
      If AI stays in the computer form though..
  • majikaja 44 days ago
    It would be nice if on the page they included detailed descriptions of the proofs it came up with, more information about the capabilities of the system and insights into the training process...

    If the data is synthetic and covers a limited class of problems I would imagine what it's doing mostly reduces to some basic search pattern heuristics which would be of more value to understand than just being told it can solve a few problems in three days.

    • cygaril 44 days ago
      • majikaja 44 days ago
        I found those, I just would have appreciated if the content of the mathematics wasn't sidelined to a separate download as if it's not important. I felt the explanation on the page was shallow, as if they just want people to accept it's a black box.

        All I've learnt from this is that they used an unstated amount of computational resources just to basically brute force what a human already is capable of doing in far less time.

        • Davidzheng 44 days ago
          Very few humans can after years of training. Please don't trivialize.
          • necovek 44 days ago
            Very few humans go after this type of the training. In my "math talent" school (most of the Serbian/Yugoslavian medal winners came from it), at most a dozen students "trained" for this over 4 high school generations (500 students).

            Problems are certainly not trivial, but humans are not really putting all their effort into it either, and the few that do train for it, on average medal 50% of the time and get a silver or better 25% of the time (by design) with much less time available to do the problems.

            • Davidzheng 44 days ago
              This is disingenuous. People who train are already self selected people who are talented in math. And in the people who train not everyone gets to this level. Sadly i speak from personal experience.
              • necovek 43 days ago
                This school is full of people talented at math — you can't get in if you don't pass a special math exam (looking at the list, out of Serbia's 16 gold medals, I can see 14 went to students of this school, and numerous silver and bronzes too — Serbia participates as an independent country since 2006 with a population of roughly 7M, if you want to compare it with other countries on the IMO medal table). So in general, out of this small pool (10 talented and motivated people out of 4 generations), Serbia could get a gold medal winner on average almost once every year. I am sure there were other equally talented mathematicians among the 490 students that did not train for the competition (and some have achieved more academic success later on).

                Most students were simply not interested. And certainly, not everybody is equally talented, but the motivation to achieve competition success is needed too — perhaps you had the latter but not enough of the former. I also believe competitive maths is entirely different from research maths (time pressure, even luck is involved for a good idea to come up quickly, etc). Since you said you were a potential bronze medal winner, it might not even be a talent issue but maybe you just had great competition and someone had the better luck in one or two tests to rank above you (better luck as in the right idea/approach came to them quicker, or the type of problem that appeared on the test suited them more). And if you are from a large country like USA, China or Russia (topping the medal table), it's going to be freaking hard to get into a team since you'll have so many worthy students (and the fact they are not always scoring only golds for their teams out of such large pools tells me that the performance is not deterministic).

                As a mathematician, I am sure you'd agree you'd want to run a lot more than a dozen tests to establish statistical significance for any ranking between two people at competitive maths IMO style, esp if they are close in the first few. As an anecdote, many at my school participated in national level maths and informatics competitions (they start at school level, go on to county/city level to nation level) — other than the few "trained" competitors staying at the top, the rest of the group mostly rotated in the other spots below them regardless of the level (school/county/nation). We've actually joked amongst ourselves about who had the better intuition "this time around" for a problem or two, while still beating the rest of the country handily (we've obviously had better base level of education + decently high base talent), but not coming close to "competitors".

                I, for instance, never enjoyed working through math problems and math competitions (after winning a couple of early age local ones): I've finished the equivalent of math + CS MSc while skipping classes by only learning theory (reading through axioms, theorems and proofs that seemed non-obvious) and using that to solve problems in exams. I've mostly enjoyed building things with the acquired knowledge (including my own proofs on the spot, but mostly programming), even though I understood that you build up speed with more practice (I was also lazy :)).

                So, let's not trivialize solving IMO-style problems, but let's not put them on a pedestal either. Out of a very small pool of people who train for it, many score higher than AI did here, and they don't predict future theoretical math performance either. Competition performance mostly predicts competition performance, but even that with large error bars.

          • majikaja 43 days ago
            To mathematicians the problems are basically easy (at least after a few weeks of extra training) and after having seen all the other AI advances lately I don't think it's surprising that with huge amounts of computing resources one can 'search' for a solution.
            • Davidzheng 43 days ago
              Sorry that's wrong. I have a math phd and i trained for Olympiads in high school. These problems are not easy for me at all. Maybe for top mathematicians who used to compete.
  • 0xd1r 43 days ago
    > As part of our IMO work, we also experimented with a natural language reasoning system, built upon Gemini and our latest research to enable advanced problem-solving skills. This system doesn’t require the problems to be translated into a formal language and could be combined with other AI systems. We also tested this approach on this year’s IMO problems and the results showed great promise.

    Wonder what "great promise" entails. Because it's hard to imagine Gemini and other transformer-based models solving these problems with reasonable accuracy, as there is no elimination of hallucination. At least in the generally available products.

    • azeirah 43 days ago
      I don't think that's what they mean.

      They explicitly stated that to achieve the current results, they had to manually translate the problem statements into formal mathematical statements:

      > First, the problems were manually translated into formal mathematical language for our systems to understand.

      How I understand what they're saying is that they used gemini to translate the problem statement into formal mathematical language and let DeepMath do it's magic after that initial step.

  • dan_mctree 44 days ago
    I'm curious if we'll see a world where computers could solve math problems so easily, that we'll be overwhelmed by all the results and stop caring. The role of humans might change to asking the computer interesting questions that we care about.
    • klysm 44 days ago
      I'm not sure what stop caring really means - like stop caring about the result, or the implications?
    • Davidzheng 44 days ago
      I think mathematicians will still care
    • mr_toad 43 days ago
      The next step will be having an AI come up with the problems.
  • stonethrowaway 44 days ago
    It’s like bringing a rocket launcher to a fist fight but I’d like to use these math language models to find gaps in logic when people are making online arguments. It would be an excellent way to verify who has done their homework.
  • skywhopper 44 days ago
    Except it didn’t. The problem statements were hand-encoded into a formal language by human experts, and even then only one problem was actually solved within the time limit. So, claiming the work was “silver medal” quality is outright fraudulent.
    • noud 44 days ago
      I had exactly the same feeling when reading this blog. Sure, the techniques used to find the solutions are really interesting. But the claim more than they achieve. The problem statements are not available in Lean, and the time limit is 2 x 4.5 hours. Not 3 days.

      The article claims they have another model that can work without formal languages, and that it looks very promising. But they don't mention how well that model performed. Would that model also perform at silver medal level?

      Also note, that if the problems are provided in a formal language, you can always find the solution in finite amount of time (provided the solution exists). You can brute-force over all possible solutions until you find the solution that proofs the statement. This may take a very long time, but it will find the solutions eventually. You will always solve all the problems and win the IMO at gold medal level. Alphaproof seems to do something similar, but takes smarter decisions which possible solutions to try and which once to skip. What would be the reason they don't achieve gold?

  • PaulHoule 44 days ago
    See https://en.wikipedia.org/wiki/Automated_Mathematician for an early system that seems similar in some way.
    • golol 44 days ago
      This Wikipedia page makes AM kind of comes across as a nonsense project whose outputs no one (besides the author) bothered to decipher.
  • seydor 43 days ago
    We need to up the ante: Getting human-like performance on any task is not impressive in itself, what matters is superhuman, orders of magnitude above. These comparisons with humans in order create impressive sounding titles are disguising the fact that we are still at the stone age of intelligence.
  • nybsjytm 44 days ago
    To what extent is the training and structure of AlphaProof tailored specifically to IMO-type problems, which typically have short solutions using combinations of a small handful of specific techniques?

    (It's not my main point, but it's always worth remembering - even aside from any AI context - that many top mathematicians can't do IMO-type problems, and many top IMO medalists turn out to be unable to solve actual problems in research mathematics. IMO problems are generally regarded as somewhat niche.)

    • Davidzheng 44 days ago
      The last statement is largely correct (though idk what the imo medalists that are unable to solve actual problems most mathematicians can't solve most open problems). But i kind of disagree with the assessment of imo problems--the search space is huge if it were as you say it would be easy to search.
      • nybsjytm 44 days ago
        No, I don't mean that the search space is small. I just mean that there are special techniques which are highly relevant for IMO-type problems. It'd be interesting to know how important that knowledge was for the design and training of AlphaProof.

        In other words, how does AlphaProof fare on mathematical problems which aren't in the IMO style? (As such exceptions comprise most mathematical problems)

        • Davidzheng 44 days ago
          Probably less well? They rely heavily on the dataset of existing problems
  • arnabgho 44 days ago
  • osti 44 days ago
    So they weren't able to solve the combinatorics problem. I'm not super well versed in competition math, but combinatorics always seem to be the most interesting problems to me.
    • sigbottle 43 days ago
      I mean, IMO algebra problems can require very clever insights as well, and number theory especially has some really nice proof arguments you can make. It's easier to make a bad problem of this category though because it's much easier to hide the difficulty in a bunch of computations / rote deduction, and not creative insights.

      Combinatorics problems are usually simple enough that anyone can understand and try tackling it though, and the solutions in IMO are usually designed to be elegant. I don't think I've ever seen a bad combo problem before.

      • osti 43 days ago
        Oh I'm sure the other topics all do have interesting problems, but I don't have the background necessary to even tackle them.

        Your second paragraph conveyed exactly how I feel about combinatorics. Elegant and clever, on top of being understandable to even non math people.

  • HL33tibCe7 44 days ago
    This is kind of an ideal use-case for AI, because we can say with absolute certainty whether their solution is correct, completely eliminating the problem of hallucination.
  • pnjunction 43 days ago
    Brilliant and so encouraging!

    >because of limitations in reasoning skills and training data

    One would assume that mathematical literature and training data would be abundant. Is there a simple example that could help appreciate the Gemini bridge layer mentioned in the blog which produces the input for RL in Lean?

  • lo_fye 44 days ago
    Remember when people thought computers would never be able to beat a human Grand Master at chess? Ohhh, pre-2000 life, how I miss thee.
    • utopcell 44 days ago
      not to be pedantic, but Deep Blue beat Kasparov in 1997.
  • zhiQ 44 days ago
    Coincidentally, I just posted about how well LLMs handle adding long strings of numbers: https://userfriendly.substack.com/p/discover-how-mistral-lar...
  • 11101010001100 43 days ago
    Can anyone comment on how different the AI generated proofs are when compared to those of humans? Recent chess engines have had some 'different' ideas.
  • 1024core 43 days ago
    > The system was allowed unlimited time; for some problems it took up to three days. The students were allotted only 4.5 hours per exam.

    I know speed is just a matter of engineering, but looks like we still have a ways to go. Hold the gong...

  • jerb 44 days ago
    Is the score of 28 comparable to the score of 29 here? https://www.kaggle.com/competitions/ai-mathematical-olympiad...
    • Davidzheng 44 days ago
      No. I would say it is more impressive than 50/50 there. (Source: I used to do math comps back in the day sorry it's not a great source)
    • gus_massa 44 days ago
      IIUC the American Math Olympiad has 3 rounds. Wining the last one is almost a guaranty gold medal.

      The link you posted has problems with a dificulty between the first and second round that are much easier.

      I took a quik look at the recent list of problems in the first and second round. I expect this new AI to get a solid 50/50 points in this test.

  • brap 44 days ago
    Are all of these specialized models available for use? Like, does it have an API?

    I wonder because on one hand they seem very impressive and groundbreaking, on the other it’s hard to imagine why more than a handful of researchers would use them

    • creata 44 days ago
      > it’s hard to imagine why more than a handful of researchers would use them

      If you could automatically prove that your concurrency protocol is safe, or that your C program has no memory management mistakes, or that your algorithm always produces the same results as a simpler, more obviously correct but less optimized algorithm, I think that would be a huge benefit for many programmers.

  • imranhou 44 days ago
    If the system took 3 days to solve a problem, how different is this approach than a bruteforce attempt at the problem with educated guesses? Thats not reasoning in my mind.
    • sigbottle 43 days ago
      Because with AlphaGeometry it literally was just a feedback loop brute forcing over a known database of geometry axioms with an LLM to guide the guesses.

      Here, from what I understand, it's instead a theorem prover + LLM backing it. General proofs have a much larger search space than the 2d geometry problems you see on IMO; many former competitors disparage geometry for that reason.

    • JohnPrine 43 days ago
      it wouldn't surprise me if what we think of as intelligence is nothing more than brute force attempts at prediction with educated guesses
  • sssummer 43 days ago
    Why frontier models can both achieve silver medal in Math Olympiad but also fail to answer "which number is bigger, 9.11 or 9.9"?
    • utopcell 43 days ago
      ..because not all systems are of the same quality.
  • quantum_state 43 days ago
    It’s as impressive as if not more than AI beating a chess master. But are we or should we be really impressed?
  • gowld 44 days ago
    Why is it so hard to make an AI that can translate an informally specified math problem (and Geometry isn't even so informal) into a formal representation?
  • quirino 44 days ago
    I honestly expected the IOI (International Olympiad of Informatics) to be "beaten" much earlier than the IMO. There's AlphaCode, of course, but on the latest update I don't think it was quite on "silver medal" level. And available LLM's are probably not even on "honourable mention" level.

    I wonder if some class of problems will emerge that human competitors are able to solve but are particularly tricky for machines. And which characteristics these problems will have (e.g. they'll require some sort of intuition or visualization that is not easily formalized).

    Given how much of a dent LLM's are already making on beginner competitions (AtCoder recently banned using them on ABC rounds [1]), I can't help but think that soon these competitions will be very different.

    [1] https://info.atcoder.jp/entry/llm-abc-rules-en

    • oXman038 44 days ago
      IOI problems are more close to IMO combinatoric problems than other IMO problem types. That might be the reason for that delay. I personally like only combinatoric problems in IMO. Thats why I drop math track and went IOI instead.

      I feel why combinatoric is harder for AI models is the same reason why LLM's are not great at reasoning anything out of distribution. LLM's are good pattern recognizers and fascinating at this point. But simple tasks like counting intersections at the Venn diagrams requires more strategy and less pattern recognition. Pure NN based models seem won't be enough to solve these problems. AI agents and RL are promising.

      I don't know anything about lean but I am curious that proof of combinatorial problems can be as well represented as number theory or algebra. If combinatorial problem solutions are always closer to natural language, the failure of LLMs are expected. Or, at least we can assume it might take more time to make it better. I am making assumption in here that solutions of combinatorial problems in IMO are more human language oriented and relies on more common sense/informal logic when it compared to geometry or number theory problems.

      • Davidzheng 44 days ago
        Are you convinced there's a "reason " AI today is worse at combo? Like i don't see enough evidence that it's not an accident.
  • djaouen 43 days ago
    Is it really such a smart thing to train a non-human "entity" to beat humans at math?
  • __0x01 43 days ago
    Please could someone explain, very simply, what the training data was composed of?
  • fovc 44 days ago
    6 months ago I predicted Algebra would be next after geometry. Nice to see that was right. I thought number theory would come before combinatorics, but this seems to have solved one of those. Excited to dig into how it was done

    https://news.ycombinator.com/item?id=39037512

  • c0l0 44 days ago
    That's great, but does that particular model also know if/when/that it does not know?
    • ibash 44 days ago
      Yes

      > AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. … Formal languages offer the critical advantage that proofs involving mathematical reasoning can be formally verified for correctness.

    • diffeomorphism 44 days ago
      While that was probably meant to be rhetorical, the answer surprisingly seems to be an extremely strong "Yes, it does". Exciting times.
    • foota 44 days ago
      Never?

      Edit: To defend my response, the model definitely knows when it hasn't yet found a correct response, but this is categorically different from knowing that it does not know (and of course monkeys and typewriters etc., can always find a proof eventually if one exists).

  • amelius 43 days ago
    How long until this tech is integrated into compilers?
  • atum47 44 days ago
    Oh, the title was changed to international math Olympiad. I was reading IMO as in my opinion, haha
  • lumb63 44 days ago
    Can someone explain why proving and math problem solving is not a far easier problem for computers? Why does it require any “artificial intelligence” at all?

    For example, suppose a computer is asked to prove the sum of two even numbers is an even number. It could pull up its list of “things it knows about even numbers”, namely that an even number modulo 2 is 0. Assuming the first number is “a” and the second is “b”, then it knows a=2x and b=2y for some x and y. It then knows via the distributive property that the sum is 2(x+y), which satisfies the definition of an even number.

    What am I missing that makes this problem so much harder than applying a finite and known set of axioms and manipulations?

    • psb217 44 days ago
      In a sense, the model _is_ simply applying a finite and known set of axioms and manipulations. What makes this hard in practice is that the number of possible ways in which to perform multiple steps of this sort of axiomatic reasoning grows exponentially with the length of the shortest possible solution for a given problem. This is similar to the way in which the tree of possible futures in games like go/chess grows exponentially as one tries to plan further into the future.

      This makes it natural address these problems using similar techniques, which is what this research team did. The "magic" in their solution is the use of neural nets to make good guesses about which branches of these massive search trees to explore, and make good guesses about how good any particular branch is even before they reach the end of the branch. These tricks let them (massively) reduce the effective branching factor and depth of the search trees required to produce solutions to math problems or win board games.

    • zone411 44 days ago
      The problems in question require much, much more complex proofs. Try example IMO problems yourself and see if they don't require much intelligence: https://artofproblemsolving.com/wiki/index.php/IMO_Problems_.... And then keep in mind that research math is orders of magnitude more complex still.
    • ComplexSystems 44 days ago
      What you're missing is that this kind of thing has arbitrarily been declared "artificial intelligence" territory. Once the ability of computers to do has been established, it will no longer be artificial intelligence territory; at that point it'll just be another algorithm.
    • booleandilemma 44 days ago
      Proofs require a certain ingenuity that computers just don't have, imo. A computer would never be able to come up with something like Cantor's diagonalization proof on its own.
      • Davidzheng 44 days ago
        Are you sure alphaproof can't
    • runeblaze 44 days ago
      Another answer is that 3SAT and co can be seen as distilled variants of proving statements. Well, 3SAT is famously hard.
  • mathinaly 44 days ago
    How do they know their formalization of the informal problems into formal ones was correct?
  • m3kw9 43 days ago
    Is it one of those slowly slowly then suddenly things? I hope so
  • mupuff1234 44 days ago
    Can it / did it solve problems that weren't solved yet?
    • raincole 44 days ago
      Techinically yes. And it's easy. You can probably do it with your PC's computational power.

      The thing is that most math "problems" are not solved not becasue they're hard, but because they're not interesting enough to even be discovered by humans.

      • mupuff1234 44 days ago
        Yeah, I mean "interesting" problems (perhaps not fields medal interesting, but interesting enough)
  • gerdesj 44 days ago
    Why on earth did the "beastie" need the questions translating?

    So it failed at the first step (comprehension) and hence I think we can request a better effort next time.

  • badrunaway 43 days ago
    This will in a few months change everything forever. Exponential growth incoming soon from Deepmind systems.
  • refulgentis 44 days ago
    Goalposts at the moon, FUD at "but what if its obviously fake?".

    Real, exact, quotes from the top comments at 1 PM EST.

    "I want to believe that the computer found it, but I can't find anything to confirm."

    "Curing cancer will require new ideas"

    "Maybe they used 10% of all of GCP [Google compute]"

  • szundi 43 days ago
    Like it understands any of it
    • johnfn 43 days ago
      Do you understand any of it?
      • utopcell 43 days ago
        :-)

        Let us support the fellow human who seems to be just at the first of the five stages of grief: denial.

  • gyudin 44 days ago
    Haha, what a dumb tincan (c) somebody on Twitter right now :D
  • lolinder 44 days ago
    This is a fun result for AI, but a very disingenuous way to market it.

    IMO contestants aren't allowed to bring in paper tables, much less a whole theorem prover. They're given two 4.5 hour sessions (9 hours total) to solve all the problems with nothing but pencils, rulers, and compasses [0].

    This model, meanwhile, was wired up to a theorem proover and took three solid days to solve the problems. The article is extremely light on details, but I'm assuming that most of that time was guess-and-check: feed the theorem prover a possible answer, get feedback, adjust accordingly.

    If the IMO contestants were given a theorem prover and three days (even counting breaks for sleeping and eating!), how would AlphaProof have ranked?

    Don't get me wrong, this is a fun project and an exciting result, but their comparison to silver medalists at the IMO is just feeding into the excessive hype around AI, not accurately representing its current state relative to humanity.

    [0] 5.1 and 5.4 in the regulations: https://www.imo-official.org/documents/RegulationsIMO.pdf

    • gjm11 44 days ago
      Working mathematicians mostly don't use theorem provers in their work, and find that when they do they go significantly more slowly (with of course the compensating advantage of guaranteeing no mistakes in the final result).

      A theorem prover is probably more useful for the typical IMO problem than for the typical real research problem, but even so I'd guess that even with a reasonable amount of training most IMO contestants would not do much better for having access to a theorem prover.

      Having three days would be a bigger deal, for sure. (But from "computers can't do this" to "computers can do this, but it takes days" is generally a much bigger step than from "computers can do this, but it takes days" to "computers can do this in seconds".)

    • Davidzheng 44 days ago
      I can tell you that as someone who could have gotten bronze (i was too weak for the team) and is now a math phd--I would not have scored as well as alphaproof in three days most likely. In most problems either you find an idea soon or it can be much much longer. It's just not a matter of working and constant progress.
      • utopcell 44 days ago
        Noting the difference in humility between someone that has made the cut to participate in the IMO (6 people per country) and the multitude of retrospect prophets that trivialize the DeepMind achievement.
      • knotthebest 44 days ago
        Skill issue tbh
    • golol 44 days ago
      The point is not to compare AI and humans, it is to compare AI and IMO-level math problems. It's not for sport.
      • lolinder 44 days ago
        They're literally comparing AI to human IMO contestants. "DeepProof solves 4/6 IMO problems correctly" would be the non-comparison version of this press release and would give a better sense for how it's actually doing.
        • golol 44 days ago
          "Solving IMO problems at Silver-Medal level" is pretty much equivalent to solving something like 4/6 problems. It is only a disingenious comparison if you want to read it as a comparison. I mean yea, many people will, but I don't care anout them. People who are technically interested in this know that the point is not to have a competition of AI with humans.
          • lolinder 44 days ago
            > but I don't care anout them

            It's great that you feel safe being so aloof, but I believe we have a responsibility in tech to turn down the AI hype valve.

            The NYT is currently running a piece with the headline "Move Over, Mathematicians, Here Comes AlphaProof". People see that, and people react, and we in tech are not helping matters by carelessly making false comparisons.

            • visarga 44 days ago
              I think search-based AI is on a different level than imitation models like GPT. This is not a hallucinating model, and it could potentially discover things that are not written in any books.

              Search is amazing. Protein folding searches for the lowest energy configuration. Evolution searches for ecological fit. Culture searches for progress, and science for understanding. Even placing a foot on the ground searches for dynamic equilibrium. Training a ML model searches for best parateters to fit a dataset. Search is universal. Supervised learning is just imitative, search is extrapolative.

            • golol 44 days ago
              Why? Why is hype bad? What actual harm does it cause?

              Also the headline is fair, as I do believe that AlphaProof demonstrates an approach to mathematics that will indeed invade mathematicians workspaces. And I say that as a mathemstician.

              • Davidzheng 44 days ago
                For sure. I feel like mathematicians are not paying attention (maybe deliberately) we have a real shot of solving some incredibly hard open problem in the next few years.
      • kenjackson 44 days ago
        Exactly. The point is what can we eventually get AI to solve problems which we as humans can’t. Not if we can win the IMO with a computer.
    • blackbear_ 44 days ago
      And why aren't you complaining that human participants could train and study for thousands of hours before attempting the problems? And that the training materials they used was itself created and perfected by hundreds of other people, after having themselves spend countless hours studying?
      • lolinder 44 days ago
        Because so did the AI?
        • blackbear_ 44 days ago
          That's exactly my point? Both had to learn?
  • rowanG077 43 days ago
    Is this just google blowing up their own asses or is this actually useable with some sane license?
  • hendler 43 days ago
  • thoiwer23423 43 days ago
    And yet it thinks 3.11 is greater than 3.9

    (probably confused by version numbers)

  • machiaweliczny 44 days ago
    Good, now use DiffusER on algebra somehow please
  • pmcf 44 days ago
    I read this as “In My Opinion” and really thought this about AI dealing with opinionated people. Nope. HN is still safe. For now…
  • Sparkyte 43 days ago
    In other news today calculator solves math problem.
  • cs702 44 days ago
    In 1997, machines defeated a World Chess Champion for the first time, using brute-force "dumb search." Critics noted that while "dumb search" worked for chess, it might not necessarily be a general strategy applicable to other cognitive tasks.[a]

    In 2016, machines defeated a World Go Champion for the first time, using a clever form of "dumb search" that leverages compute, DNNs, reinforcement learning (RL), and self-play. Critics noted that while this fancy form of "dumb search" worked for Go, it might not necessarily be a general strategy applicable to other cognitive tasks.[a]

    In 2024, machines solved insanely hard math problems at the Silver Medal level in an International Math Olympiad for the first time, using a clever form of "dumb search" that leverages compute, DNNs, RL, and a formal language. Perhaps "dumb search" over cleverly pruned spaces isn't as dumb as the critics would like it to be?

    ---

    [a] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

    • tux3 44 days ago
      The success in Go was very much not dumb search. Where dumb search had failed to achieve the level of even a very weak player, the neural net's "intuition" about good moves without any search was already very strong. Only the combination was superhuman, and that was anything but the dumb search that had been tried before.

      Today's announcement is also not about proving Lean theorems by "dumb search". The success is about search + neural networks.

      You're attacking critics for criticizing the solution that has failed, while confusing it for the solution that works to this day.

      • cs702 44 days ago
        That's in fact precisely my point:

        Clever forms of "dumb search" that leverage compute, DNNs, RL, self-play, and/or formal languages are not dumb at all.

        I put the words "dumb search" in quotes precisely because I think critics who dismiss AI progress as such are missing the point.

        We're not in disagreement :-)

        • tux3 44 days ago
          >Clever forms of "dumb search" that leverage compute, DNNs, RL, self-play, and formal languages are not dumb at all

          Right. My point is that you're attacking a position that no real critic holds. Of course we're in agreement then!

          "Clever forms of dumb search are not dumb" feels a little like kicking down open doors. We were always going to agree.

          • cs702 44 days ago
            > My point is that you're attacking a position that no real critic holds

            I'm not so sure. I wrote my comment after seeing quite a few comments on other threads here that read like fancy versions of "this is just brute-force search over cleverly pruned trees." Search the other threads here, and you'll see what I mean.

      • utopcell 44 days ago
        I think that was the op's point.
    • klyrs 44 days ago
      A brute force search has perfect knowledge. Calling it "dumb" encourages bad analogies -- it's "dumb" because it doesn't require advanced reasoning. It's also "genius" because it always gets the right answer eventually. It's hugely expensive to run.

      And you keep shifting the goalpost on what's called "dumb" here.

      • jmalicki 44 days ago
        None of the above are brute force search.
      • adrianN 44 days ago
        You might have missed the point.
        • bdjsiqoocwk 44 days ago
          Tell us what the point was, I think a lot of us are missing it.
          • BiteCode_dev 44 days ago
            Moving the goal post of dumb search.
    • Davidzheng 44 days ago
      By its nature hard math problems do not have fast algorithmic solutions--you can only solve them by search or clever search. Mathematicians have heuristics on helping us decide intuitively what's going to work what can make progress. But in the end--there is only search
    • fspeech 44 days ago
      By "dumb" I assume you meant brute-force. Search, as opposed to extrapolation, is what actually produces suprise or creative results, whether it happens in a person's head or on a computer. The issue is to produce the heuristics that can let one push back the combinatorial explosion.
    • netcan 44 days ago
      Well written.

      It kinda had to be this way, I think. There's a something from nothing problem. Douglas Adams brilliantly starts at this point.

      We don't understand something from nothing. We don't even have the language to describe it. Concepts like "complexity" are frustratingly resistant to formalization.

      "There is no free will." Has recently resurged as a philosophical position... like it did in response to Newton's mechanics.

      Matter from void. Life from minerals. Consciousness from electrons. Free will from deterministic components. Smart reasoning & intelligent rationalisation from dumb search, computation, DNNs and such.

      I don't think this debate was supposed to be ended by anything short of empirical demonstration.

      Endnote: deep blue's victory over Gary was a bunch of preprogrammed bulls--t. Rematch!

    • creer 44 days ago
      If one of the lessons of LLMs was that much of "common sense" is covered in the written language corpus - that is, perhaps, basic human intelligence is covered by language.

      Then, should we expect less with mathematics where written language is the normal way knowledge is propagated, and where formal proofs are wanted? An important distinction here is the coupling of search (not LLM for this one), a formal math language, and theorem proving. Math intelligence may not be merely the math written corpus, but adding the formal language and theorem proving sounds pretty powerful.

      All this still lacks self-directed goals. An intention. For now that's taken care of by the human asking questions.

      • imtringued 44 days ago
        >that is, perhaps, basic human intelligence is covered by language.

        It isn't. That's why LLMs are still terrible at basic reasoning tasks. The dataset doesn't contain the human intelligence, just the end results. Nobody is training their LLMs on brain scans.

    • mikeknoop 44 days ago
      High efficiency "search" is necessary to reach AGI. For example, humans don't search millions of potentially answers to beat ARC Prize puzzles. Instead, humans use our core experience to shrink the search space "intuitively" and deterministically check only a handful of ideas. I think deep-learning guided search is an incredibly promising research direction.
    • sobellian 44 days ago
      From whence are you quoting the phrase "dumb search?"
    • TheDudeMan 44 days ago
      So all forms of search are dumb in your view?
      • outlore 44 days ago
        I think OP is arguing the opposite. In other words, critics of AI innovations call them “dumb search”. The goal post keeps moving.
        • cs702 44 days ago
          Exactly. They are not dumb search!
    • Probiotic6081 44 days ago
      [dead]
  • nsjoz 42 days ago
    [dead]
  • ezugwuernesttoc 44 days ago
    [dead]
  • hulitu 44 days ago
    > AI solves International Math Olympiad problems at silver medal level

    > In the official competition, students submit answers in two sessions of 4.5 hours each. Our systems solved one problem within minutes and took up to three days to solve the others.

    Why not compare with students who are given 3 days to submit an answer ? /s

  • bigbacaloa 43 days ago
    [dead]
  • lngnmn2 43 days ago
    [dead]
  • ezugwuernesttoc 44 days ago
    [dead]
  • khana 43 days ago
    [dead]
  • t3estabc 44 days ago
    [dead]
  • teresasovinski 44 days ago
    [dead]
  • kerbaupapua 43 days ago
    [flagged]
  • ksoajl 44 days ago
    [flagged]
  • highcountess 44 days ago
    [flagged]
    • dpbriggs 44 days ago
      This is more analogous to programmers working with copilot. There's an exciting possibility here of mathematicians feeding these systems subproblems to assist in proving larger theorums.
      • highcountess 42 days ago
        It was not meant to be a serious comment even though it seems it may have touched a nerve.
    • ykonstant 44 days ago
      Programming in Lean?
    • Davidzheng 44 days ago
      lol
      • Davidzheng 44 days ago
        Of course i agree they will be better--I'm happy to be like chess players and just admire the machines and entertains humans
  • Davidzheng 44 days ago
    HOLY SHIT. It's amazing
  • data_maan 44 days ago
    It's bullshit. AlphaGeometry can't even solve Pythagoras theorem.

    Not opensourcing anything.

    This is a dead end on which no further research can be built.

    It violates pretty much every principle of incremental improvement on which science is based. It's here just for hype, and the 300+ comments prove it.

  • NoblePublius 44 days ago
    A lot of words to say second place
  • iamronaldo 44 days ago
    No benchmarks of any kind?
  • dmitrygr 44 days ago
    > First, the problems were manually translated into formal mathematical language

    That is more than half the work of solving them. Headline should read "AI solves the simple part of each IMA problem at silver medal level"

  • myspeed 43 days ago
    This means we may need to remove or replace the Olympiad..It has no practical significance..Winners never contributed to any major scientific breakthroughs.
    • anon2345252 43 days ago
      "A number of IMO participants have gone on to become notable mathematicians. The following IMO participants have either received a Fields Medal, an Abel Prize, a Wolf Prize or a Clay Research Award, awards which recognise groundbreaking research in mathematics; a European Mathematical Society Prize, an award which recognizes young researchers; or one of the American Mathematical Society's awards (a Blumenthal Award in Pure Mathematics, Bôcher Memorial Prize in Analysis, Cole Prize in Algebra, Cole Prize in Number Theory, Fulkerson Prize in Discrete Mathematics, Steele Prize in Mathematics, or Veblen Prize in Geometry and Topology) recognizing research in specific mathematical fields. Grigori Perelman proved the Poincaré conjecture (one of the seven Millennium Prize Problems), and Yuri Matiyasevich gave a negative solution of Hilbert's tenth problem."

      [...]

      "IMO medalists have also gone on to become notable computer scientists. The following IMO medalists have received a Nevanlinna Prize, a Knuth Prize, or a Gödel Prize; these awards recognise research in theoretical computer science."

      https://en.wikipedia.org/wiki/List_of_International_Mathemat...

    • nb_quant 43 days ago
      A lot of them become Fields medallists. From [1] "The conditional probability that an IMO gold medalist will become a Fields medalist is fifty times larger than the corresponding probability for a PhD graduate from a top 10 mathematics program."

      [1]: https://www.aeaweb.org/articles?id=10.1257/aeri.20190457

    • hnfong 43 days ago
      (And with this comment, the 2024 Olympics commences.)

      There are so many competitions that don't have any obvious practical significance. And people are still enjoying competitions where AI completely pwns humans.

      Also, this is probably a good time to ask whether you won the Putnam... https://news.ycombinator.com/item?id=35079

  • xyst 44 days ago
    Billions of dollars spent building this, gW of energy used to train it. And the best it could do is “silver”?

    Got to be kidding me. We are fucked

  • AyyEye 44 days ago
    Parlor tricks. Wake me up AI can reliably identify which number is circled at the level of my two year old.
  • dinobones 44 days ago
    I see DeepMind is still playing around with RL + search algorithms, except now it looks like they're using an LLM to generate state candidates.

    I don't really find that this impressive. With enough compute you could just do n-of-10,000 LLM generations to "brute force" a difficult problem and you'll get there eventually.

    • richard___ 44 days ago
      Sigh. Just wrong
      • dinobones 44 days ago
        I'm hoping someone takes my bait and explains why MCTS + LLM is more impressive than what it sounds like.
        • utopcell 44 days ago
          good luck with that.
  • piombisallow 44 days ago
    IMO problems aren't fundamentally different from chess or other games, in that the answer is already known.
    • energy123 44 days ago
      IMO and Chess are the same in the most important respect, you can use Lean or a simulated chess game to create unlimited quality training labels. Any problem of this category should be solved with enough compute and clever architecture/metacognition design. The more intractable problems are where data is hard to find or synthesize.
    • Smaug123 44 days ago
      I really don't understand what you mean by this. 1) it's not known whether chess is a win for White or not. 2) IMO problems, such as 2024 problem 1 which the system solved, are often phrased as "Determine all X such that…".
      • diffeomorphism 44 days ago
        You are attacking a straw man and the point made is pretty good.

        Competition problems are designed to be actually solvable by contestants. In particular, the problems should be solvable using a reasonable collection of techniques and many "prep courses" will teach you many techniques, tools and algorithms and a good starting point is to throw that stuff at any given problem. So just like chess openings putting in lots of leg work will give you some good results for that part. You might very well lose in mid and late game, just like this AI might struggle with "actual problems"

        It is of course still very impressive, but that is an important point.

        • Smaug123 44 days ago
          I'm attacking nobody! I literally couldn't understand the point, so I said so: as stated, its premises are simply clearly false!

          Your point, however, is certainly a good one: IMO problems are an extremely narrow subset of the space of mathematical problems, which is itself not necessarily even 50% of the space of the work of a mathematician.

    • wufufufu 44 days ago
      Kinda? Chess isn't solved. Complex problems can have better solutions discovered in the future.
      • jeremyjh 44 days ago
        It isn't solved but the evaluation (which side is better, by how much, and which moves are best) of a strong engine is - for all practical purposes - an answer to every chess position you can pose to it. This makes it easy to gauge improvement and benchmark against other systems compared to some other problems.
    • osti 44 days ago
      But the answer is probably not known by you, in particular.
      • piombisallow 44 days ago
        Yes, sure, but this doesn't mean that this generalizes to open math research problems, which would be the useful real-world application of this. Otherwise this is just playing known games with known rules, granted better/faster than humans.
    • almostgotcaught 44 days ago
      > in that the answer is already known.

      you realize this holds true for all of math right? outside of godel incompleteness potholes every proof/theorem is a permutation of ZFC. And you can fix the potholes by just filling them in with more Cs.

      • diffeomorphism 44 days ago
        That seems shallow and not really useful. Like sorting is easy, just take all permutations and look at each one to see whether is sorted.... it just might you take longer than the heat death of the universe to actually do that.
      • lanstin 44 days ago
        There are elementary Diophantine equations that are independent of ZFC. What is your approach for those?
    • Davidzheng 44 days ago
      lol
  • gowld 44 days ago
    This shows a major gap in AI.

    The proofs of these problems aren't interesting. They were already known before the AI started work.

    What's interesting is how the AI found the proof. The only answer we have is "slurped data into a neural network, matched patterns, and did some brute search".

    What were the ideas it brainstormed? What were the dead-end paths? What were the "activations" where the problem seemed similar to a certain piece of input, which led to a guess of a step in the solution?

  • rich_sasha 44 days ago
    I'm actually not that surprised. Maths Olympiads IME have always been 80% preparation, 20% skill - if not more heavily tuned to preparation. It was all about solving as many problems as possible ahead of the papers, and having a good short term memory. Since Olympiads are for kids, the amount of actual fundamental mathematical theorems required is actually not that great.

    Sounds perfect for a GPT model, with lots of input training data (problem books and solutions).

  • balls187 44 days ago
    What was the total energy consumption required to acheive this result (both training and running)

    And, how much CO2 was released into earths atmosphere?

    • ozten 44 days ago
      Compared to all of the humans who compete at this level and their inputs and outputs for the trailing 5 years.
      • balls187 44 days ago
        And?

        The result is (likely) net energy consumption, resulting in (likely) net CO2 emissions.

        So, what was did it cost us for this achievement in AI?

        EDIT TO ADD: It's fair to think that such a presser should not include answers to my questions. But, it's also fair to want that level of transparency given we are dealing with climate change.

        • regularfry 44 days ago
          You're not wrong and in general the conclusion is that AI emits less CO2 than a human performing the same task. Whether that's true for this specific task is worth asking, as is the question of how efficient such a process can be made.
    • amelius 44 days ago
      There's no energy limit in the IMO rules.
      • balls187 44 days ago
        The point isn't IMO rules.

        It's that we are living in a period of time where there are very real consequences of nearly a century of unchecked CO2 due to human industry.

        And AI (like crypto before it) requires considerable energy consumption. Because of which, I believe we (people who believe in AI) need to hold companies accountable by very transparently disclosing those energy costs.

        • password54321 44 days ago
          > need to hold companies accountable by very transparently disclosing those energy costs.

          And if they do, then what? If it is "too high" do we delay research because we need to keep the world how it is for you? What about all the other problems others face that could be solved by doubling down on compute for AI research?

          • balls187 44 days ago
            > And if they do, then what? If it is "too high" do we delay research because we need to keep the world how it is for you?

            First, it's keeping the world how it is for all of us, not just me.

            Second, to answer you question, I think that is a decision for all of us to weigh in on, but before we can do that, we must be informed as best as we can.

            Do sacrifices have to be made for the greater good? Absolutely. Do for-profit mega corporations get to make those decisions without consent from the public? No.

            • password54321 43 days ago
              Many people don't want to live in the world how it is. They would rather see risks taken for accelerated progress. Stop trying to pretend your take is the humanitarian take.
              • balls187 43 days ago
                > Many people don't want to live in the world how it is.

                Got a source to support your assertion that many people are okay with the effects of climate change?

                https://www.scientificamerican.com/article/more-climate-laws...

                > Stop trying to pretend your take is the humanitarian take.

                That's a straw man. However, I cannot believe many humans are in support of an uninhabitable world.

        • xyzzy123 44 days ago
          They are 100% accountable for paying the bill from their electricity provider.
        • amelius 44 days ago
          What if at some point AI figures out a solution to climate change?
          • Sivart13 44 days ago
            Why would we be any more likely to implement it, relative to the solutions that humans have already figured out for climate change?
            • sensanaty 44 days ago
              Well we can be confident in the knowledge that techbros might finally take the issue seriously if an AI tells them to!
          • ks2048 44 days ago
            I know this is not an uncommon opinion in tech circles, but I believe an insane thing to hang humanities hopes on. There's no reason to think AI will be omnipotent.
            • balls187 44 days ago
              There is not, but there is plenty of historical evidence that scientific and technological progress has routinely addressed humanities crisis du jour.
          • staunton 44 days ago
            You mean new and incredibly effective ways to shape public opinion? I guess it might. But then someone would still have to use it for that purpose...
    • utopcell 44 days ago