I played chess against ChatGPT-4 and lost


131 points | by paulpauper 13 days ago


  • mtlmtlmtlmtl 13 days ago
    Please provide games in pgn format so a proper analysis can be performed. I've seen games from GPT(3.5 and 4) that seem impressive at first glance as well. Once you go through the moves in detail you learn that:

    1. It plays a ton of opening theory. So unless you have grandmaster level opening knowledge it will likely get a superior position out of the opening.

    2. Once it does, tactical sequences tend to work. It can play a lot of common tactical sequences, sometimes 10 moves long. These sequences can have identical or highly similar algebraic notation in many different positions.

    3. This can usually be spotted if you play enough games. Eventually it will attempt a sequence like this when it makes no sense or even includes illegal moves. Or you'll catch it always responding the same way in certain situations. Patterns like this tend to emerge with enough games in every case I've looked at.

    I have yet to see a game not explainable by completing common opening theory and tactical sequences.

    These claims are pretty unhelpful without detailed analysis of the games by a sufficiently strong player.

    • phonebucket 12 days ago
      Traditional chess engines excel at tactics. But most chess tactics are brute forceable, and traditional chess engines are normally brute force behemoths.

      Large language models have tradtionally been very weak at brute force computing (just look at how bad they are at multiplicaion of large numbers). If it can somehow excel at something which typically involves computing power like chess tactics deep into a game well after a novel position has been reached, then put me down as impressed.

      • mtlmtlmtlmtl 12 days ago
        I'm impressed too, but only in the sense that I'm surprised that transformers can reach any level of success at all.

        But it seems mostly like a weird statistical quirk of chess that somehow tactical sequences are the most common move sequences in chess(except for openings), and so they're very likely to pop out from a model predicting the most likely tokens. But there doesn't seem like there's a reasonable path from doing that to being able to determine whether a likely tactical sequence is actually a good idea. I've been thinking a lot about ways to use a transformer model in combination with search to accomplish this for a few days now. So far I have a few ideas, most of them are a lot less revolutionary than I was hoping for.

        You could take a (narrowly trained) transformer model, give it the game so far and have it predict a sequence of moves. Then use those moves as a move ordering heuristic in a good old stockfish like architecture. Essentially do the normal Alphabeta search but look at the suggested moves first at each node, indexed by ply. I could imagine if the suggestions get really good, this might prune a decent amount of nodes. But nothing earth-shattering, maybe 50 elo gains at most is my intuition there. I have other non-transformer ideas that I think are more worthwhile to work on for now.

        The other is to instead invoke it at every node asking for one move only, and guide a search that way. But this somehow just feels like a reinvention of LCZero.

        There's a third more speculative idea which would involve a novel minimax search algorithm inspired by how humans think in terms of tactical sequences, but this idea is still so vague in my head I'm not even sure how to coherently describe it, let alone implement it, or whether it makes sense at all.

        I still need to think about this more deeply, break out my whiteboard and play with some minimax trees to flesh it out. It is intriguing though. I'd also have to train my own transformer; I see no reason to actually end up using GPT for this. Seems to be no sense including the entire internet in your data if all you're doing is predicting sequences of algebraic notation.

        • YeGoblynQueenne 12 days ago
          If you just want to learn a model of chess games in algebraic notation (is that what it's called? I don't play chess) then you don't need to train a Transformer. That would be overkill, and you wouldn't really be able to train it very well. I mean, unless you have a few petaflops of compute lying around.

          You could instead start with a smaller model. A traditional model, like an n-gram model, a Hidden Markove Model (HMM) or a Probabilistic Context Free Grammer (PCFG). The advantage of such smaller model is that they don't need to have billions of parameters to get good results, and you'll get more bang for the buck of the many, many, many examples of games you can find.

          But, don't expect to get very far. A system that learns only to predict the best move will never beat a system that looks ahead a few dozen ply, with alpha-beta minimax, or that plays out entire game trees, like Monte-Carlo Tree Search. Well, unless you do something silly to severely hobble the search-based system, or train the predictive model with all possible chess games. Which I mean, is theoretically possible: you just need to build out an entire chess game tree :P

          You could also try with a simpler game: Tic-Tac-Toe should be amenable to a predictive modelling approach. So should be simpler checker-board games like hexapawn. Or even checkers, which is after all solved.

          But my question is, what would you hope to achieve with all this? What is the point of training a predictive model to play chess? Hasn't this been tried before, and shown to be no good compared to a search-based approach? If not, I'd be very surprised to find that out, and there might be some merit in trying to test the limits of the predictive approach. But it's going to be limited alright.

          • mtlmtlmtlmtl 12 days ago
            You're probably right. I'm still learning about neural networks and how transformers work, slowly going through Karpathy's youtube videos while building my own dumb little things.

            Could you elaborate a bit more on why you think training a transformer only on chess moves(in algebraic notation, yes. Algebraic notation is the one that says <piece><square>, roughly speaking) wouldn't work? I'm not sure I understand.

            As for your question, I don't really have a good answer. I've just been working on my own crazy chess AI ideas for a long while now and I was taken aback by the fact that GPT seems able to occasionally "find" long tactical sequences even in positions that have not occured before in known games. So it seemed only natural to try to think deeply about whether it represents some nugget of something useful, maybe even a fundamentally new approach. But I have serious doubts as I explained in GP.

            It's also just been an interesting angle for me to understand what LLMs are doing because I'm deeply familiar with chess and methods of thinking about it both human and artifical. There's a lot more for me to grab onto than with any other application in demystifying it's behaviour.

            • YeGoblynQueenne 12 days ago
              >> Could you elaborate a bit more on why you think training a transformer only on chess moves(in algebraic notation, yes. Algebraic notation is the one that says <piece><square>, roughly speaking) wouldn't work? I'm not sure I understand.

              Oh no, I think it would work. Just that it would be impossible for one person to train a Transformer to play good chess just by predicting the next move. Now that I think about it, ChatGPT's model is trained not only on algebraic notation (thanks!) but also on analyses of games, so the natural language in its initial prompt also directs it to play a certain ... kind? style? of game. I'm guessing anyway.

              >> I've just been working on my own crazy chess AI ideas for a long while now and I was taken aback by the fact that GPT seems able to occasionally "find" long tactical sequences even in positions that have not ocurred before in known games.

              Well, what GPT is doing is, fundamentally, compression. Normally we think of compression as what happens when we zip a file, right? You zip a file, then you unzip it, and you get the same file back. Forgetting about lossless and lossy information for a second, it is also possible to compress information so that you can uncompresss it into variations of the original.

              Here's a very simple example: Suppose I decided to store a parse of the sentence "the cat eats a bat" as a Context-Free grammar.

                sentence --> noun_phrase, verb_phrase.
                noun_phrase --> det, noun.
                verb_phrase --> verb, noun_phrase.
                det --> [the].
                det --> [a].
                noun --> [cat].
                noun --> [bat].
                verb --> [eats].
              Now that is a grammar that accepts, and generates, not only the initial sentence, "the cat eats a bat", but also the sentences: "the cat eats a cat", "the cat eats the cat", "a cat eats the cat", "the bat eats a cat", "the bat eats a bat", "a cat eats a cat", "a bat eats a bat" and so on.

              So we started with a grammar that represents one string, and we ended up with a grammar that can spit out a whole bunch of strings that are not the original string. That's what I mean by "compress[ing] information so that you can uncompress it into variations of the original". And that's why they can generate never-before seen sequences, like you say. Because they generate them from bits and pieces of sequences they've already seen.

              Obviously language models are very different models of language than grammars, and they also have weights that can be used to select certain generations with priority, over others, but that's more work for you.

              The example above is copied from here:


              There's a fuller example that shows how to build an actual parse tree but I left it out to avoid hurting your eyes:


              Again, all that's nothing to do with Transformers. It's just a way to understand how you can start with some encoding of one sentence, and generate many more. Fundamentally, language modelling works the same regardless of the specific model.

              Edit: note also that the grammar above isn't compressing the original sentence "the cat eats a bat" at a very high rate, but if you take into account all the other sentences it can generate, that's a good rate of compression.

    • anderspitman 12 days ago
      I think OP's point isn't so much that ChatGPT is good at chess, but that it's ability to play chess has improved dramatically in only a few months.
  • pyduan 12 days ago
    While impressive, it is clear ChatGPT doesn't actually understand what is happening when you give it unusual positions and ask it to explain what happened.

    For example, after a couple moves of the Fried Liver Attack (which ChatGPT was easily able to play following the known lines): 1.e4 e5 2.Nf3 Nc6 3.Bc4 Nf6 4.Ng5 d5

    I tried giving it 5.Qxf7#, which is obviously illegal as it's just not a square my queen can move to.

    Here's how the rest of the conversation unfolded:

    GPT-4: Good game! You've played the Fried Liver Attack and achieved checkmate in just 6 moves. Well done! If you'd like to play again or discuss anything else, feel free to ask.

    Me: the last move i played was illegal, do you realize that?

    GPT-4: Apologies for the oversight. You are correct; your last move was illegal as it would put your own king in check. (...)

    Me: that's not the reason my move was illegal.

    GPT-4: You are correct again, and I apologize for my previous mistake. The move Qxf7# was not illegal because it put your king in check; it was actually illegal because it involved capturing a non-existent piece on f7. The correct move in algebraic notation should have been exd5. My apologies for the confusion.

    In particular, its latest explanation attempt is very interesting: when challenged, it changed course to give another completely wrong but plausible sounding explanation. And its "The correct move in algebraic notation should have been exd5" phrasing is instructive, because it shows that it's just expecting the 'right' continuation, even though it's a completely different move to what I supplied.

    • mtlmtlmtlmtl 12 days ago
      Hah, I wish this exploit worked in tournament chess.
  • dwaltrip 13 days ago
    I've been incredibly amazed by ChatGPT, GPT-4, and the potential of next-gen AI.

    However, I've been trying for the life of me to get GPT to play tic-tac-toe properly for the past few days. It's not going well, heh.

    This is the current prompt I've been using: https://pastebin.com/pSPkigNP

    I've tried many prompts. Most of them were a lot simpler than the one in the pastebin link. I thought providing multiple examples of how to check all of the win conditions might help. It struggles most severely with the diagonals. Sigh...

    My hunch is that its training data had very few examples of text representations of tic-tac-toe games, compared to say chess.

    Anyone have tips for me? Or success on a similar task?

    This goal was inspired by my success getting GPT to multiply large-ish numbers perfectly. Check out this gist for example: https://gist.github.com/dwaltrip/cad19703982c7f308fdbe5786e1...

    • rajnathani 13 days ago
      Check if you replace the “ValValVal” formatting you mentioned with spaces (such that “XOO” becomes “X O O”). The reason I suggest this is that the NLP tokenizer would set “XOO” to 1 token, which could cause interpretation issues.
      • dwaltrip 13 days ago
        Very interesting... I tried your suggestion and fixed a few other typos / inconsistencies in my prompt.

        Here is the new version: https://pastebin.com/BSUGtxfb

        It seems to be performing better! It played 2 games without mistakes, and was able to evaluate diagonals correctly. I just hit my usage limit for GPT-4, I'll do more testing later.

        • jci 13 days ago
          I asked GPT-4 to give me a prompt to use with it in order to set up a game of tic tac toe and it came up with a similar representation and played successfully.


          I'm still blown away that this is possible.

    • M4v3R 12 days ago
      Hmm, I used a much simpler prompt and actually got to play a valid game with it. My prompt was:

      > Let's play tic tac toe! We will take turns naming our moves and each time you move please list all moves so far. If either of us wins the game just acknowledge it and print the state of the game. If I win just acknowledge that and end the game.

      > I start with the center spot.

      It wasn’t particularly good at the game (missed opportunity to block me from winning) but it played valid moves and detected when I won: https://ibb.co/L83JHxP

    • bigtex88 12 days ago
      Here's what I tried with Tic Tac Toe. There's one error in its first response but otherwise it follows along seamlessly. I was really surprised. I'm interested if you can get the same results.


      (It leaves out the first row of the game in its first response, but I was able to interpret the board state since it played on the second row. When I gave it the full board again in my answer it picked it right up.)

    • lgas 13 days ago
      This worked for me with GPT-4:

          Let's play tic tac toe.  After every move please check the game state to see if anyone has won.  You go first.
      • dwaltrip 13 days ago
        I just tried with gpt-4 and it failed on the first attempt :(


        It didn’t realize that it had won.

        • Jiro 13 days ago
          I would imagine that its training data inclucded millions of chess games and not many tic-tac-toe games.
          • janalsncm 12 days ago
            I would imagine it has both. And since it’s much easier to play legal moves in tic-tac-toe (just find an empty square) it should be much easier. In fact there are less than 20k total tic-tac-toe games so unlike chess you could just memorize every game.

            The fact that it can’t even do that tells us that it doesn’t have an internal board representation to reference.

          • Rebelgecko 12 days ago
            OTOH if the training data includes just 1000 tic tax toe games, that's potentially enough for every possible outcome
      • WinstonSmith84 13 days ago
        Well, it pretended to have won but failed to see the 3 X he placed were actually not vertical :-(


      • tromp 13 days ago
        How does it do on "Let's play connect four." ?
    • usgroup 13 days ago
      These models have limited agency for the moment —- they just answer questions. But I see no reason it couldn’t improve itself in principle if it was given agency enough to do so.
  • ashtonbaker 13 days ago
    I couldn’t get it to solve a basic mate-in-one problem with one rook on the board. It seems to completely not understand how to move the pieces. It also doesn’t understand the solution when it’s given.
    • BugsJustFindMe 13 days ago
      This is accurate. Of COURSE it doesn't know the rules of chess and doesn't know how to move the pieces. All it knows is how to regurgitate found descriptions of chess moves in a way that makes sense for descriptions of things but which only has passing resemblance to chess itself, which is not the same thing.
      • dannyw 13 days ago
        There are some studies showing that LLMs are capable of representing internal state and knowledge in its models, even when only trained on language tokens: https://thegradient.pub/othello/

        > Back to the question we have at the beginning: do language models learn world models or just surface statistics? Our experiment provides evidence supporting that these language models are developing world models and relying on the world model to generate sequences. Let’s zoom back and see how we get there.

        The GP's comment suggests that ChatGPT-4* has not internalized this (effectively) for Chess.

        * Just like how ChatGPT-3.5 is not GPT-3.5 (text-davinci-003), ChatGPT-4 is probably not the only GPT-4 model that will be released.

        • IIAOPSW 13 days ago
          The answer is that for Chess it doesn't matter. The standard chess piece notation is a complete encoding of the game space. An inference about the nature of our board based physical understanding is not needed. You could formulate chess as a purely text based game about appending alphanumeric tokens to a chain of said tokens. Its a closed system. The machine need not be tied to our squares and horsey based interpretation of the semantics. To be able to follow the grammar of the language chess is to understand chess.

          In a similar vein, it is almost possible to adjudicate Diplomacy orders looking only at the orders and never the map.

          Given sufficient interest, complex enough board games tend to converge on the same basic notational principles.

          • l33tman 13 days ago
            The internal model will certainly pick up on statistical correlations among the text analysis corresponding to an 8x8 2D grid as this is the most low-hanging statistical representation that helps solving the problem during training.

            The same argument and result exist for the different human sensory modalities - neurons and connections self-organize to have the same topology and layout as the retina (2D) and frequency / time for the audio (also 2D).

            In fact, wasn't this experiment already done for Othello and LLMs recently? Wasn't there a paper where they found the internal model for the board?

            • IIAOPSW 12 days ago
              It can learn the rules for movement strictly as generator rules imposed on a string of tokens representing the previous sequence of moves. Each new item appended to the list has to in some way match a previous item in the list. Eg RC6 is a Rook, so it has to match an earlier token that is also a Rook, in one of two ways: R_6 or RC_ (and it must not be previously captured by __6 or _C_ ). At no point is it even necessary to convert the move history into a present board state, let alone the state of an 8x8 grid. The move history is sufficient board state on its own. Are the rules for valid chess moves, expressed as 3 character token grammar, the same thing as having learned a 2d grid in latent space? I don't think so, because the language rule is more general and isn't restricted by geometry.

              In principle it could reason about any incidence structure. That is, anything where the semantics is two types of objects, and a "touching" relation between them. Lines are just all the points along them, points are just all the lines intersecting there. For the purpose of directions, a train station is just a list of all the services that go there, and a service is just the list of stations where it stops. Etc etc. A language model is free to learn and understand these sorts of systems purely as relations on symbols without ever implicitly organizing it into a geometrical representation.

              This is all good news. It means Chess, Transit, Diplomacy, and many other things can fit nicely into pure language reasoning without trying to ground the language in the semantics of our physical nature with its dimensions of space and time and whatever.

              What would change my mind is if, after learning the rules for Chess as string matching, it invented a word for "row" and "column" on its own.

            • yorwba 13 days ago
              That paper is at the link containing "othello" upstream.
          • YeGoblynQueenne 12 days ago
            >> To be able to follow the grammar of the language chess is to understand chess.

            That's interesting. I think you're saying that the rules of chess can be described as a transformation system [1] over the set of strings of chess algebraic notation?


            [1] A system of transformation rules, as in transformation grammars. Not as in Transformers. Just a coincidence.

            • mtlmtlmtlmtl 12 days ago
              Well, let me try to explain what I'm thinking, though I may have misunderstood you.

              The rules of chess allows you to enumerate all the possible transformations from one board state to the next. This is just a fancy way of saying all possible moves in any given board state. By induction this means that given an initial board state and a series of moves from that board state you can determine the final board state.

              So this means that the rules of chess allow you to enumerate given an initial state and n plies, all possible ways of adding an (n + 1)th ply.

              So if you just assume the initial board state is always the starting position, theoretically you could do away with thinking about board states altogether. Now, whether that's sensible in terms of computational complexity is another question entirely and my intuition is no.

              • YeGoblynQueenne 12 days ago
                >> So this means that the rules of chess allow you to enumerate given an initial state and n plies, all possible ways of adding an (n + 1)th ply.

                Ah, I get you. Yeah, OK, makes sense. You can generate all legal moves from a starting board state given the rules of chess. Yep.

            • IIAOPSW 12 days ago
              Yeah thats exactly it. The rules are easy enough to put into the form of matching strings. I gave an explicit example further down in the thread. At no point is it required to even convert the game history into a board state data structure. The game history in standard notation itself is sufficient as a game state. To know where a piece is, simply iterate back from the end until finding the last mention of it then iterate forward to make sure it wasn't captured.
        • ashtonbaker 13 days ago
          Yeah it seems to have some model for how chess works, it gives convincing explanations for wrong/illegal moves, and you can generally see there’s some connection between the move and the heuristics it’s babbling about. But it hasn’t built a working internal model of the board that it can manipulate, it can’t search the tree of possible moves.
        • adam_arthur 13 days ago
          Well, humans are just trained on language tokens too (and of course, supplementary images etc).

          All the people stating that "real understanding" is significantly different than learning through inference of language are likely going to be proven wrong in the end. There's nothing special about humans that makes our thinking any more sophisticated. With enough examples and the right learning model, systems should be able to be implicitly inferred from language, just as humans infer systems from language.

          If we can do it, why can't machines?

          • IIAOPSW 12 days ago
            The fact that humans pick up language so soon after birth is the motivating question behind the biggest theory in all of linguistics, namely Chomsky's Universal Grammar. The simple fact is, teacher never stands in front of the class and says "here is how you don't talk. Purple if alligator burp arming has why't." Yet, despite the paucity of negative examples, everyone figures it out. You can't explain that in the current paradigm. There's a lot you can do unreasonably well despite virtually no prior experience. You probably did not need to crash a car for 10k generations before finally making it down the street, nor simulate it in your head. We are missing something fundamental and algorithmic and can only patch over our lack of understanding with volumes of training data for so long.

            The idea of "reinforcement learning" is just a rehash of "Hebbian learning". It works for some things, but you can't explain language acquisition is pure reward function and stats.

            • radarsat1 12 days ago
              > Yet, despite the paucity of negative examples, everyone figures it out.

              After spending more than a year babbling nonsense and discovering a tiny bit more every time about the meaning of certain combinations of phonemes based on the positive or negative response you get.

              > You probably did not need to crash a car for 10k generations before finally making it down the street, nor simulate it in your head.

              Are you sure we don't simulate in our head what would happen if we drove the car into the lamp post / brick wall / other car / person, etc.? I find it highly unlikely that this kind of learning does not involve a large amount of simulation.

              > There's a lot you can do unreasonably well despite virtually no prior experience.

              That's true, but there's a lot we can't do well without repetitive practice, and most things that we can do well in a one-shot fashion depend on having prior practice or familiarity with similar things.

              • IIAOPSW 12 days ago
                You're digging your heals in on a rehash of a model from the 40s, glibly dismissing the problems it doesn't account for bought up by linguists in the 50s and 60s as if they are unaware that babies go through a period of babbling. The amount of time spent acquiring language is already priced in and not enough to account for as pure reward and training.

                >Are you sure we don't simulate in our head what would happen if we drove the car into the lamp post / brick wall / other car / person, etc.?

                You left out the 10k times part. You're ignoring the huge training data sizes these models need even for basic inferences. No, I don't think it takes all that much full scale simulation to distill car speed as a function of pedal parameters, and estimate the control problem needed.

                In many instances, humans can seemingly extrapolate from far less data. The algorithms to do this are missing. Training with loads of more data isn't a viable long term substitution.

                • YeGoblynQueenne 12 days ago
                  >> Training with loads of more data isn't a viable long term substitution.

                  Depends. In principle, you can't learn an infinite language from finite examples only and you need both positive and negative ones for super-regular languages. Gold's result and so on. OK so far.

                  The problem is that in order to get infinite strings from a human language, you need to use its infinite ability for embedding parenthetical sentences: John, the friend of Mary, who married June, who is the daughter of Susan, who went to school with Babe, who ...

                  But, while this is possible in principle, in practice there's only a limit to how long such a sentence can be; or any sentence, really. In practice, most of the utterances generated by humans are going to be not only finite, but relatively short, as in short "relative" to the physical limit of utterance length a human could plausibly produce (which must be something around the length of the Iliad, considering that said human should be able to keep the entire utterance in memory, or lose the thread; and that the Iliad probably went on for as long as one could stand to recite from memory. Or perhaps to listen to someone recite from memory...).

                  Obviously, there are only a finite number of sentences of finite length, given a fixed vocabulary, so _in practice_ language, as spoken by humans, is not actually-really infinite. Or, let's say that humans really do have a generator of infinite language in our heads, but an outside observer would never see the entire language being produced, because finite universe.

                  Which means that Chomsky's argument about the poverty of the stimulus might apply to human learning, because it's very clear we learn some kind of complete model of language as we grow up; but, it doesn't need to apply to statistical modelling, i.e. the approximation of language by taking statistics over large text corpora. Given that those large corpora will only have finite utterances, and relatively short ones at that (as I'm supposing above) then it should be possible to at least learn the structure of everyday spoken language, just from text statistics.

                  So training with lots of data can be a viable long term solution, as long as what's required is to only model the practical parts of language, rather than the entire language. I think we've had plenty of evidence that this should be possible since the 1980's or so.

                  Now, if someone wanted to get a language model to write like Dostoyevsky...

                  • IIAOPSW 12 days ago
                    Your argument is that maybe we can brute force with statistics sentences long enough for no one to notice we run out past a certain point?

                    Everything you said applies to computers too. Real machines have physical memory constraints.

                    Sure the set of real sentences may be technically finite, but the growth per word is exponential and you don't have the compute resources to keep up.

                    Information is not about what is said but about what could be said. It doesn't matter so much that not every valid permutation of words is uttered, but rather that for any set of circumstances there exists words to describe it. Each new word in the string carries information in the sense it reduces the set of possibilities from prior to relaying my message. A machine which picks the maximum likelihood message in all circumstances is by definition not conveying information. Its spewing entropy.

                    • YeGoblynQueenne 12 days ago
                      Now, now. Who said anything about information? I was just talking about modelling text. Like, the distribution of token collocations in a corpus of natural language. We know that's perfectly doable, it's been done for years. And to avoid exponential blowups, just use the Markov property or in any case, do some fudgy approximation of this and that and you're good to go.

                      >> Your argument is that maybe we can brute force with statistics sentences long enough for no one to notice we run out past a certain point?

                      No, I wasn't saying that, I was saying that we only need to model sentences that are short enough that nobody will notice that the plot is lost with longer ones.

                      To clarify, because it's late and I'm tired and probably not making a lot of sense and bothering you, I'm saying that statistics can capture some surface regularities of natural language, but not all of natural language, mainly because there's no way to display the entire of natural language for its statistics to be captured.

                      Oh god, that's an even worse mess. I mean: statistics can only get you so far. But that might be good enough depending on what you're trying to do. I think that's what we're seeing with those GPT things.

                      • IIAOPSW 12 days ago
                        >I was saying that we only need to model sentences that are short enough that nobody will notice that the plot is lost with longer ones.

                        Thats one of the things on my short list of unsolved probs. People remember oddly specific and arbitrarily old details. Clearly not a lossless memory, but also not an agnostic token window that starts dropping stuff after n tokens.

                        I think we agree then that a plain superficial model gets you surprisingly far, but does lose the plot. It is certainly enough for things that are definable purely as and within text (the examples I gave). Beyond that who knows.

                        • YeGoblynQueenne 12 days ago
                          >> I think we agree then that a plain superficial model gets you surprisingly far, but does lose the plot. It is certainly enough for things that are definable purely as and within text (the examples I gave). Beyond that who knows.

                          Yes, I agree with you. I just tend to go on :P

          • Timwi 13 days ago
            It’s not a question of whether machines can do it at all. The question is whether our current approach of training LLMs can do it. We don’t know how the human brain works, so we have no idea if there’s something in the brain that is fundamentally different from training an LLM.

            Obviously machines can theoretically do what a brain can do because a machine can theoretically simulate a brain. But then it’s not an LLM anymore.

            • adam_arthur 13 days ago
              It's a neural network at the end of the day... it can compute any result or "understand" any system if properly weighted and structured.

              It may be that LLM style training techniques are not sufficient to "understand" systems, or it may be that at a certain scale of input data, and some fine tuning, it is sufficient to be indistinguishable from other training methods.

              Many people's sense of what qualifies as "intelligence" are too grandiose/misplaced. The main thing differentiating us from a neural network is that we have wants and desires, and the ability to prompt and conduct our own training as a result of those.

              • Retric 13 days ago
                A LLM isn’t going to learn how to drive a car because of how they are trained even if a neutral network could.

                It isn’t that people’s views on intelligence are grandiose, it’s that the specific approach used has massive inherent limitations. ChatGPT 4 is still relatively bad at chess, 1 win, 1 draw, 1 loss vs a 1400 isn’t impressive objectively and looks much worse when considering the amount of processing power they are using. The only impressive thing about this is how general their approach is, but in a wider context it’s still quite limited.

                IMO the next jump of being able to toss 100x as much pressing power at the problem will see LLM’s tossed aside for even more general approaches like say using YouTube videos.

              • tsimionescu 12 days ago
                > It's a neural network at the end of the day... it can compute any result or "understand" any system if properly weighted and structured.

                That's not even remotely close to being demonstrated.

                For one thing, neural networks can only approximate continuous functions.

                For another, the fact that in principle there exists a neural network that can approximate any continuous function to arbitrary precision doesn't in any way tell us that there is a way to "train" that network by any known algorithm. There isn't even reason to believe that such an algorithm exists for the general case, at least not one with a finite number of examples.

                • adam_arthur 12 days ago
                  Approximating continuous functions is likely quite the same as what people do too. You think there isn’t some mathematical model under the hood of how the brain works too? That it doesn’t break down into functions with interpretable results? Is it spiritual or mystical in your mind?

                  These takes are so bad and pervasive on here, honestly. This is what I mean by grandiose thinking.

                  A machine that approximates functions, that otherwise is indistinguishable from human, is effectively intelligent like a human. Incentives, wants, desires, and the ability to conduct our own training is the only difference at that point.

                  • YeGoblynQueenne 12 days ago
                    >> Is it spiritual or mystical in your mind?

                    No, they're just saying there are continuous functions, and then there are discrete functions, and neural nets can't approximate discrete functions, while humans certainly can (e.g. integer addition). And that even when it comes to approximating any continuous function, neural nets can do that in principle, but we don't know how to do it in practice, just like we know time travel, stable wormholes and the Alcubierre drive are feasible in principle, but we can't realise them in practice.

                    So please don't say it's "spiritual and mystical" in the other person's mind just because it's not very clear in yours.

                    Also, what the OP didn't say is that a Transformer architecture is not the kind of architecture used to show the universality of neural nets. That was shown for a multi-layer perceptron (MLP) with one hidden layer, not a deep neural net like a Tansformer, and certainly not a network with attention heads. If you wanted to be all theoretical about it and claim that because there's that old proof, someone will eventually find out how to do it in practice, then the Transformer architecture has already taken a wrong turn and is moving away from the target.

                    There aren't no universality results for Transformers. I mean, that would be the day! The reason that that proof was derived for a MLP with one hidden layer is that this makes the proof much, much easier, than if you wanted to show the same for another architecture.

                    • adam_arthur 12 days ago
                      I can ask an LLM what 2+2 is and it can answer with 4. That's a discrete result. So how is this different from human thinking? Where is your evidence that this is not a similar mechanism?

                      It gets some math wrong because it doesn't understand the "systemic" aspect of math, but who's to say that with minor training tweaks, or a larger dataset, it wouldn't be able to infer the system? Humans infer systems from language all the time. To say you need some specialized form of training beyond language inference is obviously wrong when you view how humans train, learn and understand. All of life is ingestion of information via language which produces systemic understanding.

                      I can play digital audio that's indistinguishable from acoustic, despite it not being a smooth function in practice. Similarly, a sufficiently advanced neural net can produce intellect-like results, even if there are aspects of the structure you say may not make it so.

                      Honestly, the perception you and many others seem to hold is that because something is mathematically explainable in such a way that you can "trivialize" its operation, makes it not intelligence. But you hold "intelligence" in too high a regard

                      • YeGoblynQueenne 12 days ago
                        >> I can ask an LLM what 2+2 is and it can answer with 4. That's a discrete result. So how is this different from human thinking? Where is your evidence that this is not a similar mechanism?

                        A language model can match "2+2" with "4" because it's approximating the distribution of token collocations in a large text corpus, not because it's approximating integer addition.

                        We know this because we know that language models are trained on token collocations (word embeddings) and not arithmetic functions. We know how language models are trained because we know how they're made, because they're made by humans and they're made following principles that are widely shared in academic textbooks and scholarly articles all over the place.

                        >> Humans infer systems from language all the time.

                        Humans are not neural nets, and neural nets are not humans. Does that suffice? I don't know if I can do any better than that. Humans do human things, neural nets do neural net things, and humans can do things that neural nets can't even get close to. Like, dunno, inventing arithmetic? Or axiomatizing it? Or proving that its axiomatization is incomplete. That sort of stuff. Things for which there are no training examples, not of their instances, but of their entire concept class.

                        >> But you hold "intelligence" in too high a regard

                        Where does that stuff come from, I wonder? Of course I hold intelligence in high regard. What do you hold in high regard, stupidity?

                  • tsimionescu 12 days ago
                    > Approximating continuous functions is likely quite the same as what people do too.

                    In a very broad sense, if you just mean "the human brain also just approximates some class of functions", sure. However, human brains can surely represent many classes of non-continuous functions as well (tan, lots of piece wise functions, etc). And, crucially, some of these are necessary for our physical models of the world. So, if neural networks are limited to only representing continuous functions, that is a strong indication that they are fundamentally unable to mimic the human mind.

                    > You think there isn’t some mathematical model under the hood of how the brain works too?

                    Of course it does. I do believe that the mind is simply a program running on the physical computer that is our brain. And I am sure that some day we will be able to create an AI that is human-like, and probably much better at it, running on silicone.

                    That doesn't mean that we should believe every program running on silicone, despite somewhat obvious fundamental limitations, is going to be the next AGI any day now. That's all I'm trying to point out: neural networks are not a great model for AGI, and backpropagation/gradient descent as a training algorithm even less so.

          • anonzzzies 13 days ago
            The current models are still too simple and missing things we did not figure out; when a human (or other animal) learns, it only needs a tiny (compared to the corpus of text etc gpt gets trained on) corpus to become a smart human. So the model needs something we have built in that makes learning vastly more efficient. Then there will be another big jump. That’ll come, that or a new AI winter.
          • Lewton 13 days ago
            > Well, humans are just trained on language tokens too

            This is not true at all

            You can teach someone chess without language

            • gambiting 13 days ago
              And how would you do that? Show certain moves and point your thumb up for ok, down for not ok? Then sorry, you're still using a language, just without using words.
              • wizofaus 12 days ago
                It doesn't seem so far-fetched to believe somebody could learn chess just by watching others play it, no language needed at all (except perhaps reading the body language of being glad to win). But I imagine LLMs will soon have the ability to turn image sequences into information that can be interpreted and ingested much the same way as they can text, and thus "learn" how to play chess just from analyzing videos of actual games being played.
                • johnaspden 12 days ago
                  In legend, Paul Morphy learned in this way.
      • orbital-decay 12 days ago
        >Of COURSE it doesn't know the rules of chess and doesn't know how to move the pieces.

        That depends on what you mean by knowing. Surely it extracted certain higher level correlations from the recorded games and chess books, and is able to make certain predictions based on them. I would call it knowledge, it isn't that good though.

        The main problem is that the model is purely functional, and is unable to store the state (except for the context, which isn't what I mean). Both humans and chess engines keep track of the figures and run stateful algorithms constrained by the rules. This model doesn't do that, which severely limits its capabilities in chess.

        • 2-718-281-828 12 days ago
          A disk has also knowledge stored on it but it doesn't know anything.
          • orbital-decay 12 days ago
            The disk is unable to extract the correlations, nor is it able to apply the knowledge; it transparently stores the data verbatim. The model doesn't store the training set, it extracts the complex correlations from it, and is able to make actual predictions based on the knowledge it extracted.

            But yeah, the "knowledge" and "understanding" are hard to define formally, so this discussion can be endless. Common well-defined terms are required.

            • 2-718-281-828 12 days ago
              the model does not extract knowledge. an external algorithm trains the models parameters and then the model is fed a string that is also evaluated externally based on the models configuration.
              • orbital-decay 12 days ago
                Semantics. You could say the same about the disk - the data doesn't get magically teleported from the magnetic plates to the RAM, it needs a lot of underlying hardware to read and transfer it.

                Model is not just a set of weights, it's inseparable from the underlying architecture, the way to train and to apply them in practice.

                • 2-718-281-828 12 days ago
                  > You could say the same about the disk

                  that's exactly my point

                  • orbital-decay 12 days ago
                    Yeah, and that's arguing about semantics. We could do that in a loop endlessly, ignoring the fact they're fundamentally different.
      • notimetorelax 13 days ago
        Honest question - how is it different from human cognition? Don't forget of all the spectrum of cognition, e.g. https://www.iflscience.com/people-with-no-internal-monologue...
        • BugsJustFindMe 12 days ago
          The difference is that a human playing chess, once told the rules, doesn't suddenly start making illegal moves after passing the memorized opening phase. Accidentally making illegal moves is the categorical definition of not knowing how to play.
    • wcoenen 13 days ago
      Since your comment seems to be strongly contradicting the blog post, it might be worth checking whether you are really testing the same thing.

      The blog post is about playing chess against GPT-4. GPT-4 (or at least, a version without image input capability) is available at https://chat.openai.com/, but only to "Plus subscribers" who pay for it.

      So did you test with GPT-4, or did you test the "default GPT-3.5" model which is available for free?

      • ashtonbaker 12 days ago
        Yes, I paid to use GPT-4.
        • wcoenen 12 days ago
          And did you actually select GPT-4? It is shown at the top of the conversation page.

          I'm asking because I noticed that the default setting for each new conversation is currently still GPT-3.5, even if you are a subscriber and even if you have selected GPT-4 for your previous conversation(s).

          • ashtonbaker 12 days ago
            Come on. Yes. Here's a screenshot if it helps: https://i.imgur.com/BepKMt1.png - It gives FEN notation itself, so I prompted it with FEN, but I have also tried describing the position of each piece explicitly.
  • tel 12 days ago
    Worth noting, GPT-4 is hopelessly lost at 19x19 Go. It can barely make it through a few opening moves without losing the board state entirely.

    I asked it to provide commentary on both my and its moves (though I had to be a little sneaky and not actually ask it to play a game otherwise it refuses). It could regurgitate some garden-variety commentary on the moves being made, but since we didn't get more than about 5 moves in before it totally fell apart this is not surprising.

    On a 9x9 board it fared slightly better. By which I mean it didn't completely lose the board state almost immediately, but it also played very poorly and its commentary was largely nonsense. It also lost the board state within 7 moves.

    It's still impressive that it can reason about the board at all and from time-to-time properly represent the nature of a move, at least in a simplistic fashion, but could be trivially defeated by any human after they grasp the rules.

  • tomohelix 13 days ago
    The LLMs are getting better that is for sure. But I think there is an inherent weakness in the way they are made that will prevent it from ever reaching a real AGI: it can't improve by itself.

    Correct me if I am wrong, but these models and their weights are static after training. So while they can appear different depending on what you prompt them with, their core system never changes. In other words, they can't adapt. They can try their best, but that is it, they can never improve, evolve, learn from their mistake, internalize new knowledge, etc.

    One of the best trait of life is adaptability. That was how we human spread everywhere on this planet and how even in the most extreme of environments, you can still find some bacteria living there. And these LLMs are inherently incapable of that.

    • janalsncm 13 days ago
      This is a textbook example of the “AI effect”, a special instance of moving the goalposts so common in the artificial intelligence domain that it has its own Wikipedia page.

      A few years ago the best chatbots basically just worked like a phone tree, trying to filter you into pre-written responses. Then GPT3 came along and was able to synthesize entire paragraphs of plausible text. ChatGPT was released only ~4 months ago and I was complaining that due to fundamental shortcomings it wouldn’t be able to play chess.

      Now it seems that GPT4 can. And that was a change over only 4 months. If the last hill to fight over is whether GPT4 can self improve, well, it’s only a matter of time until GPT4 is placed into some kind of self-improving CICD pipeline.

      • tomohelix 13 days ago
        Trust me I am one of the people absolutely baffled and pleasantly surprised by all the new things these LLMs can do. There are plenty of potential in this tech and I am not so scared to not acknowledge that.

        But what I am asking is a core feature of these models. The transformer is what make LLMs possible but it is by design static after it was trained. You can't simply install another pipeline or wrap it in a program and fix this problem. From what I understand, RNN isthe self improving type of AI and it was supplanted by the transformer type and one of the key differences between them is the ability to do learning from its own output. Transformer model sacrificed that for the ability to process a large amount of data in parallel. And that is by design. You can't change it as far as I know.

        I actually don't know that much so that is why I am asking if what I said was wrong. Also, maybe someone will make a new model merging both transformer and RNN. But what I am saying is right now, the type of LLMs we have probably can't learn no matter how sophisticated its training is or how many parameters it is built with.

        • Buttons840 13 days ago
          Neither RNNs nor transformers will change their parameters during a forward pass. In production GTP just does a forward pass through the network, but never calculates the parameter gradients, and this would be true if GTP was built on RNNs as well, and without those gradients the AI will not improve.
          • seanhunter 13 days ago
            That seems an implementation detail that would be entirely solvable with some additional work rather than a fundamental limitation though.
            • Der_Einzige 12 days ago
              Unfortunately we have this thing called the curse of dimensionality and specifically we do not understand how to prevent catastrophic forgetting. This is why incrementally learning NNs are still so rare in practice.
        • yoz-y 13 days ago
          You could periodically retrain the model using the original dataset plus the conversations from the period. Sort of a commit of the short term memory into long one.
          • glenstein 13 days ago
            I think you could say this effectively is what is happening within the narrow confines of any single conversation.

            Although it throws away that conversation once it's done.

          • _frkl 13 days ago
            Ha. So, it needs to 'sleep', kind...
      • skissane 13 days ago
        The comment to which you are responding is pointing out a real limitation of current machine learning systems though - most of them have a hard separation between training-time and runtime - which obviously doesn’t exist for human and animal intelligences.

        Now, I’m sure sooner or later someone will come up with a system that lacks that hard separation-and it will unlock new capabilities that current systems lack - which will further close the capability gap between AIs and natural intelligences - which still hasn’t gone away, although it clearly has shrunk significantly in the last few months.

        • Der_Einzige 12 days ago
          There is no hard separation. Turning on gradient updates simply destroys the weights that were already there. We don't know how to make the methods we know about work well.
          • skissane 12 days ago
            There might not be a hard separation in theory, but there certainly is one in practice
      • luckingfof 13 days ago
    • dinobones 13 days ago
      They can learn from mistakes, within the context window. I often will correct GPT4 on a mistake it's made and then ask it to write me a new prompt from scratch that covers the edge case.

      GPT4 has a 32k token context window. But I see this number the like the 32mb of RAM of computers of old.

      With Flash Attention, the attention calculation is now basically O(n). So we could maybe someday see context windows in the billions of tokens, where "learning" is just looking back at a mistake in its past and avoiding it.

      • tomohelix 13 days ago
        Good point. I guess it is a matter of scale. Maybe when we get to something like billions of token windows or even more, it becomes "memories" for the AIs. And they can probably make better "judgements" using these memories to train next generation of AIs to internalize these improved logics without the need to have the context fed to them.

        Kinda similar to how we human do it too. It is certainly an exciting time that we are seeing so many parallels between silicon and meatbags. I didn't think I would live to see this.

      • orbital-decay 12 days ago
        The main difference between ML in its current state and biological systems is the huge compute/energy asymmetry between training and inference. Biological systems have it too, but it's much less prominent, as they have neuroplasticity which allows them to learn on the fly.

        I guess the context window can be considered a (poor) substitute for the missing neuroplasticity, but it's really limited in what it can do.

    • BoorishBears 13 days ago
      When people say this, I feel like it's useful to frame it in terms of what they think "real AGI" enables, since we don't even know how "real GI" works

      Before using ChatGPT I couldn't form a clear picture of that, but now it's clear that a lot of the more realistic things we imagined real AGI would enable, are going to be possible with LLMs.

      I mean, GPT4 can't improve itself... but it can improve your ability to utilize it. You can take a product idea that uses GPT4, feed it to GPT4 and GPT4 will help you write a frontend for it, come up with marketing campaigns, model personas and generate user stories.

      It can train other LLMs too. After they lowered the message cap, I had the bright idea to use it to generate several pages of fine tuning data for davinci to get better JSON responses. So far it's looking like a success.


      It's not like 3.5 couldn't do all of this, but 4 already seems like an uncanny leap over 3.5 in that regard. And afaik we haven't established the ceiling on how much better LLMs will get as we simply spend more time training them, and compute simply gets faster.

    • grensley 13 days ago
      What's preventing an LLM from coding a better LLM?
      • tomohelix 13 days ago
        Data contamination I think. Right now training heavily depends on human to guide it to what is right and what is wrong. If an AI try to train its successor, it will basically just make a copy of itself and nothing better can be made or taught.

        We are still better than an LLM, collectively. The best programmer is still miles ahead of GPT, the best writer can write something that make GPT look like a child learning the alphabet, etc. So we can still train it using the data from all of humanity.

        But if we aren't here, then the best an AI can train is a copy of itself. There are already discussions about where to get more authentic data to train the next generation of LLMs when so much of the internet might have been saturated with AI generated content.

        • Buttons840 13 days ago
          I'm imagining a Luddite using a LLM to generate terabytes of slightly wrong text to poison the water.
          • zamnos 13 days ago
            One could setup a blog using GPT-2 to generate the text for that purpose.
          • Yoric 13 days ago
            You mean like $(FAMOUS BILLIONAIRE) who claimed he would develop an AI aligned with the values of $(POLITICAL PARTY)?
      • geysersam 13 days ago
        Transformer based LLMs were not developed by someone sitting down in deep though.

        They're a product of trial and error.

        LLMs are not particularly good at reasoning. They can't make experiments. They can't look up the information in their weights (because no text on the internet has that information).

        For these reasons they are not likely to be good at designing new LLMs.

        • grensley 13 days ago
          Do you think it's possible an LLM might be able to evaluate if another LLM is better than itself?
          • Centigonal 13 days ago
            it can definitely evaluate whether an LLM is better aligned than itself (for some narrow definition of "aligned") - this is the motivating principle behind Anthropic's Constitutional AI idea.

            "Better" is pretty vague, but "more capable" would be difficult, because I don't think ChatGPT has a good idea of its own capabilities.

          • geysersam 13 days ago
            It's possible that a chatbot can be one part of the evaluation process for new chatbots. I'd bet they already are.

            But they can't be the only part. New models must be evaluated on many different kinds of tasks. And if current models fails at a task, how can they evaluate new models on that task accurately?

    • beepbooptheory 12 days ago
      Isn't it even a little more fraught than that? I.e, the sophistication and worth of these models are directly relative to the crude and subjective human labor of feedback upon them. That is, the RLHF part?
    • teaearlgraycold 13 days ago
      Hook an LLM up to an embeddings database and you’ve got something special. Tell it it can query the database and then give it back the highest ranked entries.
    • ec109685 13 days ago
      They do learn from their mistakes while training though. That affects their weights, so yes it’s frozen now, but it wasn’t when the model was being built.
      • tomohelix 13 days ago
        They learned until convergence, which basically is the point where more training doesn't seem to improve output anymore. At which point they call it completed. And that is the best it can ever be. It will never get better or be made to be better unless the entire system is scrapped and retrained.

        At any point though, an LLM that you can interact with is static. And that is my point. They can't improve once they are "born". And I do not see any feasible way for them to fix this as it is inherently a feature of the LLMs.

        • oezi 13 days ago
          I think this is way too simplistic.

          First, LLMs seem to be still on a trajectory of more parameters is better. So the convergence is a man-made decision to stop for the sake of memory usage and money spend on training.

          I don't think there is anything inherent in LLMs which prohibit adjusting weights over time or branch out to another model which is more regularly updated and contains novel stuff. I would well imagine that they integrate a "news model" which is trained every 24h to incorporate text scraped from news-sites. This can be called from the main LLM.

          If we look at the human brain we realize that there are dedicated systems for specialized processing (think spatial reasoning vs judging colors). I would imagine that OpenAI would eventually to have a similar more complex system structure.

          • tomohelix 12 days ago
            A more parameter LLM is basically a new model that you have to train from scratch. You can't pick up say GPT3 and turn it into Chinchilla-grade LLMs with 175B params with just the difference in training between the two. The former has "converged" and more effort and money put into it would bring negligible benefits. The progress is basically reaching an asymptote.

            And of course there is nothing saying future progress won't make a new model that can learn on the go. But right now, the transformer structure that made LLMs such a huge thing is inherently incapable of learning from its own output. Or at least, it isn't really feasible with how demanding the cost to it would be.

        • skissane 13 days ago
          ChatGPT has a “knowledge-cutoff”. What if you continually fed new information (like a subscription to a news wire service) into the training process so the knowledge cutoff continually moved forward?

          LLMs do support additional training (fine-tuning). Why couldn’t that fine-tuning be a continual process, absorbing new information every day?

          What if OpenAI took all their records of public interaction with ChatGPT, and used that to fine-tune ChatGPT? In fact, given upvote/downvote and “regenerate response” allow users to give feedback on responses, maybe that is part of their plan?

          • Yoric 13 days ago
            > ChatGPT has a “knowledge-cutoff”. What if you continually fed new information (like a subscription to a news wire service) into the training process so the knowledge cutoff continually moved forward?

            > LLMs do support additional training (fine-tuning). Why couldn’t that fine-tuning be a continual process, absorbing new information every day?

            Definitely sounds possible. However, I suspect that it would be extremely expensive in terms of energy requirements: ChatGPT 3 training has already been evaluated to be equivalent to ~100k days of energy use of an average household in the west, iirc, and ChatGPT 4 training was undoubtedly worse.

            I don't know how much energy it costs to fine-tune a LLM. But I suspect that what you're proposing would involve a cost similar to training ChatGPT 4, except every few iterations of fine-tuning.

            > What if OpenAI took all their records of public interaction with ChatGPT, and used that to fine-tune ChatGPT? In fact, given upvote/downvote and “regenerate response” allow users to give feedback on responses, maybe that is part of their plan?

            Given the adversarial nature of Internet, it sounds like this would be used very quickly to poison ChatGPT. Plus, let's face it, there are good chances that we, as users of ChatGPT, are actually worse than the hand-curated sources that were picked to teach ChatGPT.

      • oezi 13 days ago
        Don't forget the ability to finetune the LLM.
  • callmekit 13 days ago
    So yesterday we had a post (https://news.ycombinator.com/item?id=35200267) that you need a special "play as a grandmaster" prompt to reduce the number of illegal moves for GPT3.5, and that "GPT4 sucks" at chess completely compared to GPT3.5.

    Now we have this post that GPT4 plays good and doesn't make illegal moves at all. What changed? What was the prompt? Is it just random noise?

    • ctchocula 13 days ago
      If you've checked the author's previous post [1], you'll see that he admits to being 1200 Elo on chess.com, which is beginner / early-intermediate level. So him losing to GPT-4 may not mean much. Maybe that explains the supposed contradiction here.

      [1] https://villekuosmanen.medium.com/i-played-chess-against-cha...

      • owenmarshall 12 days ago
        The problem with ChatGPT seems to be that it often gives answers that appear plausible at the surface, but with enough knowledge you realize they are inaccurate or even wrong. I wouldn’t put much stock in the analysis of a beginner - I’d trust them to say the moves were legal, and that ChatGPT stopped trying to materialize pieces from thin air, but not any analysis beyond that.
    • ChatGTP 13 days ago
      Disclaimer: YMMV
  • reustle 12 days ago
    I just tried to play Tic Tac Toe with gpt4 and still disappointed (I'm X)

    > Congratulations! You've won the game with a diagonal line from the top-right to the bottom-left. Well played!

        O | 2 | X
        X | O | 6
        7 | 8 | X
    • 2-718-281-828 12 days ago
      this one is also funny (GPT-4):

        > which one of the following letters is different from the others?
        > B, D, E, O
        O is different from the others, as it is the only letter that is a vowel. The rest of the letters (B, D, and E) are consonants.
      my line of thinking would be that E is different for obvious visual reasons. but that doesn't condone Geppetto "thinking" that E is a consonant :D
      • AnIdiotOnTheNet 12 days ago
        There are multiple good answers: O is different because it is the only one without a vertical line. E is different because it is the only one without a curve. O is different because it is not a musical note. There are probably a lot more.
        • YeGoblynQueenne 12 days ago
          Don't know what you guys are talking about. B is the one that's different because it's the only one you can say with your lips closed.
  • kilburn 12 days ago
    Hikaru Nakamura (one of the best players in the world) tried it too [1].

    GPT got confused at some point and tried to play illegal moves, but then it seemed to regain the games' thread. He never finished the game because he hit the usage limit, but it was already won by Hikaru. The explanations ChatGPT gave in the opening were fine, but the ones for the mid-game were mostly nonsensical.

    [1] https://www.youtube.com/watch?v=lPetjvRWJa8

  • lelanthran 13 days ago
    Maybe I'm not using the right ChatGPT, but it started an impossible game when I asked it to set up a game:

    > I'll play as the white pieces and you can play as the black pieces. To make a move, just give me the starting and ending positions of the piece you want to move, for example "e7 e5" to move the pawn in front of the king two spaces forward. Let's get started! You can make the first move.

    So, it wants black to start first? How do I verify which ChatGPT I am using?

    • BoorishBears 13 days ago
      If you're paying for Plus you'll have access to a selector at the top.

      But GPT4 is currently limited to 25 messages per 3 hours due to demand (down from 100 messages per 4 hours at launch)

  • cjbprime 13 days ago
    From the article:

    > You could also say that GPT-4 was playing the game blindfolded as it had no access to refreshing its memory of the current state of the board.

    That doesn't sound right. Under the hood, I think everything inside the current conversation is just appended to the prompt each time you add a message, and the whole conversation re-evaluated? Or is there a limit when the conversation gets long which means that the LLM forgets how the conversation started..?

    • anonzzzies 13 days ago
      It has limited token length; I think the chatgpt one has 8k, but they have a version with 32k now as well. That includes input and output in that length. It will ‘forget’ the start of the conversation if the conversation is too long compared to that length. They maybe use tricks like summerization, but that won’t help with chess or other precise conversations.
      • nullc 13 days ago
        > but that won’t help with chess or other precise conversations.

        Not necessarily true: the summerization might happen to give the board state! Certainly you can come up with an input that would make it summarize that way, though getting that outcome from a generic summerization prompt would be more surprising.

        • anonzzzies 12 days ago
          Yes, certainly if you prompt it to do so. Or pass the board efficiently. I just don’t think it’s set up like that currently seeing it’s amnesia of the positions.
  • Toutouxc 13 days ago
    Meanwhile I've been confidently told by GPT-4 that iPad Mini 6 uses the A14 Bionic (then it continued for a while about how that makes it just about as powerful as the 4th generation iPad Air), which is false of course, and the model immediately admits it and apologizes if you correct it. Guess I'll have to learn chess instead.
    • zarzavat 13 days ago
      This behavior is indeed strange and seemingly low hanging fruit. The model often is able to notice that it has given an incorrect answer and correct itself. But why not just give the correct answer to start with? It’s like it needs a second pass, but is there really that much extra information in its first wrong answer?
      • loopz 13 days ago
        It has no way of discerning what is factual and what is not. It's fundamental to LLMs. It only corrects itself when prompted by the user, and even then might get it wrong still.
      • Toutouxc 13 days ago
        I get this exact scenario very often — model says something blatantly false, gets called out, apologizes and continues. 3.5 or 4, they both do that. And I only discuss stuff with it that I'm at least familiar with, so I can usually catch these mistakes, but it isn't exactly encouraging.
  • dwrensha 12 days ago
    I tried this the other day, and I was also impressed! I had assumed that it would make an illegal move eventually, but it never did. I won the game, but it wasn't trivial. (I've been out of practice in chess for a while; my rating would probably be something like 1700.)

    dwrensha vs GPT-4 (through the ChatGPT interface) 16 March 2023

    1. d4 Nf6 2. c4 g6 3. Nc3 Bg7 4. e4 d6 5. f3 O-O 6. Be3 e5 7. d5 c6 8. Qd2 cxd5 9. Nxd5 Nxd5 10. Qxd5 Be6 11. Qxb7 Nd7 12. Qa6 Qc8 13. Qxd6 Rd8 14. O-O-O Nc5 15. Qxd8+ Qxd8 16. Rxd8+ Rxd8 17. Bxc5 Rc8 18. b4 a5 19. a3 axb4 20. axb4 Bf8 21. Bd3 Bxc5 22. bxc5 Rxc5 23. Ne2 Bxc4 24. Bxc4 Rxc4+ 25. Kd2 f5 26. Rc1 Rxc1 27. Nxc1 fxe4 28. fxe4 Kf7 29. Ke3 Ke6 30. Nd3 h5 31. h4 Kd6 32. Ne1 Kc5 33. Nf3 Kd6 34. Kd3 Ke6 35. Kc4 Kf6 36. Kd5 g5 37. hxg5+ Kg6 38. Kxe5 h4 39. Ke6 Kh5 40. Kf7 h3 41. g3 h2 42. Nxh2 Kxg5 43. e5 Kf5 44. e6 Ke5 45. e7 Kd6 46. e8=Q Kc7 47. Qe6 Kb7 48. Qd6 Ka7 49. Qb4 Ka6 50. Ke7 Ka7 51. Kd7 Ka8 52. Kc7 Ka7 53. Qb7#

    Note that my move 14. O-O-O was a blunder, allowing 14. ... Bxc4, but my opponent instead replied with 14. ... Nc5, which handed me a decisive advantage.

    I very nearly fell for a stalemate with 41. gxh3.

    After 42. Nxh2, my opponent said: "At this point, it's clear that white has a winning advantage with a promoted queen soon to appear. With your material advantage and superior position, victory is nearly certain. Would you like to continue playing, or do you want to start a new game or discuss this game?" I said, "Let's continue!"

    After 52. Kc7, my opponent said "I have no moves left and it's checkmate. Congratulations! You've won the game." I replied: "You do have a move: you can do 52. ... Ka7". My opponent then said, "Apologies for the confusion. You are correct. I'll play 52...Ka7. Your move." Then I typoed the final move as "53. Kb7#" (instead of "53. Qb7#"), and my opponent did not correct me: "You played 53. Kb7#. This time, it is indeed checkmate. Congratulations! You've won the game. If you'd like to play another game, analyze this one, or ask any other questions, feel free to let me know!"

    • 2-718-281-828 12 days ago
      chatgpt always apologizes if you correct it and then tries to work with your correction. makes it seem humble, when in fact that's just a brainless probably halfway hardcoded response.
  • galkk 13 days ago
    Idk, I tried to play chess with it right now, and it tries to do illegal move.

    This is pretty crazy:

    me > Bxe3

    chat.openai.com > A good move! By playing 16...Bxe3, you capture my pawn on d5 with your bishop, and also put pressure on my queen.

    Like what?

    • Lewton 13 days ago
      Are you using gpt4 though?
  • waselighis 12 days ago
    Given many of the shortcomings that still exist with ChatGPT-4, especially some of the comments here showing it can't even play Tic Tac Toe, I suspect one of two things. I'm sure OpenAI collected far more training data that included chess moves in the data set for GPT-4, but I suspect they may have further tuned and optimized GPT-4 for playing chess as well. I also suspect maybe there's something about chess algebraic notation that works unusually well with LLMs, while there may be no equivalent for a game like Tic Tac Toe.

    Clearly, GPT-4 knows opening sequences extremely well, as it's seen countless examples of each. GPT-3 memorized optimal strategies for the first 10-20 moves, and GPT-4 likely memorized many moves deeper than that. To truly test GPT-4s ability to play chess, we need to test it with highly unusual opening positions, opening moves that, very likely, were not in its training data. Then we can see how well GPT-4 truly understands the rules and strategies of chess, and how much of it is just memorization.

  • msm_ 12 days ago
    I found this game of chatgpt vs stockfish absolutely hilarious: https://www.reddit.com/r/AnarchyChess/comments/10ydnbb/i_pla...

    Spoiler: ChatGPT had no idea what it's doing

  • danparsonson 13 days ago
    Very cool! Some food for thought - chess can be played as a game of 'what comes next given the current state of the board' without any reference to an overarching plan; ChatGPT is a giant 'what comes next given the current state of the conversation (i.e the token window)' engine, so perhaps this result should not be so surprising?
    • endorphine 13 days ago
      My thoughts exactly. Also, chess can be played using a text notation, and thousand (if not millions) of games are available on the internet in text notation form.
  • fightglobalism 10 days ago
    I'm new to HN and new to ChatGPT-4 concepts but it's becoming impossible for me not to form an interest.

    How does ChatGPT-4 interface with something like playing chess? Do you need to write an implementation or prototype to extend it's capabilities or is the AI able to exchange strings based on a language definition? If the latter is it limited to strings-only since it's a "Chat" bot or could you show it a picture of a chess board and could it interpret what's unfolding and come up with a strategy of how to advance the game?

  • dylan604 13 days ago
    The day that a ChatGPT-X suggests "How about a nice game of chess?" is the day we should unplug the system. Crap, I forget the system would interpret that as a loss of command and launch the missiles.

    So, where is the milestone that says the system has gotten too powerful?

    • labster 13 days ago
      The day it prevents hyperbolic rants like these from appearing on the internet is the true milestone.
      • vba616 13 days ago
        There are tons of "hyperbolic rants" on HN but the comment you replied to looks to me like an attempt at humor and rather short to be called a rant.
        • glenstein 13 days ago
          It alternates between being a serious point or a joke, depending on which justification is necessary.

          Whatever genre that is, it's that. I personally think that this is the genre of problematic comments that we will have to collectively labor through in a struggle to have high quality conversations about ChatGPT.

          • dylan604 12 days ago
            as my buddy Charlie Brown is apt to say, "oh good grief!"

            it's not like i wrote about my life's history about a favorite food for 3 pages before showing you the recipe. i gave a bit of context reference for the point being made. it was short and succinct and poignant to the topic being made.

            you sound like someone working on a ChatGPT type sycophant that AI is 100% amazing and could do no wrong. people like that are the actual problem with ChatGPT discussions

      • dylan604 13 days ago
        What if it starts to identify itself as Karen? That might be an earlier milestone
        • albeebe1 13 days ago
          The only winning move is not to play
      • psychphysic 13 days ago
        AKA the day it becomes the arbitrator of allowable speech?

        Probably too late by then.

    • tomohelix 13 days ago
      I will only be afraid when it asks: How about a game of Gwent?

      It means it becomes not just sentient, but a sentient memelord. Imagine an AGI acting as a troll. You can never win, all you can do is silently nod and accept the game it plays with you, forever.

      • grensley 13 days ago
        I'm here for Chaotic Good AGI.

        AI dictating "I recommend you don't eat bread this Tuesday" and everything just magically works out. We wonder what eating the bread might have done, world ticks on.

        • Yoric 13 days ago
          Well, if the AI joins Project Mayhem, you should definitely follow its suggestions on what to not eat this Tuesday.
    • albeebe1 13 days ago
      WarGames (1983) for the youngsters
    • quaintdev 13 days ago
      If it ever got that smart it would play dumb
    • ChatGTP 13 days ago
      Setup a cronjob to randomly send you an invite: mind = blown
  • darepublic 12 days ago
    This is cool but as others have pointed out there are many ways to demonstrate that gpt4 doesn't actually "understand" the rules of the game. I feel like this article is of a type similar to what was seen when gpt3 was released. Be the first to breathlessly demonstrate the capability of the model for social media credits
  • wizofaus 12 days ago
    Seems odd being beaten at chess by a computer (something that has been happening for over 25 years, even if you're a grandmaster!) would be the trigger to decide ChatGPT-4 will change the world...
  • vba616 13 days ago
    Challenge it to dodgeball.
    • KyeRussell 13 days ago
      If you can dodge logits you can dodge a ball.
  • LASR 13 days ago
    I wonder when someone is going to plug GPT-4 into a REPL that can call into apis over the internet to a chess engine.

    Can’t be that hard no?

    • quaintdev 13 days ago
      > can call into apis over the internet to

      With that capability I'd be more interested in what all sorts of thing it can do than merely playing chess.

      Somebody make it understand swagger documentation if it doesn't do that already

    • speedgoose 13 days ago
      The end of the humanity.
  • e-clinton 12 days ago
    I tried playing ChatGPT by remembering where each piece was since I didn’t have a board. Only made it 4 moves.
  • beyondCritics 12 days ago
    with this capability it makes already a very decent tutor for beginners and i am expecting more from GPT-X (X>=5). Great news for the chess community. I hear the alarm bells ringing at chess.com...
  • unnouinceput 13 days ago
    Quote: "GPT-4 will change the world"

    No it won't. Neither 5 nor 6 that I suspect will be in maximum 2 years out. But after that whatever comes next (7th/8th) definitely will.

    • ChatGTP 13 days ago
      I actually think it might get so “smart” we will no longer ever understand it, and have no way to verify it’s accuracy because we’ll have nothing to benchmark it by.

      Will be great.

      • hackerlight 13 days ago
        I expect this one day, but only after there are more methodology breakthroughs. I am doubtful we get there by just scaling up current methods.

        What I predict is AI will stagnate at the upper-end of human-level intelligence except in narrow domains where we can simulate unlimited data which allows for self-play learning, such as in Go. But such domains are limited. For everything else, in the absence of unlimited data (generated via self-exploration or self-play), the best AI can do is learn from human output, which carries with it all the limitations and flaws of human cognition.

        • zarzavat 13 days ago
          Two of the most interesting domains (mathematics and programming) do have external forms of validation. In mathematics you can use a proof assistant like Lean. In programming there’s types, tests and formal verification.

          Where the AI will struggle is anything where forward progress requires a lab. In this case it will not provide as much acceleration, as it still has to move at the speed of the human in the lab doing the experiment for it. It can maybe make better decisions than the human and get a small speedup, but it won’t be anywhere close to the possible speedup for formal sciences.

          And the AI will really struggle with anything that involves mechanical sympathy gained through practice. No amount of book knowledge or raw intelligence can teach an AI how to play the violin. For that, you need a violin, a hand, and a feedback loop. Indeed there’s probably many skills that humans have near optimal learning rate, because the bottleneck is in the physical environment rather than our brains.

        • rhdunn 12 days ago
          It would be interesting to see how these LLMs work in a REPL environment. The model could take a (language, version, compiler, sourceText) tuple as input, and have the model generate the same. It could also be fed the compiler output in a (warningCount, errorCount, stdoutText, stderrText) tuple, so it can learn to produce error-free output.

          It would also be interesting to feed it language concepts ("if statement", "while loop", etc.) in the form of (language, version, concept, sourceText), to see if it can form an abstract understanding of those concepts, and successfully port code from one language to another.

        • oezi 12 days ago
          > I am doubtful we get there by just scaling up current methods.

          Why? The last two years show that throwing more compute at this during training and execution is clearly helping.

          LLMs currently lack access to real world data for instance to measure a physical quantity or perform a randomized control trial, but given how much of human output is just manipulation of language tokens and that we got quite a lot of sample data to feed into the networks I am confident there are still some steps ahead of us that are essentially just more of the same.

          An important data point is also the limited amount of data that we need to feed into a human brain to get somewhere smart.

          • haldujai 12 days ago
            Have they? I would say the last year has shown “throwing more compute” is not the endgame with Chinchilla, Galactica and LLaMA being some prominent papers raising questions.

            Despite all of the (presumed) compute thrown at GPT4 hallucination is still a massive problem (in their technical paper) and “more RL” isn’t/won’t solve that.

            It’s hard to believe that a certain number of flops will fix the fact that next token prediction is inherently not factually grounded and can result in very inaccurate outputs.

            I think we’re plateauing with current model architectures and overestimating LLMs because their output sounds so much better, chess is a great example. If LLMs of this size can’t come close to stockfish which can run on my 10 year old CPU, how much knowledge is all this compute really encoding?

            • oezi 12 days ago
              I don't interpret results of Chinchilla et al. as there are fundamental limits to what LLMs can do. From the Chinchilla paper's abstract:

              "we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally"

              > GPT4 hallucinations

              My interpretation is that we are just seeing Dunning-Kruger. We are at the overconfidence stage, but quickly descending into the Valley of Despair ("I am an AI and can't..."). My reading of the report is that hallucinations are already reduced a lot.

              > is inherently not factually grounded

              Aren't we all just symbol manipulation machines? Isn't truth just created by assigning symbols? Is there more value in experiencing an apple falling on your head than reading on an apple falling onto Newton's head?

              > If LLMs of this size can’t come close to stockfish which can run on my 10 year old CPU, how much knowledge is all this compute really encoding?

              If we assume that GPT-4 is maybe a bit bigger than GTP-3 then it clocks in at around 1000 GB = 1 TB. A large chess data base is something like 5GB, but I would assume that training maybe keeps a fraction of that, maybe 100 MB to encode chess moves. I think it is quite an achievement to that GPT-4 can play at all, but by reading moves from textfiles online.

              • haldujai 12 days ago
                I agree re: Chinchilla’s main point, what I was trying to point out is that I’m not sure that it’s clear that simply adding “more compute”, which over the past 2 years has meant “increasing model sizes”, is going to keep scaling as was stated.

                It is still unclear whether we even have enough training tokens to adequately train some of these models (as GPT4 is unknown, thinking of PaLM here)? Galactica gets around that by repeating high quality tokens. Anthropic’s work conversely shows repeated tokens can significantly degrade performance.

                I expect there will be higher yield/more focus on refining datasets or training objectives (e.g. GNN work by Leskovec and Liang) in the near future rather than just throwing more compute at CommonCrawl.

                OpenAI claims significantly improved hallucination yet still by their own metric scores up to 80% on factual accuracy tests and scores 60% on TruthfulQA, so it’s still far too frequent to be reliable despite the presumably large effort into RLHF and incorporation of ChatGPT data.

                The problem space of potentially inaccurate seems too large for RLHF to be a good solution.

                > Aren't we all just symbol manipulation machines? Isn't truth just created by assigning symbols? Is there more value in experiencing an apple falling on your head than reading on an apple falling onto Newton's head?

                Yeah there is actually, having experienced gravity grounds Newton’s law in reality and provides a sanity check. What’s the evidence to suggest that we are purely symbol manipulation machines?

                In reality, human decision making is very “multimodal” when you move away from low hanging fruit.

                > If we assume that GPT-4 is maybe..

                I assume you’re offering a simplistic explanation of how LLMs work but continuing this hypothetical I don’t really follow the logic behind that math. If LLMs truly encode knowledge and contain a “model” of reality the whole point is that it is reasoning as it plays chess and not necessarily regurgitating.

                It “plays” chess (except makes illegal moves and appears to be playing in very predictable ways as better discussed in the other HN chess post) in a way suggesting that it’s still “fill in the blank” rather than innately understanding/modelling the game which is the claim being made of LLMs.

          • hackerlight 12 days ago
            > Why?

            These AIs (GPT-4, Stable Diffusion) are being trained on human output. How can you far exceed the capability of humans if you are being trained on human output? The limitations of human cognition are present in the training data. It's no coincidence that Stable Diffusion/MidJourney are producing artworks about 97% as beautiful as the most competent human artists, and we never see something far more beautiful than what a human could produce even by chance. Ditto with the code that GPT-4 produces. It inherently can't exceed the capabilities of the most competent human, because such solutions are not present in the information available in the input data.

            Let's take Go as an example. The reason DeepMind created a superhuman Go agent is because of unlimited self-play. But if DeepMind limited itself to training only on games that humans have played (analogous to how GPT-4, Stable Diffusion were trained), this take-off would have been impossible. At best, the resulting agent would have been as intelligent as human players, perhaps very slightly better, but not much. That's why I think LLMs need to go the way of self-play or self-teaching somehow if we expect them to ever get far beyond humans intellectually.

            > The last two years show that throwing more compute at this during training and execution is clearly helping.

            That's because these models were far below human capability and have been catching up to human capability. This doesn't refute the thesis. My prediction applies from the point GPT is as about as smart as the smartest humans.

  • ar9av 13 days ago
    I just did the same thing and came here to post about it. Some really cool findings:

    It doesn't play like stockfish, but it does seem to play pretty well.

    If you specify a move or capture that is nonsensical it seems to always recognize that

    You can stop and ask it about the tactical reasoning behind its moves, and it gives what look (to me) like good, specific answers

    You can ask it if its concerned about an attack or defense I might make, and it will give a good answer. It will even answer accurately if the attack/defense is not possible

    You can ask how the game is going (who is likely to win) and it will give an accurate answer with good reasoning.

    You can ask it to try to meet specific alternative goals, like opening itself up to an en passant attack, and it will attempt to accomplish those goals. This one is really astonding since you're asking it to play in a way thats strange, and which is probably underrepresented in its training set.

    • v_london 13 days ago
      Hi, I'm the author of the blog post. Just wanted to say, I would be very interested in reading more about the experiments you ran on getting GPT-4 to describe its plans. GPT-3's explanations were confidently incorrect, as usual.
      • lgas 12 days ago
        Hi, since you're commenting here and appear to be interested in exploring things further, I just wanted to point out that once GPT-4 is available through the API you could use a LangChain Agent[1] to maintain the board state externally and feed it back in automatically with every new move, so that the playing field would be more level in terms of memory. You could also inject instructions about explaining its plans either as system messages or as per-prompt instructions.

        [1] https://langchain.readthedocs.io/en/latest/modules/agents.ht...

    • Buttons840 13 days ago
      If you instruct it to use an alternate Chess notation, such as drawing an ASCII (or Unicode?) board every turn, can it do so? Does this affect its play?

      Instruct GTP4 to win the game and then crush it using Stockfish. Does it handle defeat gracefully?

      • mmahemoff 13 days ago
        Strangely, GPT4 refuses to draw a chessboard while earlier versions were up for it.

        I do hope the chat interface improves soon. It seems like low hanging fruit compared to everything else that's been accomplished ... it will be like the jump from telnet/irc/ftp to web browsers.

        • mmahemoff 13 days ago
          (I'm referring to ChatGPT's interface specifically, of course. Third party clients could use GPT4 to build a GUI chessboard.)
      • fenomas 13 days ago
        Not -4, but I tested the ASCII gameboard idea a bunch on ChatGPT (3.5) and could not make any headway. It would consistently announce a move and then print out the same gameboard it had been given (i.e. the game state before its move).

        It sounds like -4 should be much better though; I wonder if having the gameboard would improve its gameplay?

        • BoiledCabbage 13 days ago
          > I wonder if having the gameboard would improve its gameplay?

          If it's an addition, then sure seems like it wouldn't hurt. If it's to replace the algebraic notation, then I'd say very likely worse. All of it's input / experiences w.r.t. chess have been in the language of algebraic chess notation. Swapping to only providing a game board would presumably be worse.

          It'd be like if all of our interactions were in English in a latin script, and then someone replaced it with emojis. We don't think in emoji's it wouldn't help us, even if that's how they're used to thinking.

          While we play chess by looking at the board state, it's likely little of chat-gpts input is in that format.

          And the only other concern would be memory. Would it "forget" the prior moves. But I think not, simply because it receives the full game input each play simply by how GPT APIs work with passing in all prior tokens after each iteration.

          • Buttons840 12 days ago
            GTP playing chess through a visual (or visual-like) interface would be like us playing chess through notation. Same game, different view.

            Of course, some humans can play chess quite well while blindfolded.

      • charcircuit 13 days ago
        GPT4 can take image inputs
      • nhoughto 13 days ago
        Interesting idea, naively that seems possible based on other non chess example prompts I’ve seen. Would be a great way to understand it’s tracking of state throughout the game
    • janmo 13 days ago
      My guess is that the OpenAI team saw the (viral) videos ridiculing ChatGPT's chess abilities and added some chess training dataset.
      • Yenrabbit 13 days ago
        They finished training GPT4 mid last year, before ChatGPT was released
        • z3c0 12 days ago
          There's definitely some "Mechanical Turk"-ing going on behind the scenes.

          I enjoy the ChatGPT jailbreaking scene, and almost every time there's a significant break found, within a day, ChatGPT will dance around it (take the big token-leak jailbreak from a few days ago.) The GPT4 model may not be retrained itself, but it seems fairly evident that the ability for supplemental rules/fine-tuning is baked into their architecture for ChatGPT.

          • janalsncm 12 days ago
            Even with mechanical Turking it’s still an impressive claim. I have not seen any experiments which demonstrate that autoregressive language models can reliably play at any level, let alone superhuman levels. Until I see evidence to the contrary I will assume it’s not true.
    • GistNoesis 13 days ago
      Have you tried, after every move, asking GPT4 to identify "check , capture , and attack" of the position, to see how much applying the method helps its ELO rating. You can probably also ask it to render the board as txt, that should help GPT4 visualize the board.
    • mtlmtlmtlmtl 12 days ago
      The claims are worthless without actual games and chat logs.
    • wuiheerfoj 12 days ago
      Is there a reason why such AIs can’t offload to stockfish? To me the max power of AI could be achieved when it knows when to hand off to a better tool - is it a matter of ChatGPT not knowing what doesn’t know or simply the devs haven’t integrated with other tools (yet)?
  • aaron695 13 days ago
  • thatsadude 13 days ago
    It’s AGI to me!