Like low, medium, high, xhigh and so on.
But are they different models underneath? Or same model with different parameter?
The reason I ask is because, if I change the effort param mid conversation in Claude code, I get a warning suggesting I’m breaking the cache.
I don’t think this happens in Codex because when I change the effort, the responses are still quick.
There aren't other comments discussing this possibility at the moment, but you don't have to take the token predicted as most likely (greedy decoding). Most decoding strategies do something else which is where settings like temperature come in. So if you want the model to "think harder" you can track whether the current tokens are thinking or answer - in OpenAI's system that's called a channel - and then if you're in a thinking block you might get a model output whose top three predictions are:
Greedy decoding would stop thinking at this point and start answering, but you want the model to keep thinking so you skip that token and select the next most likely which is "Wait, ". The reasoning levels can map to the probability of skipping the channel change tokens.1. take highest probability
2. based on some light weight code that tracks some state - like number of tokens or some sampling distribution
3. higher level is using a smaller llm to decide which token to sample (just a thought)
If Anthropic's models work the same way, then changing reasoning effort would break the cache because the API has to modify the system prompt given at the very start of the context and rerun the whole thing through the inference server.
This kind of limitation is one reason Opus 4.8's mid-conversation system messages[2] are actually a pretty big deal (if they actually work).
[1] https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/blob/ma...
[2] https://platform.claude.com/docs/en/build-with-claude/mid-co...
Didn't they start injecting system messages telling Claude to calm his tits in overly long and emotional (iirc it triggered on some keywords) chat contexts last year?
Is it harder to post-train in such a way?
Note that inference libs also have parsers that put hard limits on reasoning tokens with separate counters (similar to how you can put a limit on token generation per completion versus waiting for an <eos>). For that, take a look at vllm reasoning docs.
https://docs.vllm.ai/en/latest/features/reasoning_outputs/#a...
https://developers.openai.com/api/docs/guides/reasoning
Maybe like: add a secret suffix to your chat in the conversation to think more like
I might be very very wrong though and LLMs disagree with me, insisting that cache is preserved and the system message doesn't have to change (even though it often contains effort level in context) if effort level changes across turns, and that all you have to do is tell the inference lib that parses think tags to early-close think tags that are too long.
Usually it’s done in post training to enforce behavior based on prompt. Ie. System prompt with thinking:max or low or wtv.
Enforcement then goes via constrained decoding, checking for think token start and end with max lengths, or other variations
The "amount of thinking" is how long this internal conversation is allowed to progress. The longer it goes on the more it costs. It's all part of the token budget but, because this internal dialogue is hidden, it's not obvious to the end user.
The model that summarizes what is inside the CoT/|thinking| tags is just an LLM, and it's just as jailbreakable/susceptible to prompt injection as any other LLM: https://x.com/lefthanddraft/status/1991076879877460322 (for those without X; that's Wyatt Walls demonstrating both getting the gemini summarizer to print the raw CoT, as well as just do random calculations, dump its system prompt, etc.)
See https://developers.openai.com/cookbook/articles/openai-harmo... and src/openai/types/shared/reasoning_effort.py
Some stacks also tie it to orchestration layers or system/prompt signals, which is why it can look inconsistent across products
OpenAI describes reasoning.effort as controlling how many reasoning tokens get used before the answer. Anthropic’s docs are even more explicit that effort trades off thoroughness vs token efficiency “with a single model”.
So I wouldn’t read the Claude Code cache warning as proof that a different model is being used. It may just mean the thinking/effort setting is part of the cache key.