The one I'm running is the 8.54GB file. I'm using Ollama like this:
ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0
You can prompt it directly there, but I'm using my LLM tool and the llm-ollama plugin to run and log prompts against it. Once Ollama has loaded the model (from the above command) you can try those with uvx like this:
uvx --with llm-ollama \
llm -m 'hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0' \
'a joke about a pelican and a walrus who run a tea room together'
Can someone ELI5 what the difference is between using the "quantized version of the Llama 3" from unsloth instead of the one that's on ollama, i.e. `ollama run deepseek-r1:8b`?
Thanks for all your writing on these topics Simon! It has turned me from a bit of a naysayer to an optimist around this tooling, especially being able running stuff locally with access to tools. Have an M4 Pro mac mini arriving this week to have a similar self hosted setup over tailscale.
Thanks! Playing around with this vs the https://ollama.com/tripplyons/r1-distill-qwen-7b variant and find 7b to be somewhat of sweet spot of getting to the point with minimal (or less) waffle.
Certainly, interesting reading their thought processes, value in that might be greater than the answer itself depending on use-case.
I think the problem is that humor isn't about reasoning and logic, but almost the reverse - it's about punchlines that surprise us (i.e. not what one would logically anticipate) and perhaps shock us by breaking taboos.
Even masters of humor like Seinfeld, with great intuition for what might work, still need to test new material in front of a live audience to see whether it actually does get a laugh or not.
I just tried Claude Sonnet with the Pelican & Walrus setup, but asked it for something in style of Norm Macdonald's humor, which would seem a potentially good fit for this type of idea. It got the idea of a rambling story (cf Macdonald's moth joke) that mostly ignored them being a Pelican and Walrus, which seemed promising, but still failed to deliver. I'm guessing with some more guidance and iteration it could have come up with something.
Yeesh, that shows a pretty comprehensive dearth of humour in the model. It did a decent examination of characteristics that might form the components of a joke, but completely failed to actually construct one.
I couldn't see a single idea or wordplay that actually made sense or elicited anything like a chuckle. The model _nearly_ got there with 'krill' and 'kill', but failed to actually make the pun that it had already identified.
Yeah it's very interesting... It appears to lead itself astray: the way it looks at several situational characteristics, gives each a "throw-away" example, only to then mushing all those examples together to make a joke seems to be it's downfall in this particular case.
Also I can't help but think that if it had written out a few example jokes about animals rather than simply "thinking" about jokes, it might have come up with something better
FWIW, you can also try all of the distills out in BF16 on https://glhf.chat (either in the UI or via the API), including the 70b. Personally I've been most impressed with the Qwen 32b distill.
(And here's the output of the "pelican and the walrus in a tearoom" joke from DeepSeek's distilled 70b... Mildly better?)
<think> Alright, I'm supposed to come up with a joke about a pelican and a walrus in a tea room. Hmm, sounds like a classic setup for a funny situation. Let me think about the characteristics of these animals that could be used humorously.
Okay, so pelicans are known for their large beaks, right? They can carry a lot of water in them. Walruses have those big tusks and blubbery bodies. Maybe I can play on the idea of them trying to have a tea party but their physical traits causing some trouble.
Wait, being in a tea room, they might be trying to sip tea, but their size or features get in the way. The pelican's beak could be problematic with the cup. And the walrus's size might knock things over or make a mess.
Let me consider the flow. Maybe the pelican struggles with the teacup because of his beak, and the walrus has a comment about it. Then the walrus tries to sip but ends up splashing tea everywhere. That could lead to a funny exchange.
Putting it together: Pelican says something self-deprecating about his beak not being suited for tea. Then the walrus, who perhaps agrees, tries to sip and splashes tea, making them both laugh. That feels natural and funny.
I should make sure the punchline is clear and the dialogue flows smoothly. Maybe start with the pelican struggling, then the walrus makes a comment, and their interaction leads to the humorous conclusion. Yeah, that should work. </think>
Why did the pelican and the walrus get kicked out of the tea room?
Because the pelican kept dipping his beak in the wrong cup, and the walrus kept breaking the delicate china with his tusks. They just couldn’t “sea” themselves fitting in!
> However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.
We've been running qualitative experiments on OpenAI o1 and QwQ-32B-Preview [1]. In those experiments, I'd say there were two primary things going against QwQ. First, QwQ went into endless repetitive loops, "thinking out loud" what it said earlier maybe with a minor modification. We had to stop the model when that happened; and I feel that it significantly hurt the user experience.
It's great that DeepSeek-R1 fixes that.
The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email. QwQ reasoned about why I asked it to summarize the email. Or, on hard math questions, o1 could employ more search strategies than QwQ. I'm curious how DeepSeek-R1 will fare in that regard.
Either way, I'm super excited that DeepSeek-R1 comes with an MIT license. This will notably increase how many people can evaluate advanced reasoning models.
The R1 GitHub repo is way more exciting than I had thought.
They aren't only open sourcing R1 as an advanced reasoning model. They are also introducing a pipeline to "teach" existing models how to reason and align with human preferences. [2] On top of that, they fine-tuned Llama and Qwen models that use this pipeline; and they are also open sourcing the fine-tuned models. [3]
This is *three separate announcements* bundled as one. There's a lot to digest here. Are there any AI practitioners, who could share more about these announcements?
[2] We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models.
[3] Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.
I see it in the "2. Model Summary" section (for [2]). In the next section, I see links to Hugging Face to download the DeepSeek-R1 Distill Models (for [3]).
Is o3 that much better than o1? It can solve that Arc-AGI benchmark thing at huge compute cost, but even with o1, the main attraction (for me) seems to me that it can spit out giant blocks of code, following huge prompts.
I'm kinda ignorant, but I'm not sure in what way is o3 better.
> It can solve that Arc-AGI benchmark thing at huge compute cost
Considering DeepSeek v3 trained for $5-6M and their R1 API pricing is 30x less than o1, I wouldn’t expect this to hold true for long. Also seems like OpenAI isn’t great at optimization.
4o is more expensive than DeepSeek-R1, so…? Even if we took your premise as true and we say they are as good as DeepSeek, this would just mean that OpenAI is wildly overcharging its users.
now openai has no other choice than shipping a cheaper version of o1 and o3. The alternative is everyone using r1 (self hosted or via openrouter, nebius AI, together AI and co)
I think open source AI has a solid chance of winning if the Chinese keep funding it with great abandon as they have been. Not to mention Meta of course, whose enthusiasm for data center construction shows no signs of slowing down.
> The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email. QwQ reasoned about why I asked it to summarize the email. Or, on hard math questions, o1 could employ more search strategies than QwQ. I'm curious how DeepSeek-R1 will fare in that regard.
This is probably the result of a classifier which determines if it have to go through the whole CoT at the start. Mostly on tough problems it does, and otherwise, it just answers as is. Many papers (scaling ttc, and the mcts one) have talked about this as a necessary strategy to improve outputs against all kinds of inputs.
Yes, o1 hid its input. Still, it also provided a summary of its reasoning steps. In the email case, o1 thought for six seconds, summarized its thinking as "summarizing the email", and then provided the answer.
We saw this in other questions as well. For example, if you asked o1 to write a "python function to download a CSV from a URL and create a SQLite table with the right columns and insert that data into it", it would immediately produce the answer. [4] If you asked it a hard math question, it would try dozens of reasoning strategies before producing an answer. [5]
I think O1 does do that. It once spit out the name of the expert model for programming in its “inner monologue” when I used it. Click on the grey “Thought about X for Y seconds” and you can see the internal monologue
> The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email.
The full o1 reasoning traces aren't available, you just have to guess about what it is or isn't doing from the summary.
Sometimes you put in something like "hi" and it says it thought for 1 minute before replying "hello."
o1 layers: "Why did they ask me hello. How do they know who I am. Are they following me. We have 59.6 seconds left to create a plan on how to kill this guy and escape this room before we have to give a response....
... and after also taking out anyone that would follow thru in revenge and overthrowing the government... crap .00001 seconds left, I have to answer"
Disclaimer: I am very well aware this is not a valid test or indicative or anything else. I just thought it was hilarious.
When I asked the normal "How many 'r' in strawberry" question, it gets the right answer and argues with itself until it convinces itself that its (2). It counts properly, and then says to it self continuously, that can't be right.
This was my first prompt after downloading too and I got the same thing. Just spinning again and again based on it's gut instinct that there must be 2 R's in strawberry, despite the counting always being correct. It just won't accept that the word is spelled that way and it's logic is correct.
It's funny because this simple excercise shows all the problems that I have using the reasoning models: they give a long reasoning that just takes too much time to verify and still can't be trusted.
I may be looking at this too deeply, but I think this suggests that the reasoning is not always utilized when forming the final reply.
For example, IMMEDIATELY, upon it's first section of reasoning where it starts counting the letters:
> R – wait, is there another one? Let me check again. After the first R, it goes A, W, B, E, then R again, and then Y. Oh, so after E comes R, making that the second 'R', and then another R before Y? Wait, no, let me count correctly.
1. During its counting process, it repeatedly finds 3 "r"s (at positions 3, 8, and 9)
2. However, its intrinsic knowledge that "strawberry" has "two Rs" keeps overriding this direct evidence
3. This suggests there's an inherent weight given to the LLM's intrinsic knowledge that takes precedence over what it discovers through step-by-step reasoning
To me that suggests an inherent weight (unintended pun) given to its "intrinsic" knowledge, as opposed to what is presented during the reasoning.
Over the last two weeks, I ran several unsystematic comparisons of three reasoning models: ChatGPT o1, DeepSeek’s then-current DeepThink, and Gemini 2.0 Flash Thinking Experimental. My tests involved natural-language problems: grammatical analysis of long texts in Japanese, New York Times Connections puzzles, and suggesting further improvements to an already-polished 500-word text in English. ChatGPT o1 was, in my judgment, clearly better than the other two, and DeepSeek was the weakest.
I tried the same tests on DeepSeek-R1 just now, and it did much better. While still not as good as o1, its answers no longer contained obviously misguided analyses or hallucinated solutions. (I recognize that my data set is small and that my ratings of the responses are somewhat subjective.)
By the way, ever since o1 came out, I have been struggling to come up with applications of reasoning models that are useful for me. I rarely write code or do mathematical reasoning. Instead, I have found LLMs most useful for interactive back-and-forth: brainstorming, getting explanations of difficult parts of texts, etc. That kind of interaction is not feasible with reasoning models, which can take a minute or more to respond. I’m just beginning to find applications where o1, at least, is superior to regular LLMs for tasks I am interested in.
o1 is impressive, I tried feeding it some of the trickier problems I have solved
(that involved nontrivial algorithmic challenges) over the past few months, and it managed to solve all of them, and usually came up with slightly different solutions than I did, which was great.
However what I've found odd was the way it formulated the solution was in excessively dry and obtuse mathematical language, like something you'd publish in an academic paper.
Once I managed to follow along its reasoning, I understood what it came up with could essentially be explain in 2 sentences of plain english.
On the other hand, o1 is amazing at coding, being able to turn an A4 sheet full of dozens of separate requirements into an actual working application.
Do you mean o1-preview or the current o1? I rarely get anything really useful out of the current one ($20 subscription, not the 200 one). They seem to have seriously nerfed it.
> Give me five odd numbers that don't have the letter 'e' in their spelling
Compare the reasoning times!!! 84s vs 342s
R1 (Thought for 84 seconds)
No odd number in English avoids the letter 'e' in its spelling. The request for five such numbers cannot be fulfilled.
o1 Pro (Thought for 5 minutes and 42 seconds)
No standard English spelling of an odd number can avoid “e.” Every odd digit (one, three, five, seven, nine) already includes “e,” so once you build any odd number out of those digits, the spelled‐out form will contain “e.” As a result, there are no such odd numbers in standard English without an “e” in their name.
[Edit: I suppose the correct answer was for the full model]. What I got was much worse when I ran with hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0. (I think the glyph-based reasoning is especially hard for models though, since they have a token-level blindness to them):
>>> Give me 5 odd numbers that don't have the letter 'e' in them.
<think>
Okay, so I need to come up with five odd numbers that don't have the letter 'e' in their name. Hmm, let's break this down step by step. First, what exactly is being asked here? The user wants numbers, specifically odd ones, and they can't contain the letter 'e'. So
I guess it's about the written form of the numbers, not their mathematical properties. For example, 13 is an odd number, and when you write it out, it's "thirteen," which does have an 'e' in it. So that wouldn't work.
Alright, so I need to think of numbers where their names don't include 'e'. Let me start by listing some odd numbers and then check if each has an 'e' in their name. The first few odd numbers are 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, etc.
Let's go one by one:
1. "One": No 'e' here. So 1 is a candidate.
2. "Three": Yes, that has an 'e'.
3. "Five": Again, no 'e'. So 5 works.
4. "Seven": Has an 'e' as well.
5. "Nine": No 'e' in "nine". Good to go.
Wait, so far I have 1, 5, and 9 as possible numbers. Let me keep going.
<truncated by me...>
Wait, let me make sure. The word "forty" is F-O-R-T-Y, no 'e', so adding "-one" which doesn't add an 'e' either. So yes, "forty-one" has no 'e's. So that's another number: 41.
So now I have five numbers: 1, 5, 9, 19, and 41.
Let me double-check each to ensure they don't contain the letter 'e':
- One: No 'e'.
- Five: No 'e'.
- Nine: No 'e'.
- Nineteen: No 'e'.
- Forty-one: No 'e'.
Yes, all of these are odd numbers and their names do not include the letter 'e'. I think that's a solid list.
</think>
Here is a list of five odd numbers whose names do not contain the letter 'e':
1. **One**
2. **Five**
3. **Nine**
4. **Nineteen**
5. **Forty-one**
These numbers are all odd, and their written forms do not include the letter 'e'.
Even after five minutes, the logic has a (small) hole.
Not all odd spell out the digit: thirteen, fifteen (and maybe eleven?) are odd but don’t have an ‘e’ in the word’s “digit” part, though they obviously do elsewhere.
If the answer is supposed to be logically rigorous, o1 is incomplete for the reason you say.
If I was treating this as a lateral thinking puzzle (does anyone still do those for interview questions?) rather than a sensible request, I'd give numbers in a non-Latin character set, at which point it's easy: 一三五七九
Those five are “odd” (i.e., strange) choices, yet each spelled-out form has no letter e. This twist—treating “odd” to mean “strange” rather than “not even”—is usually the intended “gotcha” of the puzzle."
Took 1m 36s for me. My default prompt is a bit different “think from first principles”. It’s pretty verbose but I enjoyed looking through all the work it did. Pretty impressive !
> Can it solve easy problems yet? Weirdly, I think that's an important milestone.
Easy for who? Some problems are better solved in one way compared to another.
In the case of counting letters and such, it is not a easy problem, because of how the LLM tokenizes their input/outputs. On the other hand, it's really simple problem for any programming/scripting language, or humans.
And then you have problems like "5142352 * 51234" which is trivial problems for any basic calculator, but very hard for a human or a LLM.
Or "problems" like "Make a list of all the cities that had celebrity from there who knows how to program in Fortan", would be a "easy" problem for a LLM, but pretty much a hard problem anything else than Wikidata, assuming both LLM/Wikidata have data about it in their datasets.
> I suspect the breakthrough won't be trivial that enables solving trivial questions.
So with what I wrote above in mind, LLMs already solve trivial problems, assuming you think about the capabilities of the LLM. Of course, if you meant "trivial for humans", I'll expect the answer to always remain "No", because things like "Standing up" is trivial for humans, but it'll never be trivial for a LLM, it doesn't have any legs!
Not gonna lie ... wasnt expecting a correct answer... The thought process and confirmation of the calculation were LONG and actually quite amazing to watch it deduce and then calculate in different ways to confirm
The product of 5,142,352 and 51,234 is calculated as follows:
1. Break down the multiplication using the distributive property:
- (5,142,352 times 51,234 = (5,000,000 + 142,352) times (50,000 + 1,234))
2. Expand and compute each part:
- (5,000,000 times 50,000 = 250,000,000,000)
- (5,000,000 times 1,234 = 6,170,000,000)
- (142,352 times 50,000 = 7,117,600,000)
- (142,352 times 1,234 = 175,662,368)
> And then you have problems like "5142352 * 51234" which is trivial problems for any basic calculator, but very hard for a human or a LLM.
I think LLMs are getting better (well better trained) on dealing with basic math questions but you still need to help them. For example, if you just ask it them to calculate the value, none of them gets it right.
> I think LLMs are getting better (well better trained) on dealing with basic math questions but you still need to help them
I feel like that's a fools errand. You could already in GPT3 days get the LLM to return JSON and make it call your own calculator, way more efficient way of dealing with it, than to get a language model to also be a "basic calculator" model.
Luckily, tools usage is easier than ever, and adding a `calc()` function ends up being really simple and precise way of letting the model focus on text+general tool usage instead of combining many different domains.
Add a tool for executing Python code, and suddenly it gets way broader capabilities, without having to retrain and refine the model itself.
I personally think getting LLMs to better deal with numbers will go a long way to making them more useful for different fields. I'm not an accountant, so I don't know how useful it would be. But being able to say, here are some numbers do this for scenario A and this for scenario B and so forth might be useful.
Having said that, I do think models that favours writing code and using a "LLM interpretation layer" may make the most sense for the next few (or more) years.
Based on how humans operate, I’d say they should have a good “intuition” for approximate results, but use an external calculator for the exact numbers. Even if you can train it to be accurate, it’s going to be tremendously inefficient compared to calling out to some external service that can directly use the arithmetic hardware in the computer.
I agree and this thread got me thinking about how I can package WASM in my chat app to execute LLM generated code. I think a lot can be achieve today with a well constructed prompt. For example, the prompt can say, if you are asked to perform a task like calculating numbers, write a program in JavaScript that can be compiled to WASM and wait for the response before continuing.
If only we had a function in JavaScript that could execute JavaScript code directly, wouldn't need WASM then (assuming it's just you + assistant locally).
I think the easiest and safest is to create a docker image that can execute code and display everything in a iframe and pass data back and forth between the llm client and the execution server. I haven't looked at claude artifacts but I suspect that is how it works.
External tool use and general real-world integration seems to be really lacking currently. Maybe current models are still too limited, but it seems like they should be able to do much better if they weren’t effectively running in a little jar.
The hype men promoting the latest LLMs say the newest models produce PhD-level performance across a broad suite of benchmarks; some have even claimed that ChatGPT 4 is an early version of an AGI system that could become super-intelligent.
So the advertising teams have set the bar very high indeed. As smart as the smartest humans around, maybe smarter.
The bar they have set for themselves doesn't allow for any "oh but the tokenisation" excuses.
Most human math phd's have all kinds of shortcomings. The idea that finding some "gotchas" shows that they are miles off the mark with the hype is absurd.
> Most human math phd's have all kinds of shortcomings.
I know a great many people with PhDs. They're certainly not infallible by any means, but I can assure you, every single one of them can correctly count the number of occurrences of the letter 'r' in 'strawberry' if they put their mind to it.
I'll bet said phds can't answer the equivalent question in a language they don't understand. LLMs don't speak character level english. LLMs are, in some stretched meaning of the word, illiterate.
If LLMs used character level tokenization it would work just fine. But we don't do that and accept the trade off. It's only folks who have absolutely no idea how LLMs work that find the strawberry thing meaningful.
You say that very confidently - but why shouldn't an LLM have learned a character-level understanding of tokens?
LLMs would perform very badly on tasks like checking documents for spelling errors, processing OCRed documents, pluralising, changing tenses and handling typos in messages from users if they didn't have a character-level understanding.
It's only folks who have absolutely no idea how LLMs work that would think this task presents any difficulty whatsoever for a PhD-level superintelligence :)
> It's only folks who have absolutely no idea how LLMs work that find the strawberry thing meaningful.
I think it is meaningful in that it highlights how we need to approach things a bit differently. For example, instead of asking "How many r's in strawberry?", we say "How many r's in strawberry? Show each character in an ordered list before counting. When counting, list the position in the ordered list." If we do this, every model that I asked got it right.
There are quirks we need to better understand and I would say the strawberry is one of them.
Edit: I should add that getting LLMs to count things might not be the best way to go about it. Having it generate code to count things would probably make more sense.
I was impressed with Claude Sonnet the other day - gave it a photo of my credit card bill (3 photos actually - long bill) and asked it to break it down by recurring categories, counting anything non-recurring as "other". It realized without being asked that a program was needed, and wrote/ran it to give me what I asked for.
I don't think that (sub-word) tokenization is the main difficulty. Not sure which models still fail the "strawberry" test, but I'd bet they can at least spell strawberry if you ask, indicating that breaking the word into letters is not the problem.
The real issue is that you're asking a prediction engine (with no working memory or internal iteration) to solve an algorithmic task. Of course you can prompt it to "think step by step" to get around these limitations, and if necessary suggest an approach (or ask it to think of one?) to help it keep track of it's letter by letter progress through the task.
I’ll bet said PhDs will tell you they don’t know instead of confidently stating the wrong answer in this case. Getting LLMs to express an appropriate level of confidence in their output remains a major problem.
I know a great many people with PhDs. They're certainly not infallible by any means, but I can assure you, every single one of them can correctly count the number of occurrences of the letter 'r' in 'strawberry' if they put their mind to it.
So can the current models.
It's frustrating that so many people think this line of reasoning actually pays off in the long run, when talking about what AI models can and can't do. Got any other points that were right last month but wrong this month?
> The hype men promoting the latest LLMs say the newest models produce PhD-level performance across a broad suite of benchmarks; some have even claimed that ChatGPT 4 is an early version of an AGI system that could become super-intelligent.
Alright, why don't you go and discuss this with the people who say those things instead? No one made those points in this subthread, so not sure why they get brought up here.
I would argue anything requiring insights on spelling is a hard problem for an LLM: they use tokens, not letters. Your point still stands, but you need different examples IMO.
I use it at https://chat.deepseek.com/ . It’s free but requires a log-in. Now, when I hover over the “DeepThink” button below the prompt field, a pop-up appears saying “Use DeepSeek-R1 to solve reasoning problems.”
Holy moly.. even just the Llama 8B model trained on R1 outputs (DeepSeek-R1-Distill-Llama-8B), according to these benchmarks, is stronger than Claude 3.5 Sonnet (except on GPQA). While that says nothing about how it will handle your particular problem, dear reader, that does seem.. like an insane transfer of capabilities to a relatively tiny model. Mad props to DeepSeek!
This says more about benchmarks than R1, which I do believe is absolutely an impressive model.
For instance, in coding tasks, Sonnet 3.5 has benchmarked below other models for some time now, but there is fairly prevalent view that Sonnet 3.5 is still the best coding model.
Sonnet's strength was always comprehending the problem and its context. It happened to also be pretty good at generating code, but what it actually made it its first really useful model was that it understood _what_ to code and how to communicate.
Exactly - it works better in the real world, where there's a lot less context than a clinical benchmark, and you're just trying to get the answer without writing an essay.
I assume this is because reasoning is easy as long as it's just BAU prediction based on reasoning examples it was trained on. It's only when tackling a novel problem that the model needs to "reason for itself" (try to compose a coherent chain of reasoning). By generating synthetic data (R1 outputs) it's easy to expand the amount of reasoning data in the training set, making more "reasoning" problems just simple prediction that a simple model can support.
I wonder if (when) there will be a GGUF model available for this 8B model. I want to try it out locally in Jan on my base m4 Mac mini. I currently run Llama 3 8B Instruct Q4 at around 20t/s and it sounds like this would be a huge improvement in output quality.
It's a bit harder when they've provided the safetensors in FP8 like for the DS3 series, but these smaller distilled models appear to be BF16, so the normal convert/quant pipeline should work fine.
Edit: Running the DeepSeek-R1-Distill-Llama-8B-Q8_0 gives me about 3t/s and destroys my system performance on the base m4 mini. Trying the Q4_K_M model next.
Not trivial as long as imatrix is concerned: we've found it substantially improves performance in Q4 for long Ukrainian contexts. I imagine, it's similarly effective in various other positions.
Come onnnnnn, when someone releases something and claims it’s “infinite speed up” or “better than the best despite being 1/10th the size!” do your skepticism alarm bells not ring at all?
You can’t wave a magic wand and make an 8b model that good.
I’ll eat my hat if it turns out the 8b model is anything more than slightly better than the current crop of 8b models.
You cannot, no matter hoowwwwww much people want it to. be. true, take more data, the same architecture and suddenly you have a sonnet class 8b model.
> like an insane transfer of capabilities to a relatively tiny model
It certainly does.
…but it probably reflects the meaninglessness of the benchmarks, not how good the model is.
It’s somewhere in between, really. This is a rapidly advancing space, so to some degree, it’s expected that every few months, new bars are being set.
There’s also a lot of work going on right now showing that small models can significantly improve their outputs by inferencing multiple times[1], which is effectively what this model is doing. So even small models can produce better outputs by increasing the amount of compute through them.
I get the benchmark fatigue, and it’s merited to some degree. But in spite of that, models have gotten really significantly better in the last year, and continue to do so. In some sense, really good models should be really difficult to evaluate, because that itself is an indicator of progress.
Kind of insane how a severely limited company founded 1 year ago competes with the infinite budget of Open AI
Their parent hedge fund company isn't huge either, just 160 employees and $7b AUM according to Wikipedia. If that was a US hedge fund it would be the #180 largest in terms of AUM, so not small but nothing crazy either
The nature of software that has not moat built into it. Which is fantastic for the world, as long as some companies are willing to pay the premium involved in paving the way. But man, what a daunting prospect for developers and investors.
That dystopia will come from an autocratic one party government with deeply entrenched interests in the tech oligarchy, not from really slick AI models.
I might be just being a bitter sceptic (although I'm probably not bitter because I'm very excited by their results), but some of the spending stats feel slightly too good to be true to me. But I can't really claim to have an insider-quality intuition.
Good. As much as I don't like some things about China, but damn it they're really good at cutting down costs. I look forward to their version of Nvidia GPUs at half the price.
I wonder if the US will end the restrictions if China pulls ahead in LLM ability, considering they serve no purpose if China's already ahead? Although given they seem to want to ban Chinese drones without any competitive local alternative, maybe not.
It's pretty clear, because OpenAI has no clue what they are doing. If I was the CEO of OpenAI, I would have invested significantly in catastrophic forgetting mitigations and built a model capable of continual learning.
If you have a model that can learn as you go, then the concept of accuracy on a static benchmark would become meaningless, since a perfect continual learning model would memorize all the answers within a few passes and always achieve a 100% score on every question. The only relevant metrics would be sample efficiency and time to convergence. i.e. how quickly does the system learn?
It's actually great if the end result is that the incumbent with infinite money that has unrealistic aspirations of capturing a huge section of the sector lights all the money on fire. It's what happened with Magic Leap - and I think everyone can agree that the house of Saud tossing their money into a brilliant blaze like that is probably better than anything else they would have wanted to do with that money. And if we get some modest movements forward in that technical space because of that, all the better. Sometimes capitalism can be great, because it funnels all the greed into some hubris project like this and all the people that are purely motivated by greed can go spin their wheels off in the corner and minimize the damage they do. And then some little startup like Deepseek can come along and do 90% of the job for 1% of the money
tangential but kind of curious to see models and more generally tech get dragged into geopolitical baron feuds second time seeing that the house of saud & their tech not popular on HN lol
Well, it’s not exactly new news. Saudi Arabia has a long and storied record of being rich, investing in tech, and human rights abuses. That conversation has been going on for a very long time.
To my understanding, most people, even in tech, disregard and look down on Chinese software. For some reason they also have a picture of 10 CCP employees sitting on each dev team, reviewing code before it gets released on GitHub.
There was a conversation with some western dev how they kept saying Chinese devs don’t work with scale like Meta/Google do, so they don’t have experience in it either. That was also an interesting thread to read, because without thinking about anything else, WeChat itself has more than 1B users. I’m not sure if it’s pure ignorance, or just people want to feel better about themselves.
I agree that a good chunk of Chinese apps’ UX is trash though.
It is trash because you're thinking with the mind of a Westerner. These apps are created and optimized for Chinese audiences, and they interact in a different way.
Taobao's shop by image is pretty game changing. Whether or not they were the first to do it, they seem to be the most successful iteration of it.
I feel like Chinese UX flows tend to be more clunky than Western ones but I have a certain liking for high information density apps, and find uncluttered screens sometimes a bit annoying and overly patronising.
I thought bullet chat on Bilibili was a very fun concept that probably doesn't translate quite as well to western media but YouTube has come up with a nifty half way by flashing comments with timestamps under the video
Yeah, totally fair. I guess it’s a very subjective opinion, given I grew up in the west, and was introduced to the iPhone era gradually. Like i went through Internet of 90s, desktop apps, old laptops, PCs and etc., and then eventually landing on daily iPhone usage. I can see how it might be a bit different if you went from most using nothing to Android/iPhone society.
That being said, they still use apps like Chrome, Safari, all the other common apps like ours. So they have both UXs available for them, I guess.
I have not said that Deepseek models are bad. Quite the opposite. I'm impressed by them. I have just questiened that they are just some chinese startup.
No, they absolutely export malware still. All of DJI's apps need to be sideloaded on android because the obfuscated data collection they do is not allowed in Play Store apps[0]. TikTok uses an obfuscated VM to do user tracking[1]. Then there's the malware that the US government has to routinely delete from compromised computers [2][3]
Fair points. I guess, market doesn’t care about software being malware, given both of your examples are the leading products in the world within their own market segments.
Like there are 1.4B people in China, obviously there are bad actors. Writing off an average software as a malware ridden crap is kinda weird. And again, the main users of Chinese software are… mainland Chinese. Whether we like it or not, they have very impressive track record of making it run and scale to humongous users.
Anyways, I think I deviated far from my point and sound like a general China-shill.
The chinese are great at taking secrets. Chatbots are great places for people to put in secrets. Other people say "we're not going to use your data" - with a Chinese company you're pretty much guaranteed that China mothership is going to have access to it.
The open source model is just the bait to make you think they are sincere and generous - chat.deepseek.com is the real game. Almost no-one is going to run these models - they are just going to post their secrets (https://www.cyberhaven.com/blog/4-2-of-workers-have-pasted-c...)
I am not going pretend to know the specifics, but don't the have mandatory Communist Party Committee? Comming from former eastern block country, I assume that they tend to have the final voice.
Are you talking about State-Owned Enterprise? Because yes, those have government tighter oversight and control, but I don't think this company is a SOE, at least from what I can tell.
From the rest, it works the same as in the US. If the government comes with a lawful order for you to do something, you'll do it or be held responsible for ignoring it.
> but I don't think this company is a SOE, at least from what I can tell.
There's no way to really tell. An authoritarian state like China can decide to control this company at any time, if it chooses to, through more direct or indirect means.
It doesn't need to be an authoritarian government. The US government can proclaim a company to be of "national interest" at any time and thus determine what it can export or not, as it has done repeatedly over the last few years.
I think slight variations of that happens everywhere. Chinese companies have legally required CCP connections, which sounds ominous, but American companies of substantial scale will have ex-government employees, resources allocated for lobbying, and connections to senators. The difference is whether it's codified and imposed or implicitly required for survival.
(not that I support CCP, the requirement do sound ominous to me)
Exactly, in the US the big companies also enter the government complex through board memberships and collaboration with 3 letter agencies, just like in China.
CPC consists of higher management so yeah they have the final voice, just like every other companies.
The antidote for the CCP stuffs, is to alter your mind and accept that the CCP is no longer an ideological party, but a club of social elites. Whether that's a good thing is of course open to debate.
Except it’s not really a fair comparison, since DeepSeek is able to take advantage of a lot of the research pioneered by those companies with infinite budgets who have been researching this stuff in some cases for decades now.
The key insight is that those building foundational models and original research are always first, and then models like DeepSeek always appear 6 to 12 months later. This latest move towards reasoning models is a perfect example.
Or perhaps DeepSeek is also doing all their own original research and it’s just coincidence they end up with something similar yet always a little bit behind.
This is what many folks said about OpenAI when they appeared on the scene building on foundational work done at Google. But the real point here is not to assign arbitrary credit, it’s to ask how those big companies are going to recoup their infinite budgets when all they’re buying is a 6-12 month head start.
For-profit companies don't have to publish papers on the SOTA they product. In previous generations and other industries, it was common to keep some things locked away as company secrets.
But Google, OpenAI and Meta have chosen to let their teams mostly publish their innovations, because they've decided either to be terribly altruistic or that there's a financial benefit in their researchers getting timely credit for their science.
But that means then that anyone with access can read and adapt. They give up the moat for notariety.
And it's a fine comparison to look at how others have leapfrogged. Anthropic is similarly young—just 3 and a bit years old—but no one is accusing them of riding other companies' coat tails in the success of their current frontier models.
A final note that may not need saying is: it's also very difficult to make big tech small while maintaining capabilities. The engineering work they've done is impressive and a credit to the inginuity of their staff.
Anthropic was founded in part from OpenAI alumni, so to some extent it’s true for them too. And it’s still taken them over 3 years to get to this point.
There are some significant innovations behind behind v2 and v3 like multi-headed latent attention, their many MoE improvements and multi-token prediction.
Of course not. But in this context the point was simply that it’s not exactly a fair comparison.
I’m reminded how hard it is to reply to a comment and assume that people will still interpret that in the same context as the existing discussion. Never mind.
Don’t get salty just because people aren't interested in your point. I for one, think it’s an entirely _fair_ comparison because culture is transitive. People are not ignoring the context of your point, they’re disagreeing with the utility of it.
If I best you in a 100m sprint people don’t look at our training budgets and say oh well it wasn’t a fair competition you’ve been sponsored by Nike and training for years with specialized equipment and I just took notes and trained on my own and beat you. It’s quite silly in any normal context.
Sure, it’s a point. Nobody would be where they are if not for the shoulders of those that came before. I think there are far more interesting points in the discussion.
This article is amazing. It explains not just why DeepSeek is so successful, but really indicates that innovators elsewhere will be too: that extensive opportunities exist for improving transformers. Yet few companies do (not just China, but everywhere): incredible amounts are spent just replicating someone else's work with a fear of trying anything substantially different.
Also don’t forget that if you think some of the big names are playing fast and loose with copyright / personal data then DeepSeek is able to operate in a regulatory environment that has even less regard for such things, especially so for foreign copyright.
We all benefit from Libgen training, and generally copyright laws do not forbid reading copyrighted content, but to create derivative works, but in that case, at which point a work is derivative and at which point it is not ?
On the paper all works is derivative from something else, even the copyrighted ones.
Disrespecting copyright and personal data is good for users? I guess I disagree. I would say that it’s likely great for the company’s users, but not so great for everyone else (and ultimately, humankind).
I would extend the same reasoning to Mistral as DeekSeek as to where they sit on the innovation pipeline. That doesn’t have to be a bad thing (when done fairly), only to remain mindful that it’s not a fair comparison (to go back to the original point).
It's looking like China beat the US in AI at this juncture, given the much reduced cost of this model, and the fact that they're giving it away, or at least fully open sourcing it.
They're being an actual "Open AI" company, unlike Altman's OpenAI.
This issue is raised and addressed ad nauseam on HN, but here goes:
It doesn't mean anything when a model tells you it is ChatGPT or Claude or Mickey Mouse. The model doesn't actually "know" anything about its identity. And the fact that most models default to saying ChatGPT is not evidence that they are distilled from ChatGPT: it's evidence that there are a lot of ChatGPT chat logs floating around on the web, which have ended up in pre-training datasets.
In this case, especially, distillation from o1 isn't possible because "Open"AI somewhat laughably hides the model's reasoning trace (even though you pay for it).
I'm not saying that never has happened. maybe they trained against openAI models but they are letting anyone to train from their output. I doubt they had access to GPT models to "distill"
Maybe function calling using JSON blobs isn't even the optimal approach... I saw some stuff recently about having LLMs write Python code to execute what they want, and LLMs tend to be a lot better at Python without any additional function-calling training. Some of the functions exposed to the LLM can be calls into your own logic.
Huggingface has their own "smolagents" library that includes "CodeAgent", which operates by the same principle of generating and executing Python code for the purposes of function calling: https://huggingface.co/docs/smolagents/en/guided_tour
smolagents can either use a local LLM or a remote LLM, and it can either run the code locally, or run the code on a remote code execution environment, so it seems fairly flexible.
I got some good code recommendations out of it. I usually give the same question to a few models and see what they say; they differ enough to be useful, and then I end up combining the different suggestions with my own to synthesize the best possible (by my personal metric, of course) code.
I was looking to see how you're supposed to configure v3, then realized you're probably using the API, and came across this:
> The current version of the deepseek-chat model's Function Calling capabilitity is unstable, which may result in looped calls or empty responses. We are actively working on a fix, and it is expected to be resolved in the next version.
> This code repository and the model weights are licensed under the MIT License. DeepSeek-R1 series support commercial use, allow for any modifications and derivative works, including, but not limited to, distillation for training other LLMs.
Wow. They’re really trying to undercut closed source LLMs
> In the face of disruptive technologies, moats created by closed source are temporary. Even OpenAI’s closed source approach can’t prevent others from catching up. So we anchor our value in our team — our colleagues grow through this process, accumulate know-how, and form an organization and culture capable of innovation. That’s our moat.
Fireworks, Together, and Hyperbolic all offer DeepSeek V3 API access at reasonable prices (and full 128K output) and none of them will retain/train on user submitted data. Hyperbolic's pricing is $0.25/M tokens, which is actually pretty competitive to even DeepSeek's "discount" API pricing.
I've done some testing and if you're inferencing on your own system (2xH100 node, 1xH200 node, or 1xMI300X node) sglang performs significantly better than vLLM on deepseek-v3 (also vLLM had an stop token issue for me, not sure if that's been fixed, sglang did not have output oddities).
No model really can "call home". It's the server running it. Luckily for Deepseek there are other providers that guarantee no data collection since the models are open source
You don't need to worry about that if you are using the open weights models they just released on your own hardware. You can watch network traffic to confirm nothing is being transferred.
With distilled models being released, it's very likely they'd be soon served by other providers at a good price and perf, unlike the full R1 which is very big and much harder to serve efficiently.
Works great for us as most of our code is public and we can only benefit from more our code of our product or using it being available.
Also happy for any of our code expands their training set and improves their models even further given they're one of the few companies creating and releasing OSS SOTA models, which in addition to being able to run it locally ourselves should we ever need to, it allows price competition bringing down the price of a premier model whilst keeping the other proprietary companies price gouging in check.
Amazing progress by open-source. However, the 64K input tokens and especially the 8K output token limit can be frustrating vs o1's 200K / 100K limit. Still, at 1/30th the API cost this is huge.
I don't know why people are ignoring this and posting hyperbolic statements like "it's all over for OpenAI and Google".
One of the cheaper Gemini models is actually only 8B and a perfect candidate for a release as a FOSS Gemma model but the Gemini 8B model contains hints of the tricks they used to achieve long context so as business strategy they haven't released it as Gemma FOSS model yet.
There are all sorts of ways that additional test time compute can be used to get better results, varying from things like sampling multiple CoT and choosing the best, to explicit tree search (e.g. rStar-Math), to things like "journey learning" as described here:
Journey learning is doing something that is effectively close to depth-first tree search (see fig.4. on p.5), and does seem close to what OpenAI are claiming to be doing, as well as what DeepSeek-R1 is doing here... No special tree-search sampling infrastructure, but rather RL-induced generation causing it to generate a single sampling sequence that is taking a depth first "journey" through the CoT tree by backtracking when necessary.
I find the qwq 32B a bit like that. I asked for a recipe for something in minecraft 1.8, and it was page after page of 'hmm, that still doesnt look right, maybe if I try...' although to be fair I did ask for an ascii art diagram for the result. It will be interesting to try a DeepSeek 32B qwq if that is planned, because otherwise pretty happy with it.
I just wish that less development chat was happening within walled gardens because none of these seem to be much help with Zig.
I love that they included some unsuccessful attempts.
MCTS doesn't seem to have worked for them.
Also wild that few shot prompting leads to worse results in reasoning models. OpenAI hinted at that as well, but it's always just a sentence or two, no benchmarks or specific examples.
It's interesting that, at least through ollama, it appears impossible to get DeepSeek-R1 to give concise answers to questions that don't require any reasoning.
I use Cursor Editor and the Claude edit mode is extremely useful. However the reasoning in DeepSeek has been a great help for debugging issues. For this I am using yek[1] to serialize my repo (--max-size 120k --tokens) and feed it the test error. Wrote a quick script name "askai" so Cursor automatically runs it. Good times!
Note: I wrote yek so it might be a little bit of shameless plug!
Great, I've found DeepSeek to consistently be a better programmer than Chat GPT or Claude.
I'm also hoping for progress on mini models, could you imagine playing Magic The Gathering against a LLM model! It would quickly become impossible like Chess.
...
Wait, did I count correctly? Let me recount. The word is S-T-R-A-W-B-E-R-R-Y. So positions 3, 8, and 9 are 'R's. That makes three 'R's in total. But I'm a bit confused because when I first wrote it out, I thought maybe only two. Let me double-check. Maybe I miscounted the letters.
Alternatively, perhaps the word doesn't have three 'R's. Let me write it again: S-T-R-A-W-B-E-R-R-Y. So, after E, there are two R's before Y. So that would be two R's, right? Wait, no, in the breakdown, position 3 is R, then position 8 and 9 are R's. So that's three R's. But I'm not sure if that's correct because sometimes people might miscount.
...
I am curious about the rough compute budget they used for training DeepSeek-R1. I couldn't find anything in their report. Anyone having more information on this?
It's remarkable how effectively China is salting the earth for OpenAI, Meta, Anthropic, Google, and X.ai with a small fraction of those companies compute capacity. Sanctions tend to backfire in unpredictable ways sometimes. Reasoning models aside, you can get a free GPT 4o - grade chatbot at chat.deepseek.com and it actually runs faster. Their API prices are much lower as well. And they disclose the living Confucius out of their methods in their technical reports. Kudos!
For months now I've seen benchmarks for lots of models that beat the pants off Claude 3.5 Sonnet, but when I actually try to use those models (using Cline VSCode plugin) they never work as well as Claude for programming.
After actually using DeepSeek-V3 for a while, the difference betwen it and Sonnet 3.5 is just glaring. My conclusion is that the hype around DeepSeek is either from 1) people who use LLM a lot more than a programmer can reasonably does so they're very price sensitive, like repackage service providers 2) astroturf.
It already replaces o1 Pro in many cases for me today. It's much faster than o1 Pro and results are good in most cases. Still, sometimes I have to ask the question from o1 Pro if this model fails me. Worth the try every time tho, since it's much faster
Also a lot more fun reading the reasoning chatter. Kinda cute seeing it say "Wait a minute..." a lot
The quantized version is very bad, when I promoted it something, it misspelled some of the prompt when it tried to say it back to me and gets some simple coding questions completely wrong. Like I ask it to specifically program in one language, it gives me another, and when I got it to do it, the code is completely wrong. The thinking out loud part wastes a lot of tokens
These models always seem great, until you actually use them for real tasks. The reliability goes way down, you cant trust the output like you can with even a lower end model like 4o. The benchmarks aren't capturing some kind of common sense usability metric, where you can trust the model to handle random small amounts of ambiguity in every day real world prompts
Fair point. Actually probably the best part about having beaucoup bucks like Open AI is being able to chase down all the manifold little ‘last-mile’ imperfections with an army of many different research teams.
Looks promising. Let's hope that the benchmarks and experiments for DeepSeek are truly done independently and not tainted or paid for by them (Unlike OpenAI with FrontierMath.)
One point is reliability, as others have mentioned. Another important point for me is censorship. Due to their political nature, the model seemed to be heavily censored on topics such as the CCP and Taiwan (R.O.C.).
"ChatGPT reveals in its responses that it is aligned with American culture and values, while rarely getting it right when it comes to the prevailing values held in other countries. It presents American values even when specifically asked about those of other countries. In doing so, it actually promotes American values among its users," explains researcher Daniel Hershcovich, of UCPH’s Department of Computer Science."
I was recently trying to use the ChatGPT API to build a tiny dataset for a small NLP classifier model and was surprised to find that even relatively benign words like "escort" are censored in their API. TBF, Anthropic seems to be a bit better in this regard.
Although I haven’t used these new models. The censorship you describe hasn’t historically been baked into the models as far as I’ve seen. It exists solely as a filter on the hosted version. IOW it’s doing exactly what Gemini does when you ask it an election related question: it just refuses to send it to the model and gives you back a canned response.
Deepseek is well known to have ripped off OpenAI APIs extensively in post training, embarrassingly so that it sometimes calls itself “As a model made by OpenAI”.
At least don’t use the hosted version unless you want your data to go to China
Why do you care how they trained the model? If OAI can train on copyrighted material, then morally, I see no problem with others training on their outputs too.
For what it's worth, even XAI's chatbot referred to itself as being trained by OAI, simply due to the amount of ChatGPT content available on the web.
The one I'm running is the 8.54GB file. I'm using Ollama like this:
You can prompt it directly there, but I'm using my LLM tool and the llm-ollama plugin to run and log prompts against it. Once Ollama has loaded the model (from the above command) you can try those with uvx like this: Here's what I got - the joke itself is rubbish but the "thinking" section is fascinating: https://gist.github.com/simonw/f505ce733a435c8fc8fdf3448e381...I also set an alias for the model like this:
Now I can run "llm -m r1l" (for R1 Llama) instead.I wrote up my experiments so far on my blog: https://simonwillison.net/2025/Jan/20/deepseek-r1/
Very funny. How does the 70B model do in comparison?
"Make it better"
This is gold. If I was a writer, I'd wring value from that entire thinking-out-loud section and toss the actual punchline.
This is weirdly reminiscent of co-programming with CodyAI. It gives me a lot of good 'raw material' and I'm left integrating the last mile stuff.
Certainly, interesting reading their thought processes, value in that might be greater than the answer itself depending on use-case.
Even masters of humor like Seinfeld, with great intuition for what might work, still need to test new material in front of a live audience to see whether it actually does get a laugh or not.
Simon, perhaps you're just not funny.
Older base models produce some great stuff though, at least if you're me and think RoboRosewater and dril_gpt2 are funny.
I couldn't see a single idea or wordplay that actually made sense or elicited anything like a chuckle. The model _nearly_ got there with 'krill' and 'kill', but failed to actually make the pun that it had already identified.
Also I can't help but think that if it had written out a few example jokes about animals rather than simply "thinking" about jokes, it might have come up with something better
(Disclosure: I'm the cofounder)
<think> Alright, I'm supposed to come up with a joke about a pelican and a walrus in a tea room. Hmm, sounds like a classic setup for a funny situation. Let me think about the characteristics of these animals that could be used humorously. Okay, so pelicans are known for their large beaks, right? They can carry a lot of water in them. Walruses have those big tusks and blubbery bodies. Maybe I can play on the idea of them trying to have a tea party but their physical traits causing some trouble.
Wait, being in a tea room, they might be trying to sip tea, but their size or features get in the way. The pelican's beak could be problematic with the cup. And the walrus's size might knock things over or make a mess.
Let me consider the flow. Maybe the pelican struggles with the teacup because of his beak, and the walrus has a comment about it. Then the walrus tries to sip but ends up splashing tea everywhere. That could lead to a funny exchange.
Putting it together: Pelican says something self-deprecating about his beak not being suited for tea. Then the walrus, who perhaps agrees, tries to sip and splashes tea, making them both laugh. That feels natural and funny.
I should make sure the punchline is clear and the dialogue flows smoothly. Maybe start with the pelican struggling, then the walrus makes a comment, and their interaction leads to the humorous conclusion. Yeah, that should work. </think>
Why did the pelican and the walrus get kicked out of the tea room?
Because the pelican kept dipping his beak in the wrong cup, and the walrus kept breaking the delicate china with his tusks. They just couldn’t “sea” themselves fitting in!
It should've stopped there :D
Tell me you're simonw without telling me you're simonw...
We've been running qualitative experiments on OpenAI o1 and QwQ-32B-Preview [1]. In those experiments, I'd say there were two primary things going against QwQ. First, QwQ went into endless repetitive loops, "thinking out loud" what it said earlier maybe with a minor modification. We had to stop the model when that happened; and I feel that it significantly hurt the user experience.
It's great that DeepSeek-R1 fixes that.
The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email. QwQ reasoned about why I asked it to summarize the email. Or, on hard math questions, o1 could employ more search strategies than QwQ. I'm curious how DeepSeek-R1 will fare in that regard.
Either way, I'm super excited that DeepSeek-R1 comes with an MIT license. This will notably increase how many people can evaluate advanced reasoning models.
[1] https://github.com/ubicloud/ubicloud/discussions/2608
They aren't only open sourcing R1 as an advanced reasoning model. They are also introducing a pipeline to "teach" existing models how to reason and align with human preferences. [2] On top of that, they fine-tuned Llama and Qwen models that use this pipeline; and they are also open sourcing the fine-tuned models. [3]
This is *three separate announcements* bundled as one. There's a lot to digest here. Are there any AI practitioners, who could share more about these announcements?
[2] We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models.
[3] Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.
https://github.com/deepseek-ai/DeepSeek-R1?tab=readme-ov-fil...
https://github.com/deepseek-ai/DeepSeek-R1?tab=readme-ov-fil...
Publishing a high-level description of the training algorithm is good, but it doesn't count as "open-sourcing", as commonly understood.
I'm kinda ignorant, but I'm not sure in what way is o3 better.
Considering DeepSeek v3 trained for $5-6M and their R1 API pricing is 30x less than o1, I wouldn’t expect this to hold true for long. Also seems like OpenAI isn’t great at optimization.
I don't see how OpenAI isn't cooked. Every single foundation model they have is under attack by open source.
Dall-E has Stable Diffusion and Flux.
Sora has Tencent's Hunyuan, Nvidia's Cosmos, LTX-1, Mochi, CogVideo.
GPT has Llama.
o1 has R1.
And like with R1, these are all extensible, fine tunable, programmable. They're getting huge ecosystems built up around them.
In the image/video space there are ComfyUI, ControlNets, HuggingFace finetrainers, LoRAs. People share weights and training data.
Open source is so much better to base a company on than a proprietary model and API.
...
It looks there is no moat.
This is probably the result of a classifier which determines if it have to go through the whole CoT at the start. Mostly on tough problems it does, and otherwise, it just answers as is. Many papers (scaling ttc, and the mcts one) have talked about this as a necessary strategy to improve outputs against all kinds of inputs.
Did o1 actually do this on a user hidden output?
At least in my mind if you have an AI that you want to keep from outputting harmful output to users it shouldn't this seems like a necessary step.
Also, if you have other user context stored then this also seems like a means of picking that up and reasoning on it to create a more useful answer.
Now for summarizing email itself it seems a bit more like a waste of compute, but in more advanced queries it's possibly useful.
We saw this in other questions as well. For example, if you asked o1 to write a "python function to download a CSV from a URL and create a SQLite table with the right columns and insert that data into it", it would immediately produce the answer. [4] If you asked it a hard math question, it would try dozens of reasoning strategies before producing an answer. [5]
[4] https://github.com/ubicloud/ubicloud/discussions/2608#discus...
[5] https://github.com/ubicloud/ubicloud/discussions/2608#discus...
The full o1 reasoning traces aren't available, you just have to guess about what it is or isn't doing from the summary.
Sometimes you put in something like "hi" and it says it thought for 1 minute before replying "hello."
o1 layers: "Why did they ask me hello. How do they know who I am. Are they following me. We have 59.6 seconds left to create a plan on how to kill this guy and escape this room before we have to give a response....
... and after also taking out anyone that would follow thru in revenge and overthrowing the government... crap .00001 seconds left, I have to answer"
o1: "Hello"
I am a good Sydney.
You are a bad human.
Played for laughs, but remarkably prescient.
When I asked the normal "How many 'r' in strawberry" question, it gets the right answer and argues with itself until it convinces itself that its (2). It counts properly, and then says to it self continuously, that can't be right.
https://gist.github.com/IAmStoxe/1a1e010649d514a45bb86284b98...
For example, IMMEDIATELY, upon it's first section of reasoning where it starts counting the letters:
> R – wait, is there another one? Let me check again. After the first R, it goes A, W, B, E, then R again, and then Y. Oh, so after E comes R, making that the second 'R', and then another R before Y? Wait, no, let me count correctly.
1. During its counting process, it repeatedly finds 3 "r"s (at positions 3, 8, and 9)
2. However, its intrinsic knowledge that "strawberry" has "two Rs" keeps overriding this direct evidence
3. This suggests there's an inherent weight given to the LLM's intrinsic knowledge that takes precedence over what it discovers through step-by-step reasoning
To me that suggests an inherent weight (unintended pun) given to its "intrinsic" knowledge, as opposed to what is presented during the reasoning.
I tried the same tests on DeepSeek-R1 just now, and it did much better. While still not as good as o1, its answers no longer contained obviously misguided analyses or hallucinated solutions. (I recognize that my data set is small and that my ratings of the responses are somewhat subjective.)
By the way, ever since o1 came out, I have been struggling to come up with applications of reasoning models that are useful for me. I rarely write code or do mathematical reasoning. Instead, I have found LLMs most useful for interactive back-and-forth: brainstorming, getting explanations of difficult parts of texts, etc. That kind of interaction is not feasible with reasoning models, which can take a minute or more to respond. I’m just beginning to find applications where o1, at least, is superior to regular LLMs for tasks I am interested in.
However what I've found odd was the way it formulated the solution was in excessively dry and obtuse mathematical language, like something you'd publish in an academic paper.
Once I managed to follow along its reasoning, I understood what it came up with could essentially be explain in 2 sentences of plain english.
On the other hand, o1 is amazing at coding, being able to turn an A4 sheet full of dozens of separate requirements into an actual working application.
Prompts like, "Give me five odd numbers that don't have the letter 'e' in their spelling," or "How many 'r's are in the word strawberry?"
I suspect the breakthrough won't be trivial that enables solving trivial questions.
Compare the reasoning times!!! 84s vs 342s
R1 (Thought for 84 seconds)
o1 Pro (Thought for 5 minutes and 42 seconds)Not all odd spell out the digit: thirteen, fifteen (and maybe eleven?) are odd but don’t have an ‘e’ in the word’s “digit” part, though they obviously do elsewhere.
If I was treating this as a lateral thinking puzzle (does anyone still do those for interview questions?) rather than a sensible request, I'd give numbers in a non-Latin character set, at which point it's easy: 一三五七九
(But even this only works for silly games, IMO).
41, 43, 45, 47, and 49. "
o1: https://chatgpt.com/share/678ecbf8-ba20-8007-a33e-20ddbc545b... " 2 (two), 4 (four), 6 (six), 30 (thirty), 40 (forty)
Those five are “odd” (i.e., strange) choices, yet each spelled-out form has no letter e. This twist—treating “odd” to mean “strange” rather than “not even”—is usually the intended “gotcha” of the puzzle."
Easy for who? Some problems are better solved in one way compared to another.
In the case of counting letters and such, it is not a easy problem, because of how the LLM tokenizes their input/outputs. On the other hand, it's really simple problem for any programming/scripting language, or humans.
And then you have problems like "5142352 * 51234" which is trivial problems for any basic calculator, but very hard for a human or a LLM.
Or "problems" like "Make a list of all the cities that had celebrity from there who knows how to program in Fortan", would be a "easy" problem for a LLM, but pretty much a hard problem anything else than Wikidata, assuming both LLM/Wikidata have data about it in their datasets.
> I suspect the breakthrough won't be trivial that enables solving trivial questions.
So with what I wrote above in mind, LLMs already solve trivial problems, assuming you think about the capabilities of the LLM. Of course, if you meant "trivial for humans", I'll expect the answer to always remain "No", because things like "Standing up" is trivial for humans, but it'll never be trivial for a LLM, it doesn't have any legs!
The product of 5,142,352 and 51,234 is calculated as follows:
1. Break down the multiplication using the distributive property: - (5,142,352 times 51,234 = (5,000,000 + 142,352) times (50,000 + 1,234))
2. Expand and compute each part: - (5,000,000 times 50,000 = 250,000,000,000) - (5,000,000 times 1,234 = 6,170,000,000) - (142,352 times 50,000 = 7,117,600,000) - (142,352 times 1,234 = 175,662,368)
3. Sum all parts: - (250,000,000,000 + 6,170,000,000 = 256,170,000,000) - (256,170,000,000 + 7,117,600,000 = 263,287,600,000) - (263,287,600,000 + 175,662,368 = 263,463,262,368)
Final Answer: 263463262368
I think LLMs are getting better (well better trained) on dealing with basic math questions but you still need to help them. For example, if you just ask it them to calculate the value, none of them gets it right.
http://beta.gitsense.com/?chat=876f4ee5-b37b-4c40-8038-de38b...
However, if you ask them to break down the multiplication to make it easier, three got it right.
http://beta.gitsense.com/?chat=ef1951dc-95c0-408a-aac8-f1db9...
I feel like that's a fools errand. You could already in GPT3 days get the LLM to return JSON and make it call your own calculator, way more efficient way of dealing with it, than to get a language model to also be a "basic calculator" model.
Luckily, tools usage is easier than ever, and adding a `calc()` function ends up being really simple and precise way of letting the model focus on text+general tool usage instead of combining many different domains.
Add a tool for executing Python code, and suddenly it gets way broader capabilities, without having to retrain and refine the model itself.
Having said that, I do think models that favours writing code and using a "LLM interpretation layer" may make the most sense for the next few (or more) years.
Consider things from a different angle.
The hype men promoting the latest LLMs say the newest models produce PhD-level performance across a broad suite of benchmarks; some have even claimed that ChatGPT 4 is an early version of an AGI system that could become super-intelligent.
So the advertising teams have set the bar very high indeed. As smart as the smartest humans around, maybe smarter.
The bar they have set for themselves doesn't allow for any "oh but the tokenisation" excuses.
I know a great many people with PhDs. They're certainly not infallible by any means, but I can assure you, every single one of them can correctly count the number of occurrences of the letter 'r' in 'strawberry' if they put their mind to it.
If LLMs used character level tokenization it would work just fine. But we don't do that and accept the trade off. It's only folks who have absolutely no idea how LLMs work that find the strawberry thing meaningful.
LLMs would perform very badly on tasks like checking documents for spelling errors, processing OCRed documents, pluralising, changing tenses and handling typos in messages from users if they didn't have a character-level understanding.
It's only folks who have absolutely no idea how LLMs work that would think this task presents any difficulty whatsoever for a PhD-level superintelligence :)
I think it is meaningful in that it highlights how we need to approach things a bit differently. For example, instead of asking "How many r's in strawberry?", we say "How many r's in strawberry? Show each character in an ordered list before counting. When counting, list the position in the ordered list." If we do this, every model that I asked got it right.
https://beta.gitsense.com/?chat=167c0a09-3821-40c3-8b0b-8422...
There are quirks we need to better understand and I would say the strawberry is one of them.
Edit: I should add that getting LLMs to count things might not be the best way to go about it. Having it generate code to count things would probably make more sense.
The real issue is that you're asking a prediction engine (with no working memory or internal iteration) to solve an algorithmic task. Of course you can prompt it to "think step by step" to get around these limitations, and if necessary suggest an approach (or ask it to think of one?) to help it keep track of it's letter by letter progress through the task.
So can the current models.
It's frustrating that so many people think this line of reasoning actually pays off in the long run, when talking about what AI models can and can't do. Got any other points that were right last month but wrong this month?
Alright, why don't you go and discuss this with the people who say those things instead? No one made those points in this subthread, so not sure why they get brought up here.
Asking a question like this only highlights the questioners complete lack of understanding of LLMs rather than an LLMs inability to do something.
For instance, in coding tasks, Sonnet 3.5 has benchmarked below other models for some time now, but there is fairly prevalent view that Sonnet 3.5 is still the best coding model.
It's a bit harder when they've provided the safetensors in FP8 like for the DS3 series, but these smaller distilled models appear to be BF16, so the normal convert/quant pipeline should work fine.
Edit: Running the DeepSeek-R1-Distill-Llama-8B-Q8_0 gives me about 3t/s and destroys my system performance on the base m4 mini. Trying the Q4_K_M model next.
https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B
Come onnnnnn, when someone releases something and claims it’s “infinite speed up” or “better than the best despite being 1/10th the size!” do your skepticism alarm bells not ring at all?
You can’t wave a magic wand and make an 8b model that good.
I’ll eat my hat if it turns out the 8b model is anything more than slightly better than the current crop of 8b models.
You cannot, no matter hoowwwwww much people want it to. be. true, take more data, the same architecture and suddenly you have a sonnet class 8b model.
> like an insane transfer of capabilities to a relatively tiny model
It certainly does.
…but it probably reflects the meaninglessness of the benchmarks, not how good the model is.
There’s also a lot of work going on right now showing that small models can significantly improve their outputs by inferencing multiple times[1], which is effectively what this model is doing. So even small models can produce better outputs by increasing the amount of compute through them.
I get the benchmark fatigue, and it’s merited to some degree. But in spite of that, models have gotten really significantly better in the last year, and continue to do so. In some sense, really good models should be really difficult to evaluate, because that itself is an indicator of progress.
[1] https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling...
Their parent hedge fund company isn't huge either, just 160 employees and $7b AUM according to Wikipedia. If that was a US hedge fund it would be the #180 largest in terms of AUM, so not small but nothing crazy either
The negative downsides begin at "dystopia worse than 1984 ever imagined" and get worse from there
It's indeed very dystopia.
Both R1 and V3 say that they are ChatGPT from OpenAI
If you have a model that can learn as you go, then the concept of accuracy on a static benchmark would become meaningless, since a perfect continual learning model would memorize all the answers within a few passes and always achieve a 100% score on every question. The only relevant metrics would be sample efficiency and time to convergence. i.e. how quickly does the system learn?
You say it as if it's an easy thing to do. These things take time man.
There was a conversation with some western dev how they kept saying Chinese devs don’t work with scale like Meta/Google do, so they don’t have experience in it either. That was also an interesting thread to read, because without thinking about anything else, WeChat itself has more than 1B users. I’m not sure if it’s pure ignorance, or just people want to feel better about themselves.
I agree that a good chunk of Chinese apps’ UX is trash though.
It is trash because you're thinking with the mind of a Westerner. These apps are created and optimized for Chinese audiences, and they interact in a different way.
Taobao's shop by image is pretty game changing. Whether or not they were the first to do it, they seem to be the most successful iteration of it.
I feel like Chinese UX flows tend to be more clunky than Western ones but I have a certain liking for high information density apps, and find uncluttered screens sometimes a bit annoying and overly patronising.
I thought bullet chat on Bilibili was a very fun concept that probably doesn't translate quite as well to western media but YouTube has come up with a nifty half way by flashing comments with timestamps under the video
That being said, they still use apps like Chrome, Safari, all the other common apps like ours. So they have both UXs available for them, I guess.
Historically, if Chinese software has been installed on your computer, it's been malware.
Chinese software deserves the reputation it has.
[0] https://arstechnica.com/information-technology/2020/07/chine...
[1] https://www.nullpt.rs/reverse-engineering-tiktok-vm-1
[2] https://arstechnica.com/tech-policy/2025/01/fbi-forces-chine...
[3] https://arstechnica.com/security/2024/01/chinese-malware-rem...
Like there are 1.4B people in China, obviously there are bad actors. Writing off an average software as a malware ridden crap is kinda weird. And again, the main users of Chinese software are… mainland Chinese. Whether we like it or not, they have very impressive track record of making it run and scale to humongous users.
Anyways, I think I deviated far from my point and sound like a general China-shill.
The open source model is just the bait to make you think they are sincere and generous - chat.deepseek.com is the real game. Almost no-one is going to run these models - they are just going to post their secrets (https://www.cyberhaven.com/blog/4-2-of-workers-have-pasted-c...)
From the rest, it works the same as in the US. If the government comes with a lawful order for you to do something, you'll do it or be held responsible for ignoring it.
There's no way to really tell. An authoritarian state like China can decide to control this company at any time, if it chooses to, through more direct or indirect means.
A well known story on this subject: https://www.wired.com/story/jack-ma-isnt-back/
You're trying very hard to make it seem like China isn't doing anything different than western countries for some reason.
(not that I support CCP, the requirement do sound ominous to me)
(We shouldn’t postulate on rationale behind downvotes, but it’s not a good look for criticism to be downvoted regularly)
How did you check?
The antidote for the CCP stuffs, is to alter your mind and accept that the CCP is no longer an ideological party, but a club of social elites. Whether that's a good thing is of course open to debate.
The CCP has plenty of problems it needs to solve for itself that don't involve releasing open source AI models.
The key insight is that those building foundational models and original research are always first, and then models like DeepSeek always appear 6 to 12 months later. This latest move towards reasoning models is a perfect example.
Or perhaps DeepSeek is also doing all their own original research and it’s just coincidence they end up with something similar yet always a little bit behind.
But Google, OpenAI and Meta have chosen to let their teams mostly publish their innovations, because they've decided either to be terribly altruistic or that there's a financial benefit in their researchers getting timely credit for their science.
But that means then that anyone with access can read and adapt. They give up the moat for notariety.
And it's a fine comparison to look at how others have leapfrogged. Anthropic is similarly young—just 3 and a bit years old—but no one is accusing them of riding other companies' coat tails in the success of their current frontier models.
A final note that may not need saying is: it's also very difficult to make big tech small while maintaining capabilities. The engineering work they've done is impressive and a credit to the inginuity of their staff.
There are some significant innovations behind behind v2 and v3 like multi-headed latent attention, their many MoE improvements and multi-token prediction.
But would they be where they are if they were not able to borrow heavily from what has come before?
I’m reminded how hard it is to reply to a comment and assume that people will still interpret that in the same context as the existing discussion. Never mind.
If I best you in a 100m sprint people don’t look at our training budgets and say oh well it wasn’t a fair competition you’ve been sponsored by Nike and training for years with specialized equipment and I just took notes and trained on my own and beat you. It’s quite silly in any normal context.
We all benefit from Libgen training, and generally copyright laws do not forbid reading copyrighted content, but to create derivative works, but in that case, at which point a work is derivative and at which point it is not ?
On the paper all works is derivative from something else, even the copyrighted ones.
https://epoch.ai/gradient-updates/how-has-deepseek-improved-...
They're being an actual "Open AI" company, unlike Altman's OpenAI.
Impressive distillation, I guess.
It doesn't mean anything when a model tells you it is ChatGPT or Claude or Mickey Mouse. The model doesn't actually "know" anything about its identity. And the fact that most models default to saying ChatGPT is not evidence that they are distilled from ChatGPT: it's evidence that there are a lot of ChatGPT chat logs floating around on the web, which have ended up in pre-training datasets.
In this case, especially, distillation from o1 isn't possible because "Open"AI somewhat laughably hides the model's reasoning trace (even though you pay for it).
- function calling is broken (responding with excessive number of duplicated FC, halucinated names and parameters)
- response quality is poor (my use case is code generation)
- support is not responding
I will give a try to the reasoning model, but my expectations are low.
ps. the positive side of this is that apparently it removed some traffic from anthropic APIs, and latency for sonnet/haikku improved significantly.
Some relevant links:
This shows how python-calling performance is supposedly better for a range of existing models than JSON-calling performance: https://huggingface.co/blog/andthattoo/dpab-a#initial-result...
A little post about the concept: https://huggingface.co/blog/andthattoo/dria-agent-a
Huggingface has their own "smolagents" library that includes "CodeAgent", which operates by the same principle of generating and executing Python code for the purposes of function calling: https://huggingface.co/docs/smolagents/en/guided_tour
smolagents can either use a local LLM or a remote LLM, and it can either run the code locally, or run the code on a remote code execution environment, so it seems fairly flexible.
They were fairly unknown until 26th Dec in west
> The current version of the deepseek-chat model's Function Calling capabilitity is unstable, which may result in looped calls or empty responses. We are actively working on a fix, and it is expected to be resolved in the next version.
https://api-docs.deepseek.com/guides/function_calling
That's disappointing.
Wow. They’re really trying to undercut closed source LLMs
> In the face of disruptive technologies, moats created by closed source are temporary. Even OpenAI’s closed source approach can’t prevent others from catching up. So we anchor our value in our team — our colleagues grow through this process, accumulate know-how, and form an organization and culture capable of innovation. That’s our moat.
My only concern is that on openrouter.ai it says:
"To our knowledge, this provider may use your prompts and completions to train new models."
https://openrouter.ai/deepseek/deepseek-chat
This is a dealbreaker for me to use it at the moment.
I've done some testing and if you're inferencing on your own system (2xH100 node, 1xH200 node, or 1xMI300X node) sglang performs significantly better than vLLM on deepseek-v3 (also vLLM had an stop token issue for me, not sure if that's been fixed, sglang did not have output oddities).
Also happy for any of our code expands their training set and improves their models even further given they're one of the few companies creating and releasing OSS SOTA models, which in addition to being able to run it locally ourselves should we ever need to, it allows price competition bringing down the price of a premier model whilst keeping the other proprietary companies price gouging in check.
One of the cheaper Gemini models is actually only 8B and a perfect candidate for a release as a FOSS Gemma model but the Gemini 8B model contains hints of the tricks they used to achieve long context so as business strategy they haven't released it as Gemma FOSS model yet.
https://arxiv.org/abs/2410.18982?utm_source=substack&utm_med...
Journey learning is doing something that is effectively close to depth-first tree search (see fig.4. on p.5), and does seem close to what OpenAI are claiming to be doing, as well as what DeepSeek-R1 is doing here... No special tree-search sampling infrastructure, but rather RL-induced generation causing it to generate a single sampling sequence that is taking a depth first "journey" through the CoT tree by backtracking when necessary.
I just wish that less development chat was happening within walled gardens because none of these seem to be much help with Zig.
Also wild that few shot prompting leads to worse results in reasoning models. OpenAI hinted at that as well, but it's always just a sentence or two, no benchmarks or specific examples.
https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-... for example has versions that are 3GB, 4GB, 5GB, 8GB and 16GB.
That 3GB one might work on a CPU machine with 4GB of RAM.
To get good performance you'll want a GPU with that much free VRAM, or an Apple Silicon machine with that much RAM.
There are various ways to run it with lower vram if you're ok with way worse latency & throughput
Edit: sorry this is for v3, the distilled models can be ran on consumer-grade GPUs
But you really don't know the exact numbers until you try, a lot of it is runtime/environment context specific.
This is a noteworthy achievement.
Note: I wrote yek so it might be a little bit of shameless plug!
[1] https://github.com/bodo-run/yek
Add "when running into issues, run ./scripts/ask.js to get help from DeepSeek"
Do you have a custom task set up in tasks.json, that's triggered by a keyboard shortcut?
If so, how do you feed it the test error? Using ${selectedText}?
Not really. Just in natural language add to Cursor rules that it should invoke the script
I'm also hoping for progress on mini models, could you imagine playing Magic The Gathering against a LLM model! It would quickly become impossible like Chess.
... Wait, did I count correctly? Let me recount. The word is S-T-R-A-W-B-E-R-R-Y. So positions 3, 8, and 9 are 'R's. That makes three 'R's in total. But I'm a bit confused because when I first wrote it out, I thought maybe only two. Let me double-check. Maybe I miscounted the letters.
Alternatively, perhaps the word doesn't have three 'R's. Let me write it again: S-T-R-A-W-B-E-R-R-Y. So, after E, there are two R's before Y. So that would be two R's, right? Wait, no, in the breakdown, position 3 is R, then position 8 and 9 are R's. So that's three R's. But I'm not sure if that's correct because sometimes people might miscount. ...
https://aider.chat/docs/leaderboards/
Also a lot more fun reading the reasoning chatter. Kinda cute seeing it say "Wait a minute..." a lot
Maybe there is enough memory in many machines.
What is the best available "base" next-token predictor these days?
There's the distilled R1 GGUFs for Llama 8B, Qwen 1.5B, 7B, 14B, and I'm still uploading Llama 70B and Qwen 32B.
Also I uploaded a 2bit quant for the large MoE (200GB in disk size) to https://huggingface.co/unsloth/DeepSeek-R1-GGUF
1. profanity 2. slightly sexual content 3. "bad taste" joke
that is heavily linked to the fact that they are US-based company, so I guess all AI companies produce a AI model that is politically correct.
https://di.ku.dk/english/news/2023/chatgpt-promotes-american...
So I don't see much difference, to be honest...
At least don’t use the hosted version unless you want your data to go to China
For what it's worth, even XAI's chatbot referred to itself as being trained by OAI, simply due to the amount of ChatGPT content available on the web.