Open Deep Research

(github.com)

395 points | by transpute 263 days ago

12 comments

aubanel 263 days ago
Hi all! Aymeric (m-ric) here, maintainer of smolagents and part of the team who built this. Happy to see this interesting people here!
Few points:
- open Deep Research is not a production app, but it could easily be productionized (would need to be faster + good UX).
- As the GAIA score of 55% (not 54%, that would be lame) says, it's not far from the Deep Research score of 67%. It's also not there yet: I think the main point of progress is to improve web browsing. We're working on integrating vision models (for now we've used a text browser developed by the Microsofit autogen team, congrats to them) because it's probably the best way to really interact with webpages.
- Open Deep Research is built on smolagents, a library that we're building, for which the core is having agents that write their actions (tool calls) in code snippets instead of the unpractical JSON blobs + parsing that everyone incl OpenAI and Anthropic use for their agentic/tool-calling APIs. Don't hesitate to go try out the lib and drop issues/PRs!
- smolagents does code execution, which means "danger for your machine" if ran locally. We've railguardeed that a bit with our custom python interpreter, but it will never be 100% safe, so we're enabling remote execution with E2B and soon Docker.
[-]
- transpute 263 days ago
  > smolagents does code execution, which means "danger for your machine" if ran locally. We've railguardeed that a bit with our custom python interpreter, but it will never be 100% safe, so we're enabling remote execution with E2B and soon Docker.
  Those remote interfaces may also work with local VMs for isolation.
  [-]
  - Paul-Craft 263 days ago
    Yeah, that's what I was thinking: just throw the whole lot inside a Docker container and call it a day. Unless you're dealing with potentially malicious code that could break out of a container, that should isolate the rest of your machine sufficiently.
    Alternatively, PyPy is actually fully sandboxable.
    On Linux, you can also use `seccomp.` See, for instance, https://healeycodes.com/running-untrusted-python-code
- davidsojevic 263 days ago
  Great work on this, Aymeric and team! In terms of improving browsing and/or data sources, do you think it might be worth integrating things like Google Scholar search capability to increase the depth of some of the research that can be done?
  It's something I'd be happy to explore a bit if it's of interest.
  [-]
  - aubanel 263 days ago
    That's a good idea! Could be a very nice tool to add to the lib!
- swyx 263 days ago
  > for now we've used a text browser developed by the Microsofit autogen team, congrats to them
  oh super cool! i've usually heard it the other way - people develop LLM-friendly web scrapers. i wrote one for myself, and for others there's firecrawl and expand.ai. a full "text browser" (i guess with rendering?) run locally seems like a better solution for local agents.
  [-]
  - ComputerGuru 263 days ago
    It’s basically a cli controlling selenium/webdriver driving Chrome + a few functions.
- jsemrau 263 days ago
  I think using vision models for browsing is the wrong approach. It is the same as using OCR for scanning PDFs. The underlying text is already in digital form. So it would make more sense to establish a standard similar to meta-tags that enable the agentic web.
  [-]
  - webmaven 263 days ago
    If you're working from the markup rather than the appearance of the page, you're probably increasing the incentives for metacrap, "invisible text spam" and similar tactics.
  - taneq 263 days ago
    PDFs are more akin to SVG than to a Word document, and the text is often very far from “available”. OCR can be the only way to reconstruct the document as it appears on screen.
  - jejeyyy77 262 days ago
    no, websites/pdfs were designed and laid out visually by humans for humans.
    if you are just parsing the text you’ve lost a ton of information encoded in the layout/formatting.
    that doesn’t even yet consider actual visual assets like graphs/images, etc
transpute 263 days ago
https://techcrunch.com/2025/02/04/hugging-face-researchers-a...
> On GAIA, a benchmark for general AI assistants, Open Deep Research achieves a score of 54%. That’s compared with OpenAI deep research’s score of 67.36%..Worth noting is that there are a number of OpenAI deep research “reproductions” on the web, some of which rely on open models and tooling. The crucial component they — and Open Deep Research — lack is o3, the model underpinning deep research.
Blog post, https://huggingface.co/blog/open-deep-research
[-]
- swyx 263 days ago
  theres always a lot of openTHING clones of THING after THING is announced. they all usually (not always[1]!) disappoint/dont get traction. i think the causes are
  1. running things in production/self hosting is more annoying than just paying like 20-200/month
  2. openTHING makers often overhype their superficial repros ("I cloned Perplexity in a weekend! haha! these VCs are clowns!") and trivializing the last mile, most particularly in this case...
  3. long horizon planning trained with RL in a tight loop that is not available in the open (yes, even with deepseek). the thing that makes OAI work as a product+research company is that products are never launched without first establishing a "prompted baseline" and then finetuning the model from there (we covered this process in https://latent.space/p/karina recently) - which becomes an evals/dataset suite that eventually gets merged in once performance impacts stabilize
  4. that said, smolagents and HF are awesome and I like that they are always this on the ball. how does this make money for HF?
  ---
  [1]: i think opendevin/allhands is a pretty decent competitor to devin now
  [-]
  - littlestymaar 263 days ago
    Except that in this particular case (like in many others as far as AI goes, actually), the open version came before, by three full months: https://www.reddit.com/r/LocalLLaMA/comments/1gvlzug/i_creat...
    OpenAI pretty much never acknowledge prior art in their marketing material because they want you to believe they are the true innovators, but you should not take their marketing claims for granted.
    [-]
    - transpute 263 days ago
      Thanks for the pointer, https://github.com/TheBlewish/Automated-AI-Web-Researcher-Ol...
  - janalsncm 263 days ago
    > RL in a tight loop that is not available in the open
    Completely agree that a real RL pipeline is needed here, not just some clever prompting in a loop.
    That being said, it wouldn’t be impossible to create a “gym” for this task. You are essentially creating a simulated internet. And hiding a needle is a lot easier than finding it.
    [-]
    - pixl97 263 days ago
      I think models will have to some kind of internal training to teach them they are agents that can come back and work on things.
      Working on complex problems tends to explode in to a web of things that needs done. You need to be able to separate these in to subtasks and work on them semi-independently. In addition when a subtask gets stuck in a loop, you need to work on another task or line of thought, and then come back and 're-run' your thinking to see if anything changed.
      [-]
      - janalsncm 263 days ago
        The idea of reinforcement learning is that for some things it is hard to give an explicit plan for how to do something. For example, many games. Recently, DeepSeek showed that it worked for certain reasoning problems too, like leetcode problems.
        Instead, RL just rewards the model when it accomplishes some measurable goal (like winning the game). This works for certain types of problems but it’s pretty inefficient because the model wastes a lot of time doing stuff that doesn’t work.
  - kolinko 263 days ago
    As for them not getting traction - many projects just don’t advertise which tech they use.
    Also, having an open source alternative - even if slightly worse - gives projects an alternative if OpenAI decides to cut them off or sth.
    What I tell startups is to start off with whatever OpenAI/Anthropic have to offer, if they can, and consider switching into Open Source only if it allows for something they can’t do (often in gen graphics), or when they reach product market fit and they can finetune/train a smaller model that handles their specific case better and cheaper.
    [-]
    - rfoo 263 days ago
      > if it allows for something they can’t do (often in gen graphics)
      What makes gen graphics stand out?
      [-]
      - edude03 263 days ago
        Porn.
        When I first looked into stable diffusion I wanted to see what people were making with it and there was a few sites that showed stuff people had generated and it was 70% porn 29% high fantasy wallpapers (numbers are illustrative). Recently I've been looking at different text generation inference platforms like tgi and vllm on reddit, and 1 in 5 posts in the localllm subreddits are "What's the best model for erotic role play?"
        [-]
        rfoo 262 days ago
        Definitely.
        My question was, porn (or other censored thing) could be in text, too. I don't understand why those who want censored content are primarily interested in graphics.
      - alumic 263 days ago
        I’d venture to say it’s the range:
        What style? Is it a texture? If so, I’ll need a model that can generate a large tiled image. Is it a logo? What kind? Is it badged, vintage, corporate, etc
  - shekhargulati 263 days ago
    > running things in production/self hosting is more annoying than just paying like 20-200/month
    This is an important point. As people rely more on AI/LLM tools, reliability will become even more critical.
    In the last two weeks, I've heavily used Claude and DeepSeek Chat. ChatGPT is much more reliable compared to both.
    Claude struggles with long-context chats and often shifts to concise responses. DeepSeek often has its "fail whale" moment.
    [-]
    - GTP 263 days ago
      > In the last two weeks, I've heavily used Claude and DeepSeek Chat. ChatGPT is much more reliable compared to both
      Which reliability problems did you face? I heard about connection issues due to too much traffic with Deepseek, but those would go away if you self-host the model.
      [-]
      - edude03 263 days ago
        Obviously the reliability problems would go away if you self-host but the "point" is most people rely on external providers because they can't locally run models of similar quality. So what do you do if deepseek cuts you off? For most getting 12? H100s (for the 671b) is essentially impossible.
        [-]
        creamyhorror 262 days ago
        You use a Deepseek model not hosted by Deepseek, but another provider (e.g. DeepInfra currently). Hopefully a robust provider market will emerge and thrive, even if open models start thinning out.
  - tshepom 263 days ago
    Maybe these open projects start to get more attention when we have a distribution system/App Store for AI projects. I know YC is looking to fund this https://www.ycombinator.com/rfs
    [-]
    - transpute 263 days ago
      Is "AI appstore" envisioned on Linux edge inference hardware, e.g. PC+NPU, PC+GPU, Nvidia Project Digits? Or only in the cloud?
      Apple probably wouldn't accept a 3rd-party AI app store on MacOS and iOS, except possibly in the EU.
      If antitrust regulation leads to Android becoming a standalone company, that could support AI competition.
- transpute 263 days ago
  Another project, https://github.com/jina-ai/node-DeepResearch
```
  gemini for llm
  brave/duckduckgo for search
  jina reader for reading a webpage
```
flaviuspopan 263 days ago
I signed up for Gemini Advanced to get access to Deep Research on Feb 1st and felt on top of the world. So cool, so advanced.
Then OpenAI announced theirs on the 2nd: https://openai.com/index/introducing-deep-research/
Ethan Mollick called Google's undergraduate level and OpenAI's graduate level on the 3rd: https://www.oneusefulthing.org/p/the-end-of-search-the-begin...
And now this. I can't stop thinking about The Onion Movie's "Bates 4000" clip:
https://www.youtube.com/watch?v=9JCOBPMIgAA
[-]
- brcmthrowaway 263 days ago
  My X feed is 90% AI Influencers like this saying "I'm blown away! Check this out.."
  Is this a new cottage industry? Are they making money?
  [-]
  - lysecret 263 days ago
    Yea x and YouTube. Also the YouTube meta specifically seems to have moved to these sensationalist over exaggeration and over simplification. Really would be nice to have a sensibility filter somehow. It’s gonna come with ai.
    [-]
    - gusmally 263 days ago
      I like AI Explained on Youtube for non-sensationalist and thoughtful takes: https://www.youtube.com/@aiexplained-official
  - egillie 263 days ago
    I’ve heard it’s all the crypto influencers jumping into the AI hype
- gmerc 263 days ago
  Mollik is the great sense-maker. Whatever he says is usually wrong
tkellogg 263 days ago
it's just an example, but it's great to see smolagents in practice. I wonder how well the import whitelist approach works for code interpreter security.
[-]
- tptacek 263 days ago
  I know some of the point of this is running things locally, but for agent workflows like this some of this seems like a solved problem: just run it on a throwaway VM. There's lots of ways to do that quickly.
  [-]
  - ATechGuy 263 days ago
    VM is not the right abstraction because of performance and resource requirements. VMs are used because nothing exists that provides same or better isolation. Using a throwaway VM for each AI agent would be highly inefficient (think wasted compute and other resources, which is the opposite of what DeepSeek exemplified).
    [-]
    - tptacek 263 days ago
      To which performance and resource requirements are you referring? A cloud VM runs as long as the agent runs, then stops running.
      [-]
      - ATechGuy 263 days ago
        I mean performance overheads of an OS process running in a VM to (vs no VM) and additional resource requirements for running a VM, including memory and additional kernel. You can pull relevant numbers from academic papers.
        [-]
        transpute 263 days ago
        A linear bar graph comparing compute/memory requirements?
        - OS process - virtual machine - LLM inference
        Could have longevity as PC master race meme template.
        tptacek 263 days ago
        OK. Thanks for clarifying. I think you're pretty wrong on this one, for what it's worth.
    - vineyardmike 263 days ago
      Is “DeepSeek” going to be the new trendy way to say to not be wasteful? I don’t think DS is a good example here. Mostly because it’s a trendy thing, and the company still has $1B in capex spend to get there.
      Firecracker has changed the nature of “VMs” into something cheap and easy to spin up and throw away while maintaining isolation. There’s no reason not to use it (besides complexity, I guess).
      Besides, the entire rest of this is a python notebook. With headless browsers. Using LLMs. This is entirely setting silicon on fire. The overhead from a VM the least of the compute efficiency problems. Just hit a quick cloud API and run your python or browser automation in isolation and move on.
      [-]
      - ATechGuy 263 days ago
        I think you are assuming that inference happens on the same machine/VM that executes code generated by an AI agent.
      - tptacek 263 days ago
        I'm not even talking about Firecracker; for the duration of time things like these run, you could get a satisfactory UX with basic EC2.
  - cma 263 days ago
    The rise of captchas on regular content, no longer just for posting content, could ruin this. Cloudflare and other companies have set things up to go through a few hand selected scrapers and only they will be able to offer AI browsing and research services.
    [-]
    - tptacek 263 days ago
      I think the opposite problem is going to occur with captchas for whatever it's worth: LLMs are going to obsolete them. It's an arms race where the defender has a huge constraint the attacker doesn't (pissing off real users); in that way, it's kind of like the opposite dynamics that password hashes exploit.
      [-]
      - anon373839 263 days ago
        I’m not sure about that. There’s a lot of runway left for obstacles that are easy for humans and hard/impossible for AI, such as direct manipulation puzzles. (AI models have latency that would be impossible to mask.) On the other hand, a11y needs do limit what can be lawfully deployed…
        [-]
        jgrahamc 263 days ago
        There’s a lot of runway left for obstacles that are easy for humans and hard/impossible for AI, such as direct manipulation puzzles.
        That's irrelevant. Humans totally hate CAPTCHAs and they are an accessibility and cultural nightmare. Just forget about them. Forget about making better ones, forget about what AI can and can't do. We moved on from CAPTCHAs for all those reasons. Everyone else needs to.
        [-]
        nerdralph 263 days ago
        Agreed. When I open a link and get a Cloudflare CAPTCHA I just close the tab.
        [-]
        jgrahamc 263 days ago
        We eliminated all CAPTCHA use at Cloudflare in September 2023: https://blog.cloudflare.com/turnstile-ga/
        [-]
        nerdralph 262 days ago
        OK, what you now call turnstyle. If I get one of those screens I just close the tab rather than wait several seconds for the algorithm to run and give me a green checkbox to proceed.
        pixl97 263 days ago
        >AI models have latency
        So do humans, or can my friend with cerebral palsy not use the internet any longer?
        [-]
        anon373839 263 days ago
        Totally different type of latency. A person with a motor disability dragging a puzzle piece with their finger will look very different from an AI model being called frame by frame.
        [-]
        pixl97 263 days ago
        Round 2: Begin
        https://www.youtube.com/watch?v=WqnXp6Saa8Y
        [-]
        anon373839 262 days ago
        That's a great video. To be clear, I'm not defending Captchas - I just don't know if I believe they're dead yet.
      - cma 263 days ago
        Cloudflare is more than captchas, it's centralized monitoring of them too: what do you think happens when your research assistant solves 50 captchas in 5 min from your home IP? It has to slow down to human research speeds.
    - AznHisoka 263 days ago
      What about Cloudflare itself? It might constitute an abuse of sorts of their leadership position, but couldn’t they dominate the AI research/agent market if they wanted? (Or maybe that’s what you were implying too)
rvz 263 days ago
Of course. The first of many open source versions of 'Deep Research' projects are now appearing as predicted [0] but in less than a month. Faster than expected.
Open source is already at the finish line.
[0] https://news.ycombinator.com/item?id=42913379
[-]
- littlestymaar 263 days ago
  > The first of many open source versions of 'Deep Research' projects are now appearing as predicted [0] but in less than a month. Faster than expected.
  Well, in that particular case the open source version was actually here first three month ago[1].
  [1]: https://www.reddit.com/r/LocalLLaMA/comments/1gvlzug/i_creat...
- transpute 263 days ago
  > Nothing that Perplexity + DeepSeek-R1 can already do
  Any public comparisons of OAI Deep Research report quality with Perplexity + DeepSeek-R1, on the same query?
  How do cost and query limits compare?
  [-]
  - lhl 263 days ago
    I've been using GenSpark.ai for the past month to do research (its agents usually does ~20 minutes, but I've seen it go up to almost 2 hours on a task) - it uses a Mixture of Agents approach using GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 Pro and searches for hundreds of sources.
    I reran some of these searches and I've so far found OpenAI Deep Research to be superior for technical tasks. Here's one example:
    https://chatgpt.com/share/67a10f6d-28cc-8012-bf98-05dcdb705c... vs https://www.genspark.ai/agents?id=c896d5bc-321b-46ca-9aaa-62...
    I've been giving Deep Research a good workout, although I'm still mystified if switching between the different base model matters, besides o1 pro always seeming to fail to execute the Deep Research tool.
    [-]
    - Terretta 263 days ago
      > still mystified if switching between the different base model matters, besides o1 pro always seeming to fail to execute the Deep Research tool.
      You mean when it says it's going to research and get back to you and then ... just doesn't?
      [-]
      - lhl 263 days ago
        Yeah, it seems to not be able to execute the tool calling properly. Maybe it's a bad interaction w/ it's own async calling ability or something else (eg, how search and code interpreter can't seem to run at the same time for 4o)
taurknaut 263 days ago
Somehow "deep" has acquired the connotation of "bullshit". It's a shame—I'm very bullish on LLM technology, but capital seems determined to diminish every angle of potential by blatantly lying about its capabilities.
Granted, this isn't an entrepreneurial venture, so maybe it has some value. but this still stinks to high heaven of saving money rather than producing value. One day we'll see AI produce stuff of value that humans can't already do better (aside from playing board games), but that day is still a long way off.
lasermike026 263 days ago
I think I'm in love.
ai-christianson 263 days ago
[flagged]
[-]
- benatkin 263 days ago
  This is just an AI summary.
  [-]
  - transpute 263 days ago
    and/or an advertisement
bossyTeacher 263 days ago
So basically, Altman announced Deep Research less than a month ago and open-source alternatives are already out? Investors are not going to be happy unless OpenAI outperforms them all by an order of magnitude
[-]
- mrtesthah 263 days ago
  Wasn’t this technique previewed by Google Gemini 2.0 first?
  [-]
  - jcims 263 days ago
    Hilarious how few folks reference this. I’ve found it to be pretty good!
    [-]
    - vietvu 258 days ago
      Because no one use Gemini Advanced.
- energy123 263 days ago
  You're being slippery with language. The open source version doesn't get 26% on Humanity's Last Exam. It's not "already out".
  [-]
  - taurknaut 263 days ago
    > Humanity's Last Exam
    Good grief, AI researchers need to learn basic humility if they want to market their tech successfully. Unless they're toppling our states and liberating us from capital (extremely difficult to imagine) I have a difficult time imagining any value or threat AI could provide that necessitates this level of drama.
    [-]
    - sReinwald 263 days ago
      Constantly warning about how your product is or will eventually be so powerful that it will become a danger to humanity or society is the successful marketing.
      And for the big AI companies, like OpenAI, it has the very beneficial side-effect of establishing the narrative that lets them influence politics into regulating their potential competitors out of the market. Because they are, of course, the only reasonable and responsible builders of self-described doomsday devices.
yobid20 263 days ago
We need an OpenOpenAI to open source OpenAI who should actually be called ClosedAI, since there's nothing open about them other than their banks to take all your money.