Nuxt HN | Launch HN: Bloop (YC S21) – Code Search with GPT-4

Launch HN: Bloop (YC S21) – Code Search with GPT-4

Hey HN, we’re the cofounders of bloop (https://bloop.ai/), a code search engine which combines semantic search with GPT-4 to answer questions. We let you search your private codebases either the traditional way (regex or literal) or via semantic search, ask questions in natural language thanks to GPT-4, and jump between refs/defs with precise code navigation. Here’s a quick demo: https://www.loom.com/share/8e9d59b88dd2409482ec02cdda5b9185

Traditional code search tools match the terms in your query against the codebase, but often you don’t know the right terms to start with, e.g. ‘Which library do we use for model inference?’ (These types of questions are particularly common when you’re learning a new codebase.) bloop uses a combination of neural semantic code search (comparing the meaning - encoded in vector representations - of queries and code snippets) and chained LLM calls to retrieve and reason about abstract queries.

Ideally, a LLM could answer questions about your code directly, but there is significant overhead (and expense) in fine-tuning the largest LLMs on private data. And although they’re increasing, prompt sizes are still a long way off being able to fit a whole organisation’s codebase.

We get around these limitations with a two-step process. First, we use GPT-4 to generate a keyword query which is passed to a semantic search engine. This embeds the query and compares it to chunks of code in vector space (we use Qdrant as our vector DB). We’ve found that using a semantic search engine for retrieval improves recall, allowing the LLM to retrieve code that doesn’t have any textual overlap with the query but is still relevant. Second, the retrieved code snippets are ranked and inserted into a final LLM prompt. We pass this to GPT-4 and its phenomenal understanding of code does the rest.

Let’s work through an example. You start off by asking ‘Where is the query parsing logic?’ and then want to find out ‘Which library does it use?’. We use GPT-4 to generate the standalone keyword query: ‘query parser library’, which we then pass to a semantic search engine that returns a snippet demonstrating the parser in action: ‘let pair = PestParser::parse(Rule::query, query);’. We insert this snippet into a prompt to GPT-4, which is able to work out that pest is the library doing the legwork here, generating the answer ‘The query parser uses the pest library’.

You can also filter your search by repo or language - What’s the readiness delay repo:myApp lang:yaml. GPT-4 will generate an answer constrained to the respective repo and language.

We also know that LLMs are not always (at least not yet) the best tool for the job. Sometimes you know exactly what you’re looking for. For this, we’ve built a fast, trigram index based regex search engine based on Tantivy. Because of this, bloop is fast at traditional search too. For code navigation, we’ve built a precise go-to-ref/def engine based on scope resolution that uses Tree-sitter.

bloop is fully open-source. Semantic search, LLM prompts, regex search and code navigation are all contained in one repo: https://github.com/bloopAI/bloop.

Our software is standalone and doesn’t run in your IDE. We were originally IDE-based but moved away from this due to constraints on how we could display code to the user.

bloop runs as a free desktop app on Mac, Windows and Linux: https://github.com/bloopAI/bloop/releases. On desktop, your code is indexed with a MiniLM embedding model and stored locally, meaning at index time your codebase stays private. Indexing is fast, except on the very largest repos (GPU indexing coming soon). ‘Private’ here means that no code is shared with us or OpenAI at index time, and when a search is made only relevant code snippets are shared to generate the response. (This is more or less the same data usage as Copilot).

We also have a paid cloud offering for teams ($12 per user per month). Members of the same organisation can search a shared index hosted by us.

We’d love to hear your thoughts about the product and where you think we should take it next, and your thoughts on code search in general. We look forward to your comments!

264 points | by louiskw 784 days ago

To create a response to a user's natural language query, the code follows these steps: Parse the input query using parse_query(), which internally calls parser::parse_nl() to convert the natural language query into a structured format. Rephrase the query if necessary, or proceed with the initial query by extracting keywords using get_keywords(). Perform a search for relevant code snippets using search_snippets(), which calls the semantic.search() method to retrieve matching snippets. If needed, grow the relevant snippet using grow_snippet(), which extends the snippet's surrounding lines while maintaining its token count within a predefined limit. Construct an AnswerResponse object with the user ID, session ID, query ID, and the snippets obtained. Convert the AnswerResponse into a webserver::Response and return it to the user. Please note that this is a simplified explanation of the process. The actual code may have additional checks and logic.

26 comments

tkiolp4 784 days ago
I don’t understand these new saas that rely on GPT. It surely is my lack of understanding, but since the interface to GPT is mainly text: why any company would pay for N different GPT saas (e.g., one for searching for code, another for searching for documentation on confluence, another for allowing your business analysts with their queries, etc.)?
Wouldn’t it make more sense for any company to have only ONE interface to use GPT that is wired with some (or all) parts of the company digital assets? Text is already flexible enough to allow such a universal interface.
[-]
- tedsanders 784 days ago
  To me, your question seems to boil down to the economics of vertical integration.
  Two thoughts:
  First, GPT is all marginal cost, no fixed cost. So the marginal economics of 1 super app vs 10 specialized mini apps is roughly the same.
  Second, imagine the same argument applied to a utility like electricity. "Why would any company pay for lightbulbs from one company, HVAC from a second company, and appliances from a third company?"
  Early on, electricity was complicated and you might actually purchase all of these from the Edison Illuminating Company. But there are returns to specialization, and today the winning formula is different companies that specialize in each. The company best at manufacturing lightbulbs is not necessarily the company that is best at manufacturing washing machines.
  Similarly, you could ask: "Why would any company pay for a laptop from one company, an operating system from a second company, and office apps from a third company?"
  In some cases it may make sense to vertically integrate (e.g., Apple is happy to sell you a combined laptop + OS + Pages/Numbers), but in many cases the specialized players still do fine (e.g., you might buy a Dell laptop, a Microsoft OS, and Notion/Sheets).
  So I think it's very much an open question as to whether the winning approach will be a Swiss army knife (single product) or a toolkit (multiple products).
  [-]
  - rcme 784 days ago
    > Second, imagine the same argument applied to a utility like electricity.
    Isn’t that how electricity works? You purchase electricity directly from the producer, and buy additional plugins (literally) that work with the single source of electricity you’re purchasing.
    [-]
    - 8n4vidtmkvmk 784 days ago
      That would be akin to giving each of the 'plugins' your GPT API key and pay the usage bill directly to OpenAI. The plugin developers would still want $$$ for the app even if the GPT usage was excluded though.
      [-]
      - personjerry 783 days ago
        Wouldn't it be a one-time purchase then? "Buy my prompt for $10 that enables you to search your codebase"
        And we're back to Prompt Engineers :)
        [-]
        8n4vidtmkvmk 783 days ago
        I think an IDE plugin or standalone app takes slightly more work than writing a prompt, but sure
        elecush 782 days ago
        Yes, prompt engineering is the future.
  - ralfd 782 days ago
    Slight nitpicking:
    Edison used lots of different companies for their own business cases and I think the Edison Illuminating Company only built electric generators. So indeed in the beginning of electricity, when the field was most untested/starting up, you did shop at special companies for a generator or lamps or whatever instead of using an integrated whole-in-one solution of a mega corp! Wikipedia:
    > Edison Lamp Company, a lamp manufacturer in East Newark, New Jersey; Edison Machine Works, a manufacturer of dynamos and large electric motors in Schenectady, New York; Bergmann & Company, a manufacturer of electric lighting fixtures, sockets, and other electric lighting devices; and Edison Electric Light Company, the patent-holding company and financial arm for Edison's lighting experiments
    These were only later consolidated into General Electric.
  - danbala 784 days ago
    pretty quickly llvm/ai support is going to be more on the level of electricity itself. like a utility.
  - bko 784 days ago
    I think op means that a more general tool would cover all use cases. If there was a way to easily upload my code to the LLM and query it. At that point the service is providing a way to get your code to the LLM and a prompt, neither of which will be especially difficult in the future.
  - DalasNoin 784 days ago
    Have you tried just asking gpt4 to write a script that then implements the SAAS as a script that then calls gpt4? You should be careful about analogies, intelligence is special.
- hn_throwaway_99 784 days ago
  The point is that the interface to Bloop (or whatever 3rd party SaaS product) is just plain text. In programming parlance I would call this a facade [1] on top of GPT-4. The benefit is that you, the end user, can just continue typing things like "Where do we create our database pool?" in a search box, but under the covers Bloop can use GPT-4, but then eventually upgrade to GPT-5, GPT-6, or some other much better currently-super secret competitor version.
  1. https://en.wikipedia.org/wiki/Facade_pattern
  [-]
  - eddsh1994 784 days ago
    Agreed that may be the case but all OpenAI needs to do is release a nice UI for GPT and they've suddenly pushed these SaaS out of the market. Why fund miners over the companies making pickaxes?
    [-]
    - nico 784 days ago
      That will probably happen. But it will take some time. Twitter did that. At first they were super open let everyone build apps, then they started blocking and copying the most popular ones, or acquiring them.
      However, the companies will probably soon be able to switch out the LLM they are using, so potentially they could use Google’s PaLM, or Facebook’s Llama, or some open source LLM, who knows.
      They can also fine-tune the models for just-your-data, so that the responses for you are better than just using a general LLM.
    - vanjajaja1 784 days ago
      I think the key is that OpenAI wants to focus on building AGI and not waste time appeasing users requests for UX shortcuts for GPT powered code search. Sure, they can hire an extra team to manage the user's requests for UX, but at that point they're moving away from their target goal.
    - kubota 784 days ago
      That takes time and money on OpenAI's behalf. Alternatively they can aquire Bloop after Bloop does all the leg work of integrating their niche with chat GPT. But I agree its a tenuous place to be in as an investor / employee / cofounder of a startup that is at OpenAI's mercy. This model of relying on another vendor's product to base your own saas product on has worked in the past, and it has also failed in the past.
    - spyder 784 days ago
      or a few iteration down ChatGPT will create that UI too. In the future it will not be a "boring" text chat but an interface that always morphs into the specific task you are asking it about.
  - btbuildem 784 days ago
    Except in the case of GPT-x, the facade builder must take extra pains to sanitize the inputs, so it doesn't become just a free GPT account for the user to interact with on unrelated topics. That's much more challenging due to the open nature of the prompts (the selling feature). One really lame solution is to ask the user to use their own openai API key.
    [-]
    - sebzim4500 783 days ago
      I don't think that is a lame solution, it would work pretty well if OpenAI had better key management, especially spending limits per key.
      [-]
      - btbuildem 783 days ago
        If you're forcing your user to get their own OpenAI account, you're pushing them away from your product, at least somewhat. IMO the integration needs to be seamless for the illusion of "we're providing value here, not just wrapping GPT" to hold.
- s1k3s 784 days ago
  Investors want AI. Do you see how crappy their landing pages look? This one is pretty decent compared to what we've seen showcased here in the past months. It's as if someone at YC told them to ship it asap, with no consideration for anything other than "slap AI on it".
  [-]
  - intalentive 784 days ago
    MVP, be embarrassed of your first release, etc.
- 101008 784 days ago
  I would like also to understand the amount of startups funded in 2021 (or around) that relies on GPT-{n} now. They pivoted or found out that they solution was much easier with GPT-{n}?
  Also, why as a customer would care if you use GPT or something else? It is because it is a buzzword?
  [-]
  - dwaltrip 784 days ago
    People care if you use GPT because it has capabilities that other AIs don’t have.
- jstummbillig 784 days ago
  > why any company would pay for N different GPT saas
  Simple: You do it because you pay for what SaaS does, not what GPT does. A user is not interested in postgres or nginx, even though they are the tools used to build the tool they care about and pay for. If what the new tool uniquely does is not adding enough value for enough users, it's going to fail.
- jacobr1 784 days ago
  Some of these app are too niche, but maybe not all of them. There have been many attempts at a "unified enterprise search app" for example, but they all have generally failed. Maybe customer indexing + GPT + form fact (slackbot, gui, etc ..) is enough for a company. Vanilla ChatGPT probably isn't going do that.
  Or consider codex low-code type use cases. Yes you could generate generic code, but integrated into some kind of platform that knows what APIs your company uses (internal cataglog) and some kind of IAM/Auth platform, PAAS to host the results integrated into a single tool might make sense, especially if the prompts are already injecting the boilerplate about how to interact with the ecosystem.
  Or consider where just adding some kind of generative feature is just a _feature_ of broader product.
- karmasimida 784 days ago
  You are not wrong on this one.
  The moat they are going to build is going to incredibly ... low.
  The main intelligence is going to be provided by GPT models.
  [-]
  - sebzim4500 783 days ago
    Agree. The most useful moat you build as an AI-as-a-service company is detailed data on how good each of your past outputs were. E.g. Github presumably keeps track of which copilot completions end up being used and which don't. If you are just using the OpenAI API, you have basically no way of making use of that data for finetuning. You could use the data in order to finetune your own model in the future though, maybe that's the plan.
  - DriverDaily 784 days ago
    Well, one week after the API launch everyone's moat will be low.
    It will be interesting to see how fast these tools grow, and how they can add value beyond a purposeful UI.
- boplicity 783 days ago
  There's a lot of work you could theoretically put into automating many of the queries you would want to do with a GPT chat interface.
  For example, I'm no a medical diet and need a meal plan. This diet has quite a few restrictions.
  I could set up automated queries that produce a weekly meal plan, and grocery list, and also checks all of the ingredients against the allowable / not allowable food lists. This could include multiple queries, one to get an initial response, another to validate it, etc. Everything could be organized nicely, so I wouldn't have to input any text to get what I need -- I'd just click buttons. There's lots of ways this could be further customized to be more useful, such as saving recipes, generating new recipes based on old favorites, generating recipes based on foods already at home, etc. (Heck, you could wire it up to a camera in the fridge.)
  Anyways -- text input to get text output is the most basic form of this technology. There's a lot you can do with such text to make it easier to interact with for specific use cases.
- louiskw 783 days ago
  Most of bloop's interface is search results, which are driven by the LLM and have highlighted line ranges and go-to-ref/def. Down the line, many other LLM powered tasks will need specialised interfaces too, like showing a diff or a git history tree.
  This is quite different from a GPT tool for most other jobs, and I think having granular control of the interface layer certainly helps us ship a better product.
  That's not to say everything needs to be an app. If the output is just conversational text, it can and probably will be some kind of 'Alexa skill' like plugin.
- teaearlgraycold 784 days ago
  I think you could say similar things about cloud computing, but we pay for different apps that are all ultimately startups re-selling AWS resources.
- coffeebeqn 784 days ago
  If you can give it proprietary training that takes nontrivial work then I don’t see why it couldn’t be worthwhile. Collecting, curating, cleaning up, manipulating data to make the best model of X would easily be worthwhile to make and to then purchase for hundreds of companies.
  If it’s just appending a prompt to chatgpt then it’s certainly useless
- menzoic 783 days ago
  It seems you're doing the opposite of "missing the forest for the trees". You see the forest (the app) but you're not seeing the individual trees.
  The answer to why companies don't have one interface with GPT is that they would still need to invest time and money into building the application that uses GPT. Checkout the bloop source code to get an idea of what's needed https://github.com/bloopAI/bloop there are also many complex things to deal with. You can't just shove an entire codebase into GPT-4, it has limited ability to track context. The codebase needs to be indexed and stored in a way that sematic search can be done on it very quickly. Every app has their own unique challenges towards getting the data into GPT and making it fast.
  TL;DR companies are buying apps that use GPT-4 API, not just access to GPT-4. They'd rather buy the apps instead of building and maintaining them.
- elecush 782 days ago
  I think it's because variations are in themselves valuable new stand-alone applications.
- siva7 784 days ago
  What you're asking is like "Why should i pay for N different internet saas" because they share a common characteristic (e.g.they use a database) paired with the naivety where you don't understand the underlying complexity of a business application and think you can do this all by yourself
- wahnfrieden 784 days ago
  you're advocating for byok. saas choose not to use byok because they want to charge rent on your usage instead of making money in other ways.
li4ick 783 days ago
Ok, I'll be the bad guy and say that the linked demo on loom is just not impressive at all. It's just 10x slower than what a simple regex query would've answered. Again, I'll be the bad guy in this comment section, but ML-based search methods for code are just not that useful. They do sound nice for non-coders, that's about it. I also feel like I'm qualified to say this because I have first-hand experience building code search tooling at scale.
[-]
- boplicity 783 days ago
  > nice for non-coders, that's about it.
  I'm not a coder but I code; not sure if that makes sense. I sort of know a bit of regex, but find it utterly painful and not worth the time.
  I'm an amateur, effectively, with little time. I like AI tools for coding, because I can input a request, and get sample code. I know enough to be able to read most of the code that I get back, but not enough to be able to easily write such code on my own.
  These types of tools, in my opinion, have the potential to make people like me, and even people who are much less knowledgable than me, productive programmers. This could be transformative.
  Sure, we won't be as good or productive as real programmers, but that's beside the point.
  In other words, what you're referring to with "that's about it" could still transform quite a few lives.
  [-]
  - euroderf 782 days ago
    > I'm an amateur, effectively, with little time. I like AI tools for coding, because I can input a request, and get sample code.
    This sounds kinda like contemporary Chinese text entry. Nobody remembers exactly which strokes in what order, but can feed some parameters into an app and get back candidate characters.
    Of course the characters are a simple enumeration, while what you get from a coding-helper app is still a form of electronic hairball and requires a second look, and a third.
intalentive 784 days ago
Where to go next? A Toolformer that calls an AST parser. In VS Code I am running Copilot and IntelliSense side by side. Two competing tab completions -- ridiculous. They should be integrated. I want a Copilot that listens to the interpreter / compiler, so that it only makes correct suggestions.
[-]
- cjonas 784 days ago
  I can't stand how co-pilot hijacks the standard code completion, often overriding the typescript completion I actually want or generating something distracting.
  There should be a way to only run completions when prompted.
  Please upvote this issue if you run into the same problems: https://github.com/community/community/discussions/9817
  [-]
  - jatins 784 days ago
    I had the same issue of Copilot hijacking suggestions with Intellij family of IDEs. Apparently there is a setting, disabled by default, to show both the suggestions (copilot's and Intellij's). Not on my laptop but it's somewhere in
    Settings > languages and frameworks> Copilot
  - johnfn 784 days ago
    If copilot interrupts what typescript was going to complete, hit ctrl-space to trigger the typescript code completion.
- nico 784 days ago
  > I want a Copilot that listens to the interpreter / compiler, so that it only makes correct suggestions.
  And that builds a test first. Then iterates on the new code against the test until it passes.
  ChatTDD?
renewiltord 784 days ago
Very interesting. I want you to know that
> ‘Private’ here means that no code is shared with us or OpenAI at index time, and when a search is made only relevant code snippets are shared to generate the response. (This is more or less the same data usage as Copilot).
is the reason I went from "ha, cool tool" to "okay, let me go download it".
Quite surprised that you don't actually have a login wall to download. Missing an opportunity to track downloads, upsell to paid plan, etc. etc. imho.
[-]
- evrydayhustling 784 days ago
  I'm interested why this made a difference to you. If you are concerned about sharing code with Bloop/OpenAI, isn't sharing code during use as bad as sharing it at index time? The more you use it, the more of your code will be exposed...
  [-]
  - renewiltord 784 days ago
    Only code in the search space is. If it were an "index in the cloud" product, one mistake and I lose the codebase. If it is a "query exposes a portion", a mistake in where I use it exposes a piece.
    If I can contain the blast radius of my errors my error budget is more available.
    [-]
    - evrydayhustling 784 days ago
      Makes sense. But since you don't know what's in the search radius, you might still get surprised.
      [-]
      - renewiltord 784 days ago
        Ah, I haven't tried it yet, but I have repos I don't care about and repos I care about. And I'm hoping to hook it up to the repos I don't care about.
- purplecats 784 days ago
  at the cost of fewer downloads
dazbradbury 784 days ago
"Precise code navigation in 10+ languages helps you move quickly through refs and defs"
I can't find any mention of which languages are supported - can anyone point me in the right direction?
[-]
- bebrws 784 days ago
  I tried to implement this using OpenAI's embeddings then using cosign similarity between produces vectors instead of the vector DB ( as shown in OpenAI's example code here ). I do the same thing where I take the highest ranking code snippets (vectors) and include them in a prompt to ChatGPT with an original prompt. My code is a mess since I just hacked it together after a long day of work but its a small file. Like 100 lines.
  I used TreeSitter which I thought was pretty awesome though because it allows for parsing a TON of different languages. I had to parse the languages to create the different code snippet strings. I don't want to create a code snippet of half a function for example..
  So TreeSitter parses the code into an AST and I send each different AST node to OpenAI to get the vector (I optimized this so multiple nodes of the same AST type are combined). Send the prompt to OpenAI to get a vector. Find the most similar code snippets to the prompt and include them at the top of a prompt to ChatGPT.
  This is the same idea right? If anyones interested it can be found here: https://bbarrows.com/posts/using-embeddings-ada-and-chatgpt-...
  https://github.com/bebrws/openai-search-codebase-and-chat-ab...
  [-]
  - visarga 784 days ago
    I implemented this idea as well, but for PDF articles from arXiv. It's one application everyone is doing.
    Next level is to select your prompt demonstrations based on the user request. Demonstrations too can be chosen by cosine similarity. The more specific they are, the better. You can "train" such a model by adding more demonstrations, especially adding failing cases (corrected) as demos.
- ggordonhall 784 days ago
  We currently support Go, JS/TS, JSX/TSX, Python, Rust, C#, Java, C/C++
  [-]
  - teaearlgraycold 784 days ago
    This is a very competent set of supported languages for a startup. I think it would be stronger marketing to just list them out, and to omit the repetition of JSX/JS, TSX/TS.
    If I see "10+ languages" but it's actually 10, no wait, actually 8, then I'm just getting progressively let down.
    [-]
  - slackware_linux 784 days ago
    It would be nice to have this displayed somewhere prominently. I tried adding a lisp project and it sort of just spun around indefinitely until I clicked back.
    [-]
    - ggordonhall 784 days ago
      Yes I agree! I'll update the README. We'll be adding some more information to our website about how we implement code navigation soon.
dabei 784 days ago
This is awesome and has huge potential to improve developer productivity! The only thing that it gives me pause is that it requires a very broad Github authorization. Do you really need to "be able to read and write all public and private repository data"?
[-]
- x-complexity 784 days ago
  > This is awesome and has huge potential to improve developer productivity! The only thing that it gives me pause is that it requires a very broad Github authorization. Do you really need to "be able to read and write all public and private repository data"?
  Compared to a regular search engine, the permissions required are pretty much the same. Both this & regular search engines need to go through a repo's codebase to be even able to give results in the first place.
  Privacy-wise, they could probably make it better by requiring each repo to be approved before they can be searched, but that would make for a more friction-laden developer UX. The broad permissions are likely just a consequence of not wanting to ask the user every time a new repo is to be searched through.
  [-]
  - 8n4vidtmkvmk 784 days ago
    Why does it need "write" permission?
    [-]
    - x-complexity 782 days ago
      Truthfully, it doesn't need to. Only read permissions are required.
      It's possibly just a permission request mistake.
- louiskw 783 days ago
  For the desktop version we implemented GitHub OAuth using the device flow, so you can hold credentials on your local device. The tradeoff is there's no granular control of permissions, it has whatever access your account has.
  On bloop cloud we use the GitHub App permission system which is more granular and only request read access.

bjtitus 784 days ago

This looks great! Is there documentation in the repository that goes into greater detail about the query process? It might be nice to add some diagrams or other explanations about the architecture there.

[-]

knexer 784 days ago

Query flow looks to be roughly along the lines of:

- llm rephrases the last conversation entry into a standalone, context-free search query

- rephrased query is embedded, top-k results retrieved from the vector db

- llm selects a top-1 winner from the top-k results

- llm answers the question given conversational context and the top-1 code search result

(from https://github.com/BloopAI/bloop/blob/8905a36388ce7b9dadaedf...)

louiskw 783 days ago

That's correct. A good one to to get bloop to answer, here's what I got: https://imgur.com/a/nlx4YRq

Looking at the response and comparing, the biggest improvement would be crossing the file boundary in the explanation step.

duped 783 days ago
How do you manage "hallucinations?" One of the things I've noticed about using ChatGPT for programming tasks is that it will happily give you garbage.
[-]
- Alifatisk 782 days ago
  ChatGPT to my understanding is only meant for one purpose, to be extremely good at conversing. That’s it! It should be so good at it that it almost convinces you.
  OpenAi’s codex and the newer models should be used for actual coding related prompts.
elanzini 784 days ago
Congrats on the launch!
If multiple pieces of code from different files are being referenced in the response, it would be nice to have clickable refs that take you to that piece of code in the repo.
[-]
- ggordonhall 784 days ago
  Yes that's a great point. At the moment we in the 'Results' section we display the code snippets returned by the semantic search engine. These don't always line up with the code that GPT is referring to because we include as much of the file that the snippet is from in the prompt.
  We're working on a new interface to make this clearer, where we'll display the lines of code that GPT has referenced in its answer.
btbuildem 784 days ago
Tangential note to the person doing the video voiceover: pls process the audio a bit (high-pass filter at least?), or use higher a quality headset.
[-]
- louiskw 783 days ago
  Apologies, was very hastily recorded in between about a million other pre-launch tasks.
  I thought my Sennheiser Momentum 4's would do a better job, but even they were no match for a glass call booth.
- avinassh 784 days ago
  what is a high-pass filter?
  [-]
  - tb84 783 days ago
    What a coincidence, just learned about those in my EE class lol
    It filters out low frequencies (e.g. motorcycle) and only lets higher frequencies (like a human voice) be expressed
    Edit: mixed em up, rookie mistake :p
    [-]
    - Alifatisk 782 days ago
      What’s EE? Electrical Engineering?
  - btbuildem 783 days ago
    Sorry I meant a low-pass filter - a filter that cuts out high-frequency sounds, like hisses and hard reverbs/echos
swyx 784 days ago
congrats on launching!
i honestly feel like a bad user for this but i have yet to adopt a semantic search engine for code for some reason, respite codeium and sourcegraph also offering more advanced code search thingies. any ideas on how to break force of habit?
[-]
- sqs 784 days ago
  The burden is on those products to compel you, not you to break a force of habit. :)
  Code search historically has been adopted by <10% of devs, although it's usually binary within each company, with equilibriums at both ~1% adoption or >80%+ adoption. My model of LLMs applied to code search is that they make it so even a first-time user can use (easily, via the LLM) features that were previously only accessible to power users: regexps, precise code navigation, large-scale changes/refactors, diff search, other advanced search filters, and all kinds of other code metadata (ownership, dep graph, observability, deployment, runtime perf, etc.) that the code search engine knows. The code search engine itself is just used under the hood by the LLM to answer questions and write great code that uses your own codebase's conventions.
  I bet you (and the ~90% of devs who don't use code search today) will be compelled when you see and use /that/.
- louiskw 784 days ago
  That's the million $ question. Autocomplete (Copilot) is install and forget. We don't have that luxury, search has to give you that 'aha' moment the first time you use it.
  For one, having the engine respond in natural language makes a big difference. The last generation of semantic code search (we built one) used transformers to retrieve code, but as a user you'd still have to read through an entire code chunk to find the answer.
  Also, LLMs (probably starting with GPT-4) can now reason properly. The capability to make a search, read the results and execute an entirely new search based on its reasoning skills, and do this iteratively, until it finds the answer, is a huge jump from just using semantic search on its own.
amcaskill 784 days ago
Very excited to give this a spin. Congratulations!
Any plans to enable the tool to edit code for me? For example if I ask for a suggestion on how to fix something, could I just “accept” the suggestion and have the changes made to the file?
mehlmao 784 days ago
Do Bloop searches also include the license of the code it returns? For example, if a search returns code copied from a GPL 3 project, will it let users know they need to comply with the GPL?
[-]
- samwillis 784 days ago
  Unless I'm mistaken Bloop is about searching your own codebase, not public code. It would never bring back code that isn't already in your codebase.
  [-]
  - sitkack 784 days ago
    Being license aware is a good thing, lots of codebases uses many different licenses.
    I could also see this being useful for local code repos of open source code. I haven't looked, but it could be a combination of LXR (linux cross reference) project and an ML powered code understanding engine.
    [-]
    - louiskw 784 days ago
      Tracing the source of a result is one of the key advantages of code search vs generation.
      I think an easy way to read licenses from the snippet view is a good idea.
avinassh 784 days ago
Instead of asking for read permissions to repos, it is asking me to login to Github, via Bloop. Why such setup was done? Isn't just read access enough?
Freddie111 784 days ago
Congrats on the launch! How do you plan to solve distribution if you're not integrating with IDEs anymore?
meken 784 days ago
This is awesome. I wanted this yesterday
ccQpein 784 days ago
Very cool stuff, I was wondering if there is any tool like this days ago. Cannot wait to test it.
sorokod 783 days ago
Would be interesting to see precision and recall numbers - do you have any?
[-]
- louiskw 783 days ago
  To evaluate semantic search we generated synthetic questions on our own repos with davinci-003. We can probably generate tougher and more realistic queries with GPT-4, would like to re-create this and open source.
  We don't have precision/recall numbers for CodeSearchNet which is probably the biggest eval in this area.
andre-z 784 days ago
Congrats on the launch! And thanks for using Qdrant ;)
[-]
- ggordonhall 784 days ago
  Thanks! Qdrant is great :)
shashanoid 784 days ago
Cool, kinda like buildt.ai I used before. That cannot inject code. Can you do that? You're in YC I'd expect your products to better than what's out there.
[-]
- yieldcrv 784 days ago
  > You're in YC I'd expect your products to better than what's out there.
  That's .... not what YC means. It just means the current mission for this collection of shares fit in the direction of their portfolio, with a heavy weighting to the "culture fit" of the founding team. Just like all private equity.
- ggordonhall 784 days ago
  We currently support code search which you can access through a conversational interface (think Bing Chat for your code). We don't currently support editing code but are looking into a mechanism to create pull requests from bloop.
Invictus0 784 days ago
Does this have a future as a editor plugin?
tipsytoad 784 days ago
Nitpick: get started and pricing buttons don't collapse the burger menu
Also I assume I need an openai API key for this? I can't find that anywhere
[-]
- louiskw 784 days ago
  If you use the prebuilt desktop version or cloud, it uses our OpenAI key.
  And thanks, will update the site!
  [-]
  - awfulneutral 784 days ago
    I'm curious why you're using your key? I want to try this out on my work codebase, but for security reasons it would be better to use my own key, since I don't know if OpenAI might decide to share API search history with whoever owns the key.
    [-]
    - oefrha 784 days ago
      It wouldn’t be a free version if you have to bring your own key. I agree an option would be a good idea.
      [-]
      - crazysim 784 days ago
        I would love the option to be able to bring your own key. It won't be free, but I share the same concern but I am happy to pay for that.
aunch 784 days ago
From the Codeium (https://www.codeium.com) team here!
Exciting to see more teams take on the code search problem - modern pure regex definitely is a suboptimal solution.
Just to objectively highlight some of the similarities and differences between Bloop and Codeium Search (Codeium also provides Autocomplete) to a user:
- Both are free for individuals
- Both use a mix of AI and regex
- Bloop has both free local and paid cloud offerings, Codeium only works locally (but therefore doesn't work as well on super large codebases)
- Codeium uses custom proprietary embedding models so no dependency on OpenAI and no code is sent to a third party AI
[-]
- phphphphp 784 days ago
  Posting about your company when relevant is one thing, advertising it another’s launch thread is another, and it’s pretty gauche… especially when, in this very thread, one of your testimonials is saying he doesn’t actually use your product: https://news.ycombinator.com/item?id=35236557
  [-]
  - aunch 784 days ago
    sorry! on a previous post, someone encouraged me to just post when Codeium was relevant, didn't realize the norms around launch threads.
    and wrt the other comment, it is totally accurate - changing developer behavior for these search products is tough! we welcome any ideas :)
  - edanm 783 days ago
    I disagree, I am very happy that parent posted this. I want to know about other tools in this space, and the comment contains a nice summary of pros and cons, some of which are extremely relevant to me!
    Bloop looks awesome too, don't get me wrong, and I'll check it out.
  - replwoacause 783 days ago
    I don't see the issue with this. It happens on most of these Launch threads and is a common occurrence. I don't think there are rules that prohibit it either.
    [-]
    - dang 782 days ago
      There's no formal rule against it but it's often done in bad taste. When people overpromote their thing in someone else's launch thread, we tend to scold them. Someone else's launch is not a great place for competitive promotion—each product or project deserves its day in the sun. On the other hand, users like to discuss alternatives and comparables, and that seems healthy.
      I'd say the sweet spot is somewhere between just leaving the competitor's launch thread alone, or, if you must, then (1) mention it once and stop there; and (2) when you have any relationship with the alternative thing, disclose it.
      That's the sort of thinking we apply in practice but wouldn't make a formal rule out of, partly because it's always evolving, but mainly because we don't want the list of rules to be too long. If we tried to codify all such things we'd end up with a bureaucratic list of hundreds of rules—ugh!
- jackblemming 784 days ago
  Why do I care if they have custom proprietary embedding model or not? Why is OpenAI any less trust worthy than Codeium? I imagine people care about quality of results more, and I'm doubtful Codeium's proprietary embedding is better than GPT4.
  Also most of your comment history is self promotion for Codeium which is not a great look.
  [-]
  - aunch 784 days ago
    that's fair - the point was less about the model and more about the fact that your code is not sent anywhere because it all happens locally
    ill start commenting on other things - i usually just upvote on things that i find interesting but dont know enough to comment on
- awfulneutral 784 days ago
  Just to be clear, since you say it's local, when you say no code is sent to a third party AI that includes Codeium? For search and indexing, code is not sent over the network at all?
  [-]
  - aunch 784 days ago
    yes. indexing and search happens on your machine, nothing sent over network
ushakov 784 days ago
Will I have to send all of my code to OpenAI?
[-]
- sintezcs 784 days ago
  Did you read the post? ;)
brap 784 days ago
How does YC spit out so many GPT based startups recently, so soon after the LLM mainstream hype started (including ones that were founded years prior)?
Is Sam Altman still involved in YC?
[-]
- akavi 784 days ago
  I don't know if you need to invoke ulterior motives to explain startups jumping on the biggest change to the tech landscape since the iPhone.
- temp_account_32 784 days ago
  It's funny to see how half of HN frontpage in the past weeks has been Show HN thin wrappers around an API call to ChatGPT disguised as full startups.
  All while HN posters were warning us about a flood of low quality AI generated content, well, here it is in one of its forms.
- barefeg 784 days ago
  I see it from the other side. LLMs are so commoditized that it’s super easy to add GPT to anything you can think of.
- moneywoes 784 days ago
  Well looks like they are from last year so they must’ve modified their business
  [-]
  - louiskw 784 days ago
    We've been working on semantic code search since the start of 2021. You can see details our first iteration here https://www.producthunt.com/products/bloop-2#bloop-3
    The main shift was moving from the IDE to our own standalone app. We found the IDE sidebar to be too constrained visually for many code search related workflows.
- KRAKRISMOTT 784 days ago
  They probably pivoted halfway
- shashanoid 784 days ago
  Dude ofcourse what do you even mean lmao. Sama in the bones of YC
woeirua 784 days ago
I have to ask the obvious question, with GPT4 able to write code to solve the problem itself, why would I bother searching for existing code?
[-]
- awfulneutral 784 days ago
  GPT doesn't have access to your whole codebase so it can't help much in real world use cases.
- renewiltord 784 days ago
  Presumably you want to see how things are being used presently, otherwise you risk breaking it.
- motoxpro 784 days ago
  I’m assuming you just keep adding functions that you forgot you added already:
  numToString() numberToString() parseNumberToString()