Indexing Code at Scale with Glean

(engineering.fb.com)

129 points | by GavCo 3 days ago

9 comments

  • jtokoph 3 days ago
    I was really confused and surprised that Meta was using a commercial product for indexing instead of building in-house...until I realized that they weren't talking about the AI search indexing tool at glean.com
    • fintler 3 days ago
      glean.com is pretty awesome. The responses it generates will have citations from our internal Jira, Wiki, Slack, Github, etc.

      It's also great for when I get pulled into a busy Slack channel and need a summary of what's been going on in there for the past week.

      • tomerbd 2 days ago
        I'm a little bit confused is it the opensource that searches also in jira, wiki, slack, ..? https://github.com/facebookincubator/glean ?
        • dijit 2 days ago
          I'm also confused, the link you shared is more akin to a sourcegraph alternative; but the parent is talking as if it's an LLM.

          I'm going to guess that there are to completely unrelated products that share a name.

          glean.com and your link (glean.software).

      • scrollaway 2 days ago
        What's the pricing on it? Everything I see is "contact us".
        • staindk 2 days ago
          Glean.com? We had an intro meeting with them, pricing only makes sense if you're in a first world country and have 100+ or maybe 150+ employees.

          I recall pricing started at 50k USD per year but may be remembering incorrectly. Please take this with a grain of salt as they may have changed their pricing models or whatever - I just get really annoyed at the "contact us" stuff so thought I'd try to help out here.

    • iandanforth 2 days ago
      Yeah this naming is questionable. This definitely introduces confusion in the minds of consumers but I'm not sure if it's actionable. Any lawyers want to give some "I am not your lawyer" opinions?
      • loeg 2 days ago
        Meta's tool was started by at least August 2021. The Glean commercial product wasn't launched until September 2021.
      • cma 2 days ago
        It is a high burden to get a trademark on a 5 letter common English word. Usually can only be awarded after years in use and large popularity.
  • conqrr 2 days ago
    Glean: https://glean.software/ System for collecting, deriving and querying facts about source code
    • lenkite 1 day ago
      The article never mentions that Glean is written in Haskell
  • tomas789 3 days ago
    This is certainly a step in right direction especially with proliferation of AI based assistants there will be a greater need to have readily available information about the codebase. This could easily take those copilots yet another level up.

    For example my workflow now with Cursor is to keep relevant code in spearate tabs even though I don’t work on the files. I found it makes the autocomplete better as at seems to me that all the active tabs are fed to the model. That means less space for me and more distraction. Glean might here.

  • PessimalDecimal 2 days ago
    Google's equivalent to this is Kythe (https://kythe.io/). Earlier today I had noticed that Kythe ripped out its support for indexing Rust code and wondered what alternatives might exist. So iinteresting to see this right now! And it looks like it supports Rust (albeit via rust-indexer).
  • rockwotj 2 days ago
    Is there any UIs for this available openly? Or for glass? I am a former Googler and I know how awesome this kind of tooling is and it’s so hard to achieve with OSS. I would love open source code search. This seems very close but there is no UI layer (and it seems like meta uses this for code review and for IDEs) but a basic UI would be a good start
    • pepeiborra 2 days ago
      Some people use the Glass command line client to integrate with Emacs/Vim/VSCode. Internally we have an LSP server that queries Glass, but it's not open source and some work would be needed to extract it. The only non trivial thing it does is position mapping to account for local changes.

      The integrations for code review and symbol search are both built for internal tools and not amenable to open sourcing.

      FWIW I agree that the lack of open source integrations are the main barrier for external adoption

      • rockwotj 1 day ago
        Yeah understood that the internal meta stuff would be too tied up with internal infra to be OSS, it’s the same with a lot of Google’s tooling here.

        I do wish there was a startup here. There is sourcegraph, which has ok code search (github has come a long way, but without indexing and understanding the build you can’t do it justice). There are also cool code review startups like Graphite, but they don’t work together. I remember how powerful it was to review a change then go use an xref to see how a function is used that is untouched in a code review, so does not show in the diff, which requires checking out the changes locally in OSS land and context switching to leave comments.

  • YetAnotherNick 2 days ago
    There are already 3 popular products with name glean with domains as .com, .ai and .co. This is glean with .software.
  • jepler 2 days ago
    My mind just balks at the idea of having so much source that a 2020s computer could take hours to index it. ctags is nothing special (both in terms of optimization but also the level of detail it gets to: just global function identifiers) and looks like it runs at about 400MB/s on a single core of an i5-1235U. But still it looks ctags could process about 100TB in 4 hours across 16 threads on a workstation class CPU...
    • DylanSp 2 days ago
      It sounds like the indexing time/complexity is increased a lot by the amount of detailed data they're storing. They mention determining which `using` statement is used to resolve each symbol reference in C++ source, to enable dead code detection; that's going to require some sophisticated analysis.
      • menaerus 2 days ago
        Correct, you need to build an AST representation of the code that you want to index. Essentially, it's a compiler frontend pass and which is why it takes so much longer than what ctags heuristics do. Now think millions of lines of code, multiple build configurations, the amount of RAM you need, etc. Multiple branches, or even smaller revisions/commits, is also a big computation problem.

        That said, Glean seems to be reusing the indexer from LLVM/clang for C and C++.

        > The C++ indexer ("the clang indexer") is a wrapper over clang. The clang indexer is a drop in replacement for the C++ compiler that emits Glean facts instead of code. The wrapper is linked against libclang and libllvm.

        [1] https://glean.software/docs/indexer/cxx

    • UltraSane 2 days ago
      The whole point of indexing data is to perform very expensive computation once and leverage the result many many times and it works really well.
    • phyrex 2 days ago
      It's a mono repo across a dozen languages (good luck with ctags) that tens of thousands of developers commit to every day. Even if you'd spend the hours indexing it locally, it would be out of date right away.
    • kllrnohj 2 days ago
      You kinda said it yourself already - ctags is fast because it's producing almost nothing of value. Being fast at doing nothing isn't impressive.

      Try doing the same with C++ and more indexing options enabled, such as with something like universal-ctags, and a larger code base, say Android's repository aught to do it. Are you still getting 400MB/s? Nope.

  • tonymet 2 days ago
    my favorite feature of code indexing at FB was how well integrated it was. Web search, cli search and IDE search all used the search index, but would reference your local context. This was useful for reference, call stack, dead code search.

    e.g. search results from ide search would link back to your local file. CLI results would reference your local clone.

    A great example of a small feature resulting in great usability.

    • Nathanba 2 days ago
      By IDE search do you mean that it was using glean even in your local vscode? Does glean therefore work in combination with LSPs, because scip says that code modifications are a non-goal and now I wonder why somebody would create such a big tool only for local code to still just use LSPs and never use the server version of code navigation (scip or glean).
      • tonymet 2 days ago
        It used the server version and references mapped back to the local clone when you clicked on a result.
  • archy_ 2 days ago
    When I read about these things, I cant help but wonder if anybody took a step back and thought "maybe we just have too much code"?

    At some point, perhaps you're just doing too much

    • nthingtohide 2 days ago
      There isn't too much code till the point we have automated asteroid mining.
    • dboreham 2 days ago
      Career limiting thoughts.
      • sangnoir 2 days ago
        Product and profit limiting too, if you're deleting profitable code for aesthetic reasons.