Indexing Code at Scale with Glean

(engineering.fb.com)

132 points | by GavCo 183 days ago

9 comments

jtokoph 183 days ago
I was really confused and surprised that Meta was using a commercial product for indexing instead of building in-house...until I realized that they weren't talking about the AI search indexing tool at glean.com
[-]
- fintler 183 days ago
  glean.com is pretty awesome. The responses it generates will have citations from our internal Jira, Wiki, Slack, Github, etc.
  It's also great for when I get pulled into a busy Slack channel and need a summary of what's been going on in there for the past week.
  [-]
  - tomerbd 182 days ago
    I'm a little bit confused is it the opensource that searches also in jira, wiki, slack, ..? https://github.com/facebookincubator/glean ?
    [-]
    - dijit 182 days ago
      I'm also confused, the link you shared is more akin to a sourcegraph alternative; but the parent is talking as if it's an LLM.
      I'm going to guess that there are to completely unrelated products that share a name.
      glean.com and your link (glean.software).
  - scrollaway 182 days ago
    What's the pricing on it? Everything I see is "contact us".
    [-]
    - staindk 182 days ago
      Glean.com? We had an intro meeting with them, pricing only makes sense if you're in a first world country and have 100+ or maybe 150+ employees.
      I recall pricing started at 50k USD per year but may be remembering incorrectly. Please take this with a grain of salt as they may have changed their pricing models or whatever - I just get really annoyed at the "contact us" stuff so thought I'd try to help out here.
- iandanforth 183 days ago
  Yeah this naming is questionable. This definitely introduces confusion in the minds of consumers but I'm not sure if it's actionable. Any lawyers want to give some "I am not your lawyer" opinions?
  [-]
  - loeg 183 days ago
    Meta's tool was started by at least August 2021. The Glean commercial product wasn't launched until September 2021.
  - cma 182 days ago
    It is a high burden to get a trademark on a 5 letter common English word. Usually can only be awarded after years in use and large popularity.
conqrr 183 days ago
Glean: https://glean.software/ System for collecting, deriving and querying facts about source code
[-]
- lenkite 181 days ago
  The article never mentions that Glean is written in Haskell
tomas789 183 days ago
This is certainly a step in right direction especially with proliferation of AI based assistants there will be a greater need to have readily available information about the codebase. This could easily take those copilots yet another level up.
For example my workflow now with Cursor is to keep relevant code in spearate tabs even though I don’t work on the files. I found it makes the autocomplete better as at seems to me that all the active tabs are fed to the model. That means less space for me and more distraction. Glean might here.
PessimalDecimal 182 days ago
Google's equivalent to this is Kythe (https://kythe.io/). Earlier today I had noticed that Kythe ripped out its support for indexing Rust code and wondered what alternatives might exist. So iinteresting to see this right now! And it looks like it supports Rust (albeit via rust-indexer).
rockwotj 183 days ago
Is there any UIs for this available openly? Or for glass? I am a former Googler and I know how awesome this kind of tooling is and it’s so hard to achieve with OSS. I would love open source code search. This seems very close but there is no UI layer (and it seems like meta uses this for code review and for IDEs) but a basic UI would be a good start
[-]
- pepeiborra 182 days ago
  Some people use the Glass command line client to integrate with Emacs/Vim/VSCode. Internally we have an LSP server that queries Glass, but it's not open source and some work would be needed to extract it. The only non trivial thing it does is position mapping to account for local changes.
  The integrations for code review and symbol search are both built for internal tools and not amenable to open sourcing.
  FWIW I agree that the lack of open source integrations are the main barrier for external adoption
  [-]
  - rockwotj 182 days ago
    Yeah understood that the internal meta stuff would be too tied up with internal infra to be OSS, it’s the same with a lot of Google’s tooling here.
    I do wish there was a startup here. There is sourcegraph, which has ok code search (github has come a long way, but without indexing and understanding the build you can’t do it justice). There are also cool code review startups like Graphite, but they don’t work together. I remember how powerful it was to review a change then go use an xref to see how a function is used that is untouched in a code review, so does not show in the diff, which requires checking out the changes locally in OSS land and context switching to leave comments.
YetAnotherNick 183 days ago
There are already 3 popular products with name glean with domains as .com, .ai and .co. This is glean with .software.
jepler 183 days ago
My mind just balks at the idea of having so much source that a 2020s computer could take hours to index it. ctags is nothing special (both in terms of optimization but also the level of detail it gets to: just global function identifiers) and looks like it runs at about 400MB/s on a single core of an i5-1235U. But still it looks ctags could process about 100TB in 4 hours across 16 threads on a workstation class CPU...
[-]
- DylanSp 183 days ago
  It sounds like the indexing time/complexity is increased a lot by the amount of detailed data they're storing. They mention determining which `using` statement is used to resolve each symbol reference in C++ source, to enable dead code detection; that's going to require some sophisticated analysis.
  [-]
  - menaerus 182 days ago
    Correct, you need to build an AST representation of the code that you want to index. Essentially, it's a compiler frontend pass and which is why it takes so much longer than what ctags heuristics do. Now think millions of lines of code, multiple build configurations, the amount of RAM you need, etc. Multiple branches, or even smaller revisions/commits, is also a big computation problem.
    That said, Glean seems to be reusing the indexer from LLVM/clang for C and C++.
    > The C++ indexer ("the clang indexer") is a wrapper over clang. The clang indexer is a drop in replacement for the C++ compiler that emits Glean facts instead of code. The wrapper is linked against libclang and libllvm.
    [1] https://glean.software/docs/indexer/cxx
- UltraSane 183 days ago
  The whole point of indexing data is to perform very expensive computation once and leverage the result many many times and it works really well.
- phyrex 183 days ago
  It's a mono repo across a dozen languages (good luck with ctags) that tens of thousands of developers commit to every day. Even if you'd spend the hours indexing it locally, it would be out of date right away.
- kllrnohj 183 days ago
  You kinda said it yourself already - ctags is fast because it's producing almost nothing of value. Being fast at doing nothing isn't impressive.
  Try doing the same with C++ and more indexing options enabled, such as with something like universal-ctags, and a larger code base, say Android's repository aught to do it. Are you still getting 400MB/s? Nope.
tonymet 183 days ago
my favorite feature of code indexing at FB was how well integrated it was. Web search, cli search and IDE search all used the search index, but would reference your local context. This was useful for reference, call stack, dead code search.
e.g. search results from ide search would link back to your local file. CLI results would reference your local clone.
A great example of a small feature resulting in great usability.
[-]
- Nathanba 182 days ago
  By IDE search do you mean that it was using glean even in your local vscode? Does glean therefore work in combination with LSPs, because scip says that code modifications are a non-goal and now I wonder why somebody would create such a big tool only for local code to still just use LSPs and never use the server version of code navigation (scip or glean).
  [-]
  - tonymet 182 days ago
    It used the server version and references mapped back to the local clone when you clicked on a result.
archy_ 182 days ago
When I read about these things, I cant help but wonder if anybody took a step back and thought "maybe we just have too much code"?
At some point, perhaps you're just doing too much
[-]
- nthingtohide 182 days ago
  There isn't too much code till the point we have automated asteroid mining.
- dboreham 182 days ago
  Career limiting thoughts.
  [-]
  - sangnoir 182 days ago
    Product and profit limiting too, if you're deleting profitable code for aesthetic reasons.