I was really confused and surprised that Meta was using a commercial product for indexing instead of building in-house...until I realized that they weren't talking about the AI search indexing tool at glean.com
Glean.com? We had an intro meeting with them, pricing only makes sense if you're in a first world country and have 100+ or maybe 150+ employees.
I recall pricing started at 50k USD per year but may be remembering incorrectly. Please take this with a grain of salt as they may have changed their pricing models or whatever - I just get really annoyed at the "contact us" stuff so thought I'd try to help out here.
Yeah this naming is questionable. This definitely introduces confusion in the minds of consumers but I'm not sure if it's actionable. Any lawyers want to give some "I am not your lawyer" opinions?
This is certainly a step in right direction especially with proliferation of AI based assistants there will be a greater need to have readily available information about the codebase. This could easily take those copilots yet another level up.
For example my workflow now with Cursor is to keep relevant code in spearate tabs even though I don’t work on the files. I found it makes the autocomplete better as at seems to me that all the active tabs are fed to the model. That means less space for me and more distraction. Glean might here.
Google's equivalent to this is Kythe (https://kythe.io/). Earlier today I had noticed that Kythe ripped out its support for indexing Rust code and wondered what alternatives might exist. So iinteresting to see this right now! And it looks like it supports Rust (albeit via rust-indexer).
Is there any UIs for this available openly? Or for glass? I am a former Googler and I know how awesome this kind of tooling is and it’s so hard to achieve with OSS. I would love open source code search. This seems very close but there is no UI layer (and it seems like meta uses this for code review and for IDEs) but a basic UI would be a good start
Some people use the Glass command line client to integrate with Emacs/Vim/VSCode. Internally we have an LSP server that queries Glass, but it's not open source and some work would be needed to extract it. The only non trivial thing it does is position mapping to account for local changes.
The integrations for code review and symbol search are both built for internal tools and not amenable to open sourcing.
FWIW I agree that the lack of open source integrations are the main barrier for external adoption
Yeah understood that the internal meta stuff would be too tied up with internal infra to be OSS, it’s the same with a lot of Google’s tooling here.
I do wish there was a startup here. There is sourcegraph, which has ok code search (github has come a long way, but without indexing and understanding the build you can’t do it justice). There are also cool code review startups like Graphite, but they don’t work together. I remember how powerful it was to review a change then go use an xref to see how a function is used that is untouched in a code review, so does not show in the diff, which requires checking out the changes locally in OSS land and context switching to leave comments.
My mind just balks at the idea of having so much source that a 2020s computer could take hours to index it. ctags is nothing special (both in terms of optimization but also the level of detail it gets to: just global function identifiers) and looks like it runs at about 400MB/s on a single core of an i5-1235U. But still it looks ctags could process about 100TB in 4 hours across 16 threads on a workstation class CPU...
It sounds like the indexing time/complexity is increased a lot by the amount of detailed data they're storing. They mention determining which `using` statement is used to resolve each symbol reference in C++ source, to enable dead code detection; that's going to require some sophisticated analysis.
Correct, you need to build an AST representation of the code that you want to index. Essentially, it's a compiler frontend pass and which is why it takes so much longer than what ctags heuristics do. Now think millions of lines of code, multiple build configurations, the amount of RAM you need, etc. Multiple branches, or even smaller revisions/commits, is also a big computation problem.
That said, Glean seems to be reusing the indexer from LLVM/clang for C and C++.
> The C++ indexer ("the clang indexer") is a wrapper over clang. The clang indexer is a drop in replacement for the C++ compiler that emits Glean facts instead of code. The wrapper is linked against libclang and libllvm.
It's a mono repo across a dozen languages (good luck with ctags) that tens of thousands of developers commit to every day. Even if you'd spend the hours indexing it locally, it would be out of date right away.
You kinda said it yourself already - ctags is fast because it's producing almost nothing of value. Being fast at doing nothing isn't impressive.
Try doing the same with C++ and more indexing options enabled, such as with something like universal-ctags, and a larger code base, say Android's repository aught to do it. Are you still getting 400MB/s? Nope.
my favorite feature of code indexing at FB was how well integrated it was. Web search, cli search and IDE search all used the search index, but would reference your local context. This was useful for reference, call stack, dead code search.
e.g. search results from ide search would link back to your local file. CLI results would reference your local clone.
A great example of a small feature resulting in great usability.
By IDE search do you mean that it was using glean even in your local vscode? Does glean therefore work in combination with LSPs, because scip says that code modifications are a non-goal and now I wonder why somebody would create such a big tool only for local code to still just use LSPs and never use the server version of code navigation (scip or glean).
It's also great for when I get pulled into a busy Slack channel and need a summary of what's been going on in there for the past week.
I'm going to guess that there are to completely unrelated products that share a name.
glean.com and your link (glean.software).
I recall pricing started at 50k USD per year but may be remembering incorrectly. Please take this with a grain of salt as they may have changed their pricing models or whatever - I just get really annoyed at the "contact us" stuff so thought I'd try to help out here.
For example my workflow now with Cursor is to keep relevant code in spearate tabs even though I don’t work on the files. I found it makes the autocomplete better as at seems to me that all the active tabs are fed to the model. That means less space for me and more distraction. Glean might here.
The integrations for code review and symbol search are both built for internal tools and not amenable to open sourcing.
FWIW I agree that the lack of open source integrations are the main barrier for external adoption
I do wish there was a startup here. There is sourcegraph, which has ok code search (github has come a long way, but without indexing and understanding the build you can’t do it justice). There are also cool code review startups like Graphite, but they don’t work together. I remember how powerful it was to review a change then go use an xref to see how a function is used that is untouched in a code review, so does not show in the diff, which requires checking out the changes locally in OSS land and context switching to leave comments.
That said, Glean seems to be reusing the indexer from LLVM/clang for C and C++.
> The C++ indexer ("the clang indexer") is a wrapper over clang. The clang indexer is a drop in replacement for the C++ compiler that emits Glean facts instead of code. The wrapper is linked against libclang and libllvm.
[1] https://glean.software/docs/indexer/cxx
Try doing the same with C++ and more indexing options enabled, such as with something like universal-ctags, and a larger code base, say Android's repository aught to do it. Are you still getting 400MB/s? Nope.
e.g. search results from ide search would link back to your local file. CLI results would reference your local clone.
A great example of a small feature resulting in great usability.
At some point, perhaps you're just doing too much