Compute and engineering overhead for "XML as RAM" setup?

Note: This can be filed away in the "rant-y" stream-of-consciousness category, as I have yet to run into issues for a real problem just yet. There are other issues that I face for Drake, such as those mentioned in #103 (closed), which could be worked around with the pygccxml comment support.

I am concerned about the scalability of the CastXML + pygccxml for the following reasons:

Design redundancy. I understand a bit of the gccxml -> CastXML (clang vs. gcc) progression for legacy reasons. However, I can't help shake the feeling that this is another "beating around the bush" to have a more digestible interface to Clang's parsing API. With pygccxml, it feels almost wholly redundant to clang.cindex - except that it fixes certain things and makes queries a bit easier.
Prototyping speed. In #92, I was complaining about not having information already provided via Clang's AST. While it's true CastXML provides stability, I am also fine with fixing my clang/llvm version for my project. Additionally, any new features must be plumbed through pygccxml (e.g. comment strings). This probably isn't that much overhead, but feels like a bit of engineering overhead.
Compute / disk overhead. All usages of pygccxml that I see are generally "ephemeral" - point it at some source, generate some CastXML output in a tempdir, then throw it away. Some explicit caching can be done (either via pygccxml API for writing/reading CastXML directly, or using the project reader shindig), and some optimizations can be performed (e.g. the init_optimization for lookups in pygccxml), but it still feels oddly... wasteful, given that all the information was available in memory at one point in time with clang (hence the "XML as RAM" title).

Possibly, all this may just boil down to my desire to "move fast and break things" and "shoot for the moon", whereas CastXML + pygccxml very much have the "get it done" mentality and have well-scoped problem sets and existing solutions (and thus have some builtin inertia).

Action Items: 🤷 I'll still try to tinker more with pygccxml and get real use cases underway before complaining too much more in the abstract.

For now, I'm just doing some shallow performance and quality benchmarks of pygccxml vs. clang.cindex, identifying gaps between the two, and seeing what a dumb project like directly binding clangs C++ API looks like using autopybind11 - both for addressing this issue, and for checking autopybind11s usability itself.

One such shallow benchmark:

https://github.com/CastXML/pygccxml/issues/129

One semi-crazy idea is to see if pygccxml can have an adapter to use an in-memory solution (clang.cindex or whatever else) rather than the RAM -> XML on disk -> RAM indirection. pygccxmls API is still baller.

\cc @brad.king @jamiesnape

TTBOMK, CastXML effectively serves as an abstraction layer on top of Clang's AST API, but also a (possibly necessary) information bottleneck -- partially for just information density (e.g. a person looking at the output), but also for efficiency given that CastXML must be stored on disk and uses a heavy text-based format, XML. CastXML does some legwork of forcing instantiations to take place via Clang's AST API, and does the proper set of passes to take the generated symbols, with explicit relations (via the id setup), and dump them in an API-stable way. CastXML also provides quite a few compatibility shims for parsing code (what looks like an MSVC compatibility mode, etc).

pygccxml serves as a further processing of this output, reading the XML output from CastXMl, and then re-establishes the correspondences but generally with less granularity than clang (due to the information bottleneck, and the design of CastXML).

Edited Aug 01, 2020 by Eric Cousineau