Compute and engineering overhead for "XML as RAM" setup?
Note: This can be filed away in the "rant-y" stream-of-consciousness category, as I have yet to run into issues for a real problem just yet. There are other issues that I face for Drake, such as those mentioned in #103 (closed), which could be worked around with the pygccxml
comment support.
I am concerned about the scalability of the CastXML
+ pygccxml
for the following reasons:
-
Design redundancy. I understand a bit of the
gccxml
->CastXML
(clang vs. gcc) progression for legacy reasons. However, I can't help shake the feeling that this is another "beating around the bush" to have a more digestible interface to Clang's parsing API. Withpygccxml
, it feels almost wholly redundant toclang.cindex
- except that it fixes certain things and makes queries a bit easier. -
Prototyping speed. In #92, I was complaining about not having information already provided via Clang's AST. While it's true CastXML provides stability, I am also fine with fixing my clang/llvm version for my project. Additionally, any new features must be plumbed through
pygccxml
(e.g. comment strings). This probably isn't that much overhead, but feels like a bit of engineering overhead. -
Compute / disk overhead. All usages of
pygccxml
that I see are generally "ephemeral" - point it at some source, generate some CastXML output in a tempdir, then throw it away. Some explicit caching can be done (either viapygccxml
API for writing/reading CastXML directly, or using the project reader shindig), and some optimizations can be performed (e.g. theinit_optimization
for lookups inpygccxml
), but it still feels oddly... wasteful, given that all the information was available in memory at one point in time withclang
(hence the "XML as RAM" title).
Possibly, all this may just boil down to my desire to "move fast and break things" and "shoot for the moon", whereas CastXML + pygccxml
very much have the "get it done" mentality and have well-scoped problem sets and existing solutions (and thus have some builtin inertia).
Action Items: pygccxml
and get real use cases underway before complaining too much more in the abstract.
For now, I'm just doing some shallow performance and quality benchmarks of pygccxml
vs. clang.cindex
, identifying gaps between the two, and seeing what a dumb project like directly binding clang
s C++ API looks like using autopybind11
- both for addressing this issue, and for checking autopybind11
s usability itself.
One such shallow benchmark:
One semi-crazy idea is to see if pygccxml
can have an adapter to use an in-memory solution (clang.cindex
or whatever else) rather than the RAM -> XML on disk -> RAM indirection. pygccxml
s API is still baller.
\cc @brad.king @jamiesnape
TTBOMK, CastXML
effectively serves as an abstraction layer on top of Clang's AST API, but also a (possibly necessary) information bottleneck -- partially for just information density (e.g. a person looking at the output), but also for efficiency given that CastXML must be stored on disk and uses a heavy text-based format, XML. CastXML does some legwork of forcing instantiations to take place via Clang's AST API, and does the proper set of passes to take the generated symbols, with explicit relations (via the id
setup), and dump them in an API-stable way. CastXML also provides quite a few compatibility shims for parsing code (what looks like an MSVC compatibility mode, etc).
pygccxml
serves as a further processing of this output, reading the XML output from CastXMl, and then re-establishes the correspondences but generally with less granularity than clang (due to the information bottleneck, and the design of CastXML).