Using libclang to Parse C++ (aka libclang 101)
(1270 words) Tue, Jan 3, 2017In this post I’ll provide a quick tutorial for using libclang. I started playing around with libclang while implementing Reflang – an open source reflection framework for C++. Then I came to appreciate the amazing work done by its developers.
Please note that we will start with a program and will gradually add code. Scroll to the end of the post to view the complete solution.
libclang?
Clang, if you haven’t heard of yet, is a wonderful C++ (and other C language family) compiler. Well, not exactly a compiler, but a frontend to the LLVM compiler.
You see, compilers have a very tough problem to solve, and so most of them split it into 2 easier problems:
- Translating a programming language (C++ in our case) to some intermediate code – this is called the frontend, and is exactly what Clang does.
- Translate the above intermediate code to machine code – this is called the back-end. Clang uses LLVM for that.
The neat thing about Clang is that it is designed to be also used as a library. There are many types of applications that must truly understand code – IDEs, documentation-generators, static-analysis tools, etc. Instead of each of them having to implement C++ parsing (which is an extremely difficult task!), libclang can be used to correctly handle all language features and edge-cases.
libclang!
And it’s so darn easy. Really. Those Clang folks really did an awesome work. In the rest of this post we will use its C-API to explore the following code:
// header.hpp
class MyClass
{
public:
int field;
virtual void method() const = 0;
static const int static_field;
static int static_method();
};
Basic example
Let’s look at the simplest of examples. The following program parses the above file and immediately exists:
#include <iostream>
#include <clang-c/Index.h> // This is libclang.
using namespace std;
int main()
{
CXIndex index = clang_createIndex(0, 0);
CXTranslationUnit unit = clang_parseTranslationUnit(
index,
"header.hpp", nullptr, 0,
nullptr, 0,
CXTranslationUnit_None);
if (unit == nullptr)
{
cerr << "Unable to parse translation unit. Quitting." << endl;
exit(-1);
}
clang_disposeTranslationUnit(unit);
clang_disposeIndex(index);
}
There are many 0
s and nullptr
s - these allow us to do some more advanced
stuff (like pass argv & argc, use in-memory files, etc). Let’s not get into
these.
So what do we have after clang_parseTranslationUnit()
has finished
successfully? We have a parsed Abstract Syntax Tree (AST) which we can traverse
and inspect. Which is exactly what we’ll do.
Cursors
Pointers to the AST are called Cursors in libclang lingo. A Cursor can have a parent and children. It can also have related cursors (like a default value for a parameter, an explicit value to an enum entry, etc).
The ‘entry point’ cursor we will use is the cursor representing the Translation
Unit (TU), which is a C++ term meaning a single file including all #include
d
code. To get the TU’s cursor we will use the very descriptive
clang_getTranslationUnitCursor()
. Now that we have a cursor we can investigate
it or iterate using it.
Visit children
Any cursor has a kind, which represents the essence of the cursor. Kind can be one of many, many options, as can be seen here. A few examples are:
/** \brief A C or C++ struct. */
CXCursor_StructDecl = 2,
/** \brief A C or C++ union. */
CXCursor_UnionDecl = 3,
/** \brief A C++ class. */
CXCursor_ClassDecl = 4,
/** \brief An enumeration. */
CXCursor_EnumDecl = 5,
We can get the kind from a cursor using clang_getCursorKind()
.
For now lets visit all children of the TU:
CXCursor cursor = clang_getTranslationUnitCursor(unit);
clang_visitChildren(
cursor,
[](CXCursor c, CXCursor parent, CXClientData client_data)
{
cout << "Cursor kind: " << clang_getCursorKind(c) << endl;
return CXChildVisit_Recurse;
},
nullptr);
The second-parameter lambda is a function called for every cursor visited.
Inside we always return CXChildVisit_Recurse
(although other options exist),
because we want to explore everything in our file.
Output:
Cursor kind: 4
Cursor kind: 39
Cursor kind: 6
Cursor kind: 21
Cursor kind: 9
Cursor kind: 21
That’s a bit cryptic, and requires us to skip back and forth to Index.h
.
Fortunately, there’s a built-in function to convert cursor kind to a string, but
first we need to discuss libclang’s strings.
CXString
CXString is a type representing a pointer to the AST. To retrieve an actually
useful string (const char *
for example), one must call clang_getCString()
which internally increments a ref-count, and then clang_disposeString()
when
done.
Since we’re going to do this a lot, let’s create a helper function:
ostream& operator<<(ostream& stream, const CXString& str)
{
stream << clang_getCString(str);
clang_disposeString(str);
return stream;
}
Print meaningful output
Now that we can extract strings, let’s modify our lambda to print something that is actually useful:
CXCursor cursor = clang_getTranslationUnitCursor(unit);
clang_visitChildren(
cursor,
[](CXCursor c, CXCursor parent, CXClientData client_data)
{
cout << "Cursor '" << clang_getCursorSpelling(c) << "' of kind '"
<< clang_getCursorKindSpelling(clang_getCursorKind(c)) << "'\n";
return CXChildVisit_Recurse;
},
nullptr);
Output:
Cursor 'MyClass' of kind 'ClassDecl'
Cursor '' of kind 'CXXAccessSpecifier'
Cursor 'field' of kind 'FieldDecl'
Cursor 'method' of kind 'CXXMethod'
Cursor 'static_field' of kind 'VarDecl'
Cursor 'static_method' of kind 'CXXMethod'
Now, that’s friggin’ neat.
A more complicated example
I was very careful not to #include
any header in header.hpp
. Why? Well, by
merely adding #include <string>
to header.hpp
the output size is 1.51MB.
Ever got pissed at the compiler for taking so long? That’s why. It’s very
educating to read such a file, but for everyone’s sake I won’t post it here.
Instead, let’s parse the following file:
enum class Cpp11Enum
{
RED = 10,
BLUE = 20
};
struct Wowza
{
virtual ~Wowza() = default;
virtual void foo(int i = 0) = 0;
};
struct Badabang : Wowza
{
void foo(int) override;
bool operator==(const Badabang& o) const;
};
template <typename T>
void bar(T&& t);
Same program’s output for this file:
Cursor 'Cpp11Enum' of kind 'EnumDecl'
Cursor 'RED' of kind 'EnumConstantDecl'
Cursor '' of kind 'IntegerLiteral'
Cursor 'BLUE' of kind 'EnumConstantDecl'
Cursor '' of kind 'IntegerLiteral'
Cursor 'Wowza' of kind 'StructDecl'
Cursor '~Wowza' of kind 'CXXDestructor'
Cursor 'foo' of kind 'CXXMethod'
Cursor 'i' of kind 'ParmDecl'
Cursor '' of kind 'IntegerLiteral'
Cursor 'Badabang' of kind 'StructDecl'
Cursor 'struct Wowza' of kind 'C++ base class specifier'
Cursor 'struct Wowza' of kind 'TypeRef'
Cursor 'foo' of kind 'CXXMethod'
Cursor '' of kind 'attribute(override)'
Cursor '' of kind 'ParmDecl'
Cursor 'operator==' of kind 'CXXMethod'
Cursor 'o' of kind 'ParmDecl'
Cursor 'struct Badabang' of kind 'TypeRef'
Cursor 'bar' of kind 'FunctionTemplate'
Cursor 'T' of kind 'TemplateTypeParameter'
Cursor 't' of kind 'ParmDecl'
Cursor 'T' of kind 'TypeRef'
Conclusion
libclang is awesome:
- It allows checking whether code has been expanded from a macro, and to jump there;
- It allows checking the location (file+line+column) for each cursor;
- It allows getting function’s parameter names, types and return type;
- It understands templates, autos, lambdas, and, well, everything in C++.
I hope this short post made you curious, and that you’ll also try exploring what this amazing API provides. Please do write a comment below if you have anything you want to add or ask!
Complete Code
For your convenience, here’s the complete code we implemented today:
#include <iostream>
#include <clang-c/Index.h>
using namespace std;
ostream& operator<<(ostream& stream, const CXString& str)
{
stream << clang_getCString(str);
clang_disposeString(str);
return stream;
}
int main()
{
CXIndex index = clang_createIndex(0, 0);
CXTranslationUnit unit = clang_parseTranslationUnit(
index,
"header.hpp", nullptr, 0,
nullptr, 0,
CXTranslationUnit_None);
if (unit == nullptr)
{
cerr << "Unable to parse translation unit. Quitting." << endl;
exit(-1);
}
CXCursor cursor = clang_getTranslationUnitCursor(unit);
clang_visitChildren(
cursor,
[](CXCursor c, CXCursor parent, CXClientData client_data)
{
cout << "Cursor '" << clang_getCursorSpelling(c) << "' of kind '"
<< clang_getCursorKindSpelling(clang_getCursorKind(c)) << "'\n";
return CXChildVisit_Recurse;
},
nullptr);
clang_disposeTranslationUnit(unit);
clang_disposeIndex(index);
}