Using libclang to Parse C++ (aka libclang 101)

(1270 words)

In this post I’ll provide a quick tutorial for using libclang. I started playing around with libclang while implementing Reflang – an open source reflection framework for C++. Then I came to appreciate the amazing work done by its developers.

Please note that we will start with a program and will gradually add code. Scroll to the end of the post to view the complete solution.

libclang?

Clang, if you haven’t heard of yet, is a wonderful C++ (and other C language family) compiler. Well, not exactly a compiler, but a frontend to the LLVM compiler.

You see, compilers have a very tough problem to solve, and so most of them split it into 2 easier problems:

The neat thing about Clang is that it is designed to be also used as a library. There are many types of applications that must truly understand code – IDEs, documentation-generators, static-analysis tools, etc. Instead of each of them having to implement C++ parsing (which is an extremely difficult task!), libclang can be used to correctly handle all language features and edge-cases.

libclang!

And it’s so darn easy. Really. Those Clang folks really did an awesome work. In the rest of this post we will use its C-API to explore the following code:

// header.hpp

class MyClass
{
public:
  int field;
  virtual void method() const = 0;

  static const int static_field;
  static int static_method();
};

Basic example

Let’s look at the simplest of examples. The following program parses the above file and immediately exists:

#include <iostream>
#include <clang-c/Index.h>  // This is libclang.
using namespace std;

int main()
{
  CXIndex index = clang_createIndex(0, 0);
  CXTranslationUnit unit = clang_parseTranslationUnit(
    index,
    "header.hpp", nullptr, 0,
    nullptr, 0,
    CXTranslationUnit_None);
  if (unit == nullptr)
  {
    cerr << "Unable to parse translation unit. Quitting." << endl;
    exit(-1);
  }

  clang_disposeTranslationUnit(unit);
  clang_disposeIndex(index);
}

There are many 0s and nullptrs - these allow us to do some more advanced stuff (like pass argv & argc, use in-memory files, etc). Let’s not get into these.

So what do we have after clang_parseTranslationUnit() has finished successfully? We have a parsed Abstract Syntax Tree (AST) which we can traverse and inspect. Which is exactly what we’ll do.

Cursors

Pointers to the AST are called Cursors in libclang lingo. A Cursor can have a parent and children. It can also have related cursors (like a default value for a parameter, an explicit value to an enum entry, etc).

The ‘entry point’ cursor we will use is the cursor representing the Translation Unit (TU), which is a C++ term meaning a single file including all #included code. To get the TU’s cursor we will use the very descriptive clang_getTranslationUnitCursor(). Now that we have a cursor we can investigate it or iterate using it.

Visit children

Any cursor has a kind, which represents the essence of the cursor. Kind can be one of many, many options, as can be seen here. A few examples are:

  /** \brief A C or C++ struct. */
  CXCursor_StructDecl                    = 2,
  /** \brief A C or C++ union. */
  CXCursor_UnionDecl                     = 3,
  /** \brief A C++ class. */
  CXCursor_ClassDecl                     = 4,
  /** \brief An enumeration. */
  CXCursor_EnumDecl                      = 5,

We can get the kind from a cursor using clang_getCursorKind().

For now lets visit all children of the TU:

  CXCursor cursor = clang_getTranslationUnitCursor(unit);
  clang_visitChildren(
    cursor,
    [](CXCursor c, CXCursor parent, CXClientData client_data)
    {
      cout << "Cursor kind: " << clang_getCursorKind(c) << endl;
      return CXChildVisit_Recurse;
    },
    nullptr);

The second-parameter lambda is a function called for every cursor visited. Inside we always return CXChildVisit_Recurse (although other options exist), because we want to explore everything in our file.

Output:

Cursor kind: 4
Cursor kind: 39
Cursor kind: 6
Cursor kind: 21
Cursor kind: 9
Cursor kind: 21

That’s a bit cryptic, and requires us to skip back and forth to Index.h. Fortunately, there’s a built-in function to convert cursor kind to a string, but first we need to discuss libclang’s strings.

CXString

CXString is a type representing a pointer to the AST. To retrieve an actually useful string (const char * for example), one must call clang_getCString() which internally increments a ref-count, and then clang_disposeString() when done.

Since we’re going to do this a lot, let’s create a helper function:

ostream& operator<<(ostream& stream, const CXString& str)
{
  stream << clang_getCString(str);
  clang_disposeString(str);
  return stream;
}

Now that we can extract strings, let’s modify our lambda to print something that is actually useful:

  CXCursor cursor = clang_getTranslationUnitCursor(unit);
  clang_visitChildren(
    cursor,
    [](CXCursor c, CXCursor parent, CXClientData client_data)
    {
      cout << "Cursor '" << clang_getCursorSpelling(c) << "' of kind '"
        << clang_getCursorKindSpelling(clang_getCursorKind(c)) << "'\n";
      return CXChildVisit_Recurse;
    },
    nullptr);

Output:

Cursor 'MyClass' of kind 'ClassDecl'
Cursor '' of kind 'CXXAccessSpecifier'
Cursor 'field' of kind 'FieldDecl'
Cursor 'method' of kind 'CXXMethod'
Cursor 'static_field' of kind 'VarDecl'
Cursor 'static_method' of kind 'CXXMethod'

Now, that’s friggin’ neat.

A more complicated example

I was very careful not to #include any header in header.hpp. Why? Well, by merely adding #include <string> to header.hpp the output size is 1.51MB. Ever got pissed at the compiler for taking so long? That’s why. It’s very educating to read such a file, but for everyone’s sake I won’t post it here.

Instead, let’s parse the following file:

enum class Cpp11Enum
{
  RED = 10,
  BLUE = 20
};

struct Wowza
{
  virtual ~Wowza() = default;
  virtual void foo(int i = 0) = 0;
};

struct Badabang : Wowza
{
  void foo(int) override;

  bool operator==(const Badabang& o) const;
};

template <typename T>
void bar(T&& t);

Same program’s output for this file:

Cursor 'Cpp11Enum' of kind 'EnumDecl'
Cursor 'RED' of kind 'EnumConstantDecl'
Cursor '' of kind 'IntegerLiteral'
Cursor 'BLUE' of kind 'EnumConstantDecl'
Cursor '' of kind 'IntegerLiteral'
Cursor 'Wowza' of kind 'StructDecl'
Cursor '~Wowza' of kind 'CXXDestructor'
Cursor 'foo' of kind 'CXXMethod'
Cursor 'i' of kind 'ParmDecl'
Cursor '' of kind 'IntegerLiteral'
Cursor 'Badabang' of kind 'StructDecl'
Cursor 'struct Wowza' of kind 'C++ base class specifier'
Cursor 'struct Wowza' of kind 'TypeRef'
Cursor 'foo' of kind 'CXXMethod'
Cursor '' of kind 'attribute(override)'
Cursor '' of kind 'ParmDecl'
Cursor 'operator==' of kind 'CXXMethod'
Cursor 'o' of kind 'ParmDecl'
Cursor 'struct Badabang' of kind 'TypeRef'
Cursor 'bar' of kind 'FunctionTemplate'
Cursor 'T' of kind 'TemplateTypeParameter'
Cursor 't' of kind 'ParmDecl'
Cursor 'T' of kind 'TypeRef'

Conclusion

libclang is awesome:

I hope this short post made you curious, and that you’ll also try exploring what this amazing API provides. Please do write a comment below if you have anything you want to add or ask!

Complete Code

For your convenience, here’s the complete code we implemented today:

#include <iostream>
#include <clang-c/Index.h>
using namespace std;

ostream& operator<<(ostream& stream, const CXString& str)
{
  stream << clang_getCString(str);
  clang_disposeString(str);
  return stream;
}

int main()
{
  CXIndex index = clang_createIndex(0, 0);
  CXTranslationUnit unit = clang_parseTranslationUnit(
    index,
    "header.hpp", nullptr, 0,
    nullptr, 0,
    CXTranslationUnit_None);
  if (unit == nullptr)
  {
    cerr << "Unable to parse translation unit. Quitting." << endl;
    exit(-1);
  }

  CXCursor cursor = clang_getTranslationUnitCursor(unit);
  clang_visitChildren(
    cursor,
    [](CXCursor c, CXCursor parent, CXClientData client_data)
    {
      cout << "Cursor '" << clang_getCursorSpelling(c) << "' of kind '"
        << clang_getCursorKindSpelling(clang_getCursorKind(c)) << "'\n";
      return CXChildVisit_Recurse;
    },
    nullptr);

  clang_disposeTranslationUnit(unit);
  clang_disposeIndex(index);
}

Comments