Explaining C++20 Modules

Jan 1, 2021 6:59:24 AM

C++ provides abstractions in a principle of "you pay only for what you use". Unfortunately, even zero-cost abstractions have a hidden cost. The power given by features like constexpr and meta-programming is paid in longer compilation time.

Each class, template definition and template instantiation requires more effort from the compiler. It is easier to understand if you think as a compiler developer. This is a complex topic and beyond the scope of this article, but a requirement is simple to explain: a type name - such as std::array<int, 3> - needs to be mapped to something useful for the current compilation unit.

This holds true for types introduced by different template instantiations - std::array<void *, 3> and std::array<int *, 3> represent different types even though they have the same size, behaviour and base template. And this grows fast, since template instantiations usually introduce new dependent types: std::map is normally implemented with a form of tree, which requires nodes and so forth.

As projects grow, modularisation is required for a maintainable code base. Up until C++20, compilation units were unable to share data, therefore each compiler invocation had to process everything from scratch. A heavy-weight header with lots of template instantiations may eventually add a prohibitive compilation cost to a large project - which is a common reason why large projects avoid STL and, in extreme cases, some developers avoid templates altogether.

Before C++ modules, only one option was available: precompiled headers. They are not standard therefore results will vary depending on platform and compiler. But they share a common principle: one or more header files are parsed and compiled to an internal representation, which can be reused later. Then a special flag is used to bring this pre-compiled context into the global state before parsing the compilation unit.

C++20 standardises that process into a module system. It works as a precompiled header but uses a special compilation unit type instead of plain headers. That unit can export a list of types and symbols thru a public interface, as well as define internal symbols and output binary code.

The public interface of a C++20 module can be exported to a file and imported by other compilation units - no matter if they are other modules or top-level units. Details of that process are not enforced by the C++ standard, but the usage is well-defined: the compiler provides a way to store enough information of the module compilation unit and load that state in other units that depends on it.

This allows reusing compilation state between units. Assuming a module is created to export a class which depends on std::vector<int> then units using that module will not need to parse and instantiate the vector template.

The syntax is straightforward and requires three keywords: import, export and module. import brings a module interface into the current unit. export and module are used to define the interface of a module.

As an example, given a module named M1 exporting a class C1, the C++ file can be defined as:

export module M1;
export class C1 {
};

M1 can be imported in a main compilation unit and C1 used as if it was #included:

import M1;
int main() {
  C1 c;
}

Note: it is valid to have a main function without arguments and without a return statement. But it is not valid to return void from main.

Any symbol can be exported, as long as they don't have internal linkage. Static functions, symbols defined by an anonymous namespace and members cannot be exported. But functions, classes, templates, template instantiations, namespaces and other modules are valid examples:

export module M2;
export void F2() {}
export template<typename T> class C2 {};
export class C2<int>;
export namespace N2 {
  void N2F();
};
export import M1;

The export works as if part of a header inclusion. When M2 is imported, F2, C2 and N2 will become part of the global symbol table of that unit. Therefore, a C++ file like this will be possible:

import M2;
int main() {
  F2();
  C2<float> ca;
  C2<int> cb;
  N2::N2F();
  C1 c;
}

That unit is able to use all symbols exported from M2 as expected, minus the cost of parsing those symbols. F1 can be called and C2 can be instantiated and used. Additionally, C2<int> should not require an extra work, since M2 exported it.

Exporting an imported module brings all symbols then expose them as its own public interface. In the aforementioned example, M2 takes C1 from M1 and exposes it. Such mechanism allows the composition of complex modules from simpler ones.

Modules resemble namespaces, but those are orthogonal between each other. That means N2 can be exported by M2 without any prefix, because modules do not introduce new symbols implicitly - the "M2" token will only be usable by export and import statements. Their naming scheme is different - only accepting characters and periods - and modules can not be nested. Therefore, a module can be named "M1.A" without any relation with M1 or C1 - it will be independent from it in all aspects.

Both M1 and M2 exports all symbols, but that is not a rule. A module may also define private symbols which will not be available when a module is imported. Given a module M3 as:

export module M3;
void F3A() {}
export void F3B() {
  F3A();
}

Attempts of using F3A will result on error - as if the symbol was never defined. But usages of F3B will be as usual and its call to F3A will not impact the export.

import M3;
int main() {
  // F3A(); << This would yield an error
  F3B();
}

This behaviour is similar to a forward declaration in an included header, but allows more aggressive compile-time optimisations: a compiler is able to inline F3A in F3B when building the M3 interface, then inline F3B in main. That keeps a lower cost - specially if M3 is used in more compilation units - while private symbols are kept out of reach.

The visibility concept can be applied to individual symbols inside namespaces. The M2 example exported the whole namespace, but it is possible to selectively define which symbols inside a namespace will be available:

export module M4;
namespace N4 {
  void N4A() {}
  export void N4B() {}
}

When imported, M4 will expose N4B and it will keep N4A hidden. Hidden symbols is a feature where module differers from headers. Modules can define symbols following the same rules as normal compilation units, without facing ODR-violation issues seen on header files. Therefore, a code like this is possible:

export module M5;
namespace N5 {
  int N5A = 1;
  export int N5B = 2;
}

Usage of M5 is the same as previous examples:

import M5;
void print(int);
int main() {
  // print(N5A); << This would yield an error
  print(N5B);
}

Effectively, this is a compiler-enforced replacement for namespaces hidden by convention - like "details" or "impl".

As a module grows, it may be required to split its implementation over multiple units. One way is to re-export symbols from another module. Assuming a unit like this:

export module M6A;
export void F6X() {}
export void F6Y() {}

F6 can be re-exported like this:

export module M6B;
export import M6A;

M6B will also expose both F6X and F6Y to its clients, due to the "export import" statement. This can be fine tuned by exporting individual symbols:

export module M7;
export void F6X();
import M6A;

The order matters. The export of F6X also introduces a forward reference, then its definition is imported from M6A. This is easier to understand when compared to a hypothetical attempt of rewrite M6A and M7 with classical headers:

#ifndef M7_H
extern void F6X();
#include "M6A.h"
#endif

Both versions may resemble each other but they are semantically different. That example just keeps a parallel on some important concepts, like linkage definition in forward dependencies, but those units should not be considered equivalent.

Another way of splitting a module can be achieved with forward references in an interface module - following the syntax described so far - and implementation modules. The pair of an interface and an implementation modules work similarly to a header and a code file, except by the visibility of defined symbols.

As an example, given an interface module like this:

export module M8;
export void F8();

Its implementation can be given by this implementation module:

module M8;
void F8A() {
}
void F8() {
  F8A();
}

The syntax uses the same module keyword. The implementation module can be understood as a module that cannot exist or export symbols on its own. There is no standard way of exposing F8A, since the module is part of the symbol's ABI. Therefore, F8A is only available inside M8. Nonetheless, F8 would be available to any compilation unit importing M8.

Modules can be further split with partitions. Module partitions behave as a normal module, but can only be used inside its owning module. They cannot exist inside other partitions, being impossible to have sub-partitions or share partitions between modules.

The naming scheme for a partition is "module_name:partition_name" and can be referred as ":partition_name" when imported. An example module would be:

export module M9;
export import :P9;

Then, the P9 partition:

export module M9:P9;
export void F9() {}

M9:P9 exposes F9 to its parent module. In turn, M9 re-exports :P9 public symbols - effectively making F9 available to M9 consumers. This shows how partitions are organisation units, meaning P9 is not noticeable to consumers of M9:

import M9;
// import M9:P9; << This does not work outside M9
int main() {
  F9();
}

Apart from visibility differences, partitions behave like normal modules inside their parent module. Symbols can be automatically re-exported or manually enlisted - along the lines of the example M7 earlier.

The naming rules for the partition follows the module rules, with an extra exception: a partition cannot be named "private". The ":private" partition is special and it can only exist as the last fragment of the module interface unit.

Code defined inside the private module cannot change the definition of the public interface of said module. Changes in the private section should not trigger a new compilation of other units using the module. Therefore, given this example:

export module M10;
export void F10A();

module :private;
void F10B() {}
void F10A() { F10B(); }

Changes in F10B or in the body of F10A will not impact in the compilation of its consumers, as long as there is tool support from the compiler and the build system.

As described so far, modules in C++20 deal with many aspects of code organisation for new projects. But even projects created from scratch need to integrate with old code. Usage of preprocessor macros like #include and #define is prohibited in all modules and partitions, except the global module.

The global module differs from other modules in multiple aspects. It is not explicitly defined in a single compilation unit and it cannot be partitioned. There are two ways of introduce code in the global module: compilation units outside modules and the module; statement.

When a unit does not contain any module statement, it is part of the global module. This technically means all legacy C++ files have their symbols in that special module.

Otherwise, it should be the part of the preamble of a module interface:

module;
#include <iostream>

export module M11;
export void F11() {
  std::cerr << "Hello" << std::endl;
}

Macros can only exist in the global section of a module unit and it is not possible to use them to modify any module statements. But this should be enough to consume code from unmodifiable non-modular libraries.

Legacy headers also have a special feature. They can be implicitly converted in a module, as long as their behaviour does not depend on macros. Macro-free headers can be imported as a module:

import <iostream>;
int main() {
  std::cout << "This is fine" << std::endl;
}

All valid symbols are imported when a header is used as a module. It is up to the compiler implementation to define where the binary form of system headers go, but the end result is the same: once built, new usages of the header in module form will not need to recompile it.

Unfortunately, the standard does not cover all practical aspects needed to use modules in current projects. For instance, the binary interface and the module search are platform- and provider-dependent.

Those limits exist because the module standard was created with focus on compilation reuse inside the same project development lifecycle. The binary form of a module is expected to work in the same machine compiling the same project. There is no guarantee a given binary form of a module will work on a different complier, platform, host or project.

The lack of a binary standard is obnoxious, but justifiable. C++ is agnostic about the compilation results: it never defined an object file or an executable format. This was never an issue, since each platform has its own forms for storing, linking and executing code.

Modules would be limiting the development of compilers if the C++ standard required a specific format. That openness allows storage of anything in a module binary, which may include compiler-specific data, such as a raw copy of its AST. And, since the output does not need to cross a platform boundary, compiler developers may store platform-aware data, like register allocation.

Another known missing feature is how to search and catalog modules. There are no rules for mapping modules to files or where said files should exist. This is another domain that was never covered by the C++ standard. For instance, the difference between system header and local header is no more than a convention and file names and locations are not enforced.

In conclusion, modules in C++20 is a tool to improve compilation time of large projects. There is little parallel to modularisation in different technologies, but its usefulness is more oriented to existing behaviour in C++.

C++, C++20

Explaining C++20 Modules

Read On

C++ modules with Clang