Token: Utilities for string tokenization
String tokenization is the process of assigning a unique integer token to an input string. This is useful in several situations
- String comparison. If performed repeatedly, comparison costs up to O(n) each call (where n is the minimum length of any pair of input strings). Token comparison, if performed repeatedly, is O(1).
- String storage. If the same string is stored in many places (say with arrays of strings being processed), it can consume significant memory. Tokens are fixed-size references to strings, limiting the memory overhead.
- String interning. Libraries used in security-significant settings can refuse to provide a token for an "unsafe" string (or instead provide a token for properly-escaped counterpart). By forcing the API to refer to tokens rather than strings, some safety is provided. (In this situation, you must be careful to deal with collisions and with how strings are inserted into storage for retrieval from a token.)
-
Enumerations. Traditional C++ enumerations are fixed at compile time.
This can be problematic in libraries that wish to provide extensible
enumerations. Tokens can be used in place of enumerants in many cases
while allowing the set of interned strings to grow. One particular use
case of interest is
switch(…) { … }
statements; from C++14 onwards, it is possible forcase
statements to refer to compile-time hashes of strings, making case labels clear.
Tokenization
The token
library provides:
- A
token::Token
class that can be constructed from either a string or an integer hash. Hashes may be computed at compile time, though- compile-time hashes cannot be checked for collisions and
- compile-time hashes do not have their corresponding strings interned.
- A
token::Manager
class that holds interned strings. An instance of the manager is owned as a class-static member oftoken::Token
so that all interned strings are collected by one object. - String-literal operators for hashing strings
- The
""_token
operator produces atoken::Token
instance from a string. - The
""_hash
operator produces atoken::Hash
integer from a string.
- The
Tokenization (computing a hash of the string) is performed with the FNV-1a algorithm.
A simple utility program, named tokenize
, is provided with the
library. Given one or more strings on the command line, it computes
the integer hash of each and reports them. The hash number is
identical on all platforms
% ./bin/tokenize "" a b c ab bc ac abc
0x811c9dc5 = ""
0xe40c292c = "a"
0xe70c2de5 = "b"
0xe60c2c52 = "c"
0x4d2505ca = "ab"
0x3e2ba9f2 = "bc"
0x4e25075d = "ac"
0x1a47e90b = "abc"
Additional features
Besides string tokenization, the token
library provides
- Templated
typeName<T>()
andtypeToken<T>()
functions that return a string (respectivelyToken
) holding the type-name of the template parameterT
. In the future, these functions will beconstexpr
, but are not for now because older compilers do not allowconstexpr
functions with temporary variables. - A TypeContainer class for registering and fetching singleton objects given their type as a template parameter. (A string token of each object's type-name is used as the key into an unordered map of object-wrappers.)
- A singleton API that provides a global instance of a
TypeContainer for applications store/retrieve singletons
of any type.
This functionality is provided since the
token
library must be dynamic (it has a global variable holding atoken::Manager
instance), so it may as well provide this service to others. - Serialization/deserialization of tokens and their interned strings to/from JSON.
Building
The token
library depends on the
nlohmann_json
library.
It has only configuration option which may be set by
consuming projects: token_NAMESPACE
, which will change
the namespace containing the library's classes from
token
to another valid identifier.
To build and test, simply run
cmake -G Ninja /path/to/token/source
ninja install
ctest