Module ALib Characters is very foundational. The main goal of the module is not to provide algorithms and functionality, but rather to provide clarity and consistency. Both are needed in two related areas, where the C++ language itself turns out to be a little unclear and fighting with legacy compatibility issues.
The two areas are "characters" and "character arrays". The latter may also be called "character strings", however in the context of this module, the more general term "array" is preferred.
As it is discussed in the next chapters, both areas, partly for historical reasons, partly due to the abstract nature of C++, contain some of pitfalls and difficulties in respect to creating compatible, platform-independent and secure software.
This module introduces own type definitions and "type traits" aiming to overcome the difficulties while still using the same strictness and clarity that a programmer is used from the C++ language.
And here, the scope of the module already ends! Further functionality is only found with separated modules like ALib Strings, ALib Boxing or ALib BaseCamp. We think that the combination of the modules together form an unrivalled team and no other C++ library that we know of, makes character and string handling as convenient, seamless, compatible, readable and flexible as these.
What is a "character" in computer science? What is a "character" for humans? Both questions do not have simple answers. Both idioms underwent some historical development and change.
The good news is: Today, the challenge of representing characters of human written languages in computer systems is at least well understood. But it is complex.
The right definition of the character type creates a lot of confusion and platform dependencies. This Programmer's Manual cannot elaborate in all depth on this topic. To make best and most efficient use of this module ALib Characters and those modules that depend on it, it is probably enough what is written in this and the following manual sections.
To get deep knowledge and to understand the rationals behind the design decisions of this library, some links are provided here for the interested user:
The following simple observations already indicate that choosing "the right" character width, which then leads to an efficient implementation of C++ strings, is a non-trivial task:
wchar_t
as a 4-byte integral type under GNU/Linux.wchar_t
as a 2-byte integral type.While we have never seen a compiler doing this, the C++ language standard officially even allows the size of wchar_t
to be one byte wide.
In general, this ALib module differs between two ways of choosing the underlying character width of a string.
The ways to choose are:
The explicit choice of the character width is the less frequent approach and rather used internally in the library and when it comes to interacting with system and 3rd-party libraries. For this, type definitions
are provided. While nchar is always an identity type of C++ type char
(hence a simple alias name), type wchar
might resolve to either of built-in types wchar_t
, char16_t
or char32_t
.
Note that the "explicit choice of the character width" with type nchar and wchar, is very similar to the built-in C++ choice of types char
and wchar_t
. The slight but crucial difference is that the final definition of what a "wide character" is (a 2- or 4-byte integral), is not made by the compiler/platform anymore. Instead, the ALib library decides. This comprises the first step towards platform independence.
While the two explicit character types nchar and wchar are aliasing two of the four built-in types (char
, wchar_t
, char16_t
or char32_t
.), the two types themselves are aliased once more by two further type definitions:
Depending on the compiler, the platform defaults and the compiler symbol ALIB_CHARACTERS_WIDE, one of these types aliases equals nchar and the other wchar.
This logical naming means that a character is the default type for characters used with ALib and type complementChar is just the counterpart. It is recommended to just use the logical type character for all string operations of a normal software entity. Its use expresses "I don't care, don't bother me!" and ALib will help that even if string-types based on other characters come in, this is not even noticed much in a user's code!
All together this results in a three-level type/alias scheme. For example, the default on GNU/Linux platform, is:
Logical Type Explicit Type C++ Type -------------------------------------------------------------------------- alib::character <==> alib::nchar <==> char alib::complementChar <==> alib::wchar <==> wchar_t
In contrast to this, under Windows OS using the Microsoft compiler the scheme is:
alib::character <==> alib::wchar <==> wchar_t alib::complementChar <==> alib::nchar <==> char
The attentive reader now notices that an alib::character under Windows OS is not equivalent to an alib::complementChar GNU/Linux, although both are aliasing C++ type wchar_t
: As explained above, this type is platform-dependent, hence might be 2-bytes or 4-bytes. (Note that in theory, it might be even only 1-byte wide, which is not supported by ALib.)
To write software that is able to cope with just all possible character types, the two types "narrow" and "wide" obviously are not sufficient. Therefore, this ALib Module defines a third logical character type, which holds this third possible width:
This (strange!) name was chosen to express that this third type is neither compatible with ALib narrow characters nor with ALib wide characters. By default it is also neither equivalent to C++ char
nor wchar_t
. However, this might be changed, as we will see later.
Now, just for completeness, this logical type has an explicit counterpart with alib::xchar. The underlying C++ type of logical type strangeChar and explicit type xchar are always the same. Both types are identical.
We now have covered
1
, 2
and 4
bytes, inWith this, the default on GNU/Linux is:
Logical Type Explicit Type C++ Type ----------------------------------------------------------------------------------- alib::character <==> alib::nchar == char (1 byte) alib::complementChar <==> alib::wchar <==> wchar_t (4 bytes) alib::strangeChar == alib::xchar <==> char16_t (2 bytes)
On windows with Microsoft compiler:
alib::character <==> alib::wchar <==> wchar_t (2 bytes) alib::complementChar <==> alib::nchar == char (1 byte) alib::strangeChar == alib::xchar <==> char32_t (4 bytes)
Each table uses two times '=='. This should illustrate that the pairs of types:
char
, andxchar
are always synonyms. For all other pairs of types in the tables above (where symbol '<==>' is used) this is never sure and depends on the compiler, platform and library compilation settings.
QChar
is equivalent to logical type strangeChar under GNU/Linux, while it is equivalent to type character on Windows OS. This is true, because on any platform, type QChar
is defined to have a width of of 2 bytes!While it is obvious that the default character type alib::character is needed to write platform-independent code, it might not be so simple to imagine a use case for types complementChar
and strangeChar
. Mostly those two are needed internally, but they are also invaluable when it comes to template meta-programming with the aim of creating functionality that supports any type of character or string types. Samples of this will be seen in the Programmer's Manual of module ALib Strings.
Let us quickly summarize what was said in this chapter about character types:
char
, char16_t
or char32_t
, wchar_t
.char
, andwchar_t
.While these definitions might sound like introducing a next complexity level on top of the already confusing situation that C++ alone creates, this approach proves to simplify character handling a lot. Of course, this is only true if a programmer embraces these ALib definitions and starts using type character everywhere instead of type char
or wchar_t
.
While the main goal of this module is to enable a programmer to mostly forget about the character width that she is using with her ALib enabled software, and furthermore to transparently (aka without adding special code) convert external strings to the preferred width used, it may still be a worth consideration which character width is chosen for the logical type character.
As it was explained in the previous section, the default for GNU/Linux and GCC is to use 1-byte narrow char
type, while the library's default choice with Microsoft compiler under Windows OS is the using the built-in 2-byte wchar_t
.
Now, if for example, a software mixes ALib with the QT Class Library to avoid a lot of (transparent) string conversions, it would be preferable to use the same 2-byte wide character type that QT uses for its strings, independent of the platform that the software is compiled at.
To achieve this, two compiler symbols may be passed:
char
or any of the wide character types wchar_t
, char16_t
or char32_t
as the default logical character type character.wchar_t
. Type xchar consequently defaults to either char16_t
or char32_t
, just the one with the different width than wchar_t
.wchar_t
, then the assignment just changes: wchar becomes either char16_t
or char32_t
while xchar becomes wchar_t
.General purpose user code should not be affected on how the library is compiled in respect to character width. Especially all ALib Module compile and run independent of the compiler symbols introduced in the previous section. Furthermore, module ALib Strings (which of course heavily relies on this module) provides various features to keep a users code clean and transparent in respect to the character selection.
However, still in some situations, different code may be selected dependent on the library compilation settings. For this, a few corresponding symbols are provided:
false
(0
) when type character is equivalent to C++ type char
and to false
(1
) if the type is a wide character.wchar_t
.true
(1
) when type wchar is equivalent to C++ type wchar_t
and to false
(0
) if the type is any of char16_t
or char32_t
and has a different width than wchar_t
.In C++, character and character string literals are enclosed in single, respectively double quote characters. For character types of widths other than the single-byte type char
, in addition a correspondent prefix character 'L'
, 'u'
or 'U'
is needed.
Let us look at some samples:
While this code compiles, the sample is incorrect in so far, that an implicit character conversion is performed with the assignment of variable wc. This is possible, as the compiler detects that wchar_t is a wider integral than char.
Therefore, we switch to string literals, which create zero-terminated character arrays:
Here, the prefix 'L'
must not be omitted to create a zero-terminated array of wchar_t
, because arrays of wider element size cannot be simply casted. Now taking all C++ native character types into account, we are forced to also use prefixes 'u'
and 'U'
:
We recall that ALib introduces six alias character types:
Apart from type nchar (which always is equivalent to built-in type char
) we do not know what underlying types are chosen - which of course is the goal of the whole exercise. To define string literals, this module provides six corresponding preprocessor macros:
Dependent on the definitions of the corresponding character types, the macros simply prepend the right prefix character, or none in case of the narrow, 1-byte character.
char
. Therefore, its use is a matter of taste. Other ALib Modules do not use it, because for a programmer, reading a string literal without a prefix or macro, it should be evident that a single byte string is created by the compiler.With this in place, literals of the default character can easily be defined like this:
To conclude this section, this last snippet shows the use of all six macros. Instead of constant character pointers, this sample uses the string classes introduced with separate module ALib Strings. These classes are templated in respect to the character type and consequently, six aliases types are offered:
The previous chapter talked about challenges that are caused by built-in character type definitions of the C++ language, may it be due to legacy/compatibility reasons or due the language respecting platform defaults and providing this general openness and flexibility.
Unfortunately, this chapter about C++ character arrays, again talks about challenges...
While this module is about "characters" and arrays of those, the term "string" is used for character arrays likewise. And within the C++ language, this already imposes the first irritating inconsistency:
char arrayA[3] {'A','B','C'}; char arrayB[3]= "ABC"; // Error: initializing string too long
A C++ string literal always includes a terminating zero ('\0'
) character. Therefore, only a string with a length of 2
, fits to the sampled array of length 3
:
char arrayA[3] {'A','B','C'}; char arrayB[3]= "AB";
Now, with the second array, the information about its size exists twice. The information about whether one the two arrays is zero-terminated is non-existent. Of course, we can convert both to a character pointer:
char *cpA= arrayA; char *cpB= arrayB;
Now, the information about the array length is gone with the first array, with the second it is preserved due to zero termination.
Already with the following lines of code, we have undefined behavior:
std::cout << "array A: " << cpA << std::endl; std::cout << "array B: " << cpB << std::endl;
While C++ is considered a type-safe language, only a few lines of code (that only do legal type conversions), may lead to program crashes. The whole reason for this is that C++ aims to keep compatible with older language versions and with the C-language, that many decades ago proposed zero-terminated strings.
A next irritating observation is about assigning string literals to const and non-const character pointers:
const char* pointer2= "AB"; // OK char* pointer1= "AB"; // Warning: "ISO C++11 does not allow conversion from string literal to 'char *'" char array[3]= "AB"; // OK(!) char* pointer3= array; // OK, this avoided the warning from above, without using an explicit cast!
Before we conclude, a last question: How do you detect the length of a C++ string literal provided in an external macro, like this:
#define STR_LIT "A String that may change with the next library version"
The approach of most programmers would probably be:
size_t length= strlen(STR_LIT);
Well, this is very inefficient, because it does not use the compiler's knowledge about the length of the character array. There is a "constexpr"
and hence zero-cost solution available:
constexpr size_t length= std::extent<std::remove_reference<decltype(STR_LIT)>::type>::value - 1;
Let us summarize what types of "character arrays" are available in the C++ core language:
Besides that, there are tons of libraries available that define their own string types.
The concept of "type traits" in C++ is used to annotate types with attributes that can be evaluated at compile type. To implement this, templated structs are used which are specialized for the types in question, and these specializations provide compile-time information for a type.
With this ALib Module, type traits for character arrays are introduced. As described in the previous section, the C++ language is very unclear about how a character array "looks like" and many class libraries for this reason use their own lightweight or heavy string classes. The goal is to be able to use different sorts of character arrays in a type-safe and transparent manner.
This module introduces struct T_CharArray<TString,TChar> which offers type traits for types that "implement" character arrays.
As seen, the name of the template parameter that denotes the C++ type to provide type traits for is "TString". This name indicates that usually character array type traits are provided for "string classes" (e.g., std::string
or QString
). Along these lines, this documentation often uses the verb "to implement" in respect to the relationship of type TString and character arrays. This terminology may be misleading. Instead, it could also be that objects of type TString "represent" a character array (e.g., in respect to C++ 17 class std::string_view
) or that types "contain" a character array, or even create and provide one only on request.
Besides template parameter TString that denotes the type to provide traits for, a second template parameter TChar needs to be given to denote the character type (width) of the character arrays that are "implemented" by type TString.
Specializations of T_CharArray need to define static constexpr
field Access of enumeration AccessType. In the non-specialized version, value NONE is given, which usually indicates that a type is not an array-like type. Precisely it means, that character array data cannot be accessed from instances of the type.
Specializations usually specify one of the three other values:
explicit
.If one of the three values is given with a specialization, two static methods need to be defined which implement the type-specific access to the character array:
While in the documentation of the two methods, the parameter src of the static access methods is of type const TString
, in the case of using access flag MutableOnly, the method has to be defined using a mutable reference to TString.
A second static constexpr
member that a specialization of struct T_CharArray needs to define, is field Construction. It determines whether and how an instance of type TString may be created from existing character array data.
The default value (the one given in the non-specialized struct) is NONE, which determines that objects of type TString cannot be created from arrays. Specialization here may provide:
If any of the two values is set, static method T_CharArray::Construct has to be provided with the specialization of the struct. The implementation of this method needs to create a value of type TString from the character array provided with the method's arguments.
The type traits template struct T_CharArray<TString,TChar> introduced in the previous section is used to answer questions like:
And if positive answers to such questions are given, static methods
are to be provided along with the specialization to implement the array access, respectively object construction.
With sibling struct T_ZTCharArray<TString,TChar>, the same compile-time information and method implementations are provided for zero-terminated character arrays. Apart from the prefix "ZT" in the TMP struct's name, which stands for "zero-terminated", all rules for specializations are the very same.
Of course, a type TString that represents a zero-terminated array type may (and should) specialize both templated traits structs. Hereby it might use different flags and implementations of the static methods for simple character arrays and for zero-terminated ones.
The following C++ preprocessor macros are defined by this module to support the correct specialization of template traits struct T_CharArray:
Again, an equal set of macros is defined to support the correct specialization of template traits struct T_ZTCharArray:
The fact that the C++ does not provide a distinct "character" type, implies that traits structs T_CharArray and T_ZTCharArray have the character type TChar as a second template parameter that has to be named along with the main parameter TString.
As an example, for type
std::string
which is an alias to
std::basic_string<char>
the specialization of T_CharArray will be:
T_CharArray<std::string, char>
This sometimes imposes a little complication, at the moment that code wants to selectively compile based on the information that a type implements just any sort of character array, instead of a distinct type.
For this, helper-struct TT_CharArrayType is provided. Its inner type definition TChar provides the character type that a given type TString implements an array for. With that, for example to test the array access type of TString without knowing (or caring) about the character type of the array that is accessed, expression
T_CharArray<TString, typename TT_CharArrayType<TString>::TChar>::Access
can be used.
With sibling helper-struct TT_ZTCharArrayType, the same is provided for traits struct T_CharArray.
As stated before, this module ALib Characters, has a very foundational nature and does not provide algorithms and functionality, but rather type definitions and type traits. The rationale for this is that several modules of ALib, especially ALib Strings and ALib Boxing, independently of each other benefit from the foundation provided here. Consequently, this module had to be independent of each.
However, in the context of a custom set of string classes, the meaning of the array type traits is much more easy to understand. As this chapter provides information about which built-in specializations of the traits structs are provided (and why), we quickly want to anticipate what is found with separated module ALib Strings:
In addition to that, type CString represents a zero-terminated character string. The exact same rules apply to this class, but it is using the flags in specialization of characters::T_ZTCharArray instead. Both string types, String and CString, are very lightweight. Both do not manage a character string array, they neither allocate or deallocate memory. All they do is the provision of two data elements, the pointer to the array and the length of the string. As such, they are similar to classes std::string_view
introduced with C++ 17.
A third type, class AString implements a "heavy weight" string, namely one that manages its own allocated character buffer. While the character arrays of this class are not zero-terminated by default, the class always reserves space for a zero-termination character. This allows interface method AString::Terminate to be defined as a constant operation. Likewise types String and CString, the class provides cast operators (which may terminate the internal buffer) and allow the concatenation of string-like objects, again using the type traits.
The combination of all of this, provides the huge gain that comes with using character array traits: Objects of any string type, may it be C++ string literals, std::string
objects, ALib strings or any custom 3rd-party library string type like QString , become seamlessly interchangeable! For example, if implicit access is allowed for an external string type XYZString and a method expects a constant ALib string reference, such method can be invoked with objects of type XYZString seamlessly, meaning without any explicit conversions. The other way round, if a third party method expects such constant reference of its XYZString, ALib strings might be passed without explicit conversion.
With this in mind, the following subchapters provide information about the built-in array specializations of this module.
There are probably four different built-in C++ character array types:
The reason why we separate constant from mutable character pointers is given in the previous chapter 4.1 The Challenges Of C++ Character Arrays. In short: String literals can be assigned to constant character pointers, but not to mutable ones.
The compiler treats string literals as either constant character pointers or fixed length arrays, dependent on the context. The advantage of the latter is that their length is available at compile-time. As they are not distinguishable from the other types, the number of built-in types is reduced to three.
The following table lists the access and construction traits for the three types:
Type | Character Arrays | Zero-Terminated C. Arrays | Notes |
---|---|---|---|
TChar[N] | Implicit Access No Construction | Implicit Access No Construction | The decision to allow implicit access even in the case of zero-terminated arrays (which probably are not zero-terminated), lies in the fact that C++ string literals are fixed length arrays which are zero-terminated. Of course, this implies that some care has to be taken when using "real" C++ character arrays with ALib strings. This conflict is unavoidable due to the C++ language definition. |
const TChar* | Implicit Access Explicit Construction | Implicit Access Implicit Construction | Constant character pointers in C++ are "presumably" zero-terminated. This is this at least this librarie's interpretation of the C++ language standard. And this how most operating system's API calls expect a string value. Therefore, the type traits allow implicit access. Note that for the determination of the array length, const TChar* arrays have to be zero-terminated!For the same reason, construction (here: conversion to) from non-zero-terminated arrays has to be performed explicitly, while construction from zero-terminated arrays is implicitly possible. |
TChar* | Explicit Access Explicit Construction | Explicit Access Explicit Construction | Mutable character pointers in C++ are "presumably" not zero-terminated. Therefore, the type traits demand explicit access and code that explicitly uses a mutable character pointer with method that use character traits (like the corresponding explicit constructor of class String does), needs to ensure that the array is zero-terminated. Mutable character pointers should not be used in the context of character array processing. Therefore, conversions to this type is possible only in an explicit fashion. |
main()
.Unlike the default definitions for types of the standard C++ library and other 3rd-party types (as documented in the next chapters), the built-in definitions for these three types are not selected by the inclusion of an optional header file. Instead, these definitions are fixed and not customizable.
alib/compatibility/chararray_std.hpp
.The C++ standard library provides templated class std::basic_string<TChar>
which implements a heavy weight string (aka a string type that allocates heap memory for the string data). The corresponding lightweight class, std::basic_string_view<TChar>
is only available with C++ version 17. In addition, character arrays are implemented with class std::vector<TChar>
.
The usual approach of this module is to allow implicit creation of lightweight string classes from character arrays, while heavy-weight string types need to be created explicitly. However, this approach is not possible here: At the moment that C++ 17 is active and this way class std::string_view
becomes available, the standard library performs some TMP code that allows implicit creations of type std::string
with internal conversions using the lightweight type! Hence, the implicit creation cannot be avoided if in parallel we want to allow type std::string_view
to be implicitly created. Because of this, the definition is made as follows:
std::string
is defined explicit, because otherwise an ambiguity would occur. However, effectively implicit creation is allowed due to the TMP programming of the standard library.std::string_view
is not in place), the creation of class std::string
from character arrays is defined to be implicit. This is done in favor to have compilations of the library behave compatible independent from the availability of the type std::string_view
.std::vector<TChar>
is defined to be explicit.In respect to character access of non zero-terminated arrays, all three classes allow implicit access. Access to a zero-terminated character array is still implicitly done with type std::string
as this class terminates their buffer anyhow when it is accessed with method std::string::data()
. For classes std::string_view
and std::vector
, such access is to be made explicitly, as usually these types do not represent zero-terminated strings.
The definitions for the three types is summarized in the following table:
Type | Character Arrays | Zero-Terminated C. Arrays |
---|---|---|
std::string_view | Implicit Access Implicit Construction | Explicit Access Implicit Construction |
std::string | Implicit Access Explicit Construction (Effectively becomes implicit due to the TMP implementation of the type itself) | Implicit Access Explicit Construction (Effectively becomes implicit due to the TMP implementation of the type itself) |
std::vector | Implicit Access Explicit Construction | Explicit Access Explicit Construction |
alib/compatibility/chararray_qt.hpp
.The following character array type traits are made for string and character array types of the QT Class Library :
Type | Character Arrays | Zero-Terminated C. Arrays |
---|---|---|
QStringView | Implicit Access Implicit Construction | Explicit Access Implicit Construction |
QString | Implicit Access Explicit Construction | Implicit Access Explicit Construction |
QLatin1String | Implicit Access Implicit Construction | Explicit Access Implicit Construction |
QByteArray | Implicit Access Explicit Construction | Explicit Access Explicit Construction |
QVector<uint> | Implicit Access Explicit Construction | Explicit Access Explicit Construction |
If module ALib Strings is included in the ALib Distribution, in addition to these type traits, the inclusion of header file alib/compatibility/chararray_qt.hpp provides a specialization of T_Append for QT type QChar.
In addition to the type definitions, type traits structs and helper-struct TT_CharArrayType discussed in detail in the previous manual chapters, the types listed below are available with this module.
Please consult the reference documentation of these types for more information:
On the level of namespace alib::characters, a bunch of functions are implemented that provide common algorithms working on arrays of arbitrary character types. All functions have templated type TChar, which in most cases is deduced by the compiler and thus does not need to be given.
The functions are similar to what is found with traits struct std::char_traits . While some functions are just inline wrappers to specializations of this this struct, versions that do no exist in the standard were added.
Finally this small but very important ALib Module introduces class AlignedCharArray. Please refer to its reference documentation for further information about it.