C++ developers do not need to get motivated much about using a 3rd-party string library. This is due to the fact that the language itself does not offer powerful built-in types that allow convenient character string processing.
The situation is even a little worse, because in C++:
std::string
always allocates memory and copies assign data. "Lightweight" string class std::string_view
was only introduced with C++ 17 - too late for today's libraries.Because of this, every general purpose C++ library tends to invent it's own character string type and while ALib is no exception, this constitutes a problem in itself. It is a true dilemma: C++ developers need to rely on some external string library, but each new string library increases the problem of adding complexity to this very basic and fundamental domain.
Yes, a C++ developer lives in a string hell! And therefore, a main design goal of module ALib Strings is to mitigate the problems.
The design goals of module ALib Strings are:
1. Mitigate the "C++ string problem":
2. Mitigate the "C++ character width problem":
3. Abandon the use of zero-terminated strings:
4. Use of unicode and UTF-encoding
5. Low and high level string features
The primary goals listed in the previous section are reachable best with the use of "template meta programming". Within this C++ programming paradigm, it is possible to define and use information about C++ types, which generally is called "type traits". With such traits, templated code can be selectively compiled depending on the template types involved.
In earlier versions of this library, type traits that defined the use of built-in and 3rd-party string classes had been introduced along with ALib string types. However, it turned out that there is very good reason to "generalize" and extract the type traits into a separate module which is completely independent of string processing.
Instead of looking on character strings, the traits are rather about "character arrays". The difference lies in the angle of perspective: With ALib, character strings are a higher-level concept than character arrays. Strings may be constructed from character arrays, may export their data as character arrays and interpret or manipulate the array data. Hence, the arrays are seen as the foundational data structure that is used by strings.
With this conceptual distinction, it became possible to separate the definition of type traits to separate module ALib Characters. While this module ALib Strings builds on ALib Characters, there is no dependency in the other direction: module ALib Characters does not "know" about module ALib Strings.
For a thorough understanding of all aspects, reading the Programmer's Manual Of ALib Module Characters before the manual you are currently reading, is of course helpful. But for a normal, straight forward use of the string classes, this is not needed. Therefore, the advice for the reader is to continue reading this manual about strings, and only start investigating into module ALib Characters, when noted in later chapters.
This brief summary of what module ALib Characters offers may suffice for the time being:
wchar_t
is compiler dependent, type wchar is introduced. This may be equivalent to wchar_t
but may also be one of char16_t
or char32_t
. With type wchar, the responsibility of what a wide character is removed from the compiler and given to ALib (and its platform defaults and compilation options).wchar_t
is aliased by type xchar.char
is given with nchar.char
, the logical type strangeChar is always equivalent to explicit type xchar.Character Array Type Traits:
The traits, if given for a custom type T, answer the following questions:
In addition, similar traits are available that answer the very same questions in respect to zero-terminated character arrays.
The use of UTF encoding was named a "design goal" in section 1.1 Library Design Goals. In fact, this is much more: It is a mandatory constrain that the software process that invokes code of this ALib Module, uses UTF in general. The module has to rely on that fact, because unfortunately most of today's operating systems and system class libraries (that ALib builds on), use a global (process wide) approach with setting configuration parameters of provided character conversion functions like wcsnrtombs
or mbsnrtowcs
. On GNU/Linux, such settings are made with function setlocale
.
This should not be seen to be a huge restriction, because there are no good reasons for any modern software to use any other character encoding than UTF. However, environment variables (or ALib variable LOCALE in a configuration source) has to be set to a UTF-8 encoding.
This module provides five different string classes:
A string object's underlying character type is defined using a template parameter named TChar. The different string classes are located in namespace alib::strings and their type names include a prefix letter 'T'. As a result, the list of base classes is:
As it is described in the documentation of outer namespace alib, it is common practice for any ALib Module to define "alias types" of all important classes in that namespace. For each of the string classes, four alias types are defined which are using character types character, nchar, wchar and xchar. With the latter three explicit character types, the alias names replace the prefix letter 'T' by letters 'N', 'W' and 'X'.
As a result, the following table lists all alias names in namespace alib:
String Type/Character Type | character | nchar | wchar | xchar |
---|---|---|---|---|
TString<TChar> | String | NString | WString | XString |
TCString<TChar> | CString | NCString | WCString | XCString |
TAString<TChar> | AString | NAString | WAString | XAString |
TSubstring<TChar> | Substring | NSubstring | WSubstring | XSubstring |
TLocalString<TChar,N> | LocalString<N> | NLocalString<N> | WLocalString<N> | XLocalString<N> |
Within this manual, most of the time, the simple names like String, CString or AString are used, even when the corresponding templated class is meant. Likewise, if the names are linked, then the link target resolves the template type and not the simple alias. For example, this link: AString, links to class TAString.
The following subsections of this chapter introduce the main string types. This is done without going into the details of each type's functionality but rather by explaining the principal differences of the types.
The (only!) advantage of zero-terminated arrays, is that all that is needed to determine a string is a pointer to the start of the array. Otherwise, along with that pointer, the length of the string has to be given.
These two values, the pointer to the first character and the length of the string, are the only two field members of class String. It could be said, that the main purpose of this class is to provide a pair of the two values, which comprise a non-zero-terminated string and hence the type should be considered a "lightweight pointer to constant string data".
The terms "lightweight", "pointer" and "constant data" imply that class String is a simple C++ pod-type, with all benefits like having defaulted copy and move constructors, no destructor and of course no virtual functions.
For example, if a String instance is created and deleted on the stack like this:
{ String s= "Hello"; }
this is the same effort for the compiler (CPU) as creating a character pointer and a simple integral value:
{ const char* cp = "Hello"; integer length= 5; }
It is important to understand that creating, deleting and copying objects (values) of type String is equivalent to doing the same with objects of type std::pair<const char*, integer>
. Likewise, string values can simply be overwritten:
String s= "Hello"; s = "World"; s = String( s.Buffer() + s.Length() / 4, s.Length() - s.Length() / 2 );
The last sample shows that reducing the length of the represented string by cutting portions from either the front or the end of the string would be allowed operations: the resulting objects still represents valid string data. However, none of the interface functions of the class changes the pointer to the character array or the string's length. Such modifications are only implemented with derived types introduced later. This way, class String
does not only represent a buffer of constant character data, but also the pointer to the buffer and the defined length are constant themselves.
A first type derived of class String is class CString. The name of the class means "C language string": objects of this class represent zero-terminated character arrays.
In all other aspects, this class is the same as its parent.
With this class derived, it becomes obvious why parent class String must not allow operations that reduce the length of the string: The resulting shortened string would not be zero-terminated. In other-words: If class String allowed operations that shortened the represented string, then class CString could not be derived from it.
Class AString introduced in the next section, imposes a similar rationale why operations that cut portions from the front of a String are likewise not allowed.
A second type derived of class String is class AString. The prefix character "A" here simply stands for "ALib". The class implements a "heavy weight" string type, namely one that does not only "represent a string" but actively allocates memory for the string data and manages that resource internally. If a String object gets "assigned" to an object of this type, then the string data is entirely copied into the character array buffer that class AString manages.
Consequently this class provides a huge set of interface functions that allow to modify the contents of the array, and if the content is inserted that exceeds the capacity of the internal buffer, a larger buffer is allocated and that allows storing the concatenated string data.
Cutting data from the end of the string is performed in constant time ("O(1)") as only the value of inherited field Length needs to be decreased. Cutting data from the start of the string is "linear" effort ("O(N)"): The remaining portion of the string is copied to the start of the buffer and the string's length is adjusted.
Finally, a third type derived of class String is implemented with class Substring.
It has in all respects the same properties as its base class String
, especially it has the same lightweight nature, it "represents" strings rather than "implementing" those and the string data is represented is constant.
The only difference is that this type allows shortening the represented string and such shorting can be done from both ends in constant time ("O(1)"): Same as with class AString, "removing" data from the end is just about changing inherited field Length. Removing data from the front also decreases the length and in parallel increases the pointer to the start of the array.
An important use case of this type is to "parse" data from a string. Here, an object of this type is created from an "input string" of just any other type and then the string is shortened in a loop. Inside the loop, alternating operations of either parsing numbers, tokens or other values or recognizing and removing delimiters and whitespaces is performed - until the string is empty. The majority of interface methods offered therefore is named with the prefix "Consume". This indicates that not only some "parsing" takes place, but also that the corresponding characters the substring are removed from it.
In the previous three sections, base class String and three derived types CString, AString and Substring have been introduced.
With that introduction, it was explained why the base class is limited in respect to changing the string: Simply spoken, derived type CString disallows cutting substrings from the back, because this would result in a non-zero-terminated string and derived type AString
disallows cutting substrings from the front, because its simple implementation of the memory management forces the class to copy the remaining string data to the front of the allocated buffer.
All these explanations have been given to make the design rationale of the family of ALib string classes completely transparent and understood. In other, higher level programming languages this all would be an unnecessary complication of things. Even in C++ a more simple to understand and to use design would be possible, for example by using a abstract classes with virtual functions.
The design given here, aims to leverage the speed and efficiency of the C++ language. Once the differences of the string classes is understood, choosing the right type becomes a very clear, unambiguous and straightforward task in any programming situation.
To recap:
std::string_view
, which was introduced by the C++ standard library with version 17.std::string
, of the standard C++ library.This module makes use of the "character array traits" defined with dependency module ALib Characters. For the use of the string classes, a developer does not need to know all details of these traits and it is sufficient to understand what is said in the introductory chapter 1.2 Module ALib Characters of this manual.
The following table lists the constructors of class String. All constructors are inline and mostly are compiled in the shortest code possible, which only copies the right values to fields buffer and length.
No | Parameter(s) | Description |
---|---|---|
1 | None | Default constructor, sets field buffer to nullptr and length to 0 . |
2 | nullptr (C++ keyword) | Sets field buffer to nullptr and length to 0 . |
3 | const TChar* , integer | Sets fields buffer and length to the given values. |
4 | const T T_CharArray<T>::Access == AccessType::Implicit | Sets field buffer to the result of T_CharArray<T>::Buffer(src) and field length to the result of T_CharArray<T>::Length(src) |
5 | const T T_CharArray<T>::Access == AccessType::ExplicitOnly | Same as 4), but defined using keyword explicit . |
6 | T T_CharArray<T>::Access == AccessType::MutableOnly | Same as 4) but using keyword explicit and a mutable parameter. |
Constructors 4
, 5
and 6
are selected by the compiler in the case that an object of template type T is given and an according specialization of type trait struct T_CharArray exists. Each of these constructors implements one the three elements of enumeration AccessType that classify the possible access of the character array data given with type T.
This set of constructors allow very intuitive and convenient construction of ALib strings from 3rd-party string types. Especially the case of implicit construction is interesting: If a method argument is declared as a constant reference type, the C++ compiler will perform one "implicit conversion", if a different type is passed for such argument.
As a sample, we have function foo defined as:
With this, an invocation passing just any string type (that allows implicit access) is possible:
The exact same set of constructors that are listed in the table of the previous section for class String, are implemented with class CString. The only difference is that constructors 4
to 6
are testing for a specialization of struct T_ZTCharArray instead of T_CharArray.
Therefore, all that was explained in respect to construction of type String from templated types that represent character arrays, is equivalently true for the construction of type CString from types that represent zero-terminated character arrays!
In contrast to String and CString, type AString does not allow implicit construction. Apart from the move-constructor, all constructors are explicit. This design decision was made because of the heavy-weight nature of the class.
Apart from the need to be explicit, construction of the class is even more flexible than the construction of the lightweight string types: Type traits functor T_Append allows creating string representations for objects of custom types. In addition to the character array types that base class String accepts, these types are accepted by a templated constructor of the class as well. All details about this template struct are given with chapter 5. String Assembly.
Class Substring simply inherits all constructors of its base class String and therefore, all that had been written in previous chapter 3.1.1 String Construction, is true for this class. This includes that the type const alib::Substring&
may be used as method arguments to accept any type of string of fitting character size, without explicit conversion.
String
did not play the role of being the base class for types CString
and AString
, these features could be implemented with class String
itself and class Substring would not be even needed. In this respect, class Substring is not a specialization of String but more a "continuation". With that in mind, it makes a lot of sense that all parental constructors are exposed and usable.The previous chapter talked about how the different ALib string types are constructed. This chapter now discusses the opposite: the string types implement C++ cast operators that allow to construct values of arbitrary string types from those.
Again, the cast is performed using the type traits defined with dependency module ALib Characters. This time, the value of field Construction of specializations of T_CharArray respectively T_ZTCharArray are tested. Possible values are given with enumeration ConstructionType. With that casting string types to a specific custom type is either not allowed, implicitly allowed or allowed only if explicitly performed.
Class String implements an implicit cast operator to values of template type T if a specialization of T_CharArray exists that defines field Construction to be ConstructionType::Implicit. Likewise, an explicit operator is available if ConstructionType::ExplicitOnly is given.
Of course, the construction of the casted object is performed by invoking T_CharArray::Construct, passing the string's fields buffer and length.
With the same rationale as given in 3.1.4 Substring Construction, class Substring behaves 100% the same as parent class String in respect to casting options.
Class CString implements the very same casts operators as class String, with the only difference that TMP struct T_ZTCharArray is used instead of TMP struct T_CharArray.
const char*
that defines implicit casts, objects of type CString can be passed to "old school" interface methods that expect a zero-terminated character array as an argument, without an explicit cast.Class AString implements each of the cast methods that are provided with class String
and CString. This is due to the fact that the class always reserves space in the allocated buffer for a terminating character. This way, for the preparation for casting to an arbitrary zero-terminated array type is performed in constant time, as no string data has to be moved to a newly allocated buffer.
The four casts methods makes this class the most flexible of the ALib string types, in respect to implicitly or explicitly creating external character array types.
Casts, especially implicit ones, in some situations may impose ambiguities, which lead to compilation failures. To mitigate such, the implementations of the implicit casts of all three classes String, CString and AString are conditionally selected by the compiler using TMP struct T_SuppressAutoCast.
ALib specializes this struct to prevent the casting of AString objects to types String
and CString, which the type traits T_CharArray and T_ZTCharArray of course, would indicate to be allowed. This is ambiguous in respect to the implicit construction that is also allowed.
Custom specializations should only be needed in similar situations, where a custom string type allows auto-casts based on the type traits provided by ALib.
This module ALib Strings is not "responsible" to define the built-in conversion rules for C++ and 3rd-party types, because in-fact these rules are defined already with the specializations of the TMP structs T_CharArray and T_ZTCharArray given in dependency module ALib Characters.
While these specializations are described in the corresponding Programmer's Manual section 4. Built-In Character Array Traits of that module, only a summary the rules from the perspective of ALib string classes are given here.
Fixed-length Character Arrays:
const TChar*
:
AString
.TChar*
:
In general this library considers mutable character pointers a "dubious" type and unlike their constant counterparts, arrays pointed to by this type are not considered zero-terminated. Therefore all conversion functions are explicit.
std::string_view
:
std::string
:
std::vector<TChar>
:
QStringView:
QString:
QLatin1String:
QByteArray:
QVector<uint>
:
In the previous sections a quite remarkable and unique feature of this module, namely the possibility of (implicit) conversions of arbitrary C++ string types to and from ALib string types, has been described. These features contribute fundamentally to a major design goal of this module, by relieving a programmer from the burden to convert string types when mixing libraries that expect different strings.
With the previous descriptions it has been mentioned that the documentation of dependency module ALib Characters is not required to be read if ALib string types are to be just used.
To adopt custom string types to become "compatible" with ALib strings all that has to be done is to specialize type-traits struct T_CharArray and, in the case that a type represents zero-terminated strings, also struct T_ZTCharArray. While this is done with only a few lines of code, still it is advised to start reading the Programmer's Manual of module ALib Characters. If not from the beginning then at least chapter 4. Character Arrays. Together with the information provided in the previous sections of this manual, the complete picture should be given and the adoption of own types be a straight forward task.
In addition header files
can be used as a good template to use for the adoption of own string types.
Implicit string construction as discussed in the previous chapter allows creating method interfaces that accept "arbitrary" custom string types. It was explained that type traits T_CharArray and T_ZTCharArray might be specialized for custom types and with that string classes String and CString might be created implicitly from objects of those.
With these two types given, it is not possible to create an API interface that clearly separates between custom types that are zero-terminated and those that are not. This problem is best explained with a sample.
Imagine a namespace function called IsDirectory that should accept a constant directory path string and should return true if the argument represents an existing directory in the filesystem and false
if not. The function declaration would be like this:
bool IsDirectory(const String& path);
Now, many actual implementations of the function (for example on the GNU/Linux operating system), would need to pass a zero-terminated string to a corresponding operating system call. To create that, the accepted string argument is needed to be copied to a buffer that can be terminated. This effort is redundant if a user invoked the function like this:
auto result= IsDirectory( "/usr/bin" )
because the string literal given is already zero-terminated. To avoid this, an overloaded function definition could fetch zero-terminated strings and pass those without the copy and termination overhead:
bool IsDirectory(const CString& path);
But with these two methods in place, the compiler complains about an ambiguity as soon as zero-terminated string types are passed. The reason for this is simply because the normal string type String can be implicitly constructed from zero-terminated string types as well.
As a way out of the ambiguity described in the previous section, class StringNZT is given with the library. The "NZT" suffix stands for "non-zero-terminated". The type extends class String and all it does is to deny implicit construction by objects of types that would likewise construct type CString.
With that, the two overloaded namespace functions:
bool IsDirectory( const StringNZT& path ); bool IsDirectory( const CString& path );
are not ambiguous. The first function's implementation would usually copy and terminate the given non-terminated string, for example by just creating an AString object from the given non-zero-terminated string. Then it would invoke the second method passing the AString
, which becomes zero-terminated on the fly when converted to CString.
The following bullets summarize and refine what was sampled in this chapter:
Finally it should be mentioned that the use of zero-terminated strings is not recommended. ALib itself does that only in very specific situations. An example is class Path. The class interfaces with the operating system that expects zero-terminated strings, like it was sampled in the previous section.
Often, software needs to assemble strings. May it be human-readable text, data serialization or for the implementation of communication protocols. For that, a string type is needed that manages a data buffer and provides interface methods that allow the concatenation of data to existing strings. Furthermore typical methods like searching and replacing substrings, letter case conversion, etc. has to be offered.
As already introduced, for this purpose class AString is provided with this module. Therefore, this chapter dedicated to the topic of string assembly is mostly a chapter about class AString.
In the previous chapters of this manual it was explained how the lightweight ALib string types String
, CString
and Substring
are constructable using values of C++ types which are equipped with "character array traits". Those traits are nothing else but meta-information about these types which is provided by corresponding specializations of templated structs T_CharArray and T_ZTCharArray. The character array type traits are introduced with module ALib Characters.
Some high level object-oriented programming languages offer a root class which provides a common interface for just any derived type and such interface may contain a method that creates a string representation from an instance. For example, the JAVA language defines class Object
which provides method toString()
for such purpose.
The two concepts (ALib character array traits and the Object.toString()
method of Java) are fundamentally different: Character array traits are meant to be given for types whose main purpose is to represent or implement character arrays, while the toString()
method may be implemented for just any type.
Class AString, which is designed to support the assembly of strings, offers a feature that much more corresponds the toString()
concept. Again, type traits are used, this time not for accessing (existing) character array data, but for appending a string representation of any object to an AString.
Type traits "functor" T_Append<TAppendable,TChar,TAllocator> by default is empty. To allow the creation of a string representation of objects of a custom type TAppendable, a specialization of the struct has to be defined that implements method T_Append::operator()(TAString<TChar>&.
Besides specifying the type that is adopted with template type TAppendable, the character type TChar of the destination AString object may be given with a specialization. If omitted, it defaults to type character.
As the name of functor T_Append suggests, the implementation of the operator usually appends a string representation of the object given with parameter src to the AString given with parameter target. Nevertheless, an implementation is free to modify the given AString in any way. For example, built-in type Format::Escape searches and replaces "escape-characters" when "appended" to an AString!
Once type-traits functor T_Append<TAppendable,TChar,TAllocator> is specialized for a type TAppendable, objects of that type may be appended to objects of TAString<TChar,TAllocator>. This can be done using the following methods:
"_"
. Provided for compatibility with JAVA and C# versions of ALib.)Methods Append and '_', as well as operator '<<', each return a reference to the AString that they were invoked on. This allows concatenated calls, like in:
AString aString; aString << "The result is: " << 42;
The specializations of functor T_Append that come with the ALib library can be grouped into four areas:
1. Fundamental C++ Types:
Specializations for all fundamental C++ types like int, double, etc. are provided. No special header file has to be included for this. The specialization is available with the inclusion of header file alib/strings/astring.hpp.
2. Class Format And Its Inner Types:
Class Format is provided which allows formatting numbers. In addition, the class has a list of inner types that implement some specific simple format operations. These inner types are: Tab, Field, Escape, Bin, Hex and Oct.
Class Format as well as its inner types are "lightweight" and are supposed to be created locally with the invocation of the append-methods. As a quick example, the use of Format::Field should be showcased:
The code above which produces the following output:
* Hello *
Class Format is included implicitly with the inclusion of header file alib/strings/astring.hpp.
3. Other ALib Types:
For various types found in other ALib Modules, specializations of T_Append are provided.
All elements of important enum types are appendable, as soon as
#include "alib/enums/serialization.hpp"
is stated in the compilation unit. For more information, see section 4.3.1 Serialization/Deserialization of the Programmer's Manual of module ALib Enums.
4. 3rd-Party Types:
In source folder alib/compatibility
some special header files are provided that contain specializations of T_Append for type of the C++ standard library (namespace std
) as well of types of 3rd-party libraries.
The following code snippet demonstrates how to implement the specialization of functor T_Append for internal ALib class DateTime to print out a formatted date:
With this definition included, a code unit might now append DateTime objects to strings:
The output would be for example:
Execution Time: 2024-12-15 10:41
The following macros are provided to simplify the specialization of T_Append and make the code more readable:
Class AString hides all parent constructors and offers re-implementations that rather copy the data that is passed. Consequently - as this copying is not considered a lightweight operation - all constructors are explicit. By the same token, the assignment operator is not applicable with initializations as well.
The following code will not compile:
Instead, explicit construction has to be chosen, as shown here:
As already noticed in chapter 5.1 Appending Custom Types, with templated constructor AString(const TAppendable&), class AString accepts any type of object that a specialization of functor T_Append exists for. This makes construction very flexible.
Copy constructor, move constructor and move assignment are well defined, which allows AString objects to be used (as efficiently as possible) as value types in containers of the standard library, for example as in std::vector<AString>
.
As mentioned before, class AString provides logic to manage its own buffer. During the assembly of strings, the buffer "automatically" grows as needed. If a certain minimum size can be foreseen as a result of a string assembly, before performing the assembly operations, the necessary buffer size might be reserved by invoking method SetBuffer(integer). This avoids the automatic growth process which may take place in several steps and each steps may involve to copy the current buffer to a new memory location.
Once grown, the allocated buffer size is never reduced, unless method SetBuffer(integer) is explicitly invoked providing a smaller size than currently allocated.
Besides this internal, automatic memory allocation, the class can also work on external buffers. For this, overloaded method TAString::SetBuffer. allows providing such external memory. The life-cycle of an external buffer is not bound to the life-cycle of the AString object itself. At the moment that the size of an external buffer is not sufficient to allow a requested extension of the managed string, the class replaces the external buffer by a larger, self-managed one.
For details on using external buffers, see the reference documentation of overloaded method TAString::SetBuffer. Class LocalString, which is discussed in the next section, makes use of this feature and provides the possibility to have local (stack based) allocations of strings.
Template class LocalString<TChar, TCapacity>, derived from class AString uses an internal character array of a length specified by template parameter TCapacity to store the string data. During construction, the memory address of this character array member is passed to method TAString::SetBuffer. The huge benefit of using the class lies in performance: The performance impact of heap allocations is often underestimated by software developers. Therefore, for local string operations with foreseeable maximum string buffer sizes, class LocalString should be considered as a faster alternative of class AString.
Although the internal buffer size is fixed at compile-time and hence cannot be expanded, a user of the class must not fear 'buffer overflows'. If the internal buffer capacity is exceeded, a new buffer from the free memory (aka 'heap') will be allocated.
With debug-builds of ALib, parent class AString provides a warning mechanism that allows the easy detection of such (probably unwanted) replacements of the local buffer. There are two scenarios how this mechanism might be used during development:
If the latter case applies, then the warning can be disabled using inherited method DbgDisableBufferReplacementWarning. This inline method is empty in release-compilations and this way optimized out by the compiler.
While class AString (as noted above) does not provide implicit construction, class LocalString re-implements the common constructors of AString and exposes them as implicit. The rationale here is that although the data is copied (which might not be a very lightweight task), still the performance impact is far less compared to constructing an AString that uses a heap-allocated buffer. The design decision behind that takes into account that a LocalString copies an argument to its local buffer without the explicit exposure of this operation.
The following method, as a sample, takes three different ALib string types as parameters:
The following code will not compile:
Class AString has to be explicitly created, the others don't:
In addition, besides having implicit construction, the default assign operator is defined as well with LocalString. This allows using objects of this type as class members that are initialized within the class declaration as shown here:
Such members are not allowed to be initialized in the declaration if their type is AString.
Class LocalString provides no move constructor and thus is very inefficient in scenarios where objects of the class could rather be moved than copied. Consequently such situations are to be avoided. The use of LocalString should instead be very determined and it should not be subject to copy and move operations.
Within namespace alib, some convenient alias type definitions are available that define local strings of frequently uses sizes:
An important aspect of the family of string types provided by this module and library, is concept of "nullable" strings. An object of base class String is nulled, when constructed:
An existing string can be set to nulled state, by assigning keyword nullptr
or another nulled object of character array type.
Precisely, a string is nulled, when the internal pointer to the character array evaluates to nullptr
.
The concept of nullable strings differs from the concept of having empty strings. The latter refers to string objects of zero length.
While nulled strings are always also empty (hence have length of zero). The other way round, empty strings are not necessarily nulled. An empty string that is not nulled does not equal an empty string that is nulled.
Inline methods IsNull, IsNotNull, IsEmpty and IsNotEmpty of base class String test strings objects for being nulled or empty.
The following code runs fine (with no assertion):
Especially the last line of this code is important to understand: a nulled string is different from an empty string.
The concept of having nulled strings is equally available with derived string type AString: An object of type AString is nulled when no internal buffer is allocated and likewise no external buffer is set.
If default constructed, constructed with zero size, with keyword nullptr
or any other nulled string, no buffer is created. Consequently, it makes a difference if an AString is constructed using AString()
or AString("")
.
std::string
, which always allocates a buffer and thus does not support a nulled state.The allocated buffer of a non-nulled AString can be disposed by invoking SetBuffer(0) or by invoking SetNull on the instance.
To make this more clear, note the following sample code which does not throw an assertion:
nullptr
to set the string to nulled state, class AString does not support any assignment operator, but the C++ copy assignment.What was said in the previous two sections might not need any further explanation and experienced programmers might skip to the next chapter. However, because of the fact that many string types of other libraries behave differently, some further notes should be given:
The fact that string objects can be nulled allows "transporting" a piece of information along with the string that can be used in APIs. For example, if a method should receive a string object according to a key-property, a nulled result may indicate that no data existed to the given key. This is in contrast to returning an empty string, which indicates that data was found, but that the result just is an empty string. If ALib strings types were not nullable and in this sample empty strings should be allowed as a valid answer, a second return value had to be defined for the API function that indicates if a string existed for a given key-property. Such API design paradigm is used frequently across various ALib Modules.
On the other hand, when string values are used as input data, some caution has to be taken to ensure that method invocations on a given input string is even allowed. Some methods may produce undefined behavior when invoked on nulled string objects.
To maximize code performance, explicit tests for nulled strings should be avoided if not necessary, which sometimes can be an obligation to the programmer that uses the string types. More on this topic is given in the next section.
Several of the methods found in the different string classes of ALib are templated with a boolean template parameter named TCheck. This template parameter is defaulted with the tag-type CHK which hides the whole concept it in "normal" code. Consider the following snippet:
Two string methods are used in this code sample: TString::IndexOf and Substring::ConsumeChars. Both methods support templated parameter TCheck! The following code provides the parameter in its default value, and hence for the compiler is equivalent to the previous snippet:
The exact impact of the value of template parameter TCheck is documented with each function that supports it. In general, with CHK, the string object that a method is invoked on is checked, for example, for not being nulled. Furthermore the parameters given are checked, for example, to not being nulled, to be in valid ranges, and so on.
In the sampled case of method TString::IndexOf, the documentation tells us that
needle
must not be empty.startIdx
must be in the range of 0
and the string's length minus the needle's length.The latter cannot be guaranteed for the sample's method argument line and therefore the check has to be performed. As a side effect, this check implicitly tests for a given nulled string, because in the case that the given string is shorter than the token "<start>"
, the method returns -1
. This way, no user code for checking the input argument is needed in this sample code.
The implementation of method ConsumeChars by default checks if the string is long enough to cut the given number of characters from the front. In other words, it tests whether parameter regionLength is in the range of zero
and the length of the string. Obviously, this check is redundant in this sample. The method is invoked only if method IndexOf had found the token "<start>"
in the string!
To avoid the redundant check, for this invocation the non-checking version of method ConsumeChars may be used by providing false
for the template value:
The obvious goal of using non-checking method versions lies in avoiding redundant code, hence to reduce code size and improve execution performance. As a majority of string methods are inlined, the C++ compiler often is able to detect and remove redundant checks on its own. In these cases, the use of the non-checking version of a method has no effect in optimized release compilations.
However, there are many occasions where the compiler is lacking information on the state of variables that a programmer might know about and then, non-checking versions might have a huge impact when used in loops and other critical code sections. Also, in the sample above, it is very doubtful that any of today's C++ compilers "knows" what it needs to know to optimize the redundant checks out.
So, what is that "something" that we phrased as "a programmer knows" and a "compiler does not know" above? In computer science, such information is referred to as "invariants". Usually, invariants are used to prove the correctness of algorithms. Invariants are expressions on variables that always evaluate true
when program execution hits a specific line of code.
In the sample above, the relevant invariant that allows us to use the non-checking version of method ConsumeChars, could be phrased as:
The length of string "line" is as least as long as "idx" plus the length of token "<start>".
Now, by using the non-checking version and appending "<NC>"
to the method invocation, not only do we help the compiler to create shorter and faster code, we also put information about the invariant into the code. And this is a benefit that should not be under-estimated! By just looking at this single code line:
myString.ConsumeChars<NC>( 5 );
a reader understands that string myString
is at least 5
characters long. This is valuable information that a reader otherwise found out only by inspecting the context of the code line, which sometimes may become a quite complex task. From here, one could easily conclude that after this code line, an invariant for variable myString
would be
myString may be empty but is not nulled
To conclude this chapter, it has to be mentioned that in debug-compilations of the library, the non-checking versions of the code still implement checks! Exactly these conditions that are documented to be checked in the regular method versions are checked. If the check fails, debug assertions are raised by the non-checking method versions. This approach and the concept of invariants go along very well: If an invariant is false
, the algorithm is considered wrong, and the code asserts.
In release compilations, invoking non-checking method versions with a breach of a corresponding invariant leads to undefined behavior (probably a process crash).
"_NC"
.With the inclusion of the header file alib/strings/string.hpp, the following constexpr
variables are defined in namespace alib:
Each simply represents a nulled, respectively an empty string. The rationale for the provision of the nulled
versions is purely to increase the readability of the source code. The following lines of code are equivalent in all respects:
String myString; String myString= nullptr; String myString= NULL_STRING;
With variable EMPTY_STRING and its siblings things are a little more complicated: Here the right C++ string literal has to be chosen. This is achieved with the template type TT_StringConstants and its specializations for character types nchar, wchar, and xchar. If a user of this library writes entities that are templated on the character type, then the use of this helper-struct is advised.
With the inclusion of the header file alib/strings/cstring.hpp, templated helper-struct TT_CStringConstants is defined, which provides static constexpr
methods for a few frequently used string constants.
While the methods can be explicitly accessed by providing the templated character type, in addition, for each six character types a corresponding is variable given in namespace alib. For example, for member method TT_CStringConstants<TChar>::DefaultWhitespaces, corresponding variables
are defined.
Same as with helper-struct TT_StringConstants introduced in the previous chapter, if a user of the library writes entities that are templated on the character type, the use of helper-struct TT_StringCConstants is advised.
In some situations additional debug checking is helpful when working with ALib strings. Among such situations are:
In these and similar situations, it may be helpful to define preprocessor symbol ALIB_DEBUG_STRINGS. This symbol enables internal consistency checks with almost any method invoked on string types. By default this feature is disabled, as it consumes quite a lot of run-time performance. When string debugging is enabled, macro ALIB_STRING_DBG_CHK can be used to check the consistency of ALib string classes.
With string debugging, the string buffer allocated by class AString is extended by 32 characters, 16 characters at the front and 16 characters at the end. A "magic" number is written in this padding memory and accidental (illegal) write operations across the borders of the allocated space is detected.
Therefore, code that:
has to allocate the buffer passed accordingly. This means, the buffer has to be 32 characters larger than specified and the starting address of the heap allocation has to be 16 characters before what parameter extBuffer points to.
Such external buffer allocation should therefore be conditionally implemented using code selection symbol ALIB_DEBUG_STRINGS.
Further details of the built-in debug mechanisms are not documented. Please refer to the source code of the ALib string classes, especially by investigating to code locations that use selection symbol ALIB_DEBUG_STRINGS.
The string types introduced with this module are using type integer to store the string's length. This is a signed type - in contrast to what the C++ standard library suggests by using type size_t
for the length of type std::string!
There are very good reasons to consider this as a wrong design decision. Negative string length are impossible and thus, this is an artificial, non-necessary restriction, because ALib strings cannot be longer than only half of the virtually addressable memory (on standard hardware).
Honestly, the main argument for taking this restriction into account, is to avoid a lot of clutter code when it comes to subtraction of string length values. ALib compiles with the almost all reasonable compiler warnings enabled. Being signed, many static casts for converting signed and unsigned integral values would be needed to avoid warnings. This would not only be true in the library code itself, but with all code that uses the strings and that also uses a similar restrictive warning policy with compilation.
However, besides this confession of a certain level of laziness, there is also a true benefit in this decision: Types derived from class String may use this unused sign bit, to encode a binary piece of information in it. As a sample, ALib class AString leverages this option already: The information if a currently used buffer is of external or internal allocation is determined by storing a positive or negative value in the likewise signed field capacity. This way, no additional boolean value is needed, which of course reduces the memory footprint of the class.
As elaborated in the introductory chapter 1.2 Module ALib Characters, in respect to character type definitions and character array traits, this module completely relies on module ALib Characters. While all string classes are templated, the character types that are used by the template instantiations are all defined in this underlying module.
The alias types for each string class (defined in namespace alib) enumerate all possible types by adding a prefix character or word, for example NString, WString or ComplementString. The aliases without any prefix; like String, Substring or AString use the width of the generic and "agnostic" type character.
Now, when using string literals, the following code is not platform agnostic::
String myString= "Hello World";
While it might compile on some platforms or with the right compiler symbols for ALib in place, in the case that type character is a wide type, a compilation error is generated. Therefore, all non-narrow string literals need to be given by using a corresponding macro. The set of macros are also provided with the underlying module ALib Characters. The "agnostic" macro needed in the sample above is simply A_CHAR:
String myString= A_CHAR( "Hello World" );
As long as only strings of standard width are used, all that is needed to know is that each and every C++ string literal needs to be enclosed in this macro.
Further macros that define string literals of specific width are given with
Sometimes a code unit expects a string of a defined width and has to handle strings of logical types, or vice versa. For example, if an interface method accepts standard string type, while internally narrow strings are used.
In such situations, the straight forward approach to this could be to use code selector symbol ALIB_CHARACTERS_WIDE and provide two different code versions.
To avoid this, the following macros are provided:
In principal the macros define a new identifier, which in the case that a conversion is needed, uses a local string where the source string is appended, while in the case that the character widths are equal, a simple reference to the given type is created. The latter will be optimized out by a C++ compiler and thus, no performance penalty occurs.
For details, consult the reference documentation of the macros.
This user manual concentrates on the general and fundamental aspects of the string types provided by this module.
There is a whole list of utility types available with this module that are not covered by this manual. Instead, for those types an adequate and complete introduction and description is provided with the reference documentation of each. The types for example implement token parsing, a wildcard and regular-expression matcher. To separate the fundamental string types from the utility classes, a dedicated inner namespace "util" is defined where these classes are grouped.
To investigate into the functionality and tools offered in the area of string handling, please consult to the class list provided in the reference documentation of inner namespace alib::strings::util.
Almost any standard library of modern programming languages provide functionality that allow to format a list of variadic arguments along the lines of a format string that follows a certain "placeholder syntax". The most prominent sample is the good old printf
function of the standard C library.
ALib offers mechanics to define and process variadic argument lists in a type-safe fashion with its module ALib Boxing. Now, to keep module ALib Strings independent of module ALib Boxing, formatting features as described above have been placed in a separated module, namely ALib BaseCamp. With that, a powerful implementation of formatting tools, is provided. These are even supporting different standards of a format string's placeholder syntax, namely printf and Java style as well as Python style.