-
Notifications
You must be signed in to change notification settings - Fork 82
Better String Library Documentation (1)
The Better String Library is an attempt to provide improved string processing functionality to the C and C++ language. At the heart of the Better String Library (Bstrlib for short) is the management of "'''bstring'''"s which are a significant improvement over ''''\0'''' terminated char buffers.
The standard C string library has serious safety and performance problems:
- Its use of a ''''\0'''' terminator means knowing a string's length is O(n) when it could be O(1).
- It imposes an interpretation for the character value ''''\0''''.
- '''gets()''' always exposes the application to a buffer overflow.
- '''strtok()''' modifies the string its parsing and thus may not be usable in programs which are re-entrant or multi-threaded.
- '''fgets()''' has the unusual semantic of ignoring ''''\0''''s that occur before ''''\n''''s are consumed.
- There is no memory management, and actions performed such as '''strcpy''', '''strcat''' and '''sprintf''' are common places for buffer overflows.
- '''strncpy()''' doesn't ''''\0'''' terminate the destination in some cases.
- Passing '''NULL''' to C library string functions causes an undefined NULL pointer access.
- Parameter aliasing (overlapping, or self-referencing parameters) within most C library functions has undefined behavior.
- Many C library string functions take integer parameters with restricted legal ranges. Parameters passed outside these ranges are undetected and cause undefined behavior.
- Incorporate string functionality from other languages.
- '''MID$()''' from BASIC
- '''split()'''/'''join()''' from Python
- '''string/char x n''' from Perl
- Implement analogs to functions that combine stream IO and char buffers without creating a dependency on stream IO functionality.
- Implement the basic text editor-style functions insert, delete, find, and replace.
- Implement reference based sub-string access (as a generalization of pointer arithmetic.)
- Implement run-time write protection for strings.
A '''bstring''' is basically a header which wraps a pointer to a char buffer. Let's start with the definition of '''struct tagbstring''':
struct tagbstring {
int mlen;
int slen;
unsigned char * data;
};
This definition is considered exposed, not opaque (though it is neither necessary nor recommended that low-level maintenance of '''bstring'''s be performed whenever the abstract interfaces are sufficient). The '''mlen''' field (usually) describes a lower bound for the memory allocated for the data field. The '''slen''' field describes the exact length for the '''bstring'''. The data field is a single contiguous buffer of unsigned chars. Note that the existence of a ''''\0'''' character in the unsigned char buffer pointed to by the data field does not necessarily denote the end of the '''bstring'''.
To be a well formed modifiable '''bstring''' the '''mlen''' field must be at least the length of the '''slen '''field, and '''slen''' must be non-negative. Furthermore, the data field must point to a valid buffer in which access to the first '''mlen''' characters has been acquired. So the minimal check for correctness is:
'''(slen >= 0 && mlen >= slen && data != NULL)'''
'''bstring'''s returned by '''bstring''' functions can be assumed to be either '''NULL''' or satisfy the above property. (When '''bstrings''' are only readable, the '''mlen >= slen''' restriction is not required; this is discussed later in this section.) A '''bstring''' itself is just a pointer to a '''struct tagbstring''':
'''typedef struct tagbstring * bstring;'''
Note that use of the prefix "tag" in '''struct tagbstring''' is required to work around the inconsistency between C and C++'s struct namespace usage. This definition is also considered exposed.
Bstrlib basically manages '''bstring'''s allocated as a header and an associated data-buffer. Since the implementation is exposed, they can also be constructed manually. Functions which mutate '''bstring'''s assume that the header and data buffer have been malloced; the bstring library may perform '''free()''' or '''realloc()''' on both the header and data buffer of any '''bstring''' parameter. Functions which return '''bstring''''s create them. The string memory is freed by a '''bdestroy()''' call (or using the '''bstrFree''' macro).
The following related typedef is also provided:
'''typedef const struct tagbstring * const_bstring;'''
which is also considered exposed. These are directly '''bstring''' compatible (no casting required) but are just used for parameters which are meant to be non-mutable. So in general, '''bstring''' parameters which are read as input but not meant to be modified will be declared as '''const_bstring''', and '''bstring''' parameters which may be modified will be declared as '''bstring'''. This convention is recommended for user written functions as well.
Since '''bstring'''s maintain interoperability with C library char-buffer style strings, all functions which modify, update or create '''bstring'''s also append a ''''\0'''' character into the position '''slen + 1'''. This trailing ''''\0'''' character is not required for '''bstring'''s input to the '''bstring''' functions; this is provided solely as a convenience for interoperability with standard C char-buffer functionality.
Analogs for the ANSI C string library functions have been created when they are necessary, but have also been left out when they are not. In particular, there are no functions analogous to '''fwrite''', or '''puts''' just for the purposes of '''bstring'''. The '''->data''' member of any string is exposed, and, therefore, can be used just as easily as char buffers for C functions which read strings.
For those that wish to hand construct '''bstring'''s, the following should be kept in mind:
- While bstrlib can accept constructed '''bstring'''s without terminating ''''\0'''' characters, the rest of the C language string library will not function properly on such non-terminated strings. This is obvious but must be kept in mind.
- If it is intended that a constructed '''bstring''' be written to by the '''bstring''' library functions then the data portion should be allocated by the '''malloc''' function and the '''slen''' and '''mlen''' fields should be entered properly. The '''struct tagbstring''' header is not reallocated and only freed by '''bdestroy'''.
- Writing arbitrary ''''\0'''' characters at various places in the string will not modify its length as perceived by the bstring library functions. In fact, ''''\0'''' is a legitimate non-terminating character for a '''bstring''' to contain.
- For read-only parameters, '''bstring''' functions do not check the '''mlen'''. I.e., the minimal correctness requirements are reduced to:
'''(slen >= 0 && data != NULL)'''
One built-in feature of ''''\0'''' terminated '''char *''' strings, is that it's very easy and fast to obtain a reference to the tail of any string using pointer arithmetic. Bstrlib does one better by providing a way to get a reference to any substring of a '''bstring''' (or any other length delimited block of memory.) So rather than just having pointer arithmetic, one essentially has segment arithmetic. This is achieved using the macro '''blk2tbstr()''' which builds a reference to a block of memory and the macro bmid2tbstr() which builds a reference to a segment of a '''bstring'''. Bstrlib also includes functions for direct consumption of memory blocks into '''bstring'''s, namely, '''bcatblk()''' and '''blk2bstr()'''.
One scenario where this can be extremely useful is when a string contains many substrings which one would like to pass as read-only reference parameters to some string consuming function without the need to allocate entire new containers for the string data. More concretely, imagine parsing a command line string whose parameters are space delimited. This can only be done for tails of the string with ''''\0'''' terminated '''char *''' strings.
Unless otherwise noted, if a '''NULL''' pointer is passed as a '''bstring''' or any other detectably illegal parameter, the called function will return with an error indicator (either '''NULL''' or '''BSTR_ERR''') rather than simply performing a '''NULL''' pointer access, or having undefined behavior.
To illustrate the value of this, consider the following example:
strcpy(p = malloc (13 * sizeof (char)), "Hello,");
strcat(p, " World");
This is not correct because '''malloc''' may return '''NULL''' (due to an out of memory condition), and the behavior of '''strcpy''' is undefined if either of its parameters are '''NULL'''. However:
bstrcat(p = bfromcstr ("Hello,"), q = bfromcstr (" World"));
bdestroy(q);
is well defined, because if either p or q are assigned '''NULL''' (indicating a failure to allocate memory) both '''bstrcat''' and '''bdestroy''' will recognize it and perform no detrimental action.
Note that it is not necessary to check any of the members of a returned '''bstring''' for internal correctness (in particular, the data member does not need to be checked against '''NULL''' when the header is non-'''NULL'''), since this is assured by the '''bstring''' library itself.
In addition to the '''bgets''' and bread functions, Bstrlib can abstract streams with a high-performance read-only stream called a '''bStream'''. In general, the idea is to open a core stream (with something like '''fopen''') then pass its handle as well as a '''bNread''' function pointer (like '''fread''') to the '''bsopen''' function which will return a handle to an open '''bStream'''. Then the functions '''bsread''', '''bsreadln''' or '''bsreadlns''' can be called to read portions of the stream. Finally, the '''bsclose''' function is called to close the '''bStream''' – will return a handle to the original (core) stream. So '''bStreams''', essentially, wrap other streams.
The '''bStreams''' have two main advantages over the '''bgets''' and '''bread''' (as well as '''fgets'''/'''ungetc''') paradigms:
- Improved functionality via the '''bunread''' function which allows a stream to unread characters, giving the '''bStream''' stack-like functionality if so desired.
- A high-performance '''bsreadln''' function. The C library function '''fgets()''' (and the '''bgets''' function) can typically be written as a loop on top of '''fgetc()''', thus paying all of the overhead costs of calling '''fgetc''' on a per character basis. '''bsreadln''' will read blocks at a time, thus amortizing the overhead of '''fread''' calls over many characters at once.
The semantics of '''bStreams''' allows practical construction of layered data streams. What this means is that by writing a '''bNread''' compatible function on top of a '''bStream''', one can construct a new '''bStream''' on top of it. This can be useful for writing multi-pass parsers that don't actually read the entire input more than once and don't require the use of intermediate storage.
Aliasing occurs when a function is given two parameters which point to data structures which overlap in the memory they occupy. While this does not disturb read-only functions, for many libraries this can make functions that write to these memory locations malfunction. This is a common problem of the C standard library and especially the string functions in the C standard library.
The C standard string library is entirely char by char oriented (as is Bstrlib) which makes conforming implementations alias-safe for some scenarios. However no actual detection of aliasing is typically performed, so it is easy to find cases where the aliasing will cause anomalous or undesirable behavior (consider: '''strcat(p, p)'''.) The C99 standard includes the "restrict" pointer modifier which allows the compiler to document and assume a no-alias condition on usage. However, only the most trivial cases can be caught (if at all) by the compiler at compile time, and thus, there is no actual enforcement of non-aliasing.
Bstrlib, by contrast, permits aliasing and is completely aliasing safe, in the C99 sense of aliasing. That is to say, under the assumption that pointers of incompatible types from distinct objects can never alias, Bstrlib is completely aliasing safe. (In practice this means that the data buffer portion of any '''bstring''' and header of any '''bstring''' are assumed to never alias.) With the exception of the reference building macros, the library behaves as if all read-only parameters are first copied and replaced by temporary non-aliased parameters before any writing to any output '''bstring''' is performed (though actual copying is extremely rarely ever done.)
Besides being a useful safety feature, '''bstring''' searching/comparison functions can improve to O(1) execution when aliasing is detected.
Note that aliasing detection and handling code in Bstrlib is generally extremely cheap. There is almost never any appreciable performance penalty for using aliased parameters.
Nearly every function in Bstrlib is a leaf function and is completely re-enterable with the exception of writing to common '''bstring'''s. The split functions which use a callback mechanism require only that the source string not be destroyed by the callback function unless the callback function returns with an error status (note that Bstrlib functions which return an error do not modify the string in any way.) The string can, in fact, be modified by the callback and the behavior is deterministic. See the documentation of the various split functions for more details.
One of the basic important premises for Bstrlib is to not to increase the propagation of undefined situations from parameters that are otherwise legal in of themselves. In particular, except for extremely marginal cases, usages of '''bstring'''s that use the '''bstring''' library functions alone cannot lead to any undefined action. But due to C/C++ language and library limitations, there is no way to define a non-trivial library that is completely without undefined operations. All such possible undefined operations are described below:
- '''bstrings''' or '''struct tagbstrings''' that are not explicitly initialized cannot be passed as a parameter to any '''bstring''' function.
- The members of the '''NULL bstring''' cannot be accessed directly. (Though all APIs and macros detect the '''NULL bstring'''.)
- A '''bstring''' whose data member has not been obtained from a '''malloc''' or compatible call and which is write accessible passed as a writable parameter will lead to undefined results. (i.e., do not '''writeAllow''' any constructed '''bstring'''s unless the data portion has been obtained from the heap.)
- If the headers of two strings alias but are not identical (which can only happen via a defective manual construction), then passing them to a '''bstring''' function in which one is writable is not defined.
- If the '''mlen''' member is larger than the actual accessible length of the data member for a writable '''bstring''', or if the '''slen''' member is larger than the readable length of the data member for a readable '''bstring''', then the corresponding '''bstring''' operations are undefined.
- Any '''bstring''' definition whose header or accessible data portion has been assigned to inaccessible or otherwise illegal memory clearly cannot be acted upon by the '''bstring''' library in any way.
- Destroying the source of an incremental split from within the callback and not returning with a negative value (indicating that it should abort) will lead to undefined behavior. (Though modifying or adjusting the state of the source data, even if those modifications fail within the Bstrlib API, has well-defined behavior.)
- Modifying a '''bstring''' which is write protected by direct access has undefined behavior.
A C++ wrapper has been created to enable '''bstring''' functionality for C++ in the most natural way possible. The mandate for the C++ wrapper is different from the base C '''bstring''' library. Since the C++ language has far more abstracting capabilities, the '''CBString''' structure is considered fully abstracted – i.e., hand generated '''CBStrings''' are not supported (though conversion from a '''struct tagbstring''' is allowed) and all detectable errors are manifest as thrown exceptions.
- The C++ class definitions are all under the namespace Bstrlib. '''bstrwrap.h''' enables this namespace (with a using namespace Bstrlib; directive at the end) unless the macro '''BSTRLIB_DONT_ASSUME_NAMESPACE''' has been defined before it is included.
- Erroneous accesses results in an exception being thrown. The exception parameter is of type '''struct CBStringException''' which is derived from '''std::exception''' if STL is used. A verbose description of the error message can be obtained from the '''what()''' method.
- '''CBString''' is a C++ structure derived from a '''struct tagbstring'''. An address of a '''CBString''' cast to a '''bstring''' must not be passed to '''bdestroy'''. The '''bstring''' C API has been made C++ safe and can be used directly in a C++ project.
- It includes constructors which can take a '''char''', ''''\0'''' terminated '''char''' buffer, '''tagbstring''', ('''char''', repeat-value), a length delimited buffer or a '''CBStringList''' to initialize it.
- Concatenation is performed with the '''+''' and '''+=''' operators. Comparisons are done with the '''==''', '''!=''', '''<''', '''>''', '''<=''' and '''>=''' operators. Note that '''==''' and '''!=''' use the '''biseq''' call, while '''<''', '''>''', '''<=''' and '''>=''' use '''bstrcmp'''.
- '''CBString''''s can be directly cast to '''const''' character buffers.
- '''CBString''''s can be directly cast to '''double''', '''float''', '''int''' or '''unsigned int''' so long as the '''CBString''' are decimal representations of those types (otherwise, an exception will be thrown). Converting the other way should be done with the format(a) method(s).
- '''CBString''' contains the '''length''', '''character''' and '''[]''' accessor methods. The character and '''[]''' accessors are aliases of each other. If the bounds of the string are exceeded, an exception is thrown. To avoid the overhead for this check first cast the '''CBString''' to a '''(const char *)''' and use '''[]''' to dereference the array as normal. Note that the character and '''[]''' accessor methods allows both reading and writing of individual characters.
- The methods: '''format''', '''formata''', '''find''', '''reversefind''', '''findcaseless''', '''reversefindcaseless''', '''midstr''', '''insert''', '''insertchrs''', '''replace''', '''findreplace''', '''findreplacecaseless''', '''remove''', '''findchr''', '''nfindchr''', '''alloc''', '''toupper''', '''tolower''', '''gets''', '''read''' are analogous to the functions that can be found in the C API.
- The '''caselessEqual''' and '''caselessCmp''' methods are analogous to '''biseqcaseless''' and '''bstricmp''' functions respectively.
- Note that just like the '''bformat''' function, the '''format''' and '''formata''' methods do not automatically cast '''CBStrings''' into '''char *''' strings for '''"%s"'''-type substitutions:
CBString w("world");
CBString h("Hello");
CBString hw;
/* The casts are necessary */
hw.format ("%s, %s", (const char *)h, (const char *)w);
- The methods trunc and repeat have been added instead of using pattern.
- '''ltrim''', '''rtrim''' and '''trim''' methods have been added. These remove characters from a given character string set (defaulting to the whitespace characters) from either the left, right or both ends of the '''CBString''', respectively.
- The method '''setsubstr''' is also analogous in functionality to '''bsetstr''', except that it cannot be passed '''NULL'''. Instead, the method '''fill''' and the fill-style constructor have been supplied to enable this functionality.
- The '''writeprotect()''', '''writeallow()''', and '''iswriteprotected()''' methods are analogous to the '''bwriteprotect()''', '''bwriteallow()''', and '''biswriteprotected()''' macros in the C API. Write protection semantics in '''CBString''' are stronger than with the C API in that indexed character assignment is checked for write protection. However, unlike with the C API, a write protected '''CBString''' can be destroyed by the destructor.
- '''CBStream''' is a C++ structure which wraps a struct '''bStream''' (it's not derived from it, since destruction is slightly different). It is constructed by passing in a '''bNread''' function pointer and a stream parameter cast to '''void *'''. This structure includes methods for detecting eof, setting the buffer length, reading the whole stream or reading entries line by line or block by block, an unread function, and a peek function.
- If STL is available, the '''CBStringList''' structure is derived from a vector of '''CBString''' with various split methods. The split method has been overloaded to accept either a character or '''CBString''' as the second parameter (when the split parameter is a '''CBString''' any character in that '''CBString''' is used as a separator). The '''splitstr''' method takes a '''CBString''' as a substring separator. Joins can be performed via a '''CBString''' constructor which takes a '''CBStringList''' as a parameter, or just using the '''CBString::join()''' method.
- If there is proper support for '''std::iostreams''', then the '''>>''' and '''<<''' operators and the '''getline()''' function have been added (with semantics the same as those for '''std::string''').
A mutable '''bstring''' is kind of analogous to a small (two entry) linked list allocated by '''malloc''', with all aliasing completely under programmer control. I.e., manipulation of one '''bstring''' will never affect any other distinct '''bstring''' unless explicitely constructed to do so by the programmer via hand construction or via building a reference. Bstrlib also does not use any static or global storage, so there are no hidden race conditions. '''bstring'''s are also not inherently thread local. So just like '''char *''''s, '''bstring'''s can be passed around from thread to thread and shared and so on, so long as modifications to a '''bstring''' correspond to some kind of exclusive access lock as should be expected (or if the '''bstring''' is read-only, which can be enforced by '''bstring''' write protection) for any sort of shared object in a multi-threaded environment.
For convenience, a '''bsafe''' module has been included. The idea is that if this module is included, inadvertent usage of the most dangerous C functions will be overridden and lead to an immediate run time abort. Of course, it should be emphasized that usage of this module is completely optional. The intention is to provide an option for creating project safety rules which can be enforced mechanically rather than socially. This is useful for larger, or open development projects where its more difficult to enforce social rules or "coding conventions".
Bstrlib is written for the C and C++ languages, which have inherent weaknesses that cannot be easily solved:
- Memory leaks: Forgetting to call '''bdestroy''' on a '''bstring''' that is about to be unreferenced, just as forgetting to call free on a heap buffer that is about to be dereferenced will leak. Though Bstrlib itself is leak free.
- Read before write usage: In C, declaring an auto '''bstring''' does not automatically fill it with legal/valid contents. This problem has been somewhat mitigated in C++. (The '''bstrDeclare''' and '''bstrFree''' macros from '''bstraux''' can be used to help mitigate this problem.)
- Built-in mutex usage to automatically avoid all '''bstring''' internal race conditions in multitasking environments: The problem with trying to implement such things at this low a level is that it is typically more efficient to use locks in higher level primitives. There is also no platform independent way to implement locks or mutexes.
The Better String Library is not an application, it is a library. To compile it, you need to compile '''bstrlib.c''' to an object file that is linked to your application. A Makefile might contain entries such as the following to accomplish this:
BSTRDIR = $(CDIR)/bstrlib
INCLUDES = -I$(BSTRDIR)
BSTROBJS = $(ODIR)/bstrlib.o
DEFINES =
CFLAGS = -O3 -Wall -pedantic -ansi -s $(DEFINES)
application: $(ODIR)/main.o $(BSTROBJS)
echo Linking: $@
$(CC) $< $(BSTROBJS) -o $@
$(ODIR)/%.o : $(BSTRDIR)/%.c
echo Compiling: $<
$(CC) $(CFLAGS) $(INCLUDES) -c $< -o $@
$(ODIR)/%.o : %.c
echo Compiling: $<
$(CC) $(CFLAGS) $(INCLUDES) -c $< -o $@
You can configure Bstrlib using with the standard macro defines passed to the compiler. All configuration options are meant solely for the purpose of compiler compatibility. Configuration options are not meant to change the semantics or capabilities of the library, except where it is unavoidable.
Since some C++ compilers don't include the Standard Template Library and some have the options of disabling exception handling, a number of macros can be used to conditionally compile support for each of this:
BSTRLIB_CAN_USE_STL
- defining this will enable the used of the Standard Template Library.Defining '''BSTRLIB_CAN_USE_STL''' overrides the '''BSTRLIB_CANNOT_USE_STL''' macro.
BSTRLIB_CANNOT_USE_STL
- defining this will disable the use of the Standard Template Library.Defining '''BSTRLIB_CAN_USE_STL''' overrides the '''BSTRLIB_CANNOT_USE_STL''' macro.
BSTRLIB_CAN_USE_IOSTREAM
- defining this will enable the used of streams from class '''std'''. Defining '''BSTRLIB_CAN_USE_IOSTREAM''' overrides the '''BSTRLIB_CANNOT_USE_IOSTREAM''' macro.
BSTRLIB_CANNOT_USE_IOSTREAM
- defining this will disable the use of streams from class '''std'''.Defining '''BSTRLIB_CAN_USE_IOSTREAM''' overrides the '''BSTRLIB_CANNOT_USE_IOSTREAM''' macro.
BSTRLIB_THROWS_EXCEPTIONS
- defining this will enable the exception handling within '''bstring'''.Defining '''BSTRLIB_THROWS_EXCEPTIONS''' overrides the '''BSTRLIB_DOESNT_THROWS_EXCEPTIONS''' macro.
BSTRLIB_DOESNT_THROW_EXCEPTIONS
- defining this will disable the exception handling within '''bstring'''.Defining '''BSTRLIB_THROWS_EXCEPTIONS''' overrides the '''BSTRLIB_DOESNT_THROW_EXCEPTIONS''' macro.
Some older C compilers do not support functions such as '''vsnprintf'''. This is handled by the following macro variables:
BSTRLIB_NOVSNP
- defining this indicates that the compiler does not support '''vsnprintf'''. This will cause '''bformat''' and '''bformata''' to not be declared. Note that for some compilers, such as Turbo C, this is set automatically.Defining '''BSTRLIB_NOVSNP''' overrides the '''BSTRLIB_VSNP_OK''' macro.
BSTRLIB_VSNP_OK
- defining this will disable the autodetection of compilers that do not support '''vsnprintf'''.Defining '''BSTRLIB_NOVSNP''' overrides the '''BSTRLIB_VSNP_OK''' macro.
Bstrlib comes with very few compilation options for changing the semantics of the library. These are described below.
BSTRLIB_DONT_ASSUME_NAMESPACE
- Defining this before including bstrwrap.h will disable the automatic enabling of the Bstrlib namespace for the C++ declarations.
BSTRLIB_DONT_USE_VIRTUAL_DESTRUCTOR
- Defining this will make the CBString destructor non-virtual.
BSTRLIB_MEMORY_DEBUG
- Defining this will cause the Bstrlib modules '''bstrlib.c''' and '''bstrwrap.cpp''' to invoke an '''#include "memdbg.h"'''. '''memdbg.h''' has to be supplied by the user.
Current release: v1.0.0
The version format v[Major].[Minor].[Update] is used to facilitate developers with backward compatibility in the core developer branch of the Better String Library. This is also reflected in the macro symbols BSTR_VER_MAJOR, BSTR_VER_MINOR and BSTR_VER_UPDATE in the bstrlib.h file. Differences in the Major version imply that there has been a change in the API, and that a recompile and usage source changes may be necessary. Differences in Minor version imply that there has been an expansion of the API, that backward compatibility should be preserved and that at most a recompile is necessary (unless there is a namespace collision). Differences in Update imply that no API change has occurred.
Although ordered, there is no implication of lexical sequencing. In particular, the Update number will not reset to 0 as the Major and Minor version numbers increment.
So simple bug fixes will usually be reflected in a change in the Update number. If new functions are available, the Minor value will increment. If any function changes its parameters, or if a function is removed, the Major value will increment.
| Core C files (the only required files) | ||||
|
| C++ files (C++ API) | ||||
|
| Base Unicode support | ||||||||
|
| Optional extra utility functions | ||||
|
| Miscellaneous | ||||||||
|
Programs need only include '''bstrlib.h''' and compile/link '''bstrlib.c''' to use the basic '''bstring''' functions. C++ projects that wish to use '''CBString''' (which is more natural for C++) need to additionally include '''bstrwrap.h''' and compile/link '''bstrwrap.cpp'''. For both, there may be a need to make choices about feature configuration as described in the "Configurable compilation options" in the section above.
Other files that are included in this archive are:
| Documentation | ||||||||||
|
The bstest module is just a unit test for the Bstrlib module. For correct implementations of Bstrlib, it should execute with 0 failures being reported. This test should be utilized if modifications/customizations to Bstrlib have been performed. It tests each core Bstrlib function with '''bstring'''s of every mode (read-only, '''NULL''', static and mutable) and ensures that the expected semantics are observed (including results that should indicate an error). It also tests for aliasing support. Passing bstest is a necessary but not a sufficient condition for ensuring the correctness of the '''bstrlib''' module.
The test module is just a unit test for the '''bstrwrap''' module. For correct implementations of '''bstrwrap''', it should execute with 0 failures being reported. This test should be utilized if modifications/customizations to '''bstrwrap''' have been performed. It tests each core '''bstrwrap''' function with '''CBString'''s write protected or not and ensures that the expected semantics are observed (including expected exceptions.) Note that exceptions cannot be disabled to run this test. Passing test is a necessary but not a sufficient condition for ensuring the correctness of the '''bstrwrap''' module.
First let us give a table of C library functions and the alternative '''bstring''' functions and '''CBString''' methods that should be used instead of them.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The top 9 C functions listed here are troublesome in that they impose memory management in the calling function. The '''bstring''' and '''CBString''' interfaces have built-in memory management, so there is far less code with far less potential for buffer overrun problems. '''strtok''' can only be reliably called as a "leaf" calculation, since it (quite bizarrely) maintains hidden internal state. And gets is well known to be broken no matter what. The Bstrlib alternatives do not suffer from those sorts of problems. The alternatives to '''strncat''' ('''bcatStatic''', '''bcatblk''', or just '''bconcat''') have much higher performance.
The alternatives to '''strspn''', '''strcspn''', '''strnset''', '''strrev''', '''printf''', '''puts''', '''fprintf''', '''fputs''', and '''memcmp''' are not implemented in the core, and it is recommended that they be used as is, if possible. Most Bstring (and CBstring) functions will automatically append the '\0' character to the character data buffer. So by simply accessing the data buffer directly, ordinary C string library functions can be called directly on them. Note that '''bstrcmp''' is not the same as '''memcmp''' in exactly the same way that '''strcmp''' is not the same as '''memcmp'''.
If semantic equivalence is required, the '''fread''' and '''fgets''' functions should be used as is, but one should use the '''balloc()''' function to manage a '''bstring''' for to target buffer.
These are odd ones because of the exact sizing of the buffer required. The '''bstring''' and '''CBString''' alternatives requires that the buffers are forced to hold at least the prescribed length, then just use '''fread''' or '''fgets''' directly. However, typically the automatic memory management of '''bstring''' and '''CBString''' will make the typical use of '''fgets''' and '''fread''' to read specifically sized strings unnecessary.
The '''bstring''' library has more overhead versus straight char buffers for most functions. This overhead is essentially just the memory management and string header allocation. This overhead usually only shows up for small string manipulations. The performance loss has to be considered in light of the following:
- What would be the performance loss of trying to write this management code in one's own application?
- Since the '''bstring''' library source code is given, a sufficiently powerful modern inlining globally optimizing compiler can remove function call overhead.
The algorithms used have performance advantages versus the analogous C library functions. For example:
- '''bfromcstr'''/'''blk2str'''/'''bstrcpy''' versus '''strcpy'''/'''strdup'''. By using '''memmove''' instead of '''strcpy''', the break condition of the copy loop is based on an independent counter (that should be allocated in a register) rather than having to check the results of the load. Modern out-of-order executing CPUs can parallelize the final branch mis-predict penality with the loading of the source string. Some CPUs will also tend to have better built-in hardware support for counted memory moves than load-compare-store. (This is a minor, but non-zero gain.)
- '''biseq''' versus '''strcmp'''. If the strings are unequal in length, '''bsiseq''' will return in O(1) time. If the strings are aliased, or have aliased data buffers, biseq will return in O(1) time. '''strcmp''' will always be O(k), where k is the length of the common prefix or the whole string if they are identical.
- '''->slen''' versus '''strlen'''. '''->slen''' is obviously always O(1), while '''strlen''' is always O(n) where n is the length of the string.
- '''bconcat''' versus '''strcat'''. Both rely on precomputing the length of the destination string argument, which will favor the '''bstring''' library. On iterated concatenations the performance difference can be enormous.
- '''bsreadln''' versus '''fgets'''. The '''bsreadln''' function reads large blocks at a time from the given stream, then parses out lines from the buffers directly. Some C libraries will implement '''fgets''' as a loop over single '''fgetc''' calls. Testing indicates that the '''bsreadln''' approach can be several times faster for fast stream devices (such as a file that has been entirely cached.)
- '''bsplits'''/'''bsplitscb''' versus '''strspn'''. Accelerators for the set of match characters are generated only once.
- '''binstr''' versus '''strstr'''. The '''binstr''' implementation unrolls the loops to help reduce loop overhead. This will matter if the target string is long and source string is not found very early in the target string. With '''strstr''', while it is possible to unroll the source contents, it is not possible to do so with the destination contents in a way that is effective because every destination character must be tested against ''''\0'''' before proceeding to the next character.
- '''bReverse''' (from '''bstraux''') versus '''strrev'''. The C function must find the end of the string first before swapping character pairs.
- '''bstrrchr''' versus no comparable C function. Its not hard to write some C code to search for a character from the end going backwards. But there is no way to do this without computing the length of the string with '''strlen'''.
Some of Bstrlib's extra functionality also lead to inevitable performance advantages over typical C solutions. For example, using the '''blk2tbstr''' macro, one can (in O(1) time) generate an internal sub-string by reference while not disturbing the original string. If disturbing the original string is not an option, typically, a comparable '''char *''' solution would have to make a copy of the sub-string to provide similar functionality. Another example is reverse character set scanning – the '''str'''('''c''')'''spn''' functions only scan in a forward direction which can complicate some parsing algorithms.
Where high performance '''char *''' based algorithms are available, Bstrlib can still leverage them by accessing the '''->data''' field on '''bstring'''s. So realistically Bstrlib can never be significantly slower than any standard ''''\0'''' terminated '''char *''' based solutions.
The C++ interface has been designed with an emphasis on abstraction and safety first. However, since it is substantially a wrapper for the C '''bstring''' functions, for longer strings the performance comments described in the "Performance of the C interface" section above still apply. Note that the ('''CBString *''') type can be directly cast to a ('''bstring''') type, and passed as parameters to the C functions (though a '''CBString''' must never be passed to '''bdestroy'''.)
Probably the most controversial choice is performing full bounds checking on the '''[]''' operator. This decision was made because 1) the fast alternative of not bounds checking is still available by first casting the '''CBString''' to a ('''const char *''') buffer or to a ('''struct tagbstring''') then derefencing '''.data''' and 2) because the lack of bounds checking is seen as one of the main weaknesses of C/C++ versus other languages. This check being done on every access leads to individual character extraction being actually slower than other languages in this one respect (other language's compilers will normally dedicate more resources on hoisting or removing bounds checking as necessary) but otherwise bring C++ up to the level of other languages in terms of functionality.
It is common for other C++ libraries to leverage the abstractions provided by C++ to use reference counting and "copy on write" policies. While these techniques can speed up some scenarios, they impose a problem with respect to thread safety. '''bstring'''s and '''CBString'''s can be properly protected with "per-object" mutexes, meaning that two Bstrlib calls can be made and execute simultaneously, so long as the '''bstring'''s and '''CBString'''s are distinct. With a reference count and alias before copy on write policy, global mutexes are required that prevent multiple calls to the strings library to execute simultaneously regardless of whether or not the strings represent the same string.
One interesting trade off in '''CBString''' is that the default constructor is not trivial. I.e., it always prepares a ready to use memory buffer. The purpose is to ensure that there is a uniform internal composition for any functioning '''CBString''' that is compatible with '''bstring'''s. It also means that the other methods in the class are not forced to perform "late initialization" checks. In the end it means that construction of '''CBString'''s are slower than other comparable C++ string classes. Initial testing, however, indicates that '''CBString''' outperforms std::string and MFC's '''CString''', for example, in all other operations. So to work around this weakness it is recommended that '''CBString''' declarations be pushed outside of inner loops.
Practical testing indicates that with the exception of the caveats given above (constructors and safe index character manipulations) the C++ API for Bstrlib generally outperforms popular standard C++ string classes. Amongst the standard libraries and compilers, the quality of concatenation operations varies wildly and very little care has gone into search functions. Bstrlib dominates those performance benchmarks.
The '''bstring''' functions which write and modify '''bstring'''s will automatically reallocate the backing memory for the char buffer whenever it is required to grow. The algorithm for resizing chosen is to snap up to sizes that are a power of two which are sufficient to hold the intended new size. Memory reallocation is not performed when the required size of the buffer is decreased. This behavior can be relied on, and is necessary to make the behavior of '''balloc''' deterministic. This trades off additional memory usage for decreasing the frequency for required reallocations:
- For any '''bstring''' whose size never exceeds n, its buffer is not ever reallocated more than log2(n) times for its lifetime.
- For any '''bstring''' whose size never exceeds n, its buffer is never more than 2⋅(n+1) in length. (The extra characters beyond 2⋅n are to allow for the implicit ''''\0'''' which is always added by the '''bstring''' modifying functions.)
Property #2 needs emphasizing. Although the memory allocated is always a power of 2, for a '''bstring''' that grows linearly in size, its buffer memory also grows linearly, not exponentially. The reason is that the amount of extra space increases with each reallocation, which decreases the frequency of future reallocations.
Obviously, given that '''bstring''' writing functions may reallocate the data buffer backing the target '''bstring''', one should not attempt to cache the data buffer address and use it after such '''bstring''' functions have been called. This includes making reference '''struct tagbstring'''s which alias to a writable '''bstring'''.
'''balloc''' or '''bfromcstralloc''' can be used to preallocate the minimum amount of space used for a given '''bstring'''. This will reduce even further the number of times the data portion is reallocated. If the length of the string is never more than one less than the memory length then there will be no further reallocations.
Note that invoking the '''bwriteallow''' macro may increase the number of reallocs by one more than necessary for every call to '''bwriteallow''' interleaved with any '''bstring''' API which writes to this '''bstring'''.
The library does not use any mechanism for automatic clean up for the C API. Thus explicit clean up via calls to '''bdestroy()''' are required to avoid memory leaks.
Constant and static '''struct tagbstring'''s
A '''struct tagbstring''' can be write protected from any Bstrlib function using the '''bwriteprotect''' macro. A write protected struct tagbstring can then be reset to being writable via the '''bwriteallow''' macro. There is, of course, no protection from attempts to directly access the '''bstring''' members. Modifying a '''bstring''' which is write protected by direct access has undefined behavior.
Static '''struct tagbstring'''s can be declared via the '''bsStatic''' macro. They are considered permanently unwritable. Such '''struct tagbstring'''s are declared such that attempts to write to it are not well defined. Invoking either '''bwriteallow''' or '''bwriteprotect''' on static '''struct tagbstring'''s has no effect.
'''struct tagbstring'''s initialized via '''btfromcstr''' or '''blk2tbstr''' are protected by default but can be made writable via the '''bwriteallow''' macro. If '''bwriteallow''' is called on such '''struct tagbstring'''s, it is the programmer's responsibility to ensure that:
- The buffer supplied was allocated from the heap.
- '''bdestroy''' is not called on this '''struct tagbstring''' (unless the header itself has also been allocated from the heap.)
- '''free''' is called on the buffer to reclaim its memory.
The memory buffer is actually declared "'''unsigned char *'''" instead of "'''char *'''". The reason for this is to trigger compiler warnings whenever uncasted char buffers are assigned to the data portion of a '''bstring'''. This will draw more diligent programmers into taking a second look at the code where they have carelessly left off the typically required cast.
The '''bgets''', '''bread''' and '''bStream''' functions use function pointers to obtain strings from data streams. The function pointer declarations have been specifically chosen to be compatible with the '''fgetc''' and '''fread''' functions. While this may seem to be a convoluted way of implementing '''fgets''' and '''fread''' style functionality, it has been specifically designed this way to ensure that there is no dependency on a single narrowly defined set of device interfaces, such as just stream I/O. In the embedded world, its quite possible to have environments where such interfaces may not exist in the standard C library form. Furthermore, the generalization that this opens up allows for more sophisticated uses for these functions (performing an '''fgets-'''like function on a socket, for example.) By using function pointers, it also allows such abstract stream interfaces to be created using the '''bstring''' library itself while not creating a circular dependency.
This is just a recognition that 16bit platforms with requirements for strings that are larger than 32K and 32bit+ platforms with requirements for strings that are larger than 2GB are pretty marginal. The main focus is on 32bit platforms and emerging 64bit platforms with reasonable < 2GB string requirements. Using '''int'''s allows for negative values which has meaning internally to Bstrlib.
Certain care needs to be taken when copying and aliasing '''bstring'''s. A '''bstring''' is essentially a pointer type which points to a multipart abstract data structure. Thus usage, and lifetime of '''bstring'''s have semantics that follow these considerations. For example:
bstring a, b;
struct tagbstring t;
a = bfromcstr("Hello"); /* Create bstring with "Hello" in it. */
b = a; /* Alias b to a. */
t = *a; /* Alias to current contents of a */
bconcat (a, b); /* Double a & b, t is now undefined. */
bdestroy (a); /* Destroy the contents of a & b. */
Variables of type bstring are really just references that point to real '''bstring''' objects. The equal operator ('''=''') creates aliases, and the asterisk dereference operator ('''*''') creates a kind of alias to the current instance (which is generally not useful for any purpose.) Using '''bstrcpy()''' is the correct way of creating duplicate instances. The ampersand operator ('''&''') is useful for creating aliases to '''struct tagbstring'''s (remembering that constructed '''struct tagbstring'''s are not writable by default.)
'''CBString'''s use complete copy semantics for the equal operator (=), and thus do not have these sorts of issues.
bstrings have a simple, exposed definition and construction, and the library itself is open source. So most debugging is going to be fairly straight-forward. But the memory for bstrings come from the heap, which can often be corrupted indirectly, and it might not be obvious what has happened even from direct examination of the contents in a debugger or a core dump. There are some tools such as Purify, Insure++ and Electric Fence which can help solve such problems, however another common approach is to directly instrument the calls to '''malloc''', '''realloc''', '''calloc''', '''free''', '''memcpy''', '''memmove''' and/or other calls by overriding them with macro definitions.
Although the user could hack on the Bstrlib sources directly as necessary to perform such an instrumentation, Bstrlib comes with a built-in mechanism for doing this. By defining the macro '''BSTRLIB_MEMORY_DEBUG''' and providing an include file named '''memdbg.h''' this will force the core Bstrlib modules to attempt to include this file. In such a file, macros could be defined which overrides Bstrlib's useage of the C standard library.
Rather than calling '''malloc''', '''realloc''', '''free''', '''memcpy''' or '''memmove''' directly, Bstrlib emits the macros '''bstr__alloc''', '''bstr__realloc''', '''bstr__free''', '''bstr__memcpy''' and '''bstr__memmove''' in their place respectively. By default these macros are simply assigned to be equivalent to their corresponding C standard library function call. However, if they are given earlier macro definitions (via the back door include file) they will not be given their default definition. In this way Bstrlib's interface to the standard library can be changed but without having to directly redefine or link standard library symbols (both of which are not strictly ANSI C compliant.)
An example definition might include:
'''#define bstr__alloc(sz) X_malloc ((sz), __LINE__, __FILE__)'''
which might help contextualize heap entries in a debugging environment.
The '''NULL''' parameter and sanity checking of '''bstring'''s is part of the Bstrlib API, and thus Bstrlib itself does not present any different modes which would correspond to "Debug" or "Release" modes. Bstrlib always contains mechanisms which one might think of as debugging features but retains the performance and small memory footprint one would normally associate with release mode code.
Microsoft's Visual Studio debugger has a capability of customizable mouse float over data type descriptions. This is accomplished by editing the AUTOEXP.DAT file to include the following:
; new for CBString
tagbstring =slen=<slen> mlen=<mlen> <data,st>
Bstrlib::CBStringList =count=<size()>
In Visual C++ 6.0 this file is located in the directory:
C:\Program Files\Microsoft Visual Studio\Common\MSDev98\Bin
and in Visual Studio .NET 2003 its located here:
C:\Program Files\Microsoft Visual Studio .NET 2003\Common7\Packages\Debugger
This will improve the ability to debug with Bstrlib under Visual Studio.
Bstrlib does not come with explicit security features outside of its fairly comprehensive error detection, coupled with its strict semantic support. That is to say that certain common security problems, such as buffer overrun, constant overwrite, arbitrary truncation etc, are far less likely to happen inadvertently. Where it does help, Bstrlib maximizes its advantage by providing developers a simple adoption path that lets them leave less secure string mechanisms behind. The library will not leave developers wanting, so they will be less likely to add new code using a less secure string library to add functionality that might be missing from Bstrlib.
That said there are a number of security ideas not addressed by Bstrlib:
- Race condition exploitation (i.e., verifying a string's contents, then raising the privilege level and execute it as a shell command as two non-atomic steps) is well beyond the scope of what Bstrlib can provide. It should be noted that MFC's built-in string mutex actually does not solve this problem either – it just removes immediate data corruption as a possible outcome of such exploit attempts (it can be argued that this is worse, since it will leave no trace of the exploitation). In general race conditions have to be dealt with by careful design and implementation; it cannot be assisted by a string library.
- Any kind of access control or security attributes to prevent usage in dangerous interfaces such as '''system()'''. Perl includes a "trust" attribute which can be endowed upon strings that are intended to be passed to such dangerous interfaces. However, Perl's solution reflects its own limitations – notably that it is not a strongly typed language. In the example code for Bstrlib, there is a module called '''taint.cpp'''. It demonstrates how to write a simple wrapper class for managing "untainted" or trusted strings using the type system to prevent questionable mixing of ordinary untrusted strings with untainted ones then passing them to dangerous interfaces. In this way the security correctness of the code reduces to auditing the direct usages of dangerous interfaces or promotions of tainted strings to untainted ones.
- Encryption of string contents is way beyond the scope of Bstrlib. Maintaining encrypted string contents in the futile hopes of thwarting things like using system-level debuggers to examine sensitive string data is likely to be a wasted effort (imagine a debugger that runs at a higher level than a virtual processor where the application runs). For more standard encryption usages, since the '''bstring''' contents are simply binary blocks of data, this should pose no problem for usage with other standard encryption libraries.
The Better String Library is known to compile and function correctly with the following compilers:
- Microsoft Visual C++
- Watcom C/C++
- Intel's C/C++ compiler (Windows)
- The GNU C/C++ compiler (cygwin, Linux on PPC64, and Mac OS X)
- Borland C
- Turbo C
1. The function pointer types '''bNgetc''' and '''bNread''' have prototypes which are very similar to, but not exactly the same as '''fgetc''' and '''fread''' respectively. Basically the '''FILE *''' parameter is replaced by '''void *'''. The purpose of this was to allow one to create other functions with '''fgetc-'''like and '''fread-'''like semantics without being tied to ANSI C's file streaming mechanism. I.e., one could very easily adapt it to sockets, or simply reading a block of memory, or procedurally generated strings (for fractal generation, for example.)
The problem is that invoking the functions '''(bNgetc)fgetc''' and '''(bNread)fread''' is not technically legal in ANSI C. The reason being that the compiler is only able to coerce the function pointers themselves into the target type, however are unable to perform any cast (implicit or otherwise) on the parameters passed once invoked. I.e., if internally '''void *''' and '''FILE *''' need some kind of mechanical coercion, the compiler will not properly perform this conversion and thus lead to undefined behavior. However, this is not an issue for any known contemporary platforms.
To correctly work around this problem to the satisfaction of the ANSI limitations, one needs to create wrapper functions for '''fgets''' and/or '''fread''' with the prototypes of '''bNgetc''' and/or '''bNread''' respectively which performs no other action other than to explicitly cast the '''void *''' parameter to a '''FILE *''', and simply pass the remaining parameters straight to the function pointer call.
The wrappers themselves are trivial:
size_t freadWrap (void * buff, size_t esz, size_t eqty,
void * parm) {
return fread (buff, esz, eqty, (FILE *) parm);
}
int fgetcWrap (void * parm) {
return fgetc ((FILE *) parm);
}
These have not been supplied in '''bstrlib''' or '''bstraux''' to prevent unnecessary linking with file I/O functions.
2. '''vsnprintf''' is not available on all compilers. Because of this, the '''bformat''' and '''bformata''' functions (and '''format''' and '''formata''' methods) are not guaranteed to work properly. For those compilers that don't have '''vsnprintf''', the '''BSTRLIB_NOVSNP''' macro should be set before compiling '''bstrlib''', and the format functions/method will be disabled.
The more recent ANSI C standards have specified the required inclusion of a '''vsnprintf''' function.
3. The Bstrlib function names are not unique in the first 6 characters. This is only an issue for older C compiler environments which do not store more than 6 characters for function names.
4. The '''bsafe''' module defines macros and function names which are part of the C library. This simply overrides the definition as expected on all platforms tested, however, it is not sanctioned by the ANSI standard. This module is clearly optional and should be omitted on platforms which disallow its undefined semantics.
In practice, the real issue is that some compilers in some modes of operation can/will inline these standard library functions on a module by module basis as they appear in each. The linker will thus have no opportunity to override the implementation of these functions for those cases. This can lead to inconsistent behavior of the '''bsafe''' module on different platforms and compilers.
Although developed independently, '''CBString'''s have very similar functionality to Microsoft's '''CString''' class. However, Bstrlib has significant advantages over '''CString''':
- Bstrlib is a C-library as well as a C++ library.
- Thus it is compatible with more programming environments and available to a wider population of programmers.
- Thus it is compatible with more programming environments and available to a wider population of programmers.
- The internal structure of a '''bstring''' is considered exposed.
- A single contiguous block of data can be cut into read-only pieces by simply creating headers, without allocating additional memory to create reference copies of each of these sub-strings.
- In this way, using '''bstring'''s in a totally abstracted way becomes a choice rather than an imposition. Further, this choice can be made differently at different layers of applications that use it.
- Static declaration support precludes the need for constructor invocation.
- Allows for static declarations of constant strings without additional constructor overhead.
- Bstrlib is not attached to another library.
- Bstrlib is designed to be easily plugged into any other library collection, without dependencies on other libraries or paradigms (such as "MFC".)
- '''bsetstr'''
- '''bsplit'''
- '''bread'''
- '''breplace''' (this is different from '''CString::Replace()''')
- writable indexed characters (for example '''a[i]='x'''')
Bstrlib's international support is oriented around just handling UTF-8. '''CString''' essentially supports the UCS-2 version of Unicode via '''widechar_t''' as an application-wide compile time switch. This is platform specific, and basically not portable.
'''CString'''s also use built-in mechanisms for ensuring thread safety under all situations. While this makes writing thread-safe code that much easier, this built-in safety feature has a price – the inner loops of each '''CString''' method runs in its own critical section (grabbing and releasing a light weight mutex on every operation.) The usual way to decrease the impact of a critical section performance penalty is to amortize more operations per critical section. But since the implementation of '''CString'''s is fixed as a one critical section per-operation cost, there is no way to leverage this common performance enhancing idea.
The search facilities in Bstrlib are comparable to those in MFC's '''CString''' class, though it is missing locale-specific collation. But because Bstrlib is interoperable with C's char buffers, it will allow programmers to write their own string searching mechanism (such as Boyer-Moore), or be able to choose from a variety of available existing string searching libraries (such as those for regular expressions) without difficulty.
Microsoft used a very non-ANSI conforming trick in its implementation to allow '''printf()''' to use the '''"%s"''' specifier to output a '''CString''' correctly. This can be convenient, but it is inherently not portable. '''CBString''' requires an explicit cast, while bstring requires the data member to be dereferenced. Microsoft's own documentation recommends casting, instead of relying on this feature.
This is the C++ language's standard STL based string class.
- There is no C implementation.
- The '''[]''' operator is not bounds checked.
- Missing a lot of useful functions like '''printf'''-like formatting.
- Limited by STL's '''std::iostream''' which in turn is limited by '''ifstream''' which can only take input from files. (Compare to '''CBStream''''s API which can take abstracted input.)
- Extremely uneven performance across implementations.
Following the ISO C99 standard, Microsoft has proposed a group of C library extensions which are supposedly "safer and more secure". This proposal was adopted by the C11 standard.
The proposal reveals itself to be very similar to Microsoft's "StrSafe" library. The functions are basically the same as other standard C library string functions except that destination parameters are paired with an additional length parameter of type '''rsize_t'''. '''rsize_t''' is the same as '''size_t''', however, the range is checked to make sure its between '''1''' and '''RSIZE_MAX'''. Like Bstrlib, the functions perform a "parameter check". Unlike Bstrlib, when a parameter check fails, rather than simply outputting accumulated error statuses, they call a user settable global error function handler, and upon return of control performs no (additional) detrimental action. The proposal covers basic string functions as well as a few non-reenterable functions ('''asctime''', '''ctime''', and '''strtok''').
- Still based solely on '''char *''' buffers (and therefore '''strlen()''' and '''strcat()''' is still O(n), and there are no faster '''streq()''' comparison functions.)
- No growable string semantics.
- Requires manual buffer length synchronization in the source code.
- No attempt to enhance functionality of the C library.
- Introduces a new error scenario (strings exceeding '''RSIZE_MAX''' length).
The error handler can discriminate between types of failures, but does not take into account any call site context. So the problem is that the error is going to be manifest in a piece of code, but there is no pointer to that code. It would seem that passing in the call site '''__FILE__''', '''__LINE__''' as parameters would be very useful, but the API clearly doesn't support such a thing (it would increase code bloat even more than the extra length parameter does, and would require macro tricks to implement).
The Bstrlib C API takes the position that error handling needs to be done at the call site, and just tries to make it as painless as possible. Furthermore, error modes are removed by supporting auto-growing strings and aliasing. For capturing errors in more central code fragments, Bstrlib's C++ API uses exception handling extensively, which is superior to the leaf-only error handler approach.
The main webpage for the managed string library: http://www.cert.org/secure-coding/managedstring.html
Robert Seacord at CERT has proposed a C string library that he calls the "Managed String Library" for C. Like Bstrlib, it introduces a new type which is called a managed string. The structure of a managed string (string_m) is like a '''struct tagbstring''' but missing the length field. This internal structure is considered opaque. The length is, like the C standard library, always computed on the fly by searching for a terminating NUL on every operation that requires it. So it suffers from every performance problem that the C standard library suffers from. Inter-operating with C string APIs (like '''printf''', '''fopen''', or anything else that takes a string parameter) requires copying to additionally allocating buffers that have to be manually freed – this makes this library probably slower and more cumbersome than any other string library in existence.
The library gives a fully populated error status as the return value of every string function. The hope is to be able to diagnose all problems specifically from the return code alone. Comparing this to Bstrlib, which always returns one consistent error message, might make it seem that Bstrlib would be harder to debug; but this is not true. With Bstrlib, if an error occurs there is always enough information from just knowing there was an error and examining the parameters to deduce exactly what kind of error has happened. The managed string library thus gives up nested function calls while achieving little benefit, while Bstrlib does not.
One interesting feature that "managed strings" has is the idea of data sanitation via character set white-listing. That is to say, a globally definable filter that makes any attempt to put invalid characters into strings lead to an error and not modify the string. The author gives the following example:
/* create valid char set */
if (retValue = strcreate_m(&str1, "abc"))
fprintf(stderr, "Error %d from strcreate_m.\n", retValue);
if (retValue = setcharset(str1))
fprintf(stderr, "Error %d from setcharset().\n", retValue);
if (retValue = strcreate_m(&str1, "aabbccabc"))
fprintf(stderr, "Error %d from strcreate_m.\n", retValue);
/* create string with invalid char set */
if (retValue = strcreate_m(&str1, "abbccdabc"))
fprintf(stderr, "Error %d from strcreate_m.\n", retValue);
Which we can compare with a more Bstrlib way of doing things:
bstring bCreateWithFilter (const char * cstr,
const_bstring filter) {
bstring b = bfromcstr (cstr);
if (BSTR_ERR != bninchr (b, filter) && NULL != b) {
fprintf (stderr, "Filter violation.\n");
bdestroy (b);
b = NULL;
}
return b;
}
struct tagbstring charFilter = bsStatic ("abc");
bstring str1 = bCreateWithFilter ("aabbccabc", &charFilter);
bstring str2 = bCreateWithFilter ("aabbccdabc", &charFilter);
The first thing we should notice is that with the Bstrlib approach you can have different filters for different strings if necessary. Furthermore, selecting a char-set filter in the Managed String Library is uni-contextual. That is to say, there can only be one such filter active for the entire program, which means its usage is not well defined for intermediate library usage (a library that uses it will interfere with user code that uses it, and vice versa.) It is also likely to be poorly defined in multithreading environments.
There is also a question as to whether the data sanitation filter is checked on every operation, or just on creation operations. Since the char-set can be set arbitrarily at run time, it might be set after some managed strings have been created. This would seem to imply that all functions should run this additional check every time if there is an attempt to enforce this. This would make things tremendously slow. On the other hand, if it is assumed that only creates and other operations that take '''char *''''s as input need be checked because the char-set was only supposed to be called once at and before any other managed string was created, then one can see that its easy to cover Bstrlib with equivalent functionality via a few wrapper calls such as the example given above.
And finally we have to question the value of sanitation in the first place. For example, for httpd servers, there is generally a requirement that the URLs parsed have some form that avoids undesirable translation to local file system filenames or resources. The problem is that the way URLs can be encoded, it must be completely parsed and translated to know if it is using certain invalid character combinations. That is to say, merely filtering each character one at a time is not necessarily the right way to ensure that a string has safe contents.
In the article that describes this proposal, it is claimed that it fairly closely approximates the existing C API semantics. On this point we should compare this "closeness" with Bstrlib:
| Bstrlib | Managed String Library | |
| Pointer arithmetic | Segment arithmetic | N/A |
| Use with C std lib functions | '''->data''' or '''bdata'''/'''bdatae''' | '''getstr_m(x,''' *''')''' ... '''free(x)''' |
| String literals | '''bsStatic'''/'''bsStaticBlk''' | strcreate_m() |
| Transparency | Complete | None |
It's pretty clear that the semantic mapping from C strings to Bstrlib is fairly straightforward, and that in general semantic capabilities are the same or superior in Bstrlib. On the other hand, the Managed String Library is either missing semantics or changes things fairly significantly.
This library is available at:
http://www.annexia.org/freeware/c2lib
- Still based solely on '''char *''' buffers (and therefore '''strlen()''' and '''strcat()''' is still O(n), and there are no faster '''streq()''' comparison functions.) Their suggestion that alternatives which wrap the string data type (such as '''bstring''' does) imposes a difficulty in inter-operating with the C language's ordinary C string library is not founded.
- Introduction of memory (and vector?) abstractions imposes a learning curve, and some kind of memory usage policy that is outside of the strings themselves (and therefore must be maintained by the developer.)
- The API is massive, and filled with all sorts of trivial ('''pjoin''') and controversial ('''pmatch''' – regular expression are not sufficiently standardized, and there is a very large difference in performance between compiled and non-compiled, REs) functions. Bstrlib takes a decidedly minimal approach – none of the functionality in c2lib is difficult or challenging to implement on top of Bstrlib (except the regex stuff, which is going to be difficult, and controversial no matter what.)
- Understanding why c2lib is the way it is pretty much requires a working knowledge of Perl. Bstrlib requires only knowledge of the C string library while providing just a very select few worthwhile extras.
- It is attached to a lot of cruft like a matrix math library (that doesn't include any functions for getting the determinant, eigenvectors, eigenvalues, the matrix inverse, test for singularity, test for orthogonality, a grahm schmit orthogonlization, LU decomposition ... I mean why bother?)
More information about this library can be found here:
http://www.canonical.org/~kragen/stralloc.html or here:
http://cr.yp.to/lib/stralloc.html
- Library is very minimal. A little too minimal.
- Untargetted source parameters are not declared const.
- Slightly different expected emphasis (like '''_cats''' function which takes an ordinary C string char buffer as a parameter.) Its clear that the remainder of the C string library is still required to perform more useful string operations.
stralloc actually uses the interesting policy that a '''NULL''' data pointer indicates an empty string. In this way, non-static empty strings can be declared without construction. This advantage is minimal, since static empty '''bstring'''s can be declared inline without construction, and if the string needs to be written to it should be constructed from an empty string (or its first initializer) in any event.
This is the string class used in the wxWindows project. A description of wxString can be found here:
http://www.wxwindows.org/manuals/2.4.2/wx368.htm#wxstring
This C++ library is similar to '''CBString'''. However, it is littered with trivial functions ('''IsAscii''', '''UpperCase''', '''RemoveLast''' etc.)
- There is no C implementation.
- The memory management strategy is to allocate a bounded fixed amount of additional space on each resize, meaning that it does not have the ln2(n) property that Bstrlib has (it will thrash very easily, cause massive fragmentation in common heap implementations, and can easily be a common source of performance problems).
- The library uses a "copy on write" strategy, meaning that it has to deal with multithreading problems.
This is a highly orthogonal C string library with an emphasis on networking/realtime programming. It can be found here:
- The convoluted internal structure does not contain a ''''\0'''' '''char *''' compatible buffer, so interoperability with the C library a non-starter.
- The API and implementation is very large (owing to its orthogonality) and can lead to difficulty in understanding its exact functionality.
- An obvious dependency on gnu tools (confusing make configure step)
- Uses a reference counting system, meaning that it is not likely to be thread safe.
The learning curve for Vstr is very steep, and it doesn't come with any obvious way to build for Windows or other platforms without gnu tools. At least one mechanism (the iterator) introduces a new undefined scenario (writing to a Vstr while iterating through it.) Vstr has a very large footprint, and is very ambitious in its total functionality. Vstr has no C++ API.
Vstr usage requires context initialization via '''vstr_init()''' which must be run in a thread-local context. Given the totally reference based architecture this means that sharing Vstrings across threads is not well-defined, or at least not safe from race conditions. This API is clearly geared to the older standard of '''fork()''' style multitasking in UNIX, and is not safely transportable to modern shared memory multithreading available in Linux and Windows. There is no portable external solution making the library thread safe (since it requires a mutex around each Vstr context – not each string.)
In the documentation for this library, a big deal is made of its self hosted '''s'''('''n''')'''printf'''-like function. This is an issue for older compilers that don't include vsnprintf(), but also an issue because Vstr has a slow conversion to ''''\0'''' terminated '''char *''' mechanism. That is to say, using '''"%s"''' to format data that originates from Vstr would be slow without some sort of native function to do so. Bstrlib sidesteps the issue by relying on what '''snprintf'''-like functionality does exist and having a high performance conversion to a '''char *''' compatible string so that '''"%s"''' can be used directly.
This is a fairly extensive string library, that includes full Unicode support and targeted at the goal of out performing MFC and STL. The architecture, similarly to MFC's '''CString'''s, is a copy on write reference counting mechanism.
http://www.utilitycode.com/str/default.aspx
- Commercial.
- C++ only.
It should be pointed out that performance testing of Bstrlib has indicated that its relative performance advantage versus MFC's CString and STL's '''std::string''' is at least as high as that for the Str library.
A handful of functional extensions to the C library that add dynamic string functionality.
http://www.mibsoftware.com/libmib/astring/
This package basically references strings through '''char **''' pointers and assumes they are pointing to the top of an allocated heap entry (or NULL, in which case memory will be newly allocated from the heap.) So its still up to user to mix and match the older C string functions with these functions whenever pointer arithmetic is used (i.e., there is no leveraging of the type system to assert semantic differences between references and base strings as Bstrlib does since no new types are introduced.) Unlike Bstrlib, exact string length meta data is not stored, thus requiring a '''strlen()''' call on every string writing operation. The library is very small, covering only a handful of C's functions.
While this is better than nothing, it is clearly slower than even the standard C library, less safe and less functional than Bstrlib.
To explain the advantage of using libmib, their website shows an example of how dangerous C code:
char buf[256];
char *pszExtraPath = ";/usr/local/bin";
strcpy(buf,getenv("PATH")); /* oops! could overrun! */
strcat(buf,pszExtraPath); /* Could overrun as well! */
printf("Checking...%s\n",buf); /* Some printfs overrun too! */
is avoided using libmib:
char *pasz = 0; /* Must initialize to 0 */
char *paszOut = 0;
char *pszExtraPath = ";/usr/local/bin";
if (!astrcpy(&pasz,getenv("PATH"))) /* malloc error */ exit(-1);
if (!astrcat(&pasz,pszExtraPath)) /* malloc error */ exit(-1);
/* Finally, a "limitless" printf! we can use */
asprintf(&paszOut,"Checking...%s\n",pasz);fputs(paszOut,stdout);
astrfree(&pasz); /* Can use free(pasz) also. */
astrfree(&paszOut);
However, compare this to Bstrlib:
bstring b, out;
bcatcstr (b = bfromcstr (getenv ("PATH")), ";/usr/local/bin");
out = bformat ("Checking...%s\n", bdatae (b, "<Out of memory>"));
/* if (out && b) */ fputs (bdatae (out, "<Out of memory>"), stdout);
bdestroy (b);
bdestroy (out);
Besides being shorter, we can see that error handling can be deferred right to the very end. Also, unlike the above two versions, if '''getenv()''' returns with NULL, the Bstrlib version will not exhibit undefined behavior. Initialization starts with the relevant content rather than an extra auto-initialization step.
An attempt to add to the standard C library with a number of common useful functions, including additional string functions.
http://libclc.sourceforge.net/
- Uses standard '''char *''' buffer, and adopts C 99's usage of "restrict" to pass the responsibility to guard against aliasing to the programmer.
- Adds no safety or memory management whatsoever.
- Most of the supplied string functions are completely trivial.
- Uses standard '''char *''' buffer, and adopts C 99's usage of "'''restrict'''" to pass the responsibility to guard against aliasing to the programmer.
- Mixes '''char *''' and length wrapped buffers ('''estr''') functions, doubling the API size, with safety limited to only half of the functions.
This library was written for the purpose of increasing safety and power to C's string handling capabilities.
http://www.zork.org/safestr/safestr.html
- While the '''safestr_'''* functions are safe in of themselves, inter-operating with '''char *''' string has dangerous unsafe modes of operation.
- The architecture of safestr's causes the base pointer to change. Thus, its not practical/safe to store a safestr in multiple locations if any single instance can be manipulated.
- Dependent on an additional error handling library.
- Uses reference counting, meaning that it is either not thread safe or slow and not portable.
Because of its automatic temporary clean up system, it cannot use '''const''' semantics on input arguments. Interesting anomalies such as:
safestr_t s, t;
s = safestr_replace (t = SAFESTR_TEMP ("This is a test"),
SAFESTR_TEMP (" "), SAFESTR_TEMP ("."));
/* t is now undefined. */
are possible. If one defines a function which takes a '''safestr_t''' as a parameter, then the function would not know whether or not the '''safestr_t''' is defined after it passes it to a safestr library function. The author recommended method for working around this problem is to examine the attributes of the '''safestr_t''' within the function which is to modify any of its parameters and play games with its reference count. I think, therefore, that the whole '''SAFESTR_TEMP''' idea is also fatally broken.
The library implements immutability, optional non-resizability, and a "trust" flag. This trust flag is interesting, and suggests that applying any arbitrary sequence of '''safestr_'''* function calls on any set of trusted strings will result in a trusted string. It seems to me, however, that if one wanted to implement a trusted string semantic, one might do so by actually creating a different type and only implement the subset of string functions that are deemed safe (i.e., user input would be excluded, for example.) This, in essence, would allow the compiler to enforce trust propagation at compile time rather than run time. Non-resizability is also interesting, however, it seems marginal (i.e., to want a string that cannot be resized, yet can be modified and yet where a fixed sized buffer is undesirable.)
This is a length based string library based on a slightly different strategy. The string contents are appended to the end of the header directly so strings only require a single allocation. However, whenever a reallocation occurs, the header is replicated and the base pointer for the string is changed. That means references to the string are only valid so long as they are not resized after any such reference is cached. The internal structure maintains a lot some state used to accelerate Unicode manipulation. This makes sustainable usage of the library essentially opaque. This also creates a bottleneck for whatever extensions to the library one desires (write all extensions on top of the base library, put in a request to the author, or dedicate an expert to learn the internals of the library). The library is committed to Unicode representation of its string data, and therefore cannot be used as a generic buffer library.
Sds uses a strategy very similar to Libsrt. However, it uses some dynamic headers to decrease the overhead for very small strings. This requires an extra switch statement for access to each string attribute. The source code appears to use gcc/clang extensions, and thus it is not portable.
Dumping a line numbered file:
FILE * fp;
int i, ret;
struct bstrList * lines;
struct tagbstring prefix = bsStatic ("-> ");
if (NULL != (fp = fopen ("bstrlib.txt", "rb"))) {
bstring b = bread ((bNread) fread, fp);
fclose (fp);
if (NULL != (lines = bsplit (b, '\n'))) {
for (i=0; i < lines->qty; i++) {
binsert (lines->entry[i], 0, &prefix, '?');
printf ("%04d: %s\n", i, bdatae (lines->entry[i], NULL"));
}
bstrListDestroy (lines);
}
bdestroy (b);
}
For numerous other examples, see '''bstraux.c''', '''bstraux.h''' and the example archive.
The Better String Library is available under either the BSD license (see the accompanying '''license.txt''') or the Gnu Public License version 2 (see the accompanying '''gpl.txt''') at the option of the user.
The following individuals have made significant contributions to the design and testing of the Better String Library:
Bjorn Augestad, Clint Olsen, Darryl Bleau, Fabian Cenedese, Graham Wideman, Ignacio Burgueno, International Business Machines Corporation, Ira Mica, John Kortink, Manuel Woelker, Marcel van Kervinck, Michael Hsieh, Mike Steinert, Richard A. Smith, Simon Ekstrom, Wayne Scott, Zed A. Shaw
Licenses:
[BSD]
(