Programming, Software and Code

GSoC Idea: Immutable Strings

Immutable strings are an interesting idea. Fundamentally, in a system with immutable strings the actual string storage is considered immutable at the lowest level. This means that when we modify strings, we create copies instead of modifying the strings in place. Immutable strings carry some benefits. First and foremost, we never need to resize strings in place. This means that memory can be allocated more linearly, and can be kept track of more simply. Instead of a string header needing to contain both the size of the currently allocated buffer and the size of the string that buffer contains, we only need to store one size value.

When we look at a code example like this:

$S0 = foo($S0)

It will look to the programmer as if the string register $S0 has been modified to become whatever the output of the foo function is. However, this is only true of the header being modified. Internally, the header will be pointing to a new memory buffer entirely.

Parrot currently uses a system called "Copy on Write", or COW. In a COW system making copies are cheap because we make incomplete copies. When we do this in Parrot:

$S1 = $S0

We aren't copying the string, we are only copying the header. Both headers will point to the same underlying memory buffer but now a COW flag will be set. If an attempt is made to modify either string, a complete copy is made first and then the modification happens. In theory this is a great optimization: We create copies lazily, and don't have to actually copy memory buffers (which can be expensive) until a modification is made somewhere.

The problem with the COW system is that Parrot has to make a lot of copies of strings, especially for strings which are constant or are being passed as parameters to subroutines. Consider this code:

$P0 = new ["foo"]
$S0 = typeof $P0
$S0 = $S0 . "bar"

Here $S0 is orignally the typename of the $P0 object, "foo". Obviously the name of the class should be constant, if we are allowed to just rename the class like this, we could cause all sorts of problems in the code. So the typeof operator returns a COW copy of the string, so that any modifications happen to a lazy copy.

The problem is that most accesses of class type names are not like above, but are instead like shown below:

$S0 = typeof $P0
if $S0 == "foo" goto ...


$S0 = typeof $P0
$P1 = new [$S0]

In cases like these, $S0 is treated as a constant string in the code to key changes in program flow. Most of the time the output of the typeof operator is treated as a constant so a copy doesn't need to be made, lazily or otherwise. In fact, having to make COW headers every time can become extremely wasteful. A similar situation happens in the PCC system, where strings passed as parameters to subroutines are passed as COW copied, not as straight references. This likewise can become extremely expensive when a COW header isn't needed.

I've rambled enough, I can talk about the theory behind COW and Immutable strings another time. For now, let's get to the GSoC project.

For GSoC, myself and several other developers (chromatic especially) would like to see Parrot converted to use immutable strings instead of COW. This could be done in several steps:
  1. Conversion of all string functions to take a arguments and return a string pointer result, instead of modifying argument pointers directly.
  2. Modification of the string allocator to use immutable strings instead of COW (involves changes to a particularly messy portion of the GC)
  3. Deprecating and eventually deleting all the functions that deal specifically with COW
  4. Creation of a new PMC type, which would be a piece-wise string builder type.
  5. Optimizations and improvements throughout the codebase to rely on the behavior of immutable strings to prevent unwanted changes to constant strings.
  6. Benchmarks of the new system over the old system to compare performance changes
So that's the GSoC idea involving immutable strings. It's slightly ambitious, but it is pretty easy to break it up into smaller, easier steps. Interested, motivated, and talented students are definitely encouraged to apply.


This is a good idea, and it would also particularly benefit any languages that are more functional in design, such as mine, where users assume that values are always immutable, and also parallel computation is made easier.

Just like Java ;) I think Python has imutable Strings as well.

They are so great all over the code place so you don't have to worry about multiple threads accessing the same object or use it very confidently inside a HashMap or HashSet with knowing it will never change.

Plus you can have some String handling optimizations. For example calling substring() does not create a new character array behind the scene, it will refer to the same as the original String, but will have different start and end indexes to cover this case.

This entry was originally posted on Blogger and was automatically converted. There may be some broken links and other errors due to the conversion. Please let me know about any serious problems.