Whiteknight: Reading a Line of Text

In terms of usage, there aren’t too many IO-related features in Parrot’s user interface more straight-forward than the readline method. It does exactly what you tell it to do: read a line of text from the given file and return that line of text as a Parrot string. Easy.

Tonight I was looking at some of the old code to get an idea about expected semantics for some tests that need fixing. Let’s look at some code:

.sub read_a_line
    .param string type
    $P0 = new [type]
    $S0 = $P0.'readline'()
    .return($S0)
.end

.sub test_readline
    $S0 = 'read_a_line'('FileHandle')
    say $S0
    $S0 = 'read_a_line'('Socket')
    say $S0
    $S0 = 'read_a_line'('StringHandle')
    say $S0
.end

The valid types for this are, as usual, "FileHandle", "Socket" and "StringHandle". Notice that we’re reading a line from the object of the given type before we’ve opened, connected or initialized. Pretend, in order to save myself some typing, that I’ve set up exception handlers and the like above. So, what happens?

For FileHandle we throw an exception. You can’t read from a closed handle.
For StringHandle, we throw an exception for the same reason.
For Socket we return null because…whatever. (in the test suite we test that when converted to a floating-point number, that it’s 0.0. Again, whatever).

So that’s a little bit weird that socket does something different from the other two, but fundamentally it’s a pretty different type so I suppose some differences can be allowed.

Now, let’s try something slightly different:

.sub read_a_line
    .param string type
    $P0 = new [type]
    $P0.'open'("foo.txt", "r")
    $P0.'print'("This is \n test text")
    $P0.'close'()
    $S0 = $P0.'readline'()
    .return($S0)
.end

With this example we can only operate on FileHandle and StringHandle because Socket doesn’t have an .open() method like those two do. What does this do for those two types?

For FileHandle we throw the same exception, you still can’t read from a closed handle.
For StringHandle you can read like normal without any indication that the handle is closed!

So that’s weird to say the least that StringHandle has two different behaviors. Socket has yet another problem, in a slightly different way. The method Socket.readline() returns null when not open, but if you pass a Socket to the Parrot_io_readline method, it always throws an exception because apparently readline on a Socket isn’t supported! And because readline on a Socket uses a completely different code path from FileHandle the two types use completely different buffering mechanisms with subtly different semantics (StringHandle, because it uses the in-memory string buffer, does it in a third way).

To recap: What is conceptually a simple operation, read in some text until we find a delimiter, is done in three completely different ways by three different types, each with different error-handling semantics depending on both history, state, and the interface used. If anybody was wondering why I wanted to rewrite this subsystem, here’s part of the reason.

Actually, I kind of lied. It’s really not a simple operation which is all the more reason we should share common code. It’s a clear case of an algorithm where the hard parts should be encapsulated inside a clean interface so that different types can avoid needing to reimplement it over and over again (with differences, bugs and complications). That’s the way it really should be, but some of the complications in the code are a little hard to live with. Here’s the general algorithm for readline on a FileHandle, as it’s implemented in Parrot master:

The filehandle requires a buffer for this, so create (and fill) a buffer if one isn’t configured.
Create a new, empty STRING header.
Treating the buffer like an encoded STRING, scan the buffer looking for the end of the delimiter or the end of the buffer, whichever comes first.
Allocate/reallocate enough space in the STRING header to hold all the data we’ve found in the buffer.
Append all the characters we’ve found to the STRING.
If we’ve found the delimiter, we’re done. Return it to the user.
Otherwise, check if we are at the end of file for the input. If so, go to 8. If not end of file, go to 9.
Check that the last codepoint is complete and has all its bytes. If so, return the STRING to the user. If not, throw an exception about a malformed string.
Check that the last codepoint is complete and has all its bytes. If so, go to 10. Otherwise, go to 11.
Refill the buffer and go to 3.
Determine how many more bytes we need to read to complete the last codepoint.
Refill the buffer, and check that we have at least that many bytes available to read. If so, go to 13. Otherwise, throw an exception about a malformed string input.
Read in the necessary number of bytes (1, 2 or 3 at most) from the buffer and go to 3.

If you’re reading an ASCII or fixed8 string the logic obviously collapses down to something a little bit more manageable. Also, this same logic, almost line for line, is repeated in the routine to read a given number of characters from the handle, where characters in a non-fixed-width encoding (like utf8) may need multiple reads to get if we don’t get all the bytes for the character into the buffer in a single go. Notice that the versions provided by StringHandle and Socket are both much more simple and not safe for multi-byte encodings like utf8 or utf16.

In my io_cleanup1 branch, the logic has been simplified substantially, and a single codepath is now used for all three of the major types:

Make sure the handle has a read buffer set up and filled.
Create a new, empty STRING header.
Ask the buffer to find the given end-of-line character. The buffer will return a number of bytes to read in order to get a whole number of codepoints, and a flag that says whether we’ve found the delimiter or not.
Append those bytes to the string header.
If the delimiter is found or if we are at EOF, return the string.
Fill the bufffer and go to #3.

By simply coding the buffer logic to refuse to return incomplete codepoints in response to a STRING read request, the whole algorithm becomes hugely simplified. The readline routine in master takes up 185 lines of C code. In my new branch, the same routine takes up only 47 lines. Of course, this isn’t comparing apples to apples, because I did break up some of the repeated logic into helper routines, and the buffers in my system are obviously a little bit smarter about STRINGs and codepoints, but that’s not exactly the point. The real point is that three large, complicated, hard-to-read functions in master are now a single, much smaller, easier-to-read routine that relies on clear abstraction boundaries to do a difficult job in a much more conceptually simple way.

I’ve also updated the STRING read routine (now called Parrot_io_read_s) to use a similar algorithm and actually share some of the new helper methods. That sharing itself also helps to decrease total lines of code has has other benefits as well.

Notice that there is one small change in these two algorithms, which may or may not need to be worked around if it causes problems. Notice that we don’t read out of the buffer an incomplete codepoint. If we have an incomplete one at the end of the file, the first algorithm will read it in and throw an exception about a malformed string. The second algorithm will ignore those final bytes and successfully return all the rest of the valid-looking data from the buffer instead. In the first algorithm, it then becomes impossible to read the partial data out and make a best effort, while in the second algorithm you can easily get to the data, even if the last codepoint is corrupted and cannot be read. I’d really love to hear what people think about this change, and whether it’s worth keeping or needs to change. I suspect it is better this way but only the users can really say for sure.

16 Feb 2022	ParserObjects 4.0 Development projects
13 Feb 2022	Welcome to 2022 Personal
17 Dec 2021	Good Programmers Manage Expectations Philosophy

Programming, Software and Code

About

Links

Reading a Line of Text

Related Posts