I had made a round of fixes with regards to encodings in the
whiteknight/io_cleanup1
branch a few days ago. Rakudo hacker Moritz was
able to take a look at Rakudo’s spectests and verify that more tests were
indeed passing because of it. The remaining test failures represent the
changing semantics for the read
method and what appear to be two genuine
regressions or bugs.
Hopefully I will be able to get all these things sorted out this week before I go away on a mini vacation next weekend. Otherwise I can’t imagine this branch gets merged before the 4.6 release this month.
A few days ago I wrote a post about readline
and some of the intricacies involved in that, and some of the weird semantics
that I was attempting to unify. It turns out that some of these semantics are
a major cause in one of the last bugs in the branch. Let’s look at some code
in master to see where the hangup is. First, readline
on a Socket:
METHOD readline(STRING *delimiter :optional,
INTVAL has_delimiter :opt_flag) {
INTVAL idx;
STRING *result;
STRING *buf;
GET_ATTR_buf(INTERP, SELF, buf);
if (!has_delimiter)
delimiter = CONST_STRING(INTERP, "\n");
if (Parrot_io_socket_is_closed(INTERP, SELF))
RETURN(STRING * STRINGNULL);
if (buf == STRINGNULL)
buf = Parrot_io_reads(INTERP, SELF, CHUNK_SIZE);
while ((idx = Parrot_str_find_index(INTERP, buf, delimiter, 0)) < 0) {
STRING * const more = Parrot_io_reads(INTERP, SELF, CHUNK_SIZE);
if (Parrot_str_length(INTERP, more) == 0) {
SET_ATTR_buf(INTERP, SELF, STRINGNULL);
RETURN(STRING *buf);
}
buf = Parrot_str_concat(INTERP, buf, more);
}
idx += Parrot_str_length(INTERP, delimiter);
result = Parrot_str_substr(INTERP, buf, 0, idx);
buf = Parrot_str_substr(INTERP, buf, idx, Parrot_str_length(INTERP, buf) - idx);
SET_ATTR_buf(INTERP, SELF, buf);
RETURN(STRING *result);
}
We can ignore the fact that this implementation of readline
doesn’t call
Parrot_io_readline
like every other PMC does. Or that if we did call that
function the program would throw an exception because Parrot_io_readline
doesn’t support sockets anyway. Whatever. Moving on…
For comparison, let’s look at the version from the Handle PMC (which is inherited by FileHandle):
METHOD readline() {
STRING * const string_result = Parrot_io_readline(INTERP, SELF);
RETURN(STRING *string_result);
}
The Socket version takes a delimiter
parameter which is a STRING. When doing
readline on a Socket, you can pass in any arbitrary string which is used as
the token for end of line. With FileHandle, you don’t seem to have that.
However, you can definitely use custom delimiters with FileHandle. However,
we clearly don’t take a delimiter here and we aren’t passing one in as an
argument to Parrot_io_readline
like we do in the branch. Let’s see
how it’s done instead. Here’s a snippet from Handle PMC:
ATTR INTVAL record_separator; /* Record separator (only single char supported) */
We don’t need to look at any other code. This is the smoking gun.
Socket.readline()
can take any arbitrary STRING to use as a record
separator, but FileHandle.readline()
can only use a single codepoint, which
it doesn’t take as an argument.
So that’s the problem right there. When I standardized the readline mechanics between types, I picked the FileHandle semantics. This was probably the wrong decision, because not only could Sockets use a more general mechanism but Rakudo relies on that behavior in its spectests. This does raise a question about why nobody ever expected this same behavior from FileHandle, or why the difference was not considered some kind of bug. It really goes to show how immature our IO system has been for all these years, and how we had all just grown accustomed to the arbitrary, inconsistent, nonsensical behaviors. It just works for some basic usages, so nobody ever complains about it. That time is, thankfully, coming quickly to an end.
Fixing this issue is actually going to take some serious work. Several function signatures are going to need updating to take a STRING delimiter instead of an INTVAL codepoint, and a major chunk of buffering logic is going to need to be rewritten to work on substrings instead of on individual codepoints. This, in turn, is going to require a heck of a lot more testing.
Last night I started putting in some of the changes necessary to use a substring terminator instead of a single codepoint. Most of what I’ve already done has been modifying function signatures. The real changes need to occur deep within the buffering logic and will require a little bit more time.
I’m looking forward to getting this branch fixed up and merged back to master so I can get to work on my next project. I think 6model is going to be the next thing I dig into, before I find something else that annoys me enough to put in a huge amount of effort to rewrite it. I’ll post more updates about my future projects and plans as I go.