Tuesday, March 1, 2016

Unboxing Packages: string_scanner

Parsing is tricky. There’s the high-level question of what parsing system you use—a parser generator? A handwritten recursive descent? Parser combinators?—but there’s also the issue of how you move through text and track the information you need to track. That’s what the string_scanner package is for.

The Core API

The package exposes one main class, StringScanner, as well as a few subclasses we’ll get to later. This scanner moves through a string, consuming text that matches patterns and giving the user information about it. Here are the most important parts of the API:

class StringScanner {
  Match get lastMatch;
  int position;

  StringScanner(String string);

  bool scan(Pattern pattern);
  bool matches(Pattern pattern);
  void expect(Pattern pattern, {String name});
}

Let’s walk through this, starting with position. This returns the scanners zero-based character position in the source string. The scanner always matches immediately after this position, and it updates the position as it consumes text. You can also set the position if you need to jump back (or forward) while parsing.

The most important method is scan(); after all, it’s right there in the name of the class. It takes a Pattern, which in practice means a String or a RegExp, and tries to consume text that matches that pattern. If the text immediately after current position matches, it returns true; otherwise, it returns false.

The scanner is for parsing, though, not just validating that a string is in the right format. You need to be able to get information out of it. That’s what the lastMatch getter is for. If the last call to scan() (or matches() or expect()) was successful, it contains the data for that match. This, along with the current position, are the only state that the scanner maintains.

The matches() method is just like scan(), except that it doesn’t consume any text. It just returns whether the pattern matches without changing the position at all. This is sometimes used to detect boundaries between different constructs.

Finally, expect() works like scan() but requires that the pattern matches. If it doesn’t, it throws a StringScannerException which indicates that parsing failed. By default this exception includes the pattern that was being used, but the name parameter can be used to provide a more human-readable name for the missing token.

Here’s a simple number parser using these methods:

import 'dart:math' as math;

import 'package:string_scanner/string_scanner.dart';

// It's often a good idea to store regular expressions in variables so they only
// need to be compiled once.
final _digits = new RegExp("[0-9]+");

num parseNum(String text) {
  var scanner = new StringScanner(text);

  // Don't require a whole component so that ".123" works.
  var whole = 0;
  if (scanner.scan(_digits)) {
    whole = int.parse(scanner.lastMatch[0]);
  }

  // If there's no dot, exit immediately.
  if (!scanner.scan(".")) {
    // I'll get to this later on.
    scanner.expectDone();
    return whole;
  }

  // If there is a dot, there must be trailing digits.
  scanner.expect(_digits, name: "0 through 9");
  var decimal = scanner.lastMatch[0];
  var result = whole + int.parse(decimal) * math.pow(10, -decimal.length);

  scanner.expectDone();
  return result;
}

Character-Based Parsing

Some things you want to parse don’t work all that well with strings or regular expressions. You might need to make decisions based on every character individually. To make this possible, the scanner also supports going character-by-character. Like Dart’s strings themselves, it works in terms of UTF-16 code units.1

The readChar() method consumes and returns a single character. It moves the current position forward by one. It returns the characters as integers, which is generally more efficient than single-character strings. I recommend using the constants defined in the charcode package when dealing with these integers.

The peekChar() function is also useful when doing character-based parsing. peekChar() is to readChar() as matches() is to scan(). It returns the same information—the next character—but it doesn’t move the position at all. In addition, peekChar() takes an optional offset parameter that allows you to peek at a character after (or even before) the current position.

These two functions do different things if the scanner is at the end of the text. readChar() will throw an error, complaining that it expected more input. peekChar(), on the other hand, will just return null.

Here’s a character-based integer parser:

import 'package:charcode/ascii.dart';

import 'package:string_scanner/string_scanner.dart';

num parseInt(String text) {
  var scanner = new StringScanner(text);
  var value = 0;

  // The isDone getter indicates whether the scanner is at the end of the text.
  while (!scanner.isDone) {
    var char = scanner.readChar();

    // The $0 and $9 constants are defined in the charcode package.
    if (char < $0 || char > $9) {
      // I'll talk about error() in the next section.
      scanner.error("Invalid character.", position: scanner.position - 1);
    }

    value *= 10;
    value += char - $0;
  }

  return value;
}

Emitting Errors

I’ve already talked about expect(), which is the most common way to emit errors when using a scanner. But it’s not the only way. What if your character-based parser runs into an error? What if you only discover that a token is invalid after you’ve already parsed it? What if the source has extra junk at the end after you’ve finished parsing?

The most flexible error-emitting method is error(). It can emit an error associated with any chunk of the text at all—all you need to do is pass in position and length parameters. Most of the time you just need the match parameter, which takes a Match object that was returned by lastMatch earlier on. Or you can choose not to pass in any location information at all, and it will default to associating the error with lastMatch (or the current position if lastMatch is null).

If you’re wondering what it means for an error to be “associated with” a match, I encourage you to read my article about source_span. StringScannerException inherits from SourceSpanFormatException, so it uses a source span to indicate where the error occurred in the text.

Source spans can have sourceUrls associated with them, which help the user know where the errors were caused. If you want your parse errors to have file information, you can pass in an optional sourceUrl parameter (which can be a String or a Uri) to new StringScanner().

There’s one other error-emitting function, called expectDone(). It’s pretty single-purpose, but it’s useful all the same. If the scanner isn’t at the end of the text, it emits an error; otherwise, it does nothing. This is really useful when you’re parsing a single value and you want to ensure there’s no extra characters in the string after it.

Tracking Line Information

Some scanners, like the ones in the examples above, are just used for parsing small chunks of text without any larger context. But others involve parsing an entire file, in which case line and column numbers are often very important. The default StringScanner doesn’t track this information, but it has a LineScanner subclass that does.

LineScanner defines line and column getters that return the (zero-based) line and column at the current position. These fields are kept up-to-date as the scanner consumes additional text, so that accessing them is always fast.

There’s also a special way of changing position with a LineScanner. You can use the position setter like you would with a plain StringScanner, but it’s less efficient since it needs to calculate the new line and column. If you know in advance that you might need to jump back to the current position, you should use the state property instead.

This property returns a LineScannerState object, which stores the line, column, and position information for the scanner. Once you have this object, you can pass it back to state= to restore that exact state in the line scanner much more efficiently than setting the position.

Scanning With Source Spans

There’s one more subclass I want to talk about, and it’s the one that provides the most value. It’s SpanScanner, which inherits from LineScanner and provides access to source spans (specifically FileSpans) for the text being parsed.

The most commonly-used addition in SpanScanner is the lastSpan getter. It returns a span covering the same chunk of text as lastMatch, and it’s updated or set to null under the same circumstances.

The emptySpan getter returns a span that covers no text. An empty span is useful for pointing to a specific locations in the source. This one always refers to the scanner’s current position.

Finally, there’s spanFrom, the most powerful SpanScanner addition. It returns a span between two arbitrary points in the text, represented as LineScannerStates (which, if you recall, are returned by the state getter). The second state defaults to the current position, which makes it easy to get a span that covers a given chunk of scanning code.

Lazy or Eager?

For the advanced user, there’s a choice to be made when using a SpanScanner. The choice involves the implementation of the line and column getters: they can either be derived lazily from the current position, or computed eagerly as the scanner consumes text.

Internally, a SpanScanner uses a SourceFile to generate its spans. By default, it also uses this file to get the line and column information for the current position, which means that—as I discussed in my last article—accessing these is not as efficient as possible. The trade-off is that scanning is faster because it doesn’t need to keep this information up-to-date.

If that’s not the trade-off you want to make for your code, though, you can use new SpanScanner.eager() instead of the default constructor. This returns a scanner that works like LineScanner: it updates the line and column as it goes, so accessing them is very fast.

Which implementation is faster very much depends on the specifics of your use-case. Most of the time it won’t matter at all, but if parsing is a bottleneck for you, you should use Dart’s excellent profiler to see whether eager parsing makes a difference.

Parse the World

I love writing parsers. I’ve written parsers for existing formats, and I’ve written parsers for entirely new ones. The ability to write robust parsers with good APIs is an important tool in a programmer’s belt, and the string_scanner package makes it easy to get started. It makes it easy to write small parsers for small formats like the examples I’ve included here, and it makes it easy to expand those to large-scale parsers of complex formats like YAML. Next time you need to parse some text, you’ll know what to use.

Join me next week when I talk about how to bring structure to your asynchronous code.


  1. Note that some Unicode characters appear as two code points, called surrogate pairs. This is an issue inherited from Dart’s string implementation, which in turn inherited it from JavaScript.