Wednesday, June 15, 2016

Unboxing Packages: path

I want to do something a little different with my blog post this week. When I’ve written about packages in the past, I’ve mostly done a high-level overview of their APIs and how they fit into the Dart ecosystem as a whole. But path is one of the very oldest packages in the ecosystem, and any Dart user who’s written any server-side or command-line apps is probably already familiar with the API.

So instead of a high-level overview, I want to do a deep dive. I want to talk about why we made the design decisions we made when writing path, and how we implemented our design effectively and efficiently. This post will be as much about how the package was constructed as it is about what the final product looks like.

Initial Design

It first became clear that Dart needed a solid solution for path manipulation when Bob Nystrom and I first started working on pub. Paths may seem simple on their face, but there’s a lot of hidden complexity when you need to make them work with all the edge case formats that can crop up across all the operating systems we support.

This became our first design constraint: make something that handles all the edge-cases. This is less obvious than it sounds: a lot of times, good design involves sacrificing some edge-case behavior to make the common case better, or even just simpler to implement. But we knew that path would be widely-used across the ecosystem, and we wanted users to be totally confident in it. If an application had to sanitize its paths before handing them off to us, we weren’t doing our job.

Another important early decision was to make the core API use top-level methods. We often look at other languages’ APIs for inspiration, but they were split on this point. Node uses top-level functions, whereas Java uses instance methods on a Path class. Ruby uses static methods for simple manipulation and a Pathname class for more complex ones. This didn’t provide clear guidance.

We decided to rely on a rule of thumb: only create a class when it’s the canonical representation of its data type¹. There were already a bunch of APIs, both in the core and in external code, that logically took paths and accepted only strings, not our hypothetical Path objects. Certainly everywhere the end user supplied a path, that path would be made available to the program as a string.

So we decided to go with the flow of the existing APIs and continue representing paths as strings. All the path manipulation APIs now take strings and return strings, and the world is simpler for it.

We chose functions for the package based on a combination of our own needs and APIs that were common among other languages’ path manipulation suites. Some of them, like join() and relative(), were pretty obvious. Others like rootPrefix() only became apparent because they filled holes in actual code. And a few, like prettyUri(), only got added well after the package was released.

The Quest for Correctness

We wanted to make our users confident in the correctness of path’s logic, which meant we had to be confident ourselves first. To do this, we wrote tests. Lots and lots of tests. Today, the package has 2.5 times more lines of test code than implementation code, and that’s how we like it.

Writing tests isn’t trivial, though. We had to be careful to include all the cases that came up in practice. This meant that, for every function where they were relevant, we tested combinations of:

  • Directory paths that did or did not end in separators.
  • Paths with zero, one, or two extensions.
  • Paths with multiple separators in a row.
  • Paths containing, or entirely composed of, the directory traversal operators . and ...
  • Absolute and relative paths.
  • Different formats of the current working directory.

We wrote those tests first for the Posix style of path, which are used by OS X and Linux. Then we ported them over to Windows paths², and added even more cases:

  • Windows supports both / and \ as separators, so we tested both and their combinations.
  • Not only does Windows support C:\-style path roots, it supports \\server\share\-style UNC paths as well.
  • You can also start a path with \ in Windows to indicate that it’s relative to the current working directory’s roots.

Determining the proper behavior for all of these involved looking up specifications online, manually testing path behavior on the command line, and a healthy amount of discussion about exactly the right way to handle edge-cases. These discussions led to our next round of design decisions.

Not all paths are valid. Sometimes they run afoul of an operating system’s rules for valid characters, and sometimes they just don’t make sense at all—consider the path /.., for example, or just an empty string. I initially advocated for throwing errors in these cases since in general failing fast is good, but we discussed options and Bob convinced me that path operations should never fail³.

While failing fast can make errors easier to track down, it also means that a defensive programmer has to be aware of the potential for failure anywhere it could occur. Path operations are frequently used in small utility methods that aren’t expected to fail, and most of the time their output is ultimately passed to IO operations which already need error handling.

So instead of throwing an error, the path operations just do the best they can on meaningless input. For most operations, /.. is considered the same as / and the empty path is considered the same as ., but we don’t work too hard to adhere to these definitions if it would get in the way of efficiently processing valid paths.

We also had to figure out what to do with paths that contained irrelevant characters, like foo//bar or foo/./bar, both of which are semantically identical to foo/bar. We ended up deciding to preserve the existing format as much as possible. The user would be able to explicitly call normalize() if they wanted clean paths, but otherwise they’d get something like what they passed in.

This decision made it easier to interoperate with other software that did, for whatever reason, care about the exact format of a path. For example, code using less-robust path manipulation logic might not be able to tell that foo/baz/qux was within foo/bar/../baz, so it’s useful for p.join("foo/bar/../baz", "qux") to return "foo/bar/../baz/qux".

Platforms and Customization

Paths are unusual in that their semantics are deeply platform-specific, but following those semantics mostly doesn’t actually require running the code on the platform in question. We wanted to take advantage of this to allow users to do path manipulations for platforms they weren’t using, but we also wanted to make the easy default use the current platform. This called for more design.

We came up with the idea of a Context object, which would take a style of path (Posix, Windows, or eventually URI) and the only OS-specific piece of data path manipulation used—the current directory path. Context had a set of methods that exactly mirrored the top-level functions in path. In fact, path’s functions just forward to a context!

We used contexts heavily in our own tests. They allowed us to run Windows path tests on Linux, for example, and to test operations like relative() without having to make any assumptions about the current directory.

While adding contexts, we also made a design decision that turned out to be a mistake in retrospect. We’d defined a Style enum for determining which platform style a context should use, which would have been fine if we hadn’t decided to make public the methods Context called on the style.

We had a vague notion that this would allow third-party packages to define custom styles, but no one ever did. Even if they’d wanted to, different path styles are so idiosynctatic that they probably couldn’t have encoded all the custom logic in the methods we provided. So instead we had a bunch of public API surface that was tightly coupled to the internal implementation of path manipulation.

Eventually the implementation needed tweaking in a way that affected the Style methods. We couldn’t change those methods, so instead we deprecated them and added an internal implementation of Style where we could add new methods privately. The lesson here is sometimes maximal extensibility isn’t worth the pain.

Making it Fast

When we first implemented path, we were primarily concerned with correctness and not speed. Our philosophy was (and is) to avoid optimizing packages until we have a clear idea of what parts are slowest and used most heavily. If the package started out correct and well-tested, we could be sure that any performance improvements later on preserved the necessary behavior.

But eventually the time came to make those changes. Users were doing path manipulations in performance-critical loops, and that meant it had to be fast. We set up a benchmark so we could track our progress, and used Observatory to see exactly what parts of our code were taking the most time. Then we called in Anders Johnsen, one of our resident performance experts who’s since moved on from Google, to see what he could do.

It turned out he could do a lot! Not only did our code get faster, we learned quite a bit about strategies for keeping it fast.

The first change was to avoid parsing the whole path. Our original code heavily used an internal ParsedPath class that eagerly parsed the entire path and exposed its components as fields. We still use this class for particularly complex functions, but for anything simple and performance-critical, we now deal with the string directly. This removes a lot of extra unnecessary work and allocations.

The second change was to stop using regular expressions. At the time, Dart’s regular expression engine was very slow. It’s since been dramatically improved, but explicit string operations still tend to involve a lot less overhead. We had been using regexps for very simple operations anyway, so switching away from them ended up being pretty straightforward.

Finally, we had to short-circuit early when possible. A lot of path operations were very complex in the worst case—they required a lot of logic and maybe even iteration over the whole path. But the worst case didn’t actually come up all that often, and it turned out to be pretty easy to detect when it didn’t. For example, Windows paths can have a lot of different roots, which makes finding the root difficult. But if the path starts with /, then it’s guaranteed to be a root-relative path, so the root is "/". These sorts of checks may seem nitty, but they helped a lot.

Coming Up For Air

I tried something new today, and I’m curious what you thought. Leave me feedback in the comments if you’d like to see more deep dives, or if you prefer my previous articles that gave a high-level overview of package APIs. For my part, I like writing both of them, so I’d be happy to continue doing a mix in the future.

Join me again in two weeks when I cover a very generalized package that’s only used in one place so far.

Natalie is still doing the writing here. I’m just hitting publish. – Kevin

1: If it were up to me, the File class’s methods would be top-level functions as well.

2: We also support url paths, but these were added later on.

3: There’s one exception to this rule: path calls Uri.base to figure out the working directory for operations like relative(), which is an IO operation and so could theoretically fail. In pratice, though, it basically never does.