Speeding up a custom-built Nix package installer

last thread: Actually building Nix packages on Glitch

Project URL: Glitch :・゚✧

Prologue

s1

For a number of nights now, work has been proceeding in order to bring efficiency to the crudely conceived idea of a Nix package installer that would not only supply binary cached archives to the Nix store for use in newer projects but would also be capable of automatically running without any dependencies not already in the project container.

s2

Such a package installer is NarFlinger (provisional name).

s3

Now basically the only new principle involved is that instead of archives being downloaded by the execution of an already-installed Nix package, it is retrieved by the concurrent interaction of Node.js on stream pipelines.

s4

The original program had an import of the .nar parser from my nix-cache-view project adapted to write to disk in such a way that functions for generating the archived files’ contents were in a direct pipeline with a Writeable stream.

s5

The latter consisted simply of a number of require('fs').createWriteStreams so fitted to the segment of the decompressing .nar that whole-package-buffering was effectively prevented.

s6

The main Readable stream was of the normal HTTP IncomingMessage type placed in a pipeline with the stdin of a child process running xz -d,

s7

every exit event being connected by a .once() to a Promise’s reject callback awaited at the tail end of the async generators.

s8

NarFlinger has now reached a high level of development and it’s being successfully used in the installation of Emacs 28.2.

s9

Moreover, whenever an additional package installation is required, it can also be employed in conjunction with a previously installed Nix store directory to reduce shared dependency re-downloading.


No actual content about optimization today. Or rather, this thread isn’t really going to be about optimization, it’s more of just a sad story about Node.js.

In the next post, I’ll be describing the components of the full installer and how I condensed it into a simpler script to help focus on why it was running so slowly.

7 Likes

Wow, a @wh0 face reveal after all these years!

7 Likes

Components of the task

Alright, here’s an actual description of the components that make up the whole installation process, where it’s not just me messing around with the turbo encabulator script.

1. Download and process .narinfo files. These are simple text-based files containing, critically, the URL (relative, usually) of where to get the actual .nar file and the dependencies. You can download the first .narinfo file you need by constructing a URL given a binary cache server and the “hash” part of a Nix store path that the user wants to install (read more). And then you recursively get more .narinfo files for those dependencies.

2. Download the .nar files. You get the URLs for those from the .narinfo files. These can be large. And in the binary cache that I’m using to store my packages built for Glitch, (almost) everything is hosted on Glitch assets (read more).

3. Decompress the .nar files. Slight correction on the previous point: you don’t really download the .nar files directly from the binary cache server. The binary cache server keeps compressed .nar files, and the .narinfo tells you how they’re compressed. Currently the default is xz, and all the packages I’ve built are xz-compressed. You get those .nar.xz files and decompress them.

4. Read through the .nar and create files in the Nix store. At that point, there’s a way to parse the .nar file and get the various directories and files (read more). As we read through, we apply these actual changes, “to disk.”

Aside: error handling around streams in Node.js

It’s a pain. Remember that remark earlier about me wanting to do this without needing any dependencies not already included in the project container? That means nothing nice from npm. Just vanilla Node.js. And that means no node-fetch. Get ready for some builtin https module action :grimacing:.

I probably spent a whole week trying to figure out a reasonable way to make sure that something using the ‘response’ stream would be able to know if something went wrong with the HTTP(S) request.

It ended up something like this:

const https = require('https');
function deferredReject(p) {
  p.catch((e) => { });
  return p;
}
async function* getOk(url) {
  const req = http.get(url);
  const res = await new Promise((resolve, reject) => {
    req.on('error', reject);
    req.on('response', resolve);
  });
  if (!(res.statusCode >= 200 && res.statusCode < 300)) {
    req.abort();
    throw new Error(`get ${url} status ${res.statusCode} not ok`);
  }
  const done = deferredReject(new Promise((resolve, reject) => {
    req.on('error', reject);
    req.on('close', resolve);
  }));
  yield* res;
  await done;
}

Anyone have a better way to do it?

I think Node.js 16 has better handling, where they emit request errors on the response too, but I feel like Node.js 10 would have an advantage of warm startup.

The slowness of it all

So eventually I got it working, and I tried it out on installing Emacs 28.2. It was slow. It took… let me pull up my notes here… about 2m32s to install that. And at that point I didn’t know if it was, like, slow considering what all it’s doing or anything. It was just slow compared to how soon I wanted to open Emacs.


In the next post, I’ll break down how long the different components ought to take, if you were to use normal tools. That way, we can start to judge if it’s slow because it just has a lot of work to do, or it’s slow because my Node.js code sucks, or it’s slow because Node.js is a bad framework, or whatever other reason.


aaah!

2 Likes

Task breakdown, quantitative

What actually takes time when installing software? Downloading the packages over the Internet? Writing files to disk? On Glitch, it’s neither of these (for this particular setup). The thing that takes the longest is decompressing.

Summary


Reference times to do various parts of the overall task of installing a package and its dependencies. Measured on existing tools to do equivalent or similar things, except for the .narinfo row. That one’s done using my own custom implementation.

Downloading and processing .narinfo files is fast. Under a second (0.887s +/- 0.069s) (experiments have very small sample size). Let’s ignore it.

Downloading the (compressed) .nar files goes at around 5.6 MB/s (193MB in ~34s). There seems to be no big difference when downloading multiple in parallel (tested parallelism of 1, 2, 4, and 98). The files are served from Glitch’s asset CDN (cdn.glitch.me due to packages potentially being larger than 20 MiB). Tested with this command that uses well known tools, which we shall expect not to be wildly inefficient:

time wget $(cat /tmp/urls.txt)

Decompressing the .nar files is the big one. This takes 1m18.186s +/- 1.171s wall time, 18.600s +/- 0.196s user CPU time (unboosted project in “full stack” mode, i.e. with a one-core, 25% cap on CPU use, which I anticipate as a typical scenario where this installer would run). Tested this by running xz -d on everything (yes, concatenating them is valid and supported in the version of xz we have):

time cat /tmp/files/*.nar.xz | xz -d >/dev/null

Extra detail for curious readers: I’m reading the compressed .nar files from disk in these miniature experiments, and reading from disk is relatively fast, taking 0.138s +/- 0.047s wall time, 0.000s +/- 0.000s user CPU time. Thus it shouldn’t affect the CPU bottleneck in the decompression, so let’s ignore it.

Creating files in the Nix store is also relatively fast, taking 7.700s +/- 3.588s wall time and 0.142s +/- 0.031s user CPU time. Tested by doing a recursive copy:

time cp -r /tmp/dst /tmp/dst2

It’s not a trivial number files and amount of data:

$ find /tmp/dst | wc -l
22369
$ du -sh /tmp/dst
907M    /tmp/dst

Extra argument for readers interested in operating systems: it’s not important whether the new file in the Nix store get written back to disk. The goal of installing these packages is only for them to be accessible to the project. Everything’s in /tmp anyway, and we expect it to be lost if the project container goes down.

Alternative creating files measurement: I also tested extracting it from a .tar file, to verify that parsing through a reasonable archive format shouldn’t add a significant amount of time. This took 6.942s +/- 1.258s wall time and 0.197s +/- 0.038s user CPU time. Indeed, the additional user CPU time is small. Tested with tar:

mkdir /tmp/dst2 && time tar -xf /tmp/native.tar -C /tmp/dst2

Combined decompressing+creating: I tested extracting an equivalent a .tar.xz file to see how decompressing and writing to the filesystem fit together in a well known compression+archival implementation. This took 1m29.527s +/- 0.626s wall time, 18.988s +/- 0.079s user CPU time. That’s about how much we’d get by adding together the decompress time and the read+create time, and overall it’s not much longer than the decompress time. Tested with tar -xJ (actually -J omitted because it’s evident from the archive name):

mkdir /tmp/dst2 && time tar -xf /tmp/native.tar.xz -C /tmp/dst2

In the next post, I’ll show how long my custom installer takes. (A long time, just so you know.)

2 Likes

How long my custom installer takes

~2m32s wall time, ~28s user CPU time.

What baseline should we compare that to? We’d add up some of the timings from the measurements in the last post. There’s the ~1s to gather .narinfo files. Then the next three components for downloading/decompressing/writing the .nar files run in parallel, so we’ll take the slowest one, ~1m18s from decompressing. That comes to a total of ~1m19s.

Thus the installer I had written took 1.9 times as long as the baseline :sloth:.


A special debugging tool visualizes the installer in action.

Reduced script for timing and optimization

Oh yeah, had earlier mentioned that I made a simplified script and worked on further optimizing that. I started by listing components that I shouldn’t focus on:

  • The part that downloads and processes .narinfo files is already really fast. It takes like a second.
  • Downloading the compressed .nar files is done in parallel with everything, and it’s not the bottleneck. It uses the network, which no other concurrently running component contends with it for that. It also wastes the CDN bandwidth to download these repeatedly for experiments.

So I made a reduced copy of the installer script that would skip the whole .narinfo step and read the compressed .nar files from disk instead of over the Internet. And with that, it would be appropriate to compare against the “decompress+create” timings we had.


Coming up, I’ll post about how long the reduced script took to run, the steps I took to analyze it, and some optimizations I made. Not sure if it’ll be soon or all in one post though. Because I need to make another one of those charts with error bars.

4 Likes

Diversion: redoing timing chart in free software

In seaborn.

image

(Disclosure: error bars in this new graph show the 95% confidence interval instead of the standard deviation.)

2 Likes

Optimization summary

image

Measurements on simplified installer scripts and a comparable baseline. Implementations on Node.js additionally report “Self CPU” time, which excludes CPU time used by a subprocess that does the xz decompression.

Basically, what it took was to rewrite the program for a different interpreter.

Initial version

I started with a reduced version of the installer script, which I had written for Node.js. I targeted a relatively old version of Node.js, version 10, so that it would be able to run with the same interpreter that Glitch’s built-in programs use.

With the .narinfo processing and downloading removed, it was down to, on average, 2m12s wall time. And that was still taking way longer than the “decompress+create” baseline from above (here labeled tar -xJ). That’s bad, but in a way, that’s good, because it means that the simplified script successfully reproduces the slowdown.

I added some functionality to read the “utime” from /proc/self/stat proc(5) — manpages — Debian bullseye — Debian Manpages, which is similar to the “user CPU” time I’ve been getting from the time utility (in the user row). The utime is for only a single process, whereas the user CPU measurement includes subprocesses. In this case, most of the CPU usage comes from an xz subprocess that does the decompression.

What remains in the utime stat should be purely overhead. The Node.js process running my install script doesn’t have to do any decompression on its own. It only moves data from the filesystem (the pre-downloaded compressed .nar files) to the xz process and from the xz process back to the filesystem in creating new files. And we know from previous experiments that reasonable baselines for creating files take only up to tenths of seconds of CPU time.

Profiling

Node.js supports running a program with profiling, which creates a report about what takes so darned long. In that report, there was only one entry that accounted for anything more than 3% of the overall samples (referred to as “ticks,” supposedly 1 ms each):

 [C++]:
   ticks  total  nonlib   name
  16186   66.9%   68.9%  epoll_pwait

I had a lot of questions. Actually, was this even something you could consider “CPU” time? I found some postings on the Internet explaining that epoll_pwait is a blocking syscall. However, (1) I had written the JS code to use async stuff, and (2) as we saw from earlier baseline measurements, reading the compressed .nar files from disk was fast, so we shouldn’t be blocking for long either. Maybe it’s blocking while trying to read from the xz pipe, I’m not sure. And how much user CPU time did all of these epoll_pwait calls contribute anyway?

I didn’t get answers to all these questions.


In the next post, I’ll make some more progress in telling the story that led to the other rows in the chart.

2 Likes

Trying a different interpreter

So I didn’t really know what was going on in there with epoll_pwait. It was just vaguely something about how Node.js, possibly version 10 in particular, was running the event loop.

Just in case it was something Node.js improved in a later version, I tried running it in the latest version available on Glitch, version 16.

It took about the same amount of time. And still, it was spent largely in epoll_pwait.

 [C++]:
   ticks  total  nonlib   name
  16837   65.1%   86.4%  epoll_pwait

An even different-er interpreter

After that, I tried rewriting the simplified installer in Python. No event loop, no streams, no complex waiting. Good old fashioned blocking I/O.

It was indeed faster. It was about as fast as the baseline.

Python had a profiler too. And it showed a lot of time being spent in the necessary places.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   823311   81.194    0.000   81.194    0.000 {method 'decompress' of '_lzma.LZMADecompressor' objects}
    67360    4.431    0.000    4.431    0.000 {built-in method posix.write}
    49276    3.696    0.000    3.696    0.000 {method 'read1' of '_io.BufferedReader' objects}
    18288    1.884    0.000    1.884    0.000 {built-in method posix.open}
   823311    1.763    0.000   86.654    0.000 timing3.py:113(read1)
   516783    1.448    0.000    4.911    0.000 timing3.py:8(reader_read_exact)
   823311    1.184    0.000   87.838    0.000 timing3.py:5(reader_read_limit)
   250618    1.175    0.000    9.556    0.000 timing3.py:35(nar_read_bytes)
   239084    0.688    0.000    2.247    0.000 timing3.py:20(reader_skip_exact)
   266165    0.655    0.000    2.901    0.000 timing3.py:30(nar_skip_padding)

timing3.py being my own code, it looked like any inefficiencies with how I wrote things were small compared to the unavoidable decompression and I/O.

Here we are for now

Now I have this partial program written in Python. It can’t fully be used as an installer, because it’s missing all the .narinfo processing and .nar downloading. But it’s pretty fast, not much slower than native tar.

But I haven’t decided on how to continue this project. I could continue building out the rest of the functionality in Python. It would be fast. Or maybe I could try rewriting all the Node.js code to be synchronous. It would probably easier to use on Glitch.

:sleeping_bed:

2 Likes

Epilogue

I finished up the Python version, with the .narinfo processing and the downloading.

Weird thing encountered on the way: the built-in urllib module creates a separate connection for every HTTP(S) request. And when you’re doing that three-way handshake plus TLS key exchange on 0.25 CPU cores for about 200 connections, that actually ends up taking several seconds. So I switched over to the http module.

Here’s the benchmarks of that fully featured Python implementation against the Node.js implementation from the beginning. I’ve also added the timing for Nix itself (version 2.11.1, the static build from Nixpkgs).

image

Measurements on relatively complete installers. All programs at least resolve dependencies, download packages, and unpack packages.

Note that Nix does a lot of additional good things such as verifying the integrity and authenticity of the packages, which my installer does not. So that’s why it, a native program, takes more CPU time and wall time than this interpreted Python version.

More excuses for not releasing this

Using the http module is a little too low-level, and I’m doing something wrong with the connections. The connection (now a single one that gets reused) sometimes gets into a wrong state and crashes the program. It looks like some unused bytes from an earlier response (an xz-compressed .nar file) are left in the TCP connection, even though the decompressed .nar data was all done. I have to investigate that.

Excuses follow up, if that’s legal to bump a thread this many times

Turns out yeah, my code would have left unread bytes from an HTTP response in certain conditions.

My package installer decompresses the .nar file using a streaming API: it reads partial decompressed data while providing the compressed input. And when it has read the complete decompressed data (the .nar format makes it evident when you’ve read everything), then I figured that it would have also finished reading the compressed data input. Not so though.

You can get all the decompressed data even without the whole compressed file. That is, if you’re missing a few bytes off the end, you might still have enough to recover everything. I ran an experiment to confirm this.

xz-chop

Graph showing how much data is missing in the decompressed output, from how much data you omit from the compressed input.

I tested with a particular compressed file that was part of the installation I used in the optimization experiments. In this file, it turns out you can leave off up to 35 bytes of input without losing a single byte of output.

Thus if you happen read a chunk of bytes ending within those last 35 bytes, you’ll be able to decompress the whole compressed file. And the HTTPS connection that delivers the body is a byte stream that gives no guarantees about how you might read chunks of data, so that can happen. It sometimes does, in fact. And when it does happen, my installer would leave those remaining bytes unread, which is in violation of the http.client module’s rules.

Leaving those bytes unread causes the next call to getresponse() to see that remaining xz data when it goes looking for an HTTP response message, which it then correctly flags as not correct HTTP syntax, crashing the installer.

So I went and added some more logic to make sure we get all those final bytes :face_with_spiral_eyes: . And it shouldn’t crash in this way anymore.


There, no more excuses not to release this. No technical reasons, at least. Now I can feel free not to release anything only because I’m a slacker.

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.