Debugging a dynamic linking bug in a Nix project

2020-07-17

Trying out the Nix development experience

The other day, while building a scientific project to which I'm a contributor, I ran into a nasty version conflict between two system libraries. In a fit of pique, I decided to learn enough about Nix to be able to set up a reproducible, tightly controlled local build. It's done now, and overall I'm very happy with the tooling and setup. I'm using direnv to tightly integrate my normal shell with Nix's nix-shell feature, and for the most part everything feels seamless. It is extremely refreshing to see cmake report that it has found a plethora of binaries and libraries, content-hashed and installed in neat little rows under /nix/store.

I'm using Nix to manage my development environment, but not to build the project itself. Nix ensures that the project dependencies are installed and discoverable by the compiler and linker. Building the project is done with CMake, set up for cmake to find the nix-installed libraries. Nix achieves this by wrapping the C compiler with its own shell script and injecting the paths to libraries and binaries via environment variables. There's very little to do to make cmake just work, beyond declaring that the packages you want are buildInputs. The first version of my shell.nix file looked like this:

# file shell.nix
{ pkgs ? import <nixpkgs> {} }:

pkgs.mkShell {
  buildInputs = with pkgs; [
    cmake
    (callPackage nix/petsc.nix {})
    metis
    hdf5
    openmpi
    (python38.withPackages (packages: [ packages.numpy ]))
  ];
}

Using this setup, I had very little trouble getting the project to build. I had to override the default PETSc derivation to compile with METIS and OpenMPI support, which was not too hard:

# file nix/petsc.nix
{ petsc , blas , gfortran , lapack , python , metis , openmpi }:

petsc.overrideAttrs (oldAttrs: rec {
    nativeBuildInputs = [ blas gfortran gfortran.cc.lib lapack python openmpi metis ];
    preConfigure = ''
        export FC="${gfortran}/bin/gfortran" F77="${gfortran}/bin/gfortran"
        patchShebangs .
        configureFlagsArray=(
        $configureFlagsArray
        "--with-mpi-dir=${openmpi}"
        "--with-metis=${metis}"
        "--with-blas-lib=[${blas}/lib/libblas.so,${gfortran.cc.lib}/lib/libgfortran.a]"
        "--with-lapack-lib=[${lapack}/lib/liblapack.so,${gfortran.cc.lib}/lib/libgfortran.a]"
        )
    '';
})

This Nix file returns a function which is invoked in shell.nix using callPackage function. petsc.overrideAttrs is a neat way to override the attributes of a derivation created with stdenv.mkDerivation. Building PETSc with MPI and METIS support is as simple as passing in a different set of arguments to the configure script.

Figuring out how to do all of this was fun. I mostly referred to the Nix "Pills", which are a great progression through the Nix tool and language.

With these Nix files, I was able to execute cmake .. && make successfully. Getting the project to run was another story. The final binary failed immediately with a dynamic loading error:

 bin/warpxm
dyld: Library not loaded: /private/tmp/nix-build-petsc-3.13.2.drv-0/petsc-3.13.2/arch-darwin-c-debug/lib/libpetsc.3.13.dylib
  Referenced from: /Users/jack/src/warpxm/build/bin/warpxm
  Reason: image not found

The binary was trying to load a dynamic lib from one of the temporary directories that Nix created in the process of building PETSc. Of course this failed: by the time I invoked bin/warpxm, that directory had been cleaned up. Instead of a file under /private/tmp, the binary should have linked to the result of the petsc derivation in the Nix store, under /nix/store. At some point, it seemed, an environment variable was incorrectly set to this intermediate directory. To figure out where, I would have to learn a lot more about linking on OS X than I ever expected.

Whither the linker?

First I checked the compiler and linker flags that are inserted by Nix's compiler wrapper. These come in via NIX_CFLAGS_COMPILE and NIX_LDFLAGS. When you're working with nix-shell and direnv, all of the environment variables from your derivations are injected into your shell. It's a simple matter of echoing them out:

 echo $NIX_CFLAGS_COMPILE
... -isystem /nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/include ...
 echo $NIX_LDFLAGS
... -L/nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib ...

These look fine! Invoking cmake and make in this shell ought to pull in the correct library.

Then I remembered that this project uses pkg-config to find and pull together the linked libraries. Frankly, I don't understand pkg-config very well, but I do know that in this project it is invoked from inside of cmake. It searches for libraries according to its own rules, and it runs after Nix has done its job setting everything up. Therefore, it circumvents the compiler and linker flags that we just checked.

I happened to have pkg-config installed from before setting up this Nix environment. Therefore, cmake was able to invoke the system pkg-config from my user PATH. Perhaps the system version of pkg-config was somehow finding the wrong library? Indeed, echo $PKG_CONFIG_PATH confirmed that it was searching a directory under my $HOME. I thought it possible that some wires got crossed while I was adding dependencies to my Nix derivation one at a time: configuring pkg-config appropriately might help.

I referred once again to the Nix wiki page on C projects, which also has a section on using pkg-config. It seems that including the pkg-config derivation as a nativeBuildInput will let packages like petsc append their output paths to the PKG_CONFIG_PATH environment variable. I did so:

pkgs.mkShell {
  buildInputs = with pkgs; [
    ...
  ];
  nativeBuildInputs = with pkgs; [
    pkg-config
  ];
}

but it didn't fix the problem. I would have to go deeper and track down where the bad library was being pulled in.

Digging into the cmake documentation and the project's .cmake files led me to insert a trio of print statements:

find_package(PkgConfig REQUIRED)
pkg_check_modules(PETSC PETSc REQUIRED)

link_directories(${PETSC_LIBRARY_DIRS})
+ message("petsc libraries: ${PETSC_LIBRARIES}")
+ message("petsc library dirs: ${PETSC_LIBRARY_DIRS}")
+ message("petsc link libraries: ${PETSC_LINK_LIBRARIES}")
list(APPEND WARPXM_LINK_TARGETS ${PETSC_LIBRARIES})

These printed out three lines in my cmake output:

petsc libraries: petsc
petsc library dirs: /nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib
petsc link libraries: /nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib/libpetsc.dylib

The second two look good. But the first, just the library name petsc, was a little too implicit for comfort. It was precisely this variable that was being appended to the link targets list. At compile time, it would be up to the linker to find the library petsc, and I wasn't sure where it would look. Safer to use the absolute path to the .dylib, like so:

- list(APPEND WARPXM_LINK_TARGETS ${PETSC_LIBRARIES})
+ list(APPEND WARPXM_LINK_TARGETS ${PETSC_LINK_LIBRARIES})

Changing the link target to the absolute path eased my mind only for the duration of the next cmake .. && make cycle. Surely there was no way the linker could screw up now. No arcane library search involved, just an absolute path, which couldn't possibly be misinterpreted...

 bin/warpxm
dyld: Library not loaded: /private/tmp/nix-build-petsc-3.13.2.drv-0/petsc-3.13.2/arch-darwin-c-debug/lib/libpetsc.3.13.dylib
  Referenced from: /Users/jack/src/warpxm/build/bin/warpxm
  Reason: image not found

Damn it!

install_name and other depravities

At this point I was absolutely flummoxed. With every fix I attempted, I grepped vainly for the offending /private/tmp path in my build directory, and come up empty-handed. I tracked down the final, irrevocable link options passed to the compiler, tucked away in a link.txt file in the build tree. They showed incontrovertibly that my binary was being linked to the correct library:

➜ cat build/src/CMakeFiles/warpxm.dir/link.txt
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++     -O3 -DNDEBUG -isysroot ... -L/nix/store/31d3hng4sclxi3sz8g3zi3yqmychj2kg-petsc-3.13.2/lib ...

I had proved nearly to my satisfaction that CMake was doing the right thing with this library, and I was completely out of ideas. Finally, a very lucky google search led me to the section of the Nix manual describing issues specific to the Darwin (MacOS) platform. It states:

On Darwin, libraries are linked using absolute paths, libraries are resolved by their install_name at link time. Sometimes packages won't set this correctly causing the library lookups to fail at runtime. This can be fixed by adding extra linker flags or by running install_name_tool -id during the fixupPhase.

This is a very matter-of-fact way of stating something that, when I understood it, flabbergasted me. To the best of my understanding, here's what happens on MacOS:

I've certainly gotten some aspect of this wrong, so I would definitely appreciate hearing from someone who understands it better than me!

In any case, this find pointed me to the concept of the install_name, so I had something to go on. More searching led to a helpful blog post describing exactly the issue that I was facing. It also described how to check the install_name of the library:

➜ otool -D /nix/store/31d3hng4sclxi3sz8g3zi3yqmychj2kg-petsc-3.13.2/lib/libpetsc.dylib
/nix/store/31d3hng4sclxi3sz8g3zi3yqmychj2kg-petsc-3.13.2/lib/libpetsc.dylib:
/private/tmp/nix-build-petsc-3.13.2.drv-0/petsc-3.13.2/arch-darwin-c-debug/lib/libpetsc.3.13.dylib

Gotcha.

The Nix manual states that "some packages won't set this correctly", and points to the fix, which is to use install_name_tool to change the install_name of the built library. Is the PETSc derivation on nixpkgs doing this correctly? I saw that it was doing something with install_name_tool:

  prePatch = ''
    substituteInPlace configure \
      --replace /bin/sh /usr/bin/python
  '' + stdenv.lib.optionalString stdenv.isDarwin ''
    substituteInPlace config/install.py \
      --replace /usr/bin/install_name_tool install_name_tool
  '';

This directive replaces the appearances of the string /usr/bin/install_name_tool with just install_name_tool. The reason that Nix packages do this is to ensure that builds rely on the Nix-built tools, which are provided in the build shell's PATH, and not on binaries in system directories like /usr/bin.

The PR that introduced this substitution indicates that it fixed a build on Darwin, so there must be some invocation of /usr/bin/install_name_tool in PETSc. Searching for that in the PETSc repo leads to this line, which is doing exactly what the Mark's Logs post on install_name instructed: it changes the install_name to the absolute path of the library in its installation directory, using install_name_tool -id.

if os.path.splitext(dst)[1] == '.dylib' and os.path.isfile('/usr/bin/install_name_tool'):
    [output,err,flg] = self.executeShellCommand("otool -D "+src)
    oldname = output[output.find("\n")+1:]
    installName = oldname.replace(os.path.realpath(self.archDir), self.installDir)
    self.executeShellCommand('/usr/bin/install_name_tool -id ' + installName + ' ' + dst)

According to this, the install_name of the library should have been repaired by PETSc when the library was built! Except... notice something? The second condition in the if statement. After the PETSc derivation runs its prePatch step, that condition will become and os.path.isfile('install_name_tool'). That will certainly fail: install_name_tool is not going to be a file in the directory where configure is running! The patched configure script will silently skip this step, leaving the install_name of the library as the temporary directory where it was built!

Luckily, the solution to this problem is not too hard. Instead of the name of a program on the PATH, we should pass the absolute path to the program we want to run. This can be done by overriding the prePatch step like so:

    prePatch = ''
        substituteInPlace configure \
        --replace /bin/sh /usr/bin/python
    '' + stdenv.lib.optionalString stdenv.isDarwin ''
        substituteInPlace config/install.py \
        --replace /usr/bin/install_name_tool ${darwin.cctools}/bin/install_name_tool
    '';

The Nix variable ${darwin.cctools} will expand to the full path of the built darwin.cctools derivation, which is a directory under /nix/store. So the patched if statement inside of PETSc's configure.py becomes

if os.path.splitext(dst)[1] == '.dylib' and
   os.path.isfile('/nix/store/1dgdim74d05ypll85vslm8i7kgzq78vw-cctools-port/bin/install_name_tool'):
   # use install_name_tool

and the install_name of the resulting library will be correct. We can check that with otool -D again:

 otool -D /nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib/libpetsc.dylib
/nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib/libpetsc.dylib:
/nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib/libpetsc.3.13.dylib

Looking much better! And since the error was in a dynamically loaded library, we don't even have to recompile to check that it's working:

  build git:(master)  DYLD_PRINT_LIBRARIES=1 bin/warpxm
dyld: loaded: /Users/jack/src/warpxm/build/bin/warpxm
dyld: loaded: /nix/store/ni26aaiira47ak60vks1qv4apbkwbg1d-hdf5-1.10.6/lib/libhdf5.103.dylib
dyld: loaded: /nix/store/acsjaw04hrf4rv8gizai7gx1ibq92ksa-zlib-1.2.11/lib/libz.dylib
dyld: loaded: /nix/store/z4f1bq363m0ydmbyncfi2srij8vlsx32-Libsystem-osx-10.12.6/lib/libSystem.B.dylib
dyld: loaded: /nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib/libpetsc.3.13.dylib
...

That's more like it.

Epilogue

I spent most of my time debugging this problem without a working understanding of the different build phases. It should have been clear to me that neither the CMake nor the pkg-config setups could be the cause, because at the time that I was invoking cmake, the offending /private/tmp directory had long vanished. If I had focused exclusively on the PETSc derivation provided by Nix, I might have homed in on the install_name_tool patch a little sooner. As it went, I was lucky to find the note in the Nix manual about Darwin-specific linker problems.

As for Nix, I will absolutely be using more of it. What's remarkable is how little impact it can have. I am able to use it to manage my environment for this project without impacting the way the other developers manage their environments. Of course, if they asked, I would advocate that they try out Nix, but it's nice for everyone to be able to do it on their own time.

I'm also looking forward to having my first contribution to Nix!