Debugging a dynamic linking bug in a Nix project
2020-07-17
Trying out the Nix development experience
The other day, while building a scientific project to which I'm a contributor, I
ran into a nasty version conflict between two system libraries. In a fit of
pique, I decided to learn enough about Nix to be able to set up a reproducible,
tightly controlled local build. It's done now, and overall I'm very happy with
the tooling and setup. I'm using direnv to tightly integrate my normal shell
with Nix's nix-shell
feature, and for the most part everything feels seamless.
It is extremely refreshing to see cmake
report that it has found a plethora of
binaries and libraries, content-hashed and installed in neat little rows under
/nix/store
.
I'm using Nix to manage my development environment, but not to build the
project itself. Nix ensures that the project dependencies are installed and
discoverable by the compiler and linker. Building the project is done with
CMake, set up for cmake
to find the nix-installed libraries. Nix achieves this
by wrapping the C compiler with its own shell script and injecting the paths to
libraries and binaries via environment variables. There's very little to do to
make cmake
just work, beyond declaring that the packages you want are
buildInputs
. The first version of my shell.nix
file looked like this:
# file shell.nix
{ pkgs ? import <nixpkgs> {} }:
pkgs.mkShell {
buildInputs = with pkgs; [
cmake
(callPackage nix/petsc.nix {})
metis
hdf5
openmpi
(python38.withPackages (packages: [ packages.numpy ]))
];
}
Using this setup, I had very little trouble getting the project to build. I had to override the default PETSc derivation to compile with METIS and OpenMPI support, which was not too hard:
# file nix/petsc.nix
{ petsc , blas , gfortran , lapack , python , metis , openmpi }:
petsc.overrideAttrs (oldAttrs: rec {
nativeBuildInputs = [ blas gfortran gfortran.cc.lib lapack python openmpi metis ];
preConfigure = ''
export FC="${gfortran}/bin/gfortran" F77="${gfortran}/bin/gfortran"
patchShebangs .
configureFlagsArray=(
$configureFlagsArray
"--with-mpi-dir=${openmpi}"
"--with-metis=${metis}"
"--with-blas-lib=[${blas}/lib/libblas.so,${gfortran.cc.lib}/lib/libgfortran.a]"
"--with-lapack-lib=[${lapack}/lib/liblapack.so,${gfortran.cc.lib}/lib/libgfortran.a]"
)
'';
})
This Nix file returns a function which is invoked in shell.nix
using
callPackage
function. petsc.overrideAttrs
is a neat way to override the
attributes of a derivation created with stdenv.mkDerivation
. Building PETSc
with MPI and METIS support is as simple as passing in a different set of
arguments to the configure
script.
Figuring out how to do all of this was fun. I mostly referred to the Nix "Pills", which are a great progression through the Nix tool and language.
With these Nix files, I was able to execute cmake .. && make
successfully.
Getting the project to run was another story. The final binary failed
immediately with a dynamic loading error:
➜ bin/warpxm
dyld: Library not loaded: /private/tmp/nix-build-petsc-3.13.2.drv-0/petsc-3.13.2/arch-darwin-c-debug/lib/libpetsc.3.13.dylib
Referenced from: /Users/jack/src/warpxm/build/bin/warpxm
Reason: image not found
The binary was trying to load a dynamic lib from one of the temporary directories
that Nix created in the process of building PETSc. Of course this failed: by the
time I invoked bin/warpxm
, that directory had been cleaned up. Instead of a
file under /private/tmp
, the binary should have linked to the result of the
petsc
derivation in the Nix store, under /nix/store
. At some point, it
seemed, an environment variable was incorrectly set to this intermediate
directory. To figure out where, I would have to learn a lot more about linking
on OS X than I ever expected.
Whither the linker?
First I checked the compiler and linker flags that are inserted by Nix's
compiler wrapper. These come in via NIX_CFLAGS_COMPILE
and NIX_LDFLAGS
. When
you're working with nix-shell
and direnv
, all of the environment variables
from your derivations are injected into your shell. It's a simple matter of echoing
them out:
➜ echo $NIX_CFLAGS_COMPILE
... -isystem /nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/include ...
➜ echo $NIX_LDFLAGS
... -L/nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib ...
These look fine! Invoking cmake
and make
in this shell ought to pull in the
correct library.
Then I remembered that this project uses pkg-config
to find and pull together
the linked libraries. Frankly, I don't understand pkg-config
very well, but I
do know that in this project it is invoked from inside of cmake
. It searches for
libraries according to its own rules, and it runs after Nix has done its
job setting everything up. Therefore, it circumvents the compiler and linker
flags that we just checked.
I happened to have pkg-config
installed from before setting up this Nix
environment. Therefore, cmake
was able to invoke the system pkg-config
from
my user PATH
. Perhaps the system version of pkg-config
was somehow finding
the wrong library? Indeed, echo $PKG_CONFIG_PATH
confirmed that it was
searching a directory under my $HOME
. I thought it possible that some wires
got crossed while I was adding dependencies to my Nix derivation one at a time:
configuring pkg-config
appropriately might help.
I referred once again to the Nix wiki page on C projects, which also has a
section on using pkg-config
. It seems that including the pkg-config
derivation as a nativeBuildInput
will let packages like petsc
append their
output paths to the PKG_CONFIG_PATH
environment variable. I did so:
pkgs.mkShell {
buildInputs = with pkgs; [
...
];
nativeBuildInputs = with pkgs; [
pkg-config
];
}
but it didn't fix the problem. I would have to go deeper and track down where the bad library was being pulled in.
Digging into the cmake
documentation and the project's .cmake
files led me
to insert a trio of print statements:
find_package(PkgConfig REQUIRED)
pkg_check_modules(PETSC PETSc REQUIRED)
link_directories(${PETSC_LIBRARY_DIRS})
+ message("petsc libraries: ${PETSC_LIBRARIES}")
+ message("petsc library dirs: ${PETSC_LIBRARY_DIRS}")
+ message("petsc link libraries: ${PETSC_LINK_LIBRARIES}")
list(APPEND WARPXM_LINK_TARGETS ${PETSC_LIBRARIES})
These printed out three lines in my cmake
output:
petsc libraries: petsc
petsc library dirs: /nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib
petsc link libraries: /nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib/libpetsc.dylib
The second two look good. But the first, just the library name petsc
, was a little too
implicit for comfort. It was precisely this variable that was being appended
to the link targets list. At compile time, it would be up to the linker to find
the library petsc
, and I wasn't sure where it would look. Safer to use the
absolute path to the .dylib
, like so:
- list(APPEND WARPXM_LINK_TARGETS ${PETSC_LIBRARIES})
+ list(APPEND WARPXM_LINK_TARGETS ${PETSC_LINK_LIBRARIES})
My thinking here was wrong. We can be sure where the linker will look at
compile time: in the paths listed in NIX_LDFLAGS
! I wasn't thinking clearly
about the flow of data in the compilation process.
Changing the link target to the absolute path eased my mind only for the duration of
the next cmake .. && make
cycle. Surely there was no way the linker could
screw up now. No arcane library search involved, just an absolute path, which
couldn't possibly be misinterpreted...
➜ bin/warpxm
dyld: Library not loaded: /private/tmp/nix-build-petsc-3.13.2.drv-0/petsc-3.13.2/arch-darwin-c-debug/lib/libpetsc.3.13.dylib
Referenced from: /Users/jack/src/warpxm/build/bin/warpxm
Reason: image not found
Damn it!
install_name and other depravities
At this point I was absolutely flummoxed. With every fix I attempted, I
grepped vainly for the offending /private/tmp
path in my build directory, and come up
empty-handed. I tracked down the final, irrevocable link options passed to the
compiler, tucked away in a link.txt
file in the build tree. They showed
incontrovertibly that my binary was being linked to the correct library:
➜ cat build/src/CMakeFiles/warpxm.dir/link.txt
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -O3 -DNDEBUG -isysroot ... -L/nix/store/31d3hng4sclxi3sz8g3zi3yqmychj2kg-petsc-3.13.2/lib ...
I had proved nearly to my satisfaction that CMake was doing the right thing with this library, and I was completely out of ideas. Finally, a very lucky google search led me to the section of the Nix manual describing issues specific to the Darwin (MacOS) platform. It states:
On Darwin, libraries are linked using absolute paths, libraries are resolved by their install_name at link time. Sometimes packages won't set this correctly causing the library lookups to fail at runtime. This can be fixed by adding extra linker flags or by running install_name_tool -id during the fixupPhase.
This is a very matter-of-fact way of stating something that, when I understood it, flabbergasted me. To the best of my understanding, here's what happens on MacOS:
- My source code has an include directive,
include<petsc.h>
or something like that, which creates a binary interface to be satisfied by the linker. - At link time, we pass the list of absolute paths to libraries, and the linker finds the one that matches the interface.
- The linker then saves the install_name of the library it found in the binary's load section.
- At run time, the binary (actually, the MacOS
dyld
system) loads the library. The install_name is all it has, so it looks there.
I've certainly gotten some aspect of this wrong, so I would definitely appreciate hearing from someone who understands it better than me!
In any case, this find pointed me to the concept of the install_name, so I had something to go on. More searching led to a helpful blog post describing exactly the issue that I was facing. It also described how to check the install_name of the library:
➜ otool -D /nix/store/31d3hng4sclxi3sz8g3zi3yqmychj2kg-petsc-3.13.2/lib/libpetsc.dylib
/nix/store/31d3hng4sclxi3sz8g3zi3yqmychj2kg-petsc-3.13.2/lib/libpetsc.dylib:
/private/tmp/nix-build-petsc-3.13.2.drv-0/petsc-3.13.2/arch-darwin-c-debug/lib/libpetsc.3.13.dylib
Gotcha.
The Nix manual states that "some packages won't set this correctly", and points
to the fix, which is to use install_name_tool
to change the install_name of
the built library. Is the PETSc derivation on nixpkgs doing this correctly? I
saw that it was doing something with install_name_tool
:
prePatch = ''
substituteInPlace configure \
--replace /bin/sh /usr/bin/python
'' + stdenv.lib.optionalString stdenv.isDarwin ''
substituteInPlace config/install.py \
--replace /usr/bin/install_name_tool install_name_tool
'';
This directive replaces the appearances of the string
/usr/bin/install_name_tool
with just install_name_tool
. The reason that Nix
packages do this is to ensure that builds rely on the Nix-built tools, which are
provided in the build shell's PATH
, and not on binaries in system directories
like /usr/bin
.
The PR that introduced this substitution indicates that it fixed a build on
Darwin, so there must be some invocation of /usr/bin/install_name_tool
in
PETSc. Searching for that in the PETSc repo leads to this line, which is doing
exactly what the Mark's Logs post on install_name instructed: it changes the
install_name to the absolute path of the library in its installation directory,
using install_name_tool -id
.
if os.path.splitext(dst)[1] == '.dylib' and os.path.isfile('/usr/bin/install_name_tool'):
[output,err,flg] = self.executeShellCommand("otool -D "+src)
oldname = output[output.find("\n")+1:]
installName = oldname.replace(os.path.realpath(self.archDir), self.installDir)
self.executeShellCommand('/usr/bin/install_name_tool -id ' + installName + ' ' + dst)
According to this, the install_name of the library should have been repaired by
PETSc when the library was built! Except... notice something? The second
condition in the if
statement. After the PETSc derivation runs its prePatch
step, that condition will become and os.path.isfile('install_name_tool')
. That
will certainly fail: install_name_tool
is not going to be a file in the
directory where configure
is running! The patched configure
script will
silently skip this step, leaving the install_name of the library as the
temporary directory where it was built!
Luckily, the solution to this problem is not too hard. Instead of the name of a
program on the PATH
, we should pass the absolute path to the program we want
to run. This can be done by overriding the prePatch
step like so:
prePatch = ''
substituteInPlace configure \
--replace /bin/sh /usr/bin/python
'' + stdenv.lib.optionalString stdenv.isDarwin ''
substituteInPlace config/install.py \
--replace /usr/bin/install_name_tool ${darwin.cctools}/bin/install_name_tool
'';
The Nix variable ${darwin.cctools}
will expand to the full path of the
built darwin.cctools
derivation, which is a directory under /nix/store
. So
the patched if
statement inside of PETSc's configure.py
becomes
if os.path.splitext(dst)[1] == '.dylib' and
os.path.isfile('/nix/store/1dgdim74d05ypll85vslm8i7kgzq78vw-cctools-port/bin/install_name_tool'):
# use install_name_tool
and the install_name of the resulting library will be correct. We can check that
with otool -D
again:
➜ otool -D /nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib/libpetsc.dylib
/nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib/libpetsc.dylib:
/nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib/libpetsc.3.13.dylib
Looking much better! And since the error was in a dynamically loaded library, we don't even have to recompile to check that it's working:
➜ build git:(master) ✗ DYLD_PRINT_LIBRARIES=1 bin/warpxm
dyld: loaded: /Users/jack/src/warpxm/build/bin/warpxm
dyld: loaded: /nix/store/ni26aaiira47ak60vks1qv4apbkwbg1d-hdf5-1.10.6/lib/libhdf5.103.dylib
dyld: loaded: /nix/store/acsjaw04hrf4rv8gizai7gx1ibq92ksa-zlib-1.2.11/lib/libz.dylib
dyld: loaded: /nix/store/z4f1bq363m0ydmbyncfi2srij8vlsx32-Libsystem-osx-10.12.6/lib/libSystem.B.dylib
dyld: loaded: /nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib/libpetsc.3.13.dylib
...
That's more like it.
Epilogue
I spent most of my time debugging this problem without a working
understanding of the different build phases. It should have been clear
to me that neither the CMake nor the pkg-config
setups could be the cause,
because at the time that I was invoking cmake
, the offending /private/tmp
directory had long vanished. If I had focused exclusively on the PETSc
derivation provided by Nix, I might have homed in on the install_name_tool
patch a little sooner. As it went, I was lucky to find the note in the Nix
manual about Darwin-specific linker problems.
As for Nix, I will absolutely be using more of it. What's remarkable is how little impact it can have. I am able to use it to manage my environment for this project without impacting the way the other developers manage their environments. Of course, if they asked, I would advocate that they try out Nix, but it's nice for everyone to be able to do it on their own time.
I'm also looking forward to having my first contribution to Nix!