Skip to content

armv7-linux-androideabi cross test "attempt to subtract with overflow" in miniz_oxide 0.8.5 #1894

Open
@EliahKagan

Description

@EliahKagan

Current behavior 😯

Background

cf7f34d (#1882) bumped prodash from 29.0.0 to 29.0.1 in Cargo.toml, as well as various resolved dependency versions in Cargo.lock, which included bumping miniz_oxide from 0.8.0 to 0.8.5.

Around that time and starting somewhat earlier, I was experimenting with running tests via cross in the hope of finding a way to reproduce some s390x-unknown-linux-gnu failures without s390x hardware. As noted in #1890, I was unable to reproduce those failing s390x tests that way, and I still have not managed to do so.

However, I also experimented with another target in cross, with the intention that it would help me understand what failures were target-specific. I didn't get failures in s390x, but I expected I would and that the other target could serve as a kind of control. It turned out that s390x was the control and the other target eventually had unexpected failures.

Specifically, for the other target, I chose armv7-linux-androideabi. I chose this because it had been partially tested via cross on CI before the test was replaced with a more comprehensive non-Android arm32v7 containerized (but non-cross) test on ARM hardware in #1777.

In its past testing on CI, armv7-linux-androideabi had problems using the external git command. Upgrading git in the cross container and patching the runner script to avoid preloading an unusable shared library overcame this, and I was able to run all tests with cross using that target.

All tests all passed--aside from one fuzzing-related performance test that had to be skipped due to slowdown from QEMU emulation of an unrelated architecture without hardware acceleration, and a couple of cases where tests were marked with the wrong #[cfg(...)], which was easily fixed. But...

Then prodash, and various locked dependencies, were upgraded

Starting after a rebase that included cf7f34d in the history, several test cases failed in cross runs on the armv7-linux-androideabi target. In the recent run presented here, this is eight tests. Running git revert --no-commit cf7f34d and rerunning the tests confirmed that this change is what triggered the failures.

I did a full gix clean -xde between the runs to ensure they did not contaminate each other. I also ran it a number of times with the intention of verify that the effect is consistent. In initially thought the effect was completely consistent based on some very similar looking runs, but actually the tests that fail differ. I suspect this is related to threading. I sometimes forced the Docker image to be rebuilt in between as well, to ensure contamination in the Docker image itself was not the cause (though I was not aware of a way that could happen), which it was not.

Full test run output of the failures, and of how they go away without the changes from cf7f34d, are presented in this gist. Although not shown there, I have also tried running the tests with --locked. This does not make a difference, which makes me think that upgrade to prodash itself is relevant, though the way it is relevant could still in principle be through a dependency.

The failures are all specific to armv7-linux-androideabi with cross, or at least they do not happen in any of the other ways I have been tested or that are tested on CI. In particular, this does not happen on the same system when running the tests natively, nor when running them on that system with cross with the s390x-unknown-linux-gnu target. In particular, they do not fail on either of the 32-bit targets tested (without cross) on CI, one of which is a non-Android ARMv7 target.

Indirect dependencies that may be relevant

The changes in prodash 29.0.1 are listed as replacing humantime with jiff and bumping jiff to version 0.2, which the diff confirms. Assuming the latest patch versions were used in both 0.1 and 0.2 in practice, there are a number of relevant differences, but none seem like they would affect miniz_oxide.

The reason I was looking for an effect on miniz_oxide is that the test failures are panics that happen there. gitoxide uses miniz_oxide through flate2. If there is any connection to prodash or jiff, I don't know what it is. But if there isn't, then I don't know why passing --locked made no difference: with or without --locked, the failures happened with the changes from cf7f34d, and did not happen without them.

But if we look at Cargo.lock changes in cf7f34d, the flate2 and miniz_oxide crates were among those upgraded:

 [[package]]
 name = "flate2"
-version = "1.0.35"
+version = "1.1.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "c936bfdafb507ebbf50b8074c54fa31c5be9a1e7e5f467dd659697041407d07c"
+checksum = "11faaf5a5236997af9848be0bef4db95824b1d534ebc64d0f0c6cf3e67bd38dc"
 [[package]]
 name = "miniz_oxide"
-version = "0.8.0"
+version = "0.8.5"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "e2d80299ef12ff69b16a84bb182e3b9df68b5a91574d3d4fa6e41b65deec4df1"
+checksum = "8e3e04debbb59698c15bacbb6d93584a8c0ca9cc3213cb423d31f760d8843ce5"

Furthermore the versions of miniz_oxide listed in Cargo.lock seem to be used even when I do not pass --locked or --frozen. The compiled version is 0.8.5 when the tests fail, and the compiled version is 0.8.0 when the tests pass with the revert, as though locked versions are being used. Perhaps there is something I don't understand about when Cargo.lock versions are guaranteed to be used?

Which test modules have failures

Because I am using cross, I am using cargo test (via cross test) rather than cargo nextest. The output is accordingly much harder (at least for me) to read. Here's the broad summary at the end of the whole run:

error: 3 targets failed:
    `-p gix --test gix`
    `-p gix-odb-tests --test integrate`
    `-p gix-pack-tests --test pack`
error: Recipe `cross-test` failed on line 234 with exit code 101

There are always failures in these three modules and no others. Details about all the specific failures in the linked gist are as follows.

Summary of gix failures

The failures to which -p gix --test gix pertains were:

failures:
    remote::fetch::blocking_and_async_io::fetch_more_packs_than_can_be_handled
    remote::fetch::blocking_and_async_io::fetch_with_multi_round_negotiation
    repository::object::commit_as::specify_committer_and_author

test result: FAILED. 319 passed; 3 failed; 0 ignored; 0 measured; 0 filtered out; finished in 40.12s

error: test failed, to rerun pass `-p gix --test gix`

Here are links to their full details in the gist:

Summary of gix-odb-tests failures

The failures to which -p gix-odb-tests --test integrate pertains were:

failures:
    odb::store::dynamic::write
    odb::store::loose::write::collisions_do_not_cause_failure
    odb::store::loose::write::it_writes_objects_with_similar_permissions
    odb::store::loose::write::read_and_write

test result: FAILED. 55 passed; 4 failed; 0 ignored; 0 measured; 0 filtered out; finished in 7.30s

error: test failed, to rerun pass `-p gix-odb-tests --test integrate`

Here are links to their full details in the gist:

Summary of gix-pack-tests failures

The failure to which -p gix-pack-tests --test pack pertains was:

failures:
    pack::data::output::count_and_entries::traversals

test result: FAILED. 45 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 6.25s

error: test failed, to rerun pass `-p gix-pack-tests --test pack`

Here is a link to its full details in the gist:

Analysis

I don't know what's going on, but all of the failures are "attempt to subtract with overflow" errors in miniz_oxide. This is the case both in the run shown in the gist and in other runs where the tests that fail are sometimes slightly different. All the failures seem to resemble each other pretty closely. Here's the first failure output with its backtrace, to illustrate (the others can be viewed in the gist through the specific links, lest this issue become too long):

---- remote::fetch::blocking_and_async_io::fetch_more_packs_than_can_be_handled stdout ----

thread 'remote::fetch::blocking_and_async_io::fetch_more_packs_than_can_be_handled' panicked at /cargo/registry/src/index.crates.io-1949cf8c6b5b557f/miniz_oxide-0.8.5/src/deflate/core.rs:1990:38:
attempt to subtract with overflow
stack backtrace:
   0: rust_begin_unwind
             at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/std/src/panicking.rs:692:5
   1: core::panicking::panic_fmt
             at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/panicking.rs:75:14
   2: core::panicking::panic_const::panic_const_sub_overflow
             at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/panicking.rs:178:21
   3: miniz_oxide::deflate::core::compress_fast
             at /cargo/registry/src/index.crates.io-1949cf8c6b5b557f/miniz_oxide-0.8.5/src/deflate/core.rs:1990:38
   4: miniz_oxide::deflate::core::compress_inner
             at /cargo/registry/src/index.crates.io-1949cf8c6b5b557f/miniz_oxide-0.8.5/src/deflate/core.rs:2244:9
   5: miniz_oxide::deflate::core::compress
             at /cargo/registry/src/index.crates.io-1949cf8c6b5b557f/miniz_oxide-0.8.5/src/deflate/core.rs:2177:5
   6: miniz_oxide::deflate::stream::deflate
             at /cargo/registry/src/index.crates.io-1949cf8c6b5b557f/miniz_oxide-0.8.5/src/deflate/stream.rs:54:23
   7: <flate2::ffi::rust::Deflate as flate2::ffi::DeflateBackend>::compress
             at /cargo/registry/src/index.crates.io-1949cf8c6b5b557f/flate2-1.1.0/src/ffi/rust.rs:169:19
   8: flate2::mem::Compress::compress
             at /cargo/registry/src/index.crates.io-1949cf8c6b5b557f/flate2-1.1.0/src/mem.rs:334:9
   9: gix_features::zlib::stream::deflate::impls::<impl gix_features::zlib::stream::deflate::Write<W>>::write_inner
             at /project/gix-features/src/zlib/stream/deflate/mod.rs:69:30
  10: gix_features::zlib::stream::deflate::impls::<impl std::io::Write for gix_features::zlib::stream::deflate::Write<W>>::flush
             at /project/gix-features/src/zlib/stream/deflate/mod.rs:107:13
  11: <gix_features::hash::write::Write<T> as std::io::Write>::flush
             at /project/gix-features/src/hash.rs:182:13
  12: gix_odb::store_impls::loose::write::<impl gix_object::traits::Write for gix_odb::store_impls::loose::Store>::write_stream
             at /project/gix-odb/src/store_impls/loose/write.rs:91:9
  13: gix_odb::store_impls::dynamic::write::<impl gix_object::traits::Write for gix_odb::store_impls::dynamic::Handle<S>>::write_stream
             at /project/gix-odb/src/store_impls/dynamic/write.rs:34:26
  14: gix_odb::cache::impls::<impl gix_object::traits::Write for gix_odb::Cache<S>>::write_stream
             at /project/gix-odb/src/cache.rs:154:13
  15: <gix_odb::memory::Proxy<T> as gix_object::traits::Write>::write_stream
             at /project/gix-odb/src/memory.rs:216:20
  16: gix_object::traits::Write::write_buf
             at /project/gix-object/src/traits/mod.rs:16:9
  17: gix::repository::object::<impl gix::types::Repository>::write_object_inner
             at ./src/repository/object.rs:180:9
  18: gix::repository::object::<impl gix::types::Repository>::write_object
             at ./src/repository/object.rs:171:9
  19: gix::repository::object::<impl gix::types::Repository>::commit_as_inner
             at ./src/repository/object.rs:309:25
  20: gix::repository::object::<impl gix::types::Repository>::commit_as
             at ./src/repository/object.rs:273:9
  21: gix::repository::object::<impl gix::types::Repository>::commit
             at ./src/repository/object.rs:373:9
  22: gix::remote::fetch::blocking_and_async_io::fetch_more_packs_than_can_be_handled::create_empty_commit
             at ./tests/gix/remote/fetch.rs:97:13
  23: gix::remote::fetch::blocking_and_async_io::fetch_more_packs_than_can_be_handled
             at ./tests/gix/remote/fetch.rs:113:13
  24: gix::remote::fetch::blocking_and_async_io::fetch_more_packs_than_can_be_handled::{{closure}}
             at ./tests/gix/remote/fetch.rs:90:50
  25: core::ops::function::FnOnce::call_once
             at /rust/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
  26: core::ops::function::FnOnce::call_once
             at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Looking at this step in the backtrace:

   3: miniz_oxide::deflate::core::compress_fast
             at /cargo/registry/src/index.crates.io-1949cf8c6b5b557f/miniz_oxide-0.8.5/src/deflate/core.rs:1990:38

That is a reference to this line in miniz_oxide 0.8.5:

let mut cur_match_dist = (lookahead_pos - probe_pos) as u16;

The blame shows the most recent change to that line as a removal of the as usize cast on probe_pos. This is from commit Frommi/miniz_oxide@0f50464 where several as usize casts are removed. But I think this cannot be the cause of the new failure, since this change is present in all versions of miniz_oxide since 0.7.2.

Inspecting a broader diff, there are substantial changes, including two deflate/core.rs from 0.8.0 to 0.8.5, but I don't know which would cause this. Maybe something in the changelog would give a hint.

Assuming they are able to be resolved without causing problems with other dependencies, I might manage to bisect the miniz_oxide versions, or even on miniz_oxide commits, on one of the failing gitoxide tests. But perhaps you will have a more specific idea of what is going on.

Expected behavior 🤔

Ordinarily I would say I expect that all tests pass. But cross images are strange environments in which to run gitoxide tests, because they are x86-64 containers in which code of another architecture is built and then, through QEMU, run. I generally expect a cross image to require significant customization to be usable to run gitoxide tests, since gitoxide interacts with various parts of the system, even in a max-pure build as used here.

If I had always observed these failures, then I would have assumed the problem was just with the container as I had set it up, and not assumed a bug. But it is strange that it was working fine (once I got git to a new enough version and kept ld.so message from polluting stderr, as described below), but then cf7f34d broke it. That suggests to me that there is a bug somewhere.
 
Maybe there is a long-standing bug in gitoxide that this surfaces. Or maybe there is something wrong with the tools in the cross image I am using (either in the original image, or from my modifications, detailed below). Or maybe there is a bug in miniz_oxide, or in a crate that uses it such as flate2.

Git behavior

In Git itself

In the sense of git behavior that corresponds to the affected behavior of gitoxide, this is not applicable as far as I know.

In how we use it

I also suspect that the way the image is customized to be able to use git for tests that run it, either directly or through .gitignored fixtures, is not relevant here. But just in case it is, the most git-relevant changes in the container are:

  1. Enable the git-core PPA.

  2. Upgrade git. This is still the x86-64 git. (cross images are x86-64 images with cross toolchains installed. It is possible to install git in a cross image for some cross targets, and but probably not this one.)

  3. Patch the runner script cross uses, to keep it from setting LD_PRELOAD to include a .so file implementing the C++ standard library. This could be useful in a C++ Android program, but when ld.so attempts to preload it in an x86-64 executable, it writes this text to standard error:

    ERROR: ld.so: object '/android-ndk/sysroot/usr/lib/arm-linux-androideabi/libc++_shared.so' from LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS32): ignored.
    

    Then the program actually runs. But when we capture standard error and parse or make assertions about it, we get failures. So I kept LD_PRELOAD from being set to that path, to avoid this.

Full details of how I customized the container, including those not related to making git usable, are included below.

Steps to reproduce 🕹

Customizing the container

I'll open a pull request with the changes. That pull request will not fix this issue, but it will be possible to look at the changes in it, at least initially. However, I intend also to improve that pull request, both to be able to show these failures better--there are other problems on CI related to configuring binfmt_misc to use QEMU that obscure them currently--and to include further changes, both that I have already done and that I plan to do, for testing s390x. So the information about how the observations reported here were produced seems valuable to present here.

I made this etc/docker/test-cross.toml file:

# `cross` configuration for running tests. Treated like `Cross.toml` if enabled
# with `CROSS_CONFIG=etc/docker/test-cross.toml`. This avoids affecting other
# `cross` usage, e.g. in `release.yml`. See `cross-test` recipes in `justfile`.

[build.env]
passthrough = [
    "CI",
    "GITHUB_ACTIONS",
    "GIX_CREDENTIALS_HELPER_STDERR",
    "GIX_EXTERNAL_COMMAND_STDERR",
    "GIX_OBJECT_CACHE_MEMORY",
    "GIX_PACK_CACHE_MEMORY",
    "GIX_TEST_CREATE_ARCHIVES_EVEN_ON_CI",
    "GIX_TEST_EXPECT_REDUCED_TRUST",
    "GIX_TEST_IGNORE_ARCHIVES",
    "GIX_VERSION",
    "NO_PRELOAD_CXX",
    "RUST_BACKTRACE",
    "RUST_LIB_BACKTRACE",
]

[target.armv7-linux-androideabi]
image = "cross-rs-gitoxide:armv7-linux-androideabi"

[target.s390x-unknown-linux-gnu]
image = "cross-rs-gitoxide:s390x-unknown-linux-gnu"

And this etc/docker/Dockerfile.test-cross file:

ARG TARGET

FROM ghcr.io/cross-rs/${TARGET}:latest

RUN apt-get update && \
    apt-get install --no-install-recommends -y apt-transport-https gnupg && \
    release="$(sed -n 's/^VERSION_CODENAME=//p' /etc/os-release)" && \
    echo "deb https://ppa.launchpadcontent.net/git-core/ppa/ubuntu $release main" \
        >/etc/apt/sources.list.d/git-core-ubuntu-ppa.list && \
    apt-key adv --keyserver keyserver.ubuntu.com \
        --recv-keys F911AB184317630C59970973E363C90F8F1B6217 && \
    apt-get update && \
    apt-get install --no-install-recommends -y git jq patch && \
    rm -rf /var/lib/apt/lists/* && \
    to_patch=/android-runner && \
    patcher='s/^export LD_PRELOAD=/test "${NO_PRELOAD_CXX:-0}" != 0 || &/' && \
    if test -f "$to_patch"; then sed -i.orig "$patcher" -- "$to_patch"; fi && \
    git config --system gitoxide.imaginary.arbitraryVariable arbitraryValue

And added these recipes to the justfile:

# Build a customized `cross` container image for testing
cross-image target:
    docker build --build-arg "TARGET={{ target }}" -t "cross-rs-gitoxide:{{ target }}" \
        - <etc/docker/Dockerfile.test-cross

# Test another platform with `cross`
cross-test target: (cross-image target)
    CROSS_CONFIG=etc/docker/test-cross.toml NO_PRELOAD_CXX=1 \
        cross test --workspace --no-fail-fast --target "{{ target }}" \
        --no-default-features --features max-pure \
        -- --skip realpath::fuzzed_timeout

# Test s390x with `cross`
cross-test-s390x: (cross-test 's390x-unknown-linux-gnu')

# Test Android with `cross`
cross-test-android: (cross-test 'armv7-linux-androideabi')

Running the tests

Then, as shown in the gist, I ran various commands, the most important of which were this command, producing the errors:

RUST_BACKTRACE=1 just cross-test armv7-linux-androideabi

And these commands, cleaning the build directory and running the tests with the changes of cf7f34d undone. (To do this recently without having to manually resolve a conflict--but not originally, as shown in this older gist--I had to also revert 00d1a00, but that is not the cause.)

git revert --no-commit 00d1a00 cf7f34d
gix clean -xde
RUST_BACKTRACE=1 just cross-test armv7-linux-androideabi

I have done this quite a few times; the failures appear deterministic. Enabling the backtrace is not the cause. (That would be strange, but who knows--this is already strange.) I have also run the tests multiple times without RUST_BACKTRACE=1.

The images are not stale

I have also tried forcing all Docker images to be reacquired or rebuilt, even before the second run, by running the following commands, where the third command is the one that actually deletes images, and the purpose of the others is to verify that there are no containers running or stopped, that there are images before I delete them, and that there are no images after I delete them. Please note that you might not want to run these commands, since they remove any Docker images that are not use by containers.

docker container list -a
docker image list -a
docker image prune -a
docker image list -a

I would not expect that to have made a difference, and it never did. The main value of it was to let me make sure the first run in the gist would show the whole process of customizing the Docker image, in case somehow that turns out to be relevant.

Minor patches to the test suite

Finally, though I think this is the least important for this issue, the changes to allow tests to pass instead of failing, or to run and pass instead of being skipped, were:

diff --git a/gix-config-value/tests/value/path.rs b/gix-config-value/tests/value/path.rs
index 10970296d..72585522a 100644
--- a/gix-config-value/tests/value/path.rs
+++ b/gix-config-value/tests/value/path.rs
@@ -89,16 +89,16 @@ mod interpolate {
         Ok(())
     }

-    #[cfg(windows)]
+    #[cfg(any(target_os = "windows", target_os = "android"))]
     #[test]
-    fn tilde_with_given_user_is_unsupported_on_windows() {
+    fn tilde_with_given_user_is_unsupported_on_windows_and_android() {
         assert!(matches!(
             interpolate_without_context("~baz/foo/bar"),
             Err(gix_config_value::path::interpolate::Error::UserInterpolationUnsupported)
         ));
     }

-    #[cfg(not(windows))]
+    #[cfg(not(any(target_os = "windows", target_os = "android")))]
     #[test]
     fn tilde_with_given_user() -> crate::Result {
         let home = std::env::current_dir()?;
diff --git a/tests/tools/tests/umask.rs b/tests/tools/tests/umask.rs
index 0a4cc0425..5463b57ad 100644
--- a/tests/tools/tests/umask.rs
+++ b/tests/tools/tests/umask.rs
@@ -1,6 +1,9 @@
 #[test]
 #[cfg(unix)]
-#[cfg_attr(not(target_os = "linux"), ignore = "The test itself uses /proc")]
+#[cfg_attr(
+    not(any(target_os = "linux", target_os = "android")),
+    ignore = "The test itself uses /proc"
+)]
 fn umask() {
     use std::fs::File;
     use std::io::{BufRead, BufReader};

There are a few other changes on the feature branch, but you are unlikely to want to see them in this issue. They do not affect code exercised in any of the tests, and they shall be included in the forthcoming PR (which, to be clear, does not hope to fix this issue).

Edits: The pull request

That PR is #1895. The commit at this point the point described above is 5d4eed7. Later commits may include modifications to some of the files shown here. (The history to that commit can be seen here, though it's possible that could eventually be GC'd following a rebase.)

As of e207b27 ("Configure binfmt_misc for QEMU on CI"), CI in #1895 shows the same bug as here, without any unrelated failures. So that may now be a good way to examine the bug. (Comparison of the failures there to those in the gist may also illuminate something about what changes and what stays the same across different runs.) For convenience, here's a failing run, and:

The possible role of cargo test (vs. cargo nextest) seems small

On this separate branch (currently at this commit), I ran an experiment where other tests, especially the test-32bit jobs, are changed to be more like these, including by making them use cargo test. This does not ever seem to cause them to fail.

So while the nondeterminism in exactly which test cases fail suggests a possible role of cargo test using multithreading, it does not seem to a significant contributing factor to the effect:

  • The effect only occurs since dependencies were updated.
  • Even after that, it does not occur in any tests except those being experimented with in armv7-linux-androideabi.
  • Not even if other tests are changed to use cargo test so they are more similar to how armv7-linux-androideabi is tested.

Metadata

Metadata

Assignees

No one assigned

    Labels

    acknowledgedan issue is accepted as shortcoming to be fixedhelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions