Description
Current behavior 😯
Background
cf7f34d (#1882) bumped prodash
from 29.0.0 to 29.0.1 in Cargo.toml
, as well as various resolved dependency versions in Cargo.lock
, which included bumping miniz_oxide
from 0.8.0 to 0.8.5.
Around that time and starting somewhat earlier, I was experimenting with running tests via cross
in the hope of finding a way to reproduce some s390x-unknown-linux-gnu
failures without s390x hardware. As noted in #1890, I was unable to reproduce those failing s390x tests that way, and I still have not managed to do so.
However, I also experimented with another target in cross
, with the intention that it would help me understand what failures were target-specific. I didn't get failures in s390x, but I expected I would and that the other target could serve as a kind of control. It turned out that s390x was the control and the other target eventually had unexpected failures.
Specifically, for the other target, I chose armv7-linux-androideabi
. I chose this because it had been partially tested via cross
on CI before the test was replaced with a more comprehensive non-Android arm32v7 containerized (but non-cross
) test on ARM hardware in #1777.
In its past testing on CI, armv7-linux-androideabi
had problems using the external git
command. Upgrading git
in the cross
container and patching the runner script to avoid preloading an unusable shared library overcame this, and I was able to run all tests with cross
using that target.
All tests all passed--aside from one fuzzing-related performance test that had to be skipped due to slowdown from QEMU emulation of an unrelated architecture without hardware acceleration, and a couple of cases where tests were marked with the wrong #[cfg(...)]
, which was easily fixed. But...
Then prodash
, and various locked dependencies, were upgraded
Starting after a rebase that included cf7f34d in the history, several test cases failed in cross
runs on the armv7-linux-androideabi
target. In the recent run presented here, this is eight tests. Running git revert --no-commit cf7f34d
and rerunning the tests confirmed that this change is what triggered the failures.
I did a full gix clean -xde
between the runs to ensure they did not contaminate each other. I also ran it a number of times with the intention of verify that the effect is consistent. In initially thought the effect was completely consistent based on some very similar looking runs, but actually the tests that fail differ. I suspect this is related to threading. I sometimes forced the Docker image to be rebuilt in between as well, to ensure contamination in the Docker image itself was not the cause (though I was not aware of a way that could happen), which it was not.
Full test run output of the failures, and of how they go away without the changes from cf7f34d, are presented in this gist. Although not shown there, I have also tried running the tests with --locked
. This does not make a difference, which makes me think that upgrade to prodash
itself is relevant, though the way it is relevant could still in principle be through a dependency.
The failures are all specific to armv7-linux-androideabi
with cross
, or at least they do not happen in any of the other ways I have been tested or that are tested on CI. In particular, this does not happen on the same system when running the tests natively, nor when running them on that system with cross
with the s390x-unknown-linux-gnu
target. In particular, they do not fail on either of the 32-bit targets tested (without cross
) on CI, one of which is a non-Android ARMv7 target.
Indirect dependencies that may be relevant
The changes in prodash 29.0.1 are listed as replacing humantime
with jiff
and bumping jiff
to version 0.2, which the diff confirms. Assuming the latest patch versions were used in both 0.1 and 0.2 in practice, there are a number of relevant differences, but none seem like they would affect miniz_oxide
.
The reason I was looking for an effect on miniz_oxide
is that the test failures are panics that happen there. gitoxide
uses miniz_oxide
through flate2
. If there is any connection to prodash
or jiff
, I don't know what it is. But if there isn't, then I don't know why passing --locked
made no difference: with or without --locked
, the failures happened with the changes from cf7f34d, and did not happen without them.
But if we look at Cargo.lock
changes in cf7f34d, the flate2
and miniz_oxide
crates were among those upgraded:
[[package]]
name = "flate2"
-version = "1.0.35"
+version = "1.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "c936bfdafb507ebbf50b8074c54fa31c5be9a1e7e5f467dd659697041407d07c"
+checksum = "11faaf5a5236997af9848be0bef4db95824b1d534ebc64d0f0c6cf3e67bd38dc"
[[package]]
name = "miniz_oxide"
-version = "0.8.0"
+version = "0.8.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "e2d80299ef12ff69b16a84bb182e3b9df68b5a91574d3d4fa6e41b65deec4df1"
+checksum = "8e3e04debbb59698c15bacbb6d93584a8c0ca9cc3213cb423d31f760d8843ce5"
Furthermore the versions of miniz_oxide
listed in Cargo.lock
seem to be used even when I do not pass --locked
or --frozen
. The compiled version is 0.8.5 when the tests fail, and the compiled version is 0.8.0 when the tests pass with the revert, as though locked versions are being used. Perhaps there is something I don't understand about when Cargo.lock
versions are guaranteed to be used?
Which test modules have failures
Because I am using cross
, I am using cargo test
(via cross test
) rather than cargo nextest
. The output is accordingly much harder (at least for me) to read. Here's the broad summary at the end of the whole run:
error: 3 targets failed:
`-p gix --test gix`
`-p gix-odb-tests --test integrate`
`-p gix-pack-tests --test pack`
error: Recipe `cross-test` failed on line 234 with exit code 101
There are always failures in these three modules and no others. Details about all the specific failures in the linked gist are as follows.
Summary of gix
failures
The failures to which -p gix --test gix
pertains were:
failures:
remote::fetch::blocking_and_async_io::fetch_more_packs_than_can_be_handled
remote::fetch::blocking_and_async_io::fetch_with_multi_round_negotiation
repository::object::commit_as::specify_committer_and_author
test result: FAILED. 319 passed; 3 failed; 0 ignored; 0 measured; 0 filtered out; finished in 40.12s
error: test failed, to rerun pass `-p gix --test gix`
Here are links to their full details in the gist:
remote::fetch::blocking_and_async_io::fetch_more_packs_than_can_be_handled
remote::fetch::blocking_and_async_io::fetch_with_multi_round_negotiation
repository::object::commit_as::specify_committer_and_author
Summary of gix-odb-tests
failures
The failures to which -p gix-odb-tests --test integrate
pertains were:
failures:
odb::store::dynamic::write
odb::store::loose::write::collisions_do_not_cause_failure
odb::store::loose::write::it_writes_objects_with_similar_permissions
odb::store::loose::write::read_and_write
test result: FAILED. 55 passed; 4 failed; 0 ignored; 0 measured; 0 filtered out; finished in 7.30s
error: test failed, to rerun pass `-p gix-odb-tests --test integrate`
Here are links to their full details in the gist:
odb::store::dynamic::write
odb::store::loose::write::collisions_do_not_cause_failure
odb::store::loose::write::it_writes_objects_with_similar_permissions
odb::store::loose::write::read_and_write
Summary of gix-pack-tests
failures
The failure to which -p gix-pack-tests --test pack
pertains was:
failures:
pack::data::output::count_and_entries::traversals
test result: FAILED. 45 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 6.25s
error: test failed, to rerun pass `-p gix-pack-tests --test pack`
Here is a link to its full details in the gist:
Analysis
I don't know what's going on, but all of the failures are "attempt to subtract with overflow" errors in miniz_oxide
. This is the case both in the run shown in the gist and in other runs where the tests that fail are sometimes slightly different. All the failures seem to resemble each other pretty closely. Here's the first failure output with its backtrace, to illustrate (the others can be viewed in the gist through the specific links, lest this issue become too long):
---- remote::fetch::blocking_and_async_io::fetch_more_packs_than_can_be_handled stdout ----
thread 'remote::fetch::blocking_and_async_io::fetch_more_packs_than_can_be_handled' panicked at /cargo/registry/src/index.crates.io-1949cf8c6b5b557f/miniz_oxide-0.8.5/src/deflate/core.rs:1990:38:
attempt to subtract with overflow
stack backtrace:
0: rust_begin_unwind
at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/std/src/panicking.rs:692:5
1: core::panicking::panic_fmt
at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/panicking.rs:75:14
2: core::panicking::panic_const::panic_const_sub_overflow
at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/panicking.rs:178:21
3: miniz_oxide::deflate::core::compress_fast
at /cargo/registry/src/index.crates.io-1949cf8c6b5b557f/miniz_oxide-0.8.5/src/deflate/core.rs:1990:38
4: miniz_oxide::deflate::core::compress_inner
at /cargo/registry/src/index.crates.io-1949cf8c6b5b557f/miniz_oxide-0.8.5/src/deflate/core.rs:2244:9
5: miniz_oxide::deflate::core::compress
at /cargo/registry/src/index.crates.io-1949cf8c6b5b557f/miniz_oxide-0.8.5/src/deflate/core.rs:2177:5
6: miniz_oxide::deflate::stream::deflate
at /cargo/registry/src/index.crates.io-1949cf8c6b5b557f/miniz_oxide-0.8.5/src/deflate/stream.rs:54:23
7: <flate2::ffi::rust::Deflate as flate2::ffi::DeflateBackend>::compress
at /cargo/registry/src/index.crates.io-1949cf8c6b5b557f/flate2-1.1.0/src/ffi/rust.rs:169:19
8: flate2::mem::Compress::compress
at /cargo/registry/src/index.crates.io-1949cf8c6b5b557f/flate2-1.1.0/src/mem.rs:334:9
9: gix_features::zlib::stream::deflate::impls::<impl gix_features::zlib::stream::deflate::Write<W>>::write_inner
at /project/gix-features/src/zlib/stream/deflate/mod.rs:69:30
10: gix_features::zlib::stream::deflate::impls::<impl std::io::Write for gix_features::zlib::stream::deflate::Write<W>>::flush
at /project/gix-features/src/zlib/stream/deflate/mod.rs:107:13
11: <gix_features::hash::write::Write<T> as std::io::Write>::flush
at /project/gix-features/src/hash.rs:182:13
12: gix_odb::store_impls::loose::write::<impl gix_object::traits::Write for gix_odb::store_impls::loose::Store>::write_stream
at /project/gix-odb/src/store_impls/loose/write.rs:91:9
13: gix_odb::store_impls::dynamic::write::<impl gix_object::traits::Write for gix_odb::store_impls::dynamic::Handle<S>>::write_stream
at /project/gix-odb/src/store_impls/dynamic/write.rs:34:26
14: gix_odb::cache::impls::<impl gix_object::traits::Write for gix_odb::Cache<S>>::write_stream
at /project/gix-odb/src/cache.rs:154:13
15: <gix_odb::memory::Proxy<T> as gix_object::traits::Write>::write_stream
at /project/gix-odb/src/memory.rs:216:20
16: gix_object::traits::Write::write_buf
at /project/gix-object/src/traits/mod.rs:16:9
17: gix::repository::object::<impl gix::types::Repository>::write_object_inner
at ./src/repository/object.rs:180:9
18: gix::repository::object::<impl gix::types::Repository>::write_object
at ./src/repository/object.rs:171:9
19: gix::repository::object::<impl gix::types::Repository>::commit_as_inner
at ./src/repository/object.rs:309:25
20: gix::repository::object::<impl gix::types::Repository>::commit_as
at ./src/repository/object.rs:273:9
21: gix::repository::object::<impl gix::types::Repository>::commit
at ./src/repository/object.rs:373:9
22: gix::remote::fetch::blocking_and_async_io::fetch_more_packs_than_can_be_handled::create_empty_commit
at ./tests/gix/remote/fetch.rs:97:13
23: gix::remote::fetch::blocking_and_async_io::fetch_more_packs_than_can_be_handled
at ./tests/gix/remote/fetch.rs:113:13
24: gix::remote::fetch::blocking_and_async_io::fetch_more_packs_than_can_be_handled::{{closure}}
at ./tests/gix/remote/fetch.rs:90:50
25: core::ops::function::FnOnce::call_once
at /rust/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
26: core::ops::function::FnOnce::call_once
at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Looking at this step in the backtrace:
3: miniz_oxide::deflate::core::compress_fast
at /cargo/registry/src/index.crates.io-1949cf8c6b5b557f/miniz_oxide-0.8.5/src/deflate/core.rs:1990:38
That is a reference to this line in miniz_oxide
0.8.5:
let mut cur_match_dist = (lookahead_pos - probe_pos) as u16;
The blame shows the most recent change to that line as a removal of the as usize
cast on probe_pos
. This is from commit Frommi/miniz_oxide@0f50464 where several as usize
casts are removed. But I think this cannot be the cause of the new failure, since this change is present in all versions of miniz_oxide
since 0.7.2.
Inspecting a broader diff, there are substantial changes, including two deflate/core.rs
from 0.8.0 to 0.8.5, but I don't know which would cause this. Maybe something in the changelog would give a hint.
Assuming they are able to be resolved without causing problems with other dependencies, I might manage to bisect the miniz_oxide
versions, or even on miniz_oxide
commits, on one of the failing gitoxide tests. But perhaps you will have a more specific idea of what is going on.
Expected behavior 🤔
Ordinarily I would say I expect that all tests pass. But cross
images are strange environments in which to run gitoxide
tests, because they are x86-64 containers in which code of another architecture is built and then, through QEMU, run. I generally expect a cross
image to require significant customization to be usable to run gitoxide
tests, since gitoxide
interacts with various parts of the system, even in a max-pure
build as used here.
If I had always observed these failures, then I would have assumed the problem was just with the container as I had set it up, and not assumed a bug. But it is strange that it was working fine (once I got git
to a new enough version and kept ld.so
message from polluting stderr
, as described below), but then cf7f34d broke it. That suggests to me that there is a bug somewhere.
Maybe there is a long-standing bug in gitoxide that this surfaces. Or maybe there is something wrong with the tools in the cross
image I am using (either in the original image, or from my modifications, detailed below). Or maybe there is a bug in miniz_oxide
, or in a crate that uses it such as flate2
.
Git behavior
In Git itself
In the sense of git
behavior that corresponds to the affected behavior of gitoxide
, this is not applicable as far as I know.
In how we use it
I also suspect that the way the image is customized to be able to use git
for tests that run it, either directly or through .gitignore
d fixtures, is not relevant here. But just in case it is, the most git
-relevant changes in the container are:
-
Enable the
git-core
PPA. -
Upgrade
git
. This is still the x86-64git
. (cross
images are x86-64 images with cross toolchains installed. It is possible to installgit
in across
image for somecross
targets, and but probably not this one.) -
Patch the runner script
cross
uses, to keep it from settingLD_PRELOAD
to include a.so
file implementing the C++ standard library. This could be useful in a C++ Android program, but whenld.so
attempts to preload it in an x86-64 executable, it writes this text to standard error:ERROR: ld.so: object '/android-ndk/sysroot/usr/lib/arm-linux-androideabi/libc++_shared.so' from LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS32): ignored.
Then the program actually runs. But when we capture standard error and parse or make assertions about it, we get failures. So I kept
LD_PRELOAD
from being set to that path, to avoid this.
Full details of how I customized the container, including those not related to making git
usable, are included below.
Steps to reproduce 🕹
Customizing the container
I'll open a pull request with the changes. That pull request will not fix this issue, but it will be possible to look at the changes in it, at least initially. However, I intend also to improve that pull request, both to be able to show these failures better--there are other problems on CI related to configuring binfmt_misc
to use QEMU that obscure them currently--and to include further changes, both that I have already done and that I plan to do, for testing s390x. So the information about how the observations reported here were produced seems valuable to present here.
I made this etc/docker/test-cross.toml
file:
# `cross` configuration for running tests. Treated like `Cross.toml` if enabled
# with `CROSS_CONFIG=etc/docker/test-cross.toml`. This avoids affecting other
# `cross` usage, e.g. in `release.yml`. See `cross-test` recipes in `justfile`.
[build.env]
passthrough = [
"CI",
"GITHUB_ACTIONS",
"GIX_CREDENTIALS_HELPER_STDERR",
"GIX_EXTERNAL_COMMAND_STDERR",
"GIX_OBJECT_CACHE_MEMORY",
"GIX_PACK_CACHE_MEMORY",
"GIX_TEST_CREATE_ARCHIVES_EVEN_ON_CI",
"GIX_TEST_EXPECT_REDUCED_TRUST",
"GIX_TEST_IGNORE_ARCHIVES",
"GIX_VERSION",
"NO_PRELOAD_CXX",
"RUST_BACKTRACE",
"RUST_LIB_BACKTRACE",
]
[target.armv7-linux-androideabi]
image = "cross-rs-gitoxide:armv7-linux-androideabi"
[target.s390x-unknown-linux-gnu]
image = "cross-rs-gitoxide:s390x-unknown-linux-gnu"
And this etc/docker/Dockerfile.test-cross
file:
ARG TARGET
FROM ghcr.io/cross-rs/${TARGET}:latest
RUN apt-get update && \
apt-get install --no-install-recommends -y apt-transport-https gnupg && \
release="$(sed -n 's/^VERSION_CODENAME=//p' /etc/os-release)" && \
echo "deb https://ppa.launchpadcontent.net/git-core/ppa/ubuntu $release main" \
>/etc/apt/sources.list.d/git-core-ubuntu-ppa.list && \
apt-key adv --keyserver keyserver.ubuntu.com \
--recv-keys F911AB184317630C59970973E363C90F8F1B6217 && \
apt-get update && \
apt-get install --no-install-recommends -y git jq patch && \
rm -rf /var/lib/apt/lists/* && \
to_patch=/android-runner && \
patcher='s/^export LD_PRELOAD=/test "${NO_PRELOAD_CXX:-0}" != 0 || &/' && \
if test -f "$to_patch"; then sed -i.orig "$patcher" -- "$to_patch"; fi && \
git config --system gitoxide.imaginary.arbitraryVariable arbitraryValue
And added these recipes to the justfile
:
# Build a customized `cross` container image for testing
cross-image target:
docker build --build-arg "TARGET={{ target }}" -t "cross-rs-gitoxide:{{ target }}" \
- <etc/docker/Dockerfile.test-cross
# Test another platform with `cross`
cross-test target: (cross-image target)
CROSS_CONFIG=etc/docker/test-cross.toml NO_PRELOAD_CXX=1 \
cross test --workspace --no-fail-fast --target "{{ target }}" \
--no-default-features --features max-pure \
-- --skip realpath::fuzzed_timeout
# Test s390x with `cross`
cross-test-s390x: (cross-test 's390x-unknown-linux-gnu')
# Test Android with `cross`
cross-test-android: (cross-test 'armv7-linux-androideabi')
Running the tests
Then, as shown in the gist, I ran various commands, the most important of which were this command, producing the errors:
RUST_BACKTRACE=1 just cross-test armv7-linux-androideabi
And these commands, cleaning the build directory and running the tests with the changes of cf7f34d undone. (To do this recently without having to manually resolve a conflict--but not originally, as shown in this older gist--I had to also revert 00d1a00, but that is not the cause.)
git revert --no-commit 00d1a00 cf7f34d
gix clean -xde
RUST_BACKTRACE=1 just cross-test armv7-linux-androideabi
I have done this quite a few times; the failures appear deterministic. Enabling the backtrace is not the cause. (That would be strange, but who knows--this is already strange.) I have also run the tests multiple times without RUST_BACKTRACE=1
.
The images are not stale
I have also tried forcing all Docker images to be reacquired or rebuilt, even before the second run, by running the following commands, where the third command is the one that actually deletes images, and the purpose of the others is to verify that there are no containers running or stopped, that there are images before I delete them, and that there are no images after I delete them. Please note that you might not want to run these commands, since they remove any Docker images that are not use by containers.
docker container list -a
docker image list -a
docker image prune -a
docker image list -a
I would not expect that to have made a difference, and it never did. The main value of it was to let me make sure the first run in the gist would show the whole process of customizing the Docker image, in case somehow that turns out to be relevant.
Minor patches to the test suite
Finally, though I think this is the least important for this issue, the changes to allow tests to pass instead of failing, or to run and pass instead of being skipped, were:
diff --git a/gix-config-value/tests/value/path.rs b/gix-config-value/tests/value/path.rs
index 10970296d..72585522a 100644
--- a/gix-config-value/tests/value/path.rs
+++ b/gix-config-value/tests/value/path.rs
@@ -89,16 +89,16 @@ mod interpolate {
Ok(())
}
- #[cfg(windows)]
+ #[cfg(any(target_os = "windows", target_os = "android"))]
#[test]
- fn tilde_with_given_user_is_unsupported_on_windows() {
+ fn tilde_with_given_user_is_unsupported_on_windows_and_android() {
assert!(matches!(
interpolate_without_context("~baz/foo/bar"),
Err(gix_config_value::path::interpolate::Error::UserInterpolationUnsupported)
));
}
- #[cfg(not(windows))]
+ #[cfg(not(any(target_os = "windows", target_os = "android")))]
#[test]
fn tilde_with_given_user() -> crate::Result {
let home = std::env::current_dir()?;
diff --git a/tests/tools/tests/umask.rs b/tests/tools/tests/umask.rs
index 0a4cc0425..5463b57ad 100644
--- a/tests/tools/tests/umask.rs
+++ b/tests/tools/tests/umask.rs
@@ -1,6 +1,9 @@
#[test]
#[cfg(unix)]
-#[cfg_attr(not(target_os = "linux"), ignore = "The test itself uses /proc")]
+#[cfg_attr(
+ not(any(target_os = "linux", target_os = "android")),
+ ignore = "The test itself uses /proc"
+)]
fn umask() {
use std::fs::File;
use std::io::{BufRead, BufReader};
There are a few other changes on the feature branch, but you are unlikely to want to see them in this issue. They do not affect code exercised in any of the tests, and they shall be included in the forthcoming PR (which, to be clear, does not hope to fix this issue).
Edits: The pull request
That PR is #1895. The commit at this point the point described above is 5d4eed7. Later commits may include modifications to some of the files shown here. (The history to that commit can be seen here, though it's possible that could eventually be GC'd following a rebase.)
As of e207b27 ("Configure binfmt_misc for QEMU on CI"), CI in #1895 shows the same bug as here, without any unrelated failures. So that may now be a good way to examine the bug. (Comparison of the failures there to those in the gist may also illuminate something about what changes and what stays the same across different runs.) For convenience, here's a failing run, and:
- here are its
-p gix --test gix
failures - here are its
-p gix-odb-tests --test integrate
failures - here are its
-p gix-pack-tests --test pack
failures
The possible role of cargo test
(vs. cargo nextest
) seems small
On this separate branch (currently at this commit), I ran an experiment where other tests, especially the test-32bit
jobs, are changed to be more like these, including by making them use cargo test
. This does not ever seem to cause them to fail.
So while the nondeterminism in exactly which test cases fail suggests a possible role of cargo test
using multithreading, it does not seem to a significant contributing factor to the effect:
- The effect only occurs since dependencies were updated.
- Even after that, it does not occur in any tests except those being experimented with in
armv7-linux-androideabi
. - Not even if other tests are changed to use
cargo test
so they are more similar to howarmv7-linux-androideabi
is tested.