Description
I'm running Kerchunk at scale across TB's of NetCDF data and have noticed a consistent issue in my process for validation, where certain comparisons between NetCDF and Kerchunk data fails.
I've found when doing a max, min and mean comparison between arrays of arbitrary sizes, occasionally two of the three test fail, typically the mean comparison and one of the other two. I believe this is explained if the Kerchunk array has more NaN/near-zero values, so a max or min comparison may still work, but the mean comparison will fail. This I think is because of chunk requests being refused or failing, which results in that part of the array being left empty (near-zero).
I know the array chunks come out as near-zero when the chunk request goes wrong because these values can cause an overflow error in certain cases like when decoding time values, because the values are so low (10^-300). Over a set of 1000 datasets (100+ NetCDF files in each), I'm finding this problem affects ~10% of the datasets, with no dependency on type of dataset or anything. This issue comes up at random and is only detectable by direct comparison to the native data, or if it causes an error in decoding.
This could be a major issue with using Kerchunk at such a wide scale, and there really needs to be some kind of check within fsspec at the point of making a request, to ensure the correct type of data was received.