Charset detection returns inconsistent results

Running the same string through `charset.DetectEncoding()` may return different results when called different times. This is what has been making our CI tests fail randomly at `TestToUTF8WithFallback` and `TestToUTF8`.

The underlying library for encoding detection is [github.com/gogits/chardet](https://github.com/gogits/chardet), and it runs the given string through many "detectors" that return a level of confidence each.

The problem comes from the fact that these detectors are ran in goroutines and return their calculations through a channel. Many of them return the same level of confidence, and the first to report ***wins***.

https://github.com/gogs/chardet/blob/2404f777256163ea3eadb273dada5dcb037993c0/detector.go#L95-L111

So, the strings that were failing in the `charset` tests are detected as `ISO-8859-1` most of the time, but from time to time they're detected as `ISO-8859-2` which produces a different string when converted to UTF-8, thus making the test fail.

From the library point of view this is not strictly a bug, since all it does is _guessing_ the character set. However, it's not unreasonable to expect reproducibility in the results.

This situation is probably causing Gitea to parse strings inconsistently from time to time.

So, the test in `charset` is "easy to fix": we could just delete it or reduce its expectations. Obviously this defeats the purpose of having the test. Fixing the library will probably be much harder.

What saddens me is that *I knew about it* and totally forgot:

https://github.com/go-gitea/gitea/blob/5e759b60cca3cd8484a6235fcc9120d18e8cd455/modules/charset/charset_test.go#L228-L231

	// due to a race condition in `chardet` library, it could either detect
	// "ISO-8859-1" or "IS0-8859-2" here. Technically either is correct, so
	// we accept either.
	assert.Contains(t, encoding, "ISO-8859")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Charset detection returns inconsistent results #8474

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Charset detection returns inconsistent results #8474

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions