-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Don't treat BOM escape sequence as hidden character. #18909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Gusted
commented
Feb 26, 2022
- BOM sequence is a common non-harmfull escape sequence, it shouldn't be shown as hidden character.
- Follows GitHub's behavior.
- Resolves BOMs are treated as illegal escape sequences in at least the diff view /commit/* #18837
- BOM sequence is a common non-harmfull escape sequence, it shouldn't be shown as hidden character. - Follows GitHub's behavior. - Resolves go-gitea#18837
Could we add some test? |
Might need to to the first two (three if we care about little endian) of this table: https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding |
Not sure if it's correctly to add it, we uses the |
They might decode the same, not sure. Also see: If you add a test, it should tell you thought whether that works :) |
UTF-16 BOM is certainly the more widespread one in use, a lot of files created by Microsoft software will add it, for example CSV exports from Excel. |
Seems like a wider issue to me, every UTF-16 codepoint would be shown as broken codepoint, as they won't be decoded correctly by the current code. |
It's generally not an issue because stuff like Office documents are at least partially binary data, so they don't render as text. CSV is an exception thought, but we actually have a separate renderer for it too. |
Should it be fixed within this PR or a separate? Adding the logic for utf16 sounds a bit out-of-scope for this PR. |
I'd like to see at least both UTF-8 and UTF-16 BOM to be identified here, if it's not that hard. UTF-16 BOM is very commonplace. UTF-8 does not recommend BOM usage, so I think it's not in much use generally. |
Including UTF-16 BOM requires "fixing" the logic to not error utf-16 codepoints as broken codepoints. As they are checked before any of this logic is being carried out. |
https://go.dev/play/p/dEzAtdBP3oq Seems like utf16 doesn't pick up the BOM codepoints correctly. Might be using it the wrong way. |
UTF-16 and UTF-8 are totally different, the relation of them is just like one is PNG and the other is WebP, there is nothing in common. And UTF-16 has big-endian and little-endian variants. For a UTF-8 code/processor, the UTF-16 data (including BOM) is totally invalid binary data. |
All i know is that CSVs exported by Excel have the UTF-16 BOM but the data otherwise is plain ASCII. Basically, UTF-16 BOM should not trigger any warnings, and should ideally just be removed before display. |
I think it's clear that this UTF-16 is wider issue and shouldn't be included into this PR, unless there's a clear testcase that shows the BOM is shown as hidden character(not as broken codepoint) it should be included. |
Are you sure your UTF-16 files can be rendered in Gitea correctly? If not, it's a unrelated problem. |
Here's an example file: https://try.gitea.io/silverwind/symlink-test/src/branch/master/utf16bom.txt gitea warns on "hidden" characters and strangely enough renders some japanese symbols and other garbage. Still, I think this sequence should not trigger a warning. |
Yeah the rendering of that file is a separate issue. Actually, IIRC correctly, one has to put the UTF-16 BOM into UTF-8 encoded documents for Excel to correctly identify UTF-8 content (it otherwise interprets as ASCII), so it is certainly a non-standard edge case. We should not need to support UTF-16, but the sample file should at least not warn on that BOM. But it can be fixed in a separate issue along with the rendering to make rendering match GitHub. |
From what I can see in the code, that's the issue to be solved first, before we can properly support this UTF-16 BOM. As the bytes that are given to the
|
Hmm ..... Sorry I made a mistake. It seems that currently Gitea does support UTF-16 BOM: https://try.gitea.io/wxiaoguang/test/src/branch/master/test-utf-16be-bom.txt Well, we need some more work ............ |
See my examples
There are |
Tested the file, the current PR covers this file correctly and doesn't show and hidden characters. |
Could we just check the first Rune? Generally LGTM. And add the UTF-16 file for the tests. (I just tried to add a commit about only checking the first Rune, if it's wrong, please revert ....) |
62184d2
to
81d2ced
Compare
Yeah noticed that as well while trying to add it, apparently HTML text can be some of the first runes 😅 , Seems like Git decided it was a nice time to force push... |
7d6bc60
to
cec21e6
Compare
Sorry I pushed into this PR by mistake (and reverted by a force-push 😂) My proposed solution is: Gusted#2 It can pass the unit tests, and only check the first Rune for BOM. Details behind the problems:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI says that the unit-tests PASS ~~
* giteaofficial/main: Fix page and missing return on unadopted repos API (go-gitea#18848) [skip ci] Updated licenses and gitignores Allow adminstrator teams members to see other teams (go-gitea#18918) Update nginx reverse proxy docs (go-gitea#18922) Don't treat BOM escape sequence as hidden character. (go-gitea#18909) Remove CodeMirror dependencies (go-gitea#18911) Uncapitalize errors (go-gitea#18915) Disable service worker by default (go-gitea#18914) Set is_empty in fixtures (go-gitea#18869) Don't update email for organisation (go-gitea#18905) Correctly link URLs to users/repos with dashes, dots or underscores (go-gitea#18890) Set is_private in fixtures. (go-gitea#18868) Fix team management UI (go-gitea#18886) Update JS dependencies (go-gitea#18898) Fix migration v210 (go-gitea#18892) migrations: add test for importing pull requests in gitea uploader (go-gitea#18752)
* Don't treat BOM escape sequence as hidden character. - BOM sequence is a common non-harmfull escape sequence, it shouldn't be shown as hidden character. - Follows GitHub's behavior. - Resolves go-gitea#18837 Co-authored-by: wxiaoguang <[email protected]>