Skip to content

[WIP] Proposal: mime type shortcuts #10172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

guillep2k
Copy link
Member

Related #5750, #2619 (comment), #7266 (comment)

Since users sometimes have a hard time configuring MIME types (e.g. different notations from dropzone, browsers, Gitea), I thought we could "pre-cook" a list of common cases. These can be specified in ALLOWED_TYPES as well as the normal MIME types:

So:

ALLOWED_TYPES = application/gzip, application/x-gzip, application/x-gtar, application/x-tgz

Could be written as:

ALLOWED_TYPES = compressed

And:

ALLOWED_TYPES = image/jpeg, image/png, image/apng, image/bmp, image/gif

Could be written as:

ALLOWED_TYPES = image

The idea is that these notations can be combined, so the following notation is valid:

ALLOWED_TYPES = compressed, image, application/java-archive, application/vnd.google-earth.kml+xml, application/vnd.google-earth.kmz

The above example will expand to the list of pre-defined compressed and image types, and the three additional specific (formal) MIME types: application/java-archive, application/vnd.google-earth.kml+xml and application/vnd.google-earth.kmz.

This exploits the fact that MIME types are always formatted as "name/name" (or "name/", "/*"), so we could take advantage and use simple names for aliases.

This PR can be considered a work in progress because the small list of aliases I made should have some discussion. I've found that it was easier submitting a PR to showcase my idea.

@codecov-io
Copy link

codecov-io commented Feb 7, 2020

Codecov Report

Merging #10172 into master will increase coverage by 0.22%.
The diff coverage is 30.38%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10172      +/-   ##
==========================================
+ Coverage   43.42%   43.65%   +0.22%     
==========================================
  Files         577      577              
  Lines       79666    79740      +74     
==========================================
+ Hits        34594    34807     +213     
+ Misses      40795    40636     -159     
- Partials     4277     4297      +20
Impacted Files Coverage Δ
routers/api/v1/org/member.go 16.01% <0%> (ø) ⬆️
modules/git/repo_commit.go 53.39% <0%> (ø) ⬆️
modules/repository/create.go 57.44% <0%> (ø) ⬆️
routers/repo/issue_label.go 42.18% <0%> (ø) ⬆️
routers/user/profile.go 50.69% <0%> (-0.97%) ⬇️
routers/routes/routes.go 86.47% <100%> (-0.05%) ⬇️
services/pull/update.go 52.27% <100%> (+1.1%) ⬆️
routers/api/v1/repo/label.go 84% <100%> (+84%) ⬆️
services/pull/merge.go 50.44% <14.58%> (-0.92%) ⬇️
models/pull.go 41.82% <15.78%> (-0.85%) ⬇️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 218f494...6810938. Read the comment docs.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Feb 7, 2020
@zeripath
Copy link
Contributor

zeripath commented Feb 9, 2020

I think this is a good idea, however, realistically dropzone should be doing this sort of matching as it's there that the mimetype variation is really being generated.

I've not looked but it's dropzone another one of our out of date dead upstream projects that we have picked up?

PS I couldn't see what the problem was causing your code to fail to build earlier!!

@guillep2k
Copy link
Member Author

I've not looked but it's dropzone another one of our out of date dead upstream projects that we have picked up?

But we need a MIME type list on the server side anyway, no matter what version of dropzone we use. This method would simplify building the server side list by providing aliases.

PS I couldn't see what the problem was causing your code to fail to build earlier!!

Just sloppiness from my part. 😅

@silverwind
Copy link
Member

I think the approach using MIME types is flawed. Users generally do not know mime types (which are ambiguous in many cases and different browsers send different mime types for the same filename) but file extensions (which are never ambiguous).

So instead of building on a flawed concept, I'd propose we just support both like browsers already do.

Dropzone is still maintained at https://github.com/enyo/dropzone but our version is out of date. I think it's acceptedFiles option basically functions the same as the accepts attribute which delegates matching to the browser.

Another change I'd like to see is to set repository.upload.ALLOWED_TYPES and attachment.ALLOWED_TYPES to default to a empty value (which means everything is accepted and generally is what users want). repository.upload.ALLOWED_TYPES already supports empty value to match all but the attachment option only supports this weird */*.

@silverwind
Copy link
Member

silverwind commented Feb 17, 2020

But we need a MIME type list on the server side anyway

Do we really need it? The server does see the filename as part of the multipart header, so the extension could just be extracted from there so we can match on both extension or MIME type:

------Boundary
Content-Disposition: form-data; name="file"; filename="gitea.png"
Content-Type: image/png

@guillep2k
Copy link
Member Author

@silverwind Sadly, the problem of accepting any kind of files is JavaScript injection.

@techknowlogick techknowlogick added the pr/wip This PR is not ready for review label Feb 17, 2020
@silverwind
Copy link
Member

Yeah, I see those checks for */* in both backend and frontend. I still think we can make it work with empty value so both options would take the same values (maybe still accept */*for compat). So what needs to be done imho:

  • Get the backend parser to accept .ext format and other stuff supported by browsers like image/*.
  • Make both options accept the same formats (share the parser).
  • Decide on a value that allows all (either empty or */*).
  • Update default value for both options to allow all.

@jolheiser
Copy link
Member

zip files alone can be any of the following

  • application/zip
  • application/octet-stream
  • application/x-zip-compressed
  • multipart/x-zip

I'm not sure under what conditions any of the above are true, but it seems like MIME checking is somewhat fragile. 🙃
Or maybe I just don't understand it well enough. 😅

@silverwind
Copy link
Member

silverwind commented Mar 10, 2020

I think mime type lookup may in some cases even be OS or distribution-specific in the case of Linux. It really depends what Mime type database is used. IIRC, there is a official one from IANA, but it's so vastly lacking in data, everyone just uses their own databases. It's a failed "standard".

@guillep2k
Copy link
Member Author

That's why I think we should provide the option for a "curated" list. Bad as it is, MIME types are the only option ATM.

@silverwind
Copy link
Member

MIME types are the only option ATM

Why? I see no issue supporting both extensions and mime (and in a later release, drop support for mime).

@stale
Copy link

stale bot commented May 10, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs during the next 2 months. Thank you for your contributions.

@stale stale bot added the issue/stale label May 10, 2020
@guillep2k
Copy link
Member Author

From the author of dropzone, just for context: https://stackoverflow.com/a/17275873/225979

@stale stale bot removed the issue/stale label May 21, 2020
@silverwind
Copy link
Member

I guess ideally we'd want to match Dropzone's implementation exactly on the backend.

@guillep2k
Copy link
Member Author

I guess ideally we'd want to match Dropzone's implementation exactly on the backend.

Sounds like a plan.

I've been browsing the Dropzone docs and code and I have the impression that the actual MIME type is reported by the browser itself (not calculated by the library), which leaves us again the problem of validating whether the file matches the reported type. I mean, we could believe the UA about it, but not much else. Personally I could live with that; we're already serving files that say to be pictures and such and we're not really validating it's actual content.

@mrsdizzie
Copy link
Member

@guillep2k A user could send a POST that has whatever mime type header they choose, so you can't rely on the HTTP header mime type value for security.

The reason for having a mime check in dropzone (or any upload script) is as mostly as a slight convenience for both the user and the server that it doesn't let normal users go through the process of uploading files that it already knows the server will then reject after validation.

Malicious users can still send whatever mime type they want in a request so a check should still be done on the server and not just relying on what the user sent (and then ideally you'd want both checks to have the same results on legitimate files).

https://github.com/gabriel-vasile/mimetype seems like it could be much nicer than what we use now on the backed too which is https://golang.org/pkg/net/http/#DetectContentType and seems limited in comparison.

@guillep2k
Copy link
Member Author

The pedantic exposition

The problem comes from mixing file extensions with MIME types. There are multiple file types that share the same file extension, and most of them specialized can only be matched to application/octet‑stream. We want to match the file type for more than one reason:

  1. Security: the user should not be able to upload files of a forbidden type, either because:
    1. It can be a blatantly malicious file (e.g. an executable with bad code).
    2. It can be a hacking attempt; a file might be masked as a different type than it actually is (e.g. executable as image).
    3. The administrator wants to prevent users from sharing files that might expose e.g. company secrets or are susceptible of containing viruses (e.g. Word or Excel documents).
  2. Consistency: we want to match what comes in with what comes out.
    1. We want to provide the correct Content-Type header when serving the file.
    2. We want to show a matching icon for the file (e.g. 📔 with .pub files).
    3. We might not want to rely on the UA to recognize a file properly. e.g. we may have better knowledge of the kind of files that can come in, as in a list of CAD files the browser might mistake for something else.
    4. The opposite, we may want to rely on the UA as it might have better knowledge than us about the files it's sending. e.g. a script uploading files with curl for CI purposes might send a specific Content-Type that better suits a content, like a jar file that we might naively identify as application/zip.

There are two additional difficulties to consider:

  1. Many file types genuinely share the same extension: .a as ADA source file, .a as POSIX library.
  2. *nix systems don't event require extensions in their file names (although fortunately they're used in most cases).

These problems are related but no the same, and require different, even conflicting approaches.

The poll

We should decide on what kind of service we want to give, so we can take a better decision on the approach to take. So I propose a vote on the IMHO most important aspects (please note that some choices are easy to implement and other very difficult):

  1. Provided that we accept a given file: should we honor the provided Content-Type in any cases? e.g. if we're being sent a myface.jpg file that turns out to be a text document... should we serve it as plain/text or as image/jpeg? Vote 👎 for plain/text and 👍 for image/jpeg.
  2. If by configuration we are told to accept files by file extension (e.g. *.ipt, *.doc), should we just trust the file name and disregard the file type? Vote 😄 for removing any validation besides the file name, 😕 for performing some validation (?) and 👀 for using a library to match magic numbers and only support files that are supported by it.
  3. Should we use X-Content-Type-Options: nosniff when we provide a Content-Type other than a picture? Vote ❤️ for yes and 🎉 for no.

@CirnoT
Copy link
Contributor

CirnoT commented May 22, 2020

We want to show a matching icon for the file (e.g. 📔 with .pub files).

This will always be error prone as mentioned already; many types share same extension. The only proper solution here is to show generic icon, otherwise whether we depend on Content-Type or extension, there is always potential to exploit it by having Gitea mislead user with icon.

It can be a blatantly malicious file (e.g. an executable with bad code).

An executable file is desired on Release; ZIP files are no less dangerous if properly crafted.

Additionally we should never rely on Content-Type sent by browser, a file should be inspected after upload by us, to determine its Content-Type.

@stale stale bot added the issue/stale label Jul 25, 2020
@lunny lunny added the issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented label Jul 26, 2020
@stale stale bot removed the issue/stale label Jul 26, 2020
@silverwind
Copy link
Member

#12465 should make this obsolete. It will allow combinations of mime type and file extensions like

.zip,text/plain,image/*
.doc,.docx,application/msword,application/vnd.openxmlformats-officedocument.wordprocessingml.document

That seecond example is straight from MDN and by supporing the same format as browsers do we can efficiently delegate the job of identifying mime types to the user without having tediously maintain shortcuts like proposed in this PR.

silverwind added a commit to silverwind/gitea that referenced this pull request Oct 1, 2020
- Add support for file extensions, matching the `accept` attribute of `<input type="file">`
- Add support for type wildcard mime types, e.g. `image/*`
- Create repository.release.ALLOWED_TYPES setting (default unrestricted)
- Change default for attachment.ALLOWED_TYPES to a list of extensions
- Split out POST /attachments into two endpoints for issue/pr and
  releases to prevent circumvention of allowed types check

Fixes: go-gitea#10172
Fixes: go-gitea#7266
Fixes: go-gitea#12460
Ref: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/input/file#Unique_file_type_specifiers
@codecov-commenter
Copy link

Codecov Report

Merging #10172 into master will increase coverage by 0.01%.
The diff coverage is 52.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10172      +/-   ##
==========================================
+ Coverage   43.63%   43.65%   +0.01%     
==========================================
  Files         576      577       +1     
  Lines       79716    79740      +24     
==========================================
+ Hits        34783    34807      +24     
+ Misses      40638    40636       -2     
- Partials     4295     4297       +2     
Impacted Files Coverage Δ
modules/setting/common_mime.go 45.45% <45.45%> (ø)
modules/setting/repository.go 54.05% <100.00%> (+2.62%) ⬆️
modules/setting/setting.go 44.07% <100.00%> (ø)
modules/log/file.go 75.52% <0.00%> (-2.10%) ⬇️
modules/log/event.go 64.61% <0.00%> (-1.03%) ⬇️
models/gpg_key.go 54.81% <0.00%> (-0.56%) ⬇️
routers/repo/view.go 40.00% <0.00%> (+0.86%) ⬆️
modules/queue/unique_queue_disk_channel.go 54.48% <0.00%> (+1.92%) ⬆️
services/pull/check.go 35.36% <0.00%> (+2.43%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b325592...6810938. Read the comment docs.

techknowlogick pushed a commit that referenced this pull request Oct 5, 2020
)

* Attachments: Add extension support, allow all types for releases

- Add support for file extensions, matching the `accept` attribute of `<input type="file">`
- Add support for type wildcard mime types, e.g. `image/*`
- Create repository.release.ALLOWED_TYPES setting (default unrestricted)
- Change default for attachment.ALLOWED_TYPES to a list of extensions
- Split out POST /attachments into two endpoints for issue/pr and
  releases to prevent circumvention of allowed types check

Fixes: #10172
Fixes: #7266
Fixes: #12460
Ref: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/input/file#Unique_file_type_specifiers

* rename function

* extract GET routes out of RepoMustNotBeArchived

Co-authored-by: Lauris BH <[email protected]>
@go-gitea go-gitea locked and limited conversation to collaborators Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. pr/wip This PR is not ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.