Skip to content

Increase the default media chunksize to 100MB. #482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 1, 2018

Conversation

craigcitro
Copy link
Contributor

The current DEFAULT_CHUNK_SIZE of 512kb leads to unnecessarily slow
transfers, as in googlecolab/backend-container#1. Increasing this simply
lowers the overhead for large transfers.

Fixes #481.

PTAL @jonparrott -- I couldn't think of a meaningful test that didn't involve basically repeating the constant in another file, which feels silly. 😀

The current `DEFAULT_CHUNK_SIZE` of 512kb leads to unnecessarily slow
transfers, as in googlecolab/backend-container#1. Increasing this simply
lowers the overhead for large transfers.

Fixes googleapis#481.
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Feb 28, 2018
@theacodes
Copy link
Contributor

theacodes commented Mar 1, 2018

100mb is a very big increase.

resumable-media uses 256 * 1024: https://github.com/GoogleCloudPlatform/google-resumable-media-python/blob/master/google/resumable_media/common.py#L25

@craigcitro
Copy link
Contributor Author

So I forgot to mention the provenance of that constant -- it did not just come out of my hat:
https://github.com/GoogleCloudPlatform/gsutil/blob/5fb1a8f9888f8a7a0c7f14ffe6453ab60ea4e712/gslib/util.py#L839

I have vague memories of @thobrla doing some amount of experimentation to come up with that constant, so I was willing to just trust it.

Do we have any benchmarks or experiments we can use to get a sense for a good vs. bad chunksize? Is there something you're worried about with a large chunksize? (Note that most requests are already retried, so I think "oh internet connections can be wonky" isn't too compelling.)

@craigcitro
Copy link
Contributor Author

Oh @dhermes did you run any experiments when picking that chunksize for resumable-media?

@dhermes
Copy link
Contributor

dhermes commented Mar 1, 2018

@craigcitro Nope, that was just copy-pasta from google-apitools

@craigcitro
Copy link
Contributor Author

yeah, p. sure wikipedia should add a citation to this bug for "poetic justice". 😁

Proposal: let's switch to the gsutil default everywhere? I'm happy to follow this up with a PR to resumable media.

@theacodes
Copy link
Contributor

yeah, p. sure wikipedia should add a citation to this bug for "poetic justice".

Legitimately laughing out loud.

My concern was making a large behavioral change in a maintenance-mode library. If this is plucked from gsutil and you've confirmed it works, then I'll allow it.

dvdjiez

@craigcitro craigcitro merged commit cb2ce4f into googleapis:master Mar 1, 2018
@craigcitro craigcitro deleted the chunksize branch March 1, 2018 17:28
@thobrla
Copy link

thobrla commented Mar 1, 2018

FYI: I didn't do any deep performance testing to come up with the constant. Instead, I was trying to find a reasonable ceiling for this formula:

Resumable chunk size / round-trip write latency to GCS = Maximum single-stream throughput

Plugging in some numbers:
256KB / 20ms = 100Mbit/s. A.K.A. slow even with very nice, low write latency.
100MB / 300ms = 2,666 Mbit/s. A.K.A. fast even with high write latency.

craigcitro added a commit to craigcitro/google-resumable-media-python that referenced this pull request Mar 1, 2018
The goal here is to lower latency for transfers, as well as having some
consistency between client libraries.

For more discussion, see
googleapis/google-api-python-client#482.
@Ni-Chen
Copy link

Ni-Chen commented Sep 29, 2019

Hi guys,

do you know how to "Update the default UPLOAD_CHUNK_SIZE to 100MB. " when I use Colab to mount Google Drive?

Thanks a lot.

Ni

@busunkim96
Copy link
Contributor

@Ni-Chen Would you mind opening a new issue with your question? We're more likely to miss activity on issues that are already closed.

Thanks!

@craigcitro
Copy link
Contributor Author

@Ni-Chen if you're using drive.mount() in colab, then this library isn't involved at all.

@Axel-Jacobsen
Copy link

Howdy. Either the docs need to be updated, or the DEFAULT_CHUNK_SIZE needs to be changed, no?

From MediaUploadBaseIO,

"""
...
    Google App Engine has a 5MB limit on request size, so you should never set
    your chunksize larger than 5MB, or to -1.
"""

Either the Google App Engine has increased this limit, or something else fishy is going on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

googleapiclient.http.DEFAULT_CHUNK_SIZE should be larger by default.
8 participants