Skip to content

ENH: Always fsync pickle files #3083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

effigies
Copy link
Member

Summary

Fully qualifying results file names does not seem to have resolved the issue with finding the files, making it more likely that this is a filesystem synchronization issue. This PR adds an fsync call for Pickle files, specifically.

This required a slight refactor for handling gzip (pklz) files.

Hopefully:

List of changes proposed in this PR (pull-request)

Acknowledgment

  • (Mandatory) I acknowledge that this contribution will be available under the Apache 2 license.

@satra
Copy link
Member

satra commented Oct 26, 2019

@effigies - in a conversation with @oesteban we discussed that the SoftFileLock is not useful at all for save/load pickle. in nipype, two nodes would never write to the same file, and for reading, a lock shouldn't be required. so we can simplify the code here.

@oesteban mentioned that the sync issue appears to arise in networked file systems, so we also discussed copying over the timeout code from SGELikePluginBase to load pickle.

@effigies
Copy link
Member Author

I thought the lock was for reading and writing...

@satra
Copy link
Member

satra commented Oct 26, 2019

in nipype, reading never happens before writing is done. the only issue that could take place is that the filesystem has not synced yet, but even in that case the reading node should be able to assume that writing has been completed. this is because a job doesn't get released till it's dependent jobs are finished (which means result writing has been done).

@effigies effigies closed this Oct 26, 2019
@oesteban
Copy link
Contributor

I think the flush can be useful. And we might want to have a paranoia mode where the os.fsync is executed. With all that above said, I think this PR should be reconsidered (and maybe remove the lock for reading, at the same time).

@effigies effigies reopened this Oct 28, 2019
@effigies
Copy link
Member Author

effigies commented Oct 28, 2019

Okay. Reopened.

And we might want to have a paranoia mode where the os.fsync is executed.

I personally find working with the config file to be a pain, especially inside containers. If we make syncing the default behavior, people can run with eatmydata to get whatever performance boosts are available by being less paranoid. (I suspect they'll be minimal, as the time spent in the control thread is pretty small relative to most workflow tasks.)

In any event, should I add a sync parameter to the function, then, rather than running it unconditionally? This should make it easier to enable/disable paranoid mode from higher level functions, if we decide to go that route.

@oesteban
Copy link
Contributor

having the sync defaulting to true looks like a good idea

@effigies
Copy link
Member Author

This seems to reliably produce failures to find result_*.pklz files, so perhaps fsync isn't a good idea, regardless.

@effigies effigies closed this Oct 29, 2019
@effigies effigies deleted the fix/sync_resultfile branch October 29, 2019 12:37
@satra
Copy link
Member

satra commented Oct 29, 2019

@effigies - that's really weird. why would this change exacerbate the problem. might be good to follow up.

on one of the stackoverflow threads i read about the following:

  1. write to a temporary file
  2. rename the temporary file to the target file (this is an atomic operation)

@oesteban
Copy link
Contributor

@effigies you've got the closest to reliably replicate the problem. We can't be sure this is the same problem, but I agree with @satra this is the best we have to investigate it this far :) - good catch!.

@effigies
Copy link
Member Author

Yeah, I don't know it's the same problem. But I think we can say that if fsync causes a delay that means the file can't be found, then the strict ordering of tasks doesn't hold.

@effigies effigies restored the fix/sync_resultfile branch October 30, 2019 20:47
@effigies
Copy link
Member Author

Okay. Restored the branch if you have thoughts on where to go with this.

@satra
Copy link
Member

satra commented Oct 30, 2019

@effigies: can we add the timeout check to load_pkl? a version of this is here:

https://github.com/nipy/nipype/blob/master/nipype/pipeline/plugins/base.py#L476

@satra satra mentioned this pull request Nov 5, 2019
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants