Skip to content

[FIX] load_results in nodes.py now uses node output directory as wd #2694

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

achetverikov
Copy link
Contributor

Summary

Fixes #2689. Makes sure that when results are loaded the current working directory is a node's output directory so that autogenerated filenames will have the right paths.

List of changes proposed in this PR (pull-request)

  • add indirectory to Node._load_results

Acknowledgment

  • (Mandatory) I acknowledge that this contribution will be available under the Apache 2 license.

@satra
Copy link
Member

satra commented Sep 5, 2018

@achetverikov - would it be possible to add a simple test where the original code fails and this mod succeeds?

Copy link
Member

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. I have a couple ideas of how we can narrow this change to only the potentially erroring sections.

if not isinstance(self, MapNode):
self._copyfiles_to_wd(linksonly=True)
aggouts = self._interface.aggregate_outputs(
needed_outputs=self.needed_outputs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this now, I think we want to try narrowing the "critical section" where we modify the working directory. I think it might actually just be here:

with indirectory(cwd):
    aggouts = self._interface.aggregate_outputs(
        needed_outputs=self.needed_outputs)

Almost all other function calls use the cwd directly.

By keeping it narrow, we effectively mark places where we're working-directory dependent, which may be useful in future refactors.

_save_resultfile(result, cwd, self.name)
else:
logger.debug('aggregating mapnode results')
result = self._run_interface()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here's a potential other place we might want to wrap in indirectory(cwd). I would test without, first.

@effigies
Copy link
Member

effigies commented Sep 5, 2018

I had difficulties reducing a failing case to a test. But now that we've identified the likely section of code (and I have a couple ideas of the relevant lines), perhaps it will be easier to come up with a failing test.

@effigies effigies added this to the 1.1.3 milestone Sep 5, 2018
@codecov-io
Copy link

codecov-io commented Sep 5, 2018

Codecov Report

Merging #2694 into master will decrease coverage by <.01%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #2694      +/-   ##
=========================================
- Coverage   67.61%   67.6%   -0.01%     
=========================================
  Files         340     340              
  Lines       43143   43144       +1     
  Branches     5349    5349              
=========================================
- Hits        29170   29168       -2     
- Misses      13271   13274       +3     
  Partials      702     702
Flag Coverage Δ
#smoketests 50.57% <0%> (-0.01%) ⬇️
#unittests 65.1% <0%> (+0.01%) ⬆️
Impacted Files Coverage Δ
nipype/pipeline/engine/nodes.py 84.16% <0%> (-0.15%) ⬇️
nipype/utils/profiler.py 43.1% <0%> (-1.15%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37f3781...e4894a9. Read the comment docs.

@achetverikov
Copy link
Contributor Author

I have move indirectory as suggested by @effigies, but I have no experience with writing tests so I'm not sure I can do this. I guess the test should be related to loading a result with autogenerated filename from an interface that resulted in an error before, like fsl.utils.MathCommand

@effigies
Copy link
Member

effigies commented Sep 6, 2018

I can't figure out why sometimes we're in the wrong directory. I believe it has something to do with _list_outputs calling os.path.abspath() or os.getcwd() when loading results, but I cannot reproduce the condition within a workflow or without. @satra Any ideas?

@achetverikov Can you verify that this resolves your error, at least?

@satra
Copy link
Member

satra commented Sep 6, 2018

@effigies - that's why i was hoping that we could create a test? do the mapnodes that crash have function nodes as interfaces that change directory? (i think we protect against that, but i'm just throwing out ideas).

is it always the same mapnode that crashes? and does the problem happen with all plugins, or just the multiproc plugin?

@effigies
Copy link
Member

effigies commented Sep 6, 2018

I spent a while trying to reproduce the condition, but cannot. Here's the test I wrote:

class TrivialInterface(BaseInterface):
    class input_spec(TraitedSpec):
        inputs = traits.Any()

    class output_spec(TraitedSpec):
        out = File(exists=True)

    def _run_interface(self, runtime):
        open(self._list_outputs()['out'], 'w').close()
        return runtime

    def _list_outputs(self):
        outputs = {}
        outputs['out'] = os.path.abspath('out')
        return outputs


def test_mapnode_aggregation(tmpdir):
    wf = pe.Workflow(name='wf', base_dir=str(tmpdir))
    trivial = pe.MapNode(TrivialInterface(), name='trivial',
                         iterfield=['inputs'])
    trivial.inputs.inputs = list(range(5))
    consume = pe.Node(TrivialInterface(), name='consume')
    wf.connect(trivial, 'out', consume, 'inputs')

    res = wf.run(plugin='MultiProc')
    assert list(res.nodes)[0].result.outputs.out == [
        os.path.join(tmpdir, 'wf', 'trivial', 'mapflow',
                     '_trivial%d' % i, 'out')
        for i in range(5)]

I've tried in and out of workflows, with MultiProc and without.

@achetverikov
Copy link
Contributor Author

achetverikov commented Sep 6, 2018

I don't think it's a MultiProc problem - I got the same with PBS and I also tried setting run_without_submitting for the problem nodes, it did not matter. For me, two nodes were giving errors, one with fsl.maths.MeanImage, another with fsl.MCFLIRT, both were subnodes in MapNode. I can confirm that with the modified code I did not yet see this problem, although I had to delete the previous cache before testing the new code.

Both of the nodes that were producing errors generate absolute paths for the expected outputs and that was a problem when the _list_outputs was run outside of the node folder. I think one difference with the test above is that the problem occurred when the nodes results were loaded from cache and then aggregated.

But doesn't the error make sense when you look at the _load_results code? I'm not very familiar with nipype inner workings, but the line cwd = self.output_dir() suggests that it is not intended to be executed inside the node output directory. Then when the filenames are generated, they won't have the same paths as when the node runs. Sorry if this does not make much sense, as I said, I'm not very familiar with nipype architecture.

@effigies
Copy link
Member

effigies commented Sep 6, 2018

I think one difference with the test above is that the problem occurred when the nodes results were loaded from cache and then aggregated.

  1. I believe MapNodes do that even on the first run.
  2. Adding a second wf.run() does not change anything. The test always passes.

But doesn't the error make sense when you look at the _load_results code?

I agree it makes sense. I just can't figure out how to get into that code path from another working directory. (I can call _load_results() directly from another path, but within a workflow I can't see how.)

@achetverikov
Copy link
Contributor Author

I think I can try and rerun the old code to see if I can reproduce the problem and then maybe share the python script that is created for each node for batch jobs and the pickled results to see if it helps to reproduce what's going on exactly.

@effigies
Copy link
Member

effigies commented Sep 6, 2018

If you could, that would be great. And if you could try removing pieces, one at a time, until you get the smallest possible workflow that still exhibits the error, that's really what we need to test.

@achetverikov
Copy link
Contributor Author

Sorry for the delays in further testing, lots of stuff on my plate this week. I'll get back to it, hopefully before the weekend.

@effigies
Copy link
Member

Hi @achetverikov, this would be great to have in for 1.1.3, which we plan to release next Monday. Please let me know if you need any help, here.

@achetverikov
Copy link
Contributor Author

Sorry, it seems that I am now unable to replicate the error myself. I've tried several times to rerun the whole preprocessing workflow without luck (in a sense that there were no errors). Possible reasons why it appeared in the first place: old cache remaining from before I upgraded nipype last time or some weird combination of circumstances on a cluster. If I see it again, I'll try to track it down.

@effigies
Copy link
Member

Yeah, it may be a problem that arises from an old cache... I can try having another go at a test, at some point, but definitely not this week.

If this is a problem that can be resolved with a new cache, I'm okay pushing it off to the next release to make sure we can get a test and verify it's fixed. (If it goes in now, it'll be a pain to verify.)

@achetverikov
Copy link
Contributor Author

I think it might be safely postponed, maybe somebody else would also encounter similar problems

@effigies effigies modified the milestones: 1.1.3, 1.1.4 Sep 19, 2018
@effigies
Copy link
Member

Hi @achetverikov, have you had any more chance to look at this? I'll admit that I haven't.

@achetverikov
Copy link
Contributor Author

achetverikov commented Oct 24, 2018 via email

@effigies
Copy link
Member

Thanks for the update. If I remember correctly, we may be able to test this as follows:

  1. Create a workflow where _list_outputs() will be run from the CWD without this patch, and run it.
  2. Recreate the exact same workflow fresh, which will avoid any cached results objects from coming into play, and run that.

That said, I haven't looked seriously at this again, so I may be misremembering my thought process before. If you have some time, could you give that a shot? Otherwise I will try to find a chance.

@effigies effigies mentioned this pull request Oct 26, 2018
12 tasks
@achetverikov
Copy link
Contributor Author

Sorry, I don't have much time now, unfortunately

@effigies
Copy link
Member

Thanks. I'll have a look if I have time, otherwise I'll shift it to next month.

@effigies
Copy link
Member

As an update: The suggestion doesn't work. There's a lot of weird machinery around aggregating outputs, and it's not really clear how to enter any given code path at will. That entire block is not covered by tests.

I'm pushing this off to future, as I don't see any real prospect of reproducing the issue.

@effigies
Copy link
Member

I think this has been superseded by recent changes. Closing for now. If #2689 re-emerges, we can look at this again.

@effigies effigies closed this Feb 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nipype looks for node results in wrong folders
4 participants