-
Notifications
You must be signed in to change notification settings - Fork 314
Handle smooth_frames=False consistently for single and multi-frequency records #313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@bemoody this appears to be working from my testing. I do think we could improve some of the descriptions about how this works though. It took a fair bit of digging to find a good definition for how In addition to describing that In anticipation of merging this update I've created a new issue to reflect that we need to update the wfdb-python help documentation since it currently states that |
Another issue is that the current implementation of (For example, on my machine I'm not sure why this is. |
wfdb/io/_signal.py
Outdated
returned if the signals have uniform samples/frame or if | ||
`smooth_frames` is True. Otherwise a list of 1d numpy arrays | ||
is returned. | ||
returned if `smooth_frames` is True. Otherwise a list of 1d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea for the caller to know the return type/dimensions in advance. But this changes the meaning of the 'smooth_frames' phrase which is supposed to specify the behavior only for signals with multiple samples per frame. It also hides the context of what happens and when, in order to get the Nx2 array vs the N 1D arrays, by directly only focusing on the output type.
I would add another function argument, rename it, and or rework the documentation, to make the behavior more clear.
Thanks for your thoughtful review!
Definitely calling it 'smooth_frames' is not as clear as it could be.
This was my effort to maintain maximal API compatibility. There are
certainly other approaches we could take at the API level, for
example:
- 'rdrecord(..., smooth_frames=False)' could be replaced with
'rdrecord(..., expanded=True)'; i.e., the old keyword argument could
be deprecated and a new one added that is more intuitive
- rdrecord could *always* set e_d_signal/e_p_signal, and
d_signal/p_signal could be computed lazily on request; i.e., the
smooth_frames argument could be deprecated and ignored completely
Just to be clear, though, are you saying this should be changed in the
public API (rdrecord)? Or only the internal API of wfdb.io._signal?
|
I agree with @cx1111 that it would be better if the caller knew the return type. However, since it isn't feasible to only return one type, perhaps we could instead rename the |
@briangow, if I understand correctly, you're saying we should change
the name of the argument for _rd_segment and for _rd_dat_signals, since
in those two cases the type of the return value is dependent on the
argument. I agree that would be clearer and can certainly make that
change.
(I disagree, though, that it isn't feasible for these functions to
always return the same type. In fact, we could change both functions
to *always* return a list of arrays, and post-process the result in
rdrecord. However, as this would be a significant change code-wise,
and one with no user-visible effects, I would propose to do this in a
later pull request.)
The question I'm still unsure about, which wasn't completely clear from
Chen's message, is whether we also need to make a change in the *public
API* (rdrecord). For rdrecord, 'smooth_frames' doesn't change the type
of the result (it's always a Record object); it only affects whether the
'd_signal' or 'e_d_signal' attribute is set.
|
While I think it would be nice to have a clearer name for this parameter in the public API, it might cause more of a mess than it's worth. I think it would be more effective to focus on improving the documentation and tutorials/demo packages. |
@cx1111 please could you respond to this point?:
|
I agree that the current API which requires the caller to know the frequency of the signals, is poorly suited for an application writer trying to process large volumes of data. I think when I designed this, that I only thought about users trying to analyze a small homogeneous datasets they are familiar with.
Lazy evaluation sounds interesting and scary :P This would probably require a lot of further thought on the design and tradeoffs. Perhaps for the future.
Internally, we should definitely add more params to control the behavior of Perhaps we could add another function parameter such as
We could also add this new param to This way, the user can specify both the desired output data type and the behavior for multi-frequency records. the The changes could then be made in 2 steps:
Thoughts? |
This function can return signal data in either of two formats: - "smooth" (a two-dimensional array, where x[t,s] is sample t of signal s) - "non-smooth" (a list of one-dimensional arrays, where x[s][t] is sample t of signal s) Previously, _rd_dat_signals would use "smooth" format if 'sum(samps_per_frame) == n_sig', regardless of 'smooth_frames'. This makes little sense, since the caller needs to know what type of return value to expect. Instead, the format should be determined solely by the 'smooth_frames' parameter. (In this case, the only caller of _rd_dat_signals is _rd_segment, and the only caller of _rd_segment is wfdb.io.record.rdrecord.)
This function can return signal data in either of two formats: - "smooth" (a two-dimensional array, where x[t,s] is sample t of signal s) - "non-smooth" (a list of one-dimensional arrays, where x[s][t] is sample t of signal s) Previously, _rd_segment would use "smooth" format if 'sum(samps_per_frame) == n_sig', regardless of 'smooth_frames'. This makes little sense, since the caller needs to know what type of return value to expect. Instead, the format should be determined solely by the 'smooth_frames' parameter. (In this case, the only caller of _rd_segment is wfdb.io.record.rdrecord.)
rdrecord can return signal data in any of four formats: - by resampling all signals to a uniform rate ("smoothing"), and setting d_signal to a two-dimensional array, where d_signal[t,s] is sample t of signal s - by resampling to a uniform rate, and setting p_signal to a two-dimensional array, where p_signal[t,s] is sample t of signal s converted into physical units - by setting e_d_signal to a list of one-dimensional arrays, where e_d_signal[s][t] is sample t of signal s - by setting e_p_signal to a list of one-dimensional arrays, where e_p_signal[s][t] is sample t of signal s converted into physical units If the selected signals contain multiple samples per frame, the behavior of rdrecord is consistent: - If smooth_frames is True, the selected signals are resampled to the frame rate and stored as d_signal or p_signal. - If smooth_frames is False, the selected signals are stored in their original form as e_d_signal or e_p_signal. However, if each of the selected signals contains only one sample per frame, rdrecord would previously behave inconsistently: - If all signals in the record contain only one sample per frame, and smooth_frames is False and physical is True, the selected signals would be stored as p_signal (not e_p_signal as the caller would expect.) - If all signals in the record contain only one sample per frame, and smooth_frames is False and physical is False, rdrecord would crash with a TypeError in convert_dtype. - If some signals in the record contain multiple samples per frame (but the selected signals don't), rdrecord would crash with an AttributeError in _arrange_fields. Change this behavior so that if smooth_frames is false, rdrecord will always store the signals as e_d_signal or e_p_signal, regardless of the underlying number of samples per frame.
calc_checksum is called in order to calculate checksums of the signal data (d_signal or e_d_signal) in a record. In particular, if rdrecord is used to read only part of a record, it will call calc_checksum to determine the checksums of that part of the record. However, at that time, self.n_sig is still equal to the total number of signals in the input record (not the number of signals stored in d_signal or e_d_signal, which might be different if a subset of channels are selected). Thus, if expanded is true and self.n_sig > len(self.e_d_signal), this would crash. For simplicity, and consistency with the expanded=False case, ignore n_sig and simply calculate the checksums of e_d_signal.
The Record.__eq__ function is currently used in the test suite to check whether two Record objects have the same contents. (This function doesn't actually conform to the standard Python API, but it works for this purpose.) In the case of "expanded"-format records, the e_p_signal and/or e_d_signal attributes are lists of numpy.ndarray objects, not numpy.ndarray objects themselves. In order to compare these for equality we must compare each element separately.
When smooth_frames is False, rdrecord should store the signal data as a list of 1D arrays (each signal at its original sampling frequency) rather than as a "smooth" 2D array. Previously, this would fail if the input record contained only one sample per frame. Modify the existing test case to check that this works.
When comparing the attributes of two records, if a value is an instance of a class derived from 'np.ndarray' or a class derived from 'list', it should be treated the same way as an 'np.ndarray' or a 'list'. Note that there currently are no such classes defined or used by this package. Note also that in any case, the respective types of the two attributes must be equal in order for the values to be considered equal.
To simplify the implementation of _rd_segment and _rd_dat_signals, we want to eliminate the smooth_frames argument, so that the return values of these two functions will always have the same type (a list of numpy arrays.) Therefore, if the application requested frame smoothing, then instead of calling _rd_segment with smooth_frames=True, we will call _rd_segment with smooth_frames=False, and post-process the result by calling Record.smooth_frames. Record.smooth_frames (SignalMixin.smooth_frames) will give a result equivalent to what _rd_segment gives with smooth_frames=True, but there are likely differences in performance: - Record.smooth_frames performs the computation by slicing along the "long" axis and storing the intermediate results in an int64 numpy array. _rd_dat_signals slices along the "short" axis and stores the intermediate results in a Python list. Record.smooth_frames should therefore be faster for large inputs. - Record.smooth_frames only operates on the channels present in e_d_signal, whereas _rd_dat_signals smooths all of the signals in the input file. Record.smooth_frames therefore saves memory and time when reading a subset of channels. - Record.smooth_frames always returns an int64 array, whereas _rd_dat_signals returns an array of the same type as the original data. Record.smooth_frames therefore uses more memory in many cases. (Note that rdrecord will post-process the result in any case, making this change invisible to applications; the issue of increased temporary memory usage can be addressed separately.) - If there are multiple channels in a signal file, then calling _rd_dat_signals with smooth_frames=False requires making an extra copy of each signal that has multiple samples per frame (because of the "reshape(-1)".) (This could be addressed in the future by allowing _rd_segment, or at least _rd_dat_signals, to return a list of *two-dimensional* arrays instead.) In order for this to work correctly, Record.smooth_frames must be called after setting both e_d_signal and samps_per_frame. In particular, it must be done after Record._arrange_fields "rearranges" samps_per_frame according to channels. On the other hand, _arrange_fields is expected to set checksum and init_value in different ways depending on whether the result is to be smoothed. (This use of checksum and init_value is somewhat dubious.) Therefore, smooth_frames is now invoked as part of _arrange_fields, after setting channel-specific metadata and before setting checksum and init_value. _arrange_fields should never be invoked other than by rdrecord; it doesn't make any sense to call this function at other times. Change the signature of this function to reflect the fact that it actively transforms the signal array, and make all arguments mandatory.
Now, _rd_segment is only ever called with smooth_frames=False (the function previously fulfilled by smooth_frames=True is handled by invoking Record.smooth_frames in Record._arrange_fields.) Accordingly, remove the logic for smoothing frames within _rd_dat_signals and for handling "smoothed" data in _rd_segment.
The smooth_frames parameter has been deprecated, and these functions always return a list of arrays; remove the parameter. Note that _rd_segment is only ever called by rdrecord, and _rd_dat_signals is only ever called by _rd_segment.
Thanks! To address your "step 1", I think the most straightforward solution is to remove the "smoothing" functionality from Aside from reducing the amount of duplicated logic, the implementation of Note that 9d77075 is 16fa5ed rebased onto 0d41603. The new commits here are ca20b1a ("rdrecord: smooth frames by invoking smooth_frames()"), b2f3995 ("_rd_segment, _rd_dat_signals: remove support for smooth_frames=True"), and d02c478 ("_rd_segment, _rd_dat_signals: remove smooth_frames parameter"). |
One more thing I'll say is that this restructuring should also make it much easier to add support for other signal file formats (notably WFDB-FLAC, but also to properly handle EDF/BDF, and maybe other things in the future.) |
Will try to review by EOW. Ping me if I forget. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Nice code deletion. Perhaps change the title of the PR and edit the OP description to include a summary of the changes?
Also from now on, repository owners should merge their own PRs once they have approval. |
Sorry to make this more complicated, but this change now breaks
I should have tested this. I think the best thing to do here is to change |
The logic for physical and digital cases is exactly the same except for the choice of input data and the resulting data type.
Ensure that all of the input signals have the expected lengths before trying to do anything else. (If any of the signal lengths were wrong, this would typically have raised a ValueError later on - though not always, since numpy will try to be clever if one of the arguments is an array of length 1.)
Instead of always returning the result as an int64 or float64 array, select the output type based on the types of the input arrays. The output type should be the smallest type that has the correct "kind" and is able to represent all input values. For example, in digital mode, if the input includes some int8 arrays and some int16 arrays, the result should be an int16 array. In physical mode, if the inputs are all float32, then the result will be float32; otherwise the result will be float64. However, although the output type should generally match the input type, intermediate results may need to be stored as a different type. For example, if the input and output are both int16, and one or more signals have spf > 1 and use the entire 16-bit range, then the sum of N samples will overflow an int16. Previously, it was fine simply to store the intermediate results in the output array itself, because the output array was 64-bit, and no WFDB format has more than 32-bit precision, and spf is (in practice) limited to at most 2**31-1. For simplicity, continue using int64 or float64 as the intermediate type, regardless of the actual input types and spf. At the same time, we can also optimize the calculation slightly by reshaping the input array and using np.sum, avoiding another Python loop.
The variable tspf is not used and can be removed. sig_len can be calculated using // rather than / and truncating to an integer (the latter could be incorrect for records longer than 2**53.) The output array can be allocated using np.empty instead of np.zeros since it will be fully initialized by the subsequent loop. Since each frame is independent of the others, the loop can be broken up into blocks (here, arbitrarily chosen as 2**16 frames) to reduce temporary memory usage while still spending most of our CPU time in numpy operations.
Hopefully the last set of changes in this pull request. :) Changing Speed comparison: (master - original logic of
(d02c478 - original logic of
(non-smooth -
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, and nice job with the improvements!
What exactly is responsible for most of the speedup? The smaller memory allocation chunks, or something else?
Well, I don't know exactly what the bottlenecks are. But the big difference, I think, is that the original approach calculates every sample as a Python expression (notice that np.average is invoked once for every output sample of a high-res channel.) The new approach doesn't; every step of the process is done "in bulk", using numpy arrays to operate across thousands of samples at once. Of course numpy operations also take time proportional to the size of the input, but because you're performing a homogeneous operation across a homogeneous array, the constant factor is a lot smaller! |
The
smooth_frames
argument tordrecord
is meant to determine whether the input signals are stored as{d|p}_signal
(a homogeneous two-dimensional array, requiring all signals to be resampled to the same frequency) ore_{d|p}_signal
(a list of one-dimensional arrays, keeping each signal at its actual sampling frequency).However, this only works at present if you are trying to read signals with multiple samples per frame. (The assumption, I guess, being that if the record only has one sample per frame, then smooth format doesn't discard any information, so there is no reason for the application to want non-smooth format.)
IMNSHO, this is not a good idea: the application should not need to know in advance whether the record contains multiple samples per frame, and should not need to implement two different logic branches, simply to be able to read input signals without resampling. If the application asks for a 2D array it should get a 2D array, and if the application asks for a list of 1D arrays it should get a list of 1D arrays.
Fixing this could break applications, because if an existing application only ever reads single-frequency records, and it specifies
smooth_frames=False
andphysical=True
, it would receivep_signal
data just as if it had specifiedsmooth_frames=True
. It would be silly for an application to do that, and sillier to rely on this behavior, but who knows.This fixes issue #312 and also fixes the handling of
smooth_frames=False
together withsampfrom
/sampto
.