Handle smooth_frames=False consistently for single and multi-frequency records #313

bemoody · 2021-07-09T16:35:07Z

The smooth_frames argument to rdrecord is meant to determine whether the input signals are stored as {d|p}_signal (a homogeneous two-dimensional array, requiring all signals to be resampled to the same frequency) or e_{d|p}_signal (a list of one-dimensional arrays, keeping each signal at its actual sampling frequency).

However, this only works at present if you are trying to read signals with multiple samples per frame. (The assumption, I guess, being that if the record only has one sample per frame, then smooth format doesn't discard any information, so there is no reason for the application to want non-smooth format.)

IMNSHO, this is not a good idea: the application should not need to know in advance whether the record contains multiple samples per frame, and should not need to implement two different logic branches, simply to be able to read input signals without resampling. If the application asks for a 2D array it should get a 2D array, and if the application asks for a list of 1D arrays it should get a list of 1D arrays.

Fixing this could break applications, because if an existing application only ever reads single-frequency records, and it specifies smooth_frames=False and physical=True, it would receive p_signal data just as if it had specified smooth_frames=True. It would be silly for an application to do that, and sillier to rely on this behavior, but who knows.

This fixes issue #312 and also fixes the handling of smooth_frames=False together with sampfrom/sampto.

briangow · 2021-08-03T14:39:40Z

@bemoody this appears to be working from my testing. I do think we could improve some of the descriptions about how this works though. It took a fair bit of digging to find a good definition for how smooth_frames works and to find a database which uses a variable number of samps_per_frame (drivedb is one such database).

In addition to describing that smooth_frames = True will return a uniformly sampled array and smooth_frames = False will return a list of signals at their original sample rates, we could be clearer about how the "smoothing" works. Perhaps by stating that smooth_frames = True will decimate the signal by its samps_per_frame in the parameter descriptions.

In anticipation of merging this update I've created a new issue to reflect that we need to update the wfdb-python help documentation since it currently states that smooth_frames is only used when signals have multiple samples per frame, see #316 .

bemoody · 2021-08-09T20:43:12Z

Another issue is that the current implementation of smooth_frames=False is very slow, much slower than smooth_frames=True.

(For example, on my machine wfdb.rdrecord("sample-data/100") is nearly instantaneous, while wfdb.rdrecord("sample-data/100", smooth_frames=False) takes around 8 seconds.)

I'm not sure why this is.

cx1111 · 2021-10-22T21:13:08Z

wfdb/io/_signal.py

-        returned if the signals have uniform samples/frame or if
-        `smooth_frames` is True. Otherwise a list of 1d numpy arrays
-        is returned.
+        returned if `smooth_frames` is True. Otherwise a list of 1d


I think it's a good idea for the caller to know the return type/dimensions in advance. But this changes the meaning of the 'smooth_frames' phrase which is supposed to specify the behavior only for signals with multiple samples per frame. It also hides the context of what happens and when, in order to get the Nx2 array vs the N 1D arrays, by directly only focusing on the output type.

I would add another function argument, rename it, and or rework the documentation, to make the behavior more clear.

wfdb/io/record.py

bemoody · 2021-10-26T15:05:44Z

Thanks for your thoughtful review! Definitely calling it 'smooth_frames' is not as clear as it could be. This was my effort to maintain maximal API compatibility. There are certainly other approaches we could take at the API level, for example: - 'rdrecord(..., smooth_frames=False)' could be replaced with 'rdrecord(..., expanded=True)'; i.e., the old keyword argument could be deprecated and a new one added that is more intuitive - rdrecord could *always* set e_d_signal/e_p_signal, and d_signal/p_signal could be computed lazily on request; i.e., the smooth_frames argument could be deprecated and ignored completely Just to be clear, though, are you saying this should be changed in the public API (rdrecord)? Or only the internal API of wfdb.io._signal?

briangow · 2021-11-10T15:02:50Z

I agree with @cx1111 that it would be better if the caller knew the return type. However, since it isn't feasible to only return one type, perhaps we could instead rename the smooth_frames parameter to return_smoothed_frames. This at least provides an indication (without having to read the docstring) that we are changing the return based on the setting of this variable.

bemoody · 2021-11-16T18:40:16Z

@briangow, if I understand correctly, you're saying we should change the name of the argument for _rd_segment and for _rd_dat_signals, since in those two cases the type of the return value is dependent on the argument. I agree that would be clearer and can certainly make that change. (I disagree, though, that it isn't feasible for these functions to always return the same type. In fact, we could change both functions to *always* return a list of arrays, and post-process the result in rdrecord. However, as this would be a significant change code-wise, and one with no user-visible effects, I would propose to do this in a later pull request.) The question I'm still unsure about, which wasn't completely clear from Chen's message, is whether we also need to make a change in the *public API* (rdrecord). For rdrecord, 'smooth_frames' doesn't change the type of the result (it's always a Record object); it only affects whether the 'd_signal' or 'e_d_signal' attribute is set.

briangow · 2021-11-18T16:30:48Z

While I think it would be nice to have a clearer name for this parameter in the public API, it might cause more of a mess than it's worth. I think it would be more effective to focus on improving the documentation and tutorials/demo packages.

tompollard · 2022-03-03T17:23:22Z

@cx1111 please could you respond to this point?:

The question I'm still unsure about, which wasn't completely clear from
Chen's message, is whether we also need to make a change in the public
API (rdrecord). For rdrecord, 'smooth_frames' doesn't change the type
of the result (it's always a Record object); it only affects whether the
'd_signal' or 'e_d_signal' attribute is set.

cx1111 · 2022-03-03T18:36:50Z

I agree that the current API which requires the caller to know the frequency of the signals, is poorly suited for an application writer trying to process large volumes of data. I think when I designed this, that I only thought about users trying to analyze a small homogeneous datasets they are familiar with.

rdrecord could always set e_d_signal/e_p_signal, and d_signal/p_signal could be computed lazily on request; i.e., the smooth_frames argument could be deprecated and ignored completely

Lazy evaluation sounds interesting and scary :P This would probably require a lot of further thought on the design and tradeoffs. Perhaps for the future.

Just to be clear, though, are you saying this should be changed in the public API (rdrecord)? Or only the internal API of wfdb.io._signal?

Internally, we should definitely add more params to control the behavior of _rd_dat_signals. In particular, this if else statement is rather nasty.

Perhaps we could add another function parameter such as return_dims which specifies the dimensions of the output array(s), which explicitly causes the function to return either the list of 1d arrays (val=1), or the 2d array (val=2). Then we can have the following behavior matrix with return_dims and smooth_frames:

(1, True): List of 1d arrays. Guaranteed to be same length.
(1, False): List of 1d arrays. Not guaranteed to be same length.
(2, True): Single 2d array.
(2, False): Illegal combination.

We could also add this new param to rdrecord and use the same behavior matrix.

This way, the user can specify both the desired output data type and the behavior for multi-frequency records. the smooth_frames param could be safely ignored for single-frequency records.

The changes could then be made in 2 steps:

Change the internal signal functions.
Change the public facing API of rdrecord.

Thoughts?

This function can return signal data in either of two formats: - "smooth" (a two-dimensional array, where x[t,s] is sample t of signal s) - "non-smooth" (a list of one-dimensional arrays, where x[s][t] is sample t of signal s) Previously, _rd_dat_signals would use "smooth" format if 'sum(samps_per_frame) == n_sig', regardless of 'smooth_frames'. This makes little sense, since the caller needs to know what type of return value to expect. Instead, the format should be determined solely by the 'smooth_frames' parameter. (In this case, the only caller of _rd_dat_signals is _rd_segment, and the only caller of _rd_segment is wfdb.io.record.rdrecord.)

This function can return signal data in either of two formats: - "smooth" (a two-dimensional array, where x[t,s] is sample t of signal s) - "non-smooth" (a list of one-dimensional arrays, where x[s][t] is sample t of signal s) Previously, _rd_segment would use "smooth" format if 'sum(samps_per_frame) == n_sig', regardless of 'smooth_frames'. This makes little sense, since the caller needs to know what type of return value to expect. Instead, the format should be determined solely by the 'smooth_frames' parameter. (In this case, the only caller of _rd_segment is wfdb.io.record.rdrecord.)

rdrecord can return signal data in any of four formats: - by resampling all signals to a uniform rate ("smoothing"), and setting d_signal to a two-dimensional array, where d_signal[t,s] is sample t of signal s - by resampling to a uniform rate, and setting p_signal to a two-dimensional array, where p_signal[t,s] is sample t of signal s converted into physical units - by setting e_d_signal to a list of one-dimensional arrays, where e_d_signal[s][t] is sample t of signal s - by setting e_p_signal to a list of one-dimensional arrays, where e_p_signal[s][t] is sample t of signal s converted into physical units If the selected signals contain multiple samples per frame, the behavior of rdrecord is consistent: - If smooth_frames is True, the selected signals are resampled to the frame rate and stored as d_signal or p_signal. - If smooth_frames is False, the selected signals are stored in their original form as e_d_signal or e_p_signal. However, if each of the selected signals contains only one sample per frame, rdrecord would previously behave inconsistently: - If all signals in the record contain only one sample per frame, and smooth_frames is False and physical is True, the selected signals would be stored as p_signal (not e_p_signal as the caller would expect.) - If all signals in the record contain only one sample per frame, and smooth_frames is False and physical is False, rdrecord would crash with a TypeError in convert_dtype. - If some signals in the record contain multiple samples per frame (but the selected signals don't), rdrecord would crash with an AttributeError in _arrange_fields. Change this behavior so that if smooth_frames is false, rdrecord will always store the signals as e_d_signal or e_p_signal, regardless of the underlying number of samples per frame.

calc_checksum is called in order to calculate checksums of the signal data (d_signal or e_d_signal) in a record. In particular, if rdrecord is used to read only part of a record, it will call calc_checksum to determine the checksums of that part of the record. However, at that time, self.n_sig is still equal to the total number of signals in the input record (not the number of signals stored in d_signal or e_d_signal, which might be different if a subset of channels are selected). Thus, if expanded is true and self.n_sig > len(self.e_d_signal), this would crash. For simplicity, and consistency with the expanded=False case, ignore n_sig and simply calculate the checksums of e_d_signal.

The Record.__eq__ function is currently used in the test suite to check whether two Record objects have the same contents. (This function doesn't actually conform to the standard Python API, but it works for this purpose.) In the case of "expanded"-format records, the e_p_signal and/or e_d_signal attributes are lists of numpy.ndarray objects, not numpy.ndarray objects themselves. In order to compare these for equality we must compare each element separately.

When smooth_frames is False, rdrecord should store the signal data as a list of 1D arrays (each signal at its original sampling frequency) rather than as a "smooth" 2D array. Previously, this would fail if the input record contained only one sample per frame. Modify the existing test case to check that this works.

When comparing the attributes of two records, if a value is an instance of a class derived from 'np.ndarray' or a class derived from 'list', it should be treated the same way as an 'np.ndarray' or a 'list'. Note that there currently are no such classes defined or used by this package. Note also that in any case, the respective types of the two attributes must be equal in order for the values to be considered equal.

To simplify the implementation of _rd_segment and _rd_dat_signals, we want to eliminate the smooth_frames argument, so that the return values of these two functions will always have the same type (a list of numpy arrays.) Therefore, if the application requested frame smoothing, then instead of calling _rd_segment with smooth_frames=True, we will call _rd_segment with smooth_frames=False, and post-process the result by calling Record.smooth_frames. Record.smooth_frames (SignalMixin.smooth_frames) will give a result equivalent to what _rd_segment gives with smooth_frames=True, but there are likely differences in performance: - Record.smooth_frames performs the computation by slicing along the "long" axis and storing the intermediate results in an int64 numpy array. _rd_dat_signals slices along the "short" axis and stores the intermediate results in a Python list. Record.smooth_frames should therefore be faster for large inputs. - Record.smooth_frames only operates on the channels present in e_d_signal, whereas _rd_dat_signals smooths all of the signals in the input file. Record.smooth_frames therefore saves memory and time when reading a subset of channels. - Record.smooth_frames always returns an int64 array, whereas _rd_dat_signals returns an array of the same type as the original data. Record.smooth_frames therefore uses more memory in many cases. (Note that rdrecord will post-process the result in any case, making this change invisible to applications; the issue of increased temporary memory usage can be addressed separately.) - If there are multiple channels in a signal file, then calling _rd_dat_signals with smooth_frames=False requires making an extra copy of each signal that has multiple samples per frame (because of the "reshape(-1)".) (This could be addressed in the future by allowing _rd_segment, or at least _rd_dat_signals, to return a list of *two-dimensional* arrays instead.) In order for this to work correctly, Record.smooth_frames must be called after setting both e_d_signal and samps_per_frame. In particular, it must be done after Record._arrange_fields "rearranges" samps_per_frame according to channels. On the other hand, _arrange_fields is expected to set checksum and init_value in different ways depending on whether the result is to be smoothed. (This use of checksum and init_value is somewhat dubious.) Therefore, smooth_frames is now invoked as part of _arrange_fields, after setting channel-specific metadata and before setting checksum and init_value. _arrange_fields should never be invoked other than by rdrecord; it doesn't make any sense to call this function at other times. Change the signature of this function to reflect the fact that it actively transforms the signal array, and make all arguments mandatory.

Now, _rd_segment is only ever called with smooth_frames=False (the function previously fulfilled by smooth_frames=True is handled by invoking Record.smooth_frames in Record._arrange_fields.) Accordingly, remove the logic for smoothing frames within _rd_dat_signals and for handling "smoothed" data in _rd_segment.

The smooth_frames parameter has been deprecated, and these functions always return a list of arrays; remove the parameter. Note that _rd_segment is only ever called by rdrecord, and _rd_dat_signals is only ever called by _rd_segment.

bemoody · 2022-03-15T17:24:47Z

Thanks!

To address your "step 1", I think the most straightforward solution is to remove the "smoothing" functionality from _rd_segment and _rd_dat_signals entirely. Perhaps that sounds like a huge, dramatic change, but it isn't really. :) The function Record.smooth_frames already exists, and it should give numerically identical results to what _rd_dat_signals does at present.

Aside from reducing the amount of duplicated logic, the implementation of Record.smooth_frames is also better designed (IMHO) and probably more efficient overall. (There are some slight drawbacks to "unpacking" and later "repacking" the array, but I think this is a minor issue and outweighed by the benefit of simplifying the codebase.) The performance can be further improved in later pull requests.

Note that 9d77075 is 16fa5ed rebased onto 0d41603. The new commits here are ca20b1a ("rdrecord: smooth frames by invoking smooth_frames()"), b2f3995 ("_rd_segment, _rd_dat_signals: remove support for smooth_frames=True"), and d02c478 ("_rd_segment, _rd_dat_signals: remove smooth_frames parameter").

bemoody · 2022-03-16T16:14:10Z

One more thing I'll say is that this restructuring should also make it much easier to add support for other signal file formats (notably WFDB-FLAC, but also to properly handle EDF/BDF, and maybe other things in the future.)

cx1111 · 2022-03-16T23:29:39Z

Will try to review by EOW. Ping me if I forget.

cx1111

Looks good. Nice code deletion. Perhaps change the title of the PR and edit the OP description to include a summary of the changes?

cx1111 · 2022-03-18T01:47:32Z

Also from now on, repository owners should merge their own PRs once they have approval.

bemoody · 2022-03-21T15:47:37Z

Sorry to make this more complicated, but this change now breaks return_res:

>>> wfdb.rdrecord('sample-data/100', physical=False, return_res=16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/work/wfdb-python/wfdb/io/record.py", line 3620, in rdrecord
    record.convert_dtype(physical, return_res, smooth_frames)
  File "/home/user/work/wfdb-python/wfdb/io/_signal.py", line 723, in convert_dtype
    raise Exception('Cannot convert digital samples to lower dtype. Risk of overflow/underflow.')
Exception: Cannot convert digital samples to lower dtype. Risk of overflow/underflow.

I should have tested this.

I think the best thing to do here is to change smooth_frames() to store the result as the minimal data type (which I was intending to do in a separate pull request.) That behavior could be made optional but I really don't expect anyone to complain.

The logic for physical and digital cases is exactly the same except for the choice of input data and the resulting data type.

Ensure that all of the input signals have the expected lengths before trying to do anything else. (If any of the signal lengths were wrong, this would typically have raised a ValueError later on - though not always, since numpy will try to be clever if one of the arguments is an array of length 1.)

Instead of always returning the result as an int64 or float64 array, select the output type based on the types of the input arrays. The output type should be the smallest type that has the correct "kind" and is able to represent all input values. For example, in digital mode, if the input includes some int8 arrays and some int16 arrays, the result should be an int16 array. In physical mode, if the inputs are all float32, then the result will be float32; otherwise the result will be float64. However, although the output type should generally match the input type, intermediate results may need to be stored as a different type. For example, if the input and output are both int16, and one or more signals have spf > 1 and use the entire 16-bit range, then the sum of N samples will overflow an int16. Previously, it was fine simply to store the intermediate results in the output array itself, because the output array was 64-bit, and no WFDB format has more than 32-bit precision, and spf is (in practice) limited to at most 2**31-1. For simplicity, continue using int64 or float64 as the intermediate type, regardless of the actual input types and spf. At the same time, we can also optimize the calculation slightly by reshaping the input array and using np.sum, avoiding another Python loop.

The variable tspf is not used and can be removed. sig_len can be calculated using // rather than / and truncating to an integer (the latter could be incorrect for records longer than 2**53.) The output array can be allocated using np.empty instead of np.zeros since it will be fully initialized by the subsequent loop. Since each frame is independent of the others, the loop can be broken up into blocks (here, arbitrarily chosen as 2**16 frames) to reduce temporary memory usage while still spending most of our CPU time in numpy operations.

bemoody · 2022-03-21T20:12:56Z

Hopefully the last set of changes in this pull request. :)

Changing smooth_frames() to return the minimal data type requires changing the logic a little (we can't just change the type of the array because the intermediate sums can easily exceed the range of the original data type.) At the same time, throw in a few optimizations.

Speed comparison:

(master - original logic of _rd_dat_signals)

>>> import timeit
>>> timeit.timeit('wfdb.rdrecord("sample-data/03700181")', 'import wfdb', number=100)
54.588751977000356

(d02c478 - original logic of smooth_frames)

>>> import timeit
>>> timeit.timeit('wfdb.rdrecord("sample-data/03700181")', 'import wfdb', number=100)
1.1534926320018712

(non-smooth - smooth_frames with further optimizations)

>>> import timeit
>>> timeit.timeit('wfdb.rdrecord("sample-data/03700181")', 'import wfdb', number=100)
1.0475175969986594

cx1111

Good catch, and nice job with the improvements!

What exactly is responsible for most of the speedup? The smaller memory allocation chunks, or something else?

bemoody · 2022-03-28T21:14:30Z

Well, I don't know exactly what the bottlenecks are. But the big difference, I think, is that the original approach calculates every sample as a Python expression (notice that np.average is invoked once for every output sample of a high-res channel.) The new approach doesn't; every step of the process is done "in bulk", using numpy arrays to operate across thousands of samples at once.

Of course numpy operations also take time proportional to the size of the input, but because you're performing a homogeneous operation across a homogeneous array, the constant factor is a lot smaller!

briangow mentioned this pull request Aug 3, 2021

Update help documentation to reflect that smooth_frames isn't only applied for signals with multiple samps_per_frame #316

Open

bemoody mentioned this pull request Aug 10, 2021

Optimize smooth_frames=False #318

Merged

bemoody mentioned this pull request Oct 5, 2021

Multi-frequency input records MIT-LCP/waveform-annotation#49

Open

bemoody mentioned this pull request Oct 13, 2021

Read multi-frequency multi-segment records #331

Merged

cx1111 requested changes Oct 22, 2021

View reviewed changes

bemoody mentioned this pull request Oct 28, 2021

wrsamp incorrectly saves samps_per_frame when expanded=False #333

Open

Benjamin Moody added 10 commits March 11, 2022 13:17

bemoody force-pushed the non-smooth branch from 16fa5ed to d02c478 Compare March 15, 2022 17:24

cx1111 approved these changes Mar 18, 2022

View reviewed changes

bemoody changed the title ~~Handle smooth_frames for single-frequency records~~ Handle smooth_frames=False consistently for single and multi-frequency records Mar 21, 2022

Benjamin Moody added 5 commits March 21, 2022 15:39

SignalMixin.smooth_frames: consolidate duplicated logic.

13a8948

The logic for physical and digital cases is exactly the same except for the choice of input data and the resulting data type.

TestRecord: test rdrecord with the return_res parameter.

4cd1d7b

cx1111 approved these changes Mar 26, 2022

View reviewed changes

bemoody merged commit 6cf52fb into master Mar 28, 2022

bemoody deleted the non-smooth branch March 28, 2022 21:23

cx1111 mentioned this pull request May 23, 2022

Better API for specifying return data of signals #376

Open

Handle smooth_frames=False consistently for single and multi-frequency records #313

Handle smooth_frames=False consistently for single and multi-frequency records #313

Uh oh!

Conversation

bemoody commented Jul 9, 2021

Uh oh!

briangow commented Aug 3, 2021

Uh oh!

bemoody commented Aug 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cx1111 Oct 22, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bemoody commented Oct 26, 2021 via email

Uh oh!

briangow commented Nov 10, 2021

Uh oh!

bemoody commented Nov 16, 2021 via email

Uh oh!

briangow commented Nov 18, 2021

Uh oh!

tompollard commented Mar 3, 2022

Uh oh!

cx1111 commented Mar 3, 2022

Uh oh!

bemoody commented Mar 15, 2022

Uh oh!

bemoody commented Mar 16, 2022

Uh oh!

cx1111 commented Mar 16, 2022

Uh oh!

cx1111 left a comment

Choose a reason for hiding this comment

Uh oh!

cx1111 commented Mar 18, 2022

Uh oh!

bemoody commented Mar 21, 2022

Uh oh!

bemoody commented Mar 21, 2022

Uh oh!

cx1111 left a comment

Choose a reason for hiding this comment

Uh oh!

bemoody commented Mar 28, 2022

Uh oh!

Uh oh!

bemoody commented Aug 9, 2021 •

edited

Loading