Skip to content

Handle smooth_frames=False consistently for single and multi-frequency records #313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Mar 28, 2022

Conversation

bemoody
Copy link
Collaborator

@bemoody bemoody commented Jul 9, 2021

The smooth_frames argument to rdrecord is meant to determine whether the input signals are stored as {d|p}_signal (a homogeneous two-dimensional array, requiring all signals to be resampled to the same frequency) or e_{d|p}_signal (a list of one-dimensional arrays, keeping each signal at its actual sampling frequency).

However, this only works at present if you are trying to read signals with multiple samples per frame. (The assumption, I guess, being that if the record only has one sample per frame, then smooth format doesn't discard any information, so there is no reason for the application to want non-smooth format.)

IMNSHO, this is not a good idea: the application should not need to know in advance whether the record contains multiple samples per frame, and should not need to implement two different logic branches, simply to be able to read input signals without resampling. If the application asks for a 2D array it should get a 2D array, and if the application asks for a list of 1D arrays it should get a list of 1D arrays.

Fixing this could break applications, because if an existing application only ever reads single-frequency records, and it specifies smooth_frames=False and physical=True, it would receive p_signal data just as if it had specified smooth_frames=True. It would be silly for an application to do that, and sillier to rely on this behavior, but who knows.

This fixes issue #312 and also fixes the handling of smooth_frames=False together with sampfrom/sampto.

@briangow
Copy link
Contributor

briangow commented Aug 3, 2021

@bemoody this appears to be working from my testing. I do think we could improve some of the descriptions about how this works though. It took a fair bit of digging to find a good definition for how smooth_frames works and to find a database which uses a variable number of samps_per_frame (drivedb is one such database).

In addition to describing that smooth_frames = True will return a uniformly sampled array and smooth_frames = False will return a list of signals at their original sample rates, we could be clearer about how the "smoothing" works. Perhaps by stating that smooth_frames = True will decimate the signal by its samps_per_frame in the parameter descriptions.

In anticipation of merging this update I've created a new issue to reflect that we need to update the wfdb-python help documentation since it currently states that smooth_frames is only used when signals have multiple samples per frame, see #316 .

@bemoody
Copy link
Collaborator Author

bemoody commented Aug 9, 2021

Another issue is that the current implementation of smooth_frames=False is very slow, much slower than smooth_frames=True.

(For example, on my machine wfdb.rdrecord("sample-data/100") is nearly instantaneous, while wfdb.rdrecord("sample-data/100", smooth_frames=False) takes around 8 seconds.)

I'm not sure why this is.

returned if the signals have uniform samples/frame or if
`smooth_frames` is True. Otherwise a list of 1d numpy arrays
is returned.
returned if `smooth_frames` is True. Otherwise a list of 1d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea for the caller to know the return type/dimensions in advance. But this changes the meaning of the 'smooth_frames' phrase which is supposed to specify the behavior only for signals with multiple samples per frame. It also hides the context of what happens and when, in order to get the Nx2 array vs the N 1D arrays, by directly only focusing on the output type.

I would add another function argument, rename it, and or rework the documentation, to make the behavior more clear.

@bemoody
Copy link
Collaborator Author

bemoody commented Oct 26, 2021 via email

@briangow
Copy link
Contributor

I agree with @cx1111 that it would be better if the caller knew the return type. However, since it isn't feasible to only return one type, perhaps we could instead rename the smooth_frames parameter to return_smoothed_frames. This at least provides an indication (without having to read the docstring) that we are changing the return based on the setting of this variable.

@bemoody
Copy link
Collaborator Author

bemoody commented Nov 16, 2021 via email

@briangow
Copy link
Contributor

While I think it would be nice to have a clearer name for this parameter in the public API, it might cause more of a mess than it's worth. I think it would be more effective to focus on improving the documentation and tutorials/demo packages.

@tompollard
Copy link
Member

@cx1111 please could you respond to this point?:

The question I'm still unsure about, which wasn't completely clear from
Chen's message, is whether we also need to make a change in the public
API
(rdrecord). For rdrecord, 'smooth_frames' doesn't change the type
of the result (it's always a Record object); it only affects whether the
'd_signal' or 'e_d_signal' attribute is set.

@cx1111
Copy link
Member

cx1111 commented Mar 3, 2022

I agree that the current API which requires the caller to know the frequency of the signals, is poorly suited for an application writer trying to process large volumes of data. I think when I designed this, that I only thought about users trying to analyze a small homogeneous datasets they are familiar with.

rdrecord could always set e_d_signal/e_p_signal, and d_signal/p_signal could be computed lazily on request; i.e., the smooth_frames argument could be deprecated and ignored completely

Lazy evaluation sounds interesting and scary :P This would probably require a lot of further thought on the design and tradeoffs. Perhaps for the future.

Just to be clear, though, are you saying this should be changed in the public API (rdrecord)? Or only the internal API of wfdb.io._signal?

Internally, we should definitely add more params to control the behavior of _rd_dat_signals. In particular, this if else statement is rather nasty.

Perhaps we could add another function parameter such as return_dims which specifies the dimensions of the output array(s), which explicitly causes the function to return either the list of 1d arrays (val=1), or the 2d array (val=2). Then we can have the following behavior matrix with return_dims and smooth_frames:

  • (1, True): List of 1d arrays. Guaranteed to be same length.
  • (1, False): List of 1d arrays. Not guaranteed to be same length.
  • (2, True): Single 2d array.
  • (2, False): Illegal combination.

We could also add this new param to rdrecord and use the same behavior matrix.

This way, the user can specify both the desired output data type and the behavior for multi-frequency records. the smooth_frames param could be safely ignored for single-frequency records.

The changes could then be made in 2 steps:

  1. Change the internal signal functions.
  2. Change the public facing API of rdrecord.

Thoughts?

Benjamin Moody added 10 commits March 11, 2022 13:17
This function can return signal data in either of two formats:

- "smooth" (a two-dimensional array, where x[t,s] is sample t of
  signal s)

- "non-smooth" (a list of one-dimensional arrays, where x[s][t] is
  sample t of signal s)

Previously, _rd_dat_signals would use "smooth" format if
'sum(samps_per_frame) == n_sig', regardless of 'smooth_frames'.  This
makes little sense, since the caller needs to know what type of return
value to expect.  Instead, the format should be determined solely by
the 'smooth_frames' parameter.

(In this case, the only caller of _rd_dat_signals is _rd_segment, and
the only caller of _rd_segment is wfdb.io.record.rdrecord.)
This function can return signal data in either of two formats:

- "smooth" (a two-dimensional array, where x[t,s] is sample t of
  signal s)

- "non-smooth" (a list of one-dimensional arrays, where x[s][t] is
  sample t of signal s)

Previously, _rd_segment would use "smooth" format if
'sum(samps_per_frame) == n_sig', regardless of 'smooth_frames'.  This
makes little sense, since the caller needs to know what type of return
value to expect.  Instead, the format should be determined solely by
the 'smooth_frames' parameter.

(In this case, the only caller of _rd_segment is
wfdb.io.record.rdrecord.)
rdrecord can return signal data in any of four formats:

- by resampling all signals to a uniform rate ("smoothing"), and
  setting d_signal to a two-dimensional array, where d_signal[t,s] is
  sample t of signal s

- by resampling to a uniform rate, and setting p_signal to a
  two-dimensional array, where p_signal[t,s] is sample t of signal s
  converted into physical units

- by setting e_d_signal to a list of one-dimensional arrays, where
  e_d_signal[s][t] is sample t of signal s

- by setting e_p_signal to a list of one-dimensional arrays, where
  e_p_signal[s][t] is sample t of signal s converted into physical
  units

If the selected signals contain multiple samples per frame, the
behavior of rdrecord is consistent:

- If smooth_frames is True, the selected signals are resampled to the
  frame rate and stored as d_signal or p_signal.

- If smooth_frames is False, the selected signals are stored in their
  original form as e_d_signal or e_p_signal.

However, if each of the selected signals contains only one sample per
frame, rdrecord would previously behave inconsistently:

- If all signals in the record contain only one sample per frame, and
  smooth_frames is False and physical is True, the selected signals
  would be stored as p_signal (not e_p_signal as the caller would
  expect.)

- If all signals in the record contain only one sample per frame, and
  smooth_frames is False and physical is False, rdrecord would crash
  with a TypeError in convert_dtype.

- If some signals in the record contain multiple samples per frame
  (but the selected signals don't), rdrecord would crash with an
  AttributeError in _arrange_fields.

Change this behavior so that if smooth_frames is false, rdrecord will
always store the signals as e_d_signal or e_p_signal, regardless of
the underlying number of samples per frame.
calc_checksum is called in order to calculate checksums of the signal
data (d_signal or e_d_signal) in a record.  In particular, if rdrecord
is used to read only part of a record, it will call calc_checksum to
determine the checksums of that part of the record.  However, at that
time, self.n_sig is still equal to the total number of signals in the
input record (not the number of signals stored in d_signal or
e_d_signal, which might be different if a subset of channels are
selected).  Thus, if expanded is true and self.n_sig >
len(self.e_d_signal), this would crash.

For simplicity, and consistency with the expanded=False case, ignore
n_sig and simply calculate the checksums of e_d_signal.
The Record.__eq__ function is currently used in the test suite to
check whether two Record objects have the same contents.  (This
function doesn't actually conform to the standard Python API, but it
works for this purpose.)

In the case of "expanded"-format records, the e_p_signal and/or
e_d_signal attributes are lists of numpy.ndarray objects, not
numpy.ndarray objects themselves.  In order to compare these for
equality we must compare each element separately.
When smooth_frames is False, rdrecord should store the signal data as
a list of 1D arrays (each signal at its original sampling frequency)
rather than as a "smooth" 2D array.  Previously, this would fail if
the input record contained only one sample per frame.  Modify the
existing test case to check that this works.
When comparing the attributes of two records, if a value is an
instance of a class derived from 'np.ndarray' or a class derived from
'list', it should be treated the same way as an 'np.ndarray' or a
'list'.  Note that there currently are no such classes defined or used
by this package.

Note also that in any case, the respective types of the two attributes
must be equal in order for the values to be considered equal.
To simplify the implementation of _rd_segment and _rd_dat_signals, we
want to eliminate the smooth_frames argument, so that the return
values of these two functions will always have the same type (a list
of numpy arrays.)

Therefore, if the application requested frame smoothing, then instead
of calling _rd_segment with smooth_frames=True, we will call
_rd_segment with smooth_frames=False, and post-process the result by
calling Record.smooth_frames.

Record.smooth_frames (SignalMixin.smooth_frames) will give a result
equivalent to what _rd_segment gives with smooth_frames=True, but
there are likely differences in performance:

 - Record.smooth_frames performs the computation by slicing along the
   "long" axis and storing the intermediate results in an int64 numpy
   array.  _rd_dat_signals slices along the "short" axis and stores
   the intermediate results in a Python list.  Record.smooth_frames
   should therefore be faster for large inputs.

 - Record.smooth_frames only operates on the channels present in
   e_d_signal, whereas _rd_dat_signals smooths all of the signals in
   the input file.  Record.smooth_frames therefore saves memory and
   time when reading a subset of channels.

 - Record.smooth_frames always returns an int64 array, whereas
   _rd_dat_signals returns an array of the same type as the original
   data.  Record.smooth_frames therefore uses more memory in many
   cases.  (Note that rdrecord will post-process the result in any
   case, making this change invisible to applications; the issue of
   increased temporary memory usage can be addressed separately.)

 - If there are multiple channels in a signal file, then calling
   _rd_dat_signals with smooth_frames=False requires making an extra
   copy of each signal that has multiple samples per frame (because of
   the "reshape(-1)".)  (This could be addressed in the future by
   allowing _rd_segment, or at least _rd_dat_signals, to return a list
   of *two-dimensional* arrays instead.)

In order for this to work correctly, Record.smooth_frames must be
called after setting both e_d_signal and samps_per_frame.  In
particular, it must be done after Record._arrange_fields "rearranges"
samps_per_frame according to channels.

On the other hand, _arrange_fields is expected to set checksum and
init_value in different ways depending on whether the result is to be
smoothed.  (This use of checksum and init_value is somewhat dubious.)
Therefore, smooth_frames is now invoked as part of _arrange_fields,
after setting channel-specific metadata and before setting checksum
and init_value.

_arrange_fields should never be invoked other than by rdrecord; it
doesn't make any sense to call this function at other times.  Change
the signature of this function to reflect the fact that it actively
transforms the signal array, and make all arguments mandatory.
Now, _rd_segment is only ever called with smooth_frames=False (the
function previously fulfilled by smooth_frames=True is handled by
invoking Record.smooth_frames in Record._arrange_fields.)

Accordingly, remove the logic for smoothing frames within
_rd_dat_signals and for handling "smoothed" data in _rd_segment.
The smooth_frames parameter has been deprecated, and these functions
always return a list of arrays; remove the parameter.  Note that
_rd_segment is only ever called by rdrecord, and _rd_dat_signals is
only ever called by _rd_segment.
@bemoody
Copy link
Collaborator Author

bemoody commented Mar 15, 2022

Thanks!

To address your "step 1", I think the most straightforward solution is to remove the "smoothing" functionality from _rd_segment and _rd_dat_signals entirely. Perhaps that sounds like a huge, dramatic change, but it isn't really. :) The function Record.smooth_frames already exists, and it should give numerically identical results to what _rd_dat_signals does at present.

Aside from reducing the amount of duplicated logic, the implementation of Record.smooth_frames is also better designed (IMHO) and probably more efficient overall. (There are some slight drawbacks to "unpacking" and later "repacking" the array, but I think this is a minor issue and outweighed by the benefit of simplifying the codebase.) The performance can be further improved in later pull requests.

Note that 9d77075 is 16fa5ed rebased onto 0d41603. The new commits here are ca20b1a ("rdrecord: smooth frames by invoking smooth_frames()"), b2f3995 ("_rd_segment, _rd_dat_signals: remove support for smooth_frames=True"), and d02c478 ("_rd_segment, _rd_dat_signals: remove smooth_frames parameter").

@bemoody
Copy link
Collaborator Author

bemoody commented Mar 16, 2022

One more thing I'll say is that this restructuring should also make it much easier to add support for other signal file formats (notably WFDB-FLAC, but also to properly handle EDF/BDF, and maybe other things in the future.)

@cx1111
Copy link
Member

cx1111 commented Mar 16, 2022

Will try to review by EOW. Ping me if I forget.

Copy link
Member

@cx1111 cx1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Nice code deletion. Perhaps change the title of the PR and edit the OP description to include a summary of the changes?

@cx1111
Copy link
Member

cx1111 commented Mar 18, 2022

Also from now on, repository owners should merge their own PRs once they have approval.

@bemoody bemoody changed the title Handle smooth_frames for single-frequency records Handle smooth_frames=False consistently for single and multi-frequency records Mar 21, 2022
@bemoody
Copy link
Collaborator Author

bemoody commented Mar 21, 2022

Sorry to make this more complicated, but this change now breaks return_res:

>>> wfdb.rdrecord('sample-data/100', physical=False, return_res=16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/work/wfdb-python/wfdb/io/record.py", line 3620, in rdrecord
    record.convert_dtype(physical, return_res, smooth_frames)
  File "/home/user/work/wfdb-python/wfdb/io/_signal.py", line 723, in convert_dtype
    raise Exception('Cannot convert digital samples to lower dtype. Risk of overflow/underflow.')
Exception: Cannot convert digital samples to lower dtype. Risk of overflow/underflow.

I should have tested this.

I think the best thing to do here is to change smooth_frames() to store the result as the minimal data type (which I was intending to do in a separate pull request.) That behavior could be made optional but I really don't expect anyone to complain.

Benjamin Moody added 5 commits March 21, 2022 15:39
The logic for physical and digital cases is exactly the same except
for the choice of input data and the resulting data type.
Ensure that all of the input signals have the expected lengths before
trying to do anything else.

(If any of the signal lengths were wrong, this would typically have
raised a ValueError later on - though not always, since numpy will try
to be clever if one of the arguments is an array of length 1.)
Instead of always returning the result as an int64 or float64 array,
select the output type based on the types of the input arrays.

The output type should be the smallest type that has the correct
"kind" and is able to represent all input values.  For example, in
digital mode, if the input includes some int8 arrays and some int16
arrays, the result should be an int16 array.  In physical mode, if the
inputs are all float32, then the result will be float32; otherwise the
result will be float64.

However, although the output type should generally match the input
type, intermediate results may need to be stored as a different type.
For example, if the input and output are both int16, and one or more
signals have spf > 1 and use the entire 16-bit range, then the sum of
N samples will overflow an int16.  Previously, it was fine simply to
store the intermediate results in the output array itself, because the
output array was 64-bit, and no WFDB format has more than 32-bit
precision, and spf is (in practice) limited to at most 2**31-1.

For simplicity, continue using int64 or float64 as the intermediate
type, regardless of the actual input types and spf.

At the same time, we can also optimize the calculation slightly by
reshaping the input array and using np.sum, avoiding another Python
loop.
The variable tspf is not used and can be removed.

sig_len can be calculated using // rather than / and truncating to an
integer (the latter could be incorrect for records longer than 2**53.)

The output array can be allocated using np.empty instead of np.zeros
since it will be fully initialized by the subsequent loop.

Since each frame is independent of the others, the loop can be broken
up into blocks (here, arbitrarily chosen as 2**16 frames) to reduce
temporary memory usage while still spending most of our CPU time in
numpy operations.
@bemoody
Copy link
Collaborator Author

bemoody commented Mar 21, 2022

Hopefully the last set of changes in this pull request. :)

Changing smooth_frames() to return the minimal data type requires changing the logic a little (we can't just change the type of the array because the intermediate sums can easily exceed the range of the original data type.) At the same time, throw in a few optimizations.

Speed comparison:

(master - original logic of _rd_dat_signals)

>>> import timeit
>>> timeit.timeit('wfdb.rdrecord("sample-data/03700181")', 'import wfdb', number=100)
54.588751977000356

(d02c478 - original logic of smooth_frames)

>>> import timeit
>>> timeit.timeit('wfdb.rdrecord("sample-data/03700181")', 'import wfdb', number=100)
1.1534926320018712

(non-smooth - smooth_frames with further optimizations)

>>> import timeit
>>> timeit.timeit('wfdb.rdrecord("sample-data/03700181")', 'import wfdb', number=100)
1.0475175969986594

Copy link
Member

@cx1111 cx1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, and nice job with the improvements!

What exactly is responsible for most of the speedup? The smaller memory allocation chunks, or something else?

@bemoody
Copy link
Collaborator Author

bemoody commented Mar 28, 2022

Well, I don't know exactly what the bottlenecks are. But the big difference, I think, is that the original approach calculates every sample as a Python expression (notice that np.average is invoked once for every output sample of a high-res channel.) The new approach doesn't; every step of the process is done "in bulk", using numpy arrays to operate across thousands of samples at once.

Of course numpy operations also take time proportional to the size of the input, but because you're performing a homogeneous operation across a homogeneous array, the constant factor is a lot smaller!

@bemoody bemoody merged commit 6cf52fb into master Mar 28, 2022
@bemoody bemoody deleted the non-smooth branch March 28, 2022 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants