Skip to content

Replace univocity-parsers with FastCSV #4606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

vdmitrienko
Copy link
Contributor

@vdmitrienko vdmitrienko commented Jun 1, 2025

Overview

#4339


I hereby agree to the terms of the JUnit Contributor License Agreement.


Definition of Done

Copy link
Member

@marcphilipp marcphilipp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very promising! 👍

Comment on lines +113 to +115
* The `CsvFileSource.lineSeparator()` parameter is deprecated because line separators
are now detected automatically during CSV parsing. This setting is no longer required
and will be ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does auto-detection work in all cases? What happens if \n is used in a cell like in the following example with 4 columns?

a;b;\n
c;d\r\n
e;f;g;h\r\n

(assuming \n and \r are replaced with the corresponding character)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does auto-detection work in all cases?

The auto-detection treats each of \r, \n, and \r\n as a line separator. For example, given the following input:

a;b\r
c;d\n
e;f\r\n
g;h

The result is:

[["a", "b"], ["c", "d"], ["e", "f"], ["g", "h"]]

In contrast, univocity-parsers (when configured with \n as the line separator) produces different results:

["a", "b\rc", "d"], ["e", "f"], ["g", "h"]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if \n is used in a cell like in the following example with 4 columns?

In this case, the results from FastCSV and univocity-parsers are mostly similar.

FastCSV:

[["a", "b", null], ["c", "d"], ["e", "f", "g", "h"]]

univocity-parsers:

// .lineSeparator("\n") - same as FastCSV
[["a", "b", null], ["c", "d"], ["e", "f", "g", "h"]]

// .lineSeparator("\r\n") - same as FastCSV
[["a", "b", null], ["c", "d"], ["e", "f", "g", "h"]]

// .lineSeparator("\r") - differs from FastCSV
[["a", "b", "c", "d"], ["e", "f", "g", "h"], [null]]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m afraid this breaks compatibility if someone uses a character sequence as a line delimiter that is not a newline.

Copy link
Contributor Author

@vdmitrienko vdmitrienko Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, considering 3 possible scenarios, all of them imply a breaking change 😞

  1. User explicitly relies on \r\n as the line separator:
    \r - causes an unexpected line break;
    \n - causes an unexpected line break;

  2. User explicitly relies on \r as the line separator:
    \n - causes an unexpected line break;
    \r\n - no change, since \r is already interpreted as a line break;

  3. User explicitly relies on \n as the line separator:
    \r - causes an unexpected line break;
    \r\n - no change, since \n is already interpreted as a line break;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@osiegmar, would it be possible to add support for a lineSeparator() parameter in FastCSV?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, yes. Of course, this wouldn’t be a valid CSV file at all. Is this really a desired feature or just lack of specification/documentation and a good chance to change that with the new major version of JUnit?

Is there a (good) reason, someone separates text records by anything that is not a newline sequence?

Is there known usage of this?

Copy link
Member

@marcphilipp marcphilipp Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dropping this in the new major version makes sense. I'm not aware of any cases other than using the same line separator on different operating systems. IIRC I initially introduced it because univocity-parsers would use the system line separator (and only that) be default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dropping this in the new major version makes sense.

Great 👍
I'll make sure to clarify that in the release notes. Adding a few tests wouldn't hurt either.

}
return String.join("\n", csvSource.value());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does FastCSV provide an API for line-by-line reading so we don't have to create a string first? It's probably not a big deal since it comes from literals in an annotation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With osiegmar/FastCSV@1077389 there is one now. @vdmitrienko You may want to give it a try if it simplify things for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @osiegmar. This works well with individual strings, but it doesn't support headers. I think adding an overload that accepts an array (or varargs) of strings could simplify this use case:

build(final CsvCallbackHandler<T> callbackHandler, final String... data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the validation of empty records, having a setting for that could be quite handy. It would also be great if the exception message included the index of the empty record. That said, we could also handle this on our side 🙂

@vdmitrienko vdmitrienko requested a review from osiegmar June 5, 2025 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants