Skip to content

DuckDB to import columns as strings as a fallback #341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jan 13, 2023

Conversation

libbey-observable
Copy link
Contributor

@libbey-observable libbey-observable commented Jan 12, 2023

Partially resolves https://github.com/observablehq/observablehq/issues/9857

The two cases it resolves are:

  1. @Fil's case with thousands of rows of "F", which DuckDB interpreted as boolean, then threw an error when it encountered an "M."
  2. Allison's case of ejecting to SQL and it silently failing due to a type mismatch.

Case 2 (and likely case 1) occurred when the mismatch was found in a row > 10240, as that's the (default?) sample size DuckDB checks when inferring types.

Now, if insertCSVFromPath fails, we catch it, check whether it failed due to a conversion error, and if so, try again with all columns as strings. Only CSV and TSV files are affected by this change.

The error for case 1 (before):
Screen Shot 2023-01-11 at 4 21 40 PM
Case 1 fixed (after):
Screen Shot 2023-01-11 at 4 21 55 PM

Video showing before and after for case 2:

fallback_to_string_type.mov

Copy link
Contributor

@Fil Fil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yoohoo!

Regarding the check for "Could not convert", if it's not complicated I'd rather have it, not so much for performance but to clarify the code path. This way, if we stumble on a different issue later with csv ingestion, it might be easier to let duckdb throw the error on the first read. But definitely not a huge concern.

Copy link
Contributor

@annie annie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it! what a nice and clean solution 👌

@libbey-observable
Copy link
Contributor Author

Regarding the check for "Could not convert", if it's not complicated I'd rather have it, not so much for performance but to clarify the code path.

I agree, and added a check for "Could not convert." It does feel a bit brittle to add a check for a hard-coded string, but the clarity it adds seems worth it.

Copy link
Member

@mbostock mbostock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces public API: a caller can now say DuckDB.of(sources, {untyped: true}) to opt-in to the ALL_VARCHAR (but solely for CSV). In addition, this change mixes Observable’s options (currently just the untyped option) into DuckDB’s config object; we have to hope that our options don’t conflict with DuckDB’s options either now or in the future.

We don’t need to introduce public API in order to support this change, and certainly that’s not our primary goal since our intent is that DuckDB.of will automatically retry when it encounters an error inferring types. So I think that suggests for the sake of prudence/parsimony that we should avoid introducing new public API to support this (and as a secondary benefit, we don’t have to worry about our public API overlapping with DuckDB’s).

The simplest idea I have on how to fix this is to use a symbol instead of the string name "untyped" for the option. I.e.,

const untyped = Symbol("untyped"); // defined privately in this file

And when needed internally we can say:

DuckDB.of(sources, {[untyped]: true});

If the untyped symbol isn’t exposed externally, then we can mix it into the config object without exposing a public API and without potentially conflicting with DuckDB.

src/duckdb.js Outdated
return await connection.insertCSVFromPath(file.name, {
name,
schema: "main",
...options
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know for example it’s possible to pass in the schema here when you already know the types in the CSV file. I wonder if there’s not an equivalent varchar option we could pass here (but the new insertion method is fine too). Maybe just a note for the future if we want to have more control over what types DuckDB uses so that we can enforce consistency between data table and SQL cells.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before adding the new insertion method, I tried finding an equivalent varchar option to pass here, but didn't have success.

Copy link
Member

@mbostock mbostock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for making the changes! Only one small fix now for the circular import.

@libbey-observable
Copy link
Contributor Author

@mbostock It's much simpler now, no need for any config/options anywhere. Thanks for your feedback!

Copy link
Member

@mbostock mbostock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, that’s much better. Wish I’d thought if that!

(Not sure how the discussion re. defining untyped in duckdb.js and importing it into table.js triggered this, or why that was an issue since table.js already imports duckdb.js, but the point is moot since this solution is better in any case!)

@libbey-observable libbey-observable merged commit a6fa6f0 into main Jan 13, 2023
@libbey-observable libbey-observable deleted the libbey/duckdb-fallback-to-string branch January 13, 2023 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants