-
Notifications
You must be signed in to change notification settings - Fork 83
DuckDB to import columns as strings as a fallback #341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yoohoo!
Regarding the check for "Could not convert", if it's not complicated I'd rather have it, not so much for performance but to clarify the code path. This way, if we stumble on a different issue later with csv ingestion, it might be easier to let duckdb throw the error on the first read. But definitely not a huge concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love it! what a nice and clean solution 👌
I agree, and added a check for "Could not convert." It does feel a bit brittle to add a check for a hard-coded string, but the clarity it adds seems worth it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change introduces public API: a caller can now say DuckDB.of(sources, {untyped: true}) to opt-in to the ALL_VARCHAR (but solely for CSV). In addition, this change mixes Observable’s options (currently just the untyped option) into DuckDB’s config object; we have to hope that our options don’t conflict with DuckDB’s options either now or in the future.
We don’t need to introduce public API in order to support this change, and certainly that’s not our primary goal since our intent is that DuckDB.of will automatically retry when it encounters an error inferring types. So I think that suggests for the sake of prudence/parsimony that we should avoid introducing new public API to support this (and as a secondary benefit, we don’t have to worry about our public API overlapping with DuckDB’s).
The simplest idea I have on how to fix this is to use a symbol instead of the string name "untyped"
for the option. I.e.,
const untyped = Symbol("untyped"); // defined privately in this file
And when needed internally we can say:
DuckDB.of(sources, {[untyped]: true});
If the untyped
symbol isn’t exposed externally, then we can mix it into the config
object without exposing a public API and without potentially conflicting with DuckDB.
src/duckdb.js
Outdated
return await connection.insertCSVFromPath(file.name, { | ||
name, | ||
schema: "main", | ||
...options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know for example it’s possible to pass in the schema
here when you already know the types in the CSV file. I wonder if there’s not an equivalent varchar option we could pass here (but the new insertion method is fine too). Maybe just a note for the future if we want to have more control over what types DuckDB uses so that we can enforce consistency between data table and SQL cells.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before adding the new insertion method, I tried finding an equivalent varchar option to pass here, but didn't have success.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for making the changes! Only one small fix now for the circular import.
@mbostock It's much simpler now, no need for any config/options anywhere. Thanks for your feedback! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, that’s much better. Wish I’d thought if that!
(Not sure how the discussion re. defining untyped in duckdb.js and importing it into table.js triggered this, or why that was an issue since table.js already imports duckdb.js, but the point is moot since this solution is better in any case!)
Partially resolves https://github.com/observablehq/observablehq/issues/9857
The two cases it resolves are:
Case 2 (and likely case 1) occurred when the mismatch was found in a row > 10240, as that's the (default?) sample size DuckDB checks when inferring types.
Now, if
insertCSVFromPath
fails, we catch it, check whether it failed due to a conversion error, and if so, try again with all columns as strings. Only CSV and TSV files are affected by this change.The error for case 1 (before):


Case 1 fixed (after):
Video showing before and after for case 2:
fallback_to_string_type.mov