DuckDB to import columns as strings as a fallback #341

libbey-observable · 2023-01-12T01:01:44Z

Partially resolves https://github.com/observablehq/observablehq/issues/9857

The two cases it resolves are:

@Fil's case with thousands of rows of "F", which DuckDB interpreted as boolean, then threw an error when it encountered an "M."
Allison's case of ejecting to SQL and it silently failing due to a type mismatch.

Case 2 (and likely case 1) occurred when the mismatch was found in a row > 10240, as that's the (default?) sample size DuckDB checks when inferring types.

Now, if insertCSVFromPath fails, we catch it, check whether it failed due to a conversion error, and if so, try again with all columns as strings. Only CSV and TSV files are affected by this change.

The error for case 1 (before):

Case 1 fixed (after):

Video showing before and after for case 2:

fallback_to_string_type.mov

Fil

yoohoo!

Regarding the check for "Could not convert", if it's not complicated I'd rather have it, not so much for performance but to clarify the code path. This way, if we stumble on a different issue later with csv ingestion, it might be easier to let duckdb throw the error on the first read. But definitely not a huge concern.

annie

love it! what a nice and clean solution 👌

libbey-observable · 2023-01-12T16:50:00Z

Regarding the check for "Could not convert", if it's not complicated I'd rather have it, not so much for performance but to clarify the code path.

I agree, and added a check for "Could not convert." It does feel a bit brittle to add a check for a hard-coded string, but the clarity it adds seems worth it.

mbostock

This change introduces public API: a caller can now say DuckDB.of(sources, {untyped: true}) to opt-in to the ALL_VARCHAR (but solely for CSV). In addition, this change mixes Observable’s options (currently just the untyped option) into DuckDB’s config object; we have to hope that our options don’t conflict with DuckDB’s options either now or in the future.

We don’t need to introduce public API in order to support this change, and certainly that’s not our primary goal since our intent is that DuckDB.of will automatically retry when it encounters an error inferring types. So I think that suggests for the sake of prudence/parsimony that we should avoid introducing new public API to support this (and as a secondary benefit, we don’t have to worry about our public API overlapping with DuckDB’s).

The simplest idea I have on how to fix this is to use a symbol instead of the string name "untyped" for the option. I.e.,

const untyped = Symbol("untyped"); // defined privately in this file

And when needed internally we can say:

DuckDB.of(sources, {[untyped]: true});

If the untyped symbol isn’t exposed externally, then we can mix it into the config object without exposing a public API and without potentially conflicting with DuckDB.

src/duckdb.js

mbostock · 2023-01-12T16:03:54Z

src/duckdb.js

+          return await connection.insertCSVFromPath(file.name, {
+            name,
+            schema: "main",
+            ...options


I know for example it’s possible to pass in the schema here when you already know the types in the CSV file. I wonder if there’s not an equivalent varchar option we could pass here (but the new insertion method is fine too). Maybe just a note for the future if we want to have more control over what types DuckDB uses so that we can enforce consistency between data table and SQL cells.

Before adding the new insertion method, I tried finding an equivalent varchar option to pass here, but didn't have success.

mbostock

Great, thanks for making the changes! Only one small fix now for the circular import.

src/duckdb.js

libbey-observable · 2023-01-12T22:56:13Z

@mbostock It's much simpler now, no need for any config/options anywhere. Thanks for your feedback!

mbostock

Oh yeah, that’s much better. Wish I’d thought if that!

(Not sure how the discussion re. defining untyped in duckdb.js and importing it into table.js triggered this, or why that was an issue since table.js already imports duckdb.js, but the point is moot since this solution is better in any case!)

libbey-observable added 3 commits January 11, 2023 16:32

Add insertUntypedCSV function to duckdb

1bdf086

Try untyped insertion as a fallback when loading DuckDBClient

5cf6f4f

simplify

1650d9b

libbey-observable requested review from mbostock, mkfreeman and annie January 12, 2023 01:02

Fil approved these changes Jan 12, 2023

View reviewed changes

annie approved these changes Jan 12, 2023

View reviewed changes

Check error message

501fbc8

mbostock requested changes Jan 12, 2023

View reviewed changes

Use symbol and existing options

8465454

mbostock requested changes Jan 12, 2023

View reviewed changes

src/duckdb.js Outdated Show resolved Hide resolved

libbey-observable added 3 commits January 12, 2023 14:49

Undo changes to table.js

b5a4e7f

Catch earlier in process, no need for config

c27516a

Remove unnecessary variable

1a60e21

Update comment

175f5da

mbostock approved these changes Jan 12, 2023

View reviewed changes

libbey-observable merged commit a6fa6f0 into main Jan 13, 2023

libbey-observable deleted the libbey/duckdb-fallback-to-string branch January 13, 2023 00:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DuckDB to import columns as strings as a fallback #341

DuckDB to import columns as strings as a fallback #341

Uh oh!

libbey-observable commented Jan 12, 2023 •

edited

Loading

Uh oh!

Fil left a comment

Uh oh!

annie left a comment

Uh oh!

libbey-observable commented Jan 12, 2023

Uh oh!

mbostock left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

mbostock Jan 12, 2023

Uh oh!

libbey-observable Jan 12, 2023

Uh oh!

mbostock left a comment

Uh oh!

Uh oh!

libbey-observable commented Jan 12, 2023

Uh oh!

mbostock left a comment

Uh oh!

Uh oh!

DuckDB to import columns as strings as a fallback #341

DuckDB to import columns as strings as a fallback #341

Uh oh!

Conversation

libbey-observable commented Jan 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fil left a comment

Choose a reason for hiding this comment

Uh oh!

annie left a comment

Choose a reason for hiding this comment

Uh oh!

libbey-observable commented Jan 12, 2023

Uh oh!

mbostock left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mbostock Jan 12, 2023

Choose a reason for hiding this comment

Uh oh!

libbey-observable Jan 12, 2023

Choose a reason for hiding this comment

Uh oh!

mbostock left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

libbey-observable commented Jan 12, 2023

Uh oh!

mbostock left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

libbey-observable commented Jan 12, 2023 •

edited

Loading

mbostock left a comment •

edited

Loading