Infer schema for relevant data sources #344

libbey-observable · 2023-01-19T23:53:50Z

In this approach, we take a sample (currently the first 100 rows), and for each column, count how many times we encounter each possible data type. Then we take the most frequently encountered type as the column's type. We could also add some random sampling.

Note that in the screenshots, the data in the columns has not been coerced, this is an in-between point, where we've inferred types, but not yet applied them.

From a CSV file:

From a JSON file:

libbey-observable · 2023-01-19T23:54:57Z

src/table.js

+          value.match(
+            /^([-+]\d{2})?\d{4}(-\d{2}(-\d{2})?)?(T\d{2}:\d{2}(:\d{2}(\.\d{3})?)?(Z|[-+]\d{2}:\d{2})?)?$/
+          )
+        )
+          typeCounts[key]["date"]++;
+        else if (value.match(/(\d{1,2})\/(\d{1,2})\/(\d{2,4}) (\d{2}):(\d{2})/))
+          typeCounts[key]["date"]++;
+        else if (value.match(/(\d{4})-(\d{1,2})-(\d{1,2})/))
+          typeCounts[key]["date"]++;


Copied the lonnng regex from d3's autoType, and added a few more common date formats, but maybe we don't want to be that flexible?

Are the types you added appropriately created as dates using new Date()? If so, I think that's fine (but maybe arbitrary....?).

Yes, they do get created as the expected dates with new Date().

In that case, I'd say keep it!

mkfreeman · 2023-01-20T15:26:24Z

This is looking great! For reference, here's how we do the random sampling for getting string lengths - it includes the first 20 rows (because they are what the user sees -- perhaps not necessary here), and randomly samples 100 values (using a seed so the random values are always the same). https://github.com/observablehq/observablehq/blob/main/notebook-next/src/worker/computeSummaries.js#L86

mbostock

Some code review.

Let’s find a time to chat through this in real-time. I think we only want to do this on conjunction with type coercion; otherwise we are advertising types that won’t match the values.

mbostock · 2023-01-23T20:28:12Z

src/table.js

  let {schema, columns} = source;
+  if (!schema || !isValidSchema(schema)) source.schema = inferSchema(source);


This is mutating the source.schema which will be visible externally and we should avoid because mutation of inputs is an unexpected side-effect of calling this function.

If necessary we can use a WeakMap to instead associate sources with valid schemas if we don’t want to re-infer them repeatedly. (It might also make sense to move this schema inference earlier, say in loadDataSource? Not sure though.)

src/table.js

mbostock · 2023-01-23T20:38:20Z

src/table.js

+      const type = typeof d[key];
+      const value = type === "string" ? d[key]?.trim() : d[key];
+      if (value === null || value === undefined || value.length === 0)
+        typeCounts[key]["other"]++;


Not sure it’s appropriate to consider null/undefined/empty to be “other”. I would consider a column with 80% null/undefined and 20% string to be type string. In other words instead of counting these empty/missing values as a type, we should ignore them and only count types of present values. Otherwise sparsely populated columns will be considered “other” which is likely undesirable.

(Perhaps this is an alternative to the special treatment of “other” below, but I think for values that are truly “other” then we wouldn’t want that special treatment; the special treatment is only needed because we are considering nullish/empty values as “other” here.)

src/table.js

mbostock · 2023-01-23T20:42:11Z

src/table.js

+          )
+        )
+          typeCounts[key]["date"]++;
+        else if (value.match(/(\d{1,2})\/(\d{1,2})\/(\d{2,4}) (\d{2}):(\d{2})/))


It would be nice to combine these into a single refer if possible.

mbostock · 2023-01-23T20:46:28Z

src/table.js

+  const columns = Object.keys(typeCounts);
+  for (const col of columns) {
+    // sort descending so most commonly encoutered type is first
+    const typesSorted = Object.keys(typeCounts[col]).sort(function (a, b) {


Nit: If we remove the special treatment of “other” here, we could probably use d3.greatest to get the most common type rather than needing the more expensive sort. Though it’s probably not noticeable since the set of possible types is small.

src/table.js

libbey-observable · 2023-01-25T23:51:30Z

@mbostock Thanks again for the valuable feedback – the issues you mentioned have been addressed in #346. Closing this PR in favor of that.

libbey-observable added 2 commits January 19, 2023 15:12

Stop using {typed: true} for csv and tsv

ea741fe

Infer schema if none exists

558a6a6

libbey-observable commented Jan 19, 2023

View reviewed changes

libbey-observable added 4 commits January 20, 2023 09:55

Add schema validity check to address #9673

ba09d45

Update tests

3a3f5a1

Handle sources that are arrays of primitives

e151efd

Formatting

ec38311

mbostock reviewed Jan 23, 2023

View reviewed changes

Quick updates based on feedback

6e9d64e

libbey-observable closed this Jan 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Infer schema for relevant data sources #344

Infer schema for relevant data sources #344

Uh oh!

libbey-observable commented Jan 19, 2023

Uh oh!

libbey-observable Jan 19, 2023

Uh oh!

mkfreeman Jan 20, 2023

Uh oh!

libbey-observable Jan 20, 2023

Uh oh!

mkfreeman Jan 23, 2023

Uh oh!

mkfreeman commented Jan 20, 2023

Uh oh!

mbostock left a comment

Uh oh!

mbostock Jan 23, 2023

Uh oh!

Uh oh!

Uh oh!

mbostock Jan 23, 2023

Uh oh!

Uh oh!

Uh oh!

mbostock Jan 23, 2023

Uh oh!

mbostock Jan 23, 2023

Uh oh!

Uh oh!

libbey-observable commented Jan 25, 2023

Uh oh!

Uh oh!

		let {schema, columns} = source;
		if (!schema \|\| !isValidSchema(schema)) source.schema = inferSchema(source);

Infer schema for relevant data sources #344

Infer schema for relevant data sources #344

Uh oh!

Conversation

libbey-observable commented Jan 19, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkfreeman commented Jan 20, 2023

Uh oh!

mbostock left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

libbey-observable commented Jan 25, 2023

Uh oh!

Uh oh!