Skip to content

Infer schema for relevant data sources #344

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

libbey-observable
Copy link
Contributor

In this approach, we take a sample (currently the first 100 rows), and for each column, count how many times we encounter each possible data type. Then we take the most frequently encountered type as the column's type. We could also add some random sampling.

Note that in the screenshots, the data in the columns has not been coerced, this is an in-between point, where we've inferred types, but not yet applied them.

From a CSV file:
Screen Shot 2023-01-19 at 3 51 20 PM

From a JSON file:
Screen Shot 2023-01-19 at 3 36 49 PM

src/table.js Outdated
Comment on lines 712 to 720
value.match(
/^([-+]\d{2})?\d{4}(-\d{2}(-\d{2})?)?(T\d{2}:\d{2}(:\d{2}(\.\d{3})?)?(Z|[-+]\d{2}:\d{2})?)?$/
)
)
typeCounts[key]["date"]++;
else if (value.match(/(\d{1,2})\/(\d{1,2})\/(\d{2,4}) (\d{2}):(\d{2})/))
typeCounts[key]["date"]++;
else if (value.match(/(\d{4})-(\d{1,2})-(\d{1,2})/))
typeCounts[key]["date"]++;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied the lonnng regex from d3's autoType, and added a few more common date formats, but maybe we don't want to be that flexible?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the types you added appropriately created as dates using new Date()? If so, I think that's fine (but maybe arbitrary....?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they do get created as the expected dates with new Date().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I'd say keep it!

@mkfreeman
Copy link
Contributor

This is looking great! For reference, here's how we do the random sampling for getting string lengths - it includes the first 20 rows (because they are what the user sees -- perhaps not necessary here), and randomly samples 100 values (using a seed so the random values are always the same). https://github.com/observablehq/observablehq/blob/main/notebook-next/src/worker/computeSummaries.js#L86

Copy link
Member

@mbostock mbostock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some code review.

Let’s find a time to chat through this in real-time. I think we only want to do this on conjunction with type coercion; otherwise we are advertising types that won’t match the values.

let {schema, columns} = source;
if (!schema || !isValidSchema(schema)) source.schema = inferSchema(source);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mutating the source.schema which will be visible externally and we should avoid because mutation of inputs is an unexpected side-effect of calling this function.

If necessary we can use a WeakMap to instead associate sources with valid schemas if we don’t want to re-infer them repeatedly. (It might also make sense to move this schema inference earlier, say in loadDataSource? Not sure though.)

src/table.js Outdated
const type = typeof d[key];
const value = type === "string" ? d[key]?.trim() : d[key];
if (value === null || value === undefined || value.length === 0)
typeCounts[key]["other"]++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it’s appropriate to consider null/undefined/empty to be “other”. I would consider a column with 80% null/undefined and 20% string to be type string. In other words instead of counting these empty/missing values as a type, we should ignore them and only count types of present values. Otherwise sparsely populated columns will be considered “other” which is likely undesirable.

(Perhaps this is an alternative to the special treatment of “other” below, but I think for values that are truly “other” then we wouldn’t want that special treatment; the special treatment is only needed because we are considering nullish/empty values as “other” here.)

)
)
typeCounts[key]["date"]++;
else if (value.match(/(\d{1,2})\/(\d{1,2})\/(\d{2,4}) (\d{2}):(\d{2})/))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to combine these into a single refer if possible.

const columns = Object.keys(typeCounts);
for (const col of columns) {
// sort descending so most commonly encoutered type is first
const typesSorted = Object.keys(typeCounts[col]).sort(function (a, b) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If we remove the special treatment of “other” here, we could probably use d3.greatest to get the most common type rather than needing the more expensive sort. Though it’s probably not noticeable since the set of possible types is small.

@libbey-observable
Copy link
Contributor Author

@mbostock Thanks again for the valuable feedback – the issues you mentioned have been addressed in #346. Closing this PR in favor of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants