-
Notifications
You must be signed in to change notification settings - Fork 83
Infer schema for relevant data sources #344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/table.js
Outdated
value.match( | ||
/^([-+]\d{2})?\d{4}(-\d{2}(-\d{2})?)?(T\d{2}:\d{2}(:\d{2}(\.\d{3})?)?(Z|[-+]\d{2}:\d{2})?)?$/ | ||
) | ||
) | ||
typeCounts[key]["date"]++; | ||
else if (value.match(/(\d{1,2})\/(\d{1,2})\/(\d{2,4}) (\d{2}):(\d{2})/)) | ||
typeCounts[key]["date"]++; | ||
else if (value.match(/(\d{4})-(\d{1,2})-(\d{1,2})/)) | ||
typeCounts[key]["date"]++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied the lonnng regex from d3's autoType, and added a few more common date formats, but maybe we don't want to be that flexible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the types you added appropriately created as dates using new Date()
? If so, I think that's fine (but maybe arbitrary....?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they do get created as the expected dates with new Date()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I'd say keep it!
This is looking great! For reference, here's how we do the random sampling for getting string lengths - it includes the first 20 rows (because they are what the user sees -- perhaps not necessary here), and randomly samples 100 values (using a seed so the random values are always the same). https://github.com/observablehq/observablehq/blob/main/notebook-next/src/worker/computeSummaries.js#L86 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some code review.
Let’s find a time to chat through this in real-time. I think we only want to do this on conjunction with type coercion; otherwise we are advertising types that won’t match the values.
let {schema, columns} = source; | ||
if (!schema || !isValidSchema(schema)) source.schema = inferSchema(source); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mutating the source.schema which will be visible externally and we should avoid because mutation of inputs is an unexpected side-effect of calling this function.
If necessary we can use a WeakMap to instead associate sources with valid schemas if we don’t want to re-infer them repeatedly. (It might also make sense to move this schema inference earlier, say in loadDataSource? Not sure though.)
src/table.js
Outdated
const type = typeof d[key]; | ||
const value = type === "string" ? d[key]?.trim() : d[key]; | ||
if (value === null || value === undefined || value.length === 0) | ||
typeCounts[key]["other"]++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it’s appropriate to consider null/undefined/empty to be “other”. I would consider a column with 80% null/undefined and 20% string to be type string. In other words instead of counting these empty/missing values as a type, we should ignore them and only count types of present values. Otherwise sparsely populated columns will be considered “other” which is likely undesirable.
(Perhaps this is an alternative to the special treatment of “other” below, but I think for values that are truly “other” then we wouldn’t want that special treatment; the special treatment is only needed because we are considering nullish/empty values as “other” here.)
) | ||
) | ||
typeCounts[key]["date"]++; | ||
else if (value.match(/(\d{1,2})\/(\d{1,2})\/(\d{2,4}) (\d{2}):(\d{2})/)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to combine these into a single refer if possible.
const columns = Object.keys(typeCounts); | ||
for (const col of columns) { | ||
// sort descending so most commonly encoutered type is first | ||
const typesSorted = Object.keys(typeCounts[col]).sort(function (a, b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: If we remove the special treatment of “other” here, we could probably use d3.greatest to get the most common type rather than needing the more expensive sort. Though it’s probably not noticeable since the set of possible types is small.
In this approach, we take a sample (currently the first 100 rows), and for each column, count how many times we encounter each possible data type. Then we take the most frequently encountered type as the column's type. We could also add some random sampling.
Note that in the screenshots, the data in the columns has not been coerced, this is an in-between point, where we've inferred types, but not yet applied them.
From a CSV file:

From a JSON file:
