Skip to content

Infer schemas and coerce data for table cells #346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 91 commits into from
Feb 3, 2023

Conversation

libbey-observable
Copy link
Contributor

@libbey-observable libbey-observable commented Jan 25, 2023

Resolves https://github.com/observablehq/observablehq/issues/9862 and https://github.com/observablehq/observablehq/issues/9673

In this inference approach, we take a sample (currently the first 100 rows), and for each column, count how many times we encounter each possible data type. Then, if > 90% of the sampled values conform to that type, we take the most frequently encountered type as the column's type. If not, we use type "other."

These changes include some support for upcoming column type assertions, but for the purposes of this PR, I'm testing against main.

inference_and_coercion.mov

Copy link
Member

@mbostock mbostock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there! Just a few more things, and we’re done, I swear. 😅

src/table.js Outdated
case "boolean":
if (typeof value === "string") {
const trimValue = value.trim();
return trimValue === "true" ? true : trimValue === "false" ? false : null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to note that I was surprised that our boolean coercion returned false for a column of string values, but true for other types, such as a column of empty objects. Maybe the default return value for strings should be Boolean(value), but then that ruins detecting the strings "true" and "false".
Screen Shot 2023-02-02 at 11 45 14 AM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returned null for string values (rather than false). I'm not sure what the right response is when a Date, object, or other types are coerced to boolean. TRUE seems surprising to me, but if that's what we want to do, then it does seem strange to have type string alone not conform to that. Any thoughts, @mbostock?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to do this because string is special: it’s the “untyped” format we use to serialize other types (namely in CSV).

Co-authored-by: Mike Bostock <[email protected]>
Copy link
Member

@mbostock mbostock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I posted a few trivial suggestions which you can take or leave, but here is what I think (hope!) is the last logical issue! 🙏

@libbey-observable libbey-observable merged commit fa1b356 into main Feb 3, 2023
@libbey-observable libbey-observable deleted the libbey/infer-and-coerce branch February 3, 2023 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants