Skip to content

Infer schemas and coerce data for table cells #346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 91 commits into from
Feb 3, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
ea741fe
Stop using {typed: true} for csv and tsv
libbey-observable Jan 19, 2023
558a6a6
Infer schema if none exists
libbey-observable Jan 19, 2023
ba09d45
Add schema validity check to address #9673
libbey-observable Jan 20, 2023
3a3f5a1
Update tests
libbey-observable Jan 20, 2023
e151efd
Handle sources that are arrays of primitives
libbey-observable Jan 20, 2023
ec38311
Formatting
libbey-observable Jan 21, 2023
6e9d64e
Quick updates based on feedback
libbey-observable Jan 23, 2023
799398f
With Mike F's coercion
libbey-observable Jan 23, 2023
5f30887
Remove new validity check fn and use existing
libbey-observable Jan 24, 2023
eb7008a
Don't mutate source
libbey-observable Jan 24, 2023
3786837
Don't mutate row
libbey-observable Jan 24, 2023
81018d4
Add exported fn to index.js
libbey-observable Jan 24, 2023
3ec692b
Apply user-selected types and update schema
libbey-observable Jan 24, 2023
2aaaad0
Combine into one regex
libbey-observable Jan 24, 2023
4bd58bf
Update handling of "other" and use d3.greatest
libbey-observable Jan 24, 2023
bf97713
Fix tests
libbey-observable Jan 24, 2023
afe5241
Small fixes
libbey-observable Jan 24, 2023
f5c648b
More coercion
libbey-observable Jan 25, 2023
740d860
Fix test
libbey-observable Jan 25, 2023
9ddd352
Try supporting number coercion into dates
libbey-observable Jan 25, 2023
bfd138c
Try with value.toString
libbey-observable Jan 25, 2023
1e820ec
Fixes and allowing for soft coercion
libbey-observable Jan 25, 2023
28a6aaf
Fix test
libbey-observable Jan 25, 2023
0c2a3ca
Formatting
libbey-observable Jan 25, 2023
8884734
Remove export
libbey-observable Jan 26, 2023
a95a1bb
Update number and date coercion
libbey-observable Jan 26, 2023
c7583f7
Infer integers even if type is number
libbey-observable Jan 26, 2023
b9aceae
Coercion improvements
libbey-observable Jan 26, 2023
41941c5
Add unit tests
libbey-observable Jan 26, 2023
7ee0456
Formatting
libbey-observable Jan 26, 2023
fa60ecc
Fix bug
libbey-observable Jan 26, 2023
1382c0b
Move coercion outside of loop
libbey-observable Jan 30, 2023
ac1219e
Perform intended check
libbey-observable Jan 30, 2023
9d2c39f
Update BigInt coercion and tests
libbey-observable Jan 30, 2023
c22c781
Update handling of whitespace
libbey-observable Jan 30, 2023
f44e3ef
Improve handling of ints, BigInts, and numbers
libbey-observable Jan 30, 2023
060f21d
Update coercion to arrays and objects
libbey-observable Jan 30, 2023
ac1d365
Infer bigints from strings
libbey-observable Jan 30, 2023
fc2b128
Check percentage of values conforming to inferred type
libbey-observable Jan 30, 2023
9e53e61
Remove soft coercion option
libbey-observable Jan 30, 2023
43b073b
Work with all keys present in data source
libbey-observable Jan 31, 2023
7444123
Add inferred property to schema elements
libbey-observable Jan 31, 2023
99e985e
Support raw type
libbey-observable Jan 31, 2023
31048d6
Remove stray options
libbey-observable Jan 31, 2023
6acab1d
Don't mutate schema
libbey-observable Jan 31, 2023
bbe7cf3
Rename variable
libbey-observable Jan 31, 2023
19e3eef
Remove unnecessary check
libbey-observable Jan 31, 2023
72f681b
Use schema rather than object keys
libbey-observable Jan 31, 2023
fc40237
Updates based on feedback
libbey-observable Jan 31, 2023
c5e070c
Don't getAllKeys if we have columns
libbey-observable Jan 31, 2023
a4443e3
Don't export for now
libbey-observable Jan 31, 2023
834aa71
Don't mutate columns
libbey-observable Jan 31, 2023
9b56371
Formatting
libbey-observable Jan 31, 2023
9dec278
Remove unnecessary inferFromPrimitive function
libbey-observable Feb 1, 2023
c2a20db
Update string coercion
libbey-observable Feb 1, 2023
2956af9
Update boolean coercion
libbey-observable Feb 1, 2023
e92247b
Remove stringValue
libbey-observable Feb 1, 2023
e45a130
Remove coercion for some types
libbey-observable Feb 1, 2023
35579dd
Move promotion of arrays of primitives into loadTableDataSource
libbey-observable Feb 1, 2023
6590f3b
Merge branch 'main' of https://github.com/observablehq/stdlib into li…
libbey-observable Feb 1, 2023
c3384d3
Fix names test
libbey-observable Feb 1, 2023
b5e10c7
Update coercion of numbers and dates
libbey-observable Feb 1, 2023
6bee09b
Don't coerce when type is array, object, buffer, or other
libbey-observable Feb 1, 2023
caba589
Add isDataArray check before arrayIsPrimitive
libbey-observable Feb 1, 2023
f8a2544
Handle whitespace-only strings as well
libbey-observable Feb 1, 2023
b771cc0
Tighten up date regex and use test instead of match
libbey-observable Feb 1, 2023
3055ebb
Repeat date regex when coercing
libbey-observable Feb 1, 2023
750ef2a
Move bulk of inference to new inferType function
libbey-observable Feb 2, 2023
3131fdb
Only use defined values in the denominator for 90% check
libbey-observable Feb 2, 2023
b77df67
Default to "other" rather than getting undefined as key on typeCounts
libbey-observable Feb 2, 2023
9c32bd3
Add value check back and tighten up date regex a bit
libbey-observable Feb 2, 2023
329465e
Update coercion to BigInt
libbey-observable Feb 2, 2023
4600d6e
Fix date regex
libbey-observable Feb 2, 2023
8eccbd8
Move date regex to constant
libbey-observable Feb 2, 2023
154e20e
Move trim to inferType function
libbey-observable Feb 2, 2023
d82b2f8
Update coercion of dates
libbey-observable Feb 2, 2023
2f2ee5e
Don't have inferType fall back to "other"
libbey-observable Feb 2, 2023
6eae9d3
Coerce empty strings to null when type is "date"
libbey-observable Feb 2, 2023
30ba6e5
Case-insensitive boolean inference/coercion
libbey-observable Feb 2, 2023
7d0f114
Allow multiple types to be counted during inference
libbey-observable Feb 2, 2023
9bff937
Update src/table.js
libbey-observable Feb 2, 2023
5c4bc45
Update src/table.js
libbey-observable Feb 2, 2023
8344ef6
Clean up trim and lower casing
libbey-observable Feb 2, 2023
0b21fac
Use trimmed string in filter
libbey-observable Feb 2, 2023
9f9a3f2
checkpoint
mbostock Feb 2, 2023
543d55b
tweaks to inferSchema
mbostock Feb 2, 2023
63ea079
combine loops!
mbostock Feb 2, 2023
89e62c6
whitespace, bigint fixes
mbostock Feb 2, 2023
f3a4ad8
prEtTieR
mbostock Feb 2, 2023
9a56e67
stricter string coercion
mbostock Feb 2, 2023
326f542
Handle column of nulls
libbey-observable Feb 2, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
204 changes: 196 additions & 8 deletions src/table.js
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import {reverse} from "d3-array";
import {greatest, reverse} from "d3-array";
import {FileAttachment} from "./fileAttachment.js";
import {isArqueroTable} from "./arquero.js";
import {isArrowTable, loadArrow} from "./arrow.js";
Expand Down Expand Up @@ -66,13 +66,20 @@ function objectHasEnumerableKeys(value) {
}

function isQueryResultSetSchema(schemas) {
return (Array.isArray(schemas) && schemas.every((s) => s && typeof s.name === "string"));
return (
Array.isArray(schemas) &&
schemas.every(isColumnSchema)
);
}

function isQueryResultSetColumns(columns) {
return (Array.isArray(columns) && columns.every((name) => typeof name === "string"));
}

function isColumnSchema(schema) {
return schema && typeof schema.name === "string" && typeof schema.type === "string";
}

// Returns true if the value represents an array of primitives (i.e., a
// single-column table). This should only be passed values for which
// isDataArray returns true.
Expand Down Expand Up @@ -191,15 +198,17 @@ function sourceCache(loadSource) {
const loadTableDataSource = sourceCache(async (source, name) => {
if (source instanceof FileAttachment) {
switch (source.mimeType) {
case "text/csv": return source.csv({typed: true});
case "text/tab-separated-values": return source.tsv({typed: true});
case "text/csv": return source.csv();
case "text/tab-separated-values": return source.tsv();
case "application/json": return source.json();
case "application/x-sqlite3": return source.sqlite();
}
if (/\.(arrow|parquet)$/i.test(source.name)) return loadDuckDBClient(source, name);
throw new Error(`unsupported file type: ${source.mimeType}`);
}
if (isArrowTable(source) || isArqueroTable(source)) return loadDuckDBClient(source, name);
if (isDataArray(source) && arrayIsPrimitive(source))
return Array.from(source, (value) => ({value}));
return source;
});

Expand Down Expand Up @@ -542,15 +551,84 @@ export function getTypeValidator(colType) {
}
}

// Accepts dates in the form of ISOString and LocaleDateString, with or without time
const DATE_TEST = /^(([-+]\d{2})?\d{4}(-\d{2}(-\d{2}))|(\d{1,2})\/(\d{1,2})\/(\d{2,4}))([T ]\d{2}:\d{2}(:\d{2}(\.\d{3})?)?(Z|[-+]\d{2}:\d{2})?)?$/;

export function coerceToType(value, type) {
switch (type) {
case "string":
return typeof value === "string" || value == null ? value : String(value);
case "boolean":
if (typeof value === "string") {
const trimValue = value.trim().toLowerCase();
return trimValue === "true"
? true
: trimValue === "false"
? false
: null;
}
return typeof value === "boolean" || value == null
? value
: Boolean(value);
case "bigint":
return typeof value === "bigint" || value == null
? value
: Number.isInteger(typeof value === "string" && !value.trim() ? NaN : +value)
? BigInt(value) // eslint-disable-line no-undef
: undefined;
case "integer": // not a target type for coercion, but can be inferred
case "number": {
return typeof value === "number"
? value
: value == null || (typeof value === "string" && !value.trim())
? NaN
: Number(value);
}
case "date": {
if (value instanceof Date || value == null) return value;
if (typeof value === "number") return new Date(value);
const trimValue = String(value).trim();
if (typeof value === "string" && !trimValue) return null;
return new Date(DATE_TEST.test(trimValue) ? trimValue : NaN);
}
case "array":
case "object":
case "buffer":
case "other":
return value;
default:
throw new Error(`Unable to coerce to type: ${type}`);
}
}

// This function applies table cell operations to an in-memory table (array of
// objects); it should be equivalent to the corresponding SQL query. TODO Use
// DuckDBClient for data arrays, too, and then we wouldn’t need our own __table
// function to do table operations on in-memory data?
export function __table(source, operations) {
const input = source;
let {schema, columns} = source;
let primitive = arrayIsPrimitive(source);
if (primitive) source = Array.from(source, (value) => ({value}));
let inferredSchema = false;
if (!isQueryResultSetSchema(schema)) {
schema = inferSchema(source, columns);
inferredSchema = true;
}
// Combine column types from schema with user-selected types in operations
const types = new Map(schema.map(({name, type}) => [name, type]));
if (operations.type) {
for (const {name, type} of operations.type) {
types.set(name, type);
// update schema with user-selected type
if (schema === input.schema) schema = schema.slice(); // copy on write
const colIndex = schema.findIndex((col) => col.name === name);
if (colIndex > -1) schema[colIndex] = {...schema[colIndex], type};
}
source = source.map(d => coerceRow(d, types, schema));
} else if (inferredSchema) {
// Coerce data according to new schema, unless that happened due to
// operations.type, above.
source = source.map(d => coerceRow(d, types, schema));
}
for (const {type, operands} of operations.filter) {
const [{value: column}] = operands;
const values = operands.slice(1).map(({value}) => value);
Expand Down Expand Up @@ -663,7 +741,7 @@ export function __table(source, operations) {
Object.fromEntries(operations.select.columns.map((c) => [c, d[c]]))
);
}
if (!primitive && operations.names) {
if (operations.names) {
const overridesByName = new Map(operations.names.map((n) => [n.column, n]));
if (schema) {
schema = schema.map((s) => {
Expand All @@ -684,10 +762,120 @@ export function __table(source, operations) {
}))
);
}
if (primitive) source = source.map((d) => d.value);
if (source !== input) {
if (schema) source.schema = schema;
if (columns) source.columns = columns;
}
return source;
}

function coerceRow(object, types, schema) {
const coerced = {};
for (const col of schema) {
const type = types.get(col.name);
const value = object[col.name];
coerced[col.name] = type === "raw" ? value : coerceToType(value, type);
}
return coerced;
}

function createTypeCount() {
return {
boolean: 0,
integer: 0,
number: 0,
date: 0,
string: 0,
array: 0,
object: 0,
bigint: 0,
buffer: 0,
defined: 0
};
}

// Caution: the order below matters! 🌶️ The first one that passes the ≥90% test
// should be the one that we chose, and therefore these types should be listed
// from most specific to least specific.
const types = [
"boolean",
"integer",
"number",
"date",
"bigint",
"array",
"object",
"buffer"
// Note: "other" and "string" are intentionally omitted; see below!
];

// We need to show *all* keys present in the array of Objects
function getAllKeys(rows) {
const keys = new Set();
for (const row of rows) {
// avoid crash if row is null or undefined
if (row) {
// only enumerable properties
for (const key in row) {
// only own properties
if (Object.prototype.hasOwnProperty.call(row, key)) {
// unique properties, in the order they appear
keys.add(key);
}
}
}
}
return Array.from(keys);
}

export function inferSchema(source, columns = getAllKeys(source)) {
const schema = [];
const sampleSize = 100;
const sample = source.slice(0, sampleSize);
const typeCounts = {};
for (const col of columns) {
const colCount = typeCounts[col] = createTypeCount();
for (const d of sample) {
let value = d[col];
if (value == null) continue;
const type = typeof value;
if (type !== "string") {
++colCount.defined;
if (Array.isArray(value)) ++colCount.array;
else if (value instanceof Date) ++colCount.date;
else if (value instanceof ArrayBuffer) ++colCount.buffer;
else if (type === "number") {
++colCount.number;
if (Number.isInteger(value)) ++colCount.integer;
}
// bigint, boolean, or object
else if (type in colCount) ++colCount[type];
} else {
value = value.trim();
if (!value) continue;
++colCount.defined;
++colCount.string;
if (/^(true|false)$/i.test(value)) {
++colCount.boolean;
} else if (value && !isNaN(value)) {
++colCount.number;
if (Number.isInteger(+value)) ++colCount.integer;
} else if (DATE_TEST.test(value)) ++colCount.date;
}
}
// Chose the non-string, non-other type with the greatest count that is also
// ≥90%; or if no such type meets that criterion, fallback to string if
// ≥90%; and lastly fallback to other.
const minCount = Math.max(1, colCount.defined * 0.9);
const type =
greatest(types, (type) =>
colCount[type] >= minCount ? colCount[type] : NaN
) ?? (colCount.string >= minCount ? "string" : "other");
schema.push({
name: col,
type: type,
inferred: type
});
}
return schema;
}
Loading