Skip to content

DuckDBClient #313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 8, 2022
Merged

DuckDBClient #313

merged 5 commits into from
Nov 8, 2022

Conversation

mbostock
Copy link
Member

@mbostock mbostock commented Nov 2, 2022

This…

  • Mostly rewrites CMU’s DuckDBClient (with a proper implementation of queryStream).
  • Minimizes the exposed interface for DuckDBClient to reduce maintenance burden and discourage mutation.
  • Preserves stdlib’s Arrow as apache-arrow@4 for backwards compatibility.
  • Adds apache-arrow@9 (not 8) as a new dependency for DuckDB, per its package.json.
  • Uses jsDelivr’s esm.run to load @duckdb/duckdb-wasm and apache-arrow as an ES module (rather than require).
  • Minimizes the changes to recommended library versions in src/dependencies.mjs.
  • Adds CMU DIG’s BSD license to the derived code for DuckDBClient.

Demo:

Screen Shot 2022-11-02 at 3 31 07 PM

TODO Document and test the DuckDBClient implementation.

@mbostock mbostock requested a review from mkfreeman November 2, 2022 22:31
@@ -2,7 +2,7 @@
"extends": "eslint:recommended",
"parserOptions": {
"sourceType": "module",
"ecmaVersion": 2018
"ecmaVersion": 2020
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed for dynamic import. However, this probably means that this should be released as a major version bump since it means older JavaScript environments may not be able to support the new syntax. We could avoid this by reverting to the AMD (require) bundle for duckdb-wasm and apache-arrow, but it might be time to move forward. I still need to think about this a bit.

@mbostock
Copy link
Member Author

mbostock commented Nov 3, 2022

/cc @domoritz if you want to review!

@domoritz
Copy link
Contributor

domoritz commented Nov 3, 2022

Oh cool. I'll take a look.

@kimmolinna
Copy link

@mbostock this sounds really good. 👍

@mbostock mbostock mentioned this pull request Nov 5, 2022
1 task
async *readRows() {
try {
while (!batch.done) {
yield batch.value.toArray();
Copy link
Member Author

@mbostock mbostock Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as part of this PR, but we should have an option to return Apache Arrow Table instances directly from the database client rather than converting to an array of objects (rows) here. Probably this is an option (arrow = true) when you run a query, and perhaps you can specify it as a default option when constructing the DuckDBClient. In conjunction with #315 it would mean that we could entirely avoid materializing the array of objects when displaying the results of a DuckDBClient query in a table.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment DuckDBClient.query returns Apache Arrow Table so this sounds good once again.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And It would be really great if you could use Apache Arrow Table also with Plot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kimmolinna Agreed! Tracking that here: observablehq/plot#1103

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet. I think native arrow support throughout would be awesome since we reduce materializations.

this._db = _db;
this._counter = 0;
async queryStream(query, params) {
const connection = await this._db.connect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What type of object is this._db that has the .connect method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A database client? Not seeing a .connect method in the specification

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is an AsyncDuckDB instance created here:

const db = await createDuckDB();

The connect method is documented here:

https://shell.duckdb.org/docs/classes/index.AsyncDuckDB.html#connect

} else {
result = await conn.query(query);
async query(query, params) {
const result = await this.queryStream(query, params);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have an implementation of .query that doesn't rely on queryStream for some back-compatibility with existing DuckDB notebooks (e.g., this cell)? I imagine it's fine, as those imported versions of DuckDBClient will overwrite the standard library version, but just wanted to surface this as a possible concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(leaving some notes as we're pairing on the review, thinking out loud a bit) - does this mean that all DuckDBClients support streaming (as this uses the queryStream method)?

Copy link
Member Author

@mbostock mbostock Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything in the standard library can be overridden, so any notebook that defines its own DuckDBClient won’t see the implementation here.

As far as this implementation goes, the fact that it implements db.query on top of db.queryStream should be totally invisible to the user. It implements the db.query method as specified in the database client specification and it only calls db.queryStream under the hood to reduce code duplication. It does not somehow mean that db.query or db.queryRow supports streaming; those methods still just return a promise to the results. But yes, this DuckDBClient does implement db.queryStream and hence it supports streaming.

I should also note that I didn’t implement the signal option for query methods in the DuckDBClient so there isn’t currently a way to cancel queries. But in practice that probably shouldn’t matter much because it’s an in-memory client and therefore most queries should be relatively fast. We could add it in the future.

src/duckdb.mjs Outdated
return Inputs.table(result);
}
async queryRow(query, params) {
const results = await this.query(query, params);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this.query calls this.queryStream as the first line, should we just call queryStream directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that, but then we’d have to duplicate the rest of db.query, too (to call result.readRows e.g.). But I guess that’s worth it since we don’t need to read multiple batches just to get the first row.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimized in 6bf7aea.

});
await conn.close();
async sql(strings, ...args) {
return await this.query(strings.join("?"), args);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above (should we just use queryStream?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, then we’d need to duplicate the rest of db.query. db.queryStream doesn’t return a promise to an array of results; it returns a query stream response.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the clarification!

Copy link
Contributor

@libbey-observable libbey-observable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some basic testing locally after pairing on the review with @mkfreeman. Looks good!

Copy link
Contributor

@domoritz domoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. Just added some comments for more flexible imports.

}
{
const package = await resolve("apache-arrow@9");
console.log(`export const arrow9 = dependency("${package.name}", "${package.version}", "+esm");`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow 10 is out so you can update to that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

async *readRows() {
try {
while (!batch.done) {
yield batch.value.toArray();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet. I think native arrow support throughout would be awesome since we reduce materializations.

const table = arrow.tableFromJSON(array);
const buffer = arrow.tableToIPC(table);
const connection = await database.connect();
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to pull out a method for inserting an arrow Table. This method would be used to insert data that is already in columnar form.

Copy link
Member Author

@mbostock mbostock Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 77ccac0.

src/duckdb.mjs Outdated
await connection.query(
`CREATE VIEW '${name}' AS SELECT * FROM parquet_scan('${file.name}')`
);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a case for file.name.endsWith(".arrow")

Copy link
Member Author

@mbostock mbostock Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 77ccac0.

src/duckdb.mjs Outdated
type
}) => ({
return {
schema: schema.fields.map(({name, type}) => ({
name,
type: getType(String(type)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is maybe more brittle than directly checking against Arrow types but probably okay for now. Can improve later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can use the Apache Arrow type identifiers when looking at the schema for an Apache Arrow Table, but the result of a DESCRIBE doesn’t give me the numeric identifiers — I only get the string column_type. I want to use the same code for describeColumns, too. It’d be nice if DuckDB returned the numeric type identifier in addition to the string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 990c240. And I was able to simplify the switch for DuckDB types since now we only need to handle the canonical types returned by DESCRIBE rather than the aliases. (At least, I hope that’s the case… it appears to be in my local testing.)

@mbostock mbostock merged commit 77ccac0 into mkfreeman/duckdb Nov 8, 2022
@mbostock mbostock deleted the mbostock/duckdb branch November 8, 2022 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants