-
Notifications
You must be signed in to change notification settings - Fork 83
DuckDBClient #313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DuckDBClient #313
Conversation
7cb30c9
to
198cfbb
Compare
@@ -2,7 +2,7 @@ | |||
"extends": "eslint:recommended", | |||
"parserOptions": { | |||
"sourceType": "module", | |||
"ecmaVersion": 2018 | |||
"ecmaVersion": 2020 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed for dynamic import. However, this probably means that this should be released as a major version bump since it means older JavaScript environments may not be able to support the new syntax. We could avoid this by reverting to the AMD (require) bundle for duckdb-wasm and apache-arrow, but it might be time to move forward. I still need to think about this a bit.
/cc @domoritz if you want to review! |
Oh cool. I'll take a look. |
@mbostock this sounds really good. 👍 |
async *readRows() { | ||
try { | ||
while (!batch.done) { | ||
yield batch.value.toArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as part of this PR, but we should have an option to return Apache Arrow Table instances directly from the database client rather than converting to an array of objects (rows) here. Probably this is an option (arrow = true) when you run a query, and perhaps you can specify it as a default option when constructing the DuckDBClient. In conjunction with #315 it would mean that we could entirely avoid materializing the array of objects when displaying the results of a DuckDBClient query in a table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment DuckDBClient.query
returns Apache Arrow Table so this sounds good once again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And It would be really great if you could use Apache Arrow Table also with Plot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kimmolinna Agreed! Tracking that here: observablehq/plot#1103
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet. I think native arrow support throughout would be awesome since we reduce materializations.
this._db = _db; | ||
this._counter = 0; | ||
async queryStream(query, params) { | ||
const connection = await this._db.connect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What type of object is this._db
that has the .connect
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A database client? Not seeing a .connect
method in the specification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is an AsyncDuckDB instance created here:
Line 120 in a0fef0c
const db = await createDuckDB(); |
The connect method is documented here:
https://shell.duckdb.org/docs/classes/index.AsyncDuckDB.html#connect
} else { | ||
result = await conn.query(query); | ||
async query(query, params) { | ||
const result = await this.queryStream(query, params); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have an implementation of .query
that doesn't rely on queryStream
for some back-compatibility with existing DuckDB notebooks (e.g., this cell)? I imagine it's fine, as those imported versions of DuckDBClient
will overwrite the standard library version, but just wanted to surface this as a possible concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(leaving some notes as we're pairing on the review, thinking out loud a bit) - does this mean that all DuckDBClients support streaming (as this uses the queryStream
method)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything in the standard library can be overridden, so any notebook that defines its own DuckDBClient won’t see the implementation here.
As far as this implementation goes, the fact that it implements db.query on top of db.queryStream should be totally invisible to the user. It implements the db.query method as specified in the database client specification and it only calls db.queryStream under the hood to reduce code duplication. It does not somehow mean that db.query or db.queryRow supports streaming; those methods still just return a promise to the results. But yes, this DuckDBClient does implement db.queryStream and hence it supports streaming.
I should also note that I didn’t implement the signal option for query methods in the DuckDBClient so there isn’t currently a way to cancel queries. But in practice that probably shouldn’t matter much because it’s an in-memory client and therefore most queries should be relatively fast. We could add it in the future.
src/duckdb.mjs
Outdated
return Inputs.table(result); | ||
} | ||
async queryRow(query, params) { | ||
const results = await this.query(query, params); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this.query
calls this.queryStream
as the first line, should we just call queryStream
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that, but then we’d have to duplicate the rest of db.query, too (to call result.readRows e.g.). But I guess that’s worth it since we don’t need to read multiple batches just to get the first row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimized in 6bf7aea.
}); | ||
await conn.close(); | ||
async sql(strings, ...args) { | ||
return await this.query(strings.join("?"), args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above (should we just use queryStream
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, then we’d need to duplicate the rest of db.query. db.queryStream doesn’t return a promise to an array of results; it returns a query stream response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for the clarification!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some basic testing locally after pairing on the review with @mkfreeman. Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. Just added some comments for more flexible imports.
} | ||
{ | ||
const package = await resolve("apache-arrow@9"); | ||
console.log(`export const arrow9 = dependency("${package.name}", "${package.version}", "+esm");`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow 10 is out so you can update to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to stick with Arrow 9 until duckdb-wasm upgrades…
async *readRows() { | ||
try { | ||
while (!batch.done) { | ||
yield batch.value.toArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet. I think native arrow support throughout would be awesome since we reduce materializations.
const table = arrow.tableFromJSON(array); | ||
const buffer = arrow.tableToIPC(table); | ||
const connection = await database.connect(); | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to pull out a method for inserting an arrow Table
. This method would be used to insert data that is already in columnar form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 77ccac0.
src/duckdb.mjs
Outdated
await connection.query( | ||
`CREATE VIEW '${name}' AS SELECT * FROM parquet_scan('${file.name}')` | ||
); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a case for file.name.endsWith(".arrow")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 77ccac0.
src/duckdb.mjs
Outdated
type | ||
}) => ({ | ||
return { | ||
schema: schema.fields.map(({name, type}) => ({ | ||
name, | ||
type: getType(String(type)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is maybe more brittle than directly checking against Arrow types but probably okay for now. Can improve later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can use the Apache Arrow type identifiers when looking at the schema for an Apache Arrow Table, but the result of a DESCRIBE doesn’t give me the numeric identifiers — I only get the string column_type. I want to use the same code for describeColumns, too. It’d be nice if DuckDB returned the numeric type identifier in addition to the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 990c240. And I was able to simplify the switch for DuckDB types since now we only need to handle the canonical types returned by DESCRIBE rather than the aliases. (At least, I hope that’s the case… it appears to be in my local testing.)
71fbda3
to
a849d1b
Compare
a849d1b
to
77ccac0
Compare
This…
Arrow
as apache-arrow@4 for backwards compatibility.Demo:
TODO Document and test the DuckDBClient implementation.