-
-
Notifications
You must be signed in to change notification settings - Fork 239
New 'UNLIST' function #8418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New 'UNLIST' function #8418
Conversation
What about interaction with the IN predicate? In the initial discussion, the idea was put forward that it would be nice to be able to perform queries like this SELECT * FROM table WHERE id IN UNLIST('1,2,3' RETURNING INT) That is, without additional wrapping of the UNLIST results in a derived table. |
I didn't do anything specifically for IN.
|
In the discussion of #8005 it was suggested to use the shortened syntax for One more question.
What type of input parameter will be returned to the client after preparing the query?
For the output parameter, unless otherwise specified, it will be |
|
||
G) | ||
CREATE DOMAIN D1 AS INT; | ||
SELECT TEST_DOMAIN FROM UNLIST('1,2,3,4' RETURNING D1) AS A(TEST_DOMAIN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it TYPE OF [COLUMN]
allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check. Granted. Added example in readme
src/dsql/ExprNodes.cpp
Outdated
@@ -6446,9 +6458,28 @@ dsql_fld* FieldNode::resolveContext(DsqlCompilerScratch* dsqlScratch, const Meta | |||
return nullptr; | |||
} | |||
|
|||
const TEXT* dsqlName = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NULL
-> nullptr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad.
src/dsql/parse.y
Outdated
table_value_function | ||
: table_value_function_clause table_value_function_correlation_name table_value_function_columns_name | ||
{ | ||
auto *node = nodeAs<TableValueFunctionSourceNode>($1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto *node = nodeAs<TableValueFunctionSourceNode>($1); | |
auto node = nodeAs<TableValueFunctionSourceNode>($1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Сorrected.
src/dsql/parse.y
Outdated
|
||
|
||
%type <metaNameArray> table_value_function_columns_name | ||
table_value_function_columns_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use derived_column_list
instead of create duplicate specific rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently I didn't notice this rule.
src/jrd/RecordSourceNodes.cpp
Outdated
@@ -3651,6 +3657,331 @@ RseNode* SelectExprNode::dsqlPass(DsqlCompilerScratch* dsqlScratch) | |||
return PASS1_derived_table(dsqlScratch, this, NULL); | |||
} | |||
|
|||
TableValueFunctionSourceNode* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a Firebird code convention. Please start the method name in the same line as the return type and break later as necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the clang-format utility to do the formatting. That's what he decided. I'll be more careful
const auto valueDesc = EVL_expr(tdbb, request, valueItem); | ||
if (valueDesc == nullptr) | ||
{ | ||
rpb->rpb_number.setValid(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why true
?
|
||
const auto textType = toDesc->getTextType(); | ||
|
||
auto setStringToRecord = [&] (string str, USHORT length = 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whould not be a string reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd even say it should be a const string reference.
if (length == 0) | ||
continue; | ||
|
||
string valueStr(buffer, length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not copy the buffer to another string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you telling me to read the data directly into a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to use something lightweight like string_view here, so as not to allocate memory and not do unnecessary copying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to try an experiment with string_view
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworked for string_view. tests pass
auto end = AbstractString::npos; | ||
do | ||
{ | ||
auto size = end = valueStr.find(separatorStr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this code compatible with multi-character separator and the list string split in different blob segments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not. To be improved, I guess. While LIST
would never split the separator, the input may be provided from the different source as well and we cannot predict the layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asfernandes, do we generally consider OK to read/process the blob as a whole? If so, this would be the easiest solution (and IIRC it worked this way in our original implementation) and we do the same in some built-in functions. Or should it be considered a bad practice to be revisited/fixed and thus the new code should be able to process blobs in chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least, processing BLOBs in parts will save a lot on reading huge BLOBs if not all results are needed UNLIST
SELECT *
FROM UNLIST(?, ',')
ROWS 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It requires to move the processing from open()
to getRecord()
. Currently implemented differently, but it could be worth the effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it depend on OPTIMIZE FOR {FIRST | ALL} ROWS
? I.e. process/return records in chunks for ALL ROWS
(not necessarily read/cache the whole blob, maybe just some sliding buffer) but process/return records one by one for FIRST ROWS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it depend on
OPTIMIZE FOR {FIRST | ALL} ROWS
? I.e. process/return records in chunks forALL ROWS
(not necessarily read/cache the whole blob, maybe just some sliding buffer) but process/return records one by one forFIRST ROWS
?
I do not see a clear relation of OPTIMIZE FOR
and UNNEST
. OPTIMIZE FOR
is top level query feature, while UNNEST
may be in inner queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the mechanics, not the clause itself. For example, a subquery with a FIRST
clause implies FOR FIRST ROWS
when compiling/optimizing its own RSE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation has two drawbacks:
- Increased memory consumption due to result buffering
- Slowing down queries with the FIRST ROWS strategy
In general, it is not very clear why TableValueFunction is a buffered data source. It seems to be related to primary, and all our primary sources are are pipelined. If you need to buffer the result, then the optimizer of the higher level does it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These issues are being addressed.
return; | ||
} | ||
|
||
if (valueDesc->isBlob()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Others Firebird functions compare strings using collation rules.
Here this is disregard. Should it be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question to be discussed. Theoretically, one may ask to split the text into chunks by e.g. both 'A' and 'a' used as a case-insensitive separator 'a'. But honestly, I cannot imagine that being used in practice. And collation-aware compares would add some extra overhead. So maybe it's worth comparing byte-wise (converting both value and separator to the common data type, if necessary). Or maybe we should consider two code branches depending on whether byte-wise compare is possible or not.
Good question to be resolved together. Something like |
In #8145 I already pass user-provided buffers to looper. One step further and functions can accept user input directly as descriptor avoiding coercion and thus overflow. |
On 1/30/25 13:01, Dimitry Sibiryakov wrote:
it would lead to a runtime error if overflowed by the user
(coercion will fail).
In #8145 <#8145> I already
pass user-provided buffers to looper. One step further and functions
can accept user input directly as descriptor avoiding coercion and
thus overflow.
This looks like solution.
|
{ | ||
rpb->rpb_number.setValid(false); | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rpb->rpb_number.setValid(true);
The processing has been moved to the internalGetRecord method. If “RecordBuffer” is empty, it is tried to be filled in the “nextBuffer” method. Filling will be done in portions. A small refactoring.
continue; | ||
} | ||
return true; | ||
} while (1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} while (1); | |
} while (true); |
|
||
bool TableValueFunctionScan::nextBuffer(thread_db* /*tdbb*/) const | ||
{ | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not TableValueFunctionScan be an abstract class instead?
src/jrd/recsrc/RecordSource.h
Outdated
struct Impure : public RecordSource::Impure | ||
{ | ||
RecordBuffer* m_recordBuffer; | ||
blb* m_blob; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't you mixing specifics of UnlistFunctionScan's impure in the base TableValueFunctionScan?
…ted remarks on abstract class.
Output will be:
Expected output (will be also if we comment line marked as " [ ! ] "):
|
PS. |
This is not surprising considering that the default return type is This is not surprising considering that the default return type is |
IMO, varchar(32) is too small. Why can't we use 8190 ? |
Resulting type should be VARCHAR(32) of the input character set, but it appears it's described as 32 bytes, not characters. To be fixed, AFAIU. As for the length, maybe 32 is a short limit, but this function is to be commonly used with numbers and this limit is enough for that usage case. If you want to deal with strings, better specify the length yourself (nobody knows how many characters gonna fit). |
QA note: deferred waiting for fix of resulting type (expect 32 characters rather than bytes) |
0xFF.
-- ? |
We have no other table-valued functions yet, so your example is syntactically incorrect. It should look something like: , e as (
select * from d, unlist(d.blob_fld, d.separator returning blob character set utf8) as u (x)
) |
Yes, it works in such way. Thanks! |
It seems that ASCII_CHAR(0) can not be used as separator (at least currently):
Checked on Windows. |
ascii_char(0) must not be used anywhere outside of OCTETS. Perhaps this function should explicitly prohibit zero value. But the infinite loop is definitely a bug. |
One more example:
Output will be:
|
This is a separate ISQL problem, not UNLIST one. |
maybe chr(0) must also be prohibited in LIST ():
|
"This function" I meant |
Why to decline the whole function usage ? IMO, it is enough to restrict only some values that it returns, at least chr(0). |
Yes, this is exactly what I wrote: |
|
I found the problem with the infinite loop, I will fix it. |
Perhaps you can answer an other question: is |
It seems that semicolon character ( ' Queries look like this:
Then run trace with config:
(its log also is in attached .zip) For 1st, 4rd and 4th statements trace will show similar parts which executes almost instantly:
But for SECOND statement (when Appropriate lines are marked as [ 1 ] and [ 2 ], their stamps: 2025-04-10T17:26:23.0550 and 2025-04-10T17:26:31.1140:
|
That thing with |
Yes, it's expected to work. |
I have a Q about how string that looks like floating-point number is interpreted in UNLIST.
-- then output will be:
So, minimal value that is distinguish from zero is first of above mentioned.
Output:
Why ? PS. |
It's not about UNLIST at all, you're comparing different cases. See: SQL> select cast('2.225073858507202e-308' as double precision) from rdb$database;
CAST
=======================
Statement failed, SQLSTATE = 22003
arithmetic exception, numeric overflow, or string truncation
-numeric value is out of range
SQL>
|
0xFF. Is this a bug or no ? (i mean: cast('2.225073858507202e-308' as double precision) ==> numeric value is out of range) |
IMO: if the numeric literal fits the double precision range, CAST from string should work. But maybe the underlying conversion is trickier and may cause undesired side effects. |
QA note: deferred because of |
… based on reported QA issues
Please retest with the new snapshot. |
Checked on 6.0.0.744-e883a89: all fine now. New test for UNLIST will be added soon. |
::: QA note ::: |
The function parses the input string using the specified delimiter (comma "," is implied by default) and returns the identified substrings as discrete records containing a single field. Additionally, the desired type of the returned field can be specified. If the specified data type conversion is impossible, an error is raised at runtime.
See also: #8005
)Arguments:
<input>
: any value expression that returns a string/blob of characters (or may be converted to a string), including string literals, table columns, constants, variables, expressions, etc. This parameter is mandatory.<separator>
: optional value expression that returns a string which is used as a delimiter (i.e. it separates one value from another inside the input string). It may also be a BLOB TEXT value, limited to 32KB. If an empty string is specified, the output will be just one record containing the input string. If omitted, the comma character is used as a delimiter.<data type>
: target data type to convert the output values into. Alternatively, a domain can be specified as the returned type. If omitted, VARCHAR(32) is implied. Feel free to suggest any better alternative default.<correlation name>
: alias of the record set returned by the UNLIST function. It is a mandatory parameter (per SQL standard).<derived column name>
: optional alias of the column returned by the UNLIST function. If omitted, UNLIST is used as an alias.