Description
I would like to add functionality to pandas so that data frames can be queried like database tables, similar to the way that they can be in spark-sql.
I think it should work in a similar fashion.
A table can be registered using register_temp_table(dataframe, table_name).
Then using pandas.query("select * from table_name")
you can query the data frame or any other ones registered using standard sql syntax.
I've already implemented the entire thing, but I was told to open an issue for it.
Also I'm aware that there is a package called pandassql but this package actually just puts a data frame into a sql lite database, as opposed to querying a data frame directly, and transforming the sql into pandas methods that are then applied to the data frame.
Motivation:
The motivation for this enhancement is to make pandas more accessible to a crowd of users that may not be as technical and also to provide ease of transition for legacy code in systems like sas that have SQL already embedded in their programs. I'll supply a context free grammar in my documentation to show exactly what this system can handle, but it can basically handle any traditional SQL select statement, including subqueries, joins, where clauses, group by clauses, any aggregate function already supported by pandas, limit, and order by clauses. It also has support for rank and dense_rank window functions. It can't do things that sql wouldn't normally do like cross tab and you can't use a user defined function in it although I think that could be a good add-on.
Datatypes:
The interface supports all pandas datatypes, so to cast something as an integer the syntax would currently be cast(some_number as int64) or cast(some_int as object). I've played around with the idea of varchar, char, bigint and smallint, but I think those would be misleading as those aren't datatypes that are supported by pandas currently.
Errors:
Currently the exceptions that it will throw that come this api are based solely around trying to select from an unregistered table, or from submitting an improperly written sql query, both of which you wouldn't want to silence so there's only one error mode.
Api Choices:
The reason I made the register_temp_table section of the api top level was to avoid attaching a method to DataFrame although if others think it might be better as a method, I would change it in that manner (DataFrame.register_temp_table(table_name)). The reason pandas.query is a top level method is that it's relational in structure. You can select from multiple tables and join them and such and so it wouldn't make sense for it to be on a DataFrame level. The only similarity to the .query DataFrame method though is the name. DataFrame.query is just an alternate way of expressing things like DataFrame[some_condition] whereas my .query encompasses a large amount of the pandas api.
Built In:
I have two reasons that I think this would be better built in. The first is that the target audience for this is less technical pandas users. Part of making this api easier to use is lessening the burden of researching code and learning how python works, so I think that for them to go looking for an external package may be hard to begin with and they would also need to know to look for one.
My second reason is that, from using what I've built, I've found pandas a lot easier to use just as a developer.
Suppose we have a DataFrame with one column called A, it goes from
This code:
dataframe[name_1] = dataframe[a] - 1
dataframe[name_2] = dataframe[a] + 1
dataframe = dataframe[dataframe[name_1] == dataframe[name_2]]
dataframe.drop(columns=['a'], inplace=True)
To this code:
pd.query("select a - 1 as name_1, a + 1 as name_2 from some_table where name_1 = name_2")
Also although I did implement register_temp_table as an api level function, it would serve best as a method on a DataFrame so that's another thing to consider.
I can't really provide any support for the lark part, other than that it seemed like the best tool for what I was making.
I apologize for the style and such, I'll be fixing all that before I'm done. I implemented this outside of pandas first, so that's why there are so many style and documentation discrepancies.