-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
REF: dataframe formatters/outputs #36510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Move all methods to DataFrameFormatter, inherit relevant classes from DataFrameFormatter.
Replace it with get_result method, which is going to become abstract method for the parent class.
New module suggested: pandas/io/formats/string.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few comments, looks good. nice reorg and cleanup. it has been a long time / never since this was done; and we have lots of built up code here.
pandas/io/formats/csvs.py
Outdated
self.decimal = decimal | ||
self.header = header | ||
self.index = index | ||
self.na_rep = self.fmt.na_rep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if these are just readable now (which i think they are in this class as they are now set in DataFrameRenderer), then maybe make these properties? (or is my assumption wrong)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to properties.
pandas/io/formats/csvs.py
Outdated
# Incompatible types in assignment | ||
# (expression has type "Index", | ||
# variable has type "Optional[Sequence[Optional[Hashable]]]") [assignment] | ||
cols = self.obj.columns # type: ignore[assignment] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are getting this because you are overwriting cols. don't do that.
pandas/io/formats/format.py
Outdated
|
||
return None | ||
|
||
def _get_result( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these can probably be non-private. This entire class is private, so that if these are called by other modules (from pandas) then we want to make the public. If something is entirely private to this class then this is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out that _get_result
and _get_buffer
should not belong to the class: they do not use the object state. So I decided to move the methods to the module level.
else: | ||
raise TypeError("buf is not a file name and it has no write method") | ||
|
||
|
||
# ---------------------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would not be averse to splitting this file up if ppossible (e.g. maybe make array.py) for all of the formatters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
certainly better as a followon
@@ -1198,6 +899,209 @@ def _get_column_name_list(self) -> List[str]: | |||
return names | |||
|
|||
|
|||
class DataFrameRenderer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string here and explain that these to_* are being called in core/frame/to_* ultimately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
lwidth = self.line_width | ||
adjoin_width = 1 | ||
if self.fmt.index: | ||
idx = strcols.pop(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT the original _join_multiline didn't mutate the input.
often better to type inputs as Iterable instead of List for non-mutating functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you are right! It did not mutate strcols, but now it does. I will fix it.
What I don't like about Iterable type is the following:
- String is Iterable, so it can be confused with List/Tuple.
- Cannot find len(iterable) - mypy throws
Argument 1 to "len" has incompatible type "Iterable[str]"; expected "Sized" [arg-type]
So, now I changed the type of input parameter in _join_multiline
to Iterable[List[str]]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- String is Iterable, so it can be confused with List/Tuple.
yeah that can be a problem, but we normally add type parameters, so only Iterable[str] becomes an issue
- Cannot find len(iterable) - mypy throws
can always use Sequence if needed as Sequence is also not mutable.
There are some guidelines for stubs at https://github.com/python/typeshed/blob/master/CONTRIBUTING.md#stub-file-coding-style
avoid invariant collection types (List, Dict) in argument positions, in favor of covariant types like Mapping or Sequence
at some point, the guidelines there relevant to the codebase and not just stubs should probably be added to our code style docs #33851
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. @jbrockmendel @simonjayhawkins if any comments.
don't wait for me one this. prioritising reviewing 1.1.x issues and PRs |
Internal error in Windows py38_np18 check.
```
____________ test_filepath_or_buffer_arg[pathlike-foo-abc-to_latex] ____________
[gw0] linux -- Python 3.7.8 /home/travis/miniconda3/envs/pandas-dev/bin/python method = 'to_latex' filepath_or_buffer = PosixPath('/tmp/pytest-of-travis/pytest-0/popen-gw0/test_filepath_or_buffer_arg_pa11/foo') assert_filepath_or_buffer_equals = <function assert_filepath_or_buffer_equals.._assert_filepath_or_buffer_equals at 0x7f368ef02170> encoding = 'foo', data = 'abc', filepath_or_buffer_id = 'pathlike'
pandas/tests/io/formats/test_format.py:3389: self = <contextlib._GeneratorContextManager object at 0x7f368ef0fcd0> type = None, value = None, traceback = None
E AssertionError: Caused unexpected warning(s): [('ResourceWarning', ResourceWarning("unclosed <ssl.SSLSocket fd=13, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.20.0.43', 49154), raddr=('52.217.93.220', 443)>"), '/home/travis/build/pandas-dev/pandas/pandas/core/generic.py', 5404)]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ivanovmg generally lgtm.
The removal of the setters in csvs is a great improvement, for readability and static typing.
I'm not sure about removing the TableFormatter class though. This provided a common base for LatexFormatter, HTMLFormatter and what is now StringFormatter, for code that does not belong in DataFrameFormatter.
I suggest that future refactors are more atomic. I see several themes here that probably should have been discussed/reviewed in isolation. The primary one being the adoption of DataFrameFormatter for use in csv formatting. I'm also not sure that DataFrameRenderer class is even necessary.
pandas/io/formats/csvs.py
Outdated
@@ -29,19 +29,16 @@ | |||
from pandas.core.indexes.api import Index | |||
|
|||
from pandas.io.common import get_filepath_or_buffer, get_handle | |||
from pandas.io.formats.format import DataFrameFormatter, FloatFormatType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the imports are just for type checking, would prefer in a if TYPE_CHECKING:
block.
Saying that, FloatFormatType
is an alias so may want to move to pandas._typing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented as you suggested.
General question: is it OK to have imports from pandas._typing
on top, not inside if TYPE_CHECKING
block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. that's how we do it elsewhere. all imports in pandas._typing are inside a TYPE_CHECKING block so should be safe.
I've restarted the jobs. probably unrelated. |
The main target here was to extract StringFormatter from DataFrameFormatter. Regarding My concern with TableFormatter was that some properties (attributes) are not related to some ancestors. For example, Suggestion on the atomic PRs is very well taken :) But it was really difficult to make it short when extracting StringFormatter functionality. Sorry for that. |
agreed this is an important step for further refactoring. nice work.
yep. follow-ons ok.
It was a bit difficult to get one's head around the Inheritance and Composition used together, and there was not a clear separation. I think that I anticipated, that over time the shared functionality in DataFrameFormatter would move to TableFormatter, DataFrameFormatter would evolve to be what is now StringFormatter and TableFormatter would evolve to be what is now DataFrameFormatter. So the latex, html and string formatters would have all inherited from a base class. But as you have noted, composition is also an option. |
In fact my first guess was to use inheritance. But then I realized that the ancestors will have only something in common with the parent TableFormatter. Thus, I figured out that the composition would be a better option. |
There are now failures related to parquet and datetime. Looks like these are persistent across multiple PRs today. |
sgtm |
thanks @ivanovmg |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Partially addresses #36407
This is a continuation of #36434.
StringFormatter
(or betterConsoleFormatter
?) fromDataFrameFormatter
. Placed it into the new module (subject to discussion).DataFrameFormatter
inHTMLFormatter
,LatexFormatter
,CSVFormatter
andStringFormatter
. It turned out that the inheritance of each of the formatters from the baseDataFrameFormatter
was too complicated to comprehend. The composition seems to suit here better.DataFrameRenderer
to keep methods for outputs in each of the formats.This is not the ultimate refactoring, but just one more step.