The generate_statistics_from_csv very slowly for large dataset in single server

Hi According to the tfx examples, I pass the `pipeline_options` to  `generate_statistics_from_csv` which set  `--direct_num_workers=16`  like:

```python
pipeline_options = PipelineOptions(['--direct_num_workers=16'])
```

It's seem that this option cannot speed up this API,  when I set `direct_num_workers=1`, the cost time is equal the 16 worker, like that:

```bash
# direct_num_workers=1
python prep.py  99.27s user 5.84s system 99% cpu 1:45.67 total

# direct_num_workers=16
python prep.py  101.92s user 5.22s system 98% cpu 1:48.44 total
```

Could someone help me?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The generate_statistics_from_csv very slowly for large dataset in single server #98

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The generate_statistics_from_csv very slowly for large dataset in single server #98

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions