Skip to content

The generate_statistics_from_csv very slowly for large dataset in single server #98

Open
@yajunwong

Description

@yajunwong

Hi According to the tfx examples, I pass the pipeline_options to generate_statistics_from_csv which set --direct_num_workers=16 like:

pipeline_options = PipelineOptions(['--direct_num_workers=16'])

It's seem that this option cannot speed up this API, when I set direct_num_workers=1, the cost time is equal the 16 worker, like that:

# direct_num_workers=1
python prep.py  99.27s user 5.84s system 99% cpu 1:45.67 total

# direct_num_workers=16
python prep.py  101.92s user 5.22s system 98% cpu 1:48.44 total

Could someone help me?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions