Description
I am using TFDV 1.2.0 and have a problem where I am consistently getting workers OOMing on Dataflow even with very large instance types (e.g. n2-highmem-16
and n2-highmem-32
). I've tried decreasing --number_of_worker_harness_threads
to half the available number of CPUs in hopes of decreasing memory usage to no avail. The OOMs typically happen at the very end of the job when it is done loading the data but still trying to combine the stats together.
The dataset itself is quite large, with 4000-5000 features spanning ~1e9 rows (and some features are quite long varlen). Do you have any tips for how to decrease memory usage on workers? I suspect that the issue may come from some high-cardinality (~1e8) string features we have, as computing the frequency counts for these features is probably very memory-intensive.