Description
ICU4X has a concept of "baked data", a way of "baking" locale data into the source of a program in the form of consts. This has a bunch of performance benefits: loading data from the binary is essentially free and doesn't involve any sort of deserialization.
However, we have been facing issues with cases where a single crate contains a lot of data.
I have a minimal testcase here: https://github.com/Manishearth/icu4x_compile_sample. It removes most of the cruft whilst still having an interesting-enough AST in the const data. cargo build
in the demo
folder takes 51s, using almost a gigabyte of RAM. Removing the macro does improve things slightly, but not overly slow.
Some interesting snippets of time-passes
:
...
time: 1.194; rss: 52MB -> 595MB ( +543MB) expand_crate
time: 1.194; rss: 52MB -> 595MB ( +543MB) macro_expand_crate
...
time: 3.720; rss: 682MB -> 837MB ( +155MB) type_check_crate
...
time: 55.505; rss: 837MB -> 1058MB ( +221MB) MIR_borrow_checking
...
time: 0.124; rss: 1080MB -> 624MB ( -456MB) free_global_ctxt
Full time-passes
time: 0.001; rss: 47MB -> 49MB ( +1MB) parse_crate
time: 0.001; rss: 50MB -> 50MB ( +0MB) incr_comp_prepare_session_directory
time: 0.000; rss: 50MB -> 51MB ( +1MB) setup_global_ctxt
time: 0.000; rss: 52MB -> 52MB ( +0MB) crate_injection
time: 1.194; rss: 52MB -> 595MB ( +543MB) expand_crate
time: 1.194; rss: 52MB -> 595MB ( +543MB) macro_expand_crate
time: 0.013; rss: 595MB -> 595MB ( +0MB) AST_validation
time: 0.008; rss: 595MB -> 597MB ( +1MB) finalize_macro_resolutions
time: 0.285; rss: 597MB -> 642MB ( +45MB) late_resolve_crate
time: 0.012; rss: 642MB -> 642MB ( +0MB) resolve_check_unused
time: 0.020; rss: 642MB -> 642MB ( +0MB) resolve_postprocess
time: 0.326; rss: 595MB -> 642MB ( +46MB) resolve_crate
time: 0.011; rss: 610MB -> 610MB ( +0MB) write_dep_info
time: 0.011; rss: 610MB -> 611MB ( +0MB) complete_gated_feature_checking
time: 0.058; rss: 765MB -> 729MB ( -35MB) drop_ast
time: 1.213; rss: 610MB -> 681MB ( +71MB) looking_for_derive_registrar
time: 1.421; rss: 610MB -> 682MB ( +72MB) misc_checking_1
time: 0.086; rss: 682MB -> 690MB ( +8MB) coherence_checking
time: 3.720; rss: 682MB -> 837MB ( +155MB) type_check_crate
time: 0.000; rss: 837MB -> 837MB ( +0MB) MIR_coroutine_by_move_body
time: 55.505; rss: 837MB -> 1058MB ( +221MB) MIR_borrow_checking
time: 1.571; rss: 1058MB -> 1068MB ( +10MB) MIR_effect_checking
time: 0.217; rss: 1068MB -> 1067MB ( -1MB) module_lints
time: 0.217; rss: 1068MB -> 1067MB ( -1MB) lint_checking
time: 0.311; rss: 1067MB -> 1068MB ( +0MB) privacy_checking_modules
time: 0.607; rss: 1068MB -> 1068MB ( +0MB) misc_checking_3
time: 0.000; rss: 1136MB -> 1137MB ( +1MB) monomorphization_collector_graph_walk
time: 0.778; rss: 1068MB -> 1064MB ( -4MB) generate_crate_metadata
time: 0.005; rss: 1064MB -> 1085MB ( +22MB) codegen_to_LLVM_IR
time: 0.007; rss: 1076MB -> 1085MB ( +10MB) LLVM_passes
time: 0.014; rss: 1064MB -> 1085MB ( +22MB) codegen_crate
time: 0.257; rss: 1084MB -> 1080MB ( -4MB) encode_query_results
time: 0.270; rss: 1084MB -> 1080MB ( -4MB) incr_comp_serialize_result_cache
time: 0.270; rss: 1084MB -> 1080MB ( -4MB) incr_comp_persist_result_cache
time: 0.271; rss: 1084MB -> 1080MB ( -4MB) serialize_dep_graph
time: 0.124; rss: 1080MB -> 624MB ( -456MB) free_global_ctxt
time: 0.000; rss: 624MB -> 624MB ( +0MB) finish_ongoing_codegen
time: 0.127; rss: 624MB -> 653MB ( +29MB) link_rlib
time: 0.135; rss: 624MB -> 653MB ( +29MB) link_binary
time: 0.138; rss: 624MB -> 618MB ( -6MB) link_crate
time: 0.139; rss: 624MB -> 618MB ( -6MB) link
time: 65.803; rss: 32MB -> 187MB ( +155MB) total
Even without the intermediate macro, expand_crate
still increases RAM significantly, though the increase is halved:
time: 0.715; rss: 52MB -> 254MB ( +201MB) expand_crate
time: 0.715; rss: 52MB -> 254MB ( +201MB) macro_expand_crate
I understand that to some extent, we are simply feeding Rust a file that is megabytes in size and we cannot expect it to be too fast. It's interesting that MIR borrow checking is slowed down so much by this (there's relatively little to borrow check. I suspect there is MIR construction happening here too). The fact that the RAM usage is almost in the gigabytes is also somewhat concerning; the problematic source file is 7MB, but compilation takes a gigabyte of RAM, which is quite significant. Pair this with the fact that we have many such data files per crate (some of which are large) we end up hitting CI limits.
With the actual problem we were facing (unicode-org/icu4x#5230 (comment)), our time-passes numbers were:
...
time: 1.013; rss: 51MB -> 1182MB (+1130MB) expand_crate
time: 1.013; rss: 51MB -> 1182MB (+1131MB) macro_expand_crate
...
time: 6.609; rss: 1308MB -> 1437MB ( +128MB) type_check_crate
time: 36.802; rss: 1437MB -> 2248MB ( +811MB) MIR_borrow_checking
time: 2.214; rss: 2248MB -> 2270MB ( +22MB) MIR_effect_checking
...
I'm hoping there is at least some low hanging fruit that can be improved here, or advice on how to avoid this problem. So far we've managed to stay within CI limits by reducing the number of tokens, converting stuff like icu::experimental::dimension::provider::units::UnitsDisplayNameV1 { patterns: icu::experimental::relativetime::provider::PluralPatterns { strings: icu::plurals::provider::PluralElementsPackedCow { elements: alloc::borrow::Cow::Borrowed(unsafe { icu::plurals::provider::PluralElementsPackedULE::from_byte_slice_unchecked(b"\0\x01 acre") }) }, _phantom: core::marker::PhantomData } },
into icu::experimental::dimension::provider::units::UnitsDisplayNameV1::new_baked(b"\0\x01 acre")
. This works to some extent but the problems remain in the same order of magnitude and can recur as we add more data.