Skip to content

Enum field align cause performance degradation about 10x #119247

Open
@PureWhiteWu

Description

@PureWhiteWu

Hi, I'm the author of FastStr crate, and recently I found a wired problem that the clone cost of FastStr is really high. For example, an empty FastStr clone costs about 40ns on amd64 compared to about 4ns of a normal String.

The FastStr itself is a newtype of the inner Repr, which previously has the following layout:

#[derive(Clone)]
pub type FastStr(Repr);

#[cfg(all(test, target_pointer_width = "64"))]
mod size_asserts {
    static_assertions::assert_eq_size!(super::FastStr, [u8; 40]); // 40 bytes
}

const INLINE_CAP: usize = 38;

#[derive(Clone)]
enum Repr {
    Empty,
    Bytes(Bytes),
    ArcStr(Arc<str>),
    ArcString(Arc<String>),
    StaticStr(&'static str),
    Inline { len: u8, buf: [u8; INLINE_CAP] },
}

Playground link for old version

After some time of investigation, I found that this is because the Repr::Inline part has really great affect on the performance. And after I added a padding to the Repr::Inline variant(change the type of len from u8 to usize), the performance of clone a Repr::Empty(and other variants all) boosts about 9x from 40ns to 4ns. But the root cause is still not clear:

const INLINE_CAP: usize = 24; // This is becuase I don't want to enlarge the size of FastStr

#[derive(Clone)]
enum Repr {
    Empty,
    Bytes(Bytes),
    ArcStr(Arc<str>),
    ArcString(Arc<String>),
    StaticStr(&'static str),
    Inline { len: usize, buf: [u8; INLINE_CAP] },
}

Playground link for new version

A simple criterion benchmark code for the old version:

use bytes::Bytes;
use std::sync::Arc;
use criterion::{black_box, criterion_group, criterion_main, Criterion};

const INLINE_CAP: usize = 38;

#[derive(Clone)]
enum Repr {
    Empty,
    Bytes(Bytes),
    ArcStr(Arc<str>),
    ArcString(Arc<String>),
    StaticStr(&'static str),
    Inline { len: u8, buf: [u8; INLINE_CAP] },
}

fn criterion_benchmark(c: &mut Criterion) {
    let s = Repr::Empty;
    c.bench_function("empty repr", |b| b.iter(|| black_box(s.clone())));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

For a full benchmark, you may refer to: https://github.com/volo-rs/faststr/blob/main/benches/faststr.rs

Related PR: volo-rs/faststr#6
And commit: volo-rs/faststr@342bdc9

Furthermore, I've tried the following methods, but none helps:

  1. only change INLINE_CAP to 24
  2. change INLINE_CAP to 22 and added a padding to the Inline variant: Inline {_pad: u64,len: u8,buf: [u8; INLINE_CAP],},
  3. change INLINE_CAP to 22 and add a new struct Inline without the _pad field

To change the INLINE_CAP to 22 is only for not increasing the size of FastStr itself when add an extra padding, so the performance is nothing to do with it.

Edit: related discussions users.rust-lang.org, reddit

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-layoutArea: Memory layout of typesC-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchI-slowIssue: Problems and improvements with respect to performance of generated code.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions