Skip to content

Optimize Vec when cloning from a slice #15471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 9, 2014
Merged

Conversation

erickt
Copy link
Contributor

@erickt erickt commented Jul 6, 2014

llvm is currently not able to conver Vec::extend into a memcpy for Copy types, which results in methods like Vec::push_all to run twice as slow as it should be running. This patch takes the unsafe Vec::clone optimization to speed up all the operations that are cloning a slice into a Vec.

before:

test vec::tests::bench_clone_from_0000_0000                ... bench:        12 ns/iter (+/- 2)
test vec::tests::bench_clone_from_0000_0010                ... bench:       125 ns/iter (+/- 4) = 80 MB/s
test vec::tests::bench_clone_from_0000_0100                ... bench:       360 ns/iter (+/- 33) = 277 MB/s
test vec::tests::bench_clone_from_0000_1000                ... bench:      2601 ns/iter (+/- 175) = 384 MB/s
test vec::tests::bench_clone_from_0010_0000                ... bench:        12 ns/iter (+/- 2)
test vec::tests::bench_clone_from_0010_0010                ... bench:       125 ns/iter (+/- 10) = 80 MB/s
test vec::tests::bench_clone_from_0010_0100                ... bench:       361 ns/iter (+/- 28) = 277 MB/s
test vec::tests::bench_clone_from_0100_0010                ... bench:       131 ns/iter (+/- 13) = 76 MB/s
test vec::tests::bench_clone_from_0100_0100                ... bench:       360 ns/iter (+/- 9) = 277 MB/s
test vec::tests::bench_clone_from_0100_1000                ... bench:      2575 ns/iter (+/- 168) = 388 MB/s
test vec::tests::bench_clone_from_1000_0100                ... bench:       356 ns/iter (+/- 20) = 280 MB/s
test vec::tests::bench_clone_from_1000_1000                ... bench:      2605 ns/iter (+/- 167) = 383 MB/s
test vec::tests::bench_from_slice_0000                     ... bench:        11 ns/iter (+/- 0)
test vec::tests::bench_from_slice_0010                     ... bench:       115 ns/iter (+/- 5) = 86 MB/s
test vec::tests::bench_from_slice_0100                     ... bench:       309 ns/iter (+/- 170) = 323 MB/s
test vec::tests::bench_from_slice_1000                     ... bench:      2065 ns/iter (+/- 198) = 484 MB/s
test vec::tests::bench_push_all_0000_0000                  ... bench:         7 ns/iter (+/- 0)
test vec::tests::bench_push_all_0000_0010                  ... bench:        79 ns/iter (+/- 7) = 126 MB/s
test vec::tests::bench_push_all_0000_0100                  ... bench:       342 ns/iter (+/- 18) = 292 MB/s
test vec::tests::bench_push_all_0000_1000                  ... bench:      2873 ns/iter (+/- 75) = 348 MB/s
test vec::tests::bench_push_all_0010_0010                  ... bench:       154 ns/iter (+/- 8) = 64 MB/s
test vec::tests::bench_push_all_0100_0100                  ... bench:       518 ns/iter (+/- 18) = 193 MB/s
test vec::tests::bench_push_all_1000_1000                  ... bench:      4490 ns/iter (+/- 223) = 222 MB/s

after:

test vec::tests::bench_clone_from_0000_0000                ... bench:        12 ns/iter (+/- 1)
test vec::tests::bench_clone_from_0000_0010                ... bench:       123 ns/iter (+/- 5) = 81 MB/s
test vec::tests::bench_clone_from_0000_0100                ... bench:       367 ns/iter (+/- 23) = 272 MB/s
test vec::tests::bench_clone_from_0000_1000                ... bench:      2618 ns/iter (+/- 252) = 381 MB/s
test vec::tests::bench_clone_from_0010_0000                ... bench:        12 ns/iter (+/- 1)
test vec::tests::bench_clone_from_0010_0010                ... bench:       124 ns/iter (+/- 7) = 80 MB/s
test vec::tests::bench_clone_from_0010_0100                ... bench:       369 ns/iter (+/- 34) = 271 MB/s
test vec::tests::bench_clone_from_0100_0010                ... bench:       123 ns/iter (+/- 6) = 81 MB/s
test vec::tests::bench_clone_from_0100_0100                ... bench:       371 ns/iter (+/- 25) = 269 MB/s
test vec::tests::bench_clone_from_0100_1000                ... bench:      2713 ns/iter (+/- 532) = 368 MB/s
test vec::tests::bench_clone_from_1000_0100                ... bench:       369 ns/iter (+/- 14) = 271 MB/s
test vec::tests::bench_clone_from_1000_1000                ... bench:      2611 ns/iter (+/- 194) = 382 MB/s
test vec::tests::bench_from_slice_0000                     ... bench:         7 ns/iter (+/- 0)
test vec::tests::bench_from_slice_0010                     ... bench:       108 ns/iter (+/- 4) = 92 MB/s
test vec::tests::bench_from_slice_0100                     ... bench:       235 ns/iter (+/- 24) = 425 MB/s
test vec::tests::bench_from_slice_1000                     ... bench:      1318 ns/iter (+/- 96) = 758 MB/s
test vec::tests::bench_push_all_0000_0000                  ... bench:         7 ns/iter (+/- 0)
test vec::tests::bench_push_all_0000_0010                  ... bench:        70 ns/iter (+/- 4) = 142 MB/s
test vec::tests::bench_push_all_0000_0100                  ... bench:       176 ns/iter (+/- 16) = 568 MB/s
test vec::tests::bench_push_all_0000_1000                  ... bench:      1125 ns/iter (+/- 94) = 888 MB/s
test vec::tests::bench_push_all_0010_0010                  ... bench:       159 ns/iter (+/- 15) = 62 MB/s
test vec::tests::bench_push_all_0100_0100                  ... bench:       363 ns/iter (+/- 12) = 275 MB/s
test vec::tests::bench_push_all_1000_1000                  ... bench:      2860 ns/iter (+/- 415) = 349 MB/s

This also includes extra benchmarks for Vec and MemWriter.

erickt added 3 commits July 5, 2014 23:07
llvm is currently not able to conver `Vec::extend` into a memcpy
for `Copy` types, which results in methods like `Vec::push_all`
to run twice as slow as it should be running. This patch takes
the unsafe `Vec::clone` optimization to speed up all the operations
that are cloning a slice into a `Vec`.

before:

test vec::tests::bench_clone_from_0000_0000                ... bench:        12 ns/iter (+/- 2)
test vec::tests::bench_clone_from_0000_0010                ... bench:       125 ns/iter (+/- 4) = 80 MB/s
test vec::tests::bench_clone_from_0000_0100                ... bench:       360 ns/iter (+/- 33) = 277 MB/s
test vec::tests::bench_clone_from_0000_1000                ... bench:      2601 ns/iter (+/- 175) = 384 MB/s
test vec::tests::bench_clone_from_0010_0000                ... bench:        12 ns/iter (+/- 2)
test vec::tests::bench_clone_from_0010_0010                ... bench:       125 ns/iter (+/- 10) = 80 MB/s
test vec::tests::bench_clone_from_0010_0100                ... bench:       361 ns/iter (+/- 28) = 277 MB/s
test vec::tests::bench_clone_from_0100_0010                ... bench:       131 ns/iter (+/- 13) = 76 MB/s
test vec::tests::bench_clone_from_0100_0100                ... bench:       360 ns/iter (+/- 9) = 277 MB/s
test vec::tests::bench_clone_from_0100_1000                ... bench:      2575 ns/iter (+/- 168) = 388 MB/s
test vec::tests::bench_clone_from_1000_0100                ... bench:       356 ns/iter (+/- 20) = 280 MB/s
test vec::tests::bench_clone_from_1000_1000                ... bench:      2605 ns/iter (+/- 167) = 383 MB/s
test vec::tests::bench_from_slice_0000                     ... bench:        11 ns/iter (+/- 0)
test vec::tests::bench_from_slice_0010                     ... bench:       115 ns/iter (+/- 5) = 86 MB/s
test vec::tests::bench_from_slice_0100                     ... bench:       309 ns/iter (+/- 170) = 323 MB/s
test vec::tests::bench_from_slice_1000                     ... bench:      2065 ns/iter (+/- 198) = 484 MB/s
test vec::tests::bench_push_all_0000_0000                  ... bench:         7 ns/iter (+/- 0)
test vec::tests::bench_push_all_0000_0010                  ... bench:        79 ns/iter (+/- 7) = 126 MB/s
test vec::tests::bench_push_all_0000_0100                  ... bench:       342 ns/iter (+/- 18) = 292 MB/s
test vec::tests::bench_push_all_0000_1000                  ... bench:      2873 ns/iter (+/- 75) = 348 MB/s
test vec::tests::bench_push_all_0010_0010                  ... bench:       154 ns/iter (+/- 8) = 64 MB/s
test vec::tests::bench_push_all_0100_0100                  ... bench:       518 ns/iter (+/- 18) = 193 MB/s
test vec::tests::bench_push_all_1000_1000                  ... bench:      4490 ns/iter (+/- 223) = 222 MB/s

after:

test vec::tests::bench_clone_from_0000_0000                ... bench:        12 ns/iter (+/- 1)
test vec::tests::bench_clone_from_0000_0010                ... bench:       123 ns/iter (+/- 5) = 81 MB/s
test vec::tests::bench_clone_from_0000_0100                ... bench:       367 ns/iter (+/- 23) = 272 MB/s
test vec::tests::bench_clone_from_0000_1000                ... bench:      2618 ns/iter (+/- 252) = 381 MB/s
test vec::tests::bench_clone_from_0010_0000                ... bench:        12 ns/iter (+/- 1)
test vec::tests::bench_clone_from_0010_0010                ... bench:       124 ns/iter (+/- 7) = 80 MB/s
test vec::tests::bench_clone_from_0010_0100                ... bench:       369 ns/iter (+/- 34) = 271 MB/s
test vec::tests::bench_clone_from_0100_0010                ... bench:       123 ns/iter (+/- 6) = 81 MB/s
test vec::tests::bench_clone_from_0100_0100                ... bench:       371 ns/iter (+/- 25) = 269 MB/s
test vec::tests::bench_clone_from_0100_1000                ... bench:      2713 ns/iter (+/- 532) = 368 MB/s
test vec::tests::bench_clone_from_1000_0100                ... bench:       369 ns/iter (+/- 14) = 271 MB/s
test vec::tests::bench_clone_from_1000_1000                ... bench:      2611 ns/iter (+/- 194) = 382 MB/s
test vec::tests::bench_from_slice_0000                     ... bench:         7 ns/iter (+/- 0)
test vec::tests::bench_from_slice_0010                     ... bench:       108 ns/iter (+/- 4) = 92 MB/s
test vec::tests::bench_from_slice_0100                     ... bench:       235 ns/iter (+/- 24) = 425 MB/s
test vec::tests::bench_from_slice_1000                     ... bench:      1318 ns/iter (+/- 96) = 758 MB/s
test vec::tests::bench_push_all_0000_0000                  ... bench:         7 ns/iter (+/- 0)
test vec::tests::bench_push_all_0000_0010                  ... bench:        70 ns/iter (+/- 4) = 142 MB/s
test vec::tests::bench_push_all_0000_0100                  ... bench:       176 ns/iter (+/- 16) = 568 MB/s
test vec::tests::bench_push_all_0000_1000                  ... bench:      1125 ns/iter (+/- 94) = 888 MB/s
test vec::tests::bench_push_all_0010_0010                  ... bench:       159 ns/iter (+/- 15) = 62 MB/s
test vec::tests::bench_push_all_0100_0100                  ... bench:       363 ns/iter (+/- 12) = 275 MB/s
test vec::tests::bench_push_all_1000_1000                  ... bench:      2860 ns/iter (+/- 415) = 349 MB/s
dst_len += 1;
dst.set_len(dst_len);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of taking src: &[T], does this optimize as well with src: Iterator<T>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just prototyped this out, and while for most functions it performed the same, for some reason this version of Vec::push_all is half way between the original implementation, and my initial &[T] implementation. Sometimes I just don't understand microbenchmarks. I'll investigate it more tomorrow.

test vec::tests::bench_clone_from_0000_0000                ... bench:         9 ns/iter (+/- 0)
test vec::tests::bench_clone_from_0000_0010                ... bench:       128 ns/iter (+/- 12) = 78 MB/s
test vec::tests::bench_clone_from_0000_0100                ... bench:       379 ns/iter (+/- 14) = 263 MB/s
test vec::tests::bench_clone_from_0000_1000                ... bench:      2624 ns/iter (+/- 187) = 381 MB/s
test vec::tests::bench_clone_from_0010_0000                ... bench:         9 ns/iter (+/- 1)
test vec::tests::bench_clone_from_0010_0010                ... bench:       127 ns/iter (+/- 9) = 78 MB/s
test vec::tests::bench_clone_from_0010_0100                ... bench:       386 ns/iter (+/- 26) = 259 MB/s
test vec::tests::bench_clone_from_0100_0010                ... bench:       129 ns/iter (+/- 6) = 77 MB/s
test vec::tests::bench_clone_from_0100_0100                ... bench:       374 ns/iter (+/- 19) = 267 MB/s
test vec::tests::bench_clone_from_0100_1000                ... bench:      2696 ns/iter (+/- 164) = 370 MB/s
test vec::tests::bench_clone_from_1000_0100                ... bench:       384 ns/iter (+/- 27) = 260 MB/s
test vec::tests::bench_clone_from_1000_1000                ... bench:      2638 ns/iter (+/- 75) = 379 MB/s
test vec::tests::bench_extend_0000_0000                    ... bench:         9 ns/iter (+/- 0)
test vec::tests::bench_extend_0000_0010                    ... bench:       129 ns/iter (+/- 6) = 77 MB/s
test vec::tests::bench_extend_0000_0100                    ... bench:       418 ns/iter (+/- 23) = 239 MB/s
test vec::tests::bench_extend_0000_1000                    ... bench:      3108 ns/iter (+/- 296) = 321 MB/s
test vec::tests::bench_extend_0010_0010                    ... bench:       209 ns/iter (+/- 9) = 47 MB/s
test vec::tests::bench_extend_0100_0100                    ... bench:       616 ns/iter (+/- 69) = 162 MB/s
test vec::tests::bench_extend_1000_1000                    ... bench:      4891 ns/iter (+/- 243) = 204 MB/s
test vec::tests::bench_from_slice_0000                     ... bench:         4 ns/iter (+/- 0)
test vec::tests::bench_from_slice_0010                     ... bench:       109 ns/iter (+/- 9) = 91 MB/s
test vec::tests::bench_from_slice_0100                     ... bench:       233 ns/iter (+/- 11) = 429 MB/s
test vec::tests::bench_from_slice_1000                     ... bench:      1331 ns/iter (+/- 53) = 751 MB/s
test vec::tests::bench_push_all_0000_0000                  ... bench:         4 ns/iter (+/- 1)
test vec::tests::bench_push_all_0000_0010                  ... bench:        64 ns/iter (+/- 4) = 156 MB/s
test vec::tests::bench_push_all_0000_0100                  ... bench:       230 ns/iter (+/- 21) = 434 MB/s
test vec::tests::bench_push_all_0000_1000                  ... bench:      1777 ns/iter (+/- 248) = 562 MB/s
test vec::tests::bench_push_all_0010_0010                  ... bench:       140 ns/iter (+/- 3) = 71 MB/s
test vec::tests::bench_push_all_0100_0100                  ... bench:       407 ns/iter (+/- 18) = 245 MB/s
test vec::tests::bench_push_all_1000_1000                  ... bench:      3773 ns/iter (+/- 661) = 265 MB/s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I figured out an internal loop that lets llvm optimize an interator down into my fast version, but unfortunately it's not drop safe:

fn do_clone_hack_external_iter2<'a, Iter: Iterator<&'a u8>>(dst: &mut Vec<u8>, mut src: Iter) {
    let mut dst_len = dst.len();
    for x in src {
        let x = x.clone();
        unsafe {
            ptr::write(
                dst.as_mut_slice().unsafe_mut_ref(dst_len),
                x);
            dst_len += 1;
        }
    }
    unsafe {
        dst.set_len(dst_len);
    }
}

Setting the vec length from inside that hot loop prevents this optimization from happening. I've put my experiments in this gist if anyone else wants to play with it: https://gist.github.com/erickt/7a3fd8689159e69ed74c

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is unfortunate! It's a shame that this optimization can't apply to patterns such as extend or collect, but I suppose we'll just have to wait to teach LLVM a little more about aliasing!

erickt added 3 commits July 7, 2014 09:16
bors added a commit that referenced this pull request Jul 9, 2014
llvm is currently not able to conver `Vec::extend` into a memcpy for `Copy` types, which results in methods like `Vec::push_all` to run twice as slow as it should be running. This patch takes the unsafe `Vec::clone` optimization to speed up all the operations that are cloning a slice into a `Vec`.

before:

```
test vec::tests::bench_clone_from_0000_0000                ... bench:        12 ns/iter (+/- 2)
test vec::tests::bench_clone_from_0000_0010                ... bench:       125 ns/iter (+/- 4) = 80 MB/s
test vec::tests::bench_clone_from_0000_0100                ... bench:       360 ns/iter (+/- 33) = 277 MB/s
test vec::tests::bench_clone_from_0000_1000                ... bench:      2601 ns/iter (+/- 175) = 384 MB/s
test vec::tests::bench_clone_from_0010_0000                ... bench:        12 ns/iter (+/- 2)
test vec::tests::bench_clone_from_0010_0010                ... bench:       125 ns/iter (+/- 10) = 80 MB/s
test vec::tests::bench_clone_from_0010_0100                ... bench:       361 ns/iter (+/- 28) = 277 MB/s
test vec::tests::bench_clone_from_0100_0010                ... bench:       131 ns/iter (+/- 13) = 76 MB/s
test vec::tests::bench_clone_from_0100_0100                ... bench:       360 ns/iter (+/- 9) = 277 MB/s
test vec::tests::bench_clone_from_0100_1000                ... bench:      2575 ns/iter (+/- 168) = 388 MB/s
test vec::tests::bench_clone_from_1000_0100                ... bench:       356 ns/iter (+/- 20) = 280 MB/s
test vec::tests::bench_clone_from_1000_1000                ... bench:      2605 ns/iter (+/- 167) = 383 MB/s
test vec::tests::bench_from_slice_0000                     ... bench:        11 ns/iter (+/- 0)
test vec::tests::bench_from_slice_0010                     ... bench:       115 ns/iter (+/- 5) = 86 MB/s
test vec::tests::bench_from_slice_0100                     ... bench:       309 ns/iter (+/- 170) = 323 MB/s
test vec::tests::bench_from_slice_1000                     ... bench:      2065 ns/iter (+/- 198) = 484 MB/s
test vec::tests::bench_push_all_0000_0000                  ... bench:         7 ns/iter (+/- 0)
test vec::tests::bench_push_all_0000_0010                  ... bench:        79 ns/iter (+/- 7) = 126 MB/s
test vec::tests::bench_push_all_0000_0100                  ... bench:       342 ns/iter (+/- 18) = 292 MB/s
test vec::tests::bench_push_all_0000_1000                  ... bench:      2873 ns/iter (+/- 75) = 348 MB/s
test vec::tests::bench_push_all_0010_0010                  ... bench:       154 ns/iter (+/- 8) = 64 MB/s
test vec::tests::bench_push_all_0100_0100                  ... bench:       518 ns/iter (+/- 18) = 193 MB/s
test vec::tests::bench_push_all_1000_1000                  ... bench:      4490 ns/iter (+/- 223) = 222 MB/s
```

after:

```
test vec::tests::bench_clone_from_0000_0000                ... bench:        12 ns/iter (+/- 1)
test vec::tests::bench_clone_from_0000_0010                ... bench:       123 ns/iter (+/- 5) = 81 MB/s
test vec::tests::bench_clone_from_0000_0100                ... bench:       367 ns/iter (+/- 23) = 272 MB/s
test vec::tests::bench_clone_from_0000_1000                ... bench:      2618 ns/iter (+/- 252) = 381 MB/s
test vec::tests::bench_clone_from_0010_0000                ... bench:        12 ns/iter (+/- 1)
test vec::tests::bench_clone_from_0010_0010                ... bench:       124 ns/iter (+/- 7) = 80 MB/s
test vec::tests::bench_clone_from_0010_0100                ... bench:       369 ns/iter (+/- 34) = 271 MB/s
test vec::tests::bench_clone_from_0100_0010                ... bench:       123 ns/iter (+/- 6) = 81 MB/s
test vec::tests::bench_clone_from_0100_0100                ... bench:       371 ns/iter (+/- 25) = 269 MB/s
test vec::tests::bench_clone_from_0100_1000                ... bench:      2713 ns/iter (+/- 532) = 368 MB/s
test vec::tests::bench_clone_from_1000_0100                ... bench:       369 ns/iter (+/- 14) = 271 MB/s
test vec::tests::bench_clone_from_1000_1000                ... bench:      2611 ns/iter (+/- 194) = 382 MB/s
test vec::tests::bench_from_slice_0000                     ... bench:         7 ns/iter (+/- 0)
test vec::tests::bench_from_slice_0010                     ... bench:       108 ns/iter (+/- 4) = 92 MB/s
test vec::tests::bench_from_slice_0100                     ... bench:       235 ns/iter (+/- 24) = 425 MB/s
test vec::tests::bench_from_slice_1000                     ... bench:      1318 ns/iter (+/- 96) = 758 MB/s
test vec::tests::bench_push_all_0000_0000                  ... bench:         7 ns/iter (+/- 0)
test vec::tests::bench_push_all_0000_0010                  ... bench:        70 ns/iter (+/- 4) = 142 MB/s
test vec::tests::bench_push_all_0000_0100                  ... bench:       176 ns/iter (+/- 16) = 568 MB/s
test vec::tests::bench_push_all_0000_1000                  ... bench:      1125 ns/iter (+/- 94) = 888 MB/s
test vec::tests::bench_push_all_0010_0010                  ... bench:       159 ns/iter (+/- 15) = 62 MB/s
test vec::tests::bench_push_all_0100_0100                  ... bench:       363 ns/iter (+/- 12) = 275 MB/s
test vec::tests::bench_push_all_1000_1000                  ... bench:      2860 ns/iter (+/- 415) = 349 MB/s
```

This also includes extra benchmarks for `Vec` and `MemWriter`.
@bors bors closed this Jul 9, 2014
@bors bors merged commit b4984a4 into rust-lang:master Jul 9, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants