Skip to content

[bug#12506] Type-test-override IndexedSeqOps#foreach #9820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

NthPortal
Copy link
Contributor

Effectively override IndexedSeqOps#foreach to loop
over indices instead of using an Iterator by adding a
type test to IterableOnceOps#foreach.

Fixes scala/bug#12506

Effectively override `IndexedSeqOps#foreach` to loop
over indices instead of using an `Iterator` by adding a
type test to `IterableOnceOps#foreach`.
@NthPortal NthPortal added performance the need for speed. usually compiler performance, sometimes runtime performance. library:collections PRs involving changes to the standard collection library labels Dec 5, 2021
@NthPortal NthPortal requested a review from a team December 5, 2021 01:28
@scala-jenkins scala-jenkins added this to the 2.13.8 milestone Dec 5, 2021
Copy link
Contributor

@julienrf julienrf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you April!

Would you have some time to perform some execution time measurements?

@NthPortal
Copy link
Contributor Author

I'll try to get to benchmarking. unfortunately, my server is hosting increasingly many things, making it increasingly unsuited to benchmarking, but hey—it's probably still fine

@lrytz
Copy link
Member

lrytz commented Dec 6, 2021

I think we have other such "overrides" in place where a real override would be binary incompatible. At least in 2.12 we accumulated a bunch of them over time.

@som-snytt
Copy link
Contributor

som-snytt commented Dec 7, 2021

The linked commit from strawman overrides all the similar methods such as forall. Edit: probably too much trouble.

I think testing is important because it's easy to accidentally deoptimize on the jvm.

}
case _ =>
val it = iterator
while(it.hasNext) f(it.next())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the inconsistent whitespace was inherited. Dr Odersky just told someone "no spaceless if(" [indirect quote] on Scala 3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix that, and also make the "override" its own method that gets called, for clarity

@julienrf
Copy link
Contributor

I just ran the following benchmark on my machine:

package scala.collection

import java.util.concurrent.TimeUnit
import org.openjdk.jmh.annotations._
import org.openjdk.jmh.infra.Blackhole

import scala.collection.immutable.ArraySeq

@BenchmarkMode(Array(Mode.AverageTime))
@Fork(1)
@Threads(1)
@Warmup(iterations = 5)
@Measurement(iterations = 8)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
class ForEachBenchmark {

  @Param(Array("0", "1", "10", "1000", "10000", "100000"))
  var size: Int = _
  var array: Array[String] = _
  var vector: Vector[String] = _
  var arraySeq: ArraySeq[String] = _
  var arrayBuffer: mutable.ArrayBuffer[String] = _

  @Setup(Level.Trial) def initNumbers: Unit = {
    array = Array.fill(size)("foo")
    vector = Vector.fill(size)("foo")
    arraySeq = ArraySeq.fill(size)("foo")
    arrayBuffer = mutable.ArrayBuffer.fill(size)("foo")
  }

  @Benchmark def array_foreach(bh: Blackhole): Unit =
    bh.consume(array.foreach(bh.consume))

  @Benchmark def vector_foreach(bh: Blackhole): Unit =
    bh.consume(vector.foreach(bh.consume))

  @Benchmark def arraySeq_foreach(bh: Blackhole): Unit =
    bh.consume(arraySeq.foreach(bh.consume))

  @Benchmark def arrayBuffer_foreach(bh: Blackhole): Unit =
    bh.consume(arrayBuffer.foreach(bh.consume))

}

Here are the results:

Before:

[info] ForEachBenchmark.arrayBuffer_foreach       0  avgt    8       3.607 ±     0.111  ns/op
[info] ForEachBenchmark.arrayBuffer_foreach       1  avgt    8       8.205 ±     0.217  ns/op
[info] ForEachBenchmark.arrayBuffer_foreach      10  avgt    8      52.256 ±     2.819  ns/op
[info] ForEachBenchmark.arrayBuffer_foreach    1000  avgt    8    4652.913 ±   104.411  ns/op
[info] ForEachBenchmark.arrayBuffer_foreach   10000  avgt    8   47493.288 ±  1164.611  ns/op
[info] ForEachBenchmark.arrayBuffer_foreach  100000  avgt    8  497097.001 ± 21642.157  ns/op

After:

[info] ForEachBenchmark.arrayBuffer_foreach       0  avgt    8       3.342 ±      0.039  ns/op
[info] ForEachBenchmark.arrayBuffer_foreach       1  avgt    8       7.650 ±      0.593  ns/op
[info] ForEachBenchmark.arrayBuffer_foreach      10  avgt    8      48.327 ±      1.151  ns/op
[info] ForEachBenchmark.arrayBuffer_foreach    1000  avgt    8    4598.888 ±    626.385  ns/op
[info] ForEachBenchmark.arrayBuffer_foreach   10000  avgt    8   48920.076 ±   3978.785  ns/op
[info] ForEachBenchmark.arrayBuffer_foreach  100000  avgt    8  494307.976 ±  51521.829  ns/op

Before:

[info] ForEachBenchmark.arraySeq_foreach          0  avgt    8       3.909 ±     0.096  ns/op
[info] ForEachBenchmark.arraySeq_foreach          1  avgt    8       7.309 ±     0.168  ns/op
[info] ForEachBenchmark.arraySeq_foreach         10  avgt    8      45.904 ±     0.638  ns/op
[info] ForEachBenchmark.arraySeq_foreach       1000  avgt    8    3455.772 ±   112.868  ns/op
[info] ForEachBenchmark.arraySeq_foreach      10000  avgt    8   34451.301 ±   254.424  ns/op
[info] ForEachBenchmark.arraySeq_foreach     100000  avgt    8  344377.487 ±  3330.170  ns/op

After:

[info] ForEachBenchmark.arraySeq_foreach          0  avgt    8       4.248 ±      0.652  ns/op
[info] ForEachBenchmark.arraySeq_foreach          1  avgt    8       8.714 ±      0.354  ns/op
[info] ForEachBenchmark.arraySeq_foreach         10  avgt    8      52.733 ±      4.276  ns/op
[info] ForEachBenchmark.arraySeq_foreach       1000  avgt    8    4399.159 ±    868.108  ns/op
[info] ForEachBenchmark.arraySeq_foreach      10000  avgt    8   44252.763 ±   7538.913  ns/op
[info] ForEachBenchmark.arraySeq_foreach     100000  avgt    8  526270.677 ± 611607.069  ns/op

Before:

[info] ForEachBenchmark.array_foreach             0  avgt    8       3.355 ±     0.051  ns/op
[info] ForEachBenchmark.array_foreach             1  avgt    8       6.762 ±     0.169  ns/op
[info] ForEachBenchmark.array_foreach            10  avgt    8      70.455 ±     0.758  ns/op
[info] ForEachBenchmark.array_foreach          1000  avgt    8    3878.833 ±  3844.431  ns/op
[info] ForEachBenchmark.array_foreach         10000  avgt    8   31033.782 ±   840.745  ns/op
[info] ForEachBenchmark.array_foreach        100000  avgt    8  314350.427 ± 23601.784  ns/op

After (Arrays should not be affected by the PR):

[info] ForEachBenchmark.array_foreach             0  avgt    8       3.505 ±      0.334  ns/op
[info] ForEachBenchmark.array_foreach             1  avgt    8       7.555 ±      0.797  ns/op
[info] ForEachBenchmark.array_foreach            10  avgt    8      38.023 ±      5.219  ns/op
[info] ForEachBenchmark.array_foreach          1000  avgt    8    3358.272 ±    408.329  ns/op
[info] ForEachBenchmark.array_foreach         10000  avgt    8   35310.950 ±   6418.211  ns/op
[info] ForEachBenchmark.array_foreach        100000  avgt    8  367379.432 ±  45226.684  ns/op

Before:

[info] ForEachBenchmark.vector_foreach            0  avgt    8       3.366 ±     0.127  ns/op
[info] ForEachBenchmark.vector_foreach            1  avgt    8       7.075 ±     0.180  ns/op
[info] ForEachBenchmark.vector_foreach           10  avgt    8      35.302 ±     0.917  ns/op
[info] ForEachBenchmark.vector_foreach         1000  avgt    8    3763.463 ±   244.501  ns/op
[info] ForEachBenchmark.vector_foreach        10000  avgt    8   37467.608 ±   569.759  ns/op
[info] ForEachBenchmark.vector_foreach       100000  avgt    8  384386.185 ± 29821.773  ns/op

After:

[info] ForEachBenchmark.vector_foreach            0  avgt    8       3.668 ±      0.558  ns/op
[info] ForEachBenchmark.vector_foreach            1  avgt    8       7.121 ±      0.155  ns/op
[info] ForEachBenchmark.vector_foreach           10  avgt    8      36.102 ±      3.690  ns/op
[info] ForEachBenchmark.vector_foreach         1000  avgt    8    3977.236 ±    561.623  ns/op
[info] ForEachBenchmark.vector_foreach        10000  avgt    8   37354.477 ±   1814.811  ns/op
[info] ForEachBenchmark.vector_foreach       100000  avgt    8  375880.352 ±  17324.349  ns/op

Not sure if I did something wrong. But there seems to be no impact on the performance.

@Ichoran
Copy link
Contributor

Ichoran commented Dec 13, 2021

If there's no difference either way for anything, is it actually worth adding to the code complexity? Nominally it avoids an object creation, which one would think was good, but it looks to me like the object creation is getting optimized out or you'd be able to see the initializer cost.

Can you check GC churn before/after to verify?

I'd imagine in cases with less optimization, you'd get increased memory usage and initializer costs, but then you'd get a Vector slowdown because it's not very good at indexing.

@SethTisue
Copy link
Member

@WojciechMazur you were the original reporter on Discord, I believe. Could you test this PR with your code, or supply your code in benchmark form?

@lrytz
Copy link
Member

lrytz commented Dec 14, 2021

One potential issue with the benchmark is that it's monomorphic in the collection type.

  @Benchmark def arraySeq_foreach(bh: Blackhole): Unit =
    bh.consume(arraySeq.foreach(bh.consume))

In the JVM running the above, the IterableOnce.foreach method is only ever called on an ArraySeq receiver. So the JVM can inline through the ArraySeq iterator in val it = iterator; while(it.hasNext) f(it.next()), which uses an ArrayIterator.

That doesn't explain why the while (i < len) { f(seq(i)); i += 1 } version ends up running slower on an ArraySeq. But in a real world application, it's not likely that IterableOnce.foreach is going to be monomorphic in the receiver across the JVM lifetime.

I assume it should be possible to pollute the profile (run foreach on various collection types) before starting to measure an individual collection.

@WojciechMazur
Copy link

WojciechMazur commented Dec 14, 2021

Hey, the original issue was discovered in one of Scala Native old benchmarks, which are pretty deprecated and in some cases not very idiomatic. The gist with benchmark adopted to be run with Jmh can be found in this link.

Unfortunately, when testing locally it looks this change would introduce regression, here are my results, the last one refers to a situation when changed change is moved to IndexSeqOps as an overridden method

Benchmark Mode Cnt Score Error Units
before KmeansBenchmark.run avgt 20 44323720.686 636910.669 ns/op
after KmeansBenchmark.run avgt 20 94648934.033 897606.542 ns/op
as-override KmeansBenchmark.run avgt 20 69258123.109 1885916.721 ns/op

I have no idea why these results are so different from the ones I was seeing before. Maybe I have some interference due to some background processes I'm not aware of.
Even though it maybe not be directly linked, still in Scala Native there are two benchmarks for which performance regression is clearly visible between Scala 2.11 and 2.13 (~2.4x longer execution time here's link to comparison). When tracking it down in the assembly I've seen that most of the execution time was spent while iterating over collections. Additionally, after applying overrides in SN (via the mechanism of Scala stdlib overrides when compiling it) I was observing speedup.

@dwijnand dwijnand modified the milestones: 2.13.8, 2.13.9 Dec 15, 2021
@SethTisue SethTisue modified the milestones: 2.13.8, 2.13.9 Dec 15, 2021
@SethTisue SethTisue marked this pull request as draft January 19, 2022 03:56
@SethTisue
Copy link
Member

Marking as draft since the prospects of this progressing appear uncertain. @WojciechMazur are you interested in pursuing this further? @NthPortal, you?

@NthPortal
Copy link
Contributor Author

@SethTisue I don't particularly intend to pursue this further. if justified by benchmarks, I'd be happy to tweak the code to be more in line with other type-test-overrides, but unless someone else justifies it, it's not clear to me that it's worth working on. happy for someone else to reopen or copy if they think it's worth it

@NthPortal NthPortal closed this Jan 30, 2022
@SethTisue SethTisue removed this from the 2.13.9 milestone Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
library:collections PRs involving changes to the standard collection library performance the need for speed. usually compiler performance, sometimes runtime performance.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2.13 IndexedSeqOps does not override foreach in IterableOps
9 participants