By: Jörn Engel (joern.delete@this.purestorage.com), September 19, 2021 8:46 pm

Room: Moderated Discussions

Anon (no.delete@this.spam.com) on September 19, 2021 5:32 pm wrote:

> Michael S (already5chosen.delete@this.yahoo.com) on September 19, 2021 4:46 pm wrote:

> > > It's a night here now.

> >

> > So, I measured time, in microsecond, of summation of 8,000,000 L1D-resident 64-bit numbers

> > (16,000 B buffer, summation repeated 4,000 times) at different alignments and using different

> > access/arithmetic width. CPU - Skylake Client (Xeon E-2176G) downclocked to 4.25 GHz.

> >

> > Here are results:

> > 8-byte (64b) accesses:

> > 0 1064

> > 1 1170

> >

> > 16-byte (128b) accesses:

> > 0 483

> > 1 701

> >

> > 32-byte (256b) accesses:

> > 0 256

> > 1 468

> >

> > Misalignment penalty [of streaming add):

> > 8-byte - 1.10x

> > 16-byte - 1.45x

> > 32-byte - 1.83x

Thank you!

> I think you should point out when the access cross a cache line or not.

Interesting idea. With 8-byte access, roughly 12.5% off accesses should cross a cacheline. Ratio goes up with access size, as does the misalignment penalty. The numbers don't quite match up, but a lot of the measurements could be explained if performance was limited by the numbers of cachelines read.

1064 - 1.77 cachelines / cycle

1170 - 1.81 cachelines / cycle

483 - 1.95 cachelines / cycle

701 - 1.67 cachelines / cycle

256 - 1.83 cachelines / cycle

468 - 1.51 cachelines / cycle

Not sure. I'll have to play around with the code a bit.

> Michael S (already5chosen.delete@this.yahoo.com) on September 19, 2021 4:46 pm wrote:

> > > It's a night here now.

> >

> > So, I measured time, in microsecond, of summation of 8,000,000 L1D-resident 64-bit numbers

> > (16,000 B buffer, summation repeated 4,000 times) at different alignments and using different

> > access/arithmetic width. CPU - Skylake Client (Xeon E-2176G) downclocked to 4.25 GHz.

> >

> > Here are results:

> > 8-byte (64b) accesses:

> > 0 1064

> > 1 1170

> >

> > 16-byte (128b) accesses:

> > 0 483

> > 1 701

> >

> > 32-byte (256b) accesses:

> > 0 256

> > 1 468

> >

> > Misalignment penalty [of streaming add):

> > 8-byte - 1.10x

> > 16-byte - 1.45x

> > 32-byte - 1.83x

Thank you!

> I think you should point out when the access cross a cache line or not.

Interesting idea. With 8-byte access, roughly 12.5% off accesses should cross a cacheline. Ratio goes up with access size, as does the misalignment penalty. The numbers don't quite match up, but a lot of the measurements could be explained if performance was limited by the numbers of cachelines read.

1064 - 1.77 cachelines / cycle

1170 - 1.81 cachelines / cycle

483 - 1.95 cachelines / cycle

701 - 1.67 cachelines / cycle

256 - 1.83 cachelines / cycle

468 - 1.51 cachelines / cycle

Not sure. I'll have to play around with the code a bit.