General • Surprising performance disparity

Hello,

I've been hacking a bit on Miroslav Nemecek's PicoQVGA (https://www.breatharian.eu/hw/picoqvga/index_en.html)

While adding double-buffering support, I came across a huge performance pitfall from a seemingly trivial code change.

Essentially, the difference is accessing a framebuffer array directly, vs accessing it through a "current framebuffer" pointer.

The frame buffers are declared like so:

Code:

ALIGN4 u8 frame_buff0[FB_SIZE];ALIGN4 u8 frame_buff1[FB_SIZE];

these frame buffers are abstracted behind pointers as the "draw" buffer and the "display" buffer. The idea is that you draw to the draw buffer while DMA / PIO is sending the display buffer to the screen, then the pointers flip at VSYNC (if a vga_flip() call has been made, which sets a "draw buffer is ready" flag).

Code:

u8* draw_buff;u8* display_buff;

In order to access the frame buffers directly, there are also two indexes which are updated on VSYNC:

Code:

u8 draw_idx;u8 display_idx;

Accordingly, in a loop which fills the frame buffer, setting a pixel can be done in two ways: direct array access or access via the "current" pointer:

Code:

            // direct framebuffer access: 306us.            if (draw_idx == 1) {                frame_buff1[pixel] = rgb;            } else {                frame_buff0[pixel] = rgb;            }

Code:

            // pointer to framebuffer: 5467us.            draw_buff[pixel] = rgb;

I'm a bit shocked that one is about 18x faster than the other.

Just to confirm what I was seeing, I encapsulated these two approaches as functions, then toggled between them in a loop, printing out the elapsed time of each in the serial console. Sure enough, the behavior is repeatable:

Code:

main: foo: 306usmain: foo: 5979usmain: foo: 306usmain: foo: 5978usmain: foo: 306usmain: foo: 5978usmain: foo: 306usmain: foo: 5979usmain: foo: 306usmain: foo: 5978usmain: foo: 306usmain: foo: 5978us

I then marked them as NOINLINE and looked at the disassembly, but I can't seem to spot what would make one so much slower than the other.

direct array access:

Code:

NOINLINE void foo_fill1(u8 rgb) {    for (int y=0; y<FB_HEIGHT; y++) {        for (int x=0; x<FB_WIDTH; x++) {            int pixel = ((FB_WIDTH * y) + x);            // direct framebuffer access: 306us.            if (draw_idx == 1) {                frame_buff1[pixel] = rgb;            } else {                frame_buff0[pixel] = rgb;            }        }    }}

Code:

10000850 <foo_fill1>:10000850:b510      push{r4, lr}10000852:4b08      ldrr3, [pc, #32]@ (10000874 <foo_fill1+0x24>)10000854:0001      movsr1, r010000856:781b      ldrbr3, [r3, #0]10000858:2b01      cmpr3, #11000085a:d005      beq.n10000868 <foo_fill1+0x18>1000085c:2296      movsr2, #150@ 0x961000085e:4806      ldrr0, [pc, #24]@ (10000878 <foo_fill1+0x28>)10000860:0252      lslsr2, r2, #910000862:f004 fc83 bl1000516c <__wrap_memset>10000866:bd10      pop{r4, pc}10000868:2296      movsr2, #150@ 0x961000086a:4804      ldrr0, [pc, #16]@ (1000087c <foo_fill1+0x2c>)1000086c:0252      lslsr2, r2, #91000086e:f004 fc7d bl1000516c <__wrap_memset>10000872:e7f8      b.n10000866 <foo_fill1+0x16>10000874:20000f64 .word0x20000f6410000878:20001a14 .word0x20001a141000087c:20014614 .word0x20014614

pointer access:

Code:

NOINLINE void foo_fill2(u8 rgb) {    for (int y=0; y<FB_HEIGHT; y++) {        for (int x=0; x<FB_WIDTH; x++) {            int pixel = ((FB_WIDTH * y) + x);            // pointer to framebuffer: 5467us.            draw_buff[pixel] = rgb;        }    }}

Code:

10000880 <foo_fill2>:10000880:21a0      movsr1, #160@ 0xa010000882:b570      push{r4, r5, r6, lr}10000884:2500      movsr5, #010000886:4c08      ldrr4, [pc, #32]@ (100008a8 <foo_fill2+0x28>)10000888:4e08      ldrr6, [pc, #32]@ (100008ac <foo_fill2+0x2c>)1000088a:0049      lslsr1, r1, #11000088c:002b      movsr3, r51000088e:6822      ldrr2, [r4, #0]10000890:54d0      strbr0, [r2, r3]10000892:3301      addsr3, #110000894:428b      cmpr3, r110000896:d1fa      bne.n1000088e <foo_fill2+0xe>10000898:3341      addsr3, #65@ 0x411000089a:3541      addsr5, #65@ 0x411000089c:33ff      addsr3, #255@ 0xff1000089e:0019      movsr1, r3100008a0:35ff      addsr5, #255@ 0xff100008a2:42b3      cmpr3, r6100008a4:d1f2      bne.n1000088c <foo_fill2+0xc>100008a6:bd70      pop{r4, r5, r6, pc}100008a8:20000f6c .word0x20000f6c100008ac:00012d40 .word0x00012d40

Very curious to hear if anyone has any insight!

Statistics: Posted by cellularmitosis — Sun Jun 30, 2024 5:16 am — Replies 2 — Views 71

General • Surprising performance disparity

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List