Hello,
I've been hacking a bit on Miroslav Nemecek's PicoQVGA (https://www.breatharian.eu/hw/picoqvga/index_en.html)
While adding double-buffering support, I came across a huge performance pitfall from a seemingly trivial code change.
Essentially, the difference is accessing a framebuffer array directly, vs accessing it through a "current framebuffer" pointer.
The frame buffers are declared like so:
these frame buffers are abstracted behind pointers as the "draw" buffer and the "display" buffer. The idea is that you draw to the draw buffer while DMA / PIO is sending the display buffer to the screen, then the pointers flip at VSYNC (if a vga_flip() call has been made, which sets a "draw buffer is ready" flag).
In order to access the frame buffers directly, there are also two indexes which are updated on VSYNC:
Accordingly, in a loop which fills the frame buffer, setting a pixel can be done in two ways: direct array access or access via the "current" pointer:
I'm a bit shocked that one is about 18x faster than the other.
Just to confirm what I was seeing, I encapsulated these two approaches as functions, then toggled between them in a loop, printing out the elapsed time of each in the serial console. Sure enough, the behavior is repeatable:
I then marked them as NOINLINE and looked at the disassembly, but I can't seem to spot what would make one so much slower than the other.
direct array access:
pointer access:
Very curious to hear if anyone has any insight!
I've been hacking a bit on Miroslav Nemecek's PicoQVGA (https://www.breatharian.eu/hw/picoqvga/index_en.html)
While adding double-buffering support, I came across a huge performance pitfall from a seemingly trivial code change.
Essentially, the difference is accessing a framebuffer array directly, vs accessing it through a "current framebuffer" pointer.
The frame buffers are declared like so:
Code:
ALIGN4 u8 frame_buff0[FB_SIZE];ALIGN4 u8 frame_buff1[FB_SIZE];
Code:
u8* draw_buff;u8* display_buff;
Code:
u8 draw_idx;u8 display_idx;
Code:
// direct framebuffer access: 306us. if (draw_idx == 1) { frame_buff1[pixel] = rgb; } else { frame_buff0[pixel] = rgb; }
Code:
// pointer to framebuffer: 5467us. draw_buff[pixel] = rgb;
Just to confirm what I was seeing, I encapsulated these two approaches as functions, then toggled between them in a loop, printing out the elapsed time of each in the serial console. Sure enough, the behavior is repeatable:
Code:
main: foo: 306usmain: foo: 5979usmain: foo: 306usmain: foo: 5978usmain: foo: 306usmain: foo: 5978usmain: foo: 306usmain: foo: 5979usmain: foo: 306usmain: foo: 5978usmain: foo: 306usmain: foo: 5978us
direct array access:
Code:
NOINLINE void foo_fill1(u8 rgb) { for (int y=0; y<FB_HEIGHT; y++) { for (int x=0; x<FB_WIDTH; x++) { int pixel = ((FB_WIDTH * y) + x); // direct framebuffer access: 306us. if (draw_idx == 1) { frame_buff1[pixel] = rgb; } else { frame_buff0[pixel] = rgb; } } }}
Code:
10000850 <foo_fill1>:10000850:b510 push{r4, lr}10000852:4b08 ldrr3, [pc, #32]@ (10000874 <foo_fill1+0x24>)10000854:0001 movsr1, r010000856:781b ldrbr3, [r3, #0]10000858:2b01 cmpr3, #11000085a:d005 beq.n10000868 <foo_fill1+0x18>1000085c:2296 movsr2, #150@ 0x961000085e:4806 ldrr0, [pc, #24]@ (10000878 <foo_fill1+0x28>)10000860:0252 lslsr2, r2, #910000862:f004 fc83 bl1000516c <__wrap_memset>10000866:bd10 pop{r4, pc}10000868:2296 movsr2, #150@ 0x961000086a:4804 ldrr0, [pc, #16]@ (1000087c <foo_fill1+0x2c>)1000086c:0252 lslsr2, r2, #91000086e:f004 fc7d bl1000516c <__wrap_memset>10000872:e7f8 b.n10000866 <foo_fill1+0x16>10000874:20000f64 .word0x20000f6410000878:20001a14 .word0x20001a141000087c:20014614 .word0x20014614
Code:
NOINLINE void foo_fill2(u8 rgb) { for (int y=0; y<FB_HEIGHT; y++) { for (int x=0; x<FB_WIDTH; x++) { int pixel = ((FB_WIDTH * y) + x); // pointer to framebuffer: 5467us. draw_buff[pixel] = rgb; } }}
Code:
10000880 <foo_fill2>:10000880:21a0 movsr1, #160@ 0xa010000882:b570 push{r4, r5, r6, lr}10000884:2500 movsr5, #010000886:4c08 ldrr4, [pc, #32]@ (100008a8 <foo_fill2+0x28>)10000888:4e08 ldrr6, [pc, #32]@ (100008ac <foo_fill2+0x2c>)1000088a:0049 lslsr1, r1, #11000088c:002b movsr3, r51000088e:6822 ldrr2, [r4, #0]10000890:54d0 strbr0, [r2, r3]10000892:3301 addsr3, #110000894:428b cmpr3, r110000896:d1fa bne.n1000088e <foo_fill2+0xe>10000898:3341 addsr3, #65@ 0x411000089a:3541 addsr5, #65@ 0x411000089c:33ff addsr3, #255@ 0xff1000089e:0019 movsr1, r3100008a0:35ff addsr5, #255@ 0xff100008a2:42b3 cmpr3, r6100008a4:d1f2 bne.n1000088c <foo_fill2+0xc>100008a6:bd70 pop{r4, r5, r6, pc}100008a8:20000f6c .word0x20000f6c100008ac:00012d40 .word0x00012d40
Statistics: Posted by cellularmitosis — Sun Jun 30, 2024 5:16 am — Replies 2 — Views 71