ARM functions.. and performance


#1

I was trying to compact the code for my looper objects.

old code was

uint32_t pos = __USAT(inlet_pos, 28);
if (pos>attr_table.LENGTH)
pos = attr_table.LENGTH;

int8_t temp1= (bitmask1&attr_table.array[pos])>>24;
int8_t temp2= (bitmask2&attr_table.array[pos])>>16;
int8_t temp3= (bitmask3&attr_table.array[pos])>>8;
int8_t temp4= (bitmask4&attr_table.array[pos]);

outlet_o1 = ((int32_t) temp1)<<20;
outlet_o2 = ((int32_t) temp2)<<20;
outlet_o3 = ((int32_t) temp3)<<20;
outlet_o4 = ((int32_t) temp4)<<20;

new code is

uint32_t pos = __USAT(inlet_pos, 28);
if (pos>attr_table.LENGTH)
pos = attr_table.LENGTH;

uint32_t sample = attr_table.array[pos]; 

int32_t temp1= extract(sample,0,24);
int32_t temp2= extract(sample,0,16);
int32_t temp3= extract(sample,0,8);
int32_t temp4= extract(sample,0,0);

outlet_o1 = (temp1)<<20;
outlet_o2 = (temp2)<<20;
outlet_o3 = (temp3)<<20;
outlet_o4 = (temp4)<<20;

where the function extract(arg1,arg2,arg3) is defined in the local data section

int32_t extract(int32_t op1,int32_t op2,int32_t op3)
{
  int32_t result;

  __ASM volatile ("sxtab %0, %1, %2, ROR #24" : "=r" (result) :  "r" (op2), "r" (op1) );
  return(result);
}

I thought that using an ARM function would have improved performance (the old code used several bitshifts and castings instead) .. Which however is not the case.

Old code scored approx 520 cycles, while the new uses 620.. Why does this happen?
I also trying to add __attribute__( ( always_inline ) ) __STATIC_INLINE before the code, but no change ad all.


#2

One possible explanation is that - if not all outlets are actually used and the output value is just discarded - the compiler cannot eliminate the inline assembly. One way around may be adding the const or pure function attribute. GCC function attribute reference.
To verify, it can be useful to check the assembly output of the compiler with arm-none-eabi-objdump (doc).
Generally, I think little (if any) benefit can be found from using assembly for things that express cleanly in c++.


#3

Tried both attributes, i got a really serious advantage against the c++ version only in one case: with all outlets disconnected. Not the way i intended to use the object :pensive: