I was trying to compact the code for my looper objects.
old code was
uint32_t pos = __USAT(inlet_pos, 28);
if (pos>attr_table.LENGTH)
pos = attr_table.LENGTH;
int8_t temp1= (bitmask1&attr_table.array[pos])>>24;
int8_t temp2= (bitmask2&attr_table.array[pos])>>16;
int8_t temp3= (bitmask3&attr_table.array[pos])>>8;
int8_t temp4= (bitmask4&attr_table.array[pos]);
outlet_o1 = ((int32_t) temp1)<<20;
outlet_o2 = ((int32_t) temp2)<<20;
outlet_o3 = ((int32_t) temp3)<<20;
outlet_o4 = ((int32_t) temp4)<<20;
new code is
uint32_t pos = __USAT(inlet_pos, 28);
if (pos>attr_table.LENGTH)
pos = attr_table.LENGTH;
uint32_t sample = attr_table.array[pos];
int32_t temp1= extract(sample,0,24);
int32_t temp2= extract(sample,0,16);
int32_t temp3= extract(sample,0,8);
int32_t temp4= extract(sample,0,0);
outlet_o1 = (temp1)<<20;
outlet_o2 = (temp2)<<20;
outlet_o3 = (temp3)<<20;
outlet_o4 = (temp4)<<20;
where the function extract(arg1,arg2,arg3) is defined in the local data section
int32_t extract(int32_t op1,int32_t op2,int32_t op3)
{
int32_t result;
__ASM volatile ("sxtab %0, %1, %2, ROR #24" : "=r" (result) : "r" (op2), "r" (op1) );
return(result);
}
I thought that using an ARM function would have improved performance (the old code used several bitshifts and castings instead) .. Which however is not the case.
Old code scored approx 520 cycles, while the new uses 620.. Why does this happen?
I also trying to add __attribute__( ( always_inline ) ) __STATIC_INLINE
before the code, but no change ad all.