Hey guys I'm not sure if it's appropriate to ask this here, but it seemed like the best area.
I'm working on a convolution filter on custom hardware running a STM32F405 and a CS4270 codec running on the I2S in full-duplex. I've done a couple of sample player projects on this platform that came out nicely.
For this project I haven't been able to get more than about 400 taps to run on the CMSIS DSP FIR implementation using q15. This didn't seem right from what i've read on the internet so I decided to test the idea on the Axoloti as a benchmark. As you all know the Axoloti performs significantly better. I haven't tested using the Discovery board version. That implementation is more similar to the hardware I'm running, but I doubt hardware differences are big enough to see that type of performance difference.
The code in the Axoloti convolution object looks very similar to the CMSIS code. I'm running a bare-metal implementation using CubeMX and HAL. No RTOS. I even wrote a barebones interrupt handler for I2S DMA to avoid all the bloat in the HALmx version. Any ideas on where I'm losing my processing time? All interrupts beside the Systick and the I2S dma interrupts are turned off.