Call for patches that push Core to its limits, expose issues

urklang · 2019-10-14 20:10:24 UTC

From the literature and from anecdotal evidence there shouldn't be any advantage to running instructions from SRAM. People seem to observe that SRAM can actually be slower due to caching and CPU pipeline structure (One example: https://community.st.com/s/question/0D50X00009Xkh2VSAR/the-cpu-performance-difference-between-running-in-the-flash-and-running-in-the-ram-for-stm32f407).

The fundamental issue is that we're taking the special case, i.e. dynamically editing an object implementation, and applying it universally. It's over-engineered; there's no technical reason for it. Yes, it might be tolerable to deal with a compile cycle because your host machine is fast. That doesn't make it correct or desirable. Compilation should only occur when someone is changing an object implementation. This approach also has the effect of making everything harder to debug. If all of that object implementation code is compiled ahead of time, it's much easier for a debugger to be aware of all of those symbols.

The actual object implementation can remain largely the same whether in Flash or SRAM. You would simply call into object implementations in Flash as needed. I would argue that working from Flash would be easier overall because there could less concern about the exact positions of things in memory. The patcher right now actually has awareness of explicit memory addresses on the target device, again, for no real technical benefit. It's a very brittle design.

The patch load process could just as easily target Flash actually, but then we'd get into wear-leveling concerns, etc.

The bottom line is that the vast majority of patching could easily be done completely live without any compile cycle at all. It has the added bonus of actually being simpler to work with and to debug.

Ha, sorry if this all comes across as hyper-critical and doom and gloom! I'm just trying to make this thing the best it can possibly be. It bothers me that people are dealing with what I see as self-imposed technical limitations.

tele_player · 2019-10-14 20:36:10 UTC

I don't see it as hyper-critical or gloom and doom, I'm just not convinced it's desirable enough to merit much work.
The last guy to dig deep into rewriting how the system works has been mostly missing in action for a few years. We don't want that to happen before you ship some upgraded hardware.

urklang · 2019-10-14 20:49:37 UTC

Great! I agree that this is all stuff that isn't happening until the new hardware is out in the wild. Definitely enhancements, not in scope just to get hardware shipped.

Zaphod · 2019-10-16 16:13:55 UTC

The patch that I've appended is my first patch on Axoloti! It is an alpha version of a 12 band vocoder. I intended on making a 31 terts band vocoder, but soon found out that this will demand too much resources on Axoloti. Therefore I scaled it down to 12 bands. Depending on the positioning of the elements of the patch in the gui it compiles or I get an error message telling me that there is an overflow. I hope this can be of some help in the further development of Axoloti. And if anyone uses it to make some music, please send me a link, I'd be very curious to know what people use it for. To use it, put a carrier signal on the right input and a modulator signal on the left input. The 12 resulting bands are panned a bit, so any signal should result in some stereo output signal. The vca + const/i elements are meant to scale the signals. Depending on your input signals you may have to adjust these.vocode-o-matic_emulatie_12x12_banden_linear_03.axp (73.2 KB)

pierstitus · 2019-10-16 20:30:49 UTC

The big benefit of compiling everything every time is the compile time optimizations. When you compile al objects separately you can only optimize them separately, not the combination. Also important is the way you would do it, if you need an extra function call for every object that can take a lot of CPU.

urklang · 2019-10-16 20:42:57 UTC

The other issue I see with the SRAM approach is that we're simultaneously trying to read instructions and read and write data to the same memory. We could potentially avoid that conflict by trying to isolate the instruction accesses to Flash as much as possible.

SmashedTransistors · 2019-10-17 23:19:53 UTC

menu help->Library->community->tiar->FDN->D10 DelayVerb.axh
uses 90% cpu.

urklang · 2019-10-18 00:37:41 UTC

Layout I'm working with at the moment.

I was making this stupidly huge cascaded reverb to eat all that CCM. The question is can we get away with pushing some data out of the core-coupled area eventually.

jaffasplaffa · 2019-10-18 08:52:33 UTC

@urklang

Sorry I don't understand the picture.

How much sram will be available for the new one?

urklang · 2019-10-18 09:38:41 UTC

Here's the patch linker script from master: https://github.com/axoloti/axoloti/blob/master/firmware/ramlink.ld

So in the legacy system there's roughly 44k + 8k in SRAM regions and 48k in what's called "core-coupled" or CCMSRAM, then 8MB SDRAM.

The image is showing a new layout for the H7 that has about 512k of normal SRAM plus another 128k CCMSRAM for use with patches/dsp and then the 32M SDRAM. The patch I was running here uses more than twice what would be possible for CCM usage on the original hardware. Roughly we've got about 6x more SRAM and 4x more SDRAM to work with.

In both cases there are other regions of SRAM in use for other firmware purposes that aren't allocated to the patch. The H7 has about 1M of RAM overall.

Anyway, I have some ideas for reducing the effect of the SRAM bottleneck as we go forward. We're going to get some nice headroom without much work, but we still need to work on architecture improvements overall. There's a huge amount of Flash for us to work with. The H7 also has its nice L1 cache which is totally absent on the F4.

jaffasplaffa · 2019-10-18 10:21:00 UTC

Thank you for the details. This sounds really, really promising

lokki · 2019-10-18 12:04:57 UTC

i agree this is huge!!

jaffasplaffa · 2019-10-18 12:27:15 UTC

@urklang

Sorry bout all the questions here, but this is really great news all the info you give here

So one more here;

What about the cpu, compared to the old Axoloti?

urklang · 2019-10-18 22:06:03 UTC

The new board has one of the latest revisions of the H7 which I have running at 480Mhz. The F4 in Core runs at 168Mhz.

Gavin · 2019-10-19 06:04:15 UTC

Roughly 3 times the speed, very nice....!
I like the idea of having objects preloaded in flash if it improves performance over all, the only thing I don't like is there being alot of objects I may never use. Maybe there is another approach to consider where you could choose the factory objects or a personal list of objects../ just a thought....

urklang · 2019-10-19 06:44:54 UTC

This is definitely where I want to go eventually. Giving the user the option to control what objects to store into Flash. We don't want to waste any space for people with things they aren't actively using, especially random community objects. SuperCollider has an approach kind of like this actually where adding your preferred extension objects takes a build cycle and but then they're available to your patches until you decide to uninstall them.

On the other hand we do have an entire extra megabyte of flash available to us (we have 2MB over Core's 1). My intuition is that this is huge compared to the amount of data we need to store for objects if we're reasonably clever about not being wasteful when we store them.

Some objects need to be parameterized and stored more compactly. Take the basic example of mixer objects. Instead of having Mix1, Mix2, ..., Mix8 all stored separately, we want some kind of MixN that takes the number of inputs as a parameter and generates the appropriate object. This might be a behind the scenes optimization. At the patcher level, it could still be exact sizes like we're used to like Mix2, etc.

jaffasplaffa · 2019-10-19 07:36:31 UTC

Yeah that is my worry too. Pretty much every object gets edited here, so they are not factory objects anymore, they are always edits.

Blindsmyth · 2019-10-21 21:02:38 UTC

I edit a lot of objects too. But the main reason is usally that I delete unused dials, toggles or disps because they use up additional sram.
It would be super great if we could just freeze objects, meaning that all dial values become constant, feedback is disabled so that there is no additional ram usage.

I like the Idea that you can load the objects that you'll actually need into the flash. But as it works now I rework almost every object because almost all objects have wasteful dials and stuff...

EDIT: Of course there will be more ram and everything but nowing myself I will bring that to limit too propably

urklang · 2019-10-22 02:39:48 UTC

Object freezing is a great idea. I'm wondering if there might be a way that we could pull off something like this behind the scenes so the user doesn't have to manage it directly unless they want to. I'm thinking of sort of an "on-demand"/"lazy" system where non-essential features of a patch are static and consuming minimal resources until the user actually interacts with them. Rough concrete example: some object with tons of knobs only represents the "possibility" of having all those knobs; the backend waits to actually consume resources for it until the user changes the value.

It all comes back to "What is the absolute minimum work we can do to make the correct patch calculation right now?" And then being able to transition smoothly into higher resource usage as needed.

Blindsmyth · 2019-10-22 21:14:19 UTC

Hmmm I get this Idea but on the first sight I'm a bit skeptical. One of the things I like most about axoloti is that it's limits are quite well defined. Either the ram is full or not. Either CPU is full or not. In the first case it works reliably in the second not at all. In what you describe there isn't this clear control anyore.

Then a manual freeze would be the low hanging fruit right?