Multicore and Amiga: Present and Future

blog_devSymmetric Multi-Processing (SMP) is on the wishlist of AmigaOS users for quite some time now, and while progress has been made, we’re still not there yet.

To explain the progress, let us first look at the concept, and then point to where we are in the whole.

Concept: Threads, Cores and Processors

When looking at SMP support, we need to take actual processor technology into account. Older implementations used actual physical processors for SMP, that is, one processor could execute one instruction stream, and to achieve parallel execution, you would be able to plug in more processors. Usually and not surprisingly, this is rather limited in the amount of processors that can communicate with each other (although there were massively parallel machines that used a complex interconnection network for communication).

Later on, chip manufacturers added additional so called “cores” to one physical processor.

A very recent development is the ability of such individual “cores” to execute more than one instruction stream in parallel. We call those instruction streams threads. This technology is used in such CPUs as the Intel Core i7 (where it is named hyper-threading), or the Freescale e6500 core, which is used on the T-series CPUs from Freescale (up to the T4240, which has 12 physical cores with two threads each).

How to schedule tasks in SMP configurations

When looking at how to schedule Exec tasks and processes on a SMP system, we need to look at how much overhead is involved with scheduling. Currently, in single-CPU environments, Exec periodically interrupts execution and evaluates whether it should pick another task to run. Doing that with high frequency, say, 20 or 30 times per second creates the illusion of multiple tasks running in parallel.

The evaluation whether a new task should run or not is done based on the tasks priority, or depending on whether the task has something important to do. Of course, this takes time as well, and if the time taken for this evaluation becomes too long, the machine will take more time evaluating what to run than running what it is supposed to.

This gets worse if more execution units are in the system: If the evaluation would also need to ask other CPU cores whether they want to run something, the time for this processing would rise tremendously, to a point where adding more CPUs to the system would actually slow it down.

Therefore, the scheduling of Exec tasks and processes will remain something a single CPU will do for itself, even in multi-core. This will ensure a reasonable time is spent on this task.

So, how do multiple cores, threads and/or CPU’s come into play ?

We can easily monitor how loaded a core is by checking how many of the tasks we have ready actually got time to run. This is called the “load” of the CPU. If it spends a lot of time waiting for tasks to become available, the load is low. If a lot of tasks are waiting for their turn to run, the load is high.

In a multi-core system, some cores will have high load, and some low. To balance this, the system will look at the load of the individual core and determine whether there is a need to balance out the load among the CPUs or not. This balancing is triggered by the individual schedulers when they notice that their current workload is too big for them to handle. In this case, the overloaded core will migrate some of it’s tasks to other cores until it can again handle it’s workload.

Scheduling domains

In the previous section, we talked about balancing. Let’s take a closer look at this. A task running on one core is usually using the core’s resources. An important resource is caches: Level 1 caches (L1) contain data from memory very close to the CPU’s instruction units, so accessing this data is instantaneous. A level 2 cache (L2) is something that is below the Level 1 cache, is usually bigger, and access to it is slightly slower than accessing the level 1 cache. Some systems even have higher level caches below the L2 cache.

In a typical multi-core system, the L1 is exclusive to one core, while the L2 is shared among many cores. However, as in the scheduling example, more cores accessing one L2 means more overhead in communication with the L2 because some core might be waiting for another core to finish accessing the L2. The more cores access the same L2, the more likely such a stall is. Therefore, some processors with a high number of cores have multiple L2 with groups of cores sharing that L2 while another group shares another L2. We call those groups “clusters”. As an example, the T4240 has 12 cores, grouped in three clusters of four cores each, with each cluster sharing their own L2 and each core having a separate L1. In addition, there are cores that can run multiple threads at once. These threads even share the L1.

What we see here is a hierarchy of execution units. Clearly, moving a task from one thread to another thread on the same core is a rather cheap operation, the migrated task will not suffer from misses in the L1 since it still uses the same one. On the other hand, migrating from one core to another one means that the new core will not have access to the L1 of the previous core, and thus, migrating to this core will come at a slight initial performance cost. Similarly, migrating across to a new cluster will mean the task will lose the benefit of both the L1 and the L2, resulting in an even larger initial performance cost.

As you can see, moving a task to another execution unit will have different degrees of performance penalties. Note, these are only “initial performance” penalties, since the caches will gather the necessary data over time so that after some time, the caches will be fully available again to the task.

This hierarchy is in essence the hierarchy of “scheduling domains”. Scheduling domains define a cost associated with moving a task from one core to another. When balancing the load, the system will strive to minimize the cost of movement to ensure minimal performance loss. Migrating inside the current scheduling domain will always be cheaper than migrating to some core outside of the current scheduling domain, however, if all cores in the task’s scheduling domain are overloaded, the higher cost will need to be paid.

Pitfalls: The dreaded Forbid

Adding this functionality to AmigaOS is not without problems, naturally. There is one central part of the OS that has been around since it’s very beginning, and has been widely misused and misunderstood: The Exec function Forbid (and it’s counterpart, Permit). The problem with these functions is both semantic and practical. It has been documented as disabling task switching, and as enforcing the system to become single threaded.

If you look at this, these are one and the same on a single core system, but something completely different in SMP. Forbidding task switching in an SMP system will have a number of threads running in parallel. No CPU core will switch tasks, since it is forbidden to do so, but the system is not running single threaded at all.

This misunderstanding has lead to a lot of misuse. Forbid has been used to basically protect critical data structures from tampering with by other threads. Of course, this will work in a single core system: Prohibit task switching, and you can be sure that no one else will get to access that data. In a SMP situation, however, this does not stop anyone from accessing it.

So what’s the solution to this ? Keeping in mind that Forbid is mostly used to protect critical sections of code and data, the SMP enabled kernel will treat it just like that: Any core issuing a Forbid call will, simply put, write it’s number into a field somewhere in the system. It will do this “atomically”, meaning that the memory is only modified if no one else is competing for it. If it succeeds, it can proceed into the critical section. On the other hand, if there is already some other core in the critical section, or the write did not succeed, it will repeat this process until it succeeds, basically stalling the competing core until it is successful. This ensures that only one core at any time can enter into a Forbid state, and any subsequent core that wants to forbid will have to wait until it’s turn.

Of course, this is a simplified description of the process. Some other things are necessary to ensure fairness and equal distribution of access, to prevent one core from hogging this “lock” for too long, and so on, but the basic process works like this.

Where are we now ?

The development of SMP support has been separated into several distinctive steps. The first step was to rewrite the scheduler in C for easier accessibility. In the very end, this step might be reversed again, rewriting the then SMP capable scheduler back into assembly language. The second, more fundamental step was to decouple the scheduler from it’s current data structures. As you might know, ExecBase contains a lot of list for task that are ready, or waiting for a signal.

This has now been achieved. The current development build uses a scheduler that no longer uses the original AmigaOS data structures, but a structure that is replicated for each core.

The next step is to have each core in the development system (currently, the X1000) to run the scheduler. Test code will then start tasks on the different cores and see how they behave. We have already experimented with this and the results look promising. The tests basically showed that the lockout mechanism for Forbid works as planned.

As a final step, the balancing will be introduced, which then finalizes the first implementation of SMP support in AmigaOS.

Future plans

There are several possibilities to chose from once the first implementation is done. First of all, the scheduling algorithm is still the same as the one used by current Exec, a priority modified Round-Robin. Naturally, this is an algorithm that can be improved upon. There are several other implementations that come to mind, like the O(1) scheduler, multilevel feedback queues or even the Brain Fuck Scheduler.

Also, the balancing algorithms are candidates for improvement. The system might record scheduling data and compute typical user profiles that can be pre-selected, like, for example, the ability to determine when to balance, and how aggressively balancing is carried out.