Atomic instructions

Not all processors have atomic instructions!

Recently we were involved in improving boot-up times of a Linux based platform. We noticed that the boot-up time was much longer than expected based on the processor speed.
The only way to know what is happening is profiling the start-up. This is not always easy on limited embedded platforms like these! After getting some profiling tools up-and-running on the platform, we immediately detected an issue. About half of the boot time was spend into Linux exception handlers!

The busy exception handler was the one for handling invalid instruction. How is this possible?
Some further analysis showed us that the instructions being emulated in the kernel were the atomic swap instructions. It seemed that this version of the processor did not had support for atomic instructions!

Why would you need atomic instructions anyway?
Once you are running multi-threaded systems (the normal embedded case), there is a need to protect shared resources for simultaneous access. Without such protection mechanisms data can corrupt, something you do not want. Further such errors are not easy to track as they lead to invalid behavior long time after the corruption occurred.

In the good old times, there was no protection between user and kernel space and there were no multiple processors active in the system. At that time locking could easily be done by masking all interrupts. In modern operating systems, applications are not allowed to do this. An easy solution is to provide a special atomic instruction which is guaranteed not to be interrupted (a swap or read-modify-write instruction). As such there is no need to jump to the kernel space which takes an large overhead (at least when no blocking is introduced by the lock).

How could we solve the problem?
We detected that one lock in the system was responsible for a high percentage of the exceptions. We replaced this lock with the original semaphore implementation of Edsger Dijkstra (the inventor of the semaphore around 1965). At that time there were not yet atomic instructions available! The whole implementation did not need a roundtrip to the kernel. However it can not force a reschedule or any kernel access. Thus the only solution was to introduce sleeps if contention on the lock occurred (and thus give the processor to the thread having the lock). By adapting the lock so it was unfair (the current busy thread takes the lock even if some other thread requested it), the number of actual sleeps could be limited.

At the end the boot-up time was almost divided by two!

Lessons learned:

Take care when selecting processors for your embedded system
In protected, multi-threaded environments atomic instructions are a must
Atomic instructions are used in the libraries to handle locking primitives
Linux solves the problem for you by emulating these instructions if the processor lacks it