Introduction

With the recent introduction of dual-core processors you can expect that your program will run 1.5 – 2 times faster on new CPUs. But this is only true if the program that you are using is specially designed to use the power of additional CPUs and do the calculations in parallel threads.

So, what do you expect when running your application on a dual-processor/dual-core computer? You think it is going to run faster than on a single-processor computer? What if the program runs slower?

Most software developers are unaware that adding several threads can actually slower down memory intensive applications. Let's see why it can be the case.

The problem: My Application does not scale

What does it mean that an application can scale? It means that the application can take advantage of all resources available to it. If there are 2 CPUs or 2 cores, the application can use them and do the same work faster.

That assumption is not really true. When multi-threaded application allocates and frees memory, it is calling into some functions that will ultimately have to do locking to prevent threads from accessing and modifying the same internal data structures at the same time by several threads.

The most common source of the bad scalability is the default memory allocation routines provided by OS. Any operating systems designer puts first the stability and reliability and only then the speed. That is why the default memory allocation routines provided by OS designed in mind that they must work in all the cases.

The typical application can even run slower after splitting the processing into several threads as the result of a heap contention. Contention occurs when two or more threads try to access data at the same time and one must wait for the other to complete before it can proceed. Contention always causes trouble; it's by far the biggest problem that one encounters on multiprocessor systems. An application or DLL with heavy use of memory blocks will slow down when run with multiple threads (and on multiprocessor systems). The use of single lock (the common solution) means that all operations using the heap are serialized. The serialization causes threads to switch context while waiting for the lock. Imagine the slowdown caused by stop-and-go at a flashing red stoplight.

Contention usually leads to context switch of the threads and processes. Context switches are very costly, but even more costly is the loss of data from the processor cache and the rebuilding of that data when the thread is brought to life afterwards. Contention is the problem that introduces slowdown in the allocation as well as free operations. Ideally we would like to have a heap with no contention and fast alloc/free.

Why custom memory allocator?

It is natural to expect that a modern dual-processor or dual-core PCs would improve the performance if not twice but at least 1.7 times comparing to single CPU PC. To our surprise, it appeared that after splitting the calculations to several parallel worker threads the application run about 1.3 times slower than with one thread. The second CPU made it actually slower.

At first it appeared that the source of the slow down is the thread synchronization, but after further debugging we found out that the worker threads almost never block each other in our source code. After Googleing the Internet, it started to become pretty clear that the problem is coming from the run-time library memory manager. And a quick look at the source code of C++ CRT library revealed that all memory allocations/deallocations are ending up in the same memory heap, which is of course is single threaded for reliability and stability purpose.

Because of that, parallel threads in any application must wait to each other until the memory manager is released. This is not a big problem on a single-processor/single threaded application, because anyway only one thread can execute at one moment. But it becomes a real issue on dual processor or dual core processor system.

After we have tried several memory allocators available on Internet the performance of our application has improved. But then the testing revealed a few problems. All the software vendors of the alternative allocators published the benchmarks of their products, but they never mention that the memory that is allocated from OS is almost never returned back during those benchmarks, and allocating/deallocating the memory from OS is the most time consuming task.

We, at ClearBytes Soft, needed the allocator which does very fast memory allocations/deallocations and actually returns memory to OS when it is no longer used. Many people can wonder why bother returning memory to OS if your application may reuse it again at some later point of execution.

The simple answer is that OS can certainly do the better memory management if it knows that the memory is not used. Second, your application can link to some third party DLLs or COM objects that can do their own memory management, or they are written on different programming language. So you may end up in the situation when the request for memory can be denied even if there is still plenty of it available but in some private memory allocation heap. Third your application may need additional memory buffers for files or IO buffers which must be allocated by OS.