Wednesday, December 07, 2005

Compare performance of malloc and boost::pool

Andrei Gorine and Konstantin Knizhnik wrote a great article on Dr. Dobb's Journal titled Memory Management & Embedded Databases. In this article, they examine several memory-allocation strategies, explained the advantages/disadvantages based on their experience creating the eXtremeDB in-memory database.

As we know it is normal for malloc/free(new/delete) to take 10-20 percent of the CPU time of an application. Sometime it is nice to have a memory pool to handle some fixed size allocation, especially when you have Boost Pool Library for free.

I wrote a small program, which creates a small and simple data structure on heap a million times to benchmark malloc and boost::object_pool, and the result shows boost::object_pool is a clear winner. I also tried to compare with new operator, even though there is no constructor for the dummy data structure. The difference between malloc and new in this case is insignificant.

The result also shows something interesting which have been mentioned in Raymond Chen's "5 Things Every Win32 Developer Should Know" PDC 2005 talk. If boost::object_pool is used as a pointer, the performance hurts a lot, even through still way faster than malloc.

Here is the result on my laptop (Windows XP, VC 2003 release build, P4 2.4G, 512M):

alloc_with_new_array: 26 microseconds
alloc_with_malloc_array: 11 microseconds
alloc_with_new: 365120 microseconds
alloc_with_malloc: 351586 microseconds
alloc_with_boost_pool_construct: 7852 microseconds
alloc_with_boost_pool_malloc: 3371 microseconds
alloc_with_boost_pool_malloc_use_pointer: 11751 microseconds
alloc_with_boost_pool_malloc_use_reference: 3383 microseconds


Here is code:


#include <boost/pool/object_pool.hpp>

struct dummy_data
{
int a;
int b;
};

inline void alloc_with_new(size_t size)
{
dummy_data* p;
for (size_t i = 0; i < size; ++i)
{
p = new dummy_data;
delete p;
}
}

inline void alloc_with_malloc(size_t size)
{
dummy_data* p;
for (size_t i = 0; i < size; ++i)
{
p = (dummy_data*)malloc(sizeof(dummy_data));
free(p);
}
}

inline void alloc_with_boost_pool_construct(size_t size)
{
boost::object_pool<dummy_data> pool;
dummy_data* p;
for (size_t i = 0; i < size; ++i)
{
p = pool.construct();
pool.destroy(p);
}
}



inline void alloc_with_boost_pool_malloc(size_t size)
{
boost::object_pool<dummy_data> pool;
dummy_data* p;
for (size_t i = 0; i < size; ++i)
{
p = pool.malloc();
pool.free(p);
}
}

inline void alloc_with_boost_pool_malloc_use_pointer(size_t size)
{
boost::object_pool<dummy_data>* pool = new boost::object_pool<dummy_data>;
dummy_data* p;
for (size_t i = 0; i < size; ++i)
{
p = pool->malloc();
pool->free(p);
}

delete pool;
}

inline void alloc_with_boost_pool_malloc_use_reference(size_t size)
{
boost::object_pool<dummy_data> _pool;
boost::object_pool<dummy_data>& pool = _pool;
dummy_data* p;
for (size_t i = 0; i < size; ++i)
{
p = pool.malloc();
pool.free(p);
}
}

inline void alloc_with_new_array(size_t size)
{
dummy_data* p = new dummy_data[size];
delete[] p;
}

inline void alloc_with_malloc_array(size_t size)
{
dummy_data* p = (dummy_data*)malloc(sizeof(dummy_data) * size);
free(p);
}

int main (int, char**)
{
LARGE_INTEGER freq, t0, t1;
QueryPerformanceFrequency(&freq);

size_t number = 1000000;

QueryPerformanceCounter(&t0);
alloc_with_new_array(number);
QueryPerformanceCounter(&t1);
printf("alloc_with_new_array: %d microseconds\n",
((t1.QuadPart-t0.QuadPart)*1000000)/freq.QuadPart);

QueryPerformanceCounter(&t0);
alloc_with_malloc_array(number);
QueryPerformanceCounter(&t1);
printf("alloc_with_malloc_array: %d microseconds\n",
((t1.QuadPart-t0.QuadPart)*1000000)/freq.QuadPart);

QueryPerformanceCounter(&t0);
alloc_with_new(number);
QueryPerformanceCounter(&t1);
printf("alloc_with_new: %d microseconds\n",
((t1.QuadPart-t0.QuadPart)*1000000)/freq.QuadPart);

QueryPerformanceCounter(&t0);
alloc_with_malloc(number);
QueryPerformanceCounter(&t1);
printf("alloc_with_malloc: %d microseconds\n",
((t1.QuadPart-t0.QuadPart)*1000000)/freq.QuadPart);

QueryPerformanceCounter(&t0);
alloc_with_boost_pool_construct(number);
QueryPerformanceCounter(&t1);
printf("alloc_with_boost_pool_construct: %d microseconds\n",
((t1.QuadPart-t0.QuadPart)*1000000)/freq.QuadPart);

QueryPerformanceCounter(&t0);
alloc_with_boost_pool_malloc(number);
QueryPerformanceCounter(&t1);
printf("alloc_with_boost_pool_malloc: %d microseconds\n",
((t1.QuadPart-t0.QuadPart)*1000000)/freq.QuadPart);

QueryPerformanceCounter(&t0);
alloc_with_boost_pool_malloc_use_pointer(number);
QueryPerformanceCounter(&t1);
printf("alloc_with_boost_pool_malloc_use_pointer: %d microseconds\n",
((t1.QuadPart-t0.QuadPart)*1000000)/freq.QuadPart);

QueryPerformanceCounter(&t0);
alloc_with_boost_pool_malloc_use_reference(number);
QueryPerformanceCounter(&t1);
printf("alloc_with_boost_pool_malloc_use_reference: %d microseconds\n",
((t1.QuadPart-t0.QuadPart)*1000000)/freq.QuadPart);

return 0;
}

7 Comments:

Anonymous Anonymous said...

Good job.
I decided to switch from standard new allocator to boost::pool

7:23 PM  
Anonymous Anonymous said...

My impression is that the author has not carefully studied the specs of boost::object_pool.

1. object_pool is not ment to have combinations of allocation/deallocation, as a singlie deallocation is O(N). It is ment to have allocations and a final deallocation by destructing the pool.
The impact cannot be seen in the current example, as there is only one entry in the free list, but using other combinations of allocation/deallocation, the deallocation gets really bad.

2. Random allocation/deallocation patterns can be implemented using boost::pool. In my example this is two times faster than object_pool.

3. I don't know if the author has used optimization. In my case, using /Ox the differences between using malloc, reference, and pointer are reduced. malloc and reference are the same.


alloc_with_new_array: 10 microseconds
alloc_with_malloc_array: 9 microseconds
alloc_with_new: 155533 microseconds
alloc_with_malloc: 143834 microseconds
alloc_with_boost_pool_construct: 14246 microseconds
alloc_with_boost_pool_malloc: 4117 microseconds
alloc_with_boost_insert_pool_malloc: 10670 microseconds
alloc_with_boost_norm_pool_malloc: 2174 microseconds
alloc_with_boost_pool_malloc_use_pointer: 4673 microseconds
alloc_with_boost_pool_malloc_use_reference: 4119 microseconds




inline void alloc_with_boost_destruct_pool_malloc(size_t size)
{
boost::object_pool<dummy_data> pool;
dummy_data* p;
for (size_t i = 0; i < size; ++i)
{
p = pool.malloc();
}
}

inline void alloc_with_boost_norm_pool_malloc(size_t size)
{
boost::pool<> pool(sizeof(dummy_data));
dummy_data* p;
for (size_t i = 0; i < size; ++i)
{
p = (dummy_data*)pool.malloc();
pool.free(p);
}
}

}

3:37 AM  
Blogger xue.yong.zhi said...

To Anoymous:

The specs you pointed out was mentioned in another blog post: http://seclib.blogspot.com/2005/12/good-experience-with-boost-memory-pool.html

"There is one thing to remember: boost::object_pool is designed for "allocate many objects and then destroy all of them" type of scenario, so freeing memory from the pool can be very slow. For example, at the beginning I also replaced free() with boost::object_pool::free(), which make the performance even worse: boost::simple_segregated_storage::find_prev() and boost::simple_segregated_storage::nextof() start to take over all the CPU. So I do not do free anymore and simply let ~object_pool() to release the memory. This works perfect for my compiler, but may not be good for other applications."

12:54 PM  
Blogger Alex said...

I ran [mainly] your program on Linux [with clock_gettime() for timing], & got the following results:
alloc_with_new_array: 0.101566 ms.
alloc_with_malloc_array: 0.018358 ms.
alloc_with_new: 733.107805 ms.
alloc_with_malloc: 558.677912 ms.
alloc_with_boost_pool_construct: 928.229809 ms.
alloc_with_boost_pool_malloc: 660.680294 ms.
alloc_with_boost_pool_malloc_use_pointer: 662.086010 ms.
alloc_with_boost_pool_malloc_use_reference: 662.153006 ms.

9:54 AM  
Anonymous Nick said...

You might also want to try splitting the allocation and deallocation into separate loops. Otherwise, you're not really testing the pool's performance when it is forced to resize. That would also allow you to fairly compare object_pool using a single deallocation vs. pool using individual deallocations.

9:00 AM  
Blogger Rohit Joshi said...

I had very bad experience with boost::object_pool. Performance degrades drastically as number of object increases in the pool.

As Nick mention, the destroying object becomes very expensive. Infact I had to stop using this pool.

In this blog, objects are created and deleted immediately which is not utilizing pool. Try creating 50,000 object in one loop and than destroy in another loop, you will see what's happening.

To destroy 30,000 object, it took 32.150515048 seconds versus standard delete function took 0.10413087 sec.

See the below blogpost for performance data and source.

http://joshitech.blogspot.com/2010/05/boost-object-pool-destroy-performance.html

12:59 PM  
Anonymous Anonymous said...

I just stumbled across this thread and have two comments. Notably, boost::object_pool *IS* faster if:

1) You use the fast_pool_allocator (vs the standard allocator) (e.g. boost::fast_pool_allocator.

2) You enable basic compiler optimizations.

In many cases (2012) std::allocator is faster, however given object lifetimes are sometimes unknown, the convenience of allocating out of a pool then freeing the pool can't be understated. It's not uncommon for boost::object_pool to be ~100x faster than the std::allocator.

1:30 PM  

Post a Comment

<< Home