Thursday 16 June 2016
For a problem I'm working on, an efficient solution is to use a 32GB array, most of whose entries are unused. Fortunately this is no problem at all on 64-bit Linux using a couple of simple tricks.
The first trick is to allocate the array using the little-known (to me) MAP_NORESERVE option:
p = mmap(nullptr, 32 * 1024 * 1024 * 1024, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE, -1, 0));(I thought this was the default on Linux, so I'm not sure why it's needed, but it is with 4.5.5-201.fc23.x86_64 at least. Oddly enough, if I omit that flag and just do two 16GB mmaps, that works too.) Now you can write to that region and only the pages you write to will actually be allocated. Good times.
Now, once in a while I want to clear that memory or just some subsections of it. memset(array, 0, size) is out of the question since it would take a very long time and would also probably cause all those pages to be allocated, killing my process if I'm lucky and the entire system if I'm less lucky.
Fortunately we have madvise(array, size, MADV_DONTNEED)! For a MAP_ANONYMOUS mapping like ours, this delightful system call simply frees all the pages in the range and instructs the kernel to use fresh zero pages next time the memory is read or written. Not only is it theoretically efficient, it's fast in practice. I can touch 10,000 scattered pages in my 32GB array and then zero the entire array with one MADV_DONTNEED in 12ms total.
It would be nice if tricks like these worked for pages with values other than just zeroes. For example, Firefox sometimes needs large allocations filled with 0xFF000000 for graphics buffers, or special bit patterns representing "poisoned" memory for security purposes. I think you could implement something like that using userfaultfd or mapping a file from a custom FUSE filesystem, but they would probably be significantly slower.