Skip to content Skip to sidebar Skip to footer

Numpy.memmap: Bogus Memory Allocation

I have a python3 script that operates with numpy.memmap arrays. It writes an array to newly generated temporary file that is located in /tmp: import numpy, tempfile size = 2 ** 37

Solution 1:

There's nothing 'bogus' about the fact that you are generating 10 TB files

You are asking for arrays of size

2 ** 37 * 10 = 1374389534720 elements

A dtype of 'i8' means an 8 byte (64 bit) integer, therefore your final array will have a size of

1374389534720 * 8 = 10995116277760 bytes

or

10995116277760 / 1E12 = 10.99511627776 TB


If you only have 250 GB of free disk space then how are you able to create a "10 TB" file?

Assuming that you are using a reasonably modern filesystem, your OS will be capable of generating almost arbitrarily large sparse files, regardless of whether or not you actually have enough physical disk space to back them.

For example, on my Linux machine I'm allowed to do something like this:

# I only have about 50GB of free space...
~$ df -h /
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/sdb1      ext4  459G  383G   53G  88% /

~$ ddif=/dev/zero of=sparsefile bs=1 count=0 seek=10T
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000236933 s, 0.0 kB/s

# ...but I can still generate a sparse file that reports its size as 10 TB
~$ ls -lah sparsefile
-rw-rw-r-- 1 alistair alistair 10T Dec  1 21:17 sparsefile

# however, this file uses zero bytes of "actual" disk space
~$ du -h sparsefile
0       sparsefile

Try calling du -h on your np.memmap file after it has been initialized to see how much actual disk space it uses.

As you start actually writing data to your np.memmap file, everything will be OK until you exceed the physical capacity of your storage, at which point the process will terminate with a Bus error. This means that if you needed to write < 250GB of data to your np.memmap array then there might be no problem (in practice this would probably also depend on where you are writing within the array, and on whether it is row or column major).


How is it possible for a process to use 10 TB of virtual memory?

When you create a memory map, the kernel allocates a new block of addresses within the virtual address space of the calling process and maps them to a file on your disk. The amount of virtual memory that your Python process is using will therefore increase by the size of the file that has just been created. Since the file can also be sparse, then not only can the virtual memory exceed the total amount of RAM available, but it can also exceed the total physical disk space on your machine.


How can you check whether you have enough disk space to store the full np.memmap array?

I'm assuming that you want to do this programmatically in Python.

  1. Get the amount of free disk space available. There are various methods given in the answers to this previous SO question. One option is os.statvfs:

    import os
    
    def get_free_bytes(path='/'):
        st = os.statvfs(path)
        return st.f_bavail * st.f_bsize
    
    print(get_free_bytes())
    # 56224485376
  2. Work out the size of your array in bytes:

    import numpy as np
    
    defcheck_asize_bytes(shape, dtype):
        return np.prod(shape) * np.dtype(dtype).itemsize
    
    print(check_asize_bytes((2 ** 37 * 10,), 'i8'))
    # 10995116277760
  3. Check whether 2. > 1.


Update: Is there a 'safe' way to allocate an np.memmap file, which guarantees that sufficient disk space is reserved to store the full array?

One possibility might be to use fallocate to pre-allocate the disk space, e.g.:

~$ fallocate -l 1G bigfile
~$ du -h bigfile
1.1G    bigfile

You could call this from Python, for example using subprocess.check_call:

import subprocess

deffallocate(fname, length):
    return subprocess.check_call(['fallocate', '-l', str(length), fname])

defsafe_memmap_alloc(fname, dtype, shape, *args, **kwargs):
    nbytes = np.prod(shape) * np.dtype(dtype).itemsize
    fallocate(fname, nbytes)
    return np.memmap(fname, dtype, *args, shape=shape, **kwargs)

mmap = safe_memmap_alloc('test.mmap', np.int64, (1024, 1024))

print(mmap.nbytes / 1E6)
# 8.388608print(subprocess.check_output(['du', '-h', 'test.mmap']))
# 8.0M    test.mmap

I'm not aware of a platform-independent way to do this using the standard library, but there is a fallocate Python module on PyPI that should work for any Posix-based OS.

Solution 2:

Based on the answer of @ali_m I finally came to this solution:

# must be called with the argumant marking array size in GB
import sys, numpy, tempfile, subprocess

size = (2 ** 27) * int(sys.argv[1])
tmp_primary = tempfile.NamedTemporaryFile('w+')
array = numpy.memmap(tmp_primary.name, dtype = 'i8', mode = 'w+', shape = size)
tmp = tempfile.NamedTemporaryFile('w+')
check = subprocess.Popen(['cp', '--sparse=never', tmp_primary.name, tmp.name])
stdout, stderr = check.communicate()
if stderr:
    sys.stderr.write(stderr.decode('utf-8'))
    sys.exit(1)
del array
tmp_primary.close()
array = numpy.memmap(tmp.name, dtype = 'i8', mode = 'r+', shape = size)
array[0] = 666array[size-1] = 777print('File: {}. Array size: {}. First cell value: {}. Last cell value: {}'.\
      format(tmp.name, len(array), array[0], array[size-1]))
whileTrue:
    pass

The idea is to copy initially generated sparse file to a new normal one. For this cp with the option --sparse=never is employed.

When the script is called with a manageable size parameter (say, 1 GB) the array is getting mapped to a non-sparse file. This is confirmed by the output of du -h command, which now shows ~1 GB size. If the memory is not enough, the scripts exits with the error:

cp:‘/tmp/tmps_thxud2’:write failed:Nospaceleftondevice

Post a Comment for "Numpy.memmap: Bogus Memory Allocation"