[Edit: This problem applies only to 32-bit systems. If your computer, your OS and your python implementation are 64-bit, then mmap-ing huge files works reliably and is extremely efficient.]
I am writing a module that amongst other things allows bitwise read access to files. The files can potentially be large (hundreds of GB) so I wrote a simple class that lets me treat the file like a string and hides all the seeking and reading.
At the time I wrote my wrapper class I didn't know about the mmap module. On reading the documentation for mmap I thought "great - this is just what I needed, I'll take out my code and replace it with an mmap. It's probably much more efficient and it's always good to delete code."
The problem is that mmap doesn't work for large files! This is very surprising to me as I thought it was perhaps the most obvious application. If the file is above a few gigabytes then I get an EnvironmentError: [Errno 12] Cannot allocate memory
. This only happens with a 32-bit Python build so it seems it is running out of address space, but I can't find any documentation on this.
My code is just
f = open('somelargefile', 'rb')
map = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
So my question is am I missing something obvious here? Is there a way to get mmap to work portably on large files or should I go back to my na?ve file wrapper?
Update: There seems to be a feeling that the Python mmap should have the same restrictions as the POSIX mmap. To better express my frustration here is a simple class that has a small part of the functionality of mmap.
import os
class Mmap(object):
def __init__(self, f):
"""Initialise with a file object."""
self.source = f
def __getitem__(self, key):
try:
# A slice
self.source.seek(key.start, os.SEEK_SET)
return self.source.read(key.stop - key.start)
except AttributeError:
# single element
self.source.seek(key, os.SEEK_SET)
return self.source.read(1)
It's read-only and doesn't do anything fancy, but I can do this just the same as with an mmap:
map2 = Mmap(f)
print map2[0:10]
print map2[10000000000:10000000010]
except that there are no restrictions on filesize. Not too difficult really...
See Question&Answers more detail:os