I used python's mmap
to access random locations of very large files fairly quickly. Now I read about madvise()
and tried to speed up the random access of the memory mapped file even more. The behaviour I am seeing now is confusing so I kindly ask for help. I have an array of positions (byte-offsets) in the file at which I want to read multiple of the following lines. For each of these byte offsets I call mm.madvise(mmap.MADV_WILLNEED, ..)
to tell the kernel(?) that they will be accessed shortly. This speeds things up if I grab a single line. But if I try to access multiple lines the timings are strange. Some example code and timings underneath:
import mmap
import numpy as np
source = "path/to/file"
offsets = np.array([28938058915, 12165253255, 3363686649, 2337907709, 18321471207,
3043986123, 29547707866, 23431772405, 14399201212, 8695070697], dtype="uint64")
def get_data_batch(source, offsets):
# open the big file in byte mode and initiate the mmap
with open(source, 'rb') as f:
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
pagesize = 4096
# prepare starts sites for madvise call
# start of WILLNEED has to be multiple of pagesize
new_offsets = offsets - (offsets % pagesize)
# tell kernel which positions are needed
for new_offset in new_offsets:
# preload say 20 pages of data following each offset
mm.madvise(mmap.MADV_RANDOM)
mm.madvise(mmap.MADV_WILLNEED, int(new_offset), 20)
# now actually read the data
for offset in offsets:
# Use offset to jump to position in file
mm.seek(offset)
# read several lines at that position
chunk1 = mm.readline()
chunk2 = mm.readline()
chunk3 = mm.readline()
chunk4 = mm.readline()
chunk5 = mm.readline()
chunk6 = mm.readline()
If I execute the function without the use of madvise()
, I can see that accessing the first line takes quite some time but the following ones are very fast:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
...
22 100 1360035.0 13600.4 98.1 chunk1 = mm.readline()
23 100 12004.0 120.0 0.9 chunk2 = mm.readline()
24 100 94.0 0.9 0.0 chunk3 = mm.readline()
25 100 282.0 2.8 0.0 chunk4 = mm.readline()
26 100 63.0 0.6 0.0 chunk5 = mm.readline()
27 100 11785.0 117.8 0.8 chunk6 = mm.readline()
If I include madvise()
, I get an overall speed-up and the first line is read very quickly. But some of the following ones take some time now, albeit much less:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
...
22 100 4514.0 45.1 1.7 chunk1 = mm.readline()
23 100 107462.0 1074.6 39.7 chunk2 = mm.readline()
24 100 89.0 0.9 0.0 chunk3 = mm.readline()
25 100 79073.0 790.7 29.2 chunk4 = mm.readline()
26 100 91.0 0.9 0.0 chunk5 = mm.readline()
27 100 71475.0 714.8 26.4 chunk6 = mm.readline()
Can somebody explain what's going on? And is there a way to make all readline()
calls as fast as the quickest ones?
Merci