Thursday, February 22, 2007

python string += string

I use the string += operation very commonly in all my python programs, and sometimes store a few MB of data in memory before flushing it out to disk.

I just learnt that its implementation makes it a very slow operation. Basically, in python strings are immutable. This means += operation is destroying and creating objects on every call. Imagine doing this a few hundred thousand times in each program.

Today, when a simple loop was taking infinitely long time, I was forced to investigate, and sure enough someone had explained it on this thread on python forum.

But I cannot keep invoking file ios for each append operation either. Even though file writes already have buffering implemented, I like to explicitly store data in memory for a few steps of string appends, and then flushing it to disk. This is important if you want to monitor the progress of your program using these logs - deterministically - such as every 1000 steps of the loop. I wrote this simple class that makes this task very easy.

class hugeFileWrite:
def __init__(self, fname, step=100):
self.sout = ''
self.step = step
self.fname = fname
self.count = 0

f = open(fname, 'w')
f.write('')
f.close()

def addString(self, smore):
self.sout += smore
self.count += 1
if self.count > self.step:
self.flush()

# Make sure you call flush() after your last addString
def flush(self):
f = open(self.fname, 'a')
f.write(self.sout)
f.close()

self.sout = ''
self.count = 0