Skip to content Skip to sidebar Skip to footer

Improving Performance Of A Function In Python

I have a text file fo several GB with this format 0 274 593869.99 6734999.96 121.83 1, 0 273 593869.51 6734999.92 121.57 1, 0 273 593869.15 6734999.89 121.57 1, 0 273 593868.79 673

Solution 1:

Your 'idtile's appear to be in a certain order. That is, the sample data suggests that once you traverse through a certain 'idtile' and hit the next, there is no chance that a line with that 'idtile' will show up again. If this is the case, you may break the for loop once you finish dealing with the 'idtile' you want and hit a different one. Off the top of my head:

loopkiller = falsefor line infile(name, mode="r"):
    element = line.split()
    if (int(element[0]),int(element[1])) == idtile:
        lst.append(element[2:])
        dy, dx = int(element[0]),int(element[1])
        loopkiller = true
    elif loopkiller:
        break;

This way, once you are done with a certain 'idtile', you stop; whereas in your example, you keep on reading until the end of the file.

If your idtiles appear in a random order, maybe you could try writing an ordered version of your file first.

Also, evaluating the digits of your idtiles seperately may help you traverse the file faster. Supposing your idtile is a two-tuple of one-digit and three-digit integers, perhaps something along the lines of:

for line in file(name, mode="r"):
    element = line.split()
    ifint(element[0][0]) == idtile[0]:
        if element[1][0] == str(idtile[1])[0]:
            if element[1][1] == str(idtile[1])[1]:
                if element[1][2] == str(idtile[1])[2]:
                    dy, dx = int(element[0]),int(element[1])
                else go_forward(walk)
            else go_forward(run)
         else go_forward(sprint)
     else go_forward(warp)

Solution 2:

I would suggest to compare the times used for your full reading procedure and for just reading lines and doing nothing to them. If those times are close, the only thing you can really do is to change approach (splitting your files etc.), for what you can probably optimize is data processing time, not file reading time.

I also see two moments in your code that are worth fixing:

withopen(name) as f:
    for line in f:
        pass#Here goes the loop body
  1. Use with to explicitly close your file. Your solution should work in CPython, but that depends on implementation and may not be that effective always.

  2. You perform transformation of a string to int twice. It is a relatively slow operation. Remove the second one by reusing the result.

P.S. It looks like an array of depth or height values for a set of points on Earth surface, and the surface is split in tiles. :-)

Solution 3:

I suggest you change your code so that you read the big file once and write (temporary) files for each tile id. Something like:

defcreate_files(name, idtiles=None):
    files = {}
    for line inopen(name):
         elements = line.split()
         idtile = (int(elements[0]), int(elements[1]))
         if idtiles isnotNoneand idtile notin idtiles:
             continueif idtile notin files:
             files[idtile] = open("tempfile_{}_{}".format(elements[0], elements[1]), "w")
         print >>files[idtile], line
    for f in files.itervalues():
        f.close()
    return files

create_files() will return a {(tilex, tiley): fileobject} dictionary.

A variant that closes the files after writing each line, to work around the "Too many open files" error. This variant returns a {(tilex, tiley: filename} dictionary. Will probably be a bit slower.

defcreate_files(name, idtiles=None):
    files = {}
    for line inopen(name):
         elements = line.split()
         idtile = (int(elements[0]), int(elements[1]))
         if idtiles isnotNoneand idtile notin idtiles:
             continue
         filename = "tempfile_{}_{}".format(elements[0], elements[1])
         files[idtile] = filename
         withopen(filename, "a") as f:
             print >>f, line
    return files

Solution 4:

My solution is split the large text file into many small binary file for each idtile. To read the text file faster, you can use pandas:

import pandas as pd
import numpy as np
n = 400000 # read n rows as one block
for df in pd.read_table(large_text_file, sep=" ", comment=",", header=None, chunksize=n):
    for key, g in df.groupby([0, 1]):
        fn = "%d_%d.tmp" % key
            with open(fn, "ab") as f:
                data = g.ix[:, 2:5].values
                data.tofile(f)

Then you can get content of one binary file by:

np.fromfile("0_273.tmp").reshape(-1, 4)

Solution 5:

You can avoid doing the split() and int() on every line by doing a string comparison instead:

deffile_filter(name,idtile):
    lst = []
    id_str = "%d %d " % idtile
    withopen(name) as f:
        for line in f:
            if line.startswith(id_str):
                element = line.split() # add value
                lst.append(element[2:])
                dy, dx = int(element[0]),int(element[1])
    return(lst, dy, dx)

Post a Comment for "Improving Performance Of A Function In Python"