Files

8 file processing tips in Python

This week I’d like to share a handful of quick tips, all related to processing files in Python.

  • When writing files <https://www.pythonmorsels.com/creating-and-writing-file-python/> (and ideally when reading them too) use a with block to auto-close your file when you’re done working with them.

  • When working with very large text files, process the file line-by-line by looping over it (this will only store 8KB of the file in memory at a time thanks to the way file buffering works).

  • You can process large binary files chunk-by-chunk to avoid reading them into memory all at once.

  • If your text files might not be in UTF-8, be sure to specify the encoding of your files use when opening them.

  • When working with untrusted files that might have extremely long lines, instead of looping line-by-line, call the readline method with a maximum size instead Ignore this advice if you know the untrusted file is small (due to file upload limits for example)

  • When manipulating file paths, use pathlib.Path objects. In fact, I tend to prefer pathlib pretty much anytime I work with files in Python.

  • If you ever need to read from a file twice, you may want to use the seek method.

  • If you need to ensure you don’t overwrite a file or you want to append to the end of a file, take a look into Python’s file modes.

Call the readline method with a maximum size instead

For untrusted data, we could do something like this:

max_len = 2**16
with open(filename) as my_file:
    while line := my_file.readline(max_len+1):
        if len(line) > max_len:
            raise ValueError("Line too long")
        print("Processing", line)