[Python] Fastest way to recursively list files of specific type

instinct-vfx · March 28, 2017, 9:22pm

Hi everyone,

i am currently writing a tool that is gathering all max files in a given directory including subfolders. The current approach works nice and i already tweaked it a little and got it quite a bit faster it still does not really scale that well. Here’s what i got currently:


    def gatherMaxFiles(self, path):
        EXCLUDES = ["SOUND",                    ]
        files = []
        for root, subFolders, filenames in os.walk(path):
            for exclude in EXCLUDES:
                if exclude in subFolders:
                    subFolders.remove(exclude)
            for filename in fnmatch.filter(filenames, '*.max'):
                files.append(os.path.join(root, filename))
        return files

EXCLUDES is a list of subfolders to ignore completely (as parsing the render folder for max files is pointless)

Given the size of the folders i have to parse this still seems awfully slow (up to half an hour).

Any hints on how to speed that up? Go native?

Kind regards,
Thorsten

ArYeS · March 28, 2017, 9:45pm

I’m doing for local files same way as you do, but it doesn’t take a lot of time locally

Do you list these files on remote location (server on lan maybe?) or are you running that locally?
For server I use crontab job on linux server to generate index of all files on server each minute, and then I just go through that index on remote pc with python. It’s much faster this way over LAN.

instinct-vfx · March 28, 2017, 9:45pm

Thanks for the input. Yes i am doing this remotely. The problem is rather the amount of directories/files being parsed i am afraid. It will be a lot faster locally, but generating an index won’t really help as i am only doing this once. Let me fill in some details on what i am doing.

Due to issues to do with the way we have to handle some of our projects i have to re-link a lot of max files. It boils down to the following:

[ul]
[li]Scan project folder for max files
[/li][li]Gather all asset-paths in all max files
[/li][li]Present all unique locations (as in server/share = location) to the user and allow to edit locations as needed
[/li][li]Iterate over all max files and re-link assets
[/li][/ul]

This is done once only. If this takes long than that is not a show stopper, but obviously i want that to be as fast as possible. To give a rough idea of the
amount of folders to be processed, they range from around 3 to 4 Million files in 30k to 40k folders (size does not matter i guess).

If there are no faster parsing/iterating ways then i would say i have to try and invert the process for inclusion rather than exclusion to reduce the amount of folders and files to be parsed.
Was hoping to skip that as that seems a tad cumbersome.

Regards,
Thorsten

Theodox · March 28, 2017, 9:45pm

If you’re doing this remotely, a little python server on the host machine would make it go a lot faster. Plus you could add a filewatcher (I think we have a recipe for doing this on windows here on the site) that updates it incrementally so there’s no big lag in the future. This would be a good application for RPYC or ZMQ.

If the data doesn’t change frequently, do it once and put the results in a DB then update it incrementally on save – 3-4 million files is a lot to touch under any circumstances, it makes sense to do it in as small a chunk as possible.

I take it this folder is not under conventional source control, right?

instinct-vfx · March 28, 2017, 9:45pm

Well as hinted above this happens only once, so there will be no increments to speed things up. The reason this has to happen is, because projects come back from backup and have to be re-linked to a different share. This is on our main production storage, so it’s not exactly “fire up a python instance”, nor under version control.

As this only happens once it can take a bit, as it does need manual intervention after the scan and then trigger the next step i want it to be as fast as possible still.

I made it a LOT faster by inverting the rules as hinted above. I would prefer for that to be more dynamic than my current implementation.

Currently i grab all global folders that can contain the scenes, then i find out how many scenes and subsequently shots there are and then parse only the subfolders that can contain max files. I would feel better if this wasnt as hardcoded as it currently is. Anyone has a recipe for translating a filterstring along the lines of


/server/share/scene_***/shot_***_***/

?
Basically it should iterate for each scene AND for each shot. But it should be open for different kinds of setups.

I think i want to ask if there is a general framework for setting up folder rules to be parsed. I guess i just don’t know the name for what i want hehe.

Regards,
Thorsten

Theodox · March 28, 2017, 9:45pm

glob?

instinct-vfx · March 28, 2017, 9:45pm

I feel kinda stupid for using glob already and not thinking it through. Is there something similar that works with full regex? While overkill for what i do atm it may come in handy for other things.

Regards,
Thorsten

Rob_Galanakis · March 28, 2017, 9:45pm

Can you make this asynchronous (through gevent)? Basically what your code is doing is:
Python doing work-> Ask filesystem (python waiting) -> Filesystem returns -> Python doing work -> (repeat)
What you could be doing is:
Python doing work->Ask filesystem -> Python doing work -> Ask filesystem -> Filesystem returns -> Python doing work -> (interleave ad nauseum)
You’d want to use yields as much as possible in this case as well.

You could also try parallelizing this, using multiprocessing.Pool.map to map over your root folders, and assemble everything into a list (I’d not use a shared list- I’d let each one return its own list and assemble them at the end, to avoid synchronization slowdowns). It won’t be perfect because you’ll be IO bound, but it should be faster than single-threading.

On the microoptimization front:
-Your exclusion logic is slow, but it appears you’ve already improved that (doing lots of unnecessary iteration).
-Cache ALL your function calls. Don’t do ‘files.append’. Do ‘filesappend = files.append’ earlier and then call ‘filesappend(os.path.join(root, filename))’
-Actually, get rid of function calls when you can. Use root + ‘\’ + filename (% formatting instead? What’s faster?) instead of os.path.join, for this very particular instance (I am breaking a golden rule here…)
You may be able to cut a few minutes off just through this.

Overall though walking over a few million files is going to take a while! You may be able to go lower-level and ask the hardware for a big chunk of stuff at the same time, but through python, I don’t know if there’s anything else you can do.

instinct-vfx · March 28, 2017, 9:45pm

Thanks Rob. Have not thought about multithreading (which makes me again feel a tad stupid heh). I made things a LOT quicker by reducing the amount of folders to be parsed a LOT. Parsing renderfolders does not really make much sense. So i guess i will go a mixed route. This will make things a tad less generalized, but still cover most of the cases. And for the odd “but i need to parse everything randomly” i can provide a switch, and that can take longer than if it only happens seldomly.

Thanks again for the input. I will try and add the optimizations and report how it turned out!

Thorsten

Theodox · March 28, 2017, 9:45pm

Hey Rob, do you have any cost data on the differential between cached function calls and .access? It’s the kind of trick I’ve always despised in, eg, Lua because it’s purely driven by machine needs instead of readability or robustness.

madyasiwi · March 28, 2017, 9:45pm

In Windows, using Win32 functions FindFirstFile and FindNextFile is considerably faster than os.listdir or its derivatives. Especially if your folder contains alot of files/subdirs (10000+) in it. Because you can avoid the creation intermediate list and do early filtering.

import re
import ctypes
from ctypes import Structure, c_int16, byref
from ctypes.wintypes import HANDLE, WIN32_FIND_DATAA as FindData


_findFirstFile = ctypes.windll.kernel32.FindFirstFileA
_findNextFile = ctypes.windll.kernel32.FindNextFileA
_findClose = ctypes.windll.kernel32.FindClose


_ignoredSubdirs = re.compile( "\\.{1,2}$" ) # Ignore '.' & '..' subdir entries


def dirIterator( globstr ):
    
    findData = FindData()
    handle = _findFirstFile( globstr, byref( findData ) )
    
    if handle == -1:
        return
    
    while True:
        
        if findData.dwFileAttributes & 16:
            # This is a subfolder
            if not _ignoredSubdirs.match( findData.cFileName ):
                yield findData.cFileName, True
        else:
            # This is a file
            yield findData.cFileName, False
        
        if not _findNextFile( handle, byref( findData ) ):
            _findClose( handle )
            return


for entry, isDir in dirIterator( "C:\\*.*" ):
    print ( "[%s]" if isDir else "%s" ) % entry

For linux, there should be equivalent functions set that can be used similarly.

instinct-vfx · March 28, 2017, 9:45pm

Very interesting madyasiwi! I’ll definitely i give that a try given that we’re a windows shop.

Rob_Galanakis · March 28, 2017, 9:45pm

It can save a few seconds- a simple benchmark can show you how much. Obvioiusly, cached access is about twice as fast as non-cached, but it’s a fraction of your total time. I don’t think it makes the code too unreadable and if you have, say, 10 million total accesses, it can save a few seconds (about 2 seconds for 10 million on my mediocre CPU).

As for os.path.join: a + ‘\’ + b can be faster than ‘%s\%s’ % (a, b), especially if you can’t cache (a, b), but they’re roughly equal. os.path.join is several times slower (5 sec vs. .8 for 1 million calls). So that’d be worth it in this case.

Chris_Cole · May 24, 2017, 4:22pm

Hello, call me the necromancer…
Does the function cacheing speed increase( I do this all over the place for maxscript) apply to built in functions.
I have recently started using cprofile in all my test scripot, that will someday become unit tests.
And I am noticing 99% of the time that isinstance is often my biggest time taken.
I like to do verification, and filtering, and data manipulation in fn often.

And recently I have been looking at my code, wondering if it is worth sacrificing the safety, for speed in many cases.

I very quickly end up getting into the 15k - 100k+ is instance calls, and I am wondering if caching off ininstance in my libraries could be a decent middle man. Same with hasattr and getattr and setattr

THanks!

Theodox · May 24, 2017, 7:46pm

function caching speeds things up because it simplifies the chain of name lookups which Python does trying to find your the function you want to call. It’s useful if your’e putting about to call that function 10,000 times; it’s pointless if your calling it a couple of times.

 def this_is_legit(self):
      processor = self.fn # cache this lookup
      for n in range (100000):
            processor (n)

 def dont_do_this(self, n):
       processor = self.fn # cache this lookup
       processor(n)

If you’re calling isinstance once at the head of every function, caching it won’t do much for you. If you’re doing it before calling it many times in a row, it will help. Caching is only going to help python find isinstance – it isn’t going to make it run faster. The details here

As a matter of style I generally avoid tons of type checking in Python code, because it’s purely for the coder: if you’re expecting a foo and you get a bar it’s going to show up as a bug no matter what – the only difference is how fast you get to the realization “wait, this is supposed to be a foo but it’s not” while debugging. The real bug is elsewhere in the code that passed in the wrong object. If you really like isinstance you can put it in an

if __some_debug_variable__:
     assert (obj.isinstance(XXX))

block. Python has a builtin called __debug__ for this purpose, which is set by interpreter; if you’re running in maya it’s hard to work with but if not you can use that