Geometry hash

Theodox · March 28, 2017, 9:22pm

I’ve got to support a tool which, unfortunately, is really stupid about overlapped objects. If an artists accidentally duplicates and an object in place, the model will not export because the aforesaid stupid tool will try to merge the geometry of the original and the copies, creating lamina or nonmanifold faces.

Since it’s common for this to happen by accident, it’s difficult to debug: the exporter’s error message is completely unhelpful, identifiying neither the objects which caused the problem nor the physcial location of the issue.

So, I’d like to catch this condition before the geo is exported. What I need to do is generate a hash value for every mesh object in my maya scene such that two identical objects in the same physical location have the same hash, so I can quickly spot duplicates. My thought is to generate in the following way:

Collect the object triangle count and the centroid of the world space bounding box for the object. Clamp the position to 2 significant digits for less float instabilty. Make a hash with ( X << triangle count ) ^ (Y >> triangle count) ^ ( Z ^ triangle count). This should give a check for objects that are obviously different, so check everything in the scene this way first.

For any objects which are possibly identical, collect all the world-space vert positions. Sum up the vert positions (clamped again) and hash (X << vert count) ^ (Y >> vert count) ^ Z.

Does that sound like it’d be (a) fast enough and (b) accurate enough? Doesn’t have to be perfect but it does need to run every time users hit export; scenes are typically < 50k polys and < 1k objects.

Obviously, if there were an API function that did the same thing that’d be preferable… but I’m not aware of one. Anybody?

LoneWolf · March 28, 2017, 9:46pm

Yeah sounds about right, you would basically quantize the floats and make a hash value/values which would represent a unique identifier for that particular mesh in the scene. What hash algorithm you use is up to you as there are alot of them out there, what’s more important is that you collect the right values that represent that object and hash them together so you can end up with exactly the same id/ids for the same input sequence. You can use several hash ids if you want to.

If this is being scripted the iteration can be a bottleneck so you might wanna write this using the API (assuming you are using Maya) with just a simple command plugin that loops through the MFnTransform and MFnMesh channels for the values you are interested in and returning a result. The channels are linearly laid out in memory (MFloatPointArray) so this would be blazingly fast to loop through with a bunch of cpu cache hits.

You can then write some unit tests to see how it performs for a bunch of scenes.

Typically this would be done by an inhouse asset conditioning pipeline with chain of responsability pattern (mesh filters) catching stuff but I guess that’s not an option since you have to support an incomplete tool (assuming 3rd party or lazy engineers) =/

Theodox · March 28, 2017, 9:46pm

Typically this would be done by an inhouse asset conditioning pipeline with chain of responsability pattern (mesh filters) catching stuff but I guess that’s not an option since you have to support an incomplete tool (assuming 3rd party or lazy engineers) =/

Yeah. I won’t name hames. I do a lot of pre-export checks like this becase the 3d part code has lots of non-obvious requirements and lousy error messages.

The algo outlined above seems to be working for trivial cases – no false positives, correctly catches things like co-located cubes (even when they are rotated or scaled but visually identical). I inadvertenly kicked it off in timeit for a run of 100,000 times on a full scene so I’m going to get more accurate perf data than I really want… whenever it finishes

patconnole · March 28, 2017, 9:46pm

Out of curiosity, why hash? Is it to avoid comparing every vert on the suspicious objects?

Theodox · March 28, 2017, 9:46pm

Exactly – it converts the whole testing criterion into a single number which can be checked against the other numbers fast. This way you do the analysis for each object only once. You don’t want to do it iteratively in a case like this because you expect that 99% of the time the objects will be different – and if you walked through the verts comparing one at a time hoping for an early-out rejection, youll be disappointed most of the time.

btribble · March 28, 2017, 9:46pm

Just looking at size and tri/vert count is going to give you a bunch of potential collisions. You might want to also include the position of the first vertex or something. Of course, this doesn’t detect two overlapping objects that are being merged because some of their verts overlap, but they have different vert counts, etc. Still it is better than nothing.

rgkovach123 · March 28, 2017, 9:46pm

I had to do something similar… here it what i did (probably not the most elegant solution)

I ended up creating a string that contained all the info I wanted and use that as a key in dictionary.

For the value, I started a list containing the mesh name. Whenever another object generated a string that was already in the dictionary, i added that mesh to that list. When i was done going through all the objects, i iterated through the dictionary and removed entries that had a list containing only 1 mesh.

Rob_Galanakis · March 28, 2017, 9:46pm

Sorry to anyone that tries to come up with the perfect algorithm- it doesn’t exist. What you should really do is:
1- Put all the ‘hashing’ functionality into a single public function (duh),
2- Set up a test for the basic hashing functionality as you have it now,
3- When you find a new problematic asset, add a test for it, then fix the hashing function to support it,
4- And make sure you run the old tests so you don’t regress.

Unfortunately this is an inherently heuristic problem and there’s no single right answer that will run fast enough. The best anyone can really do is the minimum that works now, and make it easy to expand in the future by having good tests.

madyasiwi · March 28, 2017, 9:46pm

If you’re familiar with SQL, use SQLite (its part of standard Python library.)

[ol]
[li]Create in-memory database.
[/li][li]Create a table with object name and test criterion as its columns.
[/li][li]Add indices to all the test criterion columns (for performance.)
[/li][li]Populate the table by iterating through objects in your scene.
[/li][li]Do SQL select using ‘group by’ aggregation consisting all the test criterion columns.
[/li][li]Proceed with groups that consists more than single row.
[/li][li]
[/li][/ol]
This way the comparison is accurate and should be reasonably fast for modern CPUs.

Use SQLAlchemy as ORM wrapper if you prefer your code not littered with SQL statements.

Theodox · March 28, 2017, 9:46pm

In case anybody cares, here’s what I ended up using:


import maya.cmds as cmds
import itertools

_clamp = lambda p: hash ( int( p * 100 ) * .01 )

def bbox_hash( obj ):
    '''
    generate a first-pass hash of the object based on it's triangle count and world-space bounding box
    
    @note: This will work generate a valid bbox hash for non meshes or groups too.
    '''

    bb = cmds.xform( obj, q=True, bb=True, ws=True )
    sum_x, sum_y, sum_z = 0, 0, 0
    for pos in range ( 0, len( bb ), 3 ):
        x, y, z = [ _clamp ( p ) for p in bb[pos:pos + 3]]
        sum_x = sum_x + x
        sum_y = sum_y + y
        sum_z = sum_z + z

    try:
        count = int( cmds.polyEvaluate( obj, t=True ) ) # good ol' maya - polyEvaluate retuns a string instead of raising!
    except ValueError:
        count = 1

    return sum_x ^ sum_y ^ sum_z ^ ( hash( count - .01 ) )


def geo_hash( obj ):
    '''
    generates a hash of the supplied object based on its world space vertex positions    
    
    @note: this will raise a TypeError if used on a non-mesh
    '''
    sum_x, sum_y, sum_z = 0, 0, 0
    verts = cmds.xform( obj + ".vtx[li]", q=True, t=True, ws=True )
[/li]    for pos in range ( 0, len( verts ), 3 ):
        x, y, z = [ _clamp ( p ) for p in verts[pos:pos + 3]]
        sum_x = sum_x + x
        sum_y = sum_y + y
        sum_z = sum_z + z
    return sum_x ^ sum_y ^ sum_z ^ ( hash( len( verts ) + .01 ) )

def find_duplicates( *geo ):
    '''
    Return a list of duplicate pairs for all the objects in the supplied list *geo
    '''
    known = {}
    dupes = []
    for item in geo:
        key = bbox_hash( item )
        if not key in known:
            known[key] = item
        else:
            dupes.append ( ( known[key] , item ) )
    possibles = set( [k for k in itertools.chain( *dupes )] )
    geo_known = {}
    dupes = []
    for item in possibles:
        key = geo_hash( item )
        if not key in geo_known:
            geo_known[key] = item
        else:
            dupes.append ( ( geo_known[key] , item ) )
    return dupes

def find_all_duplicates():
    '''
    Call find_duplicates on all mesh objects in the scene
    '''
    meshes = cmds.ls( type='mesh', l=True, ni=True )
    if not meshes: return []
    xforms = cmds.listRelatives( meshes, p=True, f=True )
    return find_duplicates ( *xforms )

For my application size it’s pretty quick and catches identical geo while ignoring geo that is very slighly off (eg two cubes colocated byt rotated by < 1 degree)

As one of the earlier posts pointed out, this doesn’t catch all coincident verts – that would be a much more extensive check that I don’t think I need.

Rob_Galanakis · March 28, 2017, 9:46pm

Good stuff, Steve! Thanks for sharing this code.

madyasiwi: I have no idea why someone would use a DB for this.

LoneWolf · March 28, 2017, 9:46pm

Word, when I first saw the reply I thought: Woot, that’s clever usage of database mechanics, but yeah

Theodox · March 28, 2017, 9:46pm

the DB approach would actually work too, it offloads the uniqueness constraint and doesn’t require you to invest in any kind of encoding. For a problem with this low dimensionality it’s probably overkill – but if you were looking for a combination of several disparate characteristics it would be much easier to write and maintain.

As for perf, I guess it would depend on your DB module. I didn’t try it because we generally use PyMySql, which is pure python and so it would almost certainly be slower than my cheapo solution. If we used a C based DB engine it might be faster than mine.

madyasiwi · March 28, 2017, 9:46pm

The part in red below might be problematic;


def find_duplicates( *geo ):
    '''
    Return a list of duplicate pairs for all the objects in the supplied list *geo
    '''
    known = {}
    dupes = []
    for item in geo:
        key = bbox_hash( item )
        if not key in known:
            known[key] = item
        else:
            dupes.append ( ( known[key] , item ) )
    possibles = set( [k for k in itertools.chain( *dupes )] )
    geo_known = {}
    dupes = []
    for item in possibles:
        key = geo_hash( item )
        if not key in geo_known:
            geo_known[key] = item
        else:
            dupes.append ( ( geo_known[key] , item ) )
    return dupes

If dupes were [ ( A, B ), ( C, D ) ], and geo_hash( B ) == geo_hash( C ), then the final result would be [ ( B, C ) ]. Even though bbox_hash( B ) != bbox_hash( C )!

It can probably be replaced by something like this instead:


dupes2 = []
for object1, object2 in dupes:
    if geo_hash( object1 ) == geo_hash( object2 ):
        dupes2.append( ( object1, object2 ) )
return dupes2

But then, what if there are more than 2 duplicate instances? Say; A == B == C?
More test iterations needed?

–

The following is some modified version of code above using SQLite just for illustration (untested!):

import maya.cmds as cmds
import itertools
import sqlite3


initSQL = '''
CREATE TABLE `myobjects` (
    `name` TEXT PRIMARY KEY,
    `bb_x` FLOAT, `bb_y` FLOAT, `bb_z` FLOAT,
    `polycount` INTEGER,
    `geo_x` FLOAT, `geo_y` FLOAT, `geo_z` FLOAT,
    `vertexcount` INTEGER
);
CREATE INDEX `myobjects_criterion_index` ON `myobjects` (
    `bb_x`, `bb_y`, `bb_z`, `polycount`, `geo_x`, `geo_y`, `geo_z`, `vertexcount`
);
'''

groupQuery = '''
SELECT `bb_x`, `bb_y`, `bb_z`, `polycount`,    `geo_x`, `geo_y`, `geo_z`, `vertexcount`, COUNT( * ) `count` FROM `myobjects`
GROUP BY `bb_x`, `bb_y`, `bb_z`, `polycount`, `geo_x`, `geo_y`, `geo_z`, `vertexcount`
HAVING `count` > 1
'''

duplicatesQuery = '''
SELECT `name` FROM `myobjects`
WHERE `bb_x` = ? AND `bb_y` = ? AND `bb_z` = ? AND `polycount` = ? AND `geo_x` = ? AND `geo_y` = ? AND `geo_z` = ? AND `vertexcount` = ?
'''

_clamp = lambda p: hash ( int( p * 100 ) * .01 )


def bbox_values( obj ):
    
    bb = cmds.xform( obj, q=True, bb=True, ws=True )
    sum_x, sum_y, sum_z = 0, 0, 0
    for pos in range ( 0, len( bb ), 3 ):
        x, y, z = [ _clamp ( p ) for p in bb[pos:pos + 3]]
        sum_x = sum_x + x
        sum_y = sum_y + y
        sum_z = sum_z + z

    try:
        count = int( cmds.polyEvaluate( obj, t=True ) ) # good ol' maya - polyEvaluate retuns a string instead of raising!
    except ValueError:
        count = 1

    return sum_x, sum_y, sum_z, count


def geo_values( obj ):
    
    sum_x, sum_y, sum_z = 0, 0, 0
    verts = cmds.xform( obj + ".vtx[li]", q=True, t=True, ws=True )
[/li]    for pos in range ( 0, len( verts ), 3 ):
        x, y, z = [ _clamp ( p ) for p in verts[pos:pos + 3]]
        sum_x = sum_x + x
        sum_y = sum_y + y
        sum_z = sum_z + z
    return sum_x, sum_y, sum_z, len( verts )
    
    
def find_duplicates( objects ):
    
    duplicates = []
    connection = sqlite3.connect( ":memory:" )
    cursor1 = connection.cursor()
    cursor2 = connection.cursor()
    
    cursor1.executescript( initSQL )
    
    for object in objects:
        bb_x, bb_y, bb_z, polycount = bbox_values( object )
        geo_x, geo_y, geo_z, vertexcount = geo_values( object )
        cursor1.execute(
            "INSERT INTO `myobjects` VALUES ( ?, ?, ?, ?, ?, ?, ?, ?, ? )",
            ( object, bb_x, bb_y, bb_z, polycount, geo_x, geo_y, geo_z, vertexcount )
            )
    
    cursor1.execute( groupQuery )
    for bb_x, bb_y, bb_z, polycount, geo_x, geo_y, geo_z, vertexcount, duplicateCount in cursor1:
        group = []
        cursor2.execute( duplicatesQuery, ( bb_x, bb_y, bb_z, polycount, geo_x, geo_y, geo_z, vertexcount ) )
        for ( name, ) in cursor2:
            group.append( name )
        duplicates.append( group )
    
    cursor2.close()
    cursor1.close()
    connection.close()
    
    return duplicates
    
    
def find_all_duplicates():
    '''
    Call find_duplicates on all mesh objects in the scene
    '''
    meshes = cmds.ls( type='mesh', l=True, ni=True )
    if not meshes: return []
    xforms = cmds.listRelatives( meshes, p=True, f=True )
    return find_duplicates ( xforms )

This will be a bit slower due to geo_values() is calculated for every objects upfront. But there will be no hash(ing) necessary, no need to worry about hash collisions and hence it’s more accurate. Also handles more than 2 duplicate instances gracefully.

–

The SQLite is the most suitable database for this particular case because it exists in (local process) memory. No need to setup separate database server first. No inter-process/inter-physical-machine client-server data transfer overhead, no heavy concurrency management overhead. Therefore should be adequately fast also.

rgkovach123 · March 28, 2017, 9:46pm

I took perhaps a more naive approach to finding duplicates using a dictionary instead of generating a hash value. My determination of “sameness” is also very fuzzy and suited my specific case.
here is my solution:


def findDuplicates(nodes=[]):
    '''
    Stores identical objects into a dictionary of lists.
    Compares Vertex count, Edge count, and orientation.
    '''
    dups = {}
    
    if not nodes:
        meshes = cmds.ls(type='mesh', l=True)
    else:
        meshes = nodes
        
    for mesh in meshes:
        xform = cmds.listRelatives(mesh, p=True, f=True)
        
        verts = cmds.polyEvaluate(mesh, v=True)
        edges = cmds.polyEvaluate(mesh, e=True)
        rot = cmds.xform(xform, q=True, ro=True, ws=True)
        
        # get the length of the first edge in object space
        v = cmds.polyListComponentConversion((mesh+'.e[0]'), fe=True, tv=True)
        v = cmds.filterExpand(v, sm=31, expand=True)
        v1Pos = cmds.pointPosition(v[0], local=True)
        v2Pos = cmds.pointPosition(v[1], local=True)
        length = rUtil.distance3d(v1Pos, v2Pos)
        length = '{0:.2f}'.format(round(length, 2))
                
        data = [verts, edges, rot[1], length]
        meshProperties = ', '.join(map(str, data))
        if meshProperties in dups:
            dups[meshProperties].append(mesh)
        else:
            dups[meshProperties] = [mesh]
    for k, v in dict(dups).iteritems():
        if len(v) <= 1:
            del dups[k]
    return dups

Theodox · March 28, 2017, 9:46pm

@robert: You’re in the hash club too We’re both using the fact that the dictionary keys are hashes of the contents (hence, keys must be hashable - no lists as keys) – the only real difference is the encoding of differences. You’re getting your hashes from the strings where I do it with bitmath on the numbers. Both methods work but I felt more secure having some insight into the way the hash gets assembled. I was worried about vertex order: by summing up the vert hashes I get an order-independent comparison without having to sort the verts by position. I need to catch something like two superimposed cubes where one is rotated 90 or 180, so I don’t want to be worrying about component order in Maya. Your check is assuming that matches have the same vert topology, yes?

@madyasiwi:
I 'd b suprised if I could get a situation where the bbox_hashes don’t match but the geo does – both are calculated in worldspace with a fixed precision, so they if the geohashes match the bboxes would too… Not that I really mind, the bbox hash is just a cheap broadphase check which eliminates the vast majority of the stuff in the scene with a cheap test. After that I recheck using the detailed geo_hash test. There’s a small chance of a hash collision because you could just possibly get a false positive due to XOR of the vert count, but I’m assuming the odds of that are pretty long…or am I missing what your saying?

On a related note, I ought not to collapse the original pairwise list to a set, but since the numbers of hits are small (typically none at all, only a handful unless the user has done something really goofy since the last time this ran!) I didn’t bother.

rgkovach123 · March 28, 2017, 9:46pm

I was lazy and did not even bother checking the vertex positions - so the order of the components wasn’t a factor for me. I simply used the number of verts, number of edges and the length of the first edge (plus the orientation on the Y axis, but that is a very specific condition i needed). My check assumes that if the first 3 cases are true, then the vertex topology is the same.

madyasiwi · March 28, 2017, 9:46pm

@Steve,
In that case, yeah, I guess it should be fine then. My opinion was solely focused on the flow of logic and I honestly didn’t put much consideration with the actual data itself.

Someone feel bored would… :p:

system · March 28, 2017, 9:46pm

I was going to ask the same, lol
:eek: