Count, filter out duplicates for dictionaries



I have an iterator in which it returns me the following - item_name, item_size, user_name

What is the best way in which I can use if I wanted to:
[li]Collate similar item_namings into a 1-liner[/li][li]Prior to point #1, calculate the number of items[/li][li]Prior to point #1, it will also displays the user_names affiliated with the versions and the size it used in descending order[/li][/ul]

Currently I am using a lot of dictionaries and I am not sure what is the best way to approach this…

gen_dict = {}
size_dict = {}
# my_iterator is the one that I have mentioned as above
for result in my_iterator:
    gen_dict[result['object_name']] = result['user']
    size_dict[result['user]] = result['dir_size']
   # If same key exists, append value to existing key
   if result['owner'] in size_dict:
# Filter out duplicates, count versions
asset_user_dict = defaultdict(set)
asset_count = defaultdict(int)
user_ver_count = defaultdict(lambda: defaultdict(int))
for vers_name, artist_alias in ivy_results.iteritems():
    strip_version_name = vers_name[:-3]
       asset_count[strip_version_name] += 1
       user_ver_count[artist_alias][strip_version_name] += 1
# Gather the sum of all item's size accordingly to each user
for user_name, user_size in size_dict.iteritems():
    # This will sums up all sizes of that particular user
    size_dict[stalk_name] = sum(user_size)
for version_name, version_count in sorted(asset_count.iteritems()):
    user_vers_cnt = ', '.join('{0}({1}v, {2})'.format(user, user_ver_count[user][version_name], convert_size_query(ivy_size_query[user])) for user in asset_user_dict[version_name])
    print "| {0:<100} | {1:>12} | {2:>90} |".format(version_name+"(xxx)",

I tried using dictionary but while I can do almost all the above 3 points, I am having issues with point #3 where I either can’t seem to sort them in order or the size dervied for the user are of the same value as I am using multi dictionaries? Any advice is greatly appreciated!

By the way, my output currently is:

Suppose if my data is something like
(1 MiB) "item_C_v001" : "jack"
(5 MiB) "item_C_v002" : "kris"
(1 MiB) "item_A_v003" : "john",
(1 MiB) "item_B_v006" : "peter",
(2 MiB) "item_A_v005" : "john",
(1 MiB) "item_A_v004" : "dave"

Item Name     | No. of Vers.      | User
item_A           | 3                     | dave(1, 1MiB), john(2, 3MiB)
item_B           | 1                     | peter(1, 1MiB)
item_C           | 2                     | kris(1, 5MiB), jack(1, 1MiB)


Splitting this into three dictionaries is probably overkill here (I might be wrong). You can probably get away with a single dict or OrderedDict and do your collecting in a single pass which would probably be more efficient.

From the snippet you included I assume your iterator returns a dictionary for each ‘result’…

from collections import OrderedDict

data = OrderedDict()

for result in my_iterator:
    item_name = result.get('object_name', None)
    item_user = result.get('user', None)
    item_size = result.get('dir_size', 0)

    if item_name not in data:
        data[item_name] = {item_user: [item_size]}

        if item_user not in data[item_name]:
            data[item_name][item_user] = [item_size]

Your resulting data OrderedDict should be structured so that a single mapped item_name is as follows:

# data[item_name1] = {item_user1: [item_size1, item_size2], item_user2: [item_size1, item_size2], etc...}

From this you can derive the information you need.

for item_name, item_users in data.iteritems():
    count = sum(len(item_sizes) for item_sizes in item_users.values())
    'Item Name: {} Count: {} User(s): {}'.format(item_name, count, ', '.join(item_users.keys()))

DISCLAIMER: untested!

Hopefully this helps you look at it under another angle.


You can do it even more simply with a defaultdict whose default is a list:

 from collections import defaultdict

 output = defaultdict(list)
 for result in my_iterator:
     name = result.get('object_name', None)
     user = result.get('user', None)
     size = result.get('dir_size', 0)
 output[name].append( (user, size) )

output will be a dictionary of item names and user, size pairs:

 {'item' : [ (user1, 1),  (user2, 3)] ....}

to do the summation it’s just

 for item, data in output.iteritems():
       print item , len(data), data