What Is The Overhead Of Using A Dictionary Instead Of A List?
Solution 1:
In direct answer to your question: dictionaries have significantly more overhead than lists:
- Each item consumes memory for both key and value, in contrast to only values for lists.
- Adding or removing an item requires consulting a hash table.
Despite the fact that Python dictionaries are extremely well-designed and surprisingly fast, if you have an algorithm that can use direct index, you will save space and time.
However, from the sound of your question and subsequent discusion, it sounds like your needs may change over time and you have some uncertainty ("However, I can think of some situations that might arise in future that I will need to keep some items for keys which are not integers")
If this is the case, I suggest creating a hybrid data structure of your own so that as your needs evolve you can address the efficiency of storage in an isolated place while allowing your application to use simple, readable code to store and retrieve objects.
For example, here is a Python3 class called maybelist
that is derived from a list, but detects the presence of non-numeric keys, storing exceptions in a dictionary while providing mappings for some common list operations:
classmaybelist(list):
def__init__(self, *args):
super().__init__(*args)
self._extras = dict()
def__setitem__(self, index, val):
try:
super().__setitem__(index, val)
returnexcept TypeError:
# Index is not an integer, store in dict
self._extras[index] = val
returnexcept IndexError:
pass
distance = index - len(self)
if distance > 0:
# Put 'None' in empty slots if need be
self.extend((None,) * distance)
self.append(val)
def__getitem__(self, index):
try:
returnsuper().__getitem__(index)
except TypeError:
return self._extras[index]
def__str__(self):
returnstr([item for item in self])
def__len__(self):
returnsuper().__len__() + len(self._extras)
def__iter__(self):
for item in itertools.chain(super().__iter__(), self._extras):
yield item
So, you could treat it like an array, and have it auto expand:
>>> x = maybelist()
>>> x[0] = 'first'>>> x[1] = 'second'>>> x[10] = 'eleventh'>>> print(x)
['first', 'second', None, None, None, None, None, None, None, None, 'eleventh']
>>> print(x[10])
eleventh
Or you could add items with non-numeric keys if they were present:
>>>x['unexpected'] = 'something else'>>>print(x['unexpected'])
something else
And yet have the object appear to behave properly if you access it using iterators or other methods of your choosing:
>>> print(x)
['first', 'second', None, None, None, None, None, None, None, None, 'eleventh', 'unexpected']
>>> print(len(x))
12
This is just an example, and you would need to tailor such a class to meet the needs of your application. For example, the resulting object does not strictly behave like a list (x[len(x)-1]
is not the last item, for example). However, your application may not need such strict adherence, and if you are careful and plan properly, you can create an object which both provides highly optimized storage while leaving room for evolving data structure needs in the future.
Solution 2:
dict
uses a lot more memory that a list
. Probably not enough to be a concern if the computer isn't very busy. There are exceptions of course - if it's a web server with 100 connections per second, you may want to consider saving memory at the expense of readability
>>>L = range(400000)>>>sys.getsizeof(L)
3200072 # ~3 Megabytes
>>>D = dict(zip(range(400000), range(400000)))>>>sys.getsizeof(D)
25166104 # ~25 Megabytes
Solution 3:
Lists are what they seem - a list of values, but in a dictionary, you have an 'index' of words, and for each of them a definition.
Dictionaries are the same, but the properties of a dict are different than lists because they work with mapping keys to values. That means you use a dictionary when:
- You have to retrieve things based on some identifier, like names, addresses, or anything that can be a key.
- You don't need things to be in order. Dictionaries do not normally have any notion of order, so you have to use a list for that.
- You are going to be adding and removing elements and their keys.
Efficiency constrains are discussed at Stack posts Link1 & Link2.
Go for a dictionary as you have doubts regarding future values also there is no memory constrains to bother
Solution 4:
Not exactly the spot on answer for your not so clear question, but here are my thoughts:
You said
I am analyzing large number of items (>400k)
In that case, I'd advise you to use generators and/or process your date in chunks.
Better option would be to put your data, which are key-value pairs, in Redis and take out chunks of it at a time. Redis can handle your volume of data very easily.
You could write a script that processes one chunk at a time, and using the asyncio
module, you could parallelize the chunk processing.
Something like this:
from concurrent import futures
defchunk_processor(data):
"""
Process your list data here
"""passdefparallelizer(map_func, your_data_list, n_workers=3):
with futures.ThreadPoolExecutor(max_workers=n_workers) as executor:
for result in executor.map(map_func, your_data_list):
# Do whatever with your result# Do the take out chunks of your data from Redis here
chunk_of_list = get_next_chunk_from_redis()
# Your processing starts here
parallelizer(chunk_processor, your_data_list)
Again, something better could be done, but I'm presenting you one of the ways to go about it.
Post a Comment for "What Is The Overhead Of Using A Dictionary Instead Of A List?"