found = 0
for n in needles:
if n in haystack:
found += 1
Internals of sets and dicts
Python’s dict
and set
are built on top of hash tables.
This post explains how the use of hash tables results in
the strengths and limitations of those container types.
Here are some questions this article answers:
-
How efficient are Python
dict
andset
? -
Why are
set
elements unordered? -
Why can’t we use any Python object as a
dict
key orset
element? -
Why does the order of the
dict
keys depend on insertion order?
Contents:
Note
|
You don’t need to know all of these details to make good use of dictionaries and sets. But the implementation ideas are beautiful—that’s why I describe them. For practical advice, you can skip to Practical Consequences of How Sets Work and Practical Consequences of How dict Works. |
To motivate the study of hash tables, we start by showcasing the amazing performance of dict
and set
with a simple test involving millions of items.
A Performance Experiment
From experience, all Pythonistas know that dicts and sets are fast. We’ll confirm that with a controlled experiment.
To see how the size of a dict
, set
, or list
affects the performance of search using the in
operator, I generated an array of 10 million distinct double-precision floats, the "haystack." I then generated an array of needles: 1,000 floats, with 500 picked from the haystack and 500 verified not to be in it.
For the dict
benchmark, I used dict.fromkeys()
to create a dict
named haystack
with 1,000 floats. This was the setup for the dict
test. The actual code I clocked with the timeit
module is Example 1 (like [ex_set_loop_ex]).
I repeated the benchmark five times, each time increasing tenfold the size of haystack
, from 1,000 to 10,000,000 items. The result of the dict
performance test is in Figure 1.
In concrete terms, to check for the presence of 1,000 floating-point keys in a dictionary with 1,000 items,
the processing time on my laptop was 99µs, and the same search in a dict
with 10,000,000 items took 512µs.
In other words, the average time for each search in the haystack with 10 million items was 0.512µs—yes, that’s about half microsecond per needle.
When the search space became 10,000 times larger, the search time increased a little over 5 times. Nice.
To compare with other collections, I repeated the benchmark with the same haystacks of increasing size, but storing the haystack
as a set
or as list
. For the set
tests, in addition to timing the for
loop in Example 1, I also timed the one-liner in Example 2, which produces the same result: count the number of elements from needles
that are also in haystack
—if both are sets.
found = len(needles & haystack)
Figure 2 shows the tests side by side. The best times are in the "set& time" column, which displays results for the set &
operator using the code from Example 2.
As expected, the worst times are in the "list time" column, because there is no hash table to support searches with the in
operator on a list
, so a full scan must be made if the needle is not present, resulting in times that grow linearly with the size of the haystack.
If your program does any kind of I/O, the lookup time for keys in dicts or sets is negligible, regardless of the dict
or set
size (as long as it fits in RAM). See the code used to generate [set_dict_search_time_tbl] and accompanying discussion in [support_scripts], [support_container_perftest].
Now that we have concrete evidence of the speed of dicts and sets, let’s explore how that is achieved with the help of hash tables.
Before studying hash tables, we need to know more about hash codes, and how they relate to equality.
Hashes and equality
The hash()
built-in function works directly with built-in types and falls back to calling __hash__
for user-defined types. If two objects compare equal, their hash codes must also be equal, otherwise the hash table algorithm does not work. For example, because 1 == 1.0
is True
, hash(1) == hash(1.0)
must also be True
, even though the internal representation of an int
and a float
are very different.[1]
Also, to be effective as hash table indexes, hash codes should scatter around the index space as much as possible. This means that, ideally, objects that are similar but not equal should have hash codes that differ widely. Example 3 is the output of a script to compare the bit patterns of hash codes. Note how the hashes of 1 and 1.0 are the same, but those of 1.0001, 1.0002, and 1.0003 are very different.
32-bit Python build
1 00000000000000000000000000000001
!= 0
1.0 00000000000000000000000000000001
------------------------------------------------
1.0 00000000000000000000000000000001
! !!! ! !! ! ! ! ! !! !!! != 16
1.0001 00101110101101010000101011011101
------------------------------------------------
1.0001 00101110101101010000101011011101
!!! !!!! !!!!! !!!!! !! ! != 20
1.0002 01011101011010100001010110111001
------------------------------------------------
1.0002 01011101011010100001010110111001
! ! ! !!! ! ! !! ! ! ! !!!! != 17
1.0003 00001100000111110010000010010110
------------------------------------------------
Note
|
Starting with Python 3.3, a random salt value is included when computing hash codes for |
Hash collisions
As mentioned, on 64-bit CPython a hash code is a 64-bit number, and that’s 264 possible values—which is more than 1019. But most Python types can represent many more different values. For example, a string made of 10 ASCII printable characters picked at random has 10010 possible values–more than 266. Therefore, the hash code of an object usually has less information than the actual object value. This means that objects that are different may have the same hash code.
Tip
|
When correctly implemented, hashing guarantees that different hash codes always imply different objects, but the reverse is not true: different objects don’t always have different hash codes. When different objects have the same hash code, that’s a hash collision. |
With this basic understanding of hash codes and object equality, we are ready to dive into how hash tables work, and how hash collisions are handled.
Set hash tables under the hood
Hash tables are a wonderful invention. Let’s see how a hash table is used when adding elements to a set.
Let’s say we have a set with abbreviated workdays, created like this:
>>> workdays = {'Mon', 'Tue', 'Wed', 'Thu', 'Fri'}
>>> workdays
{'Tue', 'Mon', 'Wed', 'Fri', 'Thu'}
The core data structure of a Python set
is a hash table with at least 8 rows.
Traditionally, the rows in hash table are called buckets[2].
A hash table holding the elements of workdays
looks like Figure 3.
{'Mon', 'Tue', 'Wed', 'Thu', 'Fri'}
. Each bucket has two fields: the hash code and a pointer to the element value. Empty buckets have -1 in the hash code field. The ordering looks random.In CPython built for a 64-bit CPU, each bucket in a set has two fields: a 64-bit hash code, and a 64-bit pointer to the element value—which is a Python object stored elsewhere in memory. Because buckets have a fixed size, individual buckets are accessed by offset from the start of the hash table. In other words, the indexes 0 to 7 in Figure 3 are not stored, they are just offsets.
The hash table algorithm
We will focus on the internals of set
first, and later transfer the concepts to dict
.
Note
|
This is a simplified view of how Python uses a hash table to implement a |
Let’s see how Python builds a set like {'Mon', 'Tue', 'Wed', 'Thu', 'Fri'}
, step by step. The algorithm is illustrated by the flowchart in Figure 4, and described next.
Step 0: initialize hash table
As mentioned earlier, the hash table for a set
starts with 8 empty buckets. As elements are added, Python makes sure at least ⅓ of the buckets are empty—doubling the size of the hash table when more space is needed. The hash code field of each bucket is initialized with -1, which means "no hash code"[3].
Step 1: compute the hash code for the element
Given the literal {'Mon', 'Tue', 'Wed', 'Thu', 'Fri'}
, Python gets the hash code for the first element, 'Mon'
.
For example, here is a realistic hash code for 'Mon'
—you’ll probably get a different result because of the random salt Python uses to compute the hash code of strings:
>>> hash('Mon')
4199492796428269555
Step 2: probe hash table at index derived from hash code
Python takes the modulus of the hash code with the table size to find a hash table index. Here the table size is 8, and the modulus is 3:
>>> 4199492796428269555 % 8
3
Probing consists of computing the index from the hash, then looking at the corresponding bucket in the hash table. In this case, Python looks at the bucket at offset 3 and finds -1 in the hash code field, marking an empty bucket.
Step 3: put the element in the empty bucket
Python stores the hash code of the new element, 4199492796428269555, in the hash code field at offset 3, and a pointer to the string object 'Mon'
in the element field. Figure 5 shows the current state of the hash table.
{'Mon'}
.Steps for remaining items
For the second element, 'Tue'
, steps 1, 2, 3 above are repeated. The hash code for 'Tue'
is 2414279730484651250, and the resulting index is 2.
>>> hash('Tue')
2414279730484651250
>>> hash('Tue') % 8
2
The hash and pointer to element 'Tue'
are placed in bucket 2, which was also empty. Now we have Figure 6
{'Mon', 'Tue'}
. Note that element ordering is not preserved in the hash table.Steps for a collision
When adding 'Wed'
to the set, Python computes the hash -5145319347887138165 and index 3.
Python probes bucket 3 and sees that it is already taken. But the hash code stored there, 4199492796428269555 is different.
As discussed in Hashes and equality, if two objects have different hashes, then their value is also different.
This is an index collision.
Python then probes the next bucket and finds it empty.
So 'Wed'
ends up at index 4, as shown in Figure 7.
{'Mon', 'Tue', 'Wed'}
. After the collision, 'Wed'
is put at index 4.Adding the next element, 'Thu'
, is boring: there’s no collision, and it lands in its natural bucket, at index 7.
Placing 'Fri'
is more interesting.
Its hash, 7021641685991143771 implies index 3, which is taken by 'Mon'
. Probing the next bucket—4—Python finds the hash for 'Wed'
stored there. The hash codes don’t match, so this is another index collision. Python probes the next bucket. It’s empty, so 'Fri'
ends up at index 5. The end state of the hash table is shown in Figure 8.
Note
|
Incrementing the index after a collision is called linear probing. This can lead to clusters of occupied buckets, which can degrade the hash table performance, so CPython counts the number of linear probes and after a certain threshold, applies a pseudo random number generator to obtain a different index from other bits of the hash code. This optimization is particularly important in large sets. |
{'Mon', 'Tue', 'Wed', 'Thu', 'Fri'}
. It is now 62.5% full—close to the ⅔ threshold.When there is an element in the probed bucket and the hash codes match, Python also needs to compare the actual object values. That’s because, as explained in Hash collisions, it’s possible that two different objects have the same hash code—although that’s rare for strings, thanks to the quality of the Siphash algorithm[4]. This explains why hashable objects must implement both __hash__
and __eq__
.
If a new element were added to our example hash table, it would be more than ⅔ full, therefore increasing the chances of index collisions. To prevent that, Python would allocate a new hash table with 16 buckets, and reinsert all elements there.
All this may seem like a lot of work, but even with millions of items in a set
, many insertions happen with no collisions, and the average number of collisions per insertion is between one and two. Under normal usage, even the unluckiest elements can be placed after a handful of collisions are resolved.
Now, given what we’ve seen so far, follow the flowchart in Figure 4 to answer the following puzzle without using the computer.
Given the following set
, what happens when you add an integer 1
to it?
>>> s = {1.0, 2.0, 3.0}
>>> s.add(1)
How many elements are in s
now? Does 1
replace the element 1.0
?
When you have your answer, use the Python console to verify it.
Searching elements in a hash table
Consider the workdays
set with the hash table shown in Figure 8.
Is 'Sat'
in it? This is the simplest execution path for the expression 'Sat' in workdays
:
-
Call
hash('Sat')
to get a hash code. Let’s say it is 4910012646790914166 -
Derive a hash table index from the hash code, using
hash_code % table_size
. In this case, the index is 6. -
Probe offset 6: it’s empty. This means
'Sat'
is not in the set. ReturnFalse
.
Now consider the simplest path for an element that is present in the set. To evaluate 'Thu' in workdays
:
-
Call
hash('Tue')
. Pretend result is 6166047609348267525. -
Compute index:
6166047609348267525 % 8
is 5. -
Probe offset 5:
-
Compare hash codes. They are equal.
-
Compare the object values. They are equal. Return
True
.
-
Collisions are handled in the way described when adding an element.
In fact, the flowchart in Figure 4 applies to searches as well,
with the exception of the terminal nodes—the rectangles with rounded corners.
If an empty bucket is found, the element is not present, so Python returns False
;
otherwise, when both the hash code and the values of the sought element match an element in the hash table, the return is True
.
Practical Consequences of How Sets Work
The set
and frozenset
types are both implemented with a hash table, which has these effects:
-
Set elements must be hashable objects. They must implement proper
__hash__
and__eq__
methods as described in [what_is_hashable]. -
Membership testing is very efficient. A set may have millions of elements, but the bucket for an element can be located directly by computing the hash code of the element and deriving an index offset, with the possible overhead of a small number of probes to find a matching element or an empty bucket.
-
Sets have a significant memory overhead. The most compact internal data structure for a container would be an array of pointers[5]. Compared to that, a hash table adds a hash code per entry, and at least ⅓ of empty buckets to minimize collisions.
-
Element ordering depends on insertion order, but not in a useful or reliable way. If two elements are involved in a collision, the bucket were each is stored depends on which element is added first.
-
Adding elements to a set may change the order of other elements. That’s because, as the hash table is filled, Python may need to recreate it to keep at least ⅓ of the buckets empty. When this happens, elements are reinserted and different collisions may occur.
Hash table usage in dict
May your hashes be unique,
Your keys rarely collide,
And your dictionaries
be forever ordered.[6]
in The Dictionary Even Mightier
Since 2012, the implementation of the dict
type had two major optimizations to reduce memory usage.
The first one was proposed as PEP 412 — Key-Sharing Dictionary and implemented in Python 3.3[7].
The second is called "compact dict`"
, and landed in Python 3.6.
As a side effect, the compact `dict
space optimization preserves key insertion order.
In the next sections we’ll discuss the compact dict
and the new key-sharing scheme—in this order, for easier presentation.
How compact dict
saves space and keeps ordering
Note
|
This is a high level explanation of the Python |
Consider a dict
holding the abbreviated names for the weekdays from 'Mon'
through 'Thu'
, and the number of students enrolled in swimming class on each day:
>>> swimmers = {'Mon': 14, 'Tue': 12, 'Wed': 14, 'Thu': 11}
Before the compact dict
optimization, the hash table underlying the swimmers
dictionary would look like Figure 9.
As you can see, in a 64-bit Python, each bucket holds three 64-bit fields:
the hash code of the key, a pointer to the key object, and a pointer to the value object.
That’s 24 bytes per bucket.
dict
with 4 key-value pairs. Each bucket is a struct with the hash code of the key, a pointer to the key, and a pointer to the value.The first two fields play the same role as they do in the implementation of sets.
To find a key, Python computes the hash code of the key, derives an index from the key,
then probes the hash table to find a bucket with a matching hash code and a matching key object.
The third field provides the main feature of a dict
: mapping a key to an arbitrary value.
The key must be a hashable object, and the hash table algorithm ensures it will be unique in the dict
.
But the value may be any object—it doesn’t need to be hashable or unique.
Raymond Hettinger observed that significant savings could be made if the hash code and pointers to key and value were held in an entries
array with no empty rows,
and the actual hash table were a sparse array with much smaller buckets holding indexes into the entries
array[8].
In his original message to python-dev,
Hettinger called the hash table indices
. The width of the buckets in indices
varies as the dict
grows, starting at 8-bits per bucket—enough to index up to 128 entries, while reserving negative values for special purposes, such as -1 for empty and -2 for deleted.
As an example, the swimmers
dictionary would then be stored as shown in Figure 10.
dict
with 4 key-value pairs. Hash codes and pointers to keys and values are stored in insertion order in the entries
array, and the entry offsets derived from the hash codes are held in the indices
sparse array, where an index value of -1 signals an empty bucket.Assuming a 64-bit build of CPython, our 4-item swimmers
dictionary would take 192 bytes of memory in the old scheme:
24 bytes per bucket, times 8 rows.
The equivalent compact dict
uses 104 bytes in total: 96 bytes in entries
(24 * 4),
plus 8 bytes for the buckets in indices
—configured as an array of 8 bytes.
The next section describes how those two arrays are used.
Algorithm for adding items to compact dict
.
Step 0: set up indices
The indices
table is initially set up as an array of signed bytes, with 8 buckets, each initialized with -1 to signal "empty bucket".
Up to 5 of these buckets will eventually hold indices to rows in the entries
array, leaving ⅓ of them with -1.
The other array, entries
, will hold key/value data with the same three fields as in the old scheme—but in insertion order.
Step 1: compute hash code for the key
To add the key-value pair ('Mon', 14)
to the swimmers
dictionary,
Python first calls hash('Mon')
to compute the hash code of that key.
Step 2: probe entries
via indices
Python computes hash('Mon') % len(indices)
. In our example, this is 3.
Offset 3 in indices
holds -1: it’s an empty bucket.
Step 3: put key-value in entries
, updating indices
.
The entries
array is empty, so the next available offset there is 0.
Python puts 0 at offset 3 in indices
and stores
the hash code of the key, a pointer to the key object 'Mon'
, and a pointer to the int
value 14
at offset 0 in entries
.
Figure 11 shows the state of the arrays when the value of swimmers
is {'Mon': 14}
.
{'Mon': 14}
: indices[3]
holds the offset of the first entry: entries[0]
.Steps for next item
To add ('Tue', 12)
to swimmers
:
-
Compute hash code of key
'Tue'
. -
Compute offset into
indices
, ashash('Tue') % len(indices)
. This is 2.indices[2]
has -1. No collision so far. -
Put the next available
entries
offset, 1, inindices[2]
, then store entry atentries[1]
.
Now the state is Figure 12. Note that entries
holds the key-value pairs in insertion order.
{'Mon': 14, 'Tue': 12}
.Steps for a collision
-
Compute hash code of key
'Wed'
. -
Now,
hash('Wed') % len(indices)
is 3.indices[3]
has 0, pointing to an existing entry. Look at the hash code inentries[0]
. That’s the hash code for'Mon'
, which happens to be different than the hash code for'Wed'
. This mismatch signals a collision. Probes the next index:indices[4]
. That’s -1, so it can be used. -
Make
indices[4] = 2
, because 2 is the next available offset atentries
. Then fillentries[2]
as usual.
After adding ('Wed', 14)
, we have Figure 13
{'Mon': 14, 'Tue': 12, 'Wed': 14}
.How a compact dict
grows
Recall that the buckets in the indices
array are 8 signed bytes initially, enough to hold offsets for up to 5 entries, leaving ⅓ of the buckets empty.
When the 6th item is added to the dict
, indices
is reallocated to 16 buckets—enough for 10 entry offsets.
The size of indices
is doubled as needed, while still holding signed bytes, until the time comes to add the 129th item to the dict
.
At this point, the indices
array has 256 8-bit buckets. However, a signed byte is not enough to hold offsets after 128 entries,
so the indices
array is rebuilt to hold 256 16-bit buckets to hold signed integers—wide enough to represent offsets to 32,768 rows in the entries
table.
The next resizing happens at the 171st addition, when indices
would become more than ⅔ full.
Then the number of buckets in indices
is doubled to 512, but each bucket still 16-bits wide each.
In summary, the indices
array grows by doubling the number of buckets,
and also—less often—by doubling the width of each bucket to accommodate a growing number of rows in entries
.
This concludes our summary of the compact dict
implementation.
I omitted many details, but now let’s take a look at the other space-saving optimization for dictionaries: key-sharing.
Key-sharing dictionary
Instances of user-defined classes hold their attributes in a __dict__
attribute[9]
which is a regular dictionary.
An instance __dict__
maps attribute names to attribute values.
Most of the time, all instances have the same attributes with different values.
When that happens, 2 of the 3 fields in the entries
table for every instance have the exact same content:
the hash code of the attribute name, and a pointer to the attribute name.
Only the pointer to the attribute value is different.
In PEP 412 — Key-Sharing Dictionary,
Mark Shannon proposed to split the storage of dictionaries used as instance __dict__
,
so that each attribute hash code and pointer is stored only once, linked to the class,
and the attribute values are kept in parallel arrays of pointers attached to each instance.
Given a Movie
class where all instances have the same attributes named
'title'
, 'release'
, 'directors'
, and 'actors'
,
Figure 14 shows the arrangement of key-sharing in a split
dictionary—also implemented with the new compact layout.
__dict__
of a class and three instances.PEP 412 introduced the terms combined-table to discuss the old layout and split-table for the proposed optimization.
The combined-table layout is still the default when you create a dict
using literal syntax or call dict()
.
A split-table dictionary is created to fill the __dict__
special attribute of an instance, when it is the first instance of a class.
The keys table (see Figure 14) is then cached in the class object.
This leverages the fact that most Object Oriented Python code assigns all instance attributes in the __init__
method.
That first instance (and all instances after it) will hold only its own value array.
If an instance gets a new attribute not found in the shared keys table, then this instance’s __dict__
is converted to combined-table form.
However, if this instance is the only one in its class, the __dict__
is converted back to split-table,
since it is assumed that further instances will have the same set of attributes and key sharing will be useful.
The PyDictObject
struct that represents a dict
in the CPython source code is the same for both combined-table and split-table dictionaries.
When a dict
converts from one layout to the other, the change happens in PyDictObject
fields,
with the help of other internal data structures.
Practical Consequences of How dict Works
-
Keys must be hashable objects. They must implement proper
__hash__
and__eq__
methods as described in [what_is_hashable]. -
Key searches are nearly as fast as element searches in sets.
-
Item ordering is preserved in the
entries
table—this was implemented in CPython 3.6, and became an official language feature in 3.7. -
To save memory, avoid creating instance attributes outside of the
__init__
method. If all instance attributes are created in__init__
, the__dict__
of your instances will use the split-table layout, sharing the same indices and key entries array stored with the class.
int
, here is a CPython implementation detail: the hash code of an int
that fits in a machine word is the value of the int
itself, except the hash code of -1, which is -2.
hash()
built-in never returns -1 for any Python object. If x.hash()
returns -1, hash(x)
returns -2.
entries
array where the hash codes are. But, conceptually, the index
array is really the hash table in this implementation, even if there are no hashes in its buckets.