strongswan/src/libstrongswan/collections/hashtable.c

586 lines
12 KiB
C
Raw Normal View History

2008-12-03 09:32:16 +00:00
/*
* Copyright (C) 2008-2020 Tobias Brunner
* HSR Hochschule fuer Technik Rapperswil
2008-12-03 09:32:16 +00:00
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License as published by the
* Free Software Foundation; either version 2 of the License, or (at your
* option) any later version. See <http://www.fsf.org/copyleft/gpl.txt>.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
* or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
* for more details.
*/
#include "hashtable.h"
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
#include "hashtable_profiler.h"
2008-12-03 09:32:16 +00:00
#include <utils/chunk.h>
#include <utils/debug.h>
/** The minimum size of the hash table (MUST be a power of 2) */
#define MIN_SIZE 8
/** The maximum size of the hash table (MUST be a power of 2) */
#define MAX_SIZE (1 << 30)
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
/** Determine the capacity/maximum load of the table (higher values cause
* more collisions, lower values increase the memory overhead) */
#define CAPACITY(size) (size / 3 * 2)
/** Factor for the new table size based on the number of items when resizing,
* with the above load factor this results in doubling the size when growing */
#define RESIZE_FACTOR 3
/**
* A note about these parameters:
*
* The maximum number of items that can be stored in this implementation
* is MAX_COUNT = CAPACITY(MAX_SIZE).
* Since we use u_int throughout, MAX_COUNT * RESIZE_FACTOR must not overflow
* this type.
*/
#if (UINT_MAX / RESIZE_FACTOR < CAPACITY(MAX_SIZE))
#error Hahstable parameters invalid!
#endif
2008-12-03 09:32:16 +00:00
typedef struct pair_t pair_t;
/**
* This pair holds a pointer to the key and value it represents.
*/
struct pair_t {
2008-12-03 09:32:16 +00:00
/**
* Key of a hash table item.
*/
const void *key;
2008-12-03 09:32:16 +00:00
/**
* Value of a hash table item.
*/
void *value;
2008-12-03 09:32:16 +00:00
/**
* Cached hash (used in case of a resize).
*/
u_int hash;
};
typedef struct private_hashtable_t private_hashtable_t;
/**
* Private data of a hashtable_t object.
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
*
2008-12-03 09:32:16 +00:00
*/
struct private_hashtable_t {
2008-12-03 09:32:16 +00:00
/**
* Public part of hash table.
*/
hashtable_t public;
2008-12-03 09:32:16 +00:00
/**
* The number of items in the hash table.
2008-12-03 09:32:16 +00:00
*/
u_int count;
2008-12-03 09:32:16 +00:00
/**
* The current size of the hash table (always a power of 2).
2008-12-03 09:32:16 +00:00
*/
u_int size;
2008-12-03 09:32:16 +00:00
/**
* The current mask to calculate the row index (size - 1).
2008-12-03 09:32:16 +00:00
*/
u_int mask;
2008-12-03 09:32:16 +00:00
/**
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* All items in the order they were inserted (removed items are marked by
* setting the key to NULL until resized).
2008-12-03 09:32:16 +00:00
*/
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
pair_t *items;
/**
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* Number of available slots in the array above and the table in general,
* is set to CAPACITY(size) when the hash table is initialized.
*/
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
u_int capacity;
/**
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* Number of used slots in the array above.
*/
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
u_int items_count;
/**
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* Hash table with indices into the array above. The type depends on the
* current capacity.
*/
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
void *table;
/**
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* The hashing function.
*/
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
hashtable_hash_t hash;
/**
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* The equality function.
*/
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
hashtable_equals_t equals;
/**
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* Profiling data
*/
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
hashtable_profile_t profile;
};
2008-12-03 09:32:16 +00:00
typedef struct private_enumerator_t private_enumerator_t;
/**
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* Hash table enumerator implementation
2008-12-03 09:32:16 +00:00
*/
struct private_enumerator_t {
/**
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* Implements enumerator interface
2008-12-03 09:32:16 +00:00
*/
enumerator_t enumerator;
2008-12-03 09:32:16 +00:00
/**
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* Associated hash table
2008-12-03 09:32:16 +00:00
*/
private_hashtable_t *table;
2008-12-03 09:32:16 +00:00
/**
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* Current index
2008-12-03 09:32:16 +00:00
*/
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
u_int index;
};
2008-12-03 09:32:16 +00:00
/*
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* Described in header
*/
u_int hashtable_hash_ptr(const void *key)
{
return chunk_hash(chunk_from_thing(key));
}
/*
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* Described in header
*/
u_int hashtable_hash_str(const void *key)
{
return chunk_hash(chunk_from_str((char*)key));
}
/*
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* Described in header
*/
bool hashtable_equals_ptr(const void *key, const void *other_key)
{
return key == other_key;
}
/*
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* Described in header
*/
bool hashtable_equals_str(const void *key, const void *other_key)
{
return streq(key, other_key);
}
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
/**
* Returns the index stored in the given bucket. If the bucket is empty,
* 0 is returned.
*/
static inline u_int get_index(private_hashtable_t *this, u_int row)
{
if (this->capacity <= 0xff)
{
return ((uint8_t*)this->table)[row];
}
else if (this->capacity <= 0xffff)
{
return ((uint16_t*)this->table)[row];
}
return ((u_int*)this->table)[row];
}
/**
* Set the index stored in the given bucket. Set to 0 to clear a bucket.
*/
static inline void set_index(private_hashtable_t *this, u_int row, u_int index)
{
if (this->capacity <= 0xff)
{
((uint8_t*)this->table)[row] = index;
}
else if (this->capacity <= 0xffff)
{
((uint16_t*)this->table)[row] = index;
}
else
{
((u_int*)this->table)[row] = index;
}
}
2008-12-03 09:32:16 +00:00
/**
* This function returns the next-highest power of two for the given number.
* The algorithm works by setting all bits on the right-hand side of the most
* significant 1 to 1 and then increments the whole number so it rolls over
* to the nearest power of two. Note: returns 0 for n == 0
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
*
* Also used by hashlist_t.
2008-12-03 09:32:16 +00:00
*/
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
u_int hashtable_get_nearest_powerof2(u_int n)
2008-12-03 09:32:16 +00:00
{
u_int i;
2009-10-26 15:08:14 +00:00
2008-12-03 09:32:16 +00:00
--n;
2008-12-04 16:33:39 +00:00
for (i = 1; i < sizeof(u_int) * 8; i <<= 1)
2008-12-03 09:32:16 +00:00
{
n |= n >> i;
}
return ++n;
}
/**
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* Init hash table to the given size
2008-12-03 09:32:16 +00:00
*/
static void init_hashtable(private_hashtable_t *this, u_int size)
2008-12-03 09:32:16 +00:00
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
u_int index_size = sizeof(u_int);
this->size = max(MIN_SIZE, min(size, MAX_SIZE));
this->size = hashtable_get_nearest_powerof2(this->size);
this->mask = this->size - 1;
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
profile_size(&this->profile, this->size);
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
this->capacity = CAPACITY(this->size);
this->items = calloc(this->capacity, sizeof(pair_t));
this->items_count = 0;
if (this->capacity <= 0xff)
{
index_size = sizeof(uint8_t);
}
else if (this->capacity <= 0xffff)
{
index_size = sizeof(uint16_t);
}
this->table = calloc(this->size, index_size);
}
/**
* Calculate the next bucket using quadratic probing (the sequence is h(k) + 1,
* h(k) + 3, h(k) + 6, h(k) + 10, ...).
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
*/
static inline u_int get_next(private_hashtable_t *this, u_int row, u_int *p)
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
{
*p += 1;
return (row + *p) & this->mask;
2008-12-03 09:32:16 +00:00
}
/**
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* Find the pair with the given key, optionally returns the hash and first empty
* or previously used row if the key is not found.
2008-12-03 09:32:16 +00:00
*/
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
static inline pair_t *find_key(private_hashtable_t *this, const void *key,
u_int *out_hash, u_int *out_row)
2008-12-03 09:32:16 +00:00
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
pair_t *pair;
u_int hash, row, p = 0, removed, index;
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
bool found_removed = FALSE;
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
if (!this->count && !out_hash && !out_row)
2008-12-03 09:32:16 +00:00
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
return NULL;
2008-12-03 09:32:16 +00:00
}
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
lookup_start();
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
hash = this->hash(key);
row = hash & this->mask;
index = get_index(this, row);
while (index)
2008-12-03 09:32:16 +00:00
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
lookup_probing();
pair = &this->items[index-1];
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
if (!pair->key)
{
if (!found_removed && out_row)
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
removed = row;
found_removed = TRUE;
}
2008-12-03 09:32:16 +00:00
}
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
else if (pair->hash == hash && this->equals(key, pair->key))
{
lookup_success(&this->profile);
return pair;
}
row = get_next(this, row, &p);
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
index = get_index(this, row);
}
if (out_hash)
{
*out_hash = hash;
}
if (out_row)
{
*out_row = found_removed ? removed : row;
2008-12-03 09:32:16 +00:00
}
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
lookup_failure(&this->profile);
return NULL;
2008-12-03 09:32:16 +00:00
}
/**
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* Helper to insert a new item into the table and items array,
* returns its new index into the latter.
*/
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
static inline u_int insert_item(private_hashtable_t *this, u_int row)
2008-12-03 09:32:16 +00:00
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
u_int index = this->items_count++;
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
/* we use 0 to mark unused buckets, so increase the index */
set_index(this, row, index + 1);
return index;
}
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
/**
* Resize the hash table to the given size and rehash all the elements,
* size may be smaller or even the same (e.g. if it's necessary to clear
* previously used buckets).
*/
static bool rehash(private_hashtable_t *this, u_int size)
{
pair_t *old_items, *pair;
u_int old_count, i, p, row, index;
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
if (size > MAX_SIZE)
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
return FALSE;
}
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
old_items = this->items;
old_count = this->items_count;
free(this->table);
init_hashtable(this, size);
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
/* no need to do anything if the table is empty and we are just cleaning
* up previously used items */
if (this->count)
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
for (i = 0; i < old_count; i++)
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
pair = &old_items[i];
if (pair->key)
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
row = pair->hash & this->mask;
index = get_index(this, row);
for (p = 0; index;)
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
{
row = get_next(this, row, &p);
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
index = get_index(this, row);
}
index = insert_item(this, row);
this->items[index] = *pair;
}
}
2008-12-03 09:32:16 +00:00
}
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
free(old_items);
return TRUE;
2008-12-03 09:32:16 +00:00
}
METHOD(hashtable_t, put, void*,
private_hashtable_t *this, const void *key, void *value)
2008-12-03 09:32:16 +00:00
{
void *old_value = NULL;
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
pair_t *pair;
u_int index, hash = 0, row = 0;
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
if (this->items_count >= this->capacity &&
!rehash(this, this->count * RESIZE_FACTOR))
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
DBG1(DBG_LIB, "!!! FAILED TO RESIZE HASHTABLE TO %u !!!",
this->count * RESIZE_FACTOR);
return NULL;
}
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
pair = find_key(this, key, &hash, &row);
if (pair)
{
old_value = pair->value;
pair->value = value;
pair->key = key;
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
return old_value;
}
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
index = insert_item(this, row);
this->items[index] = (pair_t){
.hash = hash,
.key = key,
.value = value,
};
this->count++;
profile_count(&this->profile, this->count);
return NULL;
2008-12-03 09:32:16 +00:00
}
METHOD(hashtable_t, get, void*,
2013-08-27 14:37:41 +00:00
private_hashtable_t *this, const void *key)
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
pair_t *pair = find_key(this, key, NULL, NULL);
return pair ? pair->value : NULL;
}
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
/**
* Remove the given item from the table, returns the currently stored value.
*/
static void *remove_internal(private_hashtable_t *this, pair_t *pair)
2008-12-03 09:32:16 +00:00
{
void *value = NULL;
if (pair)
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
{ /* this does not decrease the item count as we keep the previously
* used items until the table is rehashed/resized */
value = pair->value;
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
pair->key = NULL;
this->count--;
2008-12-03 09:32:16 +00:00
}
return value;
}
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
METHOD(hashtable_t, remove_, void*,
private_hashtable_t *this, const void *key)
{
pair_t *pair = find_key(this, key, NULL, NULL);
return remove_internal(this, pair);
}
METHOD(hashtable_t, remove_at, void,
2013-08-27 14:37:41 +00:00
private_hashtable_t *this, private_enumerator_t *enumerator)
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
if (enumerator->table == this && enumerator->index)
{ /* the index is already advanced by one */
u_int index = enumerator->index - 1;
remove_internal(this, &this->items[index]);
}
}
METHOD(hashtable_t, get_count, u_int,
2013-08-27 14:37:41 +00:00
private_hashtable_t *this)
2008-12-03 09:32:16 +00:00
{
return this->count;
}
METHOD(enumerator_t, enumerate, bool,
private_enumerator_t *this, va_list args)
2008-12-03 09:32:16 +00:00
{
const void **key;
void **value;
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
pair_t *pair;
VA_ARGS_VGET(args, key, value);
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
while (this->index < this->table->items_count)
2008-12-03 09:32:16 +00:00
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
pair = &this->table->items[this->index++];
if (pair->key)
{
if (key)
2008-12-03 09:32:16 +00:00
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
*key = pair->key;
2008-12-03 09:32:16 +00:00
}
if (value)
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
*value = pair->value;
}
return TRUE;
2008-12-03 09:32:16 +00:00
}
}
return FALSE;
}
METHOD(hashtable_t, create_enumerator, enumerator_t*,
2013-08-27 14:37:41 +00:00
private_hashtable_t *this)
2008-12-03 09:32:16 +00:00
{
private_enumerator_t *enumerator;
INIT(enumerator,
.enumerator = {
.enumerate = enumerator_enumerate_default,
.venumerate = _enumerate,
.destroy = (void*)free,
},
.table = this,
);
2008-12-03 09:32:16 +00:00
return &enumerator->enumerator;
}
2013-08-27 14:37:41 +00:00
static void destroy_internal(private_hashtable_t *this,
void (*fn)(void*,const void*))
2008-12-03 09:32:16 +00:00
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
pair_t *pair;
u_int i;
2009-10-26 15:08:14 +00:00
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
profiler_cleanup(&this->profile, this->count, this->size);
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
if (fn)
2008-12-03 09:32:16 +00:00
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
for (i = 0; i < this->items_count; i++)
2008-12-03 09:32:16 +00:00
{
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
pair = &this->items[i];
if (pair->key)
2013-08-27 14:37:41 +00:00
{
fn(pair->value, pair->key);
}
2008-12-03 09:32:16 +00:00
}
}
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
free(this->items);
2008-12-03 09:32:16 +00:00
free(this->table);
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
free(this);
2008-12-03 09:32:16 +00:00
}
2013-08-27 14:37:41 +00:00
METHOD(hashtable_t, destroy, void,
private_hashtable_t *this)
{
destroy_internal(this, NULL);
}
METHOD(hashtable_t, destroy_function, void,
private_hashtable_t *this, void (*fn)(void*,const void*))
{
destroy_internal(this, fn);
}
/*
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
* Described in header.
2008-12-03 09:32:16 +00:00
*/
hashtable_t *hashtable_create(hashtable_hash_t hash, hashtable_equals_t equals,
u_int size)
2008-12-03 09:32:16 +00:00
{
private_hashtable_t *this;
INIT(this,
.public = {
.put = _put,
.get = _get,
.remove = _remove_,
.remove_at = (void*)_remove_at,
.get_count = _get_count,
.create_enumerator = _create_enumerator,
.destroy = _destroy,
2013-08-27 14:37:41 +00:00
.destroy_function = _destroy_function,
},
.hash = hash,
.equals = equals,
);
init_hashtable(this, size);
hashtable: Maintain insertion order when enumerating With the previous approach we'd require at least an additional pointer per item to store them in a list (15-18% increase in the overhead per item). Instead we switch from handling collisions with overflow lists to an open addressing scheme and store the actual table as variable-sized indices pointing into an array of all inserted items in their original order. This can reduce the memory overhead even compared to the previous implementation (especially for smaller tables), but because the array for items is preallocated whenever the table is resized, it can be worse for certain numbers of items. However, avoiding all the allocations required by the previous design is actually a big advantage. Depending on the usage pattern, the performance can improve quite a bit (in particular when inserting many items). The raw lookup performance is a bit slower as probing lengths increase with open addressing, but there are some caching benefits due to the compact storage. So for general usage the performance should be better. For instance, one test I did was counting the occurrences of words in a list of 1'000'000 randomly selected words from a dictionary of ~58'000 words (i.e. using a counter stored under each word as key). The new implementation was ~8% faster on average while requiring 10% less memory. Since we can't remove items from the array (would change the indices of all items that follow it) we just mark them as removed and remove them once the hash table is resized/rehashed (the cells in the hash table for these may be reused). Due to this the latter may also happen if the number of stored items does not increase e.g. after a series of remove/put operations (each insertion requires storage in the array, no matter if items were removed). So if the capacity is exhausted, the table is resized/rehashed (after lots of removals the size may even be reduced) and all items marked as removed are simply skipped. Compared to the previous implementation the load factor/capacity is lowered to reduce chances of collisions and to avoid primary clustering to some degree. However, the latter in particular, but the open addressing scheme in general, make this implementation completely unsuited for the get_match() functionality (purposefully hashing to the same value and, therefore, increasing the probing length and clustering). And keeping the keys optionally sorted would complicate the code significantly. So we just keep the existing hashlist_t implementation without adding code to maintain the overall insertion order (we could add that feature optionally later, but with the mentioned overhead for one or two pointers). The maximum size is currently not changed. With the new implementation this translates to a hard limit for the maximum number of items that can be held in the table (=CAPACITY(MAX_SIZE)). Since this equals 715'827'882 items with the current settings, this shouldn't be a problem in practice, the table alone would require 20 GiB in memory for that many items. The hashlist_t implementation doesn't have that limitation due to the overflow lists (it can store beyond it's capacity) but it itself would require over 29 GiB of memory to hold that many items.
2020-04-24 13:51:17 +00:00
profiler_init(&this->profile, 2);
2008-12-03 09:32:16 +00:00
return &this->public;
}