Attributes
A cornerstone principle in KDB/Q, integral to query optimization, is the concept of Attributes. Attributes serve as metadata that can be attached to lists of special forms, dictionaries, or table columns (which are basically lists), amplifying data retrieval speed and consequently enhancing query efficiency and performance. By attaching metadata about the implied structure of the list associated with a specific Attribute, the Q interpreter can enact optimizations, resulting in a notable performance boost. This post not only provides a comprehensive explanation of various Attributes and their properties but also emphasizes their strategic utilization and optimal placement in different scenarios.
What are Attributes
As mentioned earlier, Attributes serve as metadata that can be assigned to lists with a special form, dictionaries, and table columns to accelerate retrieval functions and other operations. All Attributes, except for the grouped Attribute, are descriptive rather than prescriptive. This means that applying an Attribute merely asserts that a list has a special form without altering or recreating its structure; this responsibility lies entirely with the developer. KDB/Q features four Attributes: sorted (s), unique (u), parted (p), and grouped (g), and while their names provide a basic understanding, there's more complexity to Attributes than meets the eye. Let's delve deeper into their nuances.
Sorted (`s#
)
In KDB/Q, the default search algorithm is a linear search, which can be quite slow for large datasets since it requires traversing the entire list to find an element. Applying the sorted Attribute (````s#```) to a list, indicates that the elements of the list are in ascending order, thereby replacing the linear search algorithm with a more efficient binary search. Consequently, operations like Find ?
, Equal =
, in, and within become faster with the application of this Attribute.
An additional optimization carried out automatically when applying the sorted Attribute involves the operations executed when applying min and max to a sorted list. Since the list is arranged in ascending order, min
essentially corresponds to the first element, while max
corresponds to the last. Let's examine how the sorted Attribute enhances performance through a few simple examples.
// First we generate a list of numbers from 0 to 100000000
q)x:til 100000000
// We then try to search for the last element in our list and see that this takes 134 milliseconds to do so
q)\t x?last x
134
// Let's now apply the sorted Attribute and repeat the searches
q)s:`s#x
q)\t s?last s
0
As you can see, applying the sorted Attribute to a list speeds up operations significantly. Even though both lists are sorted from the smallest element to the largest (remember, til n
creates a list of numbers from 0 to n-1
), searching for the last element in the list without the sorted Attribute applied takes roughly 0.2 seconds, while the search in the list with the sorted Attribute applied returned immediately.
The sorted Attribute is the only Attribute that adds no memory overhead when being applied to the data
Unique (`u#
)
The unique Attribute, as the name implies, indicates that the elements of a list are distinct. Applying the unique Attribute to a list significantly speeds up the performance of operations such as searching, as KDB/Q can terminate the search upon finding the initial occurrence. When applying the unique Attribute to a list, KDB/Q generates a hash table in the background, mapping each unique element of the list to its corresponding index. This transformation shifts the linear search algorithm to a constant lookup with O(1)
time complexity. As your dataset grows, the unique Attribute yields the most substantial performance enhancement, establishing it as the fastest Attribute for lookups among the four Attributes.
Let's explore what the metadata of the unique Attribute might resemble behind the scenes:
q)show l:-10?10
4 3 8 2 9 5 7 0 1 6
q)group l
4| 0
3| 1
8| 2
2| 3
9| 4
5| 5
7| 6
0| 7
1| 8
6| 9
Attempting to apply the unique Attribute to a list of non-unique elements results in my favorite KDB/Q error of all time: u-fail. It's KDB/Q's unique way of telling you that you that there's still much to learn.
q)`u#1 2 2 3 4
'u-fail
[0] `u#1 2 2 3 4
^
Each of the three Attributes—unique, grouped, and parted—introduces a memory overhead when being applied to a list. Further details on this topic will be covered in a subsequent section of this post.
Parted (`p#
)
Assigning the parted Attribute to a list indicates that elements with the same value are adjacent, appearing next to one another. The following, simple example should illustrate a parted list.
q)p:2 2 3 3 3 4 4 4 9 9 1 1 1 1
While the unique Attribute is the optimal choice for a list without duplicate elements, the parted Attribute can be applied to a list with duplicate values. However, it's essential to note that, to apply the parted Attribute, all elements with the same value must be adjacent to each other. This can be easily achieved by sorting the list.
Let's check this out
// Create a list of 20 random values between 0 and 2
q)20?3
2 1 2 1 1 0 2 2 0 0 0 0 2 2 1 1 2 2 1 1
// sort the list
q)asc 20?3
`s#0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
// Apply the parted Attribute
q)`p#asc 20?3
`p#0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2
As you can see, KDB/Q automatically applies the sorted Attribute when you sort a list (this can be seen by the `s#
symbol in front of the list). However, we can still apply the parted Attribute after we sorted the list using `p#
Like the unique Attribute, when you apply the parted Attribute to a list, KDB/Q generates a hash table in the background. This hash table maps each unique element of the list to the starting index of the block of equal elements. Essentially, it associates the first index of each contiguous block with its corresponding unique element. The following example should provide a clearer understanding of this concept.
// Let's consider the previous example of a list of 20 random numbers between 0 and 3
q)show l:asc 20?3
`s#0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 2 2
q)first each group l
0| 0
1| 6
2| 12
As evident from the above example, we have a list of three numbers repeated multiple times. Sorting the list creates contiguous blocks where each element with the same value is adjacent. The parted Attribute will then map the first index of each distinct element to the corresponding element and store this mapping in a hash table.
As you can see, the element with the value 0 starts at index 0, the element with value 1 starts at index 6 and the element with value 2 starts at index 12.
Grouped (`g#
)
The grouped Attribute stands out as the most versatile Attribute, as it can be applied to any list, regardless of its structure. When applying the grouped Attribute, KDB/Q generates a hash table in the background, mapping each index of every distinct element to its corresponding hash value. Due to the maintenance of a hash table with all indexes, the grouped Attribute incurs the highest memory overhead and is most effective when handling data without a specific order. Further details on this can be found in the later section of this post.
Below example should illustrate the functionality of the group Attribute:
q)show l:100?10
6 0 1 4 7 6 8 1 8 3 0 7 6 4 6 5 4 2 5 7 7 7 6 2 9 2 4 4 7 9 3 2 1 8 3 1 7 2 0 8 0 5 8 9 6 6 3 8 6 5 7 7 0 4 5 0 1 7 8 3 2 1 8 7 6 0 7 7 5 5 0 8 9 7 4 7 7 5 9 1 4 3 9 8 9 9 5 4 0 4 3 8 0 7 9 7 8 9 3 9
q)group l
6| 0 5 12 14 22 44 45 48 64
0| 1 10 38 40 52 55 65 70 88 92
1| 2 7 32 35 56 61 79
4| 3 13 16 26 27 53 74 80 87 89
7| 4 11 19 20 21 28 36 50 51 57 63 66 67 73 75 76 93 95
8| 6 8 33 39 42 47 58 62 71 83 91 96
3| 9 30 34 46 59 81 90 98
5| 15 18 41 49 54 68 69 77 86
2| 17 23 25 31 37 60
9| 24 29 43 72 78 82 84 85 94 97 99
As evident, the grouped Attribute establishes a mapping of every index for each distinct element to its corresponding element. This characteristic makes the grouped Attribute unique as it is the only Attribute that remains consistently maintained when appending elements to the list.
When adding a new element to the end of a sorted list which has the sorted Attribute applied, KDB/Q will only keep the Attribute pesisted if the new element adheres to the sorted property of the list.
q)show l:`s#til 5
`s#0 1 2 3 4
q)show l,:5
`s#0 1 2 3 4 5
q)show l,:2
0 1 2 3 4 5 2
Applying Attributes
Having explored the definition of each Attribute and its functioning, let's delve into something more hands-on and take a look at how Attributes are assigned to a list that aligns with the properties of the corresponding Attribute. Applying an Attribute couldn't be easier, you simply use the character symbol of the respective Attribute and use #
to apply the Attribute. Let's look at some examples
Sorted Attribute
q)show l:`s#til 10
`s#0 1 2 3 4 5 6 7 8 9
q)show l:asc 10?100
`s#22 22 38 50 51 51 52 57 79 81
// Applying the sorted Attribute on a non-sorted list will throws a s-fail error
q)`s#10?100
's-fail
[0] `s#10?100
Sorting a list in ascending order will automatically apply the sorted Attribute
Unique Attribute
/// Generate 10 random numers between 0 and 100 without repetition
q)show l:`u#-10?100
`u#82 69 49 62 97 67 34 0 83
// Applying the unique Attribute to a non-unique list of elements throws a u-fail error
q)`u#2 2 3 4 5 5 6
'u-fail
[0] `u#2 2 3 4 5 5 6
^
Parted Attribute
q)show l:`p#raze 3#'1 2 3
`p#1 1 1 2 2 2 3 3 3
// !!!! Applying the parted Attribute to a non-parted list will also throw a u-fail error !!!!
q)show l:`p#raze 3#'1 2 3 1
'u-fail
[0] show l:`p#raze 3#'1 2 3 1
^
q))\
q)raze 3#'1 2 3 1
1 1 1 2 2 2 3 3 3 1 1
Grouped Attribute
q)show l:`g#10?3
`g#0 0 1 0 0 0 2 1 1 0
The grouped Attribute is the only Attribute that can be applied to any list
Removing Attributes
In order to remove an Attribute, simply use the empty symbol character `
and the the `#
operator
q)l:`s#1 2 2
q)l
`s#1 2 2
q)`#l
1 2 2
Verifying the Attribute applied on a list
You can verify whether a list has an Attribute or not by using the attr
keyword.
q)attr 2 3 4
`
q)attr asc 2 3 4
`s
q)attr `#2 3 4
`
q)attr `s#2 3 4
`s
Applying Attributes to table columns
Given that the individual columns of a table are basically lists, applying an Attribute to a column of a table is as straightforward as applying an Attribute to a list. The syntax however is slightly different and you have to use an update
statement. Note: If you are not familiar with q-sql
and how to query tables I suggest to read Chapter 9: Queries - q-sql of "Q for Mortals" by Jeffry A. Borror.
Let's create a table first
q)table:([] a:10?100; b:10?`3)
q)show table:([] a:10?100; b:10?`3)
a b
------
53 lcb
40 jap
34 glk
49 lmj
32 ajb
43 khi
21 pho
43 kib
20 fod
38 ecj
We then use the meta
keyword to inspect the meta data of our table and see that there is no Attribute (column a) applied.
q)meta table
c| t f a
-| -----
a| j
b| s
You can use meta to inspect the meta data of a table
Sorting a table will automatically apply the parted Attribute rather than the sorted Attribute
q)meta asc table
c| t f a
-| -----
a| j p
b| s
However, we can apply the sorted Attribute to a table with the following update statement:
q)meta t:update `s#a from asc table
c| t f a
-| -----
a| j s
b| s
Operations on lists with Attributes
Because Attributes, except for the grouped Attribute, are rather descriptive than prescriptive, any operation on lists that contradicts the property of the applied Attribute will result in the removal of the Attribute from the list. The sorted, grouped, and unique Attributes are retained in memory when you add a new element to a list, provided the new element adheres to the properties of the Attribute. The parted Attribute on the other hand will always be removed, no matther the operation.
Operations on Attributes in Memory
As explained preserving the applied Attribute when adding elements to a list depends on the new element adhering to the properties of the Attribute. Conversely, removing elements from a list always results in the removal of the Attribute. Naturally, any operation that contravenes the properties of the Attribute will also lead to its removal.
Sorted Lists
q)show l:`s#1 2 3
`s#1 2 3
// Amend an element that maintains the sorted Attribute
q)show l,:4
`s#1 2 3 4
// Amend an element that removes the sorted Attribute
q)show l,:3
1 2 3 4 3
// Create a sorted list again
q)show l:`s#1 2 3 4
`s#1 2 3 4
// Deleting the element at index 1 will remove the sorted Attribute
q)l _ 1
1 3 4
q)l except 2
1 3 4
// Deleting the first element of the list removes the sorted Attribute
q)1_l
2 3 4
// Deleting the last element of the list removes the sorted Attribute
q)-1_l
1 2 3
Unique Lists
q)show l:`u#1 2 3
`u#1 2 3
q)show l,:4
`u#1 2 3 4
q)show l,:4
1 2 3 4 4
q)l except 4
1 2 3
q)l _ 1
1 3 4
q)1_l
2 3 4
q)-1_l
1 2 3
Parted Lists
// Create a parted list
q)show l:`p#raze 3#'1 2 3
`p#1 1 1 2 2 2 3 3 3
// Amending an element to a parted list will remove the Attribute even if the parted structure is maintained
q)show l,:4
1 1 1 2 2 2 3 3 3 4
q)show l:`p#raze 3#'1 2 3
`p#1 1 1 2 2 2 3 3 3
// Deleting elements from a parted list will remove the Atttribute
q)show l _ 1
1 1 2 2 2 3 3 3
q)show l:`p#raze 3#'1 2 3
`p#1 1 1 2 2 2 3 3 3
q)show l _:1
1 1 2 2 2 3 3 3
q)show l:`p#raze 3#'1 2 3
`p#1 1 1 2 2 2 3 3 3
q)show l:l except 2
1 1 1 3 3 3
The parted Attribute is distinctive in that it will be removed irrespective of the operation applied. Even when appending a new element that aligns with the properties of the parted list, the Attribute will still be eliminated.
Grouped Lists
// Create a grouped list
q)show l:`g#10?2
`g#1 1 1 0 1 1 1 1 1 1
// Amending a new element to a grouped list will maintain the grouped Attribute no matter what
q)show l,:1
`g#1 1 1 0 1 1 1 1 1 1 1
q)show l,:2
`g#1 1 1 0 1 1 1 1 1 1 1 2
q)show l _ 1
1 1 0 1 1 1 1 1 1 1 2
q)show l _ 2
1 1 0 1 1 1 1 1 1 1 2
q)show l except 1
0 2
q)show 1_l
1 1 0 1 1 1 1 1 1 1 2
q)show -1_l
1 1 1 0 1 1 1 1 1 1 1
Checking whether the mentioned properties are valid for in-memory tables is a task left for the curious reader.
Operations on Attributes on Disk
Let's now have a look at what behaviour operations on Attributes on Disk have:
Sorted Lists on Disk
q)\pwd
"/Users/Alexander/repos/testing/attributes"
q)\ls
// Create a sorted list and save it to disk
q)`:l set `s#1 2 3
`:l
// Load the content of the variable back into memory
q)get `:l
`s#1 2 3
// Amend a new element that preservers the sorted Attribute and save it to disk
q).[`:l;();,;4]
`:l
// When loading the variable back to memory, we can see that the sorted Attribute was maintained
q)get `:l
`s#1 2 3 4
Memory Overhead of Attributes
While adding Attributes to your structured list can enhance performance significantly, they also introduce some memory overhead. All Attributes except the sorted Attribute will contribute to memory overhead in your processes. Only the sorted Attribute will not introduce any memory overhead. The subsequent section will discuss his memory overhead and should be thoroughly examined for the optimal application of Attributes. Individual testing is recommended to assess whether the performance improvement resulting from the application of an Attribute is justified.
Attribute | Memory overhead in bytes [*] | Notes |
---|---|---|
sorted `s# | 0 | |
unique `u# | 16*n | |
parted `p# | 24*u | |
grouped `g# | (24*u)+4*n | Since Q v3.0 this should be 8 bytes as integer changed from 32bit to 64 bit |
[*] Source: https://code.kx.com/q/wp/data-management/
If you're curious about how we arrived at these figures, allow me to provide an explanation.
The sorted Attribute is clear-cut; it's the sole Attribute that doesn't introduce any memory overhead. However, for the remaining Attributes—unique, parted, and grouped—KDB/Q generates a hash map in the background. Each distinct element gets hashed to a GUID datatype, which occupies 16 bytes. Consequently, the unique Attribute incurs a memory overhead of 16 bytes multiplied by n
, where n
represents the count of unique elements.
The parted Attribute similarly consumes 16 bytes for each unique element u
. However, due to the nature of the parted Attribute, which retains the index of the initial element in a block of contiguous, distinct elements, this index points to a list of uniform values. A pointer to a list requires 8 bytes, thereby leading to the parted Attribute requiring (16+8)
bytes times u
elements.
The grouped Attribute, being the most memory-intensive among all Attributes, demands precisely the same number of bytes as the parted Attribute. However, since the grouped Attribute stores not only the initial index of a continuous list of unique elements but also every index of all unique elements dispersed throughout the list, it consumes an extra 4 bytes per index of each element (as indexes are represented as integers requiring 4 bytes). Consequently, the total memory required for the grouped Attribute is (16+8)*u+4n
.
When to use Attributes
Due to the above mentioned memory overhead, KX does not advise utilizing Attributes on data structures with fewer than 1 million elements. This is also the reason why Attributes are not applied by default to lists that conform to the structure of a respective Attribute. Additionally, as your dataset expands, and KDB/Q is required to maintain a hash table in the background, operations may experience a slowdown since the hash table needs updates whenever a new element is inserted into the list. It is crucial to test your specific use case to determine whether applying an Attribute yields performance benefits. Blindly applying Attributes is discouraged.
Attributes in the KDB-Tick Stack
So far we have we have covered quite a bit of theoretical knowledge about Attributes with some basic applications on simple lists or tables. However, as a KDB/Q developer our task is to build large, efficient systems, that go far beyond the application of simple lists. Let's direct our attention to the KDB/Q Tick setup, exploring where Attributes should be applied and identifying the most suitable Attribute for each of the processes.
To revisit your understanding of the KDB/Q Tick setup, you can review my earlier post available here
Attributes and the Tickerplant
In any KDB/Q Tick setup, the central component is the Tickerplant. Despite the fact that this process stores minimal or no data in memory (and, if it does, it's only for a brief duration), it remains crucial to apply the sorted Attribute to the time
column and the grouped Attribute to the sym
column of each table in the schemas held in memory. This practice is of upmost importance. When a real-time subscriber connects to the Tickerplant, it receives an empty schema for each table it subscribes to. Upon initializing these schemas, the Tickerplant's Attributes are retained within the schemas of the real-time subscriber.
Despite the fact that the Tickerplant receives real-time data in chronological order, it is crucial for us to apply the sorted Attribute to the time
column of each process. As a reminder, applying the sorted Attribute to a list in KDB/Q changes the linear search algorithm to a binary search, resulting in a substantial enhancement in the performance of search queries.
The second Attribute we implement in the table schemas of our Tickerplant is the grouped Attribute. Considering the scattered structure of the data we capture, it is logical to apply the grouped Attribute to the sym
column of our tables. Unlike other Attributes, the grouped Attribute doesn't demand a specific structure for our lists and remains intact when adding new items to the list. Again, this Attribute will be retained in the schemas of every real-time subscriber that subscribes to the tables of the Tickerplant.
Real-time Database (RDB) and other Real-Time Subscribers (RTS)
As previously mentioned, the Real-time Database, along with any other Real-time Subscriber connected to the Tickerplant, will automatically adopt the Attributes applied to the table schemas in the Tickerplant. This results in the sorted Attribute being applied to the time
column of our tables, and the grouped Attribute being applied to the sym
column. Considering the scattered nature of the symbols in our data, any Attribute other than the grouped Attribute would be impractical. Applying the parted Attribute, for instance, would result in losing the Attribute as soon as the second message is received, as the parted Attribute does not persist through updates. Additionally, achieving a parted structure would require sorting the data by symbol after every update, which, given the typical volume of real-time data, would be time-consuming and render the Real-Time Database pretty much useless.
While it is possible to apply more than one grouped Attribute to a table—such as using the grouped Attribute on the sym
column and another grouped Attribute on the exchange
column—I would generally discourage this practice. Primarily, the memory overhead significantly increases with each additional grouped Attribute, as the grouped Attribute incurs the highest memory cost. Moreover, as your dataset expands, the time required to update the hash table maintained by KDB/Q for each grouped Attribute becomes increasingly expensive.
Attributes and the Historical Database (HDB)
The final process we examine is our Historical Database (HDB). Once intraday data is stored on disk, it typically remains unaltered, with no new data appended. Consequently, we can sort our data before saving it to disk and apply the parted Attribute to the sym
column without concern about losing the Attribute during updates.
As the data remains static after being stored on disk, applying extra grouped Attributes to columns other than the sym
poses little to no concern about performance impact as the hash table maintained by KDB/Q is only updates once, specifically during Attribute creation. Additionally, the ample disk memory available compared to process memory eliminates worries about the memory overhead associated with the grouped Attribute.
You might be curious about why we would apply multiple grouped Attributes to a table. In the evaluation of the where clause in a q-sql statement, once the first column is employed to narrow down the search, any subsequent Attributes on the remaining columns are disregarded, and a linear search is used for all remaining columns. Since we may not always know all the use cases of our users' q-sql queries, and they might not necessarily want to query the sym
column as the first column, applying the grouped Attribute to several columns ensures a performance improvement regardless of the column our users choose to query.
Attributes and Joins
The main motivation for applying Attributes is to increase the efficiency and performance of search operations. One operation KDB/Q is particularly famous for are joins, asof and window joins to be more specific. During a join operation, one table's column needs to be searched and matched with the keys of the columns in the table to be matched. This scenario is a perfect use case for our Attributes, as they substantially boost the performance of join operations.
When executing an asof or window join
on in-memory tables, it is crucial to ensure that the grouped Attribute is applied to the sym
column. Conversely, when conducting a join on data that has been persisted to disk, the parted Attribute enhances the performance of the join operation.
You can read everything about joins here
Special Application of Attributes
Step Dictionary
A special use case of the sorted Attribute is when you apply the sorted Attribute to the keys of a dictionary, essentially creating a step dictionary. This modified dictionary now functions as a step function, enabling you to index into it with a key that may not be present, and it will return the value associated with the key immediately preceding the one you provided. Let me illustrate this with an example:
// We first create a dictionary without the sorted Attribute applied
q)d:0 1 5 10 15!`A`B`C`D`E
q)d
0 | A
1 | B
5 | C
10| D
15| E
// If we now index into this dictionary with a key that doesn't exists, you will see the null symbol is returned
q)d 4
`
// Now let's apply the sorted Attribute to the dictionary, creating a step function
q)d:`s#0 1 5 10 15!`A`B`C`D`E
// Et voila, using 4 as the index, returns B, the value of the largest index smaller than the index we passed
q)d 4
`B
// Using an existing index works as expected
q)d 5
`C
// One more example of a non existing index
q)d 13
`D
Sorted Attribute on Keyed Tables
The notion of a step dictionary functions identically for a keyed table. Applying the sorted Attribute to a keyed table results in the same step function behavior as observed in a step dictionary. For those interested in understanding, this similarity arises because a keyed table is essentially a dictionary composed of two simple tables. Let's delve into this explanation.
// Create two simple tables
q)show t1:([] a:0 1 5 10 15)
a
--
0
1
5
10
15
q)show t2:([] b:5?`3)
b
---
bgh
ifn
foh
kdj
eeg
// Create a keyed table as dictionary of two simple tables
q)show kt:t1!t2
a | b
--| ---
0 | bgh
1 | ifn
5 | foh
10| kdj
15| eeg
// verify the type of kt to show that it's effectively a dictionary
q)type kt
99h
// Index into the keyed table with an existing index
q)kt 0
b| bgh
q)kt 5
b| foh
// Index into the keyed table with a non-existing index returns nothing
q)kt 3
b|
// Apply the sorted Attribute to the keyed table to create a step-function behaviour
q)show kts:`s#kt:t1!t2
a | b
--| ---
0 | bgh
1 | ifn
5 | foh
10| kdj
15| eeg
q)meta kts
c| t f a
-| -----
a| j s
b| s
// Index into the keyed table with an existing index
q)kts 0
b| bgh
// Indexing into the keyed table with a non-existing index will now return a result
q)kts 3
b| ifn
q)kts 12
b| kdj
Conclustion
In this post, we took a deep dive into a comprehensive understanding of Attributes, explaining their functionality, application to both simple lists and tables, and the impact of operations on them. We then explored their strategic utilization within a KDB/Q Tick system, emphasizing the importance of sorted,grouped and parted Attributes in enhancing performance. Concluding our exploration, we highlighted two special use cases showcasing the versatility of the sorted Attribute in creating step-function dictionaries and keyed tables. This comprehensive overview equips developers with a nuanced understanding of Attributes and their optimal integration into KDB/Q systems.
Reference:
- Q For Mortals by Jeffry A. Borror
- KDB/Q Reference Card - Set Attribute
- Q Tips by Nick Psaris
- Data Management Whitepaper by Simon Mescal, KX
- Adventure in retrieving memory size of KDB+ Object by Data Intellect