In this blog post I will elaborate on how a few of the features in daru were implemeted. Notably I will stress on what spurred a need for that particular design of the code.
This post is primarily intended to serve as documentation for me and future contributors. If readers have any inputs on improving this post, I’d be happy to accept new contributions :)
Index factory architecture
Daru currently supports three types of indexes, Index, MultiIndex and DateTimeIndex.
It became very tedious to write if statements in the Vector or DataFrame codebase whenever a new data structure was to be created, since there were 3 possible indexes that could be attached with every data set. This mainly depended on what kind of data was present in the index, i.e. tuples would create a MultiIndex, DateTime objects or date-like strings would create a DateTimeIndex, and everything else would create a Daru::Index.
This looked something like the perfect use case for the factory pattern, the only hurdle being that the factory pattern in the pure sense of the term would be a superclass, something called
Daru::IndexFactory that created an Index, DateTimeIndex or MultiIndex index using some methods and logic. The problem is that I did not want to call a separate class for creating Indexes. This would break existing code and possibly cause problems in libraries that were already using daru (viz. statsample), not to mention confusing users about which class they’re actually supposed to be using.
The solution came after I read this blog post, which demonstrates that the
.new method for any class can be overridden. Thus, instead of calling
initialize for creating the instance of a class, it calls the overridden
new, which can then call initialize for instantiating an instance of that class. It so happens that you can make
new return any object you want, unlike initialize which must an instance of the class it is declared in. Thus, for the factory pattern implementation of Daru::Index, we over-ride the
.new method of the Daru::Index and write logic such that it manufactures the appropriate kind of index based on the data that is passed to
Daru::Index.new(data). The pseudo code for doing this looks something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Also, since over-riding
.new tampers with the subclasses of the class as well, an
inherited hook that replaces the over-ridden
.new of the inherited class with the original one was added to
Working of the where clause
The where clause in daru lets users query data with a Array containing boolean variables. So whenever you call
where on Daru::Vector or DataFrame, and pass in an Array containing true or false values, all the rows corresponding with
true will be returned as a Vector or DataFrame respectively.
Since the where clause works in cojunction with the comparator methods of Daru::Vector (which return a Boolean Array), it was essential for these boolean arrays to be combined together such that piecewise AND and OR operations could be performed between multiple boolean arrays. Hence, the
Daru::Core::Query::BoolArray class was created, which is specialized for handling boolean arrays and performing piecewise boolean operations.
The BoolArray defines the
#& method for piecewise AND operations and it defines the
#| method for piecewise OR operations. They work as follows:
1 2 3 4 5 6 7 8 9 10