The newest release of daru brings alongwith it added support for time series data analysis, manipulation and visualization.
A time series is any data is indexed (or labelled) by time. This includes the stock market index, prices of crude oil or precious metals, or even geo-locations over a period of time.
The primary manner in which daru implements a time series is by indexing data objects (i.e Daru::Vector or Daru::DataFrame) on a new index called the DateTimeIndex. A DateTimeIndex consists of dates, which can queried individually or sliced.
Introduction
A very basic time series can be created with something like this:
In the above code, the DateTimeIndex.date_range function is creating a DateTimeIndex starting from a particular date and spanning for 1000 periods, with a frequency of 1 day between period. For a complete coverage of DateTimeIndex see this notebook. For an introduction to the date offsets used by daru see this blog post.
The index is passed into the Vector like a normal Daru::Index object.
Statistics functions and plotting for time series
Many functions are avaiable in daru for computing useful statistics and analysis. A brief of summary of statistics methods available on time series is as follows:
Method Name
Description
rolling_mean
Calculate Moving Average
rolling_median
Calculate Moving Median
rolling_std
Calculate Moving Standard Deviation
rolling_variance
Calculate Moving Variance
rolling_max
Calculate Moving Maximum value
rolling_min
Calcuclate moving minimum value
rolling_count
Calculate moving non-missing values
rolling_sum
Calculate moving sum
ema
Calculate exponential moving average
macd
Moving Average Convergence-Divergence
acf
Calculate Autocorrelation Co-efficients of the Series
acvf
Provide the auto-covariance value
To demonstrate, the rolling mean of a Daru::Vector can be computed as follows:
1234567891011121314
require'daru'require'distribution'rng=Distribution::Normal.rngvector=Daru::Vector.new(1000.times.map{rng.call},index:Daru::DateTimeIndex.date_range(:start=>'2012-4-2',:periods=>1000,:freq=>'D'))# Compute the cumulative sumvector=vector.cumsumrolling=vector.rolling_mean60rolling.tail
This time series can be very easily plotted with its rolling mean by using the GnuplotRB gem:
In a manner similar to that done with Vectors above, we can easily plot each Vector of the DataFrame with GNU plot:
12345678910111213
require'gnuplotrb'# Calculate cumulative sum of each Vectordf=df.cumsum# Compute rolling sum of each Vector with a loopback length of 60.r_sum=df.rolling_sum(60)plots=[]r_sum.each_vector_with_indexdo|vec,n|plots<<GnuplotRB::Plot.new([vec,with:'lines',title:n])endGnuplotRB::Multiplot.new(*plots,layout:[3,1],title:'Rolling sums')
Usage with statsample-timeseries
Daru now integrates with statsample-timeseries, a statsample extension that provides many useful statistical analysis tools commonly applied to time series.
Some examples with working examples of daru and statsample-timseries are coming soon. Stay tuned!
Daru’s (Data Analysis in RUby) latest release (0.2.0) brings in a host of new features, most important among them being time series manipulation functionality. In this post, we will go over the date offsets that daru offers, which can be used for creating date indexes of specific intervals. The offsets offer a host of options for easy creation of different intervals and even work with standalone DateTime objects to increase or decrease time.
Offset classes and behaviour
The date offsets are contained in the Daru::Offsets sub-module. A number of classes are offered, each of which implements business logic for advancing or retracting date times by a specific interval.
As you can see in the above example, an hour was added to the time specified by DateTime and returned. All the offset classes work in a similar manner. Following offset classes are available to users:
Offset Class
Description
Daru::DateOffset
Generic offset class
Second
One Second
Minute
One Minute
Hour
One Hour
Day
One Day
Week
One Week. Can be anchored on any week of the day.
Month
One Month.
MonthBegin
Calendar Month Begin.
MonthEnd
Calendar Month End.
Year
One Year.
YearBegin
Calendar Year Begin.
YearEnd
Calendar Year End.
The generic Daru::DateOffset class is used for creating a generic offset by passing the number of intervals you want as the value for a key that describes the type of interval. For example to create an offset of 3 days, you pass the option days: 3 into the Daru::Offset constructor.
On a similar note, the DateOffset class constructor can accept the options :secs, :mins,:hours, :days, :weeks, :months or :years. Optionally, specifying the :n option will tell DateOffset to apply a particular offset more than once. To elaborate:
The specialized offset classes like MonthBegin, YearEnd, etc. all reside inside the Daru::Offsets namespace and can be used by simply calling .new on them. All accept an optional Integer argument that works like the :n option for Daru::DateOffset, i.e it applies the offset multiple times.
To elaborate, consider the YearEnd offset. This offsets the date to the nearest year end after itself:
1234567891011
require'daru'offset=Daru::Offsets::YearEnd.newoffset+DateTime.new(2012,5,1,5,2,1)#=> #<DateTime: 2012-12-31T05:02:01+00:00 ((2456293j,18121s,0n),+0s,2299161j)># Passing an Integer into an Offsets object will apply the offset that many times:offset=Daru::Offsets::MonthBegin.new(3)offset+DateTime.new(2015,3,5)#=> #<DateTime: 2015-06-01T00:00:00+00:00 ((2457175j,0s,0n),+0s,2299161j)>
Of special note is the Week offset. This offset can be ‘anchored’ to any week of the day that you specify. When this is done, the DateTime that is being offset will be offset to that day of the week.
For example, to anchor the Week offset to a Wednesday, pass ‘3’ as a value to the :weekday option:
Likewise, the Week offset can be anchored on any day of the week, by simplying specifying the :weekday option. Indexing for days of the week starts from 0 for Sunday and goes on 6 for Saturday.
Offset string aliases
The most obvious use of date offsets is for creating DateTimeIndex objects with a fixed time interval between each date index. To make creation of indexes easy, each of the offset classes have been linked to certain string alaises, which can directly passed to the DateTimeIndex class.
For example, to create a DateTimeIndex of 100 periods with a frequency of 1 hour between each period:
Likewise all of the above listed offsets can be aliased using strings, which can be used for specifying the offset in a DateTimeIndex index. The string aliases of each offset class are as follows:
Alias String
Offset Class / Description
‘S’
Second
‘M’
Minute
‘H’
Hour
‘D’
Days
‘W’
Default Week. Anchored on SUN.
‘W-SUN’
Week anchored on sunday
‘W-MON’
Week anchored on monday
‘W-TUE’
Week anchored on tuesday
‘W-WED’
Week anchored on wednesday
‘W-THU’
Week anchored on thursday
‘W-FRI’
Week anchored on friday
‘W-SAT’
Week anchored on saturday
‘MONTH’
Month
‘MB’
MonthBegin
‘ME’
MonthEnd
‘YEAR’
Year
‘YB’
YearBegin
‘YE’
YearEnd
See this notebook on daru’s time series functions in order to get a good overview of daru’s time series manipulation functionality.
Note: It so happens that the latest release of rb-gsl does not depend on narray anymore. Hence rb-gsl can be directly used with statsample. However, if you want to use nmatrix with GSL, use gsl-nmatrix.
Statsample is the most comprehensive statistical computation suite in Ruby as of now.
Previously, it so happened that statsample would depend on rb-gsl to speed up a lot of computations. This is great, but the biggest drawback of this approach is that rb-gsl depends on narray, which is incompatible with nmatrix - the numerical storage and linear algebra library from the SciRuby foundation - due to namespace collisions.
NMatrix is used by many current and upcoming ruby scientific gems, most notably daru, mikon, nmatrix-fftw, etc. and the a big hurdle that these gems were facing was that they could not leverage the advanced functionality of rb-gsl or statsample because nmatrix cannot co-exist with narray. On a further note, daru’s DataFrame and Vector data structures are to replace statsample’s Dataset and Vector, so that a dedicated library can be used for data storage and munging and statsample can be made to focus on statistical analysis.
The most promising solution to this problem was that rb-gsl must be made to depend on nmatrix instead of narray. This problem was solved by the gsl-nmatrix gem, which is a port of rb-gsl, but uses nmatrix instead of narray. Gsl-nmatrix also allows conversion of GSL objects to NMatrix and vice versa. Also, latest changes to statsample make it completely independent of GSL, and hence all the methods in statsample are now possible with or without GSL.
To make your installation of statsample work with gsl-nmatrix, follow these instructions:
Install nmatrix and clone, build and install the latest gsl-nmatrix from https://github.com/v0dro/gsl-nmatrix
Clone the latest statsample from https://github.com/SciRuby/statsample
Open the Gemfile of statsample and add the line gem 'gsl-nmatrix', '~>1.17'
Build statsample using rake gem and install the resulting .gem file with gem install.
You should be good able to use statsample with gsl-nmatrix on your system now. To use with rb-gsl, just install rb-gsl from rubygems (gem install rb-gsl) and put gem 'rb-gsl', '~>1.16.0.4' in the Gemfile instead of gsl-nmatrix. This will activate the rb-gsl gem and you can use rb-gsl with statsample.
However please take note that narray and nmatrix cannot co-exist on the same gem list. Therefore, you should have either rb-gsl or gsl-nmatrix installed at a particular time otherwise things will malfunction.
For my GSOC project, I’m trying to build an extension to NMatrix which will interface with a high performance C library for fast linear algebra calculations. Since one of the major problems affecting the usability and portability of NMatrix is the effort taken for installation (adding/removing dependencies etc.), it is imperative to ship the source of this high performance C library alongwith the ruby gem.
This leaves us with quite a few choices about the library that can be used. The most common and obvious interfaces for performing fast linear algebra calculations are LAPACK and BLAS. Thus the library bundled with the nmatrix extension must expose an interface similar to LAPACK and BLAS. Since ruby running on MRI can only interface with libraries having a C interface, the contenders in this regard are CLAPACK or LAPACKE for a LAPACK in C, and openBLAS or ATLAS for a BLAS interface.
I need to choose an appropriate BLAS and LAPACK interface based on its speed and usability, and to do so, I decided to build some quick ruby interfaces to these libraries and benchmark the ?gesv function (used for solving n linear equations in n unknowns) present in all LAPACK interfaces, so as to get an idea of what would be the fastest. This would also test the speed of the BLAS implemetation since LAPACK primarily depends on BLAS for actual computations.
To create these benchmarks, I made a couple of simple ruby gems which linked against the binaries of these libraries. All these gems define a module which contains a method solve_gesv, which calls the C extension that interfaces with the C library. Each library was made in its own little ruby gem so as to nullify any unknown side effects and also to provide more clarity.
To test these libraries against each other, I used the following test code:
123456789101112131415161718
require'benchmark'Benchmark.bmdo|x|x.reportdo10000.timesdoa=NMatrix.new([3,3],[76,25,11,27,89,51,18,60,32],dtype::float64)b=NMatrix.new([3,1],[10,7,43],dtype::float64)NMatrix::CLAPACK.solve_gesv(a,b)# The `NMatrix::CLAPACK` is replaced with NMatrix::LAPACKE # or NMatrix::LAPACKE_ATLAS as per the underlying binding. Read the# source code for more details.endendend
Here I will list the libraries that I used, the functions I interfaced with, the pros and cons of using each of these libraries, and of course the reported benchmarks:
CLAPACK (LAPACK) with openBLAS (BLAS)
CLAPACK is an F2C’d version of the original LAPACK written in FORTRAN. The creators have made some changes by hand because f2c spews out unnecessary code at times, but otherwise its pretty much as fast as the original LAPACK.
To interface with a BLAS implementation, CLAPACK uses a blas wrapper (blaswrap) to generate wrappers to the relevant CBLAS functions exposed by any BLAS implementation. The blaswrap source files and F2C source files are provided with the CLAPACK library.
The BLAS implementation that we’ll be using is openBLAS, which is a very stable and tested BLAS exposing a C interface. It is extremely simple to use and install, and configures itself automatically according to the computer it is being installed upon. It claims to achieve performance comparable to intel MKL, which is phenomenal.
To compile CLAPACK with openBLAS, do the following:
cd to your openBLAS directory and run make NO_LAPACK=1. This will create an openBLAS binary with the object files only for BLAS and CBLAS. LAPACK will not be compiled even though the source is present. This will generate a .a file which has a name that is similar to the processor that your computer uses. Mine was libopenblas_sandybridgep-r0.2.13.a.
Now rename the openBLAS binary file to libopenblas.a so its easier to type and you lessen your chances of mistakes, and copy to your CLAPACK directory.
cd to your CLAPACK directory and open the make.inc file in your editor. In it, you should find a BLASDIR variable that points to the BLAS files to link against. Change the value of this variable to ../../libopenblas.a.
Now run make f2clib to make F2C library. This is needed for interconversion between C and FORTRAN data types.
Then run make lapacklib from the CLAPACK root directory to compile CLAPACK against your specified implementation of CBLAS (openBLAS in this case).
At the end of this process, you should end up with the CLAPACK, F2C and openBLAS binaries in your directory.
Since the automation of this compilation process would take time, I copied these binaries to the gem and wrote the extconf.rb such that they link with these libraries.
On testing this with a ruby wrapper, the benchmarking code listed above yielded the following results:
LAPACKE is the ‘official’ C interface to the FORTRAN-written LAPACK. It consists of two levels; a high level C interface for use with C programs and a low level one that talks to the original FORTRAN LAPACK code. This is not just an f2c’d version of LAPACK, and hence the design of this library is such that it is easy to create a bridge between C and FORTRAN.
For example, C has arrays stored in row-major format while FORTRAN had them column-major. To perform any computation, a matrix needs to be transposed to column-major form first and then be re-transposed to row-major form so as to yield correct results. This needs to be done by the programmer when using CLAPACK, but LAPACKE’s higher level interface accepts arguments (LAPACKE_ROW_MAJOR or LAPACKE_COL_MAJOR) which specify whether the matrices passed to it are in row major or column major format. Thus extra (often unoptimized code) on part of the programmer for performing the tranposes is avoided.
To build binaries of LAPACKE compiled with openBLAS, just cd to your openBLAS source code directory and run make. This will generate a .a file with the binaries for LAPACKE and CBLAS interface of openBLAS.
As you can see these are quite faster than CLAPACK with openBLAS, listed above.
CLAPACK(LAPACK) with ATLAS(BLAS)
This is the combination that is currently in use with nmatrix. It involves installing the libatlas-base-dev package from the Debian repositories. This pacakage will load all the relevant clapack, atlas, blas and cblas binaries into your computer.
This is fast. But a big limitation on using this approach is that the CLAPACK library exposed by the libatlas-base-dev is outdated and no longer maintained. To top it all, it does not have all the functions that a LAPACK library is supposed to have.
LAPACKE(LAPACK) with ATLAS(BLAS)
For this test case I compiled LAPACKE (downloaded from netlib) with an ATLAS implementation from the Debian repositories. I then included the generated static libraries in the sample ruby gem and compiled the gem against those.
To do this on your machine:
* Install the package libatlas-base-dev with your package manager. This will install the ATLAS and CBLAS shared objects onto your system.
* cd to the lapack library and in the make.inc file change the BLASLIB = -lblas -lcblas -latlas. Then run make. This will compile LAPACK with ATLAS installed on your system.
* Then cd to the lacpack/lapacke folder and run make.
Again the function chosen was LAPACKE_?gesv. This test should tell us a great deal about the speed differences between openBLAS and ATLAS, since tranposing overheads are handled by LAPACKE and no Ruby code is interfering with the benchmarks.
As you can see from the benchmarks above, the approach followed by nmatrix currently (CLAPACK with ATLAS) is the fastest, but this approach has certain limitations:
Requires installation of tedious to install dependencies.
Many pacakages offer the same binaries, causing confusion.
CLAPACK library is outdated and not maintained any longer.
ATLAS-CLAPACK does not expose all the functions present in LAPACK.
The LAPACKE-openBLAS and the LAPACKE-ATLAS, though a little slower(~10-20 ms), offer a HUGE advantage over CLAPACK-ATLAS, viz. :
LAPACKE is the ‘standard’ C interface to the LAPACK libraries and is actively maintained, with regular release cycles.
LAPACKE is compatible with intel’s MKL, in case a future need arises.
LAPACKE bridges the differences between C and FORTRAN with a well thought out interface.
LAPACKE exposes the entire LAPACK interface.
openBLAS is trivial to install.
ATLAS is a little non-trivial to install but is fast.
For a further explanation of the differences between these CBLAS, CLAPACK and LAPACKE, read this blog post.
I’ve just released daru version 0.0.5, which brings in a lot of new features and consolidates existing ones. NMatrix is now well integrated into Daru and all of the operations that can be performed using Arrays as the underlying implementation can be performed using NMatrix as well (except some operations involving missing data).
The new features include extensive support for missing data, hierarchial sorting of data frames and vectors by preserving indexing, ability to group, split and aggregate data with group by, and quickly summarizing data by generating excel-style pivot tables. This release also includes new aritmetic and statistical functions on Data Frames and Vectors. Both DataFrame and Vector are now mostly compatible with statsample, allowing for a much larger scope of statistical analysis by leveraging the methods already provided in statsample.
The interface for interacting with nyaplot for plotting has also been revamped, allowing much greater control on the way graphs are handled by giving direct access to the graph object. A new class for hierarchial indexing of data (called MultiIndex) has also been added, which is immensely useful when grouping/splitting/aggregating data.
Lets look at all these features one by one:
Data Types
You can now either use Ruby Arrays or NMatrix as the underlying implementation. Since NMatrix is fast and makes use of C storage, it is recommended to use nmatrix when dealing with large sets of data. Daru will store any data as Ruby Array unless explicitly specified.
Thus to specify the data type of a Vector use the option :dtype and either supply it with :array or :nmatrix, and if using the NMatrix dtype, you can also specify the C data type that NMatrix will use internall by using the option :nm_dtype and supplying it with one of the NMatrix data types (it currently supports ints, floats, rationals and complex numbers. Check the docs for further details).
As an example, consider creating a Vector which uses NMatrix underneath, and stores data using the :float64 NMatrix data type, which stands for double precision floating point numbers.
Another distinction between types of data that daru offers is :numeric and :object. This is a generic feature for distinguishing numerical data from other types of data (like Strings or DateTime objects) that might be contained inside Vectors or DataFrames. These distinctions are important because statistical and arithmetic operations can only be applied on structures with type numeric.
To query the data structure for its type, use the #type method. If the underlying implemetation is an NMatrix, it will return the NMatrix data type, otherwise for Ruby Arrays, it will be either :numeric or :object.
Thus Daru exposes three methods for querying the type of data:
* #type - Get the generic type of data to know whether numeric computation can be performed on the object. Get the C data type used by nmatrix in case of dtype NMatrix.
* #dtype - Get the underlying data representation (either :array or :nmatrix).
Working with Missing Data
Any data scientist knows how common missing data is in real-life data sets, and to address that need, daru provides a host of functions for this purpose.
This functionality is still in its infancy but should be up to speed soon.
The #is_nil? function will return a Vector object with true if a value is nil and false otherwise.
1234567891011
v=Daru::Vector.new([1,2,3,nil,nil,4],index:[:a,:b,:c,:d,:e,:f])v.is_nil?#=> ##<Daru::Vector:93025420 @name = nil @size = 6 ># nil# a nil# b nil# c nil# d true# e true# f nil
The #nil_positions function returns an Array that contains the indexes of all the nils in the Vector.
1
v.nil_positions#=> [:d, :e]
The #replace_nils functions replaces nils with a supplied value.
12345678910
v.replace_nils69#=> ##<Daru::Vector:92796730 @name = nil @size = 6 ># nil# a 1# b 2# c 3# d 69# e 69# f 4
The statistics functions implemented on Vectors ensure that missing data is not considered during computation and are thus safe to call on missing data.
Hierarchical sorting of DataFrame
It is now possible to use the #sort function on Daru::DataFrame such that sorting happens hierarchically according to the order of the specified vector names.
In case you want to sort according to a certain attribute of the data in a particular vector, for example sort a Vector of strings by length, then you can supply a code block to the :by option of the sort method.
Supply the :ascending option with an Array containing ‘true’ or ‘false’ depending on whether you want the corresponding vector sorted in ascending or descending order.
Vector objects also have a similar sorting method implemented. Check the docs for more details. Indexing is preserved while sorting of both DataFrame and Vector.
Previously plotting with daru required a lot of arguments to be supplied by the user. The interface did not take advatage of Ruby’s blocks, nor did it expose many functionalities of nyaplot. All that changes with this new version, that brings in a new DSL for easy plotting (recommended usage with iruby notebook).
Thus to plot a line graph with data present in a DataFrame:
As you can see, the #plot function exposes the Nyaplot::Plot and Nyaplot::Diagram objects to user after populating them with the relevant data. So the new interface lets experienced users utilize the full power of nyaplot but keeps basic plotting very simple to use for new users or for quick and dirty visualization needs. Unfortunately for now, until a viable solution to interfacing with nyaplot is found, you will need to use the nyaplot API directly.
Refer to this notebook for advanced plotting tutorials.
Statistics and arithmetic on DataFrames.
Daru includes a host of methods for simple statistical analysis on numeric data. You can call mean, std, sum, product, etc. directly on the DataFrame. The corresponding computation is performed on numeric Vectors within the DataFrame, and missing data if any is excluded from the calculation by default.
Apart from that you can use the #describe method to calculate many statistical features of numeric Vectors in one shot and see a summary of statistics for numerical vectors in the DataFrame that is returned. For example,
1
df.describe
The covariance and correlation coeffiecients between the numeric vectors can also be found with #cov and #corr
1234567
df.cov# => # #<Daru::DataFrame:91700830 @name = f5ae5d7e-9fcb-46c8-90ac-a6420c9dc27f @size # = 3># d e f # d 4 8 40 # e 8 16 80 # f 40 80 400
Hierarchial indexing
A new way of hierarchially indexing data has been introduced in version 0.0.5. This is done with the new Daru::MultiIndex class. Hierarchial indexing allows grouping sets of similar data by index and lets you select sub sets of data by specifying an index name in the upper hierarchy.
A MultiIndex can be created by passing a bunch of tuples into the Daru::MultiIndex class. A DataFrame or Vector can be created by passing it a MultiIndex object into the index option. A MultiIndex can be used for determining the order of Vectors in a DataFrame too.
Selecting a top level index from the hierarchy will select all the rows under that name, and return a new DataFrame with just that much data and indexes.
1
df_mi.row[:a]
Alternatively passing the entire tuple will return just that row as a Daru::Vector, indexed according to the column index.
1
df_mi.row[:a,:one,:bar]
Hierachical indexing is especially useful when aggregating or splitting data, or generating data summaries as we’ll see in the following examples.
Splitting and aggregation of data
When dealing with large sets of scattered data, it is often useful to ‘see’ the data grouped according to similar values in a Vector instead of it being scattered all over the place.
The #group_by function does exactly that. For those familiar SQL, #group_by works exactly like the GROUP BY clause, but is much easier since its all Ruby.
The #group_by function will accept one or more Vector names and will scan those vectors for common elements that can be grouped together. In case multiple names are specified it will check for common attributes accross rows.
So for example consider this DataFrame:
12345678910111213141516
df=Daru::DataFrame.new({a:%w{foo bar foo bar foo bar foo foo},b:%w{one one two three two two one three},c:[1,2,3,1,3,6,3,8],d:[11,22,33,44,55,66,77,88]})#<Daru::DataFrame:88462950 @name = 0dbc2869-9a82-4044-b72d-a4ef963401fc @size = 8># a b c d # 0 foo one 1 11 # 1 bar one 2 22 # 2 foo two 3 33 # 3 bar three 1 44 # 4 foo two 3 55 # 5 bar two 6 66 # 6 foo one 3 77 # 7 foo three 8 88
To group this DataFrame by the columns :a and :b, pass them as arguments to the #group_by function, which returns a Daru::Core::GroupBy object.
Calling #groups on the returned GroupBy object returns a Hash with the grouped rows.
To see the first group of each group from this collection, call #first on the grouped variable. Calling #last will return the last member of each group.
123456789
grouped.first#=> a b c d # 1 bar one 2 22 # 3 bar three 1 44 # 5 bar two 6 66 # 0 foo one 1 11 # 7 foo three 8 88 # 2 foo two 3 33
On a similar note #head(n) will return the first n groups and #tail(n) the last n groups.
The #get_group function will select only the rows that a particular group belongs to and return a DataFrame with those rows. The original indexing is ofcourse preserved.
123456
grouped.get_group(["foo","one"])# => # #<Daru::DataFrame:90777050 @name = cdd0afa8-252d-4d07-ad0f-76c7581a492a @size # = 2># a b c d # 0 foo one 1 11 # 6 foo one 3 77
The Daru::Core::GroupBy object contains a bunch of methods for creating summaries of the grouped data. These currently include #mean, #std, #product, #sum, etc. and many more to be added in the future. Calling any of the aggregation methods will create a new DataFrame which will have the index as the group and the aggregated data of the non-group vectors as the corresponding value. Of course this aggregation will apply only to :numeric type Vectors and missing data will not be considered while aggregation.
1
grouped.mean
A hierarchichally indexed DataFrame is returned. Check the GroupBy docs for more aggregation methods.
Generating Excel-style Pivot Tables
You can generate an excel-style pivot table with the #pivot_table function. The levels of the pivot table are stored in MultiIndex objects.
Daru is now completely compatible with statsample and you can now perform all of the functions by just passing it a Daru::DataFrame or Daru::Vector to perform statistical analysis.
Solving systems of linear equations is a very important part of scientific computing (some would say most important), and in this post I will show you how a system of linear equations involving n equations and n unknowns can be solved in Ruby using the NMatrix gem and the methodology that I used for simplyfying the algorithms involved.
This involved solving a system of linear equations using forward substution followed by back substution using the LU factorization of the matrix of co-efficients.
The reduction techniques were quite baffling at first, because I had always solved equations in the traditional way and this was something completely new. I eventually figured it out and also implemented it in NMatrix. Here I will document how I did that. Hopefully, this will be useful to others like me!
I’m assuming that you are familiar with the LU decomposed form of a square matrix. If not, read this resource first.
Throughout this post, I will refer to A as the square matrix of co-efficients, x as the column matrix of unknowns and b as column matrix of right hand sides.
Lets say that the equation you want to solve is represented by:
The basic idea behind an LU decomposition is that a square matrix A can be represented as the product of two matrices L and U, where L is a lower triangular matrix and U is an upper triangular matrix.
Given this, equation (1) can be represented as:
Which we can use for solving the vector y such that:
and then solving:
The LU decomposed matrix is typically carried in a single matrix to reduce storage overhead, and thus the diagonal elements of L are assumed to have a value 1. The diagonal elements of U can have any value.
The reason for breaking down A and first solving for an upper triangular matrix is that the solution of an upper triangular matrix is quite trivial and thus the solution to (2) is found using the technique of forward substitution.
Forward substitution is a technique that involves scanning an upper triangular matrix from top to bottom, computing a value for the top most variable and substituting that value into subsequent variables below it. This proved to be quite intimidating, because according to Numerical Recipes, the whole process of forward substitution can be represented by the following equation:
Figuring out what exactly is going on was quite a daunting task, but I did figure it out eventually and here is how I went about it:
Let L in equation (2) to be the lower part of a 3x3 matrix A (as per (1)). So equation (2) can be represented in matrix form as:
Our task now is calculate the column matrix containing the y unknowns.
Thus by equation (4), each of them can be calculated with the following sets of equations (if you find them confusing just correlate each value with that present in the matrices above and it should be clear):
Its now quite obvious that forward substitution is called so because we start from the topmost row of the matrix and use the value of the variable calculated in that row to calculate the y for the following rows.
Now that we have the solution to equation (2), we can use the values generated in the y column vector to compute x in equation (3). Recall that the matrix U is the upper triangular decomposed part of A (equation (1)). This matrix can be solved using a technique called backward substitution. It is the exact reverse of the forward substitution that we just saw, i.e. the values of the bottom-most variables are calculated first and then substituted into the rows above to calculate subsquent variables above.
The equation describing backward substitution is described in Numerical Recipes as:
Lets try to understand this equation by extending the example we used above to understand forward substitution. To gain a better understanding of this concept, consider the equation (3) written in matrix form (keeping the same 3x3 matrix A):
Using the matrix representation above as reference, equation (5) can be expanded in terms of a 3x3 matrix as:
Looking at the above equations its easy to see how backward substitution can be used to solve for unknown quantities when given a upper triangular matrix of co-efficients, by starting at the lowermost variable and gradually moving upward.
Now that the methodology behind solving sets of linear equations is clear, lets consider a set of 3 linear equations and 3 unknowns and compute the values of the unknown quantities using the nmatrix #solve method.
The #solve method can be called on any nxn square matrix of a floating point data type, and expects its sole argument to be a column matrix containing the right hand sides. It returns a column nmatrix object containing the computed co-efficients.
For this example, consider these 3 equations:
These can be translated to Ruby code by creating an NMatrix only for the co-efficients and another one only for right hand sides:
Interpolation involves predicting the co-ordinates of a point given the co-ordinates of points around it. Interpolation can be done in one or more dimensions. In this article I will give you a brief introduction of one-dimensional interpolation and execute it on a sample data set using the interpolation gem.
One dimensional interpolation involves considering consecutive points along the X-axis with known Y co-ordinates and predicting the Y co-ordinate for a given X co-ordinate.
There are several types of interpolation depending on the number of known points used for predicting the unknown point, and several methods to compute them, each with their own varying accuracy. Methods for interpolation include the classic Polynomial interpolation with Lagrange’s formula or spline interpolation using the concept of spline equations between points.
The spline method is found to be more accurate and hence that is what is used in the interpolation gem.
Common Interpolation Routines
Install the interpolation gem with gem install interpolation. Now lets see a few common interpolation routines and their implementation in Ruby:
Linear Interpolation
This is the simplest kind of interpolation. It involves simply considering two points such that x[j] < num < x[j+1], where num is the unknown point, and considering the slope of the straight line between (x[j], y[j] ) and (x[j+1], y[j+1]), predicts the Y co-ordinate using a simple linear polynomial.
Linear interpolation uses this equation:
Here interpolant is the value of the X co-orinate whose corresponding Y-value needs to found.
Cubic Spline interpolation defines a cubic spline equation for each set of points between the 1st and nth points. Each equation is smooth in its first derivative and continuos in its second derivative.
So for example, if the points on a curve are labelled i, where i = 1..n, the equations representing any two points i and i-1 will look like this:
Cubic spline interpolation involves finding the second derivative of all points , which can then be used for evaluating the cubic spline polynomial, which is a function of x, y and the second derivatives of y.
daru (Data Analysis in RUby) is a ruby gem for performing various data analysis and manipulation tasks in Ruby. It draws inspiration from pandas (python) and aims to be completely cross-compatible between all ruby implementations (MRI/JRuby etc.) yet leverage the individual benefits that each interpreter offers (for example the speed of C in MRI), while offering a simple and powerful API for data analysis, manipulation and visualization.
In this first article on daru, I will show you some aspects of how daru handles data and some operations that can be performed on a real-life data set.
Getting Started
daru consists of two major data structures:
Vector - A named one-dimensional array-like structure.
DataFrame - A named spreadsheet-like two-dimensional frame of data.
A Vector can either be represented by a Ruby Array, NMatrix(MRI) or MDArray(JRuby) internally. This allows for fast data manipulation in native code. Users can change the underlying implementation at will (demonstrated in the next blog post).
Both of these can be indexed by the Daru::Index or Daru::MultiIndex class, which allows us to reference and operate on data by name instead of the traditional numeric indexing, and also perform index-based manipulation, equality and plotting operations.
Vector
The easiest way to create a vector is to simply pass the elements to a Daru::Vector constructor:
123456789101112
v=Daru::Vector.new[23,44,66,22,11]# This will create a Vector object v# => ##<Daru::Vector:78168790 @name = nil @size = 5 ># ni# 0 23# 1 44# 2 66# 3 22# 4 11
Since no name has been specified, the vector is named nil, and since no index has been specified either, a numeric index from 0..4 has been generated for the vector (leftmost column).
A better way to create vectors would be to specify the name and the indexes:
12345678910
sherlock=Daru::Vector.new[3,2,1,1,2],name::sherlock,index:[:pipe,:hat,:violin,:cloak,:shoes]#=> #<Daru::Vector:78061610 @name = sherlock @size = 5 ># sherlock# pipe 3# hat 2# violin 1# cloak 1# shoes 2
This way we can clearly see the quantity of each item possesed by Sherlock.
Data can be retrieved with the [] operator:
1
sherlock[:pipe]#=> 3
DataFrame
A basic DataFrame can be constructed by simply specifying the names of columns and their corresponding values in a hash:
You can also specify an index for the DataFrame alongwith the data and also specify the order in which the vectors should appear. Every vector in the DataFrame will carry the same index as the DataFrame once it has been created.
12345678910
plus_one=Daru::DataFrame.new({a:[1,2,3,4,5],b:[10,20,30,40,50],c:[11,22,33,44,55]},name::plus_one,index:[:a,:e,:i,:o,:u],order:[:c,:a,:b])# => ##<Daru::DataFrame:77605450 @name = plus_one @size = 5># c a b # a 11 1 10 # e 22 2 20 # i 33 3 30 # o 44 4 40 # u 55 5 50
daru will also add nil values to vectors that fall short of elements.
Creating a DataFrame by specifying Vector objects in place of the values in the hash will correctly align the values according to the index of each vector. If a vector is missing an index present in another vector, that index will be added to the vector with the corresponding value set to nil.
123456789101112
a=Daru::Vector.new[1,2,3,4,5],index:[:a,:e,:i,:o,:u]b=Daru::Vector.new[43,22,13],index:[:i,:a,:queen]on_steroids=Daru::DataFrame.new({a:a,b:b},name::on_steroids)#=> #<Daru::DataFrame:75841450 @name = on_steroids @size = 6># a b # a 1 22 # e 2 nil # i 3 43 # o 4 nil # queen nil 13 # u 5 nil
A DataFrame can be constructed from multiple sources:
To construct by columns:
Array of hashes - Where the key of each hash is the name of the column to which the value belongs.
Name-Array Hash - Where the hash key is set as the name of the vector and the data the corresponding value.
Name-Vector Hash - This is the most advanced way of creating a DataFrame. Treats the hash key as the name of the vector. Also aligns the data correctly based on index.
Array of Arrays - Each sub array will be considered as a Vector in the DataFrame.
To construct by rows using the .rows class method:
Array of Arrays - This will treat each sub-array as an independent row.
Array of Vectors - Uses each Vector in the Array as a row of the DataFrame. Sets vector names according to the index of the Vector. Aligns vector elements by index.
Handling Data
Now that you have a basic idea about representing data in daru, lets see some more features of daru by loading some real-life data from a CSV file and performing some operations on it.
For this purpose, we will use iruby notebook, with which daru is compatible. iruby provides a great interface for visualizing and playing around with data. I highly recommend installing it for full utilization of this tutorial.
Loading Data From Files
Let us load some data about the music listening history of one user from this subset of the Last.fm data set:
As you can see the timestamp field is in a somewhat non-Ruby format which is pretty difficult for the default Time class to understand, so we destructively map time zone information (IST in this case) and then change every timestamp string field into a Ruby Time object, so that operations on time can be easily performed.
Notice the syntax for referencing a particular vector. Use ‘row’ for referencing any row.
A bunch of rows can be selected by specifying a range:
df.row[900..923]
Data Analysis
Lets dive deeper by actually trying to extract something useful from the data that we have. Say we want to know the name of the artist heard the maximum number of times. So we create a Vector which consists of the names of the artists as the index and the number of times the name appears in the data as the corresponding values:
12
# Group by artist name and call 'size' to see the number of rows each artist populates.artists=df.group_by(:artname).size
To get the maximum value out of these, use #max_index. This will return a Vector which has the max:
count.max_index
Plotting
daru uses Nyaplot for plotting, which is an optional dependency. Install nyaplot with gem install nyaplot and proceed.
To demonstrate, lets find the top ten artists heard by this user and plot the number of times their songs have been heard against their names in a bar graph. For this, use the #sort function, which will preserve the indexing of the vector.
This post will talk about methods to access different types of matrix elements (diagonals, columns, rows, etc.) when a matrix is expressed as a continguous 1D array.
Recently, I was working on implementing a matrix inversion routine using the Gauss-Jordan elimination technique in C++. This was part of the NMatrix ruby gem, and because of the limitations imposed by trying to interface a dynamic language like Ruby with C++, the elements of the NMatrix object had to expressed as a 1D contiguous C++ array for computation of the inverse.
The in-place Gauss-Jordan matrix inversion technique uses many matrix elements in every pass. Lets see some simple equations that can be used for accessing different types of elements in a matrix in a loop.
Diagonals
Lets say we have a square matrix A with shape M. If k is iterator we are using for going over each diagonal element of the matrix, then the equation will be something like .
A for loop using the equation should look like this:
12345
for(k=0;k<M;++k){cout<<A[k*(M+1)];}// This will print all the diagonal elements of a square matrix.
Rows
To iterate over each element in a given row of a matrix, use . Here row is the fixed row and col goes from 0 to M-1.
Columns
To iterate over each element in a given column of a matrix, use . Here col is the fixed column and row goes from 0 to M-1.
General
In general the equation will yield a matrix element with row index row and column index col.
Speech is the fundamental method of communication between human beings. Everyone within Human civilization, whether literate or illiterate, can communicate with the people around them through speech.
Using a computer can be a scary proposition for most people. It involves GUIs, text, images, video; intangible entities that many first time users are unable to relate to.
In contrast to the rapid evolution of computing, development of modes of communication between human and computer has been painfully slow and has been primarily restricted to text, images, videos and the like.
This is where the idea of Automatic Speech Recognition comes in. It aims to bridge the communication gap between humans and computers and bring it as close as possible to a human-human interaction. In aims to teach a computer the primary method of communication between humans: speech.
To cite Wikipedia, Automatic Speech Recognition is the translation of spoken words into text. Once we have text (which is the most portable method of information transfer), we can do absolutely anything with it.
In this article, we will be gaining a brief overview of Automatic Speech Recognition (ASR), and take a look at a few algorithms that are used for the same. Most of the methods listed here are language neutral, unless explicitly stated. Let us start with how speech is actually produced by a normal Human being.
The primary organ of speech production is the Vocal Tract. The lungs push out the air, which passes through the vocal tract and mouth and then is released into the atmosphere. On its way out of the mouth, the air is manipulated by a series of obstacles present in the vocal tract, nose and mouth. These manipulations in the raw air pushed through the lungs manifest as speech.
Air first passes through the glottis, which is the combination of the vocal folds (vocal cords) and the space in between the folds. It then passes through the mouth, where the tongue plays a major role in overall speech modulation. Factors like constriction of the vocal tract (for /g/ in ‘ghost’), aspirated stops and nasal tones play a major role in modulating the overall sound wave.
The primary organ of speech production is the Vocal Tract. The lungs push out the air, which passes through the vocal tract and mouth and then is released into the atmosphere. On its way out of the mouth, the air is manipulated by a series of obstacles present in the vocal tract, nose and mouth. These manipulations in the raw air pushed through the lungs manifest as speech.
Air first passes through the glottis, which is the combination of the vocal folds (vocal cords) and the space in between the folds. It then passes through the mouth, where the tongue plays a major role in overall speech modulation. Factors like constriction of the vocal tract (for /g/ in ‘ghost’), aspirated stops and nasal tones play a major role in modulating the overall sound wave.
For the purpose of processing speech using computers, there is a need to digitize the signal. When we receive a speech signal in a computer, we first sample the analog signal at a frequency such that the original waveform is completely preserved. We then perform some basic pre-filtering; for example, observations indicate that human speech is the range of 0-4 kHz, so we pass the sampled signal through a low-pass filter to remove any frequecies above 4 kHz.
Before proceeding with the working of an ASR, we make some fundamental assumptions:
* Vocal tract changes shape rather slowly in continuos speech and it can be assumed that the vocal tract has fixed shape and characterestics for 10 ms. Thus on an average, the shape of the vocal tract changes every 10 ms.
* Source of excitation (lungs) and vocal tract are independent of each other.
To extract any meaning from sound, we need to make certain measurements from the sampled wave. Let us explore these one by one:
Zero Crossing Count - This is number of times the signal crosses the zero-line per unit time. This gives an idea of the frequency of the wave per unit time.
Energy - Energy of a signal is represented by the square of each sample of the signal, over the entire duration of the signal.
Pitch period of utterances - It is found that most utterances have a certain ‘pseudo periodicity’ associated with them. This is called the pitch period.
Speech can be classified into two broad categories - VOICED speech(top) and UNVOICED speech(bottom).
Voiced speech is characterized in a signal with many undulations (ups and downs). Voiced signals tend to be louder like the vowels /a/, /e/, /i/, /u/, /o/. Unvoiced speech is more of a high frequency, low energy signal, which makes it difficult to interpret since it is difficult to distinguish it from noise. Unvoiced signals, by contrast, do not entail the use of the vocal cords, for example, /s/, /z/, /f/ and /v/.
A basic ASR will consist of three basic steps -
End Point Detection - Marking the beginning and ending points of the actual utterance of the word in the given speech signal is called End Point Detection.
Phoneme^{1} Segmentation - Segregating individual phonemes from a speech signal is called Phoneme Segmentation.
Phoneme Identification - Recognizing the phoneme present in each phoneme segment of the waveform is called Phoneme Identification.
Every step in the speech recognition process is an intricate algorithm in itself, and over the years, numerous approaches have been suggested by many people. Let us look at a few simple ones:
End Point Detection:
We make use of the Zero Crossing Count and Energy parameters of a sound wave for calculating the end points of an utterance in an input sound wave.- It assumes that the first 100 ms of the speech waveform are noise. Based on this assumption, it comes up with the ZCC and energy of the noise signal, through which it computes the points where the speech segment begins and ends. A detailed discussion would be out of the scope of this article, but those interested can always go through the paper written by Rabiner and Sambur^{2}.
Phoneme Segmentation
This step in the process is the most important step because what Phoneme gets detected from a particular speech waveform is completely dependent on what wave we pass to the Phoneme Recognition algorithm.
The algorithm proposed by Bapat and Nagalkar^{3} functions based on the fact that each phoneme will have a different energy and amplitude, and whenever a variation drastic deviation in these parameters is detected in the sound wave, it is marked as a different phoneme.
Phoneme Recognition
This is by far the most intriguing and researched. Extensive work has been done in this domain, ranging from simple spectral energy analysis of signals, to more complicated Neural Network algorithms. One can find several hypotheses all over the internet regarding this domain. A discussion on these algorithms would get too large, but we will discuss a very simple algorithm which utilises the frequency domain representation of a signal to segregate ‘varnas’ or classes of Phonemes found in the Devnagiri script:
Each class of phonemes in Devnagiri is generated using the same organ but with different air pressure and time of touch for each individual alphabet. This property of Devangiri can be used for detecting only the class of a particular phoneme.
If we divide the entire frequency axis of 4 kHz into 17 bands of ~ 235 Hz each, and observe some sample utterances through this grid, we find that the phonemes of a particular class show peak frequencies in the same band or a very predictable set of 2-3 bands. Taking note of these peaks, one can identify the phoneme class by observing which bands the peaks fall into.
We have discussed some major characterestics and components of an Automatic Speech Recognition engine, and have also seen some interesting facets of digital signals along the way.
It is interesting to note how some basic principles of Digital Signal Processing can be applied to the real world for useful applications.
Phoneme - A phoneme is a basic unit of a language’s phonology, which is combined with other phonemes to form meaningful units such as words. Alternatively, a phoneme is a set of phones or a set of sound features that are thought of as the same element within the phonology of a particular language. ↩