# GSOC 2016 Wrap Up for SciRuby

In the summer of 2016 I was chosen by the SciRuby core team to be admin for SciRuby for Google Summer of Code 2016. GSOC is an important yearly event for us as an organization since it provides a great platform for an upcoming organization like SciRuby and helps us get more users and contributors for the various libraries that we maintain.

This blog post is meant to be a summary of the work that SciRuby did over the summer and also of my experience at the GSOC 2016 mentor’s summit.

## GSOC student work

For the 2016 edition of GSOC we had 4 students - Lokesh Sharma, Prasun Anand, Gaurav Tamba and Rajith Vidanaarachchi. All four were undergraduate computer engineering students from colleges in India or Sri Lanka at the time of GSOC 2016.

Lokesh worked on making improvements to daru, a Ruby DataFrame library. He made very significant contributions to daru by adding functionality for storing and performing operations on categorical data, and also significantly sped up the sorting and grouping functionality of daru. His work has now been successfully integrated into the main branch and has also been released on rubygems. Lokesh has remained active as a daru contributor and regularly contributes code and replies to Pull Requests and issues. You can find a wrap up of the work he did throughout the summer in this blog post.

Prasun worked on creating a Java backend for NMatrix, a Ruby library for performing linear algebra operations similar to numpy in Python. This project opened the doors for scientific computation on JRuby. Prasun was able to complete all his project objectives, and his work is currently awaiting review because of the sheer size of the Pull Request and the variety of changes to the library that he had to make in order to accomplish his project goals. You can read about his summer’s work here. Prasun will also be speaking at Ruby Conf India 2017 about his GSOC work and scientific computing on JRuby in general.

Gaurav worked on creating a Ruby wrapper for NASA’s SPICE toolkit. A need for this was felt since Gaurav’s mentor John is a rocket scientist and was keen having a Ruby wrapper for a library that he used regularly in his work. This resulted in the spice_rub gem. It exposes a very intuitive Ruby interface to the SPICE toolkit. Gaurav also gave a lightning talk about his work at Deccan Ruby Conf (Pune, India). Blog posts summarizing his work can be found here, here and here.

Rajith worked on growing the Ruby wrapper over symengine. His mentor Abinash was a student with SciRuby for GSOC 2015 and volunteered to mentor Rajith so that Rajith could build upon the work that he had done the previous summer. This resulted in a huge increase in functionality for the symengine.rb ruby gem.

To summarize, all four of our students could execute their chosen tasks within the stipulated time and we did not have to fail anyone. All in all, we mentors had a great time working with the students and hope to keep doing this year on year!

## GSOC 2016 mentor’s summit

The GSOC 2016 mentor’s summit was fantastic. It was great meeting all the contributors and listening to ideas from projects that I had never heard about previously. I also had the opportunity to conduct an unconference session and share my ideas on Scientific Computation in Ruby with like minded people from other organizations.

Here are some photos that I took at the summit:

# Advice for Future GSOC Students and Mentors Based on My Experience.

This year I was the admin for the GSOC projects of the SciRuby foundation. It’s also the first time I mentored a student, having been a student myself last year. Being a mentor is pretty tough task, and for some reason is underestimated by many. I was lucky to have the experience and support of co-admins @mohawkjohn and @pjotrp throughout the GSOC period.

GSOC has now come to a close. I have learned a great deal myself in the past 3 months, and thought I would share some of my learnings in this blog post in the interest of future GSOC students and mentors.

Writing a proposal

Research your ideas at least for a day before asking your first question. Mentors are volunteers and it’s important to respect the time and effort that they’re putting into FOSS. When you do propose an idea, you should also have a good knowledge of why you’re working on that idea in the first place and what kind of long term impact the realization of that idea can have. Putting this across through your proposal can have a positive on your selection. Know how to ask questions on mailing lists. A properly researched question should show that you have first taken an effort to understand the topic and are asking the question only because you got stuck somewhere.

Community bonding

Make sure you figure out what exactly you have to research and learn during the community bonding phase. There’s a lot of things out there that can be learned, but only a few specific things will helpful for your project. Focus on them only. Ask specific questions to your mentor.

Coding

Setup a daily schedule for coding and stick to it. Constantly keep in touch with your mentor and make sure they know your progress as well as you do. If you run into previously unseen problems (frequent in programming), tell mentor about this ASAP and work out a solution together.

Don’t burn yourself out in your enthusiasm. Take regular breaks. Overworking does more harm than good.

Student selection

Short story: If you’re unsure about a student, don’t select him. It’s better to have a more quality than quantity.

Long story: First and foremost, it is very important to establish some organization-wide procedure that will be followed when selecting a student. As a start, consider making a proposal template that contains all the information and details that the student needs to fill up when submitting the proposal. Have a look at the SciRuby application template as an example.

When students start asking questions on the mailing list, it is important for the org admins to keep a watch on students and get a rough idea of who asks the better questions and who don’t. Community participation is a great measure of understanding whether a student will live upto your expectations or not. A proposal with a bad first draft might just turn out to be great simply because the student is open to feedback and is willing to put it in the effort to work on it.

We have 3 rounds: First round every mentor rates their own student only. In the next round all mentors rate all students (students without a mentor and bad proposals drop off).

In each case when rating a student. mentors put in a comment, making sure to tell how a student has interacted in the proposal phase, what his current coding looks like, how responsive he is. Mentors can still push their students to do stuff. We like it when students keep responsive in this phase.

In the 3rd round the org admins make the final ranking to set the number of slots. By this stage we are pretty clear about the individuals involved (and note that mentor activity counts). When Google allocates the slots the top-ranked students get in.

Coding period

Make sure you communicate with your student that they are supposed to send you daily updates of their progress. One paragraph about their work that particular day should suffice.

# Setting Up a Lexical Analyser and Parser in Ruby

I wrote this post as I was setting up the lexer and parser for Rubex, a new superset of Ruby that I’m developing.

Let’s demonstrate the basic working of a lexical analyser and parser in action with a demonstration of a very simple addition program. Before you start, please make sure rake, oedipus_lex and racc are installed on your computer.

### Configuring the lexical analyser

The most fundamental need of any parser is that it needs string tokens to work with, which we will provide by way of lexical analysis by using the oedipus_lex gem (the logical successor of rexical). Go ahead and create a file lexer.rex with the following code:

In the above code, we have defined the lexical analyser using Oedipus Lex’s syntax inside the AddLexer class. Let’s go over each element of the lexer one by one:

macro

The macro keyword lets you define macros for certain regular expressions that you might need to write repeatedly. In the above lexer, the macro DIGIT is a regular expression (\d+) for detecting one or more integers. We place the regular expression inside forward slashes (/../) because oedipus_lex requires it that way. The lexer can handle any valid Ruby regular expression. See the Ruby docs for details on Ruby regexps.

rule

The section under the rule keyword defines your rules for the lexical analysis. Now it so happens that we’ve defined a macro for detecting digits, and in order to use that macro in the rules, it must be inside a Ruby string interpolation (#{..}). The line to the right of the /#{DIGIT}/ states the action that must be taken if such a regular expression is encountered. Thus the lexer will return a Ruby Array that contains the first element as :DIGIT. The second element uses the text variable. This is a reserved variable in lex that holds the text that the lexer has matched. Similar the second rule will match any character (.) or a newline (/n) and return an Array with [text, text] inside it.

inner

Under the inner keyword you can specify any code that you want to occur inside your lexer class. This can be any logic that you want your lexer to execute. The Ruby code under the inner section is copied as-is into the final lexer class. In the above example, we’ve written an empty method called do_parse inside this section. This method is mandatory if you want your lexer to sucessfully execute. We’ll be coupling the lexer with racc shortly, so unless you want to write your own parsing logic, you should leave this method empty.

### Configuring the parser

In order for our addition program to be successful, it needs to know what to do with the tokens that are generated by the lexer. For this purpose, we need racc, an LALR(1) parser generator for Ruby. It is similar to yacc or bison and let’s you specify grammars easily.

Go ahead and create a file called parser.racc in the same folder as the previous lexer.rex and Rakefile, and put the following code inside it:

As you can see, we’ve put the logic for the parser inside the AddParser class. Yacc’s  is the result; $0, $1… is an array called val, and $-1, $-2… is an array called _values. Notice that in racc, only the parsing logic exists inside the class and everything else (i.e under header and inner) exists outside the class. Let’s go over each part of the parser one by one:

This is the core class that contains the parsing logic for the addition parser. Similar to oedipus_lex, it contains a rule section that specifies the grammar. The parser expects tokens in the form of [:TOKEN_NAME, matched_text]. The :TOKEN_NAME must be a symbol. This token name is matched to literal characters in the grammar (DIGIT in the above case). token and expr are varibles. Have a look at this introduction to LALR(1) grammars for further information.

The header keyword tells racc what code should be put at the top of the parser that it generates. You usually put your require statements here. In this case, we load the lexer class so that the parser can use it for accessing the tokens generated by the lexer. Notice that header has 4 hyphens (-) and a space before it. This is mandatory if your program is to not malfunction.

inner

The inner keyword tells racc what should be put inside the generated parser class. As you can see there are two methods in the above example - next_token and prepare_parser. The next_token method is mandatory for the parser to function and you must include it in your code. It should contain logic that will return the next token for the parser to consider. Moving on the prepare_parser method, it takes a file name that is to be parsed as an argument (how we pass that argument in will be seen later), and initialzes the lexer. It then calls the parse_file method, which is present in the lexer class by default.

The next_token method in turn uses the @lexer object’s next_token method to get a token generated by the lexer so that it can be used by the parser.

### Putting it all together

Our lexical analyser and parser are now coupled to work with each other, and we now use them in a Ruby program to parse a file. Create a new file called adder.rb and put the following code in it:

The prepare_parser is the same one that was defined in the inner section of the parser.racc above. The do_parse method called on the parser will signal the parser to start doing it’s job.

In a separate file called text.txt put the following text:

Oedipus Lex does not have a command line tool like rexical for generating a lexer from the logic specified, but rather has a bunch of rake tasks defined for doing this job. So now create a Rakefile in the same folder and put this code inside it:

Running rake parser will generate a two new files - lexer.rex.rb and parser.racc.rb - which will house the classes and logic for the lexer and parser, respectively. You can use your newly written lexer + parser with a ruby adder.rb text.txt command. It should output 4 as the answer.

You can find all the code in this blogpost here.

# Random Thoughts on Music Theory.

Title explains what this is about.

### 16 August 2016

Was checking out this video (Contortionist - Language 1) and learned about standard C# tuning on a 6 string bass guitar today. He’s used tuning G# C# F# B E A. Killer bass tone. This wiki says something different about C# standard, though.

### 30 November 2016

Trying out some interval training with this video today. Supposd to be really good.

So there are two types of intervals: harmonic and melodic. Harmonic is when is two or more notes are played at a time and melodic is when two or more notes are played separately.

Intervals are described by some properties: - Quality: Whether it is perfect, major, minor, augmented or diminished. Perfect intervals, if they’re raised by half step become augmented, if they are lowered by half step they become diminished. If perfect intervals are inverted, they remain perfect intervals. So a perfect fifth interved becomes a perfect fourth, and vice versa a perfect fourth interved becomes a perfect fifth. Minor or major intervals can become augmented or diminshed but never perfect. - Number: Unison, 2nd, 3rd, 4th, 5th,6th,7th, 8th, etc. Number of the interval is the number of letter names that the letter name spans. For example, C to G is a fifth because it spans 5 letter names C-D-E-F-G.

A dyad is a two note chord.

Aural characterestics of intervals: Consonance category: Perfect fifth and octaves are open consonances. Major and minor thirds and sixths are called soft consonances.

Dissonant category: Minor sevenths (C-Bb) and major seconds (C-D) are called mild dissonances. Minor seconds (like C-Db) and major sevenths (C-B) are called sharp dissonances.

The perfect fourth is characterized as a consonant or distant interval depending on its used in context. If a perfect 4th is part of a second inversion major triad

The major 6th interval can be remebered with ‘My Bonnie Lies…’.

To identify a minor 6th interval, play the first inversion of the triad and then play the 1st and 3rd of the inversion.

To identify a major 6th, play the second inversion of the triad so you get the 1st and 3rd notes at a major 6th interval.

### 11 Decemeber 2016

Songs for remembering ascending invtervals:

• Major 2nd - Happy Birthday to You.
• Major 3rd - Oh when the saints go marching.
• Perfect 4th - Star Trek Theme (TNG).
• Perfect 5th - Scarborough Fair. (are we ^going….)
• Major 6th - My Bonnie Lies Over…
• Major 7th - Superman theme
• Octave - The Christmas Song

# Searching for Graduate Degree Courses in USA and Japan.

I’m currently searching for master’s degree courses in various colleges in Japan and USA. I want to pursue a Computer Science degree specializing in distributed systems. Searching for the right graduate degree courses can get depressing. Here I’m posting various links and leads that I came across through the course of my search.

### 5 August 2016

Searching for options in Japan and started with University of Tokyo. Most of their courses seem to be in Japanese but there are a few in English as well. This page has some starting info about the English courses. Also found a collection of colleges here.

So apparently the process for getting into a Japanese college for Master’s can take two paths. The first is like so:

1. Talk to a professor and gain a research assistantship with him/her.
2. Give an exam and enroll for a 2 year master’s course if you pass that exam.

The second is directly give the exam, but I’m not sure how that can be done since they all appear to be written examinations that are conducted in Japan.

### 16 August 2016

Having a look at the graudate schools of University of Tokyo, Tokyo Insitute of Technology and Kyoto University today.

University of Tokyo

UoT seems to have some special selection process for international applicants (link), though it’s not useful for me. There’s a decent contact page here. They’ve also put up a check list for applications here.

Tokyo Inst. of Technology

This also has a good graduate program.Tokyo Inst. of Technology has an international graduate program for overseas applicants. The courses seems to be in English mostly. The school of computer science has also participated in the IGP and accept the IGP(A), IGP(B)3 and IGP(C) types of applicants. I seem to be most qualified for the IGP(A) and IGP(C) applications.

The ‘Education Program of Advanced Information Technology Leaders’ seems to be most relevant to my case. This looks like a good PDF to brief about the program.

All the courses require students to arrange for a Tokyo Tech faculty member to serve as their academic supervisor. This handy web application allows you to do that. They also have the MEXT scholarship for outstanding students.

University of Kyoto

### 17 August 2016

Continuing my research on Tokyo Inst. of Technology. The PDF I pointed to yesterday brought out an interesting observation - IGP(A) students and IGP(C) students seem to have different course work.

### 18 August 2016

It seems the IGP C program at Tokyo Tech. is best for me. I will research that further today. Most probably I’ll need to do a 6 month research assistantship first. Here’s a list of the research groups of the Computer Sci. deptartment at Tokyo Tech.

### 20 August 2016

Tokyo Inst. of Technology

Found a list of faculties under the IGP(C) program here.

### 23 August 2016

Had a look at Kyushu Inst. of Technology today. The program for international students looks good.

Also check out scholarship opportunities at Tokyo Inst. of Technology. Links - 1, 2, 3. There are a bunch of scholarships that can be applied to before you enrol in university. Have a look here.

There’s also the MEXT scholarshipfrom the Japanese government.

### 24 August 2016

Found an interesting FAQ on the UoT website.

Also having a look at JASSO scholarships. Found some great scholarships here.

### 25 August 2016

Found some scholarships. Also, I can also enrol as a privately funded research student at Tokyo Tech.

This is a PDF that talks about privately funded research students.

Also checking out Keio University today. They have a program for internation graduate students. Have a look here.

I also had a look at the Kyoto University IGP. Here’s a listing of Japanese universities.

### 28 August 2016

Found a Computer Engineering IGP at Kyoto University, though I still cant find anything related to HPC. This is a link that has some details on admissions.

More details on Tokyo Tech.’s IGP(A) can be found here. This looks like a good resource for curriculum. This has resources for scholarships without recommendation.

### 29 August 2016

Found a good resource on IGP programs at Tokyo Tech here. Here’s a PPT about IGP(A) in particular. IGP(A) coursework can be found here.

#### 4 November 2016

Posting after quite a while!

I’m currently having a look at Linz University, Austria. I came to know one of the research groups there is really good and are making some solid progress in high performant software.

Here’s the admissions page of the dept. of computer science. Here’s more info on admissions. This is a PDF on the Computer Science degree.

The System Software group looks nice.

#### 25 December 2016

Checking out the Computer Science program at UIC and that at University of Houston.

This is UICs website. This is the detailed PDF of the MS in CS requirements.

# Random Thoughts on Bass Tone

This post is about my learnings about bass tone. I’m currently using the following rig:

• Laney RB2 amplifier
• Tech 21 Sansamp Bass Driver Programmable DI
• Fender Mexican Standard Jazz Bass (4 string)

I will updating this post as and when I learn something new that I’d like to document or share. Suggestions are welcome. You can email me (see the ‘about’ section) or post a comment below.

#### 26 July 2016

As of now I’m tweaking the sansamp and trying to achieve good tone that will compliment the post/prog rock sound of my band Cat Kamikazee. I’m also reading up on different terminologies and use cases on the internet. For instance I found this explanation on DI boxes quite useful. For instance I learned that the ‘XLR Out Pad’ button on the sansamp actually provides a 20 db cut to the soundboard if your signal is too hot.

I am trying to couple the sansamp with a basic overdrive pedal I picked up from a friend. This thread on talkbass is pretty useful for that. The guy who answered the question states that it’s better to place the sansamp last in the chain so that the DI can deliver the output of the sound chain.

So the BLEND knob on the sansamp modulates how much of the dry signal is mixed with the sansamp tube amplifier emulation circutry. Can be useful when chaining effects pedals with the sansamp by reducing the blend and letting more of the dry signal pass through. Btw the bass, treble and level controls remain active irrespective of the position of BLEND.

One thing that was a little confusing was the whole thing about ‘harmonic partials’. I found a pretty informative thread about the same on this TalkBass thread.

Here’s an interesting piece on compressors.

Some more useful links I came across over the course of the past few days:

• https://theproaudiofiles.com/amp-overdrive-vs-pedal-overdrive/
• http://www.offbeatband.com/2009/08/the-difference-between-gain-volume-level-and-loudness/

#### 28 July 2016

Found an interesting and informative piece on bass pedals here. It’s a good walkthrough of different pedal types and their functionality and purpose.

I wanted to check out some overdrive pedals today but was soon sinking in a sea of terminologies. One thing that intrigued me is the difference between an overdrive, distortion and fuzz. I found a pretty informative article on this topic. The author has the following to say about these 3 different but seemingly similar things.

I had a look at the Darkglass b3k and b7k pedals too. They look like promising overdrive pedals. I’ll explore the b3k more since the only difference between the 3 and the 7 is that the 7 also functions as a DI box and has an EQ, while the 3 doesn’t. I already have a DI with a 2 band EQ in the sansamp.

#### 29 July 2016

One thing that I noticed when tweaking my sansamp is the level of ‘distortion’ in my tone varies a LOT when you change the bass or treble keeping the drive at the same level. Why does this happen?

#### 2 August 2016

Trying to dive further into distortion today. Found this article kind of useful. It relates mostly to lead guitar tones, but I think it applies in a general case too. I learned about symmetric and asymmetric clipping in that article.

According to the article, symmetric clipping is more focused and clear, because it is only generating one set of harmonic overtones. Since asymmetric clipping can be hard-clipped on one side, and soft-clipped on the other, it has the potential to create very thick complex sounds. This means that if you want plenty of overtones, but do not want a lot of gain, asymmetric clipping can be useful. For full-blown distortion symmetric clipping is usually more suitable, since high-gain tones are already very harmonically complex. Typically asymmetric clipping will have a predominant first harmonic, which the symmetric clipping will not (that’s probably why in this video, the SD1 sounds brigther than than the TS-9). High gain distortion tones sound best with most of the distortion coming from the pre-amp, so try to use a fairly neutral pickup or even a slightly ‘bright’ pickup.

The follow up to the above post talks about EQ in relation with distortion. It has stuff on pre and post EQ distortion and how it can affect the overall tone. If you place the EQ before the distortion, you can actually shape which frequencies will be clipped. However if you place it after the distortion then the EQ will only act for shaping the already distorted tone. Pre-dist EQ is more useful in most cases since it let’s you control the frequencies for clipping.

It also says that humbucking pickups have a mid-boost that is more focused by the lower part of the frequency range. Single coil pickups on the other hand have a mid-boost focused by the upper part of the frequency range. Single coils generally have clearer, more articulate bass end.

#### 10 October 2016

Posting after quite a while!

Also, my band’s installation of Main Stage 3 has started giving some really weird problems. More about that soon.

#### 11 October 2016

Coming back to Main Stage. For some reason, pressing Space Bar for play/pause reduces the default sampling rate and makes the tracks sound weird. We need to go to preferences and increase the sampling rate to 48 kHz again (that’s what our backing tracks are recorded at). I think its something to do with the key mappings, but I’m not sure. Will need to check it out.

It also so happens that after the space bar has been pressed and the issue with the sampling rate is resolved, the samples (which come from a M-Audio M-Track) start emitting a strange crackling sound. This sounds persists only if the headphones are connected into the audio jack (we use the onboard Mac sound card too). The sound goes away if the headphones are unplugged. Restarting the Mac resolves the issue. I suspect there might be a way without having to restart. Will investigate.

Turns out you just restart and it solves the problem (and be careful about what keys you press when on stage!). Not worth scratching your head too much.

#### 9 November 2016

I just got a new EHX Micro POG octaver pedal and a TC electronic booster pedal. Also got a TC electronics Polytune. Finally on my way to creating a pedal chain :)

So for now I’m using the pedals in this order:

Tuner -> Octaver -> Booster -> Sansamp

I think this works fine for me for now, though I might change something later on.

I read in this thread that using one octave down with an overdrive (on the sansamp) works wonders. Gonna try that now!

I am also having a look at this guide on setting up a pedal board.

#### 18 November 2016

Also found an interesting rig rundown by Tim Commerford (RATM).

# Overview

I thought I’ll try something new by recording screencasts for some of my work on Ruby open source libraries.

This is quite a change for me since I’m primarily focused on the programming and designing side of things. Creating documentation is something I’ve not ventured into a lot except the usual YARD markup for Ruby methods and classes.

In this blog post (which I will keep updating as time progresses) I hope to document my efforts in creating screencasts. Mind you this is the first time I’m creating a screencast so if you find any potential improvements in my methods please point them out in the comments.

# Creating the video

My first ever screencast will be for my benchmark-plot gem. For creating the video I’m mainly using two tools - Kdenlive for video editing and Kazam for recording screen activity. I initially tried using Pitivi and OpenShot for video editing, but the former did not seem user friendly and the latter kept crashing on my system. For the desktop recording I first tried using RecordMyDesktop but gave up on it since it’s too heavy on resources and recoreded poor quality screencasts with not too many customization options.

For creating informative visuals, I’m using LibreOffice Impress so that I can create a slide, take it’s screenshot when in slideshow mode and put in the screencast. However I’ve generally found that using slides does not serve well the content delivery in a screencast and will probably not feature too many slides in future screencasts.

Sublime Text 3 is my primary text editor. I use it’s in built code execution functionality (by pressing Ctrl + Shift + B) to execute a code snippet and display the results immediately.

# Creating the audio

I am using Audacity for recording sound. Sadly my mic produces a lot of noise, so for removing that noise in Audacity, I use the inbuilt noise reduction tools.

Noise reduction in Audacity can be achieved by first selecting a small part of the sound that does not contain speech, then go to Effects -> Noise Reduction and click on ‘Get Noise Profile’. Then select the whole sound wave with Ctrl + A. Go to Effects -> Noise Reduction again and click ‘OK’. It should considerably reduce static noise from your sound file.

All files are exported to Ogg Vorbis.

# Putting it all together

I did some research on the screencasting process and found this article by Avdi Grimm and this one by Sayanee Basu extremely helpful.

I first started by writing the transcript along with any code samples that I had to show. I made it a point to describe the code being typed/displayed on the screen since it’s generally more useful to have a voice over explaning the code than having to pause the video and go over it yourself.

Then I recorded the voice over just for the part that featured slides. I imported the screenshots of the slides in kdenlive and adjusted them such that they fit the voice over. Recording the code samples was a bit of a challenge. I started typing out the code and talking about it into the mic. This was more difficult than I thought, almost like playing a Guitar and singing at the same time. I ended up recording the screencast in 4 separate takes, with several retakes for each take.

After importing the screencast with voice over into kdenlive and separating the audio and video components, I did some cuts to reduce redundancy or imperfections in my VO. Some of the parts of the video where there was a lot of typing had to be sped up by using kdenlive’s Speed tool.

Once this was upto my satisfaction, I exported it to mp4.

The video of my first screencast is now up on YouTube in the video below. Have a look and leave your feedback in the comments!

# Summary of Work This Summer for GSOC 2015

Over this summer as a part of Google Summer of Code 2015, daru received a lot of upgrades and new features which have made a pretty robust tool for data analysis in pure ruby. Of course, a lot of work still remains for bringing daru at par with the other data analysis solutions on offer today, but I feel the work done this summer has put daru on that path.

The new features led to the inclusion of daru in many of SciRuby’s gems, which use daru’s data storage, access and indexing features for storing and carrying around data. Statsample, statsample-glm, statsample-timeseries, statsample-bivariate-extensions are all now compatible with daru and use Vector and DataFrame as their primary data structures. Daru’s plotting functionality, that interfaced with nyaplot for creating interactive plots directly from the data was also significantly overhauled.

Also, new gems developed by other GSOC students, notably Ivan’s GnuplotRB gem and Alexej’s mixed_models gem both accept data from daru data structures. Do see their repo pages for seeing interesting ways of using daru.

The work on daru is also proving to be quite useful for other people, which led a talk/presentation at DeccanRubyConf 2015, which is one of the three major ruby conferences in India. You can see the slides and notebooks presented at the talk here. Given the current interest in data analysis and the need for a viable solution in ruby, I plan to take daru much further. Keep watching the repo for interesting updates :)

In the rest of this post I’ll elaborate on all the work done this summer.

## Pre-mid term submissions

Daru as a gem before GSOC was not exactly user friendly. There were many cases, particularly the iterators, that required some thinking before anybody used them. This is against the design philosophy of daru, or even ruby general, where surprising programmers with ubiqtuos constructs is usually frowned down upon by the community. So the first thing that I did mainly concerned overhauling the daru’s many iterators for both Vector and DataFrame.

For example, the #map iterator from Enumerable returns an Array no matter object you call it on. This was not the case before, where #map would a Daru::Vector or Daru::DataFrame. This behaviour was changed, and now #map returns an Array. If you want a Vector or a DataFrame of the modified values, you should call #recode on Vector or DataFrame.

Each of these iterators also accepts an optional argument, :row or :vector, which will define the axis over which iteration is supposed to be carried out. So now there are the #each, #map, #map!, #recode, #recode!, #collect, #collect_matrix, #all?, #any?, #keep_vector_if and #keep_row_if. To iterate over elements along with their respective indexes (or labels), you can likewise use #each_row_with_index, #each_vector_with_index, #map_rows_with_index, #map_vector_with_index, #collect_rows_with_index, #collect_vector_with_index or #each_index. I urge you to go over the docs of each of these methods to utilize the full power of daru.

Apart from this there was also quite a bit of refactoring involved for many methods (courtesy Alexej). This has made daru much faster than previous versions.

The next (major) thing to do was making daru compatible with statsample. This was very essential since statsample is very important tool for statistics in ruby and it was using its own Vector and Dataset classes, which weren’t very robust as computation tools and very difficult to use when it came to cleaning or munging data. So I replaced statsample’s Vector and Dataset clases with Daru::Vector and Daru::DataFrame. It involved a significant amount of work on both statsample and daru. Statsample because many constructs had to changed to make them compatible with daru, and daru because there was a lot of essential functionality in these classes that had to be ported to daru.

Porting code from statsample to daru improved daru significantly. There were a whole of statistics methods in statsample that were imported into daru and you can now use all them from daru. Statsample also works well with rubyvis, a great tool for visualization. You can now do that with daru as well.

Many new methods for reading and writing data to and from files were also added to daru. You can now read and write data to and from CSV, Excel, plain text files or even SQL databases.

In effect, daru is now completely compatible with statsample (and all the other statsample extensions). You can use daru data structures for storing data and pass them to statsample for performing computations. The biggest advantage of this approach is that the analysed data can be passed around to other scientific ruby libraries (some of which listed above) that use daru as well. Since daru offers in-built functions to better ‘see’ your data, better visualization is possible.

See these blogs and notebooks for a complete overview of daru’s new features.

Also see the notebooks in the statsample README for using daru with statsample.

## Post-mid term submissions

Most of time post the mid term submissions was spent in implementing the time series functions for daru.

I implemented a new index, the DateTimeIndex, which can used for indexing data on time stamps. It enables users to query data based on time stamps. Time stamps can either be specified with precise ruby DateTime objects or can be specified as strings, which will lead to retrival of all the data falling under that time. For example specifying ‘2012’ returns all data that falls in the year 2012. See detailed usage of DateTimeIndex in conjunction with other daru constructs in the daru README.

An essential utility in implementing DateTimeIndex was DateOffset, which is a new set of classes that offsets dates based on certain rules or business logic. It can advance or lag a ruby DateTime to the nearest day or any day of the week or the end or beginning of the month etc. DateOffset is an essential part of DateTimeIndex and can also be used as a standalone utility for advancing/lagging DateTime objects. This blog post elaborates more on the nuances of DateOffset and its usage.

The last thing done during the post mid term was complete compatibility with statsample-timeseries, which was created by Ankur Goel during GSOC 2013. It offers many uesful functions for analysis of time series data. It now works with daru containers. See some use cases here.

Thats all, as far as I can remember.

# Elaboration on Certain Internals of Daru

In this blog post I will elaborate on how a few of the features in daru were implemeted. Notably I will stress on what spurred a need for that particular design of the code.

This post is primarily intended to serve as documentation for me and future contributors. If readers have any inputs on improving this post, I’d be happy to accept new contributions :)

## Index factory architecture

Daru currently supports three types of indexes, Index, MultiIndex and DateTimeIndex.

It became very tedious to write if statements in the Vector or DataFrame codebase whenever a new data structure was to be created, since there were 3 possible indexes that could be attached with every data set. This mainly depended on what kind of data was present in the index, i.e. tuples would create a MultiIndex, DateTime objects or date-like strings would create a DateTimeIndex, and everything else would create a Daru::Index.

This looked something like the perfect use case for the factory pattern, the only hurdle being that the factory pattern in the pure sense of the term would be a superclass, something called Daru::IndexFactory that created an Index, DateTimeIndex or MultiIndex index using some methods and logic. The problem is that I did not want to call a separate class for creating Indexes. This would break existing code and possibly cause problems in libraries that were already using daru (viz. statsample), not to mention confusing users about which class they’re actually supposed to be using.

The solution came after I read this blog post, which demonstrates that the .new method for any class can be overridden. Thus, instead of calling initialize for creating the instance of a class, it calls the overridden new, which can then call initialize for instantiating an instance of that class. It so happens that you can make new return any object you want, unlike initialize which must an instance of the class it is declared in. Thus, for the factory pattern implementation of Daru::Index, we over-ride the .new method of the Daru::Index and write logic such that it manufactures the appropriate kind of index based on the data that is passed to Daru::Index.new(data). The pseudo code for doing this looks something like this:

Also, since over-riding .new tampers with the subclasses of the class as well, an inherited hook that replaces the over-ridden .new of the inherited class with the original one was added to Daru::Index.

## Working of the where clause

The where clause in daru lets users query data with a Array containing boolean variables. So whenever you call where on Daru::Vector or DataFrame, and pass in an Array containing true or false values, all the rows corresponding with true will be returned as a Vector or DataFrame respectively.

Since the where clause works in cojunction with the comparator methods of Daru::Vector (which return a Boolean Array), it was essential for these boolean arrays to be combined together such that piecewise AND and OR operations could be performed between multiple boolean arrays. Hence, the Daru::Core::Query::BoolArray class was created, which is specialized for handling boolean arrays and performing piecewise boolean operations.

The BoolArray defines the #& method for piecewise AND operations and it defines the #| method for piecewise OR operations. They work as follows:

# Finding and Combining Data in Daru

## Arel-like query syntax

Arel is a very popular ruby gem that is one of the major components of the most popular ruby framework, Rails. It is an ORM-helper of sorts that exposes a beatiful and intuitive syntax for creating SQL strings by chaining Ruby methods.

Daru successfully adopts this syntax and the result is a very intuitive and readable syntax for obtaining any sort of data from a DataFrame or Vector.

As a quick demonstration, lets create a DataFrame which looks like this:

To select all rows where df[:a] equals 2 or df[:c] equals 55, just write this:

As is easily seen above, the Daru::Vector class has special comparators defined on it, which allow it to check each value of the Vector and return an object that can be evaluated by the DataFrame#where method.

Notice that to club the two comparators above, we have used the union OR (|) operator.

Daru::Vector has a bunch of comparator methods defined on it, which can be used with #where for obtaining the desired results. All of these return an object of type Daru::Core::Query::BoolArray, which is read by #where. BoolArray uses the methods | (also aliased as #or) and & (also aliased as #and) for piecewise logical operations on other BoolArray objects.

BoolArray consists of an internal Array that contains true for every entry in the Vector that returns true for an operation between the comparable operand and a Vector entry.

For example,

The #& (or #and) and #| (or #or) methods on BoolArray apply a logical and and a logical or respectively between each element of the BoolArray and return another BoolArray that contains the results. For example:

The following comparators can be used with a Daru::Vector:

Comparator Method Description
eq Uses == and returns true for each equal entry
not_eq Uses != and returns true for each unequal entry
lt Uses < and returns true for each entry less than the supplied object
lteq Uses <= and returns true for each entry less than or equal to the supplied object
mt Uses > and returns true for each entry more than the supplied object
mteq Uses >= and returns true for each entry more than or equal to the supplied object
in Uses == for each element in the collection (Array, Daru::Vector, etc.) passed and returns true for a match

A major advantage of using the #where clause over DataFrame#filter or Vector#keep_if, apart from better readability and usability, is that it is much faster. These benchmarks prove my point.

I’ll conclude this chapter with a little more complex example of using the arel-like query syntax with a Daru::Vector object:

For more examples on using the arel-like query syntax, see this notebook. ## Joins

Daru::DataFrame offers the #join method for performing SQL style joins between two DataFrames. Currently #join supports inner, left outer, right outer and full outer joins between DataFrames.

In order to demonstrate joins, lets consider a single example of an inner on two DataFrames:

For more examples please refer this notebook.