When to invest in mutual funds: An exercise in data science

I started with an interesting hypothesis: “Some days are better than others for investing in Indian mutual funds for buy-and-hold approach“. There’s nothing particular about mutual funds or India; it should apply to most investment vehicles and other countries but I have been investing in Indian mutual funds and wanted to know if I could do better by investing more on some days than others. There’s a story behind this hypothesis. A friend of mine said his financial advisor told him that their data showed, on average it’s better to invest:

during the second half of the month
on odd days
on Tuesdays, Wednesdays, Thursdays as compared to Mondays and Fridays

I guess we can come up with many stories for why these hypothesis may be true. People in India generally get paid on the last day of each month and have more money during the first half of the month than the second (It’s a different story in USA where most people get paid twice a month but have to pay rent etc. at the start of the month. They should have more disposable income during the second half.), so NAVs for mutual funds may be lower during the second half because on average, more people will buy during the first half and more people will sell during the second half. Since market is closed on Saturdays and Sundays, people might take more extreme positions during Mondays and Fridays and if you are investing for long term, it might be better to stay away from extreme positions. I couldn’t come up with any story for odd days but in any case, that’s all these are, stories.

Of course markets depend on a million different things, almost all of which I have no knowledge or control over. The appeal of a SIP is to average over these variabilities, but you still have to pick a day to start your SIP (say 7th of every month). I wanted to see if there’s any correlation in historical data and if there is, to come up with a formula to determine how much more to invest in a particular fund on one day than other, or determine the best day to start a SIP. I did some projects in Machine Learning back in college but have forgotten most of it. I think it’ll be interesting to read how I went about this project and struggled with different aspects of it.

Getting the historical data

All data science problems start with getting the data. I was interested in the following 6 mutual funds:

After a few quick Bing searches, I had found a few sites that allowed to query historical NAVs for a period of dates, although one of them crashed after I gave it a period of one year and another reported that the maximum query interval is only 90 days. I thought I’ll build an API over it to remove the limitation of 90 days by dividing the query period into 90-day chunks. I also found a project on github that queried current NAVs using node.js. I’ve never used node.js but decided to give it a shot. After installing it and reading the project readme and node.js documentation a little bit, I still had no clue how to execute that script.

I gave up on node.js and decided to do it in python. I settled on AMFI portal for querying NAV data. It required to fill a form. After a few more Bing searches, I found that I could do this with a library: mechanize. Installed it, read documentation and started playing with it. I soon discovered that submitted the form on the portal opens another webpage with form elements embedded in the url and that I can just submit that url and get the response directly. Another Bing search and I was using urllib2 to download the web page. The response was messy. The returned html elements had no IDs or names. After a few more Bing searches and experimenting with xml.etree, I felt it would be so much easier with XElement in C#. Fired up Visual Studio and coded for a minute or so, then decided to try a longer query interval in the url. I tried an interval of 4 months and it worked, made it 1 year and it still worked, then queried for the entire duration of that fund (3.5 years) and to my amazement, it still worked. Alright, no API needed! How did people code before search engines?

For each of the six funds, I queried historical data for the entire duration of the fund (surprisingly all of them started on the same day – Jan 01, 2013. This can’t be a coincidence.) and manually copied the result table to excel (Ah, the programmer in me cried!). Only the Franklin fund had data for Jan 01, 2013 and the UTI fund had duplicate rows for Dec 30, 2013.

Features engineering

I decided to go with the following features:

Month
Day of week
Half of month (first:-1 or second:1)
Date parity (even:-1 or odd:1)
Day
Whether day was in mid of week or not (Monday or Friday:-1, else:1)

I learnt quite a few excel functions for creating these features: TEXT, MOD, WEEKDAY, IF. For the nominal features, I first started with string values but soon remembered that regression requires numeric values so converted them to 0 and 1, but then thought that a feature value of 0 might not be very useful because the coefficient weight would have no effect so converted them to -1 and 1. Also added a few dependent features to introduce non-linearity: squares of (1), (2), and (5), and square root of (5).

I had used Weka for some ML projects back in college so decided to start with it. Fired up the “Select attributes” feature with default values. It got rid of all features except month and date parity; not very useful. I decided to ignore it.

Regression

Chose “Linear Regression” in the classify tab for DSPBR as the output. I wasn’t expecting any brilliant results because mutual fund NAVs don’t depend just on the date. It came up with this:

It threw away all the features except month and came up with a model that has only a tiny negative correlation with the output. Completely useless! Tried another algorithm SMOreg and it gave the following output:

A tiny positive correlation with the actual output but at least it used all the features. Some observations:

Month, DayOfWeek, Half, DateParity, Month^2, Day^2, sqrt(day) have small positive weights, which might suggest that NAVs go up as later in the year or month (or maybe that’s just what stock market does – it goes up over time! Maybe the data has to be normalized to account for this, perhaps by the market index).
Day has a negative weight and Day^2 has almost the same positive weight so they cancel each other. Similarly for DayOfWeek.
Which half of month it is has a small positive weight, which means there might be a small but positive correlation that NAVs are higher during the second half of the month (or it might just be stock market going up over time). Although the data shows that NAVs actually are slightly lower (30.004) during the second half than the first half (30.117). It’s not a good model!
DateParity also has a small positive weight which might mean NAVs are slightly higher on odd days (opposite of what I was going for; it’s better to buy low). We didn’t need to train a model for this. We could just look at the data and indeed the average NAV for DSPBR on odd days is 30.18 which is slightly higher than the average NAV on even days (29.94).
DayMidOrNot has a small negative weight. Since it has value -1 for Mondays and Fridays, this should mean that NAVs for slightly higher on Mondays and Fridays than during the mid of the week. The data shows that: average NAV is slightly higher on Mondays and Fridays (30.11) than during the mid of the week (30.03).

The model generated by Weka was completely useless. After this, I tried regression tool in excel and tried to interpret the results by reading about different statistical measures here. For DSPBR,

R square is miniscule (0.0065), meaning the model is useless.
Significance F is 85%, meaning there’s 85% chance that the regression output was obtained by chance.
All the attributes have very high P-value, meaning the results more or less occurred by chance.

Interestingly, for UTI, the model performs slightly better with larger R square (0.0136) and lower Significance F (30%).

Conclusion

I guess machine learning was not the right approach for this problem, because of small training set (866 instances) and features having almost no correlation with the desired output. As for the hypothesis, we can directly look for support in the historical data:

NAVs actually are slightly lower during the second half of the month than the first half for 4/6 funds, not a clear result but still it might be better to buy during the second half for a buy-and-hold approach.
NAVs are higher for all the funds during odd days than on even days. This is the most surprising and clear result. It should mean that it’s better to buy on even days, contradicting the hypothesis.
NAVs are higher for all the funds during Mondays and Fridays than during the mid of the week, a very clear result. It means it should be better to buy during the mid of the week, supporting the hypothesis.

As for the best day of month to start a SIP, we can compare SIPs of same amount started on different days for the same interval. I compared SIPs of 1000 started on 1st through 28th and continued for 25 months and compared their worth based on NAVs available for the last day in the data. Here are the observations:

For DSPBR, Franklin, Mirae and UTI, the three best days to start a SIP came out to be 11, 18 and 4 respectively.
For BSL and Reliance, the three best days to start a SIP came out to be 11, 18 and 25 respectively.
The difference in returns between the best day and the worst day was as high as 18.72%.

These results are surprising and there seems to be something special about the days 11, 18, 4 and 25 (or these funds just happen to be highly correlated with each other).

All of the historical data and analysis is available here: mutual-fund-analysis. Let me know what you think in comments.

Disclaimer: I wrote this post in my personal time. All opinions expressed in this post are my personal opinions and not that of my employer. I’m not an expert in investment or finance.

4 thoughts on “When to invest in mutual funds: An exercise in data science”

Pnkj says:

September 29, 2016 at 12:45 pm

Sir though I am quite young and don’t have much knowledge of Mutual Funds and SIP but I think your model can be improved slightly like adding time of day in which market opens or closes,my father use to buy or sell a lot in that phase (9.15AM or 3.30PM india ).So I think that can be considered as extreme position.Also what % of data you used as train and test for cross validation ? Also did you plot anything or removed outliners ? They really helps a lot.We can also use more sophesticated algorithms because linear regression under perform many time(what I have seen).
Like always it was a very nice read(I am already fan of your blog specially that segment tree article *_* ) If you don’t mind can I use your data to do predictions for the same ?

- Kartik Kukreja says:
  
  September 30, 2016 at 12:58 am
  
  Thank you for the nice comments. All the data and code on this blog is free for anyone to use. You are welcome to play with it. Time of day is not a factor for mutual funds because they are priced once for a day (after trading is over). I was using 7-fold cross validation (so 1 fold for testing). I didn’t remove outliers but tested with many different algorithms with similar outcomes. There’s just not much to learn in the data. With just these signals, month is the most important factor because stock prices tend to go up over time – this fact is useless to investors looking for when to invest. If you find any other interesting things, please share them in comments.
  
  - Pnkj says:
    
    September 30, 2016 at 2:39 am
    
    Oh sorry,I don’t know much about MFs 🙂 maybe my father is much involved in stocks :D. Stocks and MF are hard to predict !!! As we can see from your trials 🙂 I still have a lot to learn and don’t think can perform as good as you but I will try my best and your blog is awesome 🙂
    
    - Kartik Kukreja says:
      
      September 30, 2016 at 3:34 am
      
      You are so polite 🙂

Everything Under The Sun

A blog on CS concepts

When to invest in mutual funds: An exercise in data science

Getting the historical data

Features engineering

Regression

Conclusion

4 thoughts on “When to invest in mutual funds: An exercise in data science”

Leave a comment Cancel reply

Getting the historical data

Features engineering

Regression

Conclusion

Share this:

Related

4 thoughts on “When to invest in mutual funds: An exercise in data science”

Leave a comment Cancel reply