When to invest in mutual funds: An exercise in data science

I started with an interesting hypothesis: “Some days are better than others for investing in Indian mutual funds for buy-and-hold approach“. There’s nothing particular about mutual funds or India; it should apply to most investment vehicles and other countries but I have been investing in Indian mutual funds and wanted to know if I could do better by investing more on some days than others. There’s a story behind this hypothesis. A friend of mine said his financial advisor told him that their data showed, on average it’s better to invest:

  • during the second half of the month
  • on odd days
  • on Tuesdays, Wednesdays, Thursdays as compared to Mondays and Fridays

I guess we can come up with many stories for why these hypothesis may be true. People in India generally get paid on the last day of each month and have more money during the first half of the month than the second (It’s a different story in USA where most people get paid twice a month but have to pay rent etc. at the start of the month. They should have more disposable income during the second half.), so NAVs for mutual funds may be lower during the second half because on average, more people will buy during the first half and more people will sell during the second half. Since market is closed on Saturdays and Sundays, people might take more extreme positions during Mondays and Fridays and if you are investing for long term, it might be better to stay away from extreme positions. I couldn’t come up with any story for odd days but in any case, that’s all these are, stories.

Of course markets depend on a million different things, almost all of which I have no knowledge or control over. The appeal of a SIP is to average over these variabilities, but you still have to pick a day to start your SIP (say 7th of every month). I wanted to see if there’s any correlation in historical data and if there is, to come up with a formula to determine how much more to invest in a particular fund on one day than other, or determine the best day to start a SIP. I did some projects in Machine Learning back in college but have forgotten most of it. I think it’ll be interesting to read how I went about this project and struggled with different aspects of it.

Getting the historical data

All data science problems start with getting the data. I was interested in the following 6 mutual funds:

  1. BSL MNC Fund – Direct Growth
  2. DSPBR Micro Cap Fund – Direct Growth
  3. Franklin India Smaller Companies Fund – Direct Growth
  4. Mirae Asset Emerging Bluechip Fund – Direct Growth
  5. Reliance Small Cap Fund – Direct Growth
  6. UTI Transportation and Logistics Fund – Direct Growth

After a few quick Bing searches, I had found a few sites that allowed to query historical NAVs for a period of dates, although one of them crashed after I gave it a period of one year and another reported that the maximum query interval is only 90 days. I thought I’ll build an API over it to remove the limitation of 90 days by dividing the query period into 90-day chunks. I also found a project on github that queried current NAVs using node.js. I’ve never used node.js but decided to give it a shot. After installing it and reading the project readme and node.js documentation a little bit, I still had no clue how to execute that script.

I gave up on node.js and decided to do it in python. I settled on AMFI portal for querying NAV data. It required to fill a form. After a few more Bing searches, I found that I could do this with a library: mechanize. Installed it, read documentation and started playing with it. I soon discovered that submitted the form on the portal opens another webpage with form elements embedded in the url and that I can just submit that url and get the response directly. Another Bing search and I was using urllib2 to download the web page. The response was messy. The returned html elements had no IDs or names. After a few more Bing searches and experimenting with xml.etree, I felt it would be so much easier with XElement in C#. Fired up Visual Studio and coded for a minute or so, then decided to try a longer query interval in the url. I tried an interval of 4 months and it worked, made it 1 year and it still worked, then queried for the entire duration of that fund (3.5 years) and to my amazement, it still worked. Alright, no API needed! How did people code before search engines?

For each of the six funds, I queried historical data for the entire duration of the fund (surprisingly all of them started on the same day – Jan 01, 2013. This can’t be a coincidence.) and manually copied the result table to excel (Ah, the programmer in me cried!). Only the Franklin fund had data for Jan 01, 2013 and the UTI fund had duplicate rows for Dec 30, 2013.

Features engineering

I decided to go with the following features:

  1. Month
  2. Day of week
  3. Half of month (first:-1 or second:1)
  4. Date parity (even:-1 or odd:1)
  5. Day
  6. Whether day was in mid of week or not (Monday or Friday:-1, else:1)

I learnt quite a few excel functions for creating these features: TEXT, MOD, WEEKDAY, IF. For the nominal features, I first started with string values but soon remembered that regression requires numeric values so converted them to 0 and 1, but then thought that a feature value of 0 might not be very useful because the coefficient weight would have no effect so converted them to -1 and 1. Also added a few dependent features to introduce non-linearity: squares of (1), (2), and (5), and square root of (5).

I had used Weka for some ML projects back in college so decided to start with it. Fired up the “Select attributes” feature with default values. It got rid of all features except month and date parity; not very useful. I decided to ignore it.


Chose “Linear Regression” in the classify tab for DSPBR as the output. I wasn’t expecting any brilliant results because mutual fund NAVs don’t depend just on the date. It came up with this:

Linear regression output for DSPBR


It threw away all the features except month and came up with a model that has only a tiny negative correlation with the output. Completely useless! Tried another algorithm SMOreg and it gave the following output:

SMOreg output for DSPBR

A tiny positive correlation with the actual output but at least it used all the features. Some observations:

  • Month, DayOfWeek, Half, DateParity, Month^2, Day^2, sqrt(day) have small positive weights, which might suggest that NAVs go up as later in the year or month (or maybe that’s just what stock market does – it goes up over time! Maybe the data has to be normalized to account for this, perhaps by the market index).
  • Day has a negative weight and Day^2 has almost the same positive weight so they cancel each other. Similarly for DayOfWeek.
  • Which half of month it is has a small positive weight, which means there might be a small but positive correlation that NAVs are higher during the second half of the month (or it might just be stock market going up over time). Although the data shows that NAVs actually are slightly lower (30.004) during the second half than the first half (30.117). It’s not a good model!
  • DateParity also has a small positive weight which might mean NAVs are slightly higher on odd days (opposite of what I was going for; it’s better to buy low). We didn’t need to train a model for this. We could just look at the data and indeed the average NAV for DSPBR on odd days is 30.18 which is slightly higher than the average NAV on even days (29.94).
  • DayMidOrNot has a small negative weight. Since it has value -1 for Mondays and Fridays, this should mean that NAVs for slightly higher on Mondays and Fridays than during the mid of the week. The data shows that: average NAV is slightly higher on Mondays and Fridays (30.11) than during the mid of the week (30.03).

The model generated by Weka was completely useless. After this, I tried regression tool in excel and tried to interpret the results by reading about different statistical measures here. For DSPBR,

  • R square is miniscule (0.0065), meaning the model is useless.
  • Significance F is 85%, meaning there’s 85% chance that the regression output was obtained by chance.
  • All the attributes have very high P-value, meaning the results more or less occurred by chance.

Interestingly, for UTI, the model performs slightly better with larger R square (0.0136) and lower Significance F (30%).


I guess machine learning was not the right approach for this problem, because of small training set (866 instances) and features having almost no correlation with the desired output. As for the hypothesis, we can directly look for support in the historical data:

Mutual fund analysis results

  1. NAVs actually are slightly lower during the second half of the month than the first half for 4/6 funds, not a clear result but still it might be better to buy during the second half for a buy-and-hold approach.
  2. NAVs are higher for all the funds during odd days than on even days. This is the most surprising and clear result. It should mean that it’s better to buy on even days, contradicting the hypothesis.
  3. NAVs are higher for all the funds during Mondays and Fridays than during the mid of the week, a very clear result. It means it should be better to buy during the mid of the week, supporting the hypothesis.

As for the best day of month to start a SIP, we can compare SIPs of same amount started on different days for the same interval. I compared SIPs of 1000 started on 1st through 28th and continued for 25 months and compared their worth based on NAVs available for the last day in the data. Here are the observations:

  • For DSPBR, Franklin, Mirae and UTI, the three best days to start a SIP came out to be 11, 18 and 4 respectively.
  • For BSL and Reliance, the three best days to start a SIP came out to be 11, 18 and 25 respectively.
  • The difference in returns between the best day and the worst day was as high as 18.72%.

These results are surprising and there seems to be something special about the days 11, 18, 4 and 25 (or these funds just happen to be highly correlated with each other).

All of the historical data and analysis is available here: mutual-fund-analysis. Let me know what you think in comments.

Disclaimer: I wrote this post in my personal time. All opinions expressed in this post are my personal opinions and not that of my employer. I’m not an expert in investment or finance.

4 thoughts on “When to invest in mutual funds: An exercise in data science

  1. Sir though I am quite young and don’t have much knowledge of Mutual Funds and SIP but I think your model can be improved slightly like adding time of day in which market opens or closes,my father use to buy or sell a lot in that phase (9.15AM or 3.30PM india ).So I think that can be considered as extreme position.Also what % of data you used as train and test for cross validation ? Also did you plot anything or removed outliners ? They really helps a lot.We can also use more sophesticated algorithms because linear regression under perform many time(what I have seen).
    Like always it was a very nice read(I am already fan of your blog specially that segment tree article *_* ) If you don’t mind can I use your data to do predictions for the same ?

    • Thank you for the nice comments. All the data and code on this blog is free for anyone to use. You are welcome to play with it. Time of day is not a factor for mutual funds because they are priced once for a day (after trading is over). I was using 7-fold cross validation (so 1 fold for testing). I didn’t remove outliers but tested with many different algorithms with similar outcomes. There’s just not much to learn in the data. With just these signals, month is the most important factor because stock prices tend to go up over time – this fact is useless to investors looking for when to invest. If you find any other interesting things, please share them in comments.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s