### Motivation

In this paper we use Monte Carlo simulation to show the relationship between the Information Coefficient (IC), correlation, decile returns, and linear regression.^{1} We can also gain insight into investment related questions, such as

- What level of IC is considered good?
- What effect does volatility have on the spread?
- What effect does universe size have on the hit rate?

We will start off by defining some terms so we are all on the same page.

- The
*Spread*is the difference between the average top decile return and the average bottom decile return. - The
*IC*is the correlation between two series, here the return series and the exposure series. The regular Pearson correlation can be used here, but most times Spearman’s Rank correlation is used because it is less effected by outliers. - A
*linear model*is the regression of one series versus another resulting in an intercept and a coefficients which describes the relationship between the two variables. *R-squared*measures the goodness of fit of the linear model. It is also referred to as the coefficient of determination.

### Simulation Process

We proceed by assuming asset returns are normally distributed and we generate two random series from a bivariate normal distribution with given mean and correlation. We will call the first random series the *returns* and call the second random series the *exposures*. The *exposures* represent the factor, which could be momentum, book to price, or any other factor.

As defined above, the correlation between the two random series is the IC. Here is how each simulation is performed

- Divide into deciles based on exposures
- Calculate spread between top decile and bottom decile
- Calculate the IC
- Run a linear regression

### Base case

To run the simulation we use the R programming language’s multivariate random number generator function *mvrnorm* from the MASS package to generate two series for each simulation. Both series have zero mean. They only thing they have in common is a correlation, which we vary from zero to twenty percent. Before going further into the details, let’s have a look at some simulations so we can describe how everything relates.

Correlation | Return.Vol | Exposure.Vol | Spread | Pct.Positive | IC | R.squared | R | Coef | Scaled.IC |
---|---|---|---|---|---|---|---|---|---|

0.00 | 0.08 | 0.08 | -3.602 | 0.481 | -0.001 | 0.001 | 0.025 | -0.001 | -0.001 |

0.01 | 0.08 | 0.08 | 28.432 | 0.604 | 0.010 | 0.001 | 0.026 | 0.010 | 0.010 |

0.02 | 0.08 | 0.08 | 55.937 | 0.691 | 0.019 | 0.001 | 0.031 | 0.020 | 0.019 |

0.03 | 0.08 | 0.08 | 85.344 | 0.775 | 0.030 | 0.002 | 0.037 | 0.031 | 0.030 |

0.04 | 0.08 | 0.08 | 111.117 | 0.832 | 0.037 | 0.003 | 0.042 | 0.039 | 0.037 |

0.05 | 0.08 | 0.08 | 139.557 | 0.897 | 0.048 | 0.003 | 0.051 | 0.050 | 0.048 |

0.06 | 0.08 | 0.08 | 169.638 | 0.927 | 0.058 | 0.005 | 0.061 | 0.060 | 0.058 |

0.07 | 0.08 | 0.08 | 195.786 | 0.967 | 0.067 | 0.006 | 0.070 | 0.070 | 0.067 |

0.10 | 0.08 | 0.08 | 278.362 | 0.996 | 0.095 | 0.011 | 0.100 | 0.100 | 0.095 |

0.20 | 0.08 | 0.08 | 561.558 | 1.000 | 0.192 | 0.041 | 0.200 | 0.200 | 0.192 |

You can see from the title of Table 1 that we run 3000 simulations, each with 1000 assets. By assets we mean how long each random series is and by simulations we mean how many random samples we draw. The values shown in the table are averages of all the simulations. The first column shows the correlation, which represents the *information*. The next two columns show the volatility of the *Returns* series and the *Exposures* series, which we have set to 8% for now, but will very later. The Spread shows one measure of the performance of the *factor*. The *Exposures* series is our *factor* and we create deciles based on that series. Once we have the deciles, we average the *Returns* for each decile. The spread is the difference between the top decile average return and the bottom decile average return. The percent positive column shows what proportion of the spreads are positive. Each simulation produces one spread, so since we have 3000 simulations, we have 3000 spreads. The average of these spreads is the spread column and the proportion positive is the percent positive column. The IC is the average correlation between the *exposures* and *returns* series and is another standard measure of the performance of a *factor*. Measuring the correlation between two factors is a quick and easy way to see how closely they are related, and how powerful the *exposures* may be in predicting *returns*. Another way to do so is to run a regression of the *exposures* on the *returns*. The regression function provides three main outputs, the y-intercept, the coefficient, and the R-squared measure. The R-squared is a measure of the goodness of fit of the regression equation. The square root of the R-squared statistic returns the IC (to a close approximation). In this simple one-variable linear regression framework the coefficient is equivalent to the correlation that was introduced to the two random series. Lastly, the coefficient can be approximated by the scaled IC as defined in the formula below:

\[Scaled.IC = \frac{\sigma_{returns}}{\sigma_{exposures}} * IC\]

In this table both the *returns* and *exposures* have the same standard deviation, so the volatility ratio is one, so it doesn’t seem very informative, but in the next table we will run simulations with a higher *returns* volatility and you will see the formula holds.

Remember that these are random series with zero mean, so the only information content is the correlation. By running enough simulations we can reliably approximate the true correlation by both the IC and the R (the square root of the regression R-squared statistic). We can also approximate the coefficient by calculating the scaled IC.

### Correlation, IC, and hit rate

As you can see from the base case simulation run, the correlation and the IC are closely linked. In these simulations the correlation is the given relationship between *returns* and *exposures* and the IC is a measurement of how well our signal (or rank in this case) works.^{2}

The percent positive (Pct.Positive) column shows the number of simulation runs (out of 3000) that the spread was positive. This number is often called the hit rate. If the factor has a zero correlation we would expect a hit rate of 0.5 (or 50%). A correlation of one percent bumps the hit rate up to almost 60% and two percent gets us to 68%. Usually ICs in the range of 5% to 10% are considered very good. The hit rate in that case would be between 89% and 99%, which is very good indeed. Note that this is on a large universe of 1000 assets. We will see later that the hit rate declines rapidly as the number of assets falls.

### Higher Return Volatility

In the next set of simulations we increase the return volatility from 8% to 16%.

Correlation | Return.Vol | Exposure.Vol | Spread | Pct.Positive | IC | R.squared | R | Coef | Scaled.IC |
---|---|---|---|---|---|---|---|---|---|

0.00 | 0.16 | 0.08 | 2.493 | 0.497 | 0.000 | 0.001 | 0.026 | 0.001 | 0.001 |

0.01 | 0.16 | 0.08 | 52.148 | 0.588 | 0.009 | 0.001 | 0.026 | 0.019 | 0.018 |

0.02 | 0.16 | 0.08 | 105.172 | 0.688 | 0.017 | 0.001 | 0.029 | 0.037 | 0.035 |

0.03 | 0.16 | 0.08 | 167.679 | 0.764 | 0.028 | 0.002 | 0.036 | 0.059 | 0.057 |

0.04 | 0.16 | 0.08 | 217.495 | 0.833 | 0.038 | 0.003 | 0.043 | 0.079 | 0.076 |

0.05 | 0.16 | 0.08 | 278.395 | 0.893 | 0.048 | 0.003 | 0.051 | 0.100 | 0.096 |

0.06 | 0.16 | 0.08 | 341.000 | 0.931 | 0.058 | 0.005 | 0.061 | 0.121 | 0.116 |

0.07 | 0.16 | 0.08 | 389.511 | 0.959 | 0.066 | 0.006 | 0.070 | 0.139 | 0.132 |

0.10 | 0.16 | 0.08 | 566.551 | 0.995 | 0.096 | 0.011 | 0.100 | 0.201 | 0.192 |

0.20 | 0.16 | 0.08 | 1124.047 | 1.000 | 0.192 | 0.041 | 0.200 | 0.401 | 0.383 |

Doubling the *returns* volatility basically doubles the spread. The IC and the regression R (the square root of R-squared) still match the actual correlation between the two random series, but the coefficient is twice as large as in the previous table. This is because the *returns* volatility is now twice as high as the *exposures* volatility. The regression coefficient is accurately estimated by the scaled IC measure.

### Lower Volatility

In the next table we simulate both series with only 2% volatility (for both *returns* and *exposures*). For a given correlation, the percent positive, IC, and coefficients are all comparable to the base case, but the spread is much lower. The *returns* volatility is what creates the opportunity to profit and if the correlation is high enough the investor can capitalize on the hit rate (percent positive).

Correlation | Return.Vol | Exposure.Vol | Spread | Pct.Positive | IC | R.squared | R | Coef | Scaled.IC |
---|---|---|---|---|---|---|---|---|---|

0.00 | 0.02 | 0.02 | -0.799 | 0.487 | -0.001 | 0.001 | 0.026 | -0.001 | -0.001 |

0.01 | 0.02 | 0.02 | 8.352 | 0.618 | 0.011 | 0.001 | 0.027 | 0.011 | 0.011 |

0.02 | 0.02 | 0.02 | 14.668 | 0.688 | 0.020 | 0.001 | 0.031 | 0.021 | 0.020 |

0.03 | 0.02 | 0.02 | 21.622 | 0.773 | 0.029 | 0.002 | 0.036 | 0.030 | 0.029 |

0.04 | 0.02 | 0.02 | 27.764 | 0.838 | 0.038 | 0.003 | 0.042 | 0.040 | 0.038 |

0.05 | 0.02 | 0.02 | 35.557 | 0.900 | 0.049 | 0.004 | 0.052 | 0.051 | 0.049 |

0.06 | 0.02 | 0.02 | 41.850 | 0.935 | 0.057 | 0.005 | 0.060 | 0.059 | 0.057 |

0.07 | 0.02 | 0.02 | 48.828 | 0.958 | 0.067 | 0.006 | 0.071 | 0.070 | 0.067 |

0.10 | 0.02 | 0.02 | 68.160 | 0.992 | 0.095 | 0.011 | 0.099 | 0.099 | 0.095 |

0.20 | 0.02 | 0.02 | 140.598 | 1.000 | 0.191 | 0.041 | 0.200 | 0.201 | 0.192 |

All the usual relationships hold. The only difference is that the spread is much lower due to the lower volatility of the *returns* series.

### Higher Exposure Volatility

Correlation | Return.Vol | Exposure.Vol | Spread | Pct.Positive | IC | R.squared | R | Coef | Scaled.IC |
---|---|---|---|---|---|---|---|---|---|

0.00 | 0.08 | 0.16 | 1.992 | 0.507 | 0.000 | 0.001 | 0.025 | 0.000 | 0.000 |

0.01 | 0.08 | 0.16 | 24.810 | 0.593 | 0.010 | 0.001 | 0.026 | 0.005 | 0.005 |

0.02 | 0.08 | 0.16 | 55.793 | 0.689 | 0.019 | 0.001 | 0.030 | 0.010 | 0.009 |

0.03 | 0.08 | 0.16 | 83.876 | 0.777 | 0.029 | 0.002 | 0.036 | 0.015 | 0.015 |

0.04 | 0.08 | 0.16 | 112.334 | 0.845 | 0.038 | 0.003 | 0.043 | 0.020 | 0.019 |

0.05 | 0.08 | 0.16 | 140.015 | 0.890 | 0.047 | 0.003 | 0.051 | 0.025 | 0.023 |

0.06 | 0.08 | 0.16 | 172.254 | 0.933 | 0.058 | 0.005 | 0.062 | 0.031 | 0.029 |

0.07 | 0.08 | 0.16 | 197.061 | 0.964 | 0.066 | 0.006 | 0.070 | 0.035 | 0.033 |

0.10 | 0.08 | 0.16 | 282.661 | 0.994 | 0.096 | 0.011 | 0.100 | 0.050 | 0.048 |

0.20 | 0.08 | 0.16 | 562.552 | 1.000 | 0.192 | 0.041 | 0.201 | 0.100 | 0.096 |

When the *exposures* volatility is twice as high as the *returns* volatility the spread remains comparable to the base case, but the scaled IC drops in half, as does the regression coefficient. Clearly, we want the volatility to be on the *returns* and not on the *exposures*.

### Non-zero Mean

Correlation | Return.Vol | Exposure.Vol | Spread | Pct.Positive | IC | R.squared | R | Coef | Scaled.IC |
---|---|---|---|---|---|---|---|---|---|

0.00 | 0.08 | 0.08 | 1.944 | 0.515 | 0.002 | 0.001 | 0.025 | 0.001 | 0.002 |

0.01 | 0.08 | 0.08 | 28.674 | 0.601 | 0.010 | 0.001 | 0.026 | 0.011 | 0.010 |

0.02 | 0.08 | 0.08 | 53.470 | 0.676 | 0.018 | 0.001 | 0.030 | 0.019 | 0.018 |

0.03 | 0.08 | 0.08 | 80.222 | 0.760 | 0.028 | 0.002 | 0.035 | 0.029 | 0.028 |

0.04 | 0.08 | 0.08 | 117.386 | 0.844 | 0.039 | 0.003 | 0.044 | 0.041 | 0.039 |

0.05 | 0.08 | 0.08 | 142.425 | 0.897 | 0.048 | 0.004 | 0.052 | 0.051 | 0.048 |

0.06 | 0.08 | 0.08 | 171.230 | 0.935 | 0.057 | 0.005 | 0.061 | 0.060 | 0.057 |

0.07 | 0.08 | 0.08 | 196.997 | 0.956 | 0.066 | 0.006 | 0.070 | 0.069 | 0.066 |

0.10 | 0.08 | 0.08 | 280.055 | 0.992 | 0.096 | 0.011 | 0.100 | 0.100 | 0.096 |

0.20 | 0.08 | 0.08 | 556.470 | 1.000 | 0.190 | 0.040 | 0.198 | 0.199 | 0.190 |

While generally a higher return is better, in this case we are measuring the spread, which is the difference between the average return in the top decile minus the average return in the bottom decile. So a higher return just increases the average return, but does not (necessarily) benefit the top decile more than the bottom decile, leaving the spread pretty much the same. As you can see, all the other metrics are comparable as well. The correlation and volatility are the two driving forces of factor performance.

### Small Number of Assets

Correlation | Return.Vol | Exposure.Vol | Spread | Pct.Positive | IC | R.squared | R | Coef | Scaled.IC |
---|---|---|---|---|---|---|---|---|---|

0.00 | 0.08 | 0.08 | -3.410 | 0.504 | -0.001 | 0.013 | 0.091 | -0.001 | -0.001 |

0.01 | 0.08 | 0.08 | 31.407 | 0.534 | 0.009 | 0.012 | 0.090 | 0.010 | 0.009 |

0.02 | 0.08 | 0.08 | 53.209 | 0.558 | 0.019 | 0.013 | 0.091 | 0.021 | 0.019 |

0.03 | 0.08 | 0.08 | 80.859 | 0.575 | 0.027 | 0.014 | 0.094 | 0.029 | 0.028 |

0.04 | 0.08 | 0.08 | 118.916 | 0.618 | 0.037 | 0.014 | 0.096 | 0.040 | 0.037 |

0.05 | 0.08 | 0.08 | 144.229 | 0.640 | 0.047 | 0.015 | 0.100 | 0.051 | 0.048 |

0.06 | 0.08 | 0.08 | 158.161 | 0.662 | 0.056 | 0.016 | 0.102 | 0.059 | 0.056 |

0.07 | 0.08 | 0.08 | 201.076 | 0.703 | 0.067 | 0.017 | 0.107 | 0.072 | 0.068 |

0.10 | 0.08 | 0.08 | 257.235 | 0.732 | 0.091 | 0.022 | 0.121 | 0.095 | 0.091 |

0.20 | 0.08 | 0.08 | 537.909 | 0.908 | 0.185 | 0.050 | 0.198 | 0.195 | 0.186 |

As mentioned earlier, when we lower the number of assets to 80 the hit rate (Pct.Positive) falls substantially. The rest of the metrics are similar to the base case. Quantitative investing is a numbers game, where is pays to have as much breadth as possible. With 5% correlation and 1000 assets the simulated hit rate was 89%. With only 80 assets the simulated hit rate falls to around 63%. 63% is not bad for a hit rate, but you have to make numerous “bets” to get that advertised number. If you are only investing in a small number of assets, you could get a resulting hit rate that is much worse (or better) than the advertised hit rate.

### Conclusion and Shiny App

This paper has explained the relationship between the Information Coefficient (IC) and linear regression model output. These simulations should help you understand the important of correlation in factor investing and what level of IC and/or correlation will yield acceptable results. We have shown that the larger the universe the better the expected hit rate (for a given correlation). We have shown that *returns* volatility is good and the *exposures* volatility is (relatively) bad. Average returns are not as important as the spread between the top and bottom deciles.

If you would like to change the parameters and run your own simulations, please visit this Shiny App link.

This paper was motivated by a short paper and slides by Oliver Buckley at Invesco which can be found here(https://www.northinfo.com/documents/88.pdf).↩

Since this is a simple one-factor (

*exposure*) model, the IC matches the correlation very close (we are able to capture the full signal). In a multi-factor context the IC will usually be lower than the correlation because the factors will not be perfectly uncorrelated.↩