As you may know, Docear offers literature recommendations and as you may know further, it’s part of my PhD to find out how to make these recommendations as good as possible. To accomplish this I need to know what a ‘good’ recommendation is. So far we have been using Click Through Rates (CTR) to evaluate different recommendation algorithms. CTR is a common performance measure in online advertisement. For instance, if a recommendation is shown 1000 times and clicked 12 times, then the CTR is 1,2% (12/1000). That means if an algorithm A has a CTR of 1% and algorithm B has a CTR of 2%, B is better.
Recently, we submitted a paper to a conference. The paper summarized the results of some evaluations we did with different recommendation algorithms. The paper was rejected. Among others, a reviewer criticized the CTR as a too simple evaluation metric. We should rather use metrics that are common in information retrieval such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), or Precision (i.e. Mean Average Precision, MAE).
The funny thing is, CTR, MAE, MSE, RMSE and Precision are basically all the same, at least in a binary classification problem (recommendation relevant / clicked vs. recommendation irrelevant / not clicked). The table shows an example. Assume, you show ten recommendations to users (Rec1…Rec10). Then is the ‘Estimate’ for each recommendation ‘1’, i.e. it’s clicked by a user. The ‘Actual‘ value describes if a user actually clicked on a recommendation (‘1) or not (‘0’). The ‘Error’ is either 0 (if the recommendation actually was clicked) or 1 (if it was not clicked). The mean absolute error (MAE) is simply the sum of all errors (6 in the example) devided by the number of total recommendations (10 in the example). Since we have only zeros and ones, it makes no difference if they are squared or not. Consequently, the mean squared error (MSE) is identical to MAE. In addition, precision and mean average precision (MAP) is identical to CTR; precision (and CTR) is exactly 1-MAE (or 1-MSE), and also RMSE perfectly correlates with the other values because it’s simply the root square of MSE (or MAE).
However, the question remains, what is a ‘good‘ recommendation? And how would you measure the performance of a recommendation algorithm if not with Click Through Rates or any other already mentioned metric?