FRISK's credo

22/09/08 László Hajdu

Kaggle (Founded in 2010) is the largest data mining community of the world, with the number of users reaching 5,000,000 by 2021. Kaggle's undoubted merit is that it has been a catalyst for the spread of data science and analytics. Yet something is going wrong!

 A Kaggle data mining competition looks something like this:

  1. Modeling: modeling is done on the training database and the necessary data processing and wrangling is done to generate new variables.
  2. Testing on the test database: this database is used to evaluate the model built in the previous step. The evaluation can be done by the competitor because the target variable is included in the test database. If the competitor is dissatisfied with the performance of the model, he builds a new model on the training database and evaluates it again on the test data. The model building itself is therefore an iterative process.
  4. Submitting a model to the contest: if the contestant feels that the model is good enough, he runs the model on the evaluation database and posts the results on the Kaggle site. The competitor cannot evaluate the model in this database as the target variable is not included. Once uploaded to the Kaggle page, however, the model is automatically evaluated (the organizer of the competition, of course, has the target variable), so the competitor can immediately see how his/her model performs in the evaluation database.  If the competitor is not satisfied with his/her ranking, he/she can continue the modeling (using the training database). 


The above principles are clear and professional, you can hardly think of a better way to organize a data mining competition. However, it is seldom mentioned that this type of competition not only develops skills but also atrophies them. Three skills are now presented that are an important part of business analytics, but not of Kaggle-type competitions!

  1. Over-learning management: model performance is always much better on the training database than on the test database. The "Kaggle principle" does not address this. The only important thing is that the model performs as well as possible on the test or evaluation database. How the model performs on the training database is irrelevant to the whole process. Overlearning is not penalized in competitions, although it is a dangerous phenomenon, it indicates model instability.
  2. Obviously, these are important, but there are other factors that can be used to increase the accuracy of the model. One such factor is how the training, test, and validation database is designed. Several factors need to be taken into account: (i) what should be the size of these databases, (ii) should any of the values of the target variable be overweight, and if so, by how much? However, the Kaggle competitions provide such databases (train, test) out-of-the-box, so the ability to design train-test-evaluation databases is not developed in these competitions.
  3. Business communication atrophies: modeling is apparently a technical task. Collect as much data as possible, model on that data, then evaluate the model. These are steps that can be automated very well, very quickly, and very good models can be built. The problem is that there is no opportunity to ask questions during the modeling. Although the success of business analysis depends on this in most cases. It is about understanding the data as well as possible, aligning the analysis with the business objectives, and creating the right data processing for that. Kaggle competitions reinforce the idea that you only need to focus on the data! In contrast, in reality, the business needs first, and only then do the analytics come!