14 Lessons Learned from my first Kaggle Competition

Phil Godley
5 min readAug 21, 2020
Photo by Nick Fewings on Unsplash

Here are 14 lessons learned from my recent participation in the 2020 Kaggle SIIM-ISIC Melanoma Classification Competition.

You can read about that experience here and review the code on GitHub here.

I hope this will be of help to anyone considering participating in future competitions for the first time.

  1. Research the problem. Before opening up your IDE or writing any notebooks, spend a good few days researching the problem which the competition is attempting to solve. This will include published research papers, Medium articles and the discussion boards on Kaggle from previous competitions. This will be invaluable as you progress through the competition.
  2. Make a plan. Once you have done your research, make a broad plan based on your research and work within that plan. Be open minded about the problem but having a plan will help you make initial progress much quicker and also help towards the final stages of the competition not to get distracted from what you are trying to achieve.
  3. Have an outlet for when you get bogged down. There will be times when the coding gets tough or you are trying to work out a particular problem. You may get through those challenges quickly or they may turn your brain to jelly and it feels like you will never get through them. Take a break and find a way to clear your head. Fresh air and exercise are excellent in these situations.
  4. Comment your code. Working in Kaggle, Colab or Jupyter notebooks are useful for breaking down your workflow and code. Nevertheless, always comment each block of code which contains a distinct feature or function. You will come back to your code many times and your comments will help you navigate. This is especially true when building large For Loops. Cross-Validation For-Loops can run to well over 100 lines of code. Comment on each element.
  5. Keep a log and record all outputs, some which may seem innocuous may become important later. You will run many experiments and cross validations. Recording data as you go will be invaluable later in the competition. Make sure you capture your input configuration variables for example epochs, batch size or learning rates. Also capture outputs such as loss/cost, your optimisation metric, standard deviations of your cross-validation folds and Out of Fold (OOF) results. Only when you have plenty of data will you start to see important patterns to help you optimise and stabilise your models. There are also third-party tools for this including https://www.wandb.com and https://neptune.ai.
  6. Experiment fast on smaller models. Run many experiments to start with, including those which may not immediately seem intuitively promising. This means you should work on smaller networks and less features to start with if you can. Once you have run many experiments, then start increasing your depth and width of models, increasing data volumes and more features. If you are working with images, start with smaller images. This is especially true if you are working with new code and are still debugging, it will save you hours.
  7. Don’t be tempted to go for immediate success. Being patient is important. Build your models and workflows slowly in line with your plan. Only use small models and smaller datasets to begin with if you can. Don’t be disheartened if others are flying up the public leaderboard. This discipline will give you large benefits towards the final stages of the competition. Slow and steady wins the race here.
  8. Get a teammate. If you know someone who wants to enter a competition with you as a Team then great, if not the discussion forums on Kaggle could be a good place to find a Team. Some Teams will break down workloads and share their plans. Others, it will just be someone who you can share ideas and sense check with. Either way, being able to talk about what’s in your brain and show someone your code will help you organise those thoughts and proceed in a more organised way.
  9. Keep reading, researching and contributing. During a competition, problems and nuances will come to light, for example through your data exploration, how the data is distributed or how to build workflows which can enhance your training with GPUs or TPUs. The competition discussion boards will be a goldmine of information. They will help you along the way, as will external information sources such as Medium, research papers and Stack Overflow. Don’t be afraid to put your code aside for a day and spend some time researching. Also, do encourage and contribute yourself. You don’t have to give away your own code but sharing ideas and encouraging others is a great way to participate.
  10. Use 3rd party code but fully understand it, recode it so you understand it and test it, it might be wrong. There will be plenty of code out there for you to use. This might originate on a developer’s website eg Tensorflow, Keras, Pytorch or Scikit-Learn where there are plenty of examples. Or you may benefit from starter code and baseline code provided by other competitors. Use this code but not until you understand it and the best way is to re-write it yourself. You will often come across code that simply doesn’t work. Your style will be different to others. For Python, others may use list comprehensions which may be difficult to follow, some may use colons or Lambda functions. If you write your code, then you will understand it much better now and later when you review it again. You may also learn neat new coding tricks.
  11. Visualise where possible. If you can, create charts to visualise as much as you can whether that be data exploration, learning rate schedules or training/cross-validation graphs for example. Keep these in your notebooks, it will help you understand how the models are working.
  12. Don’t chase the leaderboard. Resist the temptation! Every competition will have a big shake up at the end and ONLY those who have been disciplined with their plan and cross-validation models will prosper!
  13. Be very wary of over fitting. A statement of the obvious in machine learning but the temptation is always there, resist it! Always have this in the back of your mind.
  14. Trust your Cross-Validation (CV). The most important rule to ensure a good result on the private leaderboard; trust your cross-validation scores. You should work the hardest to optimise and stabilise these. Using the Kaggle leaderboard to do this or blindly using all the training data will almost definitely lead to overfitting and you will not optimise your models.

--

--

Phil Godley

An investor, board member and advisor to growth companies.