The Story Continues…

Nicholas
5 min readJan 20, 2021

The moment we’ve all been waiting for…Finally…

I’m sure you’ve been following every episode but if you haven’t, we last left off just as we were preparing to reveal the results of our submission. We don’t have a montage like those fancy pancy T.V. shows so in case you did in fact miss it just click here or here. Luckily our tech team, me, has been hard at work and has even managed to provide us a link to our last episode.

Down here

It’s amazing what technology can do these days. Let’s get back to our model submission. So, what did we get? Well son, it’s important to remember two things. You did your best and you played the game honestly. That is what matters most. Before we reveal our score I want to go over a few things. First I want to go over a few lessons learned. There’s no point in failing (we didn’t fail) if you don’t learn from your mistakes right? So, we will go over some things we could’ve done better by looking at other submissions with higher scores and different approaches. Second, I want to go over one of the problems that I personally had trouble with while completing this project. I know, I know, you’re ready to see the score and you will…

One of the really cool things about Kaggle is that it allows us to view other submissions in this competition (and many other competitions). This lets us see areas where we could’ve improved and alternate routes to problem solving. If you remember, in the Titanic data there were two separate labels for siblings and parents. I saw a submission that applied feature engineering and created a label for the entire family. Small? Maybe, but mighty when it comes to value added data for our model. Another Kaggler divided the the ages into groups. This was really helpful because dividing them into groups will give our model more data to work with. Here, take a look…

One of the biggest problems that I ran into while building this project was handling missing data. You have a few options to choose from when it comes to handling numerical data that span over a specified range of values, for example, age. This article on Toward Data Science was very helpful in learning how to find and handle missing values. She also replaced missing age values with the mean age for each group. This helps us predict potential ages more accurately. Genius! I’m definitely setting this aside for future obstacles. It also sparks an idea in my head. Maybe taking into consideration the accompanying siblings and parents we could better find a suitable value to apply. I find it hard to believe that someone with both parents and siblings onboard would be over the age of 40. This Kaggler received a 0.7918 for their score and spoiler alert she scored higher than us, however…

However, she did not score THAT much higher than us. So, we can take a bit more pride in our first submission.

This Kaggler used a correlation map at the start of their exploratory analysis to find values that were highly correlated. He goes on to say

A big part of correlation analysis is to investigate the correlations between attributes. If a feature is too similar to another, it becomes redundant to the model and slows down the computing time. Some of the columns are almost perfectly correlated to each other, but this is only true where categorical data of one column has been split into dummy variables. By the nature of that operation, we should expect to see that and is not a cause for concern.

This is a very important piece of information. In the case of the Titanic dataset we already know a few of the obvious correlations because we’re aware that women and children were made priority during the evacuation, but in the future when we begin dealing with completely new data, finding these correlations early is important. Redundancy in values can become very taxing on our model and that problem only increases as the size of the data grow. Did we score higher than this guy? Nope! He beat us out with a 0.77 but THAT wasn’t much higher than our first score.

Heat Map to check for correlations

Well, there it is. We’ve covered some of the lessons learned like finding null values and what to do with them. We’ve taken a look at other submissions and analyzed key points from their exploratory analysis. What’s next you ask? Well, it’s the moment we’ve all been waiting for. What did our first Kaggle submission receive?!?!

Yes, that’s right a 0.737 (rounded for confidence). Coincidentally 0.73 is one of my favorite numbers. It rolls off the tongue so smoothly, zero point seven three, like water. Ok Ok, it’s not that high, but I’ve never beaten myself up for a C before and I won’t start today. We should be proud. This is only the beginning which means if this is where we’re starting then…..sky’s the limit. Personally, I’m quite content with this score. I’ve learned a lot during this project and can’t wait to start on the next submission. So, as usual..

--

--

Nicholas

Budding Data Scientist and aspiring NeuralEngineer