Hello Titanic

Nicholas
8 min readDec 27, 2020

Alright, it’s not Hello World but it’s close enough. If you’ve ever dabbled in the world of programming you know the ritual performed at the start of every new adventure. Our beloved Hello World. This program is like bootcamp for the military. In order to be initiated into the world of hackers, developers, geeks and gurus, we must first run, Hello World. It helps ensure your compiler and all the other bells and whistles work properly.

Sho Nuff runs Hello World, why can’t you?

Also, a bit of avuncular advice, no matter how good you are at programming it’s always best practice to run it before you begin to paint your masterpiece of models.

So, what’s Kaggle’s version of “Hello World?” Well, truth be told, there is none, but the Titanic dataset serves as a proper proxy to get your feet wet (don’t laugh). The legendary Titanic competition is perfect to start and is easy to read. It also gives you a chance to handle some of the most common issues you’ll run into when navigating datasets. The dataset contains information on passengers who embarked on the first/last trip of the Titanic. In this article we are going to go through the 5 steps of a Data Science project. The steps are obtain, scrub, explore, model, and interpret, also known as the OSEMN model. We won’t show every single step taken to build this project, but just enough to document the progress in our journey. My Github will have the entire code. I’ll also leave my scores for transparency. Lastly, before we get started, if you’d like to learn more about these steps, this article, written by Dr. Cher Han Lau titled ‘5 Steps of a Data Science Project Lifecycle, is perfect for getting up to speed quickly.

Now, the moment you’ve all been waiting for…

LLLLLETTSS GET READY TO MODEL!!!!!!!

Obtain

Today’s dataset is being brought to you by Kaggle. Kaggle, dope for all of your daily dataset desires. I’ll be entering this competition of course and using the data provided. If you’re a bit behind here’s a great article on Elite Data Science that gives you a good beginner’s guide to Kaggle. So, enter the competition, download the dataset, and we can get started with getting an idea of what we’re working with.

I wrote a short RDQ for getting started with new data. It’s really helpful if you find yourself having trouble with where to start on a new dataset. For the sake of brevity, I’ll just use this as a roadmap for where to go from here.

What’s the DISH?

First I load my data up. Next, split the data into two sets. One for testing and one for training. Then I get the DISH.

Describe

Info

Shape

Head

Perfect. The DISH is done. At a glance from the data we can see a few things. The average age and fare are around 30. There are about 891 rows and 12 columns. I can also see that we have some NaN values that will need to be managed before I proceed to exploring. That makes a perfect segue to the next step, scrubbing.

Scrub

Scrubbing is where things start to get fun. Unless you’re capturing the data yourself via APIs and other available avenues, many times you’ll find yourself just downloading them from some central distribution. For us, we will be getting most of our data directly from Kaggle because the name of this journey is ‘The Road to Kaggle Master.’ Occasionally though, we will sway from that path.

I am an ex military guy so I liken scrubbing my data to pulling strings from uniforms, removing lint, and wiping dirt from your boots. Unfortunately, not all the data we download will be in pristine and ready to model condition That’s where scrubbing comes in. First, I found out how much of our data is null.

687 missing data in the Cabin column?! That’s too much to keep around when it’s serving no value. I’ll hold it there for now to see if I can think of any use for it, but I think it’s safe to say that I won’t use that later. 177 missing values in Age may also be a problem, but if we have a predictable distribution of ages we can treat the missing values with the same weight using the probability as a weight. Hmmmm??? For now, we will let the ops continue and handle them later in preprossing. Next up is exploration.

Exploration

You know what they say about exploration? If you don’t explore, you know no more. That’s not a real thing, but for the sake of my pride and prejudice let’s pretend it is. Normally, I like to start exploring my data by asking some simple questions. What is the average? Did Epstein kill himself? Where is the max in reference to the first standard deviation? These simple questions can help us build a more robust understanding of what the data is saying. Below, I copied the number of males and females onboard to understand the percentage each group occupied. From this beautiful five line script I eloquently executed, we can see that men made up a significant portion of the population.

And with the majority of onboard population being men, we can clearly see why they believed the Titanic would never go down

Now that we know how the population was split, I’m wondering the average each group paid.

I know what I said earlier,

“We won’t show every single step taken to build this project”- Nick

And we won’t, however, there’s a few cool things I wanted to share with you. Check out the chart below. I split the population into seven groups based on age. Next, I split the population once again into two more groups, those who survived and those who did not.

Typo: replace df with train from previous steps

I also made another column that consist of each person’s Title and grouped by prefix. Lastly, I found the mean Fare paid by each Title.

We can see from the data that the Countess paid the most for her ticket. Damn Professor Oglevee.

Countess Vaughn from the Parkers remember?

So, let’s pretend we’ve done a great job at scrubbing and exploring the data. We’ve gotten rid of some NaN values. We’ve made some pretty graphs and maybe even a couple of animation. Who knows? The only way to find out is go to Github and fork the code.

Out of the 5 steps in the OSEMN model we’ve complete O (Obtain), S (Scrub), and E (Explore). Next is modeling. Now, for our M (model) we’re going to be using the XGBoost algorithm. Here’s a quick overview of the XGBoost algorithm from an article written on toward data science.

“XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured data (images, text, etc.) … A wide range of applications: Can be used to solve regression, classification, ranking, and user-defined prediction problems.”

And coincidentally we just happen to have some of those “user-defined prediction problems” they mentioned…

However, we must first do some preprocessing and for that we have our trusty busty One Hot Encoding dandy notebook provided by sklearn. If you’re wondering what One Hot Encoding is, it’s a simple way of categorizing data based on attributes and encoding it in a format that our model can process. If you’d like to learn more about One Hot Encoding, Machine Learning Mastery offers an in depth look into the processes of One Hot Encoding.

…Here’s the code though…

And here’s the model…

Copy the code and build your own model. Change the parameters, submit your predictions to Kaggle, and see if you can score higher than me. If you happen to score higher, leave me a comment and let me know your prediction accuracy, however,

if you cannot score higher, then come join me on our journey to Kaggle Grandmaster (Master) and we’ll climb that mountain together, though I will have to destroy you at the top. Below you can see that with our best parameters we’ve managed to produce a score of .82.

However, this isn’t the real test. The real test is the Kaggle submission. What kind of score will we receive for our submission? Well, you’ll have to wait until the next “Road to Kaggle Grandmaster.” No, but seriously you really are going to have to wait and…

Road to Kaggle Grandmaster debut poster

--

--

Nicholas

Budding Data Scientist and aspiring NeuralEngineer