Given the following Dataset:
https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv
Create a Jupyter Notebook with the following items
8 points -> An Exploratory Data Analysis focusing especially on answering the following business question What factors are related to passengers having a higher/lower probability of survival?
2 points -> Train a Logistic Regression algorithm to be able to properly classify passengers between those who survived and those who did not ('Survived' variable).
The Logistic Regression part of the algorithm is not taken into account for pass/fail, but if you want to get more than an 8 it is necessary to do it.
It is important to comment in detail in Markdown cells what is being done in each step, and why, as well as your opinions and reasoning about the results obtained.
Imagine you are in the year 1912, aboard the largest and most luxurious ship ever built, sailing to America. The RMS Titanic, an engineering marvel of the era, makes its maiden voyage from the port of Southampton to New York, carrying more than 2,200 passengers and crew.
But on the dark night of April 14, the ship strikes an iceberg and begins to sink. Within hours, more than 1,500 people lose their lives in one of the worst maritime disasters in history.
More than a century later, the sinking of the Titanic continues to inspire, fascinate and move the world.
Could the disaster have been avoided? What factors influenced who survived and who did not? These are some of the questions that scientists and researchers are trying to answer.
The dataset we will be working with contains detailed information about the passengers who were part of the Titanic's crew, their age, gender, socioeconomic class and whether or not they survived the sinking of the ocean liner. By analyzing this data, we will try to gain a unique insight into what happened that fateful night and learn more about one of the most tragic and poignant events in maritime history.
My goal is to use advanced data analysis and machine learning techniques learned during my training to explore this dataset and uncover hidden patterns and relationships. I would like to be able to answer questions such as:
By the end of the analysis, I hope that we will have a deeper understanding of the sinking of the RMS Titanic and have developed valuable skills in data analysis and predictive modeling.
Join me on this exciting journey through time, history and my development as a data scientist ;)
In this workbook, we will work with our DataSet using some of the most popular and powerful libraries for data analysis. These libraries are NumPy, Pandas, Scikit-learn and Matplotlib. Each of them provides us with unique tools and functionalities to manipulate, analyze and visualize our data.
In the following sections, we will use these libraries to load, clean, analyze and visualize our dataset and achieve our goals.
import numpy as np # Algebra.
import pandas as pd # Data processing.
import matplotlib.pyplot as plt # Plot the data and display graphics inside jupyter.
%matplotlib inline
from matplotlib.colors import ListedColormap # Create color map.
import seaborn as sns # Seaborn, style plug-in for matplotlib.
from sklearn.impute import KNNImputer # Allows us to use k nearest neighbors.
from sklearn.model_selection import train_test_split # It allows us to separate our set in two to work with the predictive model.
from sklearn.linear_model import LogisticRegression # Allows us to use logistic regression on the model.
from sklearn.preprocessing import StandardScaler # Standardize the characteristics by removing the mean and scaling to the unit variance.
The time has come, let's get started!
I have just conducted a brief review of the columns comprising the dataset.
First and foremost, we need to access the data, which we will store in a variable for further processing.
Next, we will load the data into the variable named rms_titanic
. Personally, I prefer using the passenger number as the index.
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv' # We store the csv in variable.
rms_titanic = pd.read_csv(url, index_col='PassengerId') # we access the dataset by storing it in the var rms_titanic and using the PassengerID column as index.
rms_titanic.info() # General dataset information.
<class 'pandas.core.frame.DataFrame'> Index: 891 entries, 1 to 891 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Survived 891 non-null int64 1 Pclass 891 non-null int64 2 Name 891 non-null object 3 Sex 891 non-null object 4 Age 714 non-null float64 5 SibSp 891 non-null int64 6 Parch 891 non-null int64 7 Ticket 891 non-null object 8 Fare 891 non-null float64 9 Cabin 204 non-null object 10 Embarked 889 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 83.5+ KB
The DataFrame has 891 rows and 11 columns in total.
Most of the columns contain 891 values, indicating that there is no missing data in them.
However
, the Age
, Cabin
and Embarked columns contain null data (we will see later).
The series contain different types of data, including:
The memory size occupied by the DataFrame is approximately 83.5 KB.
rms_titanic # Dataset overview.
# I prefer to use this display shape instead of .head or .tail because calling the set in a single view displays the same data along with a .shape
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId | |||||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 11 columns
The data set is composed of the following data:
In this section, our focus will be on thorough preparation and cleaning of the dataset to ensure its suitability for subsequent analysis.
We will perform a thorough analysis to identify potential duplicate instances in our dataset and determine the appropriate approach to address these duplicates. This could involve removing duplicates or retaining a single instance.
We will apply data transformation techniques to ensure that data is presented in the optimal format for analysis. These techniques may include normalizing or standardizing the data, coding categorical variables, or even creating new variables from existing ones.
We will address the issue of missing data in our set. This process could encompass the elimination of instances with missing values, the imputation of values using the mode or median, and the consideration of more advanced approaches such as the use of the K-nearest neighbors (KNN) algorithm, among others, for the estimation of missing values.
In summary, this section will be focused on ensuring that our data set is clean and ready for further analysis.
To make sure there is no duplicate data in our dataset, we will copy the original DataSet to a new variable that will be our working version.
We will then proceed to perform the duplicate check on this new variable to verify that there are no duplicate records.
Although we already have prior knowledge that there should be no duplicate data in this DataSet, it is good practice to perform the check systematically to ensure the integrity of our analysis.
titanic = rms_titanic.copy()
titanic = pd.concat([rms_titanic, rms_titanic]) # we will use the concat() method of pandas to place the dataset behind itself.
print('rms_titanic: ', rms_titanic.shape) # we use shape to show the size of the Datasets.
print('titanic: ', titanic.shape)
rms_titanic: (891, 11) titanic: (1782, 11)
The original data set consists of 891 rows and 11 columns.
After the concatenation operation, a new DataSet was generated with dimensions of 1782 rows x 11 columns, just double.
Well, let's eliminate the duplicates.
titanic = titanic.drop_duplicates() # using the drop_duplicates() method we remove possible duplicate data.
print('titanic: ', titanic.shape) # we check the size of the dataset.
titanic: (891, 11)
After performing the duplicate removal operation, we check the size of the DataSet again and everything looks correct: (891 rows x 11 columns).
Once we have verified the absence of duplicate data in the data set, it is time to face the challenge of null values. This stage of the process is vitally important, as it involves making decisions about how to address the absence of data within our data set. These decisions can have a significant impact on all subsequent analysis.
First, we will proceed to identify in which columns the null values are found and how many of them exist.
To detect missing values in the data set, we can use the isnull()
method, which will allow us to identify all null values in each column. Next, we will use the sum()
method to calculate the total missing values in each column.
This action will give us an overview of the columns that are affected by missing values in the data set.
titanic.isnull().sum()
Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
After reviewing the contents of the dataset and assessing the number of missing values, I have made the decision to remove the Cabin
column.
Although this column contains potentially valuable information, especially if we had access to the ship's plans, more than 75% of its values are null. Therefore, I believe that we can dispense with this column without it having a significant impact on our analysis.
del titanic['Cabin'] # deletes the cabin column.
titanic.isnull().sum()
Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Embarked 2 dtype: int64
We will now proceed to address a rather simple adjustment: we will deal with the missing values in the Embarked
column, which has only 2 null values.
Since the number of missing values is minimal, we can apply the mode (i.e., the most frequent value) to complete this data. Statistically, it is very likely that these passengers embarked at the same port as most of the other passengers.
most_comun_port = titanic['Embarked'].mode()[0] # we use the mode() method to find the mode of the embarked column .
print('Most comun port: ', most_comun_port) # print it on the screen.
print('\n','-'*80, '\n') # separator.
titanic['Embarked'] = titanic['Embarked'].fillna(most_comun_port) # with the fillna() method we search and fill the null data of the embarked column with the new variable.
titanic.isnull().sum()
Most comun port: S --------------------------------------------------------------------------------
Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Embarked 0 dtype: int64
The next step would be to deal with the missing values in the Age
column. However, before doing so, we will normalize and transform the data in our dataset. For now, we will not modify the missing values. We will reveal later why we have made this decision.
In this part of the process we will focus on giving our dataset a facelift so that it is ready for analysis.
First we will work on changing the format of the data. It is very important that our variables have a common format so that we can work with them.
We will make our dataset breathe harmony!
titanic.head() # Checking.
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|
PassengerId | ||||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S |
As we can see we have two series that we must standardize to be able to operate with them, these are Sex
and Embarked
.
In this next step, we are going to transform the series Sex
into a binary categorical variable.
change_sex = {'female': 0, 'male': 1} # We create a dictionary with the data to be transformed.
titanic['Sex'] = titanic['Sex'].map(change_sex) # we map the series sex transform the binary data.
titanic.head() # Checking.
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|
PassengerId | ||||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | S |
5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | S |
Great, let's proceed!
From my perspective, there might not be a direct relationship between the port of embarkation and the survival rate, however, I would like to verify this. Therefore, we will transform the Embarked
column to facilitate further processing and study any potential interactions it might have.
embarked_port = {'C': 1, 'Q': 2, 'S': 3}
titanic['Embarked'] = titanic['Embarked'].map(embarked_port)
titanic.head()
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|
PassengerId | ||||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | 3 |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | 1 |
3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | 3 |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | 3 |
5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | 3 |
Remember that we previously left the null values in the Age
series untreated, well it's time to tackle it like a pro :O
Once we have successfully transformed the Sex and Embarked series into categorical data, we are ready to deal with these null values.
Among many other possible techniques, the technique we are going to use to impute these null values is the k-nearest neighbor (k-NN)
method. This method assigns to each missing value the most approximate and probable value based on the comparison of the characteristics of its nearest neighbors (i.e., k-nearest neighbor values).
Although I found this technique somewhat complex at first, it is a powerful tool that I wanted to integrate to deal with the missing values in our dataset.
x = titanic[['Survived', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Age']] # Select the columns to be imputed (all but name and ticket).
imputer = KNNImputer(n_neighbors=5) # we create an object for the call to imputation with 5 nearest objects.
x_imputed = imputer.fit_transform(x) # invoke the fit_transform() method to work on the data and store the result in another variable .
titanic.loc[titanic['Age'].isnull(), 'Age'] = x_imputed[:, -1][titanic['Age'].isnull()] # we use .loc to select the rows with nulls from the Age column and replace the nulls with the imputed data.
titanic.isnull().sum()
Survived 0 Pclass 0 Name 0 Sex 0 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Embarked 0 dtype: int64
titanic.head() # Checking.
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|
PassengerId | ||||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | 3 |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | 1 |
3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | 3 |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | 3 |
5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | 3 |
With this we have finished the data preparation and cleaning section.
We are left with two columns that we could continue working with: Name
and Ticket
... but we are going to continue without them.
In the case of the Name
column, we could try to extract additional information, such as the noble title, medical, etc... following the suggestion of my prof (thanks for all your advice, respects ;)) However, I personally do not think that this information is relevant and meaningful for the result of our analysis.
And as for the Ticket
column, it seems to contain the serial number of the tickets purchased by each passenger, but it does not seem to have a clear order or correlation with each other or with other variables in our dataset. Therefore, I consider that we can leave these columns unchanged.
If necessary, we will consider removing them at a later date.
Our goal is to deepen our understanding of the data and discover patterns and relationships between the different series.
To achieve this, we will use tools such as graphs, tables and statistical measures to summarize and present the information contained in our data set.
Through this analysis, we hope to gain valuable information that will allow us to formulate hypotheses and guide our future analyses.
We start the analysis with the describe()
method, which will give us a quick overview of the distribution of statistical values in each column.
titanic.describe() # Summary statistics .
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 0.383838 | 2.308642 | 0.647587 | 30.061452 | 0.523008 | 0.381594 | 32.204208 | 2.536476 |
std | 0.486592 | 0.836071 | 0.477990 | 13.644962 | 1.102743 | 0.806057 | 49.693429 | 0.791503 |
min | 0.000000 | 1.000000 | 0.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
25% | 0.000000 | 2.000000 | 0.000000 | 21.000000 | 0.000000 | 0.000000 | 7.910400 | 2.000000 |
50% | 0.000000 | 3.000000 | 1.000000 | 29.000000 | 0.000000 | 0.000000 | 14.454200 | 3.000000 |
75% | 1.000000 | 3.000000 | 1.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 | 3.000000 |
max | 1.000000 | 3.000000 | 1.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 | 3.000000 |
This summary includes the following statistical measures:
Broadly speaking, we can begin to extract some relevant data:
Having reviewed the general statistics, we have obtained an overview of the situation of the RMS Titanic prior to her departure from English shores.
We are now ready to begin our analysis and draw the first meaningful conclusions. Using graphs, tables and statistical measures, we will seek to visually understand the data as a whole.
high_contrast = 'Set1'
class_color = ['skyblue', 'lightgreen', 'salmon']
embarked_color = ['skyblue', 'lightgreen', 'salmon']
survival_color = ['salmon', 'skyblue']
survival_dictionary = {0: 'salmon', 1: 'skyblue'}
Let's start our analysis of the graphs with the number of passengers per class of ticket purchased.
# Statements
pclass_count = titanic['Pclass'].value_counts().sort_index() # We create a variable and store in it a series with the categories and the sum of all of them.
print(pclass_count)
# Graph
plt.figure(figsize=(11, 6)) # we create a figure and give it dimension.
plt.bar(pclass_count.index, pclass_count, color = class_color) # we create a bar chart indicating the axes and color it.
plt.xlabel('') # we empty the x-axis
plt.xticks([1,2,3], ['1st', '2nd', '3rd'], fontsize=14) # we rename the x-axis ticks by selecting and replacing them.
plt.ylabel('nº of passengers',fontsize=14) # we rename the y-axis and give it size.
plt.title('No. of passengers per class', fontdict={'fontsize': 11, 'fontweight': 'bold'}) # retitle and align in the center, give size and weight (This reminds me a lot of CSS :D).
plt.show() # we plot.
Pclass 1 216 2 184 3 491 Name: count, dtype: int64
It is interesting to note that in the previous chart, we could observe that the mean ticket class was 2.3, which might have suggested that most passengers were in second class and, to a greater extent, third class. However, upon further analysis of this new graph, we can confirm the bias indicating that the vast majority of passengers were in third class, significantly exceeding half of all passengers.
This finding provides a more accurate perspective of the class distribution and gives us a better understanding of the socioeconomic composition of passengers on board.
We will now continue the analysis with data from the Fare series.
print(titanic['Fare'].describe())
plt.figure(figsize=(11, 6)) # We create a figure and give it dimension.
sns.boxplot(x=titanic['Fare'], color='skyblue') # we use seaborn to create a boxplot and use the Fare series on the x-axis, we color the box.
plt.xlabel('') # we empty the x-axis label.
plt.title('ticket prices', fontdict={'fontsize': 11, 'fontweight': 'bold'}) # we retitle and align in the center, give size and weight.
plt.show() # we plot.
count 891.000000 mean 32.204208 std 49.693429 min 0.000000 25% 7.910400 50% 14.454200 75% 31.000000 max 512.329200 Name: Fare, dtype: float64
It should be noted that the median of the fare series shows a very high standard deviation (49.69), indicating a large difference between the fares paid by all passengers.
In fact, if we review the statistics the maximum fare paid was £512.33, while the median was £32.20. As a curious fact, there were passengers who were given free boarding.
Next, we will analyze the age distribution within the passage.
print(titanic['Age'].describe())
plt.figure(figsize=(11, 6)) # We create a figure and give it dimension.
sns.histplot(data=titanic, x='Age', bins=80, color='skyblue') # we create a histiogram with the age series data and spread it over 80 cells (which is the maximum age) and color it.
plt.xlabel('Age of passengers', fontsize=14) # we customize the x-axis label and size it.
plt.ylabel('no. of passengers', fontsize=14) # we customize the y-axis label and size it.
plt.title('No. of passengers per age', fontdict={'fontsize': 11, 'fontweight': 'bold'}) # we retitle and align in the center, give size and weight.
plt.show() # we plot.
count 891.000000 mean 30.061452 std 13.644962 min 0.420000 25% 21.000000 50% 29.000000 75% 38.000000 max 80.000000 Name: Age, dtype: float64
We observe that there is a notable number of newborns in the passage, and this figure remains constant from 0 to 4 years of age. Subsequently, the number of passengers gradually decreases. Then, we can identify a steeply upward curve between the ages of 16 and 31, reaching its peak, and then declining towards the age of 40. From this point on, as the ages increase, a decrease in the number of passengers is observed, with the exception of the 47 years age group.
We continued to explore the SibSp and Parch series data, which represent the number of family members who accompanied each passenger on the trip.
# Statements
sibsp_count = titanic['SibSp'].value_counts().sort_index() # Creates a series within the variable and calculates the number of passengers in each group SibSp (number of siblings/spouses on board).
parch_count = titanic['Parch'].value_counts().sort_index() # we do the same with Parch (parents/children abroad).
width = 0.40 # we select a width for the histiogram bar, this will allow to place the sibsp bar next to the parch bar.
index = np.arange(len(sibsp_count)) # we create a list with the range of sibsp, which we know has the most values.
print(sibsp_count,'\n', '-'*20, '\n', parch_count, '\n', '-'*20, '\n')
# Graph
fig, x = plt.subplots(figsize=(11, 6)) # We create a figure and give it dimension.
bar_sibsp = x.bar(index, sibsp_count, width, label='Siblings / spouses', color='Salmon') # we create the bars for sibsp. we tell it where they have to go, their content, the width of the bar, how we want it to be identified and we color it.
bar_parch = x.bar(index + width, parch_count, width, label='Parents / children', color='Skyblue') # we do the same for the parch bars, but this time we tell it that the position is the same but to add the width of a bar, that way it doesn't stack with the previous one.
x.set_xlabel('No. of family members on board', fontsize=14) # we customize the x-axis label and size it.
x.set_ylabel('No. of passengers', fontsize=14) # we customize the y-axis label and size it.
x.set_title('Passengers by number of family members on board', fontdict={'fontsize': 11, 'fontweight': 'bold'}) # we retitle and align in the center, give size and weight.
x.set_xticks(index + width / 2) # the positions on the x-axis where the bar labels will be placed. They are placed in the middle between the bars for SibSp and Parch.
x.set_xticklabels(sibsp_count.index) # we define the labels for the bars on the x-axis as the unique values of SibSp.
x.legend(title='Family on board') # we create a title for the legend and the tags we have created.
plt.show() # we plot.
SibSp 0 608 1 209 2 28 3 16 4 18 5 5 8 7 Name: count, dtype: int64 -------------------- Parch 0 678 1 118 2 80 3 5 4 4 5 5 6 1 Name: count, dtype: int64 --------------------
Analyzing the graph, we can see that the majority of passengers, just over 600, were traveling without siblings and/or spouses. In addition, almost 700 passengers were traveling without parents or children on board. This indicates that a large number of passengers were traveling individually or without immediate family members. Furthermore, we can observe that the groups of families compared to the previous group are considerably reduced and decrease as the number of family members increases.
We will now proceed to explore the data relating to passenger gender.
# Statements
gender = titanic['Sex'].value_counts() # we create a series to count the number of passengers of each genre.
print(gender)
# Graph
fig, plot = plt.subplots(figsize=(11, 6)) # We create a figure and give it dimension.
plt.bar(gender.index, gender, color=survival_color) # we create a bar chart with the index of the series we have generated and tell it to use the stored values, then we colorize.
plt.xlabel('') # we empty the x-axis label.
plt.xticks([0, 1], ['Women', 'Men'], fontsize=14) # we change the X-axis ticks, select the desired ones and then choose the custom ones.
plt.ylabel('No. of passengers', fontsize=14) # we customize the y-axis label and size it.
plt.title('No. of passengers per gender', fontdict={'fontsize': 11, 'fontweight': 'bold'}) # we retitle it and align it in the center, give it size and weight.
plt.show() # we plot.
Sex 1 577 0 314 Name: count, dtype: int64
In this gender graph, we can see that the majority of passengers on the liner were male, representing approximately 64.8%.
Now, let us examine the data concerning the ports of embarkation
# Statements
embarked_count = titanic['Embarked'].value_counts().sort_index() # We create a series using the boarding ports as an index and adding each passenger.
print(embarked_count)
# Graph
fig, plot = plt.subplots(figsize=(11, 6)) # We create a figure and give it dimension.
plt.bar(embarked_count.index, embarked_count, color=embarked_color) # we create a bar chart using as an index the one of the created series, load its values and color it.
plt.xlabel('') # we empty the x-axis label .
plt.xticks([1, 2, 3], ['Cherbourg', 'Queenstown', 'Southampton'], fontsize=14) # we change the X-axis ticks, select the desired ones and then choose the custom ones.os.
plt.ylabel('Nº of Passengers', fontsize=14) # we customize the y-axis label and size it.
plt.title('No. of passengers per port', fontdict={'fontsize': 11, 'fontweight': 'bold'}) # we retitle it and align it in the center, give it size and weight .
plt.show() # we plot .
Embarked 1 168 2 77 3 646 Name: count, dtype: int64
Excellent! In this chart we can see that the port of Southampton had the highest number of passengers embarked, which is logical given that the ship sailed from that port and subsequently called at the other ports before departing for the high seas.
Finally, we will explore the survival rate during the disaster.
# Statements
survival_count = titanic['Survived'].value_counts() # We create a new series with the data from the suvived table.
print(survival_count)
# Graph
fig, plot = plt.subplots(figsize=(11, 6)) # We create a figure and give it dimension.
# We create a pie chart with labels for each part, percentage format with a single decimal, angle to start from, colors, font size, spacing between pie parts and shading.
plt.pie(survival_count, labels=['Deceased', 'Survivors'], autopct='%1.1f%%', startangle=90, colors= survival_color , textprops={'fontsize': 14}, explode=[0.1, 0], shadow=True)
plt.title('survival rate', fontdict={'fontsize': 11, 'fontweight': 'bold'}) # we retitle it and align it in the center, give it size and weight .
plt.show()
Survived 0 549 1 342 Name: count, dtype: int64
The graph shows that only 38.4% of the passengers survived the shipwreck (342 of the almost 900 on board), which represents a mortality rate of 61.6%, an extremely high rate.
This concludes the data presentation section. Thanks to this visual representation, we have been able to better understand the previous data in the dataset and get a snapshot of the DataSera. In the next section, we will continue to deepen our analysis and explore the dataset in more detail.
In this part of the analysis, we will work with bivariate and/or multivariate analyses, which will allow us to explore the relationships and interactions between different variables. This approach will provide valuable information on the underlying patterns and dependencies in the data, helping us to better understand the dynamics and factors that may have influenced the poor results.
In the first part, we will use a heat map to explore the correlations between variables.
This visualization technique allows us to represent the correlations in a matrix format with color gradients, which facilitates visual identification and understanding of the relationships between variables.
The heat map provides an intuitive and clear representation of the strength and direction of correlations, allowing us to work with the data more easily and gain valuable insights.
# Statements
corr = titanic.corr(numeric_only=True) # We call the corr() method and tell it to use only the columns with numeric data.
print(corr)
# Graph
fig, ax = plt.subplots(figsize=(17, 8))
sns.heatmap(corr, annot=True) # we create a heat map with the above variable and tell it to use annotations.
plt.title('Correlation matrix of the DataSet', fontdict={'fontsize': 11, 'fontweight': 'bold'})
plt.show()
Survived Pclass Sex Age SibSp Parch \ Survived 1.000000 -0.338481 -0.543351 -0.110313 -0.035322 0.081629 Pclass -0.338481 1.000000 0.131900 -0.361498 0.083081 0.018443 Sex -0.543351 0.131900 1.000000 0.139182 -0.114631 -0.245489 Age -0.110313 -0.361498 0.139182 1.000000 -0.215897 -0.183820 SibSp -0.035322 0.083081 -0.114631 -0.215897 1.000000 0.414838 Parch 0.081629 0.018443 -0.245489 -0.183820 0.414838 1.000000 Fare 0.257307 -0.549500 -0.182333 0.092513 0.159651 0.216225 Embarked -0.167675 0.162098 0.108262 -0.008357 0.068230 0.039798 Fare Embarked Survived 0.257307 -0.167675 Pclass -0.549500 0.162098 Sex -0.182333 0.108262 Age 0.092513 -0.008357 SibSp 0.159651 0.068230 Parch 0.216225 0.039798 Fare 1.000000 -0.224719 Embarked -0.224719 1.000000
If we look closely at the resulting matrix, we can draw some interesting conclusions:
The moment of truth has arrived, when we put on our Sherlock Holmes cap and pipe and start our real investigation because... "There is nothing more deceptive than an obvious fact." ;)
The port of shipment.
What data are likely to be extracted and evaluated?
Our main task is to identify and collect information that will allow us to disentangle possible relationships between the port of embarkation and the survival rates of passengers aboard the Titanic.
Although it may at first appear that the port of embarkation does not directly influence the probability of passenger survival, the potential for providing valuable information should not be underestimated.
Therefore, we will proceed with the following evaluation of the data.
It consists of:
Let's start with a basic query of the data corresponding to the 'Embarked' series. This initial query will provide us with a preliminary view of the distribution and characteristics of the ports and the number of passengers.
embarked_count
Embarked 1 168 2 77 3 646 Name: count, dtype: int64
We proceed to visualize through bar charts that will allow us to examine both the distribution of passengers who embarked at each port and the breakdown of fatalities and survivors for each port.
These representations will provide us with a clearer understanding of the patterns and differences between the ports of embarkation in terms of passenger numbers and their survival outcomes.
plt.figure(figsize=(16, 6)) # We create a figure and give it a size.
# Graph 1
plt.subplot(1, 2, 1) # we place the first graph in the first row of a two-row grid.
sns.countplot(x='Embarked', data=titanic, palette=embarked_color) # We use the seaborn countplot function to autogenerate a plot with the 'Embarked' data as input, from the titanic dataset, and paint it with the variable we have stored.
plt.xlabel('') # we empty the X-axis label.
plt.xticks([0, 1, 2], ['Cherbourg', 'Queenstown', 'Southampton'], fontsize=14) # we replace the ticks on the x-axis with the name of the ports and size them .
plt.ylabel('no. of passengers', fontsize=14) # we customize the y-axis label and shape it.
plt.title('Relationship between the number of passengers and the port of embarkation', fontdict={'fontsize': 11, 'fontweight': 'bold'}) # We customize the title of the chart, center it and give it size and weight.
# Graph 2
plt.subplot(1, 2, 2) # we place the second graph in the first row of a two-row grid.
sns.countplot(x='Embarked', hue='Survived', data=titanic, palette=survival_color) # we use the seaborn countplot() function to autogenerate a plot with the 'Embarked' data as input by the 'Survived' data, from the titanic dataset, and paint it with the variable we have stored.
plt.xlabel('') # we empty the X-axis label.
plt.xticks([0, 1, 2], ['Cherbourg', 'Queenstown', 'Southampton'], fontsize=14) # we replace the ticks on the x-axis with the name of the ports and size them .
plt.ylabel('no. of passengers', fontsize=14) # we customize the y-axis label and shape it.
plt.title('Relationship between the number of survivors and the port of embarkation', fontdict={'fontsize': 11, 'fontweight': 'bold'}) # we customize the title of the chart, center it and give it size and weigh.
plt.legend(title='Status', labels=['Deceased', 'Survivors']) # we create a legend, generate a title for the legend and the labels for the values of the "Survivors" series.
plt.tight_layout() # we use the .tight_layout() function of matplotlib to avoid overlaps.
plt.show() # we paint the grid.