Breaking the Ice with Titanic




Kaggle Machine Learning Competition:
Predicting the Survival of Titanic Passengers


In this blog post, I will go through the Titanic dataset. As opposed to its infamous destiny, this project is my breakage of the ice!
All the credits go to Abhishek Kumar and his course of "Doing Data Science with Python" on Pluralsight. Thanks to his concise and clear mentorship, I have completed my first kaggle submission in the Titanic competition. For project structure, the cookiecutter tamplate has been applied. Also, the data analysis' steps are explained in details by using Jupyter notebook and GitHub versioning system.

  • Environment
  • Extracting Data
  • Exploring and Processing Data
  • Building and Evaluating Predictive Model

Setting up the Environment


Before diving into data, as mentioned above, I will go through the tools and setups, which are used in order to set up the environment. Firstly, regarding the python distributions, Anaconda has been used as specialized python distribution, which comes with pre-installed and optimized python packages. For the project, the latest available version of Python 3.9 has been used. Documentation of all analyses has been written and showcased in Jupyter notebook, in the notebooks folder, as the part of the common data science project template called cookiecutter template. All changes and important insights have been tracked by GitHub versioning system.


Extracting Data


For description, evaluation and dataset, the kaggle's platform has been used. Short description of the Titanic challenge: as widely well known, Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, the "unsinkable" Titanic sank after colliding with an iceberg, resulting in one of the deadliest shipwrecks. While there was some element of luck involved in surviving, it is likely that some groups of people were having higher possibilities to survive than others. In this challenge, the main question is defined as follows: "What kind of people were more likely to survive?". In this course, a couple of common data science practices were explained, regarding the extraction of dataset by using techniques such as extraction from databases (sqlite3 library), through APIs (requests library) and web scraping (requests and BeautifulSoup libraries). In the first Jupyter notebook, the automated script for extracting data has been created.


Exploring and Processing Data


This phase of the project covers some of the basic and advanced exploratory data analysis techniques, such as basic data structure, summary statistics, distributions, grouping, crosstabs and pivots. Also, here most of the time is invested in terms of data cleaning, munging, visualization, using Python libraries such as NumPy, Pandas and Matplotlib. All these steps are documented in the second Jupyter notebook.

Firstly, we import the Python libraries and as a common practice, we make their aliases. In order to start the basic exploratory data analysis, we need to import the dataset, more precisely the train and test .csv files.

  # import python libraries
  import pandas as pd
  import numpy as np
  import os

  # set the path of the raw data, in accordance with cookiecutter template
  raw_data_path = os.path.join(os.path.pardir, 'data', 'raw')
  train_file_path = os.path.join(raw_data_path, 'train.csv')
  test_file_path = os.path.join(raw_data_path, 'test.csv')

  # read the data with .read_csv method
  train_df = pd.read_csv(train_file_path, index_col='PassengerId')
  test_df = pd.read_csv(test_file_path, index_col='PassengerId')

Analysing the data structure, we can see some of the basic data-related info by using .info function, such as the number of data entries, columns, types of data, whether we have the missing data and memory usage of the dataframe. Having the separate train and test data, we concatenate these two dataframes into one dataset. After this, we have the data info as follows:

  # concatenate the train and test datasets with pandas .concat method
  df = pd.concat((train_df, test_df), axis=0)

  # get some basic information of the dataframe with .info method
  df.info()
  <class 'pandas.core.frame.DataFrame'> 

  Int64Index: 1309 entries, 1 to 1309 
  Data columns (total 11 columns):
   #   Column    Non-Null Count  Dtype  
  ---  ------    --------------  -----  
  0   Survived  1309 non-null   int64  
  1   Pclass    1309 non-null   int64  
  2   Name      1309 non-null   object 
  3   Sex       1309 non-null   object 
  4   Age       1046 non-null   float64
  5   SibSp     1309 non-null   int64  
  6   Parch     1309 non-null   int64  
  7   Ticket    1309 non-null   object 
  8   Fare      1308 non-null   float64
  9   Cabin     295 non-null    object 
  10  Embarked  1307 non-null   object 
  dtypes: float64(2), int64(4), object(5)
  memory usage: 122.7+ KB 

So, in total, we have 1309 entries, 11 columns and data type information for each one in the dataset. Apparently, we have some missing data, which we will analyise and resolve in the proper way. In this step, we just explore the data by using simple functions as .head, .tail, slicing techniques, filtering with .loc. In this way, the basic overview of the data structure is gained.

  # use .head to get top 5 rows
  df.head()

  # use .tail to get bottom 5 rows
  df.tail()

Now we have the general idea of the dataset contents, we can explore more deeply some specific features.


BASIC Exploratory Data Analysis (EDA)

Next, we analyse the summary statistics, depending on the type of the feature: numerical or categorical ones. For numerical features, we analyse the centrality measures (mean, median) and dispersion measure (range, percentiles, variance, standard deviation). For categorical features, we analyse the total and unique count, category counts and proportions, as well as per category statistics. By using .describe functions, including the argument 'all', we get the summary statistics of all features, as follows:

  # use .describe method (include='all' argument) to get a summary statistics
  df.describe = (include='all')

For numerical features, we can analyse the centrality and dispersion measures. Whereas, the categorical ones need to be analysed in a different way. Take for example the feature of Pclass, which represents the class of the passanger. The graph below represents the number of passengers, categorized by the class 1st, 2nd or 3rd.

  # use .value_counts for analysing categorical features and .plot for visualization
  df.Pclass.value_counts().plot(kind='bar', rot=0, title='Class wise passenger count', color='c');

Obviously, we can make a clear conclusion that the highest number of passengers were in the lowest class.


ADVANCED Exploratory Data Analysis (EDA)

By creating a crosstab and applying it to the Pclass and Sex features, we drive the additional conclusion that the majority of passangers in the third class are the male passangers, exactly number of 493. This is a very handy EDA technique, but its extension would be creating pivots. With pivots, we can add an argument of a function, which could be applied on a specific feature.

  # create crosstab for Sex and Pclass features to get insights, present with bar chart
  pd.crosstab (df.Sex, df.Pclass).plot(kind='bar');

For instance, using the same features, we could create a pivot table by defining the function argument to calculate mean of the value Age.

  # create pivot table by defining 4 arguments (rows, columns, values and function)
  pd.pivot_table (index='Sex', columns='Pclass', values='Age', aggfunc='mean')
  # or get the same result by using .groupby, .mean and .unstack methods
  df.groupby (['Sex', 'Pclass']).Age.mean().unstack()

So, the majority of male passangers in the 3rd class are 25.96 years old in average.

Furthermore, we would like to apply some visualization tools in order to analyse the distribution of the data. Firslty, we make a distinction between an univariate distribution, where we use histogram and/or kernel density estimation (KDE) plot, and a bivariate distribution or a distribution of two features, where we use a scatter plot for visualization. Analysing data distribution, we look into very important aspects of it, such as skewness and kurtosis and its variations to the normal one, which serves as the standard.

Looking at the histogram of the features Age and Fare, we can see positively skewed distributions or we have their mean values higher than the medians. In terms of Age, this means that 50% of the passangers are older than 28 years and the other half is younger than the median of 28 years. Also, the mean age is 29.88, so, it is slightly skewed in right, meaning there are majority of ages around median, but also longer tail with some very old people, comparing them to the median, shifting the mean age to be slightly higher than the meadian. Similarly, with more positive distribution, logic could be applied in terms of Fare.


DATA MUNGING

Data munging is a very important part of data analysis, which refers to dealing with missing values or outliers. By using .info method earlier, we have already detected some features with missing values (Age, Fare and Embarked, whereas Cabin will be analysed in Feature engineering section). Similarly, while plotting some features, especially Age, we have seen the existence of some extreme values of this feature.


WORKING WITH MISSING VALUES

As we have realised earlier, in Titanic dataset, there are a couple of features with missing values. In terms of possible solutions, we have these techniques at our disposal:

  • Deletion - only if few observations have a missing-value issue;
  • Imputation - replacing NaNs with plausable data, such as mean, median or mode imputation;
  • Forward/backward fill - used in case of time series or sequential data;
  • Predictive model;
The last two in the list represent advanced techniques for resolving a missing-value issue.


FEATURE: Embarked

From our previous EDA section, we see that there are two NaN values of this feature in the dataset.

  # use .isnull function to extract missing values
  df [df.Embarked.isnull()]

  # use .value_counts function to count the frequency of embarkment
  df.Embarked.value_counts()
  S    914
  C    270
  Q    123
  Name: Embarked, dtype: int64

So, the highest number of embarkment happened at the S location.
But which embarked point had higher survival count? As both of these passangers survived.

  # use .crosstab technique to discover which embarkment location had the highest survival rate
  pd.crosstab (df[df.Survived != -888].Survived, df[df.Survived != -888].Embarked)

Here, we filter out the survival data with -888 value, which come from test data where we do not have survival information. So, in absolute terms, the embarkment point with the highest survival rate was the S one. Relatively, the result is different.

  # explore Pclass and Fare for each Embarkment point
  df.groupby (['Pclass', 'Embarked']).Fare.median()

As both of these two passangers survived, were in 1st class and with Fare of 80, let's try to use these information as well.

  Pclass  Embarked
       1         C         76.7292
                 Q         90.0000
                 S         52.0000
       2         C         15.3146
                 Q         12.3500
                 S         15.3750
       3         C          7.8958
                 Q          7.7500
                 S          8.0500
  Name: Fare, dtype: float64

From this point of view, we see that it is most likely that these passangers had embarkment point C. Finally, let's fill in the missing values with C embarkment point and check if any null values remain afterwards.

  # replace the missing values with 'C' by using .fillna method
  df.Embarked.fillna ('C', inplace = True)

  # check if any null values exist with .isnull, after .fillna was applied
  df [df.Embarked.isnull()]

Great! We have solved the first feature with a missing-value issue.
Let's tackle the rest!


FEATURE: Fare

We will use similar approach to the feature Fare, as we did in the previous section. So, spot the missing values, use existing information we have and draw some conclusions.

  # check if any null values exist with .isnull
  df [df.Fare.isnull()]

  # filter the passangers with Pclass = 3, Embarked = S, including the application of .median function to the Fare value
  median_fare = df.loc [(df.Pclass == 3) & (df.Embarked == 'S'), 'Fare'].median()
  print (median_fare)
  8.5

We will use imputation method to deal with missing value of Fare, based on median Fare that was applied for the passangers from the 3rd class and embarkment point S. In the following step, we fix the NaN value to be 8.5.

  # apply the created median_fare function and replace NaN Fare values with median value (3rd class passangers from S embarkment point)
  df.Fare.fillna (median_fare, inplace = True)]

If we check the number of null values with .info method, we can safely continue with the Age feature.


FEATURE: Age

  # check if any null values exist with .isnull
  df [df.Age.isnull()]

We have exactly 263 rows with missing values of the Age feature. It is a lot of rows, so we should take a closer look of what is the best way to deal with this issue. Whether we should apply mean, median age of the passangers or some more complex logic. Let's find out!

Previously, we have analysed the data distribution of the Age feature. We have seen some of very high values with passangers which are over 70 and 80 years old. So, these extreme values could easily impact the mean Age value. But, we will check, what are the mean and median values, just to have in mind the figures.

  # get the Age mean value
  df.Age.mean()
  29.881137667304014
  # get the Age median values, by Sex category
  df.groupby ('Sex').Age.median()
  Sex
  -------------------------
  female    27.0
  male      28.0
  Name: Age, dtype: float64

It is useful to apply some visual tools for further analysis. We will use boxplot technique to discover more details.

  # visualize Age notnull values by Sex category, using boxplot
  df [df.Age.notnull()].boxplot('Age', 'Sex');

The plot shows similar result for both, female and male passangers, in terms of age data distribution. So, we continue to dig further.

  # visualize Age notnull values by Pclass category, using boxplot
  df [df.Age.notnull()].boxplot('Age', 'Pclass');

Now we see some difference in the Age levels between the Pclasses of the passangers. But, at this point, we want to track a passanger's title and try to see its correspondence with the age difference. Let's try to extract this insight from the data!

We will now explore the Name values, extract the title from it and make a dictionary for titles, group titles in a couple of bins, so we could more easily drive some new conclusions. Whether we find some new variances of the age, depending od the passangers' titles, we could be on the right path.

  # explore the Name feature
  df.Name
  PassengerId
  1                                 Braund, Mr. Owen Harris
  2       Cumings, Mrs. John Bradley (Florence Briggs Th...
  3                                  Heikkinen, Miss. Laina
  4            Futrelle, Mrs. Jacques Heath (Lily May Peel)
  5                                Allen, Mr. William Henry
  ...                        
  1305                                   Spector, Mr. Woolf
  1306                         Oliva y Ocana, Dona. Fermina
  1307                         Saether, Mr. Simon Sivertsen
  1308                                  Ware, Mr. Frederick
  1309                             Peter, Master. Michael J
  Name: Name, Length: 1309, dtype: object
  # create the GetTitle function -> to extract the title info from the name
  def GetTitle (name):
      first_name_with_title = name.split (',')[1]
      title = first_name_with_title.split ('.')[0]
      title = title.strip().lower()
      return title

WORKING WITH OUTLIERS

One more data quality issue is a presence of outliers or extreme values. There are also a couple of techniques which could be used in order to deal with such values. We will take a closer look into the features Age and Fare, since we have spotted some high values of these variables earlier.


FEATURE: Fare

If we recall the previous histogram visualisation of the Fare feature, we should have in mind the existence of some extremely high values. Let us pay more attention to those values. Firstly by plotting the box plot:


We can see some really high fares, around 500 value. To be exact, let's extract the top fares:

  # extract the Fare TOP outliers
  df.loc [df.Fare == df.Fare.max()]

The highest fare is exactly the value of 512.3292. As we know that fare could not be negative, we apply log transformation technique so we make it less skewed. Let's apply the numpy log function to the passangers' fare.

  # apply log transformation to reduce the skewness, add 1 for zero fares
  LogFare = np.log (df.Fare + 1.0)
  # plot the LogFare to check the skewness
  LogFare.plot (kind='hist', color='c', bins=20);


This is now less skewed distribution. Furthermore, we will apply the technique of binning in order to categorize the fare feature into 4 bins, so we treat these outliers more conveniently. In pandas, we use qcut function to achieve this. The qcut comes from 'Quantile-based discretization function', which basically means that it tries to divide up the data into equal sized bins.

  # apply the binning technique by using .qcut function
  pd.qcut (df.Fare, 4)
  PassengerId
  1         (-0.001, 7.896]
  2       (31.275, 512.329]
  3         (7.896, 14.454]
  4       (31.275, 512.329]
  5         (7.896, 14.454]
  ...        
  1305      (7.896, 14.454]
  1306    (31.275, 512.329]
  1307      (-0.001, 7.896]
  1308      (7.896, 14.454]
  1309     (14.454, 31.275]
  Name: Fare, Length: 1309, dtype: category
  Categories (4, interval[float64]): [(-0.001, 7.896] < (7.896, 14.454] < (14.454, 31.275] < (31.275, 512.329]]
  # add bins' labels or discretization = turn numerical into categorical feature
  pd.qcut (df.Fare, 4, labels=['very_low', 'low', 'high', 'very_high'])
  PassengerId
  1        very_low
  2       very_high
  3             low
  4       very_high
  5             low
  ...    
  1305          low
  1306    very_high
  1307     very_low
  1308          low
  1309         high
  Name: Fare, Length: 1309, dtype: category
  Categories (4, object): ['very_low' < 'low' < 'high' < 'very_high']
  # plot the labaled bins
  pd.qcut (df.Fare, 4, labels=['very_low', 'low', 'high', 'very_high']).value_counts().plot (kind='bar', color='c', rot=0);


By using the qcut pandas function, we categorized the fare numerical values into 4 buckets and turn it into categorical feature of 4 bins: 'very_low', 'low', 'high' and 'very_high'. As we look at the bar graph, we see there are similar number of observations in each bin. Finally, we will create new variable 'Fare_Bin' and store it into dataframe for possible future analyses.

  # store new variable 'Fare_Bin'
  df['Fare_Bin'] = pd.qcut (df.Fare, 4, labels=['very_low', 'low', 'high', 'very_high'])

FEATURE: Age

As previously plotted, the feature Age shows that the majority of passangers are aged around 29 years. On the other hand, we can see some of them are really old, so let's quickly check those who are 70 years old or above.

  # extract the passangers over 70 years old
  df.loc [df.Age > 70]


So we have one male passanger who is 80 years old and who also survived the shipwreck. On the other hand, we see some missing values in the column of Cabin feature. This is to be dealt with in the following section.


FEATURE ENGINEERING

Feature engineering is one of the crucial aspects of data science project cycle. It represents a process of transforming raw data to better representative features in order to create better predictive models. In this section, we will go through the Deck feature. It is a wide area, which covers many various activities, such as transformation (as we did with the feature Fare in the previous section), feature creation and selection (based on domain knowledge).

As we have seen earlier, looking at the Cabin feature, we have majority of NaN values. Analysing carefully, we will try to modify some of the values, especially the NaNs, so we can have useful Deck information for future analysis.

  # explore Cabin values
  df.Cabin
  PassengerId
  1        NaN
  2        C85
  3        NaN
  4       C123
  5        NaN
  ... 
  1305     NaN
  1306    C105
  1307     NaN
  1308     NaN
  1309     NaN
  Name: Cabin, Length: 1309, dtype: object
  # display the unique values of Cabin feature
  df.Cabin.unique()
  array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
    'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
    'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
    'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
    'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
    'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
    'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
    'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
    'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
    'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
    'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
    'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
    'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
    'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
    'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45', 'C46', 'D30',
    'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36',
    'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24', 'C50', 'B42',
    'C148', 'B45', 'B36', 'A21', 'D34', 'A9', 'C31', 'B61', 'C53',
    'D43', 'C130', 'C132', 'C55 C57', 'C116', 'F', 'A29', 'C6', 'C28',
    'C51', 'C97', 'D22', 'B10', 'E45', 'E52', 'A11', 'B11', 'C80',
    'C89', 'F E46', 'B26', 'F E57', 'A18', 'E60', 'E39 E41',
    'B52 B54 B56', 'C39', 'B24', 'D40', 'D38', 'C105'], dtype=object)

So, we have a lot of NaN values, but also 'T' and 'D', which seem like a mistake, since only these do not have the number following the letter of the deck.

  # look at the Cabin = T
  df.loc [df.Cabin == 'T']

  # look at the Cabin = D
  df.loc [df.Cabin == 'D']

In case of Cabin T, we do not want to create a separate Deck for only one passanger. So, we can assume this is a mistake and we will set it to NaN. Afterwards, all NaN values will convert to the deck Z.

  # set the Cabin T to NaN
  df.loc [df.Cabin == 'T', 'Cabin'] = np.NaN
  # extract the first character of Cabin string to the Deck
  def get_deck (cabin):
      return np.where (pd.notnull(cabin), str(cabin)[0].upper(), 'Z')
  df['Deck'] = df['Cabin'].map (lambda x : get_deck(x))

So, we have created the function 'get_deck' to extract the Deck info from the Cabin feature. By using numpy function 'where', we extract the first letter of the Cabin info, otherwise, we set the NaN values, which we extract by using pandas 'notnull' function and set those NaNs values to 'Z' value. If it is not null, we firstly convert the cabin to a string and upper the first character. Finally, we apply the 'map' function to the cabin attribute and pass the cabin value to 'get_deck' function. Now, we can explore the passangers per Deck feature.

  # check the passangers per deck
  df.Deck.value_counts()
  Z    1015
  C      94
  B      65
  D      46
  E      41
  A      22
  F      21
  G       5
  Name: Deck, dtype: int64
  # check the passangers survival rate per deck
  pd.crosstab (df[df.Survived != -888].Survived, df[df.Survived != -888].Deck)

We have most values with Z deck, where the deck information is not clear. By using the crosstab technique, we check the survival rate of passangers per deck. Here, we can see that decks B, D and E have the highest survival rates.

In the second Jupyter notebook, you can see more details of new features we created, such as IsMother (for women older than 18 years, married and with children), IsMale (for male passangers), AgeState (whether a person is an adult or a child) and FamilySize (the number of family members). These are pretty interesting in terms of exploring the survival rate difference between the certain groups of passangers. Also, we drop the columns we do not need anymore or which we used in the process of feature engineering, reorder columns, so we have the Survived column at the first place, since it is the one to be predicted. In the notebook, we save the processed dataset and create the summary script for getting, reading, processing and saving the data. Lastly, in the next section, we will try out some advanced techniques of visualization, by using matplotlib library.


VISUALIZATION

Using the technique of subplots in pyton library 'matplotlib', which we imported as alies plt, we create even more complex visualizations for the features Fare and Age. We adjust some additional settings, drop the sixth empty plot, stop the overlapping of the graphs. The final result, with multiple conclusions that we have already driven in the previous analysis, is the multiple plots as shown below:

  # adding subplots using ax_arr or axis array (instead of individual axises)
  f, ax_arr = plt.subplots (3, 2, figsize = (14, 7))

  # plot 1
  ax_arr [0,0].hist (df.Fare, bins=20, color='c')
  ax_arr [0,0].set_title ('Histogram : Fare')
  ax_arr [0,0].set_xlabel ('Bins')
  ax_arr [0,0].set_ylabel ('Counts')

  # plot 2
  ax_arr [0,1].hist (df.Age, bins=20, color='c')
  ax_arr [0,1].set_title ('Histogram : Age')
  ax_arr [0,1].set_xlabel ('Bins')
  ax_arr [0,1].set_ylabel ('Counts')

  # plot 3
  ax_arr [1,0].boxplot (df.Fare.values)
  ax_arr [1,0].set_title ('Boxplot : Fare')
  ax_arr [1,0].set_xlabel ('Fare')
  ax_arr [1,0].set_ylabel ('Fare')

  # plot 4
  ax_arr [0,1].boxplot (df.Age.values)
  ax_arr [0,1].set_title ('Boxplot : Age')
  ax_arr [0,1].set_xlabel ('Age')
  ax_arr [0,1].set_ylabel ('Age')
  
  # plot 5
  ax_arr [2,0].scatter (df.Age, df.Fare, color='c', alpha=0.15)
  ax_arr [2,0].set_title ('Scatter Plot : Age vs. Fare')
  ax_arr [2,0].set_xlabel ('Age')
  ax_arr [2,0].set_ylabel ('Fare')

  # cut off the 6th plot
  ax_arr [2,1].axis('off')

  # fix the overlapping
  plt.tight_layout()

  plt.show()



Building and Evaluating Predictive Model


As we had previously prepared the data for the model building, so we dropped unnecessary columns, put 'Survived' column as the first one. We will create our input variable as 'X' and output variable as 'y'. For 'X' we extract all columns from 'Age' onwards, excluding the 'Survived' column. Additionally, for X we apply .numpy method and set the type of data to 'float' by applying .astype method. The output variable y we create y array by using .ravel() we create a flattened one-dimensional array.

  # creating input variable X and output variable y for model building
  X = train_df.loc [:, 'Age':].to_numpy().astype('float')
  y = train_df ['Survived'].ravel()

  # print the shape of created variables
  print (X.shape, y.shape)
  (891, 32) (891,)
  # train-test split -> inside the function we define arrays X, y and test size of 20% of actual training data 
  # test data will be used for model evaluation, while the rest of 80% of training data will be used for model training
  from sklearn.model_selection import train_test_split
  X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.2, random_state=0)

  # print the shape of test and train data 
  print (X_train.shape, y_train.shape)
  print (X_test.shape, y_test.shape)
  (712, 32) (712,)
  (179, 32) (179,)

Now, we are going to build a baseline model, without any machine learning, because it is a common practise to do so. It represents the output of the majority of the class and will be our guidence, in terms of model performance, as we build more advanced model by applying the machine learning techniques. As a conclusion, our predictive model should have better performances than the baseline one.

  # import the function DummyClassifier in order to build a baseline classification model
  from sklearn.dummy import DummyClassifier

  # create the model object as the most frequent, in our case it's 0 or not survived 
  model_dummy = DummyClassifier (strategy = 'most_frequent', random_state=0)

  # train the model by using .fit function on the model object 
  model_dummy.fit (X_train, y_train)
  DummyClassifier(random_state=0, strategy='most_frequent')
  # use .score method to evaluate the model performance on the test data
  print ('score for baseline model: {0:.2f}'.format(model_dummy.score(X_test, y_test)))
  score for baseline model: 0.61

So, we pass the test data to evaluate the model performance. Firstly, model will predict the output on X_test, then, it will comparet the predicted output with the actual output y_test. Also, for classification model, the default score represents the model accuracy. The result represents our baseline model accuracy of 61%.


Building Machine Learning (ML) Model

The next step in building our model is to apply more advanced machine learning techniques. Before, we will import necessary libraries. Since we are dealing with a binary problem, or two possible outcomes, such as 1 for survived and 0 if not survived, we will apply a logistic regression analysis.

  # import libraries for model performance metrics
  from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

  # import library for logistic regression analysis
  from sklearn.linear_model import LogisticRegression
  
  # create model and model object
  clf = LogisticRegression (solver='sublinear')
  model_lr_1 = LogisticRegression (random_state=0)

  # train model with .fit function on the model object
  model_lr_1.fit (X_train, y_train)

We have imported sci-kit libraries for logistic regression analysis, created the model object and trained the model. We will now check the model performances and compare those with the baseline model that we had previously created.


Evaluating ML model

As we created and trained the logistic regression model, we should evaluate its performance:

  # accuracy
  print ('accuracy for logistic regression - version 1: {0:.2f}'.format (accuracy_score(y_test, model_lr_1.predict(X_test))))

  # confusion matrix
  print ('confusion matrix for logistic regression - version 1: \n {0}'.format (confusion_matrix(y_test, model_lr_1.predict(X_test))))

  # precision
  print ('precision for logistic regression - version 1: {0:.2f}'.format (precision_score(y_test, model_lr_1.predict(X_test))))

  # recall
  print ('recall for logistic regression - version 1: {0:.2f}'.format (recall_score(y_test, model_lr_1.predict(X_test))))

  accuracy for logistic regression - version 1: 0.83
  confusion matrix for logistic regression - version 1: 
    [[95 15]
    [15 54]]
  precision for logistic regression - version 1: 0.78
  recall for logistic regression - version 1: 0.78
  # extract the model coefficients
  model_lr_1.coef_
  array([[-0.02367032,  0.00459391, -0.45856325,  0.42774923, -0.74536247,
  0.07698167, -0.04810058, -0.32375099,  0.45165998,  0.95470524,
  0.23766785, -0.03128023, -0.36468204,  0.81144645,  0.46934547,
  -0.32759101,  0.09309015,  1.11451687,  0.52592334, -1.56220423,
  1.07954989, -0.1103208 , -0.18735432,  0.12498004,  0.21616361,
  0.2329452 ,  0.37911205,  0.39268022,  0.47166964,  0.08885105,
  0.36215285,  0.59104806]])

Model performances are significantly improved with the model score of 83% compared to the baseline model of 61% accuracy. Let's apply some of the techniques for model tuning and see what happens.


Tuning ML model

While building our model, we looked at just a couple of parameters, randomstate and solver attribute. However, logistic regression model has a lot of parameters, such as regularization parameter, which are common for hyperoptimization of the model. In the next section, we will try to optimize the model's parameters and its overall score. Most commonly used hyperparameters optimization technique is grid search, which we will apply on our model.

  # create the model
  model_lr_1 = LogisticRegression (random_state=0, solver='liblinear')

  # import GridSearch function
  from sklearn.model_selection import GridSearchCV

  # create parameters dictionary
  parameters = {'C':[1.0, 10.0, 50.0, 100.0, 1000.0], 'penalty':['l1','l2']} 

  # create the grid search object
  clf = GridSearchCV (model_lr, param_grid = parameters, cv =3)

  # pass the train data to train different models with different hyperparameter combinations
  clf.fit (X_train, y_train)
  GridSearchCV(cv=3,
  estimator=LogisticRegression(random_state=0, solver='liblinear'),
  param_grid={'C': [1.0, 10.0, 50.0, 100.0, 1000.0],
              'penalty': ['l1', 'l2']})
  # get the best settings
  clf.best_params_
 {'C': 1.0, 'penalty': 'l1'}
  # check the model performance score
  print ('best score : {0:.2f}'.format (clf.best_score_))
 best score : 0.82
  # evaluate the model
  print ('score for logistic regression - version 2 : {0:.2f}'.format (clf.score(X_test, y_test)))
 score for logistic regression - version 2 : 0.82

So, in this part of model tuning, we are checking whether we can optimize our model even further. Grid search technique relies on a cross validation logic, where we are trying out different hyperparameters in the logistic regression function. For example, the C value represents regularization parameter. We also set cv to be 3, so we apply the 3-k fold cross validation. The goal is to find the best combination of hyperparameters' values. Here, we can se that our model's performance is not significantly improved after applying this technique. This is simply because we are reaching the maximum limit of the logistic regression model. After applying feature standardization and normalization, we confirm the same conclusion.


Persistance of ML model and its API

In the final deployment of the model, we used the pickle library, put all the .pkl files in the model folder, which is in accordance with the cookie-cutter data science template. So, we write or persist our model to the disk, so it would be available at any moment for making predictions. Our disk is in the role of the server for our future model API. Afterwards, we created the model's API by using the flask and request libraries, so we could envoke the API. Simply put, the job of our API is to create predictions when the input data are delivered. It gets and extract the data, process and model the data. Finally, the API is predicting the survival of the Titanic passangers.