1.Import library

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

2.Get the data

car_sales = pd.read_csv("car-sales-data.csv")
car_sales.head()
Make Colour Odometer (KM) Doors Price
0 Honda White 35431.0 4.0 15323.0
1 BMW Blue 192714.0 5.0 19943.0
2 Honda White 84714.0 4.0 28343.0
3 Toyota White 154365.0 4.0 13434.0
4 Nissan Blue 181577.0 3.0 14043.0

3.Check for missing values

car_sales.isna().sum()
Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

3.1 What if data is filled with missing values?

  1. Drop the rows with no labels
  2. Fill them with some value(also known as imputation).

1.Drop the rows with no labels

car_sales.dropna(subset=["Price"],inplace=True)
car_sales.isna().sum()
Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64
car_sales
Make Colour Odometer (KM) Doors Price
0 Honda White 35431.0 4.0 15323.0
1 BMW Blue 192714.0 5.0 19943.0
2 Honda White 84714.0 4.0 28343.0
3 Toyota White 154365.0 4.0 13434.0
4 Nissan Blue 181577.0 3.0 14043.0
... ... ... ... ... ...
995 Toyota Black 35820.0 4.0 32042.0
996 NaN White 155144.0 3.0 5716.0
997 Nissan Blue 66604.0 4.0 31570.0
998 Honda White 215883.0 4.0 4001.0
999 Toyota Blue 248360.0 4.0 12732.0

950 rows × 5 columns

2.Fill them with some value

  • Option 1. Fill missing data with pandas
  • Option 2. Fill missing data with scikit learn
2.1 Option 1. With Pandas
car_sales.isna().sum()
Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64
car_sales.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 950 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Make           903 non-null    object 
 1   Colour         904 non-null    object 
 2   Odometer (KM)  902 non-null    float64
 3   Doors          903 non-null    float64
 4   Price          950 non-null    float64
dtypes: float64(3), object(2)
memory usage: 44.5+ KB
car_sales["Make"].fillna("missing",inplace=True)

car_sales["Colour"].fillna("missing",inplace=True)

car_sales["Odometer (KM)"].fillna(car_sales["Odometer (KM)"].median(),inplace=True)

car_sales["Doors"].fillna(4,inplace=True)
car_sales
Make Colour Odometer (KM) Doors Price
0 Honda White 35431.0 4.0 15323.0
1 BMW Blue 192714.0 5.0 19943.0
2 Honda White 84714.0 4.0 28343.0
3 Toyota White 154365.0 4.0 13434.0
4 Nissan Blue 181577.0 3.0 14043.0
... ... ... ... ... ...
995 Toyota Black 35820.0 4.0 32042.0
996 missing White 155144.0 3.0 5716.0
997 Nissan Blue 66604.0 4.0 31570.0
998 Honda White 215883.0 4.0 4001.0
999 Toyota Blue 248360.0 4.0 12732.0

950 rows × 5 columns

car_sales.isna().sum()
Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64
2.2 Option 2. With scikit learn
car_sales_missing = pd.read_csv("car-sales-data.csv")
car_sales_missing
Make Colour Odometer (KM) Doors Price
0 Honda White 35431.0 4.0 15323.0
1 BMW Blue 192714.0 5.0 19943.0
2 Honda White 84714.0 4.0 28343.0
3 Toyota White 154365.0 4.0 13434.0
4 Nissan Blue 181577.0 3.0 14043.0
... ... ... ... ... ...
995 Toyota Black 35820.0 4.0 32042.0
996 NaN White 155144.0 3.0 5716.0
997 Nissan Blue 66604.0 4.0 31570.0
998 Honda White 215883.0 4.0 4001.0
999 Toyota Blue 248360.0 4.0 12732.0

1000 rows × 5 columns

car_sales_missing.dropna(subset=["Price"],inplace=True)
car_sales_missing.isna().sum()
Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64
car_sales_missing.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 950 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Make           903 non-null    object 
 1   Colour         904 non-null    object 
 2   Odometer (KM)  902 non-null    float64
 3   Doors          903 non-null    float64
 4   Price          950 non-null    float64
dtypes: float64(3), object(2)
memory usage: 44.5+ KB

If you are use scikit learn to fill missing value, then you have to Split data into X and y

X = car_sales_missing.drop("Price",axis=1)
y = car_sales_missing["Price"]
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with 'missing' & numerical values with median
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="median")

# Define columns
cat_features = ["Make", "Colour"]
door_feature = ["Doors"]
num_features = ["Odometer (KM)"]

# Create an imputer (something that fills missing data)
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features),
    ("door_imputer", door_imputer, door_feature),
    ("num_imputer", num_imputer, num_features)
])

filled_X = imputer.fit_transform(X)

filled_X
array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)
car_sales_filled = pd.DataFrame(filled_X, 
                                      columns=["Make", "Colour", "Doors", "Odometer (KM)"])

car_sales_filled.isna().sum()
Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64
car_sales_filled
Make Colour Doors Odometer (KM)
0 Honda White 4 35431
1 BMW Blue 5 192714
2 Honda White 4 84714
3 Toyota White 4 154365
4 Nissan Blue 3 181577
... ... ... ... ...
945 Toyota Black 4 35820
946 missing White 3 155144
947 Nissan Blue 4 66604
948 Honda White 4 215883
949 Toyota Blue 4 248360

950 rows × 4 columns

Convert categorical data into numbers

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", 
                                 one_hot, 
                                 categorical_features)],
                                 remainder="passthrough")

# Fill train and test values separately
transformed_X = transformer.fit_transform(car_sales_filled)

# Check transformed and filled X_train
transformed_X.toarray()
array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

Split data into train and test

from sklearn.model_selection import train_test_split
np.random.seed(2509)
X_train, X_test, y_train, y_test = train_test_split(transformed_X,
                                                    y,
                                                    test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((760, 15), (190, 15), (760,), (190,))

Now data is in write shape to fit into model

Happy coding and have a great time learning how to make machines smarter.