0. Data Preprocessing

0.1 Importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

0.2 Importing the dataset

housing = pd.read_csv("housing.csv")
housing
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
... ... ... ... ... ... ... ... ... ... ...
20635 -121.09 39.48 25.0 1665.0 374.0 845.0 330.0 1.5603 78100.0 INLAND
20636 -121.21 39.49 18.0 697.0 150.0 356.0 114.0 2.5568 77100.0 INLAND
20637 -121.22 39.43 17.0 2254.0 485.0 1007.0 433.0 1.7000 92300.0 INLAND
20638 -121.32 39.43 18.0 1860.0 409.0 741.0 349.0 1.8672 84700.0 INLAND
20639 -121.24 39.37 16.0 2785.0 616.0 1387.0 530.0 2.3886 89400.0 INLAND

20640 rows × 10 columns

0.3 Check if any null value

housing.isna().sum()
longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64
housing['total_bedrooms'].median()
435.0
housing['total_bedrooms'].fillna(housing['total_bedrooms'].median(),inplace=True) #with pandas fillna 
housing.isna().sum()
longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64
housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20640 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

0.4 Split into X & y

X = housing.drop("median_house_value",axis=1)
X
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 NEAR BAY
... ... ... ... ... ... ... ... ... ...
20635 -121.09 39.48 25.0 1665.0 374.0 845.0 330.0 1.5603 INLAND
20636 -121.21 39.49 18.0 697.0 150.0 356.0 114.0 2.5568 INLAND
20637 -121.22 39.43 17.0 2254.0 485.0 1007.0 433.0 1.7000 INLAND
20638 -121.32 39.43 18.0 1860.0 409.0 741.0 349.0 1.8672 INLAND
20639 -121.24 39.37 16.0 2785.0 616.0 1387.0 530.0 2.3886 INLAND

20640 rows × 9 columns

y = housing["median_house_value"]
y
0        452600.0
1        358500.0
2        352100.0
3        341300.0
4        342200.0
           ...   
20635     78100.0
20636     77100.0
20637     92300.0
20638     84700.0
20639     89400.0
Name: median_house_value, Length: 20640, dtype: float64

0.5 Convert categorical data into numbers

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["ocean_proximity"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", 
                                 one_hot, 
                                 categorical_features)],
                                 remainder="passthrough")

transformed_X = transformer.fit_transform(X)
pd.DataFrame(transformed_X)
0 1 2 3 4 5 6 7 8 9 10 11 12
0 0.0 0.0 0.0 1.0 0.0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252
1 0.0 0.0 0.0 1.0 0.0 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014
2 0.0 0.0 0.0 1.0 0.0 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574
3 0.0 0.0 0.0 1.0 0.0 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431
4 0.0 0.0 0.0 1.0 0.0 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462
... ... ... ... ... ... ... ... ... ... ... ... ... ...
20635 0.0 1.0 0.0 0.0 0.0 -121.09 39.48 25.0 1665.0 374.0 845.0 330.0 1.5603
20636 0.0 1.0 0.0 0.0 0.0 -121.21 39.49 18.0 697.0 150.0 356.0 114.0 2.5568
20637 0.0 1.0 0.0 0.0 0.0 -121.22 39.43 17.0 2254.0 485.0 1007.0 433.0 1.7000
20638 0.0 1.0 0.0 0.0 0.0 -121.32 39.43 18.0 1860.0 409.0 741.0 349.0 1.8672
20639 0.0 1.0 0.0 0.0 0.0 -121.24 39.37 16.0 2785.0 616.0 1387.0 530.0 2.3886

20640 rows × 13 columns

0.6 Split the data into test and train

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(transformed_X, y, test_size = 0.25, random_state = 2509)

1. Training the Random Forest Regression model on the training set

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(x_train, y_train)
RandomForestRegressor()

1.2 Score

model.score(x_test, y_test)
0.8214712477743407

2. Predicting a new result on test set

y_preds = model.predict(x_test)
df = pd.DataFrame(data={"actual values": y_test,
                        "predicted values": y_preds})
df["differences"] = df["predicted values"] - df["actual values"]
df
actual values predicted values differences
14352 225000.0 213052.00 -11948.00
13882 99300.0 109818.00 10518.00
4223 230600.0 250500.00 19900.00
2428 55100.0 58660.00 3560.00
18402 274100.0 330573.04 56473.04
... ... ... ...
11691 217000.0 179878.00 -37122.00
1213 79900.0 95071.00 15171.00
15957 246200.0 232699.01 -13500.99
13982 53300.0 84761.00 31461.00
11212 182900.0 189426.00 6526.00

5160 rows × 3 columns

3. Save a model

import pickle

# Save an extisting model to file
pickle.dump(model, open("random_forest_model.pkl", "wb"))