0. Data Preprocessing

0.1 Importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

0.2 Importing the dataset

dataset = pd.read_csv('Salary_Data.csv')
dataset
YearsExperience Salary
0 1.1 39343.0
1 1.3 46205.0
2 1.5 37731.0
3 2.0 43525.0
4 2.2 39891.0
5 2.9 56642.0
6 3.0 60150.0
7 3.2 54445.0
8 3.2 64445.0
9 3.7 57189.0
10 3.9 63218.0
11 4.0 55794.0
12 4.0 56957.0
13 4.1 57081.0
14 4.5 61111.0
15 4.9 67938.0
16 5.1 66029.0
17 5.3 83088.0
18 5.9 81363.0
19 6.0 93940.0
20 6.8 91738.0
21 7.1 98273.0
22 7.9 101302.0
23 8.2 113812.0
24 8.7 109431.0
25 9.0 105582.0
26 9.5 116969.0
27 9.6 112635.0
28 10.3 122391.0
29 10.5 121872.0

0.3 Check if any null value

dataset.isna().sum()
YearsExperience    0
Salary             0
dtype: int64
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   YearsExperience  30 non-null     float64
 1   Salary           30 non-null     float64
dtypes: float64(2)
memory usage: 608.0 bytes

0.4 Split into X & y

X = dataset.drop('Salary', axis=1)
X
YearsExperience
0 1.1
1 1.3
2 1.5
3 2.0
4 2.2
5 2.9
6 3.0
7 3.2
8 3.2
9 3.7
10 3.9
11 4.0
12 4.0
13 4.1
14 4.5
15 4.9
16 5.1
17 5.3
18 5.9
19 6.0
20 6.8
21 7.1
22 7.9
23 8.2
24 8.7
25 9.0
26 9.5
27 9.6
28 10.3
29 10.5
y = dataset['Salary']
y
0      39343.0
1      46205.0
2      37731.0
3      43525.0
4      39891.0
5      56642.0
6      60150.0
7      54445.0
8      64445.0
9      57189.0
10     63218.0
11     55794.0
12     56957.0
13     57081.0
14     61111.0
15     67938.0
16     66029.0
17     83088.0
18     81363.0
19     93940.0
20     91738.0
21     98273.0
22    101302.0
23    113812.0
24    109431.0
25    105582.0
26    116969.0
27    112635.0
28    122391.0
29    121872.0
Name: Salary, dtype: float64

0.5 Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

1. Training the Simple Linear Regression model on the Training set

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
LinearRegression()

1.1 Score

regressor.score(X_test,y_test)
0.9740993407213511

2. Predicting the Test set results

y_pred = regressor.predict(X_test)
d = {'y_pred': y_pred, 'y_test': y_test}

2.1 Compare predicted results

pd.DataFrame(d)
y_pred y_test
2 40817.783270 37731.0
28 123188.082589 122391.0
13 65154.462615 57081.0
10 63282.410357 63218.0
26 115699.873560 116969.0
24 108211.664531 109431.0
27 116635.899689 112635.0
11 64218.436486 55794.0
17 76386.776158 83088.0

3. Visualising

3.1 Visualising the Training set results

plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

3.1 Visualising the Test set results

plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()