README¶
In this project, a neural network is used to predict the car prices based on the distance driven, model year and and whether the vehicle has been in an accident. The prediction shows an average error of ~51%, thus the model could not learn from the distance driven, model year and accidents to predict the price. The model could be improved by adding more input features.
Source: https://www.kaggle.com/datasets/taeefnajib/used-car-price-prediction-dataset/data
Exploratory Data Analysis¶
Libraries¶
In [89]:
import pandas as pd
import matplotlib.pyplot as plt
import torch
from torch import nn
import seaborn as sns
from sklearn.model_selection import train_test_split
Load data¶
In [90]:
data = pd.read_csv("used_cars.csv")
data.head()
Out[90]:
| brand | model | model_year | milage | fuel_type | engine | transmission | ext_col | int_col | accident | clean_title | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Ford | Utility Police Interceptor Base | 2013 | 51,000 mi. | E85 Flex Fuel | 300.0HP 3.7L V6 Cylinder Engine Flex Fuel Capa... | 6-Speed A/T | Black | Black | At least 1 accident or damage reported | Yes | $10,300 |
| 1 | Hyundai | Palisade SEL | 2021 | 34,742 mi. | Gasoline | 3.8L V6 24V GDI DOHC | 8-Speed Automatic | Moonlight Cloud | Gray | At least 1 accident or damage reported | Yes | $38,005 |
| 2 | Lexus | RX 350 RX 350 | 2022 | 22,372 mi. | Gasoline | 3.5 Liter DOHC | Automatic | Blue | Black | None reported | NaN | $54,598 |
| 3 | INFINITI | Q50 Hybrid Sport | 2015 | 88,900 mi. | Hybrid | 354.0HP 3.5L V6 Cylinder Engine Gas/Electric H... | 7-Speed A/T | Black | Black | None reported | Yes | $15,500 |
| 4 | Audi | Q3 45 S line Premium Plus | 2021 | 9,835 mi. | Gasoline | 2.0L I4 16V GDI DOHC Turbo | 8-Speed Automatic | Glacier White Metallic | Black | None reported | NaN | $34,999 |
Preprocessing data¶
Model year
In [91]:
model_year = data["model_year"].max()-data["model_year"]
model_year = model_year.astype(float)
model_year = pd.DataFrame(model_year)
milage
In [92]:
milage = data["milage"]
milage = milage.str.replace("mi.","")
milage = milage.str.replace(",","")
milage = milage.astype(float)
milage = pd.DataFrame(milage)
Accident free
In [93]:
accident_free = data["accident"] == "None reported"
accident_free = accident_free.astype(int)
price
In [94]:
price = data["price"]
price = price.str.replace("$","")
price = price.str.replace(",","")
price = price.astype(float)
price = pd.DataFrame(price)
new dataframe
In [95]:
df = pd.concat([model_year,milage,accident_free,price], axis=1)
df.head()
Out[95]:
| model_year | milage | accident | price | |
|---|---|---|---|---|
| 0 | 11.0 | 51000.0 | 0 | 10300.0 |
| 1 | 3.0 | 34742.0 | 0 | 38005.0 |
| 2 | 2.0 | 22372.0 | 1 | 54598.0 |
| 3 | 9.0 | 88900.0 | 1 | 15500.0 |
| 4 | 3.0 | 9835.0 | 1 | 34999.0 |
Correlation
In [96]:
print(df.corr())
model_year milage accident price model_year 1.000000 0.617720 -0.188222 -0.199496 milage 0.617720 1.000000 -0.272352 -0.305528 accident -0.188222 -0.272352 1.000000 0.105135 price -0.199496 -0.305528 0.105135 1.000000
In [97]:
fig = plt.figure(figsize=(12,8))
ax = plt.axes(projection='3d')
z = df["price"]
x = df["model_year"]
y = df["milage"]
ax.scatter(x, y, z,c=z, cmap="viridis", s=50)
ax.set_xlabel("Model Year")
ax.set_ylabel("Milage")
ax.set_zlabel("Price")
ax.set_title("Visualization Data")
plt.subplots_adjust(left=0.1, right=0.9, top=0.9, bottom=0.1)
plt.tight_layout()
plt.show()
Training Neural Network¶
Preprocessing¶
Copy dataframe
In [98]:
df_model = df.copy()
print(df_model.shape)
(4009, 4)
In [99]:
X = df_model[["model_year", "milage", "accident"]]
y = df_model["price"]
print(X.shape)
print(y.shape)
(4009, 3) (4009,)
Split training and test data
In [100]:
X_train, X_test, y_train, y_test = train_test_split(
X,y,
test_size=0.05,
random_state=42
)
print(f"Input training data:",X_train.shape, "Input test data:",X_test.shape)
print(y_train.shape, y_test.shape)
Input training data: (3808, 3) Input test data: (201, 3) (3808,) (201,)
Neural network model¶
Training data to tensor
In [101]:
X_train_tensor = torch.tensor(X_train.values, dtype=torch.float32) #.values in pytroch array,
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).reshape(-1,1)
print(X_train_tensor.shape)
print(y_train_tensor.shape)
torch.Size([3808, 3]) torch.Size([3808, 1])
Normilize data
In [102]:
X_mean = X_train_tensor.mean(axis=0)
X_std = X_train_tensor.std(axis=0)
X_train_tensor = (X_train_tensor - X_mean) / X_std
In [103]:
y_mean = y_train_tensor.mean()
y_std = y_train_tensor.std()
y_train_tensor = (y_train_tensor - y_mean) / y_std
Model
In [104]:
model = nn.Sequential(
nn.Linear(3, 16),
nn.ReLU(),
nn.Linear(16, 8),
nn.ReLU(),
nn.Linear(8, 1)
)
loss function and optimizer
In [105]:
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
Training loop
In [106]:
# Training loop
losses_list = []
for i in range(0,3000):
optimizer.zero_grad()
# model
outputs = model(X_train_tensor)
# loss
loss = loss_fn(outputs, y_train_tensor)
loss.backward()
optimizer.step()
# loss fucntion
losses_list.append(loss.item())
if i % 500 == 0:
print(loss)
tensor(1.0066, grad_fn=<MseLossBackward0>) tensor(0.5707, grad_fn=<MseLossBackward0>) tensor(0.5497, grad_fn=<MseLossBackward0>) tensor(0.5315, grad_fn=<MseLossBackward0>) tensor(0.4996, grad_fn=<MseLossBackward0>) tensor(0.4637, grad_fn=<MseLossBackward0>)
plot loss function
In [107]:
plt.figure(figsize=(8,5))
plt.plot(losses_list)
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.title("Loss vs Iterations")
plt.grid(True)
plt.show()
Validation¶
test data to tensor
In [108]:
X_test_tensor = torch.tensor(X_test.values, dtype=torch.float32) #.values in pytroch array,
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).reshape(-1,1)
print(X_test_tensor.shape)
print(y_test_tensor.shape)
torch.Size([201, 3]) torch.Size([201, 1])
Make prediction
In [109]:
prediction = model((X_test_tensor - X_mean) / X_std)
prediction_orig = prediction * y_std + y_mean
percent_error = torch.abs((y_test_tensor - prediction_orig)/y_test_tensor)*100
Compute average error
In [110]:
# Mean Absolute Percentage Error (MAPE)
mean_error = torch.mean(percent_error)
# print("Percentage error per sample:", percent_error)
print(f"Mean Absolute Percentage Error: {mean_error:.2f}%")
Mean Absolute Percentage Error: 51.00%