Linear Regression: Housing Prices Prediction

Linear Regression: Housing Prices Prediction

glancing at the future ๐Ÿ˜Ž

ยท

4 min read

Regression to the Mean (phr.) : no matter how bad things get or how good, things always come back to middle.

Linear Regression is one of the simplest methods in Supervised Machine Learning, and let us put it into practice by predicting the price from the area of the house. The two prerequisites are Python Programming Language and Patience(You might not get it for the first time :x).

Let's break the recipe into chunks for peace of mind ๐Ÿ˜Œ,

  1. Breathe in...Breathe out

  2. Gathering stuff

  3. Cleaning the Data

  4. Normalization

  5. Training ๐ŸŽ‰

Breathe In...Breathe out...

Yayy! We're halfway done(of the half ๐Ÿ˜…).

Gathering the Stuff

With great power tools comes great responsibility achievements. We need a dataset, a framework, and peace of mind to build ML models. Now that we have one from the 1st step, let's go get our dataset. The dataset I suggest to try Linear Regression on is, Housing Prices, and the framework Tensorflow, and the reason I'll be sharing with you in another post. Now, let's clean our dataset.

Cleaning the Data

What? & Why?. I know these two questions are swirling in your mind. Firstly, I'll answer the "What" question.

What is Cleaning the Data?

Well, data isn't always 100% complete. There are incomplete fields; multiple types of data(viz. numbers, true/false, text). Cleaning the data is simply transforming the data, which is easier for you to use.

Why is cleaning data necessary?

When there are null/incomplete fields in your data, that does not end up well with your model's training, or when there is text in your data, your model doesn't accept it. So clean it, before you use it.

Let's begin by importing the necessary packages

import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np

and reading our dataset.

dataset = pd.read_csv('../Datasets/HousingPrices.csv')
dataset.head()

It's time to clean.

dataset.drop(columns=['mainroad','guestroom','basement','hotwaterheating','airconditioning','prefarea','furnishingstatus'],inplace=True)

We don't need the above columns because we don't need them. Just kidding. We're dealing with numbers, so we don't need them.

Normalizing the Data

You might be like: Why am I even doing this? ๐Ÿฅฒ. In our data, especially "area," they are large numbers ranging from 4-5 digits. Training on such huge figures costs time, and you might end up with a less accurate model(that's a long explanation for "messing up"). So how about we scale it down such that the areas range from 0 to 1 :0. And that's what we call Normalization. There's a simple formula for doing it:

$$[data - min(data)]/[max(data)-min(data)]$$

before we use that formula, we divide our data into "Train" and "Test" datasets. Because we're going to evaluate our model in the future. The common ratio is 70:30 or 80:20.

train_dataset = dataset.sample(frac=0.7) #extracts 70% of the data
test_dataset = dataset.drop(train_dataset.index) #extracts remaining 30%
# Data Normalization
train_features = (train_dataset-np.min(train_dataset))/(np.max(train_dataset)-np.min(train_dataset))
test_features = (test_dataset-np.min(test_dataset))/(np.max(test_dataset)-np.min(test_dataset))
# .pop method removes the price column from dataframe and stores it in following variables.
# we need labels because it is Supervised Machine Learning. Read my first post "A Glimpse into Machine Learning" to know what it is!
train_labels = train_features.pop('price')
test_labels = test_features.pop('price')

The following is another way to normalize the data. The normalization method divides the difference of data and its mean with its variance

$$[data - mean(data)]/variance(data)$$

area = np.array(train_features['area'])
area_normalizer = tf.keras.layers.Normalization(input_shape=[1,],axis=None)
area_normalizer.adapt(area)

However, this method outputs the best results for our dataset.

Training

The time has finally come...To become a Typical Indian Baba(they usually predict your future without ML ๐Ÿ˜†).

Now here's the model we're going to train:

model = tf.keras.Sequential([
    area_normalizer,
    tf.keras.layers.Dense(units=64, activation='relu'),
    tf.keras.layers.Dense(units=64, activation='relu'),
    tf.keras.layers.Dense(units=1)
])

Sequential is a model type that will easily allow you to create a Neural Network by stacking the layers sequentially. The layers we provide in Sequential are our Normalization layer and three dense layers, which is our magical predictor! I will explain more about these dense layers in my next post.

Now, you compile the model,

model.compile(
optimizer=tf.keras.optimizers.legacy.Adam(learning_rate=0.001),
loss='mean_absolute_error',
metrics=['mae']
)

optimizer: The goal of an optimizer is to find the optimal set of values that result in the best performance of the model on the given task.

loss: the loss function (also known as the cost function or objective function) measures how well your model performs on a particular task.

metrics: they just measure your model's performance and have 0 contributions in training.

now, you fit the model to your data,

model.fit(
train_features['area'],
train_labels, 
epochs=500, 
validation_split=0.2
)

You provide features and labels, the number of times it goes through the whole data(termed 'epochs'), and the percentage of split to validate itself in training.

then you can predict values by,

y=model.predict([64986])

Here's what the Regression Line looks like:

Thats it! We're done! You can have a complete look at the code here. In my next post, I'll be writing about Behind the Scenes of Linear Regression ๐Ÿคฉ. Stay Tuned!

Datasets from kaggle.com & datasetsearch.research.google.com. For more insights on Tensorflow, visit: tensorflow.org

Until Next Time, Sree Teja Dusi

Did you find this article valuable?

Support Sree Teja Dusi by becoming a sponsor. Any amount is appreciated!

ย