The Boston housing data is available online at Boston housing data - UCI. The data consists of 14 features -
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
I performed some descriptive analysis to explore the patterns in the data. Below are some of the results.
Boston Hosuing Prices
As we can see that there is a positive correlation between RM and housing prices.
Next, I performed the supervised machine learning on the boston housing data and wrote K-nn and gradient descent machine learning algorithms from scratch in python.
First step is to divide your data into training and testing. This is very important to ensure that the optimized final model runs on testing data to make accurate predictions. Below is the code snippet for splitting the data
# Create random indices for training dataset
train_idx = np.random.choice(bdata.target.shape[0], size=round(float(.66*bdata.target.shape[0]),0), replace = False )
# Create random indices for test dataset
test_idx = range(506)
test_idx = [elem for elem in test_idx if elem not in train_idx]
# Select bdata.data for training dataset
bdata_train = bdata.target[train_idx]
bdata_test = bdata.target[test_idx]
# Length of training and test dataset
print "Length of Training set", len(bdata_train)
print "Length of Test set", len(bdata_test)
Second step is to normalize the data which is very required for nearest neighbor. Below is the code snippet for a generic normalization function that takes as input an array of values for a given feature, and returns the normalized array (subtract the mean and divide by the standard deviation).
def normalize(raw_data):
# Generate an empty array of length of raw_data
normalized_data = np.empty(len(raw_data))
# Calculate mean of raw_data array
raw_data_mean = np.mean(raw_data)
# Calculate standard deviation of raw_data array
raw_data_std = np.std(raw_data)
# Normalize each value in the array by subtracting mean of the raw_data and dividing it by raw_data std
for i in range(len(raw_data)):
normalized = (np.subtract(raw_data[i],raw_data_mean))/raw_data_std
normalized_data[i] = normalized
return normalized_data
To prevent overfitting of data, K-fold cross validation is used. Root mean squared error (RMSE) is calculated to see the accuracy of the model. Below is the code snippet for K-nearest negihbor with K-fold cross validation algorithm.
def compute_rmse(predictions, yvalues):
rmse = np.sqrt(np.mean((np.subtract(predictions,yvalues))**2))
return rmse
def distance(x1, x2, L):
"""Given two instances and a value for L, return the L-Norm distance between them"""
e = 1/float(L)
dist = (np.sum((np.subtract(x2,x1))**L))**e
return dist
def knn(trainingSet, testInstance, L, K):
"""Given training and test set array with L for norm distance and K for
required number of neighbors, Return RMSE"""
#start_time = time.time()
#avgRMSE = []
actuals = []
predictions = []
# Iterate over the test set and find nearest neighbors for each test instance
for x in range(len(testInstance)):
distances = []
# Compute distance between each test instance with data in train
for i in range(len(trainingSet)):
dist = distance(testInstance[x][1:], trainingSet[i][1:], L)
# Append the cacluated distance to the training instance
distances.append((trainingSet[i], dist))
# Sort the data on dist column by ascending
distances.sort(key=operator.itemgetter(1))
neighbors =[]
# Get number of K neighbors
for j in range(K):
neighbors.append(distances[j][0])
price = []
# Get the price from first column in the neighbors array
for k in range(len(neighbors)):
price.append(neighbors[k][0])
# Append average price in predictions list
predictions.append(np.mean(price))
# Append each test instance actual to actuals list
actuals.append(testInstance[x][0])
# Compute RMSE
rmse = compute_rmse(predictions, actuals)
#print "RMSE is: ", rmse
#print "Time taken: " + str(round(time.time() - start_time,2)) + " seconds"
return rmse
def k_fold_knn(data, L, K):
"""Given an array of data with price and features with L,
returns RMSE for each fold and average RMSE for 10 fold CV
using K-NN algorithm as written in function knn()"""
start_time = time.time()
avgRMSE = []
# Shuffle data to choose random data for samples
#np.random.shuffle(data)
# Generate indices to generate 10 samples of equal sizes
sample_idx = list(partition(range(len(data)),int(len(data)/10)))
# Adjusting for extra records as we have 506 records
# 9 samples with equal partition of 50 records each and the last sample will have 56 records
if len(sample_idx) > 10:
sample_idx[9] = sample_idx[9] + sample_idx[10]
sample_idx.remove(sample_idx[10])
# Set range to 10 for 10 fold cross validation
for k in range(10):
# Set indices with sample size of 50 for test
# This will also insure that for every iteration of k, we will have a different test data
test_idx = sample_idx[k]
# Set indices for training set
train_idx = range(len(data))
train_idx = [elem for elem in train_idx if elem not in test_idx]
# Select test data using test indices from data provided by the user
test_data = [data[i] for i in test_idx]
# Select training data using training indices
train_data = [data[i] for i in train_idx]
# Compute RMSE using knn()
rmse = knn(train_data,test_data,L,K)
avgRMSE.append(rmse)
#print "Average RMSE for 10-FOLD CROSS VALIDATION with K=3 for KNN: ", np.mean(avgRMSE)
#print "TOTAL TIME TAKEN FOR 10-FOLD CV : " + str(round(time.time() - start_time,2)) + " seconds"
return np.mean(avgRMSE)
# Run K-fold with K-NN using knn() and k_fold_knn()
def main():
start_time = time.time()
# Compute RMSE using k_fold_knn() on normalized data
RMSE = k_fold_knn(bdata_cv_norm,2,3)
print "Average RMSE for 10-FOLD CROSS VALIDATION with K=3 for KNN: ", RMSE
print "Total Time Taken for 10-FOLD CV: " + str(round(time.time() - start_time,2)) + " seconds"
main()
I performed a 10-fold cross validation with K (neighbors) ranging from 1 to 25 to find the best choice of K. The best choice of K should be the one having lowest RMSE. RMSE values dipped initially and moved up and down between K=1 and K=9 but then started rising steadily for K > 10. Average RMSE for 10 FOLD CV is 5.61. Lowest RMSE value is 5.243873 for K = 5.
The K-nn model is optimized for boston housing data and model gives a RMSE of 5.24 with K=5 neighbors.
Implemented the batch gradient descent algorithm and regressed the housing price on the number of rooms per house for boston housing data Below is the code snippet.
def multivariate_ols(x, y, R, ep = 0.000001, MaxIterations=100000):
"""Gradient Decent to minimize OLS. Used to find co-efficients of bivariate OLS Linear regression
Takes parameters xvalue_matrix for features and yvalues for dependent variable, R for learning rate,
epsilon value ep, MaxIterations for maximum number of iterations; and returns an array theta [intercept, co-efficient]"""
start_time = time.time()
converged = False
iter = 0
i = np.shape(x)[0]
x = np.c_[ np.ones(i), x] # insert column
m,n = x.shape # number of samples
# Initiate alpha and beta with value 1
theta = np.ones(n)
x_transpose = x.transpose()
e = 0.0 #Temp variable to find min cost function
while not converged:
yhat = np.dot(x, theta)
loss = yhat - y
J = np.sum(loss ** 2) / (2 * m) # cost
# Check if cost function is minimum
if abs(J - e) <= ep:
print 'Converged after iterations: ', iter, '!!!'
converged = True
e = J # update temp variable with new cost value
iter += 1 # update iter
if iter == MaxIterations:
print 'Maximum interactions exceeded!'
converged = True
gradient = np.dot(x_transpose, loss) / m
theta = theta - R * gradient # update
print "Time taken: " + str(round(time.time() - start_time,2)) + " seconds"
return theta
def standardize(raw_data):
"""Normalize data""
return ((raw_data - np.mean(raw_data, axis = 0)) / np.std(raw_data, axis = 0))
# Filter RM and CRIM data
rmcrm = bdata.data[:,(0,5)]
# Normalize data
rmcrm_std = standardize(rmcrm)
# Run multivariate gradient descent function with RM and CRIM features
R = 0.01 # learning rate
ep = 0.00000001
mgd = multivariate_ols(rmcrm_std, medv, R, ep, 200000)
print "Learning rate: ", R
print "Intercept: ", mgd[0]
print "Coef for CRIM: ", mgd[1]
print "Coef for RM: ", mgd[2]
The coefficients converged after 1001 iterations with learning rate of 0.01. The coefficients are similar to coefficients obtained with OLS class model from SciPy library. - Intercept: 22.531895225 - Coef for CRIM: -2.24891245358 - Coef for RM: 5.89407627438
Next, I perfomed the logistic regression using gradient descent on boston housing data and optimized my model. Below is the code snippet
def log_ols(x, y, R, ep = 0.000001, MaxIterations=100000):
"""
Takes features matrix as x values and dependent variable as y
Learning rate is R, epislon as e, Maximum iterations as MaxIterations
:return: array with intercept and slope coefficients
"""
start_time = time.time()
converged = False
iter = 0
i = np.shape(x)[0]
x = np.c_[ np.ones(i), x] # insert column
m,n = x.shape # number of samples
# Initiate alpha and beta with value 1
theta = np.ones(n)
x_transpose = x.transpose()
error = 0.0 #Temp variable to find min cost function
while not converged:
# Sigmoid function to find gradient descent for logistic regression
hyp = 1.0 + np.e**(-1.0*np.dot(x, theta))
yhat = 1.0 / hyp
loss = yhat - y
J = (1./m) * (-np.transpose(y).dot(np.log(yhat)) - np.transpose(1 - y).dot(np.log(1 - yhat))) # Cost function
# Check if cost function is minimum
if abs(J - error) <= ep:
print 'Converged after iterations: ', iter, '!!!'
converged = True
error = J # update temp variable with new cost value
iter += 1 # update iter
if iter == MaxIterations:
print 'Maximum interactions exceeded!'
converged = True
gradient = np.dot(x_transpose, loss) / m
theta = theta - R * gradient # update
print "Time taken: " + str(round(time.time() - start_time,2)) + " seconds"
return theta
# Set variable expensive with 1 = expensive for medv > 40.0 and 0 = not expensive
expensive = np.where(bdata.target>40.0, 1,0)
# Filter CHAS and RM values
chasrm = bdata.data[:,(3,5)]
R = 0.01 # learning rate
ep = 0.00000001
# Run logistic gradient descent
theta_log = log_ols(chasrm, expensive, R, ep, 800000)
print "Learning rate: ", R
print "Intercept: ", theta_log[0]
print "Coefficient for CHAS: ", theta_log[1]
print "Coefficient for RM: ", theta_log[2]
Finally, I ran my optimized model on my test data and calcualted the accuracy of the data.
# Append ones to CHAS and RM matrix
chasrm_t = np.c_[ np.ones(len(chasrm)), chasrm]
chasrm_trans = chasrm_t.transpose()
# Sigmoid function to find predicted values using theta_log from log_ols() gradient descent above
chasrm_hyp = 1.0 + np.e**(-1.0*np.dot(theta_log, chasrm_trans))
chasrm_sig = 1.0 / chasrm_hyp
# Set simple classifier. predict 1 for sigmoid function (chasrm_sig >= 0.5)
chasrm_pred = np.where(chasrm_sig>=0.5, 1,0)
# Calculate accuracy of the model
print '\nACCURACY OF THE MODEL IS: %f' % ((expensive[np.where(chasrm_pred == expensive)].size / float(expensive.size)) * 100.0), '%'
ACCURACY OF THE MODEL IS: 95.849802 %