In the previous post, we looked at using automatic differentiation to simplify our linear regression model and make it more readable.
Just to recap how far we’ve come with our implementation of linear regression:
- Multiple regression: Generalize the model to accept any number of features (\(x_1, x_2, \cdots, x_n\))
- Vectorization: Eliminate computationally-costly
for
-loops with vector and matrix operations - Error threshold: Modify our
train_model()
function to use an \(\epsilon\) error threshold rather than a set number of iterations - Automatic differentiation: Outsource the computation of the gradient to a library such as
autograd
, which can do it faster, more simply, and more accurately
That’s a lot, and there are still many things left to explore! Thankfully, the mathematical heavy lifting is done for now. Now we can concentrate on the coding and data science side of things!
- Do data science: Use
pandas
instead ofnumpy
; use a publicly available dataset instead of randomly generating toy data every time, and preprocess it appropriately - OOP: Encapsulate the linear regression model into a class
- Fine-tune the hyperparameters and evaluate the results
- Create similar models in Julia, R, and Clojure (the last one is a little ambitious)
- Compare our results to out-of-the-box linear regression models
Public data & preprocessing
We were generating toy datasets with random numbers before, but that is only good for testing that the code works. If we are going to evaluate models and hyperparameters, we need to use the same single dataset for consistency.
I currently live in SE Asia and air pollution is a big issue here, so the PM2.5 Data of Five Chinese Cities Data Set caught my eye as I was scrolling through the UCI Machine Learning Repository website.
Let’s use pandas
to read in the Beijing data and generate a DataFrame
for us, which is a more sophisticated version of a NumPy ndarray
.
import pandas as pd
# You may have to change os.getcwd() and os.chdir() to tell Python where to look
raw_data = pd.read_csv('FiveCitiePMData/BeijingPM20100101_20151231.csv')
raw_data.info() # https://datascience.stackexchange.com/questions/12645/how-to-count-the-number-of-missing-values-in-each-row-in-pandas-dataframe
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52584 entries, 0 to 52583
Data columns (total 18 columns):
No 52584 non-null int64
year 52584 non-null int64
month 52584 non-null int64
day 52584 non-null int64
hour 52584 non-null int64
season 52584 non-null int64
PM_Dongsi 25052 non-null float64
PM_Dongsihuan 20508 non-null float64
PM_Nongzhanguan 24931 non-null float64
PM_US Post 50387 non-null float64
DEWP 52579 non-null float64
HUMI 52245 non-null float64
PRES 52245 non-null float64
TEMP 52579 non-null float64
cbwd 52579 non-null object
Iws 52579 non-null float64
precipitation 52100 non-null float64
Iprec 52100 non-null float64
dtypes: float64(11), int64(6), object(1)
memory usage: 7.2+ MB
Our DataFrame
has 52,584 entries (hourly snapshots of air pollution) and 18 columns, and there are a lot of “holes” in the data (as noted on the webpage).
There’s also a lot of data we don’t really need. I’m not a meteorologist, and I’m not experienced enough with statistics to use this data to find out which features correlate with air pollution.
So, the objective of our model will be to find out how the salient features correlate with air pollution, and use that to predict future PM2.5 readings.
A quick-and-dirty Google search reveals that humidity and air temperature are the most influential factors on PM2.5 readings. So, let’s trim our data down to humidity (HUMI
), air temperature (TEMP
) and the PM2.5 reading.
Note that there are several PM2.5 readings, but the PM_US Post
has the fewest holes (non-null values). So let’s ignore the others.
Another consideration is the nature of each feature. Humidity and air temperature are both continuous values, so we’re in the clear, but if we cared about the time of day or wind direction, we would have to deal with those discrete values in a slightly different way.
Cleaning the data
Since we only need a few columns out of this entire table, let’s just create a whole new DataFrame
, using filter
to select our target columns and dropna
to remove the rows where any of those values are NaN
.
df = raw_data.filter(['HUMI','TEMP', 'PM_US Post'], axis=1).dropna()
That was easy!
We’ve gotten rid of incomplete entries, but we might still have some bad values that, err, pollute our results. Let’s check:
df.describe()
HUMI TEMP PM_US Post
count 50048.000000 50048.000000 50048.000000
mean 54.580363 12.599504 95.773258
std 25.996814 12.107097 91.731446
min 2.000000 -19.000000 1.000000
25% 31.000000 2.000000 27.000000
50% 55.000000 14.000000 69.000000
75% 78.000000 23.000000 132.000000
max 100.000000 42.000000 994.000000
Keep in mind that HUMI
is % and TEMP
is in degrees Celsius. Everything seems okay there.
How about those pollution readings, though? The max
value seems a little off. I mean, I have heard of readings in the 300 range (which is still dangerous), but 994 seems like an outlier.
How many readings were more than 3 standard deviations above the mean?
# Source: https://stackoverflow.com/questions/23833763/pandas-count-number-of-elements-in-each-column-less-than-x
high_pm = df['PM_US Post'].mean() + df['PM_US Post'].std() * 3
high_readings = (df['PM_US Post'] > high_pm).apply(np.count_nonzero).sum()
(high_readings / len(df.index)) * 100
1.86%. I’m tempted to remove them from the data, but a Google search turned up an article that suggests that such readings are not beyond the realm of possibility, sadly.
Feature scaling
One thing we didn’t do with the randomly generated datasets is feature scaling, which refers to the process of either normalizing or standardizing the values to facilitate faster and more accurate computations during gradient descent (or other algorithms).
Min-max scaling: Make all values fit within a certain range (between 0 and 1 or some other value)
Standardization: Coerce columns to have a mean (\(\mu\)) of 0 and a standard deviation (\(\sigma\)) of 1 by computing a \(z\)-score from the values. This is something I vaguely remember from college stats.
Which is better?
When in doubt, just standardize the data, it shouldn’t hurt. (Source)
Fair enough. scikit-learn
has built-in feature scaling methods, but one thing that did cross my mind is that if you train a model on normalized data, how can you make predictions from new data without the min
and max
(or mean
and std
) from every column?
For now, I’ll proceed working with this dataset without doing feature scaling, but I’ll come back to this point if necessary.
Segregating the data
Typically, for machine learning, you randomize the data and split it into three sets for training, validation, and testing. Training is for the model to learn the weights, validation is to fine-tune the hyperparameters (learning rate, convergence threshold, etc.), and testing is for evaluating the finalized model.
The following code performs a 60:20:20 split on the data. You could also do 80:10:10.
# Source: https://stackoverflow.com/questions/38250710/how-to-split-data-into-3-sets-train-validation-and-test
import autograd.numpy as np
np.random.seed(2) # guarantee same result every time
np.random.permutation(df.index)
train, validate, test = np.split(df.sample(frac=1),
[int(.6*len(df)), int(.8*len(df))])
Done!
A class act
Now, we can move on to tightening up the code a little bit. Before we encapsulate it into a class, though, we should consider the fact that we are working with pandas
now, and not numpy
, which will affect a couple things.
To add a column of ones (dummy features), the \(x_0\) in our linear regression matrix:
np.hstack((np.ones((1, 1)), target))
target.insert(0, 'dummy', 1)
To create a column vector of zeros (our initial weights), we can use a Pandas Series
:
np.zeros((num_weights, 1))
pd.Series(0.0, index=list(dataframe_minus_truth))
The list(dataframe_minus_truth)
isn’t quite so transparent. I found that when doing matrix multiplication between a matrix and either a vector or a matrix in pandas
, not only must the number of columns in the matrix match the number of rows in the other object, but the column names must match, too.
This is actually pretty smart, because it makes sure that your features don’t get messed up. But it’s still kind of a gotcha for the uninitiated (me).
So, to fix that, list(df)
is a quick way to get a list of your column names.
With those translations in mind, let’s rewrite our linear regression model as a class.
Our old train_model
function took separate X
and y
inputs, but we should make it more user-friendly by spliting the features from truth behind the scenes (we’ll assume that the last column is truth).
Let’s flesh out the class with some extra methods, test_on(dataset)
and predict()
. For test_on(dataset)
, let’s have it return the mean error (easier to interpret) rather than the mean squared error.
from autograd import elementwise_grad as egrad
class LinRegModel:
def __init__(self, dataset):
self.X = dataset.iloc[:,:-1].copy() # all but last column
self.X.insert(0, 'dummy', 1) # padding (w_0)
self.y = dataset.iloc[:,-1].copy() # last column
self.weights = pd.Series(0.0, index=list(self.X))
self.epochs = 0
def __cost__(self, weights):
errors = self.X @ weights - self.y
return np.square(errors).mean()
def info(self):
print(f'Model underwent {self.epochs} epochs of training.')
print('Model weights:')
print(self.weights)
def train(self, epsilon=0.001, learning_rate=0.01):
self.epochs = 0 # just to know the value; not critical for training
self.weights = pd.Series(0.0, index=list(self.X))
grad_cost = egrad(self.__cost__)
last_cost = 0
while True:
self.epochs += 1
print(grad_cost(np.array(self.weights)))
self.weights -= learning_rate * grad_cost(self.weights)
this_cost = self.__cost__(self.weights)
print(this_cost)
if abs(this_cost - last_cost) < epsilon:
break
last_cost = this_cost
self.info()
def test_on(self, test_set):
test_X = dataset.iloc[:,:-1].copy() # all but last column
test_X.insert(0, 'dummy', 1) # padding (w_0)
test_y = dataset.iloc[:,-1].copy() # last column
errors = self.X @ weights - self.y
return np.sqrt(np.square(errors).mean())
def predict(self, new_samples):
if type(new_samples) == "list":
new_samples = pd.DataFrame(new_samples)
if new_samples.shape[1] == self.X.shape[1]:
pass
elif new_samples.shape[1] == self.X.shape[1] - 1:
new_samples.insert(0, 'dummy', 1)
else:
raise ValueError("Number of features does not match the model's dataset")
return X @ weights
Fail!
While autograd
can handle NumPy data types, it appears that it can’t deal with Pandas data types such as Series
and DataFrames
! This defeats the whole purpose of trying to convert the old code from NumPy to Pandas.
I opened an issue on the GitHub repo, but still have not received a response after a week. In the meantime, I lost interest in this topic (manual linear regression) for the time being, and will end this post here.
References
About Feature Scaling and Normalization, Sebastian Raschka
Go Top