Linear Regression, Part 1.9999: Autograd Fail

In the previous post, we looked at using automatic differentiation to simplify our linear regression model and make it more readable.

Just to recap how far we’ve come with our implementation of linear regression:

Multiple regression: Generalize the model to accept any number of features ($x_1, x_2, \cdots, x_n$)
Vectorization: Eliminate computationally-costly for-loops with vector and matrix operations
Error threshold: Modify our train_model() function to use an $\epsilon$ error threshold rather than a set number of iterations
Automatic differentiation: Outsource the computation of the gradient to a library such as autograd, which can do it faster, more simply, and more accurately

That’s a lot, and there are still many things left to explore! Thankfully, the mathematical heavy lifting is done for now. Now we can concentrate on the coding and data science side of things!

Do data science: Use pandas instead of numpy; use a publicly available dataset instead of randomly generating toy data every time, and preprocess it appropriately
OOP: Encapsulate the linear regression model into a class
Fine-tune the hyperparameters and evaluate the results
Create similar models in Julia, R, and Clojure (the last one is a little ambitious)
Compare our results to out-of-the-box linear regression models

Public data & preprocessing

We were generating toy datasets with random numbers before, but that is only good for testing that the code works. If we are going to evaluate models and hyperparameters, we need to use the same single dataset for consistency.

I currently live in SE Asia and air pollution is a big issue here, so the PM2.5 Data of Five Chinese Cities Data Set caught my eye as I was scrolling through the UCI Machine Learning Repository website.

Let’s use pandas to read in the Beijing data and generate a DataFrame for us, which is a more sophisticated version of a NumPy ndarray.

Python

import pandas as pd
# You may have to change os.getcwd() and os.chdir() to tell Python where to look
raw_data = pd.read_csv('FiveCitiePMData/BeijingPM20100101_20151231.csv')
raw_data.info() # https://datascience.stackexchange.com/questions/12645/how-to-count-the-number-of-missing-values-in-each-row-in-pandas-dataframe

Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52584 entries, 0 to 52583
Data columns (total 18 columns):
No                 52584 non-null int64
year               52584 non-null int64
month              52584 non-null int64
day                52584 non-null int64
hour               52584 non-null int64
season             52584 non-null int64
PM_Dongsi          25052 non-null float64
PM_Dongsihuan      20508 non-null float64
PM_Nongzhanguan    24931 non-null float64
PM_US Post         50387 non-null float64
DEWP               52579 non-null float64
HUMI               52245 non-null float64
PRES               52245 non-null float64
TEMP               52579 non-null float64
cbwd               52579 non-null object
Iws                52579 non-null float64
precipitation      52100 non-null float64
Iprec              52100 non-null float64
dtypes: float64(11), int64(6), object(1)
memory usage: 7.2+ MB

Our DataFrame has 52,584 entries (hourly snapshots of air pollution) and 18 columns, and there are a lot of “holes” in the data (as noted on the webpage).

There’s also a lot of data we don’t really need. I’m not a meteorologist, and I’m not experienced enough with statistics to use this data to find out which features correlate with air pollution.

So, the objective of our model will be to find out how the salient features correlate with air pollution, and use that to predict future PM2.5 readings.

A quick-and-dirty Google search reveals that humidity and air temperature are the most influential factors on PM2.5 readings. So, let’s trim our data down to humidity (HUMI), air temperature (TEMP) and the PM2.5 reading.

Note that there are several PM2.5 readings, but the PM_US Post has the fewest holes (non-null values). So let’s ignore the others.

Another consideration is the nature of each feature. Humidity and air temperature are both continuous values, so we’re in the clear, but if we cared about the time of day or wind direction, we would have to deal with those discrete values in a slightly different way.

Cleaning the data

Since we only need a few columns out of this entire table, let’s just create a whole new DataFrame, using filter to select our target columns and dropna to remove the rows where any of those values are NaN.

df = raw_data.filter(['HUMI','TEMP', 'PM_US Post'], axis=1).dropna()

That was easy!

We’ve gotten rid of incomplete entries, but we might still have some bad values that, err, pollute our results. Let’s check:

Python

df.describe()

Output

               HUMI          TEMP    PM_US Post
count  50048.000000  50048.000000  50048.000000
mean      54.580363     12.599504     95.773258
std       25.996814     12.107097     91.731446
min        2.000000    -19.000000      1.000000
25%       31.000000      2.000000     27.000000
50%       55.000000     14.000000     69.000000
75%       78.000000     23.000000    132.000000
max      100.000000     42.000000    994.000000

Keep in mind that HUMI is % and TEMP is in degrees Celsius. Everything seems okay there.

How about those pollution readings, though? The max value seems a little off. I mean, I have heard of readings in the 300 range (which is still dangerous), but 994 seems like an outlier.

How many readings were more than 3 standard deviations above the mean?

# Source: https://stackoverflow.com/questions/23833763/pandas-count-number-of-elements-in-each-column-less-than-x
high_pm = df['PM_US Post'].mean() + df['PM_US Post'].std() * 3
high_readings = (df['PM_US Post'] > high_pm).apply(np.count_nonzero).sum()
(high_readings / len(df.index)) * 100

1.86%. I’m tempted to remove them from the data, but a Google search turned up an article that suggests that such readings are not beyond the realm of possibility, sadly.

Feature scaling

One thing we didn’t do with the randomly generated datasets is feature scaling, which refers to the process of either normalizing or standardizing the values to facilitate faster and more accurate computations during gradient descent (or other algorithms).

Min-max scaling: Make all values fit within a certain range (between 0 and 1 or some other value)

$$ X_{\textrm{norm}} = \frac{X - X_{\textrm{min}}}{X_{\textrm{max}} - X_{\textrm{min}}} $$

Standardization: Coerce columns to have a mean ($\mu$) of 0 and a standard deviation ($\sigma$) of 1 by computing a $z$-score from the values. This is something I vaguely remember from college stats.

$$ z_i = \frac{x_i - \mu}{\sigma} $$

Which is better?

When in doubt, just standardize the data, it shouldn’t hurt. (Source)

Fair enough. scikit-learn has built-in feature scaling methods, but one thing that did cross my mind is that if you train a model on normalized data, how can you make predictions from new data without the min and max (or mean and std) from every column?

For now, I’ll proceed working with this dataset without doing feature scaling, but I’ll come back to this point if necessary.

Segregating the data

Typically, for machine learning, you randomize the data and split it into three sets for training, validation, and testing. Training is for the model to learn the weights, validation is to fine-tune the hyperparameters (learning rate, convergence threshold, etc.), and testing is for evaluating the finalized model.

The following code performs a 60:20:20 split on the data. You could also do 80:10:10.

# Source: https://stackoverflow.com/questions/38250710/how-to-split-data-into-3-sets-train-validation-and-test
import autograd.numpy as np
np.random.seed(2) # guarantee same result every time
np.random.permutation(df.index)
train, validate, test = np.split(df.sample(frac=1),
                                 [int(.6*len(df)), int(.8*len(df))])

Done!

A class act

Now, we can move on to tightening up the code a little bit. Before we encapsulate it into a class, though, we should consider the fact that we are working with pandas now, and not numpy, which will affect a couple things.

To add a column of ones (dummy features), the $x_0$ in our linear regression matrix:

NumPy

np.hstack((np.ones((1, 1)), target))

Pandas

target.insert(0, 'dummy', 1)

To create a column vector of zeros (our initial weights), we can use a Pandas Series:

NumPy

np.zeros((num_weights, 1))

Pandas

pd.Series(0.0, index=list(dataframe_minus_truth))

The list(dataframe_minus_truth) isn’t quite so transparent. I found that when doing matrix multiplication between a matrix and either a vector or a matrix in pandas, not only must the number of columns in the matrix match the number of rows in the other object, but the column names must match, too.

This is actually pretty smart, because it makes sure that your features don’t get messed up. But it’s still kind of a gotcha for the uninitiated (me).

So, to fix that, list(df) is a quick way to get a list of your column names.

With those translations in mind, let’s rewrite our linear regression model as a class.

Our old train_model function took separate X and y inputs, but we should make it more user-friendly by spliting the features from truth behind the scenes (we’ll assume that the last column is truth).

Let’s flesh out the class with some extra methods, test_on(dataset) and predict(). For test_on(dataset), let’s have it return the mean error (easier to interpret) rather than the mean squared error.

from autograd import elementwise_grad as egrad

class LinRegModel:
    def __init__(self, dataset):
        self.X = dataset.iloc[:,:-1].copy()  # all but last column
        self.X.insert(0, 'dummy', 1)         # padding (w_0)
        self.y = dataset.iloc[:,-1].copy()   # last column
        self.weights = pd.Series(0.0, index=list(self.X))
        self.epochs = 0

    def __cost__(self, weights):
        errors = self.X @ weights - self.y
        return np.square(errors).mean()

    def info(self):
        print(f'Model underwent {self.epochs} epochs of training.')
        print('Model weights:')
        print(self.weights)

    def train(self, epsilon=0.001, learning_rate=0.01):
        self.epochs = 0 # just to know the value; not critical for training
        self.weights = pd.Series(0.0, index=list(self.X))
        grad_cost = egrad(self.__cost__)
        last_cost = 0
        while True:
            self.epochs += 1
            print(grad_cost(np.array(self.weights)))
            self.weights -= learning_rate * grad_cost(self.weights)
            this_cost = self.__cost__(self.weights)
            print(this_cost)
            if abs(this_cost - last_cost) < epsilon:
                break
            last_cost = this_cost
        self.info()

    def test_on(self, test_set):
        test_X = dataset.iloc[:,:-1].copy()   # all but last column
        test_X.insert(0, 'dummy', 1)          # padding (w_0)
        test_y = dataset.iloc[:,-1].copy()    # last column
        errors = self.X @ weights - self.y
        return np.sqrt(np.square(errors).mean())

    def predict(self, new_samples):
        if type(new_samples) == "list":
            new_samples = pd.DataFrame(new_samples)
        if new_samples.shape[1] == self.X.shape[1]:
            pass
        elif new_samples.shape[1] == self.X.shape[1] - 1:
            new_samples.insert(0, 'dummy', 1)
        else:
            raise ValueError("Number of features does not match the model's dataset")
        return X @ weights

Fail!

While autograd can handle NumPy data types, it appears that it can’t deal with Pandas data types such as Series and DataFrames! This defeats the whole purpose of trying to convert the old code from NumPy to Pandas.

I opened an issue on the GitHub repo, but still have not received a response after a week. In the meantime, I lost interest in this topic (manual linear regression) for the time being, and will end this post here.

References

About Feature Scaling and Normalization, Sebastian Raschka

Go Top