Intro to Deep Learning — linear separability, perceptron

5 min readMay 29, 2021

In this article, I would like to introduce important concepts of the multi-layer neural networks in deep learning. First, we’ll study what linear separability is, and then look at how the perceptron algorithm works. We’ll discuss the application and limitations of linear classifiers and what non-linear transformations imply in multi-layer neural networks.

Linear Separability

Linear Separability implies the existence of a hyperplane separating the two classes. For example, consider a dataset with two features x1 and x2 in which the points (−1, −1),(1, 1),(−3, −3),(4, 4) belong to one class and (−1, 1),(1, −1),(−5, 2),(4, −8) belong to the other.

import matplotlib.pyplot as pltc1x1 = [-1, 1, -3, 4]
c1x2 = [-1, 1, -3, 4]
plt.scatter(c1x1, c1x2, label='class1')c2x1 = [-1, 1, -5, 4]
c2x2 = [1, -1, 2, -8]
plt.scatter(c2x1, c2x2, label='class2')plt.xlabel('x1')
plt.ylabel('x2')plt.legend() 
plt.show()

This data set is not linearly separable because there is no linear function that can separate class1 and class2.

Let’s consider 1-dimensional representation z in terms of x1 and x2 such that the dataset is linearly separable in terms of 1-dimensional representation corresponding to z. Defining z = x1x2, given data can be represented 1-dimensional and be linearly separable. Using this mapping, points of class1 (−1, −1),(1, 1),(−3, −3),(4, 4) become (1), (1), (9), (6) and points of class2 (−1, 1),(1, −1),(−5, 2),(4, −8) become (-1), (-1), (-10), (-32).

z1 = [1, 1, 9, 6]
y1 = [0, 0, 0, 0]
plt.scatter(z1, y1, label='class1')z2 = [-1, -1, -10, -32]
y2 = [0, 0, 0, 0]
plt.scatter(z2, y2, label='class2')plt.scatter([0], [0], label='hyperplane')
plt.xlabel('z = x1*x2')
plt.legend()
plt.show()

As you can see in the above plot, now z = 0 separates class1 and class2. A non-linearly separable data can be represented as linearly separable form after applying a non-linear transformation(z = x1x2 in this example, resulting in data represented as 1D).

Perceptron

For linearly separable datasets, a linear classifier or SVM with a linear kernel can achieve 100% accuracy to classify data. Linear classifiers classify data into labels based on a linear combination of input features. A single layer perceptron is an example of a linear classifier. It computes a linear combination of input features with parameters(weights), passes it as an input for a sign function, calculates the loss, and updates the weights through gradient descent.

To better understand how a linear classifier works, let’s implement a single-layer perceptron algorithm from scratch.

Consider a 2-dimensional data set in which all points with x1 > x2 belong to the positive class, and all points with x1 ≤ x2 belong to the negative class. Therefore, the true separator of the two classes is linear hyperplane (line) defined by x1 − x2 = 0. Now let’s create a training data set with 20 points randomly generated inside the unit square in the positive quadrant.

import matplotlib.pyplot as plt
import numpy as np
import randomx1_range = np.arange(0, 1, 0.01)#create train data
random.seed(0)
x1_pos = []
x2_pos = []
x1_neg = []
x2_neg = []train_data = []for i in range(20):
    x1, x2 = random.random(), random.random()
    train_data.append(np.array([x1, x2]))
    
    if x1 > x2:
        x1_pos.append(x1)
        x2_pos.append(x2)
    else:
        x1_neg.append(x1)
        x2_neg.append(x2)plt.scatter(x1_pos, x2_pos, label='positive')
plt.scatter(x1_neg, x2_neg, label='negative')
plt.plot(x1_range, x1_range, label='x1-x2=0')
plt.legend()
plt.show()

Let’s implement the perceptron algorithm and train it on the 20 points above, and test its accuracy on 1000 randomly generated points inside the unit square.

def perceptron(x_train, epoch, learning_rate, a):
    w = np.array([random.random(), random.random()])
    for n in range(epoch):
        for x in x_train:
            #linear combination
            dotprod = np.dot(x, w)
            #compute output
            pred = np.sign(dotprod)
            #prediction
            target = 1 if (x[0] > x[1]) else -1
           
            #calculate loss
            if pred*target == -1:
                gradient = - (target * x)
                w = w - learning_rate * gradient        
    return wdef predict(x_test, weights):
    n_correct = 0
    for x in x_test:
        dotprod = np.dot(x, weights)
        pred = np.sign(dotprod)
        
        if pred > 0:
            if x1 > x2:
                n_correct += 1
        else:
            if x1 <= x2:
                n_correct += 1
    return n_correct / 1000#train
w_0 = perceptron(train_data, 20, 0.01, 0)#create test data
test_data = []
for i in range(1000):
    x1, x2 = random.random(), random.random()
    test_data.append(np.array([x1, x2]))print("Accuracy : ", predict(test_data, w_0))

After 20 epochs, the accuracy of the perceptron gets 100%.

Another example of a linear classifier is SVM(support vector machines). The objective of SVM is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes.

To evaluate the SVM on this dataset, we can change the perceptron criterion to hinge-loss and repeat the accuracy computation on the same test points above. The hinge loss is used for “maximum-margin” classification and equals to the following loss function we used for the above perceptron criterion(a=0) only substituting the value a to 1.

#train
w_1 = perceptron(train_data, 10, 0.01, 1) #change a to 1
print(w_1)print("Accuracy : ", predict(test_data, w_1))

As in the perceptron algorithm, with SVM we could also achieve 100% accuracy on this linearly separable dataset.

Using both perceptron algorithms using linear classifier and SVM, we saw 100% accuracy can be achieved with the linearly separable dataset. However, most of the data in the real world are not linearly separable. Then how can we classify these data? Non-linear transformation is the key!

Non-linear transformation implies the key idea for multi-layer neural network classifiers. By bringing non-linear activation functions in the hidden layers, data can be represented as a linearly separable form in feature space. Thus a model can learn more complex features of data through multi-layer neural networks.

C. Aggarwal. Neural Networks and Deep Learning

Now you may have intuition how multi-layer neural networks learn the hidden features and patterns in data. Linear classifiers are simple and work well with linearly separable data, but have limitations to learn complex features of real-world data. This is the startline to understand plenty of things about multi-layer neural network architectures to solve various real-world problems!

References

Hinge loss - Wikipedia

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for…

en.wikipedia.org

Support-vector machine - Wikipedia

In machine learning, support-vector machines ( SVMs, also support-vector networks) are supervised learning models with…