From Linear Regression To Neural Networks

Rahul Kumar
Good Audience
Published in
7 min readJul 2, 2018

--

A friend of mine recently joined a gym. He has this habit of noting down how much calories he burned in how much time.
His first week readings look something like this:

The vertical axis represents calories burned and the horizontal axis represents duration in minutes.

Notice there is no data corresponding to 7 minutes.
Can you guess how much calories will he burn if the duration is 7 minutes?
Approximately ~13–15 right?

But how did you guess it?

Intuitively, you drew a straight line which fits the data approximately, then used that line to predict the outcome for input 7 minutes.

But how to make a computer do this? Machines don’t have intuition like us.
What we can do is come up with a procedure which can find the best fitting line and program the computer accordingly.

Procedure for finding the best fitting line:

  • Take any random line
  • Tilt it clockwise a little bit and calculate the cost
  • Tilt it anticlockwise a little bit and calculate the cost
  • Check in which direction cost is reducing most and tilt the line a little bit in that direction
  • Move the line a little bit upward and calculate the cost
  • Move the line a little bit downward and calculate the cost
  • Check in which direction cost is reducing most and move the line little bit in that direction
  • Repeat until cost stops reducing further

Mathematically a line can be represented as y = mx + b, where m is the slope and b is the y-intercept. We have to find m and b such that cost is minimum.
For tilting the line, we change the value of m and for moving the line up or down we change the value of b.

For Example, suppose we start with m = 1 and b = 1.
we calculate the cost for m + 0.00001 and m-0.00001.
whichever reduces the cost most, will be our new m. Similarly, we change b.
We keep doing this until cost stops reducing further.

1st question? What is this cost and how are we calculating it?

Intuitively we can say that cost will be nothing but the sum of errors of all points, i.e. y distance of all points from the line.

E = ∑ Yp-Ya

where Yp is the predicted output (the point on the line) and Ya is the actual output.

Now suppose 2 points have a positive error and 2 points have a negative error with equal magnitude, then according to our formula cost will be 0.

How to fix that?
We can instead take the absolute values.
Now suppose we have one line with errors 1,1,1,1 and another line with errors 2,2. Both of them will have the same cost according to our equation. But which one is a better fit? The line with errors 1,1,1,1 will be a better fit.
How to handle this scenario?
We can take the square of errors instead of absolute values, so that more the error, more will be the cost.

So our new cost function will be
E = ∑ (Yp-Ya)²

and we multiply this by half to make our calculations easier later
E =1/2 ∑ (Yp-Ya)²

2nd question? why are taking step size of 0.00001?

That is just random, if I want more precision in my output I can decrease the step size.

Is there any other way to find out how to change m and b and how much instead of doing it manually?

Before that let us consider an example.

Suppose you are at a particular point in the valley given. Your goal is to reach the bottom.

How to reach the bottom?
Take a step in the direction opposite to slope with a step size proportional to the slope at that particular point. Keep taking steps until the slop becomes 0.

Imagine if this is our cost function, the y-axis represents cost and x-axis represents the m parameter of our line. The slope of the curve at any particular point is equivalent to the rate of change of cost w.r.t. m. So given slope w.r.t. m at any point we can calculate whether to increase m or decrease m and by how much such that our cost is minimised.

Using this approach, instead of increasing m and then decreasing and then comparing costs, we can directly calculate which way the cost will decrease.

But how do we calculate the slope?

Remember derivative of a function? Partial Derivative of the cost function at a particular m will directly give us the slope at that point.

So instead of tilting the line both ways and then checking cost function, we just need to calculate the partial derivate w.r.t. m and change m accordingly.

Mathematically,

m_new = m_old-(alpha * ẟE/ẟm)
similarly,
b_new = b_old-(alpha * ẟE/ẟb)

Yp = m*X + c
Ya = actual output
Error = Yp-Ya
Cost Function E = ½*(Yp-Ya)²
What we have to find is ẟE/ẟm and ẟE/ẟb

ẟE/ẟm = (Yp-Ya) * 1 * X
ẟE/ẟb = (Yp-Ya) * 1 * 1

and that’s it.
so given any data, we can find a line using the above approach such that the line approximately models the data.

Let us represent this model

Let’s call the circles neurons and m and b weights.
We take the input layer neurons, multiply them by their weights and pass them to the hidden layer neuron. Hidden layer summates both the inputs and forwards them to the output layer.
This is equivalent to y =m*x + b.

But in real life, we don’t usually have linear data.
What if we have binary data?

A straight line isn’t a good fit for this data. Can’t we use some sort of combination of lines?
Consider two lines :
Y = m1*x + b1
Y = m2*x + b2
Let’s add them
Y = (m1 + m2) * x + (b1 + b2)
Again a straight line, so using a combination of lines won’t be of any use.

What if we can bend the line somehow?

If we pass the straight line through a sigmoid function, the line is compressed into an S-shaped curve with values in the range 0 to 1.

Sigmoid function

The slope of the line controls the sharpness of S-shape and intercept b controls the mid-point.

S shape curve fits our data nicely!

Let us modify our model and add sigmoid function

Notice we have introduced two more weights(wo,bo) so as to scale and shift the output, since the output of the sigmoid function is always between 0 and 1.

So now our output depends on four variables m,b,wo,bo.

Using the same approach we used for the straight line we can find the best fitting S-shaped curve.

Cost Function E = ½*(Y_out-Y_a)^2
Output of summation Y_net = m*x + b
Output of sigmoid Y_sig = 1/ (1 + e^Y_net )
Output layer output = Y_out = w_o*Y_sig + b_o

= (Y_out-Y_a) * 1 * w_o * Y_sig* (1- Y_sig) * x

m_new = m_old -alpha * ẟE/ẟm
Similarly, other 3 weight updates can be calculated.

What if the data is as shown below?

1 S-shaped curve is not sufficient for the data given.
Can we somehow use the combination of S-shaped curves to fit on the data?
Yes!! We can.
We can take 2 curves as shown and subtract them.
The resultant will be our required curve.

The corresponding network will be as shown below:

Using the same approach we used earlier weights can be updated to find the best fitting combination of 2 S-shaped curves.

Conclusion?
Using a combination of 2 hidden layer neurons we can model any tower shape of any width or height.

But in real life data isn’t an S-shaped curve or tower mostly.
What if we have a random curve?
We can use the combination of towers in that case.

Given any random function F, we can create a combination of neurons such that the network approximately models the function.

Or we can say neural networks are “Universal function approximators”, they can model any data.

Connect with the Raven team on Telegram

--

--