Neural networks work. In simple words about the complex: what are neural networks

16.06.2019 Windows 7, XP

Artificial intelligence, neural networks, machine learning - what do all these popular concepts really mean? For most of the uninitiated people, which I myself am, they always seemed like something fantastic, but in fact their essence lies on the surface. I had an idea for a long time to write in simple language about artificial neural networks. Learn for yourself and tell others what this technology is, how it works, consider its history and prospects. In this article, I tried not to get into the jungle, but simply and popularly tell about this promising direction in the world of high technologies.

Artificial intelligence, neural networks, machine learning - what do all these popular concepts really mean? For most uninitiated people, which I myself am, they always seemed like something fantastic, but in fact, their essence lies on the surface. I had an idea for a long time to write in simple language about artificial neural networks. Learn for yourself and tell others what this technology is, how it works, consider its history and prospects. In this article, I tried not to get into the jungle, but simply and popularly tell about this promising direction in the world of high technologies.

A bit of history

For the first time, the concept of artificial neural networks (ANN) arose when trying to simulate the processes of the brain. The first major breakthrough in this area can be considered the creation of the McCulloch-Pitts neural network model in 1943. Scientists first developed a model of an artificial neuron. They also proposed the construction of a network of these elements to perform logical operations. But most importantly, scientists have proven that such a network is capable of learning.

The next important step was the development by Donald Hebb of the first algorithm for computing ANN in 1949, which became fundamental for the next several decades. In 1958, Frank Rosenblatt developed the parceptron, a system that mimics the processes of the brain. At one time, the technology had no analogues and is still fundamental in neural networks. In 1986, practically simultaneously, independently of each other, American and Soviet scientists significantly improved the fundamental method of teaching the multilayer perceptron. In 2007, neural networks underwent a rebirth. British computer scientist Jeffrey Hinton pioneered a deep learning algorithm for multilayer neural networks, which is now, for example, used to operate self-driving cars.

Briefly about the main thing

In the general sense of the word, neural networks are mathematical models that work on the principle of networks of nerve cells in an animal organism. ANNs can be implemented in both programmable and hardware solutions. For ease of perception, a neuron can be imagined as a certain cell, which has many input holes and one output hole. How numerous incoming signals are formed into outgoing ones determines the calculation algorithm. Effective values are supplied to each input of a neuron, which are then propagated along interneuronal connections (synopsis). Synapses have one parameter - weight, due to which the input information changes when moving from one neuron to another. The easiest way to imagine how neural networks work can be represented by the example of color mixing. The blue, green, and red neurons have different weights. The information of that neuron, the weight of which will be more dominant in the next neuron.

The neural network itself is a system of many such neurons (processors). Individually, these processors are quite simple (much simpler than a personal computer processor), but when connected into a large system, neurons are capable of performing very complex tasks.

Depending on the area of application, a neural network can be interpreted in different ways.For example, from the point of view of machine learning, an ANN is a pattern recognition method. From a mathematical point of view, this is a multi-parameter problem. From the point of view of cybernetics, it is a model of adaptive control of robotics. For artificial intelligence, ANN is a fundamental component for modeling natural intelligence using computational algorithms.

The main advantage of neural networks over conventional computation algorithms is their ability to learn. In the general sense of the word, learning consists in finding the correct coupling coefficients between neurons, as well as in generalizing data and identifying complex dependencies between input and output signals. In fact, successful training of a neural network means that the system will be able to identify the correct result based on data not present in the training set.

Today's situation

And no matter how promising this technology would be, so far ANNs are still very far from the capabilities of the human brain and thinking. Nevertheless, neural networks are already being used in many areas of human activity. So far, they are not able to make highly intellectual decisions, but they are able to replace a person where he was previously needed. Among the numerous areas of application of ANNs, one can note: the creation of self-learning systems of production processes, unmanned vehicles, image recognition systems, intelligent security systems, robotics, quality monitoring systems, voice interaction interfaces, analytics systems and much more. Such widespread use of neural networks is, among other things, due to the emergence of various ways to accelerate the learning of ANN.

Today the market for neural networks is huge - it is billions and billions of dollars. As practice shows, most of the technologies of neural networks around the world differ little from each other. However, the use of neural networks is a very costly exercise, which in most cases can only be afforded by large companies. For the development, training and testing of neural networks, large computing power is required, it is obvious that large players in the IT market have enough of this. Among the main companies leading development in this area are Google DeepMind, Microsoft Research, IBM, Facebook and Baidu.

Of course, all this is good: neural networks are developing, the market is growing, but so far the main task has not been solved. Humanity has failed to create a technology that is even close in capabilities to the human brain. Let's take a look at the main differences between the human brain and artificial neural networks.

Why are neural networks still far from the human brain?

The most important difference that fundamentally changes the principle and efficiency of the system is the different signal transmission in artificial neural networks and in the biological network of neurons. The fact is that in the ANN, neurons transmit values that are real values, that is, numbers. In the human brain, impulses with a fixed amplitude are transmitted, and these impulses are almost instantaneous. Hence, there are a number of advantages to the human network of neurons.

First, communication lines in the brain are much more efficient and economical than in ANNs. Secondly, the impulse circuit ensures the simplicity of the technology implementation: it is enough to use analog circuits instead of complex computational mechanisms. Ultimately, impulse networks are protected from acoustic interference. Valid numbers are prone to noise, which increases the likelihood of errors.

Outcome

Of course, in the last decade, there has been a real boom in the development of neural networks. This is primarily due to the fact that the learning process of the ANN has become much faster and easier. Also, the so-called "pre-trained" neural networks began to be actively developed, which can significantly speed up the process of technology implementation. And if it is too early to say whether someday neural networks will be able to fully reproduce the capabilities of the human brain, the likelihood that in the next decade ANNs will be able to replace humans in a quarter of existing professions is becoming more and more true.

For those who want to know more

The Big Neural War: What Google Is Really Up to
How cognitive computers can change our future

Algorithms, Machine Learning

Welcome to the second part of the neural network tutorial. Just want to apologize to everyone who waited for the second part much earlier. For some reason, I had to postpone writing it. In fact, I did not expect that the first article would have such a demand and that so many people would be interested in this topic. Taking into account your comments, I will try to provide you with as much information as possible and at the same time keep the way it is presented as clearly as possible. In this article, I will talk about ways to train / train neural networks (in particular, the backpropagation method) and if you, for some reason, have not read the first part, I strongly recommend starting with it. In the process of writing this article, I also wanted to talk about other types of neural networks and training methods, however, starting to write about them, I realized that this would go against my presentation method. I understand that you are eager to get as much information as possible, but these topics are very extensive and require detailed analysis, and my main task is not to write another article with a superficial explanation, but to convey to you every aspect of the topic raised and make the article as easy as possible. development. I hasten to upset those who like to “code”, since I still will not resort to using the programming language and will explain everything on my fingers. Enough of the introduction, let's now continue our study of neural networks.

What is a displacement neuron?

Before starting our main topic, we must introduce the concept of another type of neuron - the displacement neuron. The bias neuron or bias neuron is the third type of neuron used in most neural networks. The peculiarity of this type of neuron is that its input and output are always equal to 1 and they never have input synapses. Displacement neurons can either be present in the neural network one by one on a layer, or be completely absent, 50/50 cannot be (red on the diagram indicates weights and neurons that cannot be placed). The connections for displacement neurons are the same as for ordinary neurons - with all neurons of the next level, except that there can be no synapses between two bias neurons. Therefore, they can be placed on the input layer and all hidden layers, but not on the output layer, since they simply have nothing to form a connection with.

What is a bias neuron for?

The displacement neuron is needed in order to be able to obtain an output result by shifting the activation function graph to the right or left. If this sounds confusing, let's look at a simple example where there is one input neuron and one output neuron. Then it can be established that the O2 output will be equal to the H1 input multiplied by its weight, and passed through the activation function (formula in the photo on the left). In our particular case, we will use a sigmoid.

From the school mathematics course, we know that if we take the function y = ax + b and change its values "a", then the slope of the function will change (the colors of the lines on the graph on the left), and if we change "b", then we will shift function to the right or to the left (the colors of the lines on the graph on the right). So “a” is the weight of H1, and “b” is the weight of the bias neuron B1. This is a rough example, but this is how it works (if you look at the activation function on the right in the image, you will notice a very strong similarity between the formulas). That is, when, during training, we adjust the weights of hidden and output neurons, we change the slope of the activation function. However, adjusting the weight of the bias neurons can give us the opportunity to shift the activation function along the X-axis and capture new areas. In other words, if the point responsible for your solution is located, as shown in the graph on the left, then your neural network will never be able to solve the problem without using bias neurons. Therefore, you rarely find neural networks without bias neurons.

Also, bias neurons help in the case when all input neurons receive 0 as input and no matter what their weights they have, they will transfer everything to the next layer 0, but not in the case of the presence of a bias neuron. The presence or absence of bias neurons is a hyperparameter (more on that later). In short, you yourself must decide whether you need to use bias neurons or not by running the neural network with and without mixing neurons and comparing the results.

IMPORTANT be aware that sometimes on the diagrams they do not denote bias neurons, but simply take their weights into account when calculating the input value, for example:

Input = H1 * w1 + H2 * w2 + b3
b3 = bias * w3

Since its output is always 1, we can simply imagine that we have an additional synapse with a weight and add this weight to the sum without mentioning the neuron itself.

How to make the National Assembly give correct answers?

The answer is simple - you need to train her. However, as simple as the answer is, its implementation in terms of simplicity leaves much to be desired. There are several methods for teaching neural networks and I will single out 3, in my opinion, the most interesting:

Backpropagation method
Resilient propagation (Rprop)
Genetic Algorithm

Rprop and GA will be discussed in other articles, but now we will take a look at the basic basics - the backpropagation method, which uses the gradient descent algorithm.

What is Gradient Descent?

This is a way to find the local minimum or maximum of a function by moving along the gradient. Once you understand the essence of gradient descent, you shouldn't have any questions while using the backpropagation method. First, let's figure out what a gradient is and where it is present in our neural network. Let's build a graph where the x-axis will be the values of the neuron's weight (w) and the y-axis will be the error corresponding to this weight (e).

Looking at this graph, we will understand that the graph of the function f (w) is the dependence of the error on the selected weight. On this chart, we are interested in the global minimum - the point (w2, e2) or, in other words, the place where the chart comes closest to the x-axis. This point will mean that by choosing the weight w2 we will get the smallest error - e2 and, as a consequence, the best possible result. The gradient descent method will help us to find this point (the gradient is shown in yellow on the graph). Accordingly, each weight in the neural network will have its own graph and gradient, and each needs to find a global minimum.

So what is this gradient? A gradient is a vector that defines the steepness of a slope and indicates its direction relative to any of the points on the surface or graph. To find the gradient, you need to take the derivative of the graph at a given point (as shown in the graph). Moving in the direction of this gradient, we will smoothly slide into the lowlands. Now, imagine the error is a skier and the function graph is a mountain. Accordingly, if the error is 100%, then the skier is at the very top of the mountain, and if the error is 0%, then in the lowland. Like all skiers, the bug tends to descend as quickly as possible and decrease its value. In the end, we should get the following result:

Imagine that a skier is being thrown, by helicopter, up a mountain. How high or low depends on the case (similar to how in a neural network, during initialization, weights are randomly assigned). Let's say the error is 90% and this is our starting point. Now the skier needs to go down using a gradient. On the way down, at each point we will calculate a gradient that will show us the direction of descent and, when the slope changes, adjust it. If the slope is straight, then after the n-th number of such actions we will get to the lowland. But in most cases the slope (function graph) will be wavy and our skier will face a very serious problem - a local minimum. I think everyone knows what the local and global minimum of a function is, here's an example to refresh your memory. Getting into a local minimum is fraught with the fact that our skier will forever remain in this lowland and never slide down the mountain, therefore, we will never be able to get the correct answer. But we can avoid this by equipping our skier with a jetpack called momentum. Here's a quick illustration of the moment:

As you probably already guessed, this knapsack will give the skier the necessary acceleration to overcome the hill that keeps us at a local minimum, but there is one BUT. Let's imagine that we set a certain value for the moment parameter and could easily overcome all local minima and get to the global minimum. Since we cannot simply turn off the jetpack, we can slip past the global minimum if there are still lowlands next to it. In the final case, this is not so important, since sooner or later we will still return to the global minimum, but it is worth remembering that the larger the moment, the greater the swing with which the skier will ride on the lowlands. Along with the moment, the backpropagation method also uses such a parameter as the learning rate. As many will probably think, the higher the learning rate, the faster we will train the neural network. No. The learning rate, like the moment, is a hyperparameter - a value that is selected through trial and error. The speed of learning can be directly related to the speed of the skier and it is safe to say that you will go on more quietly. However, there are also certain aspects here, since if we do not give the skier speed at all, then he will not go anywhere at all, and if we give a low speed, then the travel time can stretch for a very, very long period of time. What then happens if we give too much speed?

As you can see, nothing good. The skier will begin to slide down the wrong path and possibly even in the other direction, which, as you understand, will only distance us from finding the right answer. Therefore, in all these parameters it is necessary to find a golden mean in order to avoid non-convergence of the NN (more on this later).

What is a Back Propagation Method (MPA)?

So we have come to the point where we can discuss how to make sure that your neural network can learn correctly and give the right decisions. The MOP is rendered very well on this GIF:

Now let's take a closer look at each stage. If you remember that in the previous article we considered the output of the National Assembly. In another way, this is called forward pass, that is, we sequentially transmit information from input neurons to output neurons. After that, we calculate the error and based on it we make a postback, which consists in sequentially changing the weights of the neural network, starting with the weights of the output neuron. The weights will change in the direction that gives us the best result. In my calculations, I will use the method of finding the delta, since this is the simplest and most understandable way. I will also use the stochastic method for updating the weights (more on that later).

Now let's continue where we left off the calculations in the previous article.

These tasks from the previous article

Data: I1 = 1, I2 = 0, w1 = 0.45, w2 = 0.78, w3 = -0.12, w4 = 0.13, w5 = 1.5, w6 = -2.3.

H1input = 1 * 0.45 + 0 * -0.12 = 0.45
H1output = sigmoid (0.45) = 0.61

H2input = 1 * 0.78 + 0 * 0.13 = 0.78
H2output = sigmoid (0.78) = 0.69

O1input = 0.61 * 1.5 + 0.69 * -2.3 = -0.672
O1output = sigmoid (-0.672) = 0.33

O1ideal = 1 (0xor1 = 1)

Error = ((1-0.33) ^ 2) /1=0.45

The result is 0.33, the error is 45%.

Since we have already calculated the result of the NN and its error, we can immediately proceed to the MPA. As I mentioned earlier, the algorithm always starts with the output neuron. In that case, let's calculate the value for it? (delta) by formula 1.

Since the output neuron does not have outgoing synapses, we will use the first formula (? Output), therefore, for hidden neurons, we will already take the second formula (? Hidden). Everything is quite simple here: we calculate the difference between the desired and the obtained result and multiply by the derivative of the activation function from the input value of the given neuron. Before starting the calculations, I want to draw your attention to the derivative. First, as it probably already became clear, only those activation functions that can be differentiated should be used with MOPs. Secondly, in order not to do unnecessary calculations, the derivative formula can be replaced with a friendlier and simpler formula of the form:

Thus, our calculations for point O1 will look like this.

Solution

O1output = 0.33
O1ideal = 1
Error = 0.45

O1 = (1 - 0.33) * ((1 - 0.33) * 0.33) = 0.148

This completes the calculations for neuron O1. Remember that after calculating the delta of a neuron, we must immediately update the weights of all outgoing synapses of this neuron. Since they are absent in the case of O1, we go to the latent-level neurons and do the same, except that we now have the second formula for calculating the delta and its essence is to multiply the derivative of the activation function from the input value by the sum of the products of all outgoing weights and delta of the neuron with which this synapse is connected. But why are the formulas different? The point is that the whole point of the MOR is to propagate the error of the output neurons to all the weights of the neural network. The error can be calculated only at the output level, as we have already done, we also calculated the delta in which this error already exists. Consequently, now instead of an error, we will use a delta that will be transmitted from neuron to neuron. In that case, let's find the delta for H1:

Solution

H1output = 0.61
w5 = 1.5
? O1 = 0.148

H1 = ((1 - 0.61) * 0.61) * (1.5 * 0.148) = 0.053

Now we need to find the gradient for each outgoing synapse. Here, they usually insert a 3-floor fraction with a bunch of derivatives and other mathematical hell, but that's the beauty of using the delta counting method, because ultimately your formula for finding the gradient will look like this:

Here point A is the point at the beginning of the synapse, and point B is at the end of the synapse. Thus, we can calculate the gradient w5 as follows:

Solution

H1output = 0.61
? O1 = 0.148

GRADw5 = 0.61 * 0.148 = 0.09

Now we have all the necessary data to update the weight w5 and we will do this thanks to the MOP function, which calculates the amount by which one or another weight needs to be changed and it looks like this:

I strongly recommend that you do not ignore the second part of the expression and use the moment as this will allow you to avoid problems with the local minimum.

Here we see 2 constants that we already talked about when we considered the gradient descent algorithm: E (epsilon) - learning rate,? (alpha) - moment. Translating the formula into words, we get: the change in the synapse weight is equal to the learning rate coefficient multiplied by the gradient of this weight, add the moment multiplied by the previous change in this weight (at the 1st iteration it is equal to 0). In this case, let's calculate the change in weight w5 and update its value by adding? W5 to it.

Solution

E = 0.7
? = 0.3
w5 = 1.5
GRADw5 = 0.09
? w5 (i-1) = 0

W5 = 0.7 * 0.09 + 0 * 0.3 = 0.063
w5 = w5 +? w5 = 1.563

Thus, after applying the algorithm, our weight increased by 0.063. Now I suggest you do the same for H2.

Solution

H2output = 0.69
w6 = -2.3
? O1 = 0.148
E = 0.7
? = 0.3
? w6 (i-1) = 0

H2 = ((1 - 0.69) * 0.69) * (-2.3 * 0.148) = -0.07

GRADw6 = 0.69 * 0.148 = 0.1

W6 = 0.7 * 0.1 + 0 * 0.3 = 0.07

W6 = w6 +? W6 = -2.2

And of course, do not forget about I1 and I2, because they also have synapses of weight which we also need to update. However, remember that we do not need to find deltas for input neurons, since they do not have input synapses.

Solution

w1 = 0.45,? w1 (i-1) = 0
w2 = 0.78,? w2 (i-1) = 0
w3 = -0.12,? w3 (i-1) = 0
w4 = 0.13,? w4 (i-1) = 0
? H1 = 0.053
? H2 = -0.07
E = 0.7
? = 0.3

GRADw1 = 1 * 0.053 = 0.053
GRADw2 = 1 * -0.07 = -0.07
GRADw3 = 0 * 0.053 = 0
GRADw4 = 0 * -0.07 = 0

W1 = 0.7 * 0.053 + 0 * 0.3 = 0.04
? w2 = 0.7 * -0.07 + 0 * 0.3 = -0.05
? w3 = 0.7 * 0 + 0 * 0.3 = 0
? w4 = 0.7 * 0 + 0 * 0.3 = 0

W1 = w1 +? W1 = 0.5
w2 = w2 +? w2 = 0.73
w3 = w3 +? w3 = -0.12
w4 = w4 +? w4 = 0.13

Now let's make sure that we did everything correctly and again calculate the output of the neural network only with the updated weights.

Solution

I1 = 1
I2 = 0
w1 = 0.5
w2 = 0.73
w3 = -0.12
w4 = 0.13
w5 = 1.563
w6 = -2.2

H1input = 1 * 0.5 + 0 * -0.12 = 0.5
H1output = sigmoid (0.5) = 0.62

H2input = 1 * 0.73 + 0 * 0.124 = 0.73
H2output = sigmoid (0.73) = 0.675

O1input = 0.62 * 1.563 + 0.675 * -2.2 = -0.51
O1output = sigmoid (-0.51) = 0.37

O1ideal = 1 (0xor1 = 1)

Error = ((1-0.37) ^ 2) /1=0.39

The result is 0.37, the error is 39%.

As we can see after one iteration of the MPA, we managed to reduce the error by 0.04 (6%). Now you have to repeat this over and over again until your error is small enough.

What else do you need to know about the learning process?

A neural network can be taught with or without a teacher (supervised, unsupervised learning).

Supervised Learning- This is the type of training inherent in problems such as regression and classification (we used it in the example above). In other words, here you act as a teacher and the NS as a student. You provide the input data and the desired result, that is, the student, looking at the input data, will understand that he needs to strive for the result that you provided him.

Learning without a teacher- this type of training is not so common. There is no teacher here, so the network does not get the desired result, or their number is very small. Basically, this type of training is inherent in neural networks in which the task is to group data according to certain parameters. Let's say you submit 10,000 articles to Habré and after analyzing all these articles, the National Assembly will be able to categorize them based on, for example, frequently occurring words. Articles that mention programming languages, programming, and where words like Photoshop, design.

There is also such an interesting method as reinforcement learning(reinforcement learning). This method deserves a separate article, but I will try to briefly describe its essence. This method is applicable when we can, based on the results received from the National Assembly, give it an assessment. For example, we want to teach the National Assembly to play PAC-MAN, then every time when the National Assembly will gain a lot of points, we will encourage it. In other words, we give the NS the right to find any way to achieve the goal, as long as it gives a good result. In this way, the network will begin to understand what they want to achieve from it and tries to find the best way to achieve this goal without the constant provision of data by the “teacher”.

Also, training can be performed using three methods: stochastic method, batch method and mini-batch method. There are so many articles and studies out there on which method is the best, and no one can come up with a general answer. I am a supporter of the stochastic method, but I do not deny the fact that each method has its own pros and cons.

Briefly about each method:

Stochastic(it is also sometimes called online) the method works according to the following principle - found? w, immediately update the corresponding weight.

Batch method it works differently. We summarize? W of all weights at the current iteration and only then update all weights using this sum. One of the most important advantages of this approach is the significant savings in computation time, while the accuracy can be severely affected in this case.

Mini batch method is the golden mean and tries to combine the advantages of both methods. Here the principle is as follows: we freely distribute the weights among the groups and change their weights by the sum? W of all weights in one or another group.

What are hyperparameters?

Hyperparameters are values that need to be picked manually and often by trial and error. Among these values are:

Moment and speed of learning
Number of hidden layers
Number of neurons in each layer
Presence or absence of displacement neurons

In other types of neural networks there are additional hyperparameters, but we will not talk about them. Choosing the right hyperparameters is very important and will directly affect the convergence of your neural network. Understanding whether to use bias neurons or not is easy enough. The number of hidden layers and neurons in them can be calculated by brute force based on one simple rule - the more neurons, the more accurate the result and the exponentially more time you will spend on training it. However, it is worth remembering that you should not make a neural network with 1000 neurons to solve simple problems. But with the choice of the moment and the speed of learning, everything is a little more complicated. These hyperparameters will vary, depending on the task and the architecture of the neural network. For example, for the XOR solution, the learning rate can be in the range of 0.3 - 0.7, but in a neural network that analyzes and predicts the stock price, a learning rate higher than 0.00001 leads to poor convergence of the neural network. You shouldn't focus on hyperparameters now and try to thoroughly understand how to choose them. This will come with experience, but for now I advise you to just experiment and look for examples of solving a particular problem on the network.

What is convergence?

Convergence indicates whether the neural network architecture is correct and whether the hyperparameters were correctly selected in accordance with the task at hand. Let's say our program outputs an NS error at each iteration to the log. If the error decreases with each iteration, then we are on the right track and our neural network is converging. If the error jumps up and down or freezes at a certain level, then the NN does not converge. In 99% of cases, this is solved by changing the hyperparameters. The remaining 1% will mean that you have an error in the neural network architecture. It also happens that the convergence is affected by the retraining of the neural network.

What is retraining?

Overfitting, as the name suggests, is the state of a neural network when it is oversaturated with data. This problem arises if it takes too long to train the network on the same data. In other words, the network will begin not to learn from the data, but to memorize and “cram” it. Accordingly, when you already submit new data to the input of this neural network, noise may appear in the received data, which will affect the accuracy of the result. For example, if we show the NA different photos of apples (only red ones) and say that this is an apple. Then, when the NS sees a yellow or green apple, it will not be able to determine that it is an apple, since she remembered that all apples must be red. Conversely, when the NS sees something red and the shape of an apple, such as a peach, it will say that it is an apple. This is noise. On the graph, the noise will look like this.

It can be seen that the graph of the function fluctuates greatly from point to point, which are the output (result) of our neural network. Ideally, this graph should be less wavy and less straight. To avoid overfitting, you should not train NN for a long time on the same or very similar data. Also, overfitting can be caused by a large number of parameters that you supply to the input of the neural network or by a too complex architecture. Thus, when you notice errors (noise) in the output after the training stage, then you should use one of the regularization methods, but in most cases this will not be necessary.

Conclusion

I hope this article was able to clarify the key points of such a difficult subject as Neural Networks. However, I believe that no matter how many articles you read, it is impossible to master such a complex topic without practice. Therefore, if you are just at the beginning of the journey and want to study this promising and developing industry, then I advise you to start practicing by writing your own neural network, and only after that resort to using various frameworks and libraries. Also, if you are interested in my method of presenting information and you want me to write articles on other topics related to Machine Learning, then vote in the poll below for the topic that interests you. See you in future articles :)

Accordingly, the neural network takes two numbers as input and must give another number at the output - the answer. Now about the neural networks themselves.

What is a neural network?

A neural network is a sequence of neurons connected by synapses. The structure of the neural network came to the programming world straight from biology. Thanks to this structure, the machine acquires the ability to analyze and even memorize various information. Neural networks are also able not only to analyze incoming information, but also to reproduce it from their memory. For those interested, be sure to watch 2 videos from TED Talks: Video 1 , Video 2). In other words, a neural network is a machine interpretation of the human brain, which contains millions of neurons that transmit information in the form of electrical impulses.

What are neural networks?

For now, we will consider examples on the most basic type of neural networks - this is a feedforward network (hereinafter referred to as FNS). Also in subsequent articles I will introduce more concepts and tell you about recurrent neural networks. DSS, as the name implies, is a network with a serial connection of neural layers, in which information always goes in only one direction.

What are neural networks for?

Neural networks are used to solve complex problems that require analytical calculations similar to those of the human brain. The most common uses for neural networks are:

Classification- data distribution by parameters. For example, a set of people is given at the entrance and it is necessary to decide which of them to give a loan, and who does not. This work can be done by a neural network analyzing information such as age, solvency, credit history, etc.

Prediction- the ability to predict the next step. For example, the rise or fall of a stock based on the situation in the stock market.

Recognition- currently, the most widespread use of neural networks. Used on Google when you are looking for a photo or in phone cameras when it detects the position of your face and makes it stand out and much more.

Now, to understand how neural networks work, let's take a look at its components and their parameters.

What is a neuron?

A neuron is a computational unit that receives information, performs simple calculations on it and transmits it further. They are divided into three main types: entrance (blue), hidden (red), and exit (green). There is also a bias neuron and a context neuron, which we will talk about in the next article. In the case when a neural network consists of a large number of neurons, the term layer is introduced. Accordingly, there is an input layer that receives information, n hidden layers (usually no more than 3) that process it and an output layer that outputs the result. Each of the neurons has 2 main parameters: input data and output data. In the case of an input neuron: input = output. In the rest, the total information of all neurons from the previous layer gets into the input field, after which it is normalized using the activation function (for now, just represent it f (x)) and gets into the output field.

Important to remember that neurons operate with numbers in the range or [-1,1]. But what, you ask, then handle numbers that go out of this range? At this stage, the simplest answer is to divide 1 by that number. This process is called normalization and is very often used in neural networks. More on this later.

What is a synapse?

A synapse is a connection between two neurons. Synapses have 1 parameter - weight. Thanks to him, the input information changes when it is transmitted from one neuron to another. Let's say there are 3 neurons that transmit information to the next one. Then we have 3 weights corresponding to each of these neurons. For the neuron with the greater weight, that information will be dominant in the next neuron (for example, color mixing). In fact, the set of weights of a neural network or a matrix of weights is a kind of brain of the entire system. It is thanks to these weights that the input information is processed and converted into a result.

Important to remember that during the initialization of the neural network, the weights are randomly assigned.

How does a neural network work?

In this example, a part of a neural network is depicted, where the letters I denote the input neurons, the letter H denotes the hidden neuron, and the letter w denotes the weights. It can be seen from the formula that the input information is the sum of all input data multiplied by the corresponding weights. Then we give the input 1 and 0. Let w1 = 0.4 and w2 = 0.7 The input data of the neuron H1 will be as follows: 1 * 0.4 + 0 * 0.7 = 0.4. Now that we have the input, we can get the output by plugging the input into the activation function (more on that later). Now that we have the output, we pass it on. And so, we repeat for all layers until we get to the output neuron. Starting such a network for the first time, we will see that the answer is far from correct, because the network is not trained. We will train her to improve her results. But before we learn how to do this, let's introduce a few terms and properties of a neural network.

Activation function

An activation function is a way to normalize input data (we talked about this earlier). That is, if you have a large number at the input, having passed it through the activation function, you will get an output in the range you need. There are a lot of activation functions, so we will consider the most basic ones: Linear, Sigmoid (Logistic) and Hyperbolic tangent. Their main difference is the range of values.

Linear function

This function is almost never used, except when you need to test a neural network or transfer a value without transformations.

Sigmoid

This is the most common activation function and its range of values. It is on it that most of the examples on the web are shown, it is also sometimes called the logistic function. Accordingly, if in your case there are negative values (for example, stocks can go not only up, but also down), then you will need a function that captures negative values as well.

Hyperbolic tangent

It makes sense to use the hyperbolic tangent only when your values can be both negative and positive, since the range of the function is [-1,1]. It is inappropriate to use this function only with positive values, as it will significantly worsen the results of your neural network.

Training set

A training set is a sequence of data that a neural network operates on. In our case, exceptional or (xor) we have only 4 different outcomes, that is, we will have 4 training sets: 0xor0 = 0, 0xor1 = 1, 1xor0 = 1.1xor1 = 0.

Iteration

This is a kind of counter that increases every time the neural network goes through one training set. In other words, this is the total number of training sets traversed by the neural network.

Epoch

When initializing the neural network, this value is set to 0 and has a manually set ceiling. The larger the epoch, the better the network is trained and, accordingly, its result. The epoch increases each time we go through the entire set of training sets, in our case, 4 sets or 4 iterations.

Important do not confuse iteration with epoch and understand the sequence of their increment. First n
once the iteration increases, and then the epoch and not vice versa. In other words, you cannot first train a neural network only on one set, then on another, etc. You need to train each set once per era. This way, you can avoid errors in calculations.

Error

Error is a percentage that represents the discrepancy between expected and received responses. The error is formed every epoch and should decline. If this does not happen, then you are doing something wrong. The error can be calculated in different ways, but we will consider only three main ways: Mean Squared Error (hereinafter MSE), Root MSE, and Arctan. There is no restriction on use as in the activation function, and you are free to choose whichever method gives you the best results. One has only to take into account that each method counts errors in different ways. In Arctan, the error, almost always, will be larger, since it works according to the principle: the greater the difference, the greater the error. Root MSE will have the smallest error, therefore most often MSE is used, which keeps a balance in the error calculation.

Root MSE

The principle of calculating the error is the same in all cases. For each set, we count the error, subtracting it from the ideal answer. Further, either we square it, or we calculate the square tangent from this difference, after which we divide the resulting number by the number of sets.

Task

Now, to test yourself, calculate the result of a given neural network using a sigmoid and its error using MSE.

Data: I1 = 1, I2 = 0, w1 = 0.45, w2 = 0.78, w3 = -0.12, w4 = 0.13, w5 = 1.5, w6 = -2.3.

Neural networks

Diagram of a simple neural network. Green marked input elements, in yellow - day off element

Artificial neural networks(ANN) - mathematical models, as well as their software or hardware implementations, built on the principle of the organization and functioning of biological neural networks - networks of nerve cells of a living organism. This concept arose when studying the processes occurring in the brain during thinking, and when trying to model these processes. The first such brain model was the perceptron. Subsequently, these models began to be used for practical purposes, usually in forecasting problems.

Neural networks are not programmed in the usual sense of the word, they are trained... Learning is one of the main advantages of neural networks over traditional algorithms. Technically, training consists in finding the coefficients of connections between neurons. During training, the neural network is able to identify complex dependencies between input data and output, as well as perform generalization. This means that, in case of successful training, the network will be able to return the correct result based on data that was absent in the training set.

Chronology

Known Applications

Clustering

Clustering is understood as dividing the set of input signals into classes, while neither the number nor the characteristics of the classes are known in advance. After training, such a network is able to determine which class the input signal belongs to. The network can also signal that the input signal does not belong to any of the selected classes - this is a sign of new data that are absent in the training sample. Thus, a similar network can identify new, previously unknown signal classes... The correspondence between the classes allocated by the network and the classes existing in the domain is established by a person. Clustering is carried out, for example, by Kohonen neural networks.

Experimental selection of network characteristics

After choosing the general structure, you need to experimentally select the network parameters. For networks like a perceptron, this will be the number of layers, the number of blocks in hidden layers (for Word networks), the presence or absence of bypass connections, the transfer functions of neurons. When choosing the number of layers and neurons in them, one should proceed from the fact that the network's ability to generalize is the higher, the greater the total number of connections between neurons... On the other hand, the number of links is bounded from above by the number of records in the training data.

Experimental selection of training parameters

After choosing a specific topology, it is necessary to select the parameters for training the neural network. This stage is especially important for supervised learning networks. The correct choice of parameters depends not only on how quickly the network's answers will converge to the correct answers. For example, choosing a low learning rate will increase the convergence time, but sometimes it avoids network paralysis. An increase in the learning moment can lead to both an increase and a decrease in the convergence time, depending on the shape of the error surface. Based on such a contradictory influence of the parameters, we can conclude that their values should be chosen experimentally, guided by the criterion of training completion (for example, minimizing the error or limiting the training time).

The actual training of the network

During training, the network scans the training sample in a certain order. The scan order can be sequential, random, etc. Some unsupervised networks, for example, Hopfield networks, scan the sample only once. Others, such as Kohonen networks and supervised networks, scan the sample multiple times, with one full pass through the sample called the learning era... When teaching with a teacher, the set of initial data is divided into two parts - the training sample itself and the test data; the principle of separation can be arbitrary. The training data is fed to the network for training, and the validation data is used to calculate the network error (validation data is never used to train the network). Thus, if the error on the test data decreases, then the network does generalize. If the error on the training data continues to decrease, and the error on the test data increases, then the network has stopped generalizing and simply "remembers" the training data. This phenomenon is called network overfitting or overfitting. In such cases, training is usually stopped. During the learning process, other problems may appear, such as paralysis or the network hitting a local minimum of the error surface. It is impossible to predict in advance the manifestation of a particular problem, as well as to give unambiguous recommendations for their solution.

Checking the adequacy of training

Even in the case of successful, at first glance, learning, the network does not always learn exactly what the creator wanted from it. There is a known case when the network was trained to recognize images of tanks from photographs, but later it turned out that all the tanks were photographed against the same background. As a result, the network "learned" to recognize this type of terrain, instead of "learn" to recognize tanks. Thus, the network “understands” not what is required of it, but what is easiest to generalize.

Classification by type of input information

Analog neural networks (use information in the form of real numbers);
Binary neural networks (operate with information presented in binary form).

Classification by the nature of learning

Supervised learning - the output space of neural network solutions is known;
Unsupervised learning - a neural network forms the output space of solutions based on input only. Such networks are called self-organizing;
Reinforcement Learning is a system for assigning penalties and rewards from the environment.

Classification by the nature of synapse tuning

Signal transmission time classification

In a number of neural networks, the activating function may depend not only on the weighting coefficients of connections w ij, but also on the time of transmission of a pulse (signal) through communication channels τ ij... Therefore, in general, the activating (transmitting) communication function c ij from element u i to element u j looks like:. Then synchronous network ij each bond is equal to either zero or a fixed constant τ. Asynchronous is a network for which the transmission time τ ij for every link between elements u i and u j its own, but also constant.

Classification by the nature of ties

Feedforward networks

All connections are directed strictly from input neurons to output neurons. Examples of such networks are Rosenblatt's perceptron, multilayer perceptron, Word networks.

Recurrent neural networks

The signal from the output neurons or neurons of the hidden layer is partially transmitted back to the inputs of the neurons of the input layer (feedback). Recurrent network Hopfield network "filters" the input data, returning to a steady state and, thus, allows solving problems of data compression and building associative memory. Bidirectional networks are a special case of recurrent networks. In such networks, there are connections between layers both in the direction from the input layer to the output layer, and in the opposite direction. The classic example is Kosko's Neural Network.

Radial basis functions

Artificial neural networks using radial-basis networks as activation functions (such networks are abbreviated as RBF networks). General view of the radial basis function:

, For example,

where x is the vector of neuron input signals, σ is the width of the function window, φ ( y) is a decreasing function (most often equal to zero outside a certain segment).

The radial-baseline network is characterized by three features:

1. The only hidden layer

2. Only neurons of the hidden layer have a nonlinear activation function

3. The synaptic weights of the connections of the input and hidden layers are equal to one

About the training procedure - see the literature

Self-organizing cards

Such networks represent a competitive neural network with unsupervised learning, performing the task of visualization and clustering. It is a method of projecting a multidimensional space into a space with a lower dimension (most often two-dimensional), it is also used to solve problems of modeling, forecasting, etc. It is one of the versions of Kohonen's neural networks. Self-organizing Kohonen maps are primarily used for visualization and initial (“exploratory”) data analysis.

The signal to the Kohonen network goes to all neurons at once, the weights of the corresponding synapses are interpreted as coordinates of the node position, and the output signal is formed according to the principle of "winner takes everything" - that is, the neuron closest (in the sense of synapse weights) to the input signal has a nonzero output signal object. In the process of training, the synapse weights are adjusted in such a way that the lattice nodes are "located" in the places of local data condensations, that is, they describe the cluster structure of the data cloud, on the other hand, the connections between neurons correspond to the neighborhood relations between the corresponding clusters in the feature space.

It is convenient to consider such maps as two-dimensional grids of nodes located in multidimensional space. Initially, a self-organizing map is a grid of nodes, connected by links. Kohonen considered two options for connecting nodes - in a rectangular and a hexagonal grid - the difference is that in a rectangular grid, each node is connected to 4 neighboring nodes, and in a hexagonal grid - to 6 nearest nodes. For two such grids, the process of constructing the Kohonen network differs only in the place where the neighbors closest to the given node are moved.

The initial nesting of the mesh into the data space is arbitrary. The author's package SOM_PAK offers options for a random initial arrangement of nodes in space and an option for an arrangement of nodes in a plane. After that, the nodes begin to move in space according to the following algorithm:

Data point is randomly selected x .
The closest to x card node (BMU - Best Matching Unit).
This node is moved a given step towards x. However, it does not move alone, but carries along a certain number of the nearest nodes from a certain neighborhood on the map. Of all the moving nodes, the central node - the one closest to the data point - is most strongly displaced, and the rest experience the smaller displacements, the further they are from the BMU. There are two stages in map tuning - the ordering stage and the fine-tuning stage. At the first stage, large values of the neighborhood are selected and the movement of the nodes is of a collective nature - as a result, the map “straightens out” and roughly reflects the data structure; at the stage of fine tuning, the radius of the neighborhood is 1–2 and the individual positions of the nodes are adjusted. In addition, the value of the displacement decays uniformly over time, that is, it is large at the beginning of each of the training stages and close to zero at the end.
The algorithm repeats a certain number of epochs (it is clear that the number of steps can vary greatly depending on the task).

Differences from machines with von Neumann architecture

A long period of evolution has given the human brain many qualities that are absent in machines with von Neumann architecture:

Mass concurrency;
Distributed presentation of information and calculations;
Ability to learn and generalize;
Adaptability;
Property of contextual information processing;
Tolerance to mistakes;
Low power consumption.

Neural networks - universal approximators

Neural networks are universal approximating devices and can simulate any continuous automaton with any accuracy. A generalized approximation theorem is proved: using linear operations and a cascade connection, it is possible to obtain a device from an arbitrary nonlinear element that calculates any continuous function with any predetermined accuracy. This means that the nonlinear characteristic of a neuron can be arbitrary: from sigmoidal to an arbitrary wave packet or wavelet, sine or polynomial. The complexity of a particular network may depend on the choice of a nonlinear function, but with any nonlinearity, the network remains a universal approximator and, with the correct choice of structure, can approximate the functioning of any continuous automaton as accurately as desired.

Application examples

Forecasting financial time series

Input data - stock price for the year. The task is to determine tomorrow's course. The following transformation is being carried out - the course for today, yesterday, for the day before yesterday, for the day before yesterday is lined up. The next row is shifted by date by one day, and so on. On the resulting set, a network with 3 inputs and one output is trained - that is, an output: rate for a date, inputs: rate for a date minus 1 day, minus 2 days, minus 3 days. For the trained network, we submit the course for today, yesterday, the day before yesterday and receive an answer for tomorrow. It is easy to see that in this case the network will simply display the dependence of one parameter on the three previous ones. If it is desirable to take into account some other parameter (for example, the general index for the industry), then it must be added as an input (and included in the examples), the network must be retrained and new results obtained. For the most accurate training, it is worth using the ORO method, as it is the most predictable and simpler to implement.

Psychodiagnostics

A series of works by M. G. Dorrer et al. Is devoted to the study of the question of the possibility of developing psychological intuition in neural network expert systems. The results obtained provide an approach to the disclosure of the mechanism of intuition of neural networks, which manifests itself when they solve psychodiagnostic tasks. Created non-standard for computer techniques intuitive approach to psychodiagnostics, which consists in the exclusion of construction the described reality... It allows you to shorten and simplify the work on psychodiagnostic techniques.

Chemoinformatics

Neural networks are widely used in chemical and biochemical research.Currently, neural networks are one of the most widespread methods of chemoinformatics for finding quantitative structure-property relationships, due to which they are actively used both for predicting the physicochemical properties and biological activity of chemical compounds, and for directional design of chemical compounds and materials with predetermined properties, including in the development of new drugs.

Notes (edit)

McCulloch W.S., Pitts W., Logical calculus of ideas related to nervous activity // In collection: "Automata" ed. C.E. Shannon and J. McCarthy. - M .: Publishing house of foreign. lit., 1956 .-- p. 363-384. (Translation of an English article of 1943)
Pattern Recognition and Adaptive Control. BERNARD WIDROW
Widrow B., Stearns S., Adaptive Signal Processing. - M .: Radio and communication, 1989 .-- 440 p.
Werbos P. J., Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. thesis, Harvard University, Cambridge, MA, 1974.
Galushkin A.I. Synthesis of multilayer pattern recognition systems. - M .: "Energy", 1974.
Rumelhart D.E., Hinton G.E., Williams R.J., Learning Internal Representations by Error Propagation. In: Parallel Distributed Processing, vol. 1, pp. 318-362. Cambridge, MA, MIT Press. 1986.
Bartsev S.I., Okhonin V.A. Adaptive information processing networks. Krasnoyarsk: Institute of Physics, Siberian Branch of the USSR Academy of Sciences, 1986. Preprint N 59B. - 20 p.
BaseGroup Labs - Practical Application of Neural Networks in Classification Problems
This type of encoding is sometimes referred to as a "1 of N" code.
Open systems - an introduction to neural networks
Mirkes E. M., Logically transparent neural networks and the production of explicit knowledge from data, In the book: Neuroinformatics / A. N. Gorban, V. L. Dunin-Barkovsky, A. N. Kirdin and others - Novosibirsk: Science. Siberian Enterprise RAS, 1998 .-- 296 with ISBN 5020314102
Mention of this story in Popular Mechanics magazine
http://www.intuit.ru/department/expert/neuro/10/ INTUIT.ru - Recurrent networks as associative storage devices]
Kohonen, T. (1989/1997/2001), Self-Organizing Maps, Berlin - New York: Springer-Verlag. First edition 1989, second edition 1997, third extended edition 2001, ISBN 0-387-51387-6, ISBN 3-540-67921-9
A. Yu. Zinoviev Visualization of multidimensional data. - Krasnoyarsk: Ed. Krasnoyarsk State Technical University, 2000. - 180 p.
Gorban A. N., Generalized approximation theorem and computational capabilities of neural networks, Siberian Journal of Computational Mathematics, 1998. Vol. 1, No. 1. P. 12-24.
Gorban A.N., Rossiyev D.A., Dorrer M.G., MultiNeuron - Neural Networks Simulator For Medical, Physiological, and Psychological Applications, Wcnn'95, Washington, DC: World Congress on Neural Networks 1995 International Neural Network Society Annual Meeting: Renaissance Hotel, Washington, DC, USA, July 17-21, 1995.
Dorrer M.G., Psychological intuition of artificial neural networks, Diss. ... 1998. Other copies online:,
Baskin I.I., Palyulin V.A., Zefirov N.S., Application of artificial neural networks in chemical and biochemical research, Vestn. Moscow Un-Ta. Ser. 2. Chemistry. 1999. Vol. 40. No. 5.
Galberstam N.M., Baskin I.I., Palyulin V.A., Zefirov N.S. Neural networks as a method for finding dependencies structure - property of organic compounds // Advances in chemistry... - 2003. - T. 72. - No. 7. - S. 706-727.
Baskin I.I., Palyulin V.A., Zefirov N.S. Multilayer perceptrons in the study of structure-property relationships for organic compounds // Russian Chemical Journal (Journal of the Russian Chemical Society named after D.I. Mendeleev)... - 2006. - T. 50. - S. 86-96.

What is a neural network?

What are neural networks?

What are neural networks for?

Neural networks are used to solve complex problems that require analytical calculations similar to those of the human brain. The most common uses for neural networks are:

Prediction- the ability to predict the next step. For example, the rise or fall of a stock based on the situation in the stock market.

Now, to understand how neural networks work, let's take a look at its components and their parameters.

What is a neuron?

What is a synapse?

Important to remember that during the initialization of the neural network, the weights are randomly assigned.

How does a neural network work?

Activation function

Linear function

This function is almost never used, except when you need to test a neural network or transfer a value without transformations.

Sigmoid

Hyperbolic tangent

Training set

A training set is a sequence of data that a neural network operates on. In our case, the exclusive or (xor) we have only 4 different outcomes, that is, we will have 4 training sets: 0xor0 = 0, 0xor1 = 1, 1xor0 = 1.1xor1 = 0.

Iteration

This is a kind of counter that increases every time the neural network goes through one training set. In other words, this is the total number of training sets traversed by the neural network.

Neural networks work. In simple words about the complex: what are neural networks

A bit of history

Briefly about the main thing

Today's situation

Why are neural networks still far from the human brain?

Outcome

For those who want to know more

What is a displacement neuron?

What is a bias neuron for?

How to make the National Assembly give correct answers?

What is Gradient Descent?

What is a Back Propagation Method (MPA)?

What else do you need to know about the learning process?

What are hyperparameters?

What is convergence?

What is retraining?

Conclusion

What is a neural network?

What are neural networks?

What are neural networks for?

What is a neuron?

What is a synapse?

How does a neural network work?

Activation function

Training set

Iteration

Epoch

Error

Task

Chronology

Known Applications

Clustering

Experimental selection of network characteristics

Experimental selection of training parameters

The actual training of the network

Checking the adequacy of training

Classification by type of input information

Classification by the nature of learning

Classification by the nature of synapse tuning

Signal transmission time classification

Classification by the nature of ties

Feedforward networks

Recurrent neural networks

Radial basis functions

Self-organizing cards

Known network types

Differences from machines with von Neumann architecture

Neural networks - universal approximators

Application examples

Forecasting financial time series

Psychodiagnostics

Chemoinformatics

Notes (edit)

Links

What is a neural network?

What are neural networks?

What are neural networks for?

What is a neuron?

What is a synapse?

How does a neural network work?

Activation function

Training set

Iteration

Epoch

Error

Top related articles