The feed-forward neural networks are the basic form of artificial neural networks. They are parametrized mathematical functions y=f(x;θ) that maps an input x to an output y by feeding it through a number of non-linear transformations: f(x)=(fn◦···◦f1)(x). Here each component fk, called a network layer, consists of a simple linear transformation of the previous component’s output, followed by a nonlinear function: fk=σkTkfk−1). The non-linear functions σk are typically sigmoid functions or ReLUs, as discussed below, and the θk are matrices of numbers, called the model’s weights. During the training phase, the network is fed training data and tasked with making predictions at the output layer that match the known labels, each component of the network producing an expedient representation of its input. It has to learn how to best utilize the intermediate representations to form a complex hierarchical representation of the data, ending in correct predictions at the output layer. Training a neural network means changing its weights to optimize the outputs of the network. This is done using an optimization algorithm, called gradient descent, on a function measuring the correctness of the outputs, called a cost function or loss function.

Activation functions and Forward Propagation

The activation functions are basically the functions that control the behavior of the artificial network. The transmission of the input is known as forward propagation. Inputs, weights, and biases are transformed by the activation functions and the results are the input for the next layer. Examples of activation functions are the following: 
a) Linear:  function where the dependent variable has a direct, proportional relationship with the independent variable;
b)Sigmoid:  a logistic function that converts independent variable of near-infinite range into simple probabilities between 0 and 1. Its characteristic is then to reduce values or outliers without removing them;
c)Tanh: a hyperbolic trigonometric function, it transforms the independent variable to a range between -1 and 1. Its advantage is that can deal easily with negative number;
d) Softmax: it is a generalization of logistic regression, it can be applied to continuous data and can contain multiple decision boundaries. This function is often used in a classification problem; 
e) Rectified Linear unit (ReLu): a function that activates a node only if the input is above a certain positive quantity. Above this threshold, the function has a linear relationship with the dependent variable.