As equation 1 show, normal equation is far more easier to implement with compare to gradient descent.If the matrix is singular, we could either decrease the number of features, or using SVD to find an approximation of the inverse matrix.

The pros and cons of gradient descent vs normal equation.

**Gradient Descent**- Need to choose alpha
- Needs many iterations
- works well even when n(number of features) is large

__Normal Equation__

- No need to choose alpha
- Don't need to iterate
- Need to compute inverse matrix
- slow if n(number of features) is very large

The price of computing the inverse matrix is almost same as O(n^3), this kind of complexity is unacceptable when the number of n is big(10000 or more, depends on your machine).