Numerical-Optimization/index.Rmd at master · indrag49/Numerical-Optimization · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
---
title: "Introduction to Mathematical Optimization"
subtitle: "with Python"
author: "Indranil Ghosh"
date: "`r Sys.Date()`"
knit: "bookdown::render_book"
description: "A book for teaching introductory numerical optimization algorithms with Python"
documentclass: krantz
site: bookdown::bookdown_site
#bibliography: [book.bib, packages.bib]
#biblio-style: apalike
colorlinks: yes
fontsize: 12pt
monofont: "Source Code Pro"
monofontoptions: "Scale=0.7"
link-citations: yes
url: 'https://indrag49.github.io/Numerical-Optimization/'
---


# What is Numerical Optimization?

This chapter gives an introduction to the basics of numerical optimization and will help build the tools required for our in-depth understanding in the later chapters. Some fundamental linear algebra concepts will be touched which will be required for further studies in optimization along with introduction to simple Python codes.

---

## Introduction to Optimization

Let $f(\mathbf{x})$ be a scalar function of a vector of variables $\mathbf{x} = \begin{pmatrix}x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix} \in \mathbb{R}^n$. *Numerical Optimization* is the minimization or maximization of this function $f$ subject to constraints on $\mathbf{x}$. This $f$ is a scalar function of $\mathbf{x}$, also known as the *objective function* and the continuous components $x_i \in \mathbf{x}$ are called the *decision variables*.

The optimization problem is formulated in the following way:


\begin{align}
&\!\min_{\mathbf{x} \in \mathbb{R}^n}        &\qquad& f(\mathbf{x}) \\
&\text{subject to} &      & g_k(\mathbf{x}) \leq 0,\ k=1,2,..., m\\
&                  &      & h_k(\mathbf{x}) = 0,\ k=1,2,..., r\\
&                  &      & m,r < n.(\#eq:1)
\end{align}

Here, $g_k(\mathbf{x})$ and $h_k(\mathbf{x})$ are scalar functions too (like $f(\mathbf{x})$) and are called *constraint functions*. The constraint functions define some specific equations and/or inequalities that $\mathbf{x}$ should satisfy.

## A Solution

```{definition}
A *solution* of $f(\mathbf{x})$ is a point $\mathbf{x^*}$ which denotes the optimum vector that solves equation \@ref(eq:1), corresponding to the optimum value $f(\mathbf{x^*})$.
```

In case of a *minimization* problem, the optimum vector $\mathbf{x^*}$ is referred to as the *global minimizer* of $f$, and $f$ attains the least possible value at $\mathbf{x^*}$. To design an algorithm that finds out the global minimizer for a function is quite difficult, as in most cases we do not have the idea of the overall shape of $f$. Mostly our knowledge is restricted to a local portion of $f$.

```{definition}
A point $\mathbf{x^*}$ is called a *global minimizer* of $f$ if $f(\mathbf{x^*}) \leq f(\mathbf{x}) \forall\ x$.
```

```{definition}
A point $\mathbf{x^*}$ is called a *local minimizer* of $f$ if there is a neighborhood $\mathcal{N}$ of $\mathbf{x^*}$ such that $f(\mathbf{x^*}) \leq f(\mathbf{x}) \forall\ x \in \mathcal{N}$.
```

```{definition}
A point $\mathbf{x^*}$ is called a *strong local minimizer* of $f$ if there is a neighborhood $\mathcal{N}$ of $\mathbf{x^*}$ such that $f(\mathbf{x^*}) < f(\mathbf{x}) \forall\ x \in \mathcal{N}$, with $\mathbf{x} \neq \mathbf{x}^*$.
```

```{definition}
For an objective function $f(\mathbf{x})$ where, $\mathbf{x} \in \mathbb{R}^2$, a point $\mathbf{x}^s=\begin{bmatrix} x_1^s \\ x_2^s \end{bmatrix}$ is called a *saddle point* if $\forall\ \mathbf{x}$, there exists an $\epsilon>0$, such that the following conditions are satisfied:

* $\frac{\partial f}{\partial x_1}(\mathbf{x}) \mid_{(x_1^s, x_2^s)} < \epsilon$,
* $\frac{\partial f}{\partial x_2}(\mathbf{x}) \mid_{(x_1^s, x_2^s)} < \epsilon$, and
* $[\frac{\partial^2 f}{\partial x_1^2}(\mathbf{x}) \frac{\partial^2 f}{\partial x_2^2}(\mathbf{x}) - (\frac{\partial^2f}{\partial x_1 \partial x_2}(\mathbf{x}))^2]\mid_{(x_1^s, x_2^s)} < 0$


generating the following chain of inequalities: $f(\mathbf{x})\mid_{(x_1, x_2^s)} \leq f(\mathbf{x})\mid_{(x_1^s, x_2^s)} \leq f(\mathbf{x})\mid_{(x_1^s, x_2)}$.
```

An example of a saddle point is shown below:

```{python, echo=FALSE, results=FALSE}
import numpy as np

import matplotlib.pyplot as plt
import pylab


X, Y = [], []

for v in range(-10, 12):
    X += [v, ]
    Y += [v, ]

l_X, l_Y = len(X), len(Y)

Z = np.ndarray((l_X,l_Y))

for x in range(0, l_X):
    for y in range(0, l_Y):
        Z[x][y] = X[x]**2 - Y[y]**2

# Set the x axis and y axis limits
plt.xlim([-10,10])
plt.ylim([-10,10])

plt.xlabel('X')
plt.ylabel('Y')

contours=plt.contour(X, Y, Z, 10, cmap=plt.cm.gnuplot)
plt.clabel(contours, inline=1, fontsize=10)
plt.plot([0], [0], 'ro')
plt.title("Saddle point $(0, 0)$ can be seen on the saddle surface given by the equation $f(\mathbf{x}) = x^2 - y ^2$", size = 10)
plt.show()
```

## Maximization
We just defined a *minimization* problem as our optimization task. We could do the same with a *maximization* problem with little tweaks. The problem $\underset{\mathbf{x} \in \mathbb{R}^n}{max} f(\mathbf{x})$ can be formulated as:
\begin{equation}
    \underset{\mathbf{x} \in \mathbb{R}^n}{max} f(\mathbf{x}) = - \underset{\mathbf{x} \in \mathbb{R}^n}{min}\{- f(\mathbf{x})\}
    (\#eq:2)
\end{equation}
We then apply any minimization technique after setting $\hat{f}(\mathbf{x}) = - f(\mathbf{x})$. Further, for the inequality constraints for the maximization problem, given by $g_k(\mathbf{x}) \geq 0$, we set
\begin{equation}
    \hat{g}_k(\mathbf{x})=-g_k(\mathbf{x})
    (\#eq:3)
\end{equation}

The problem thus has become,

\begin{align}
&\!\min_{\mathbf{x} \in \mathbb{R}^n}        &\qquad& \hat{f}(\mathbf{x})\\
&\text{subject to} &      & \hat{g}_k(\mathbf{x}) \leq 0,\ k=1,2,..., m\\
&                  &      & h_k(\mathbf{x}) = 0,\ k=1,2,..., r\\
&                  &      & m,r < n.(\#eq:4)
\end{align}

After the solution $\mathbf{x^*}$ is computed, the maximum value of the problem is given by: $-\hat{f}(\mathbf{x^*})$.

## Feasible Region

```{definition}
A *feasible region* is the set of those points which satisfy all the constraints provided.
```

## Discrete Optimization Problems

```{definition}
The optimization problems whose variables $\mathbf{x}$ take on integer values, and the constraints have the form either $x_i \in \mathcal{Z}$ or $x_i \in \{0, 1\}$ are called *discrete optimization problems*.
```

The above class of problems are also sometimes called *integer programming* problems. The fundamental characteristic of a discrete optimization problem is that, $x_i$ is drawn from a countable set.

## Linear Programming Problems

The class of optimization problems where both the objective function $f(\mathbf{x})$ and the constraints are linear functions of the variable vector $\mathbf{x}$, are called the *linear programming problems*.

A linear programming problem can be formulated in the following way:

\begin{align}
&\!\min_{\mathbf{x} \in \mathbb{R}^n}        &\qquad& f(\mathbf{x})=\mathbf{c}^T\mathbf{x},\\
&\text{subject to} &      & \mathbf{A}\mathbf{x} \leq \mathbf{b},\\
&                  &      & \mathbf{x} \geq \mathbf{0},\\
&                  &      & \mathbf{c} \in \mathbb{R}^n, \mathbf{b} \in \mathbb{R}^m, \mathbf{A}\in \mathbb{R}^{m \times n}. (\#eq:5)
\end{align}

## Stochastic Optimization Problems

```{definition}
The class of optimization problems, where the decision variables $x_i \in \mathbf{x}$ depend on the outcomes of a random phenomenon besides consisting of random objective function and constraints are called *stochastic optimization problems*.
```

Some examples of stochastic optimization methods are: *simulated annealing*, *quantum annealing*, *genetic algorithms*, etc.

## Scaling of Decision Variables

While formulating optimization problems, it must be guaranteed that the scale of the decision variables are approximately of the same order. If this is not taken care of, some optimization algorithms that are sensitive to scaling will perform poorly and will flounder to converge to the solution. Two of the fundamental fields that get disturbed due to poor scaling are computing the optimized step lengths and the numerical gradients. One of the widely accepted best practices is to make the decision variables dimensionless and vary them approximately between 0 and 1. One should always prefer optimization algorithms that are not sensitive to scaling.

## Gradient Vector and Hessian Matrix of the Objective Function

```{definition}
For a differentiable objective function $f(\mathbf{x}): \mathbb{R}^n \rightarrow \mathbb{R}$, its *gradient vector* given by $\nabla f(\mathbf{x}): \mathbb{R}^n \rightarrow \mathbb{R}^n$, is defined at the point $\mathbf{x}$ in the $n$-dimensional space as the vector of first order partial derivatives:

\begin{equation}
\nabla f(\mathbf{x})= \begin{pmatrix} \frac{\partial f}{\partial x_1}(\mathbf{x})\\
    \vdots \\
    \frac{\partial f}{\partial x_n}(\mathbf{x})
    \end{pmatrix}(\#eq:6)
\end{equation}
```

Now, if $f(\mathbf{x})$ is smooth, the gradient vector $\nabla f(\mathbf{x})$ is always perpendicular to the contours at the point $\mathbf{x}$. The gradient vector is thus in the direction of the maximum increase of $f(\mathbf{x})$. Look at the figure below.

```{python, echo=FALSE, results=FALSE}
from matplotlib.patches import FancyArrowPatch
feature_x = np.arange(-20, 20, 1)
feature_y = np.arange(-20, 20, 1)

x, y = np.meshgrid(feature_x, feature_y)
z = 0.5*(y-x)**2 + 0.5*(2*y+3*x)**2
u = 2*x - y + 1
v = y + x

# Normalize all gradients to focus on the direction not the magnitude
norm = np.linalg.norm(np.array((u, v)), axis=0)
u = u / norm
v = v / norm

fig, ax = plt.subplots(1, 1)
ax.set_aspect(1)
#ax.plot(feature_x, feature_y, c='k')
ax.quiver(x, y, u, v, units='xy', scale=0.5, color='gray')
contours=ax.contour(x, y, z, 10, lw=2, cmap=plt.cm.gnuplot)
plt.clabel(contours,
            fontsize=10)
plt.title("Direction of $\\nabla f(\mathbf{x})$", size=10)
plt.show()
```

```{definition}
For a twice continuously differentiable function $f: \mathbb{R}^n \rightarrow \mathbb{R}$, its *Hessian matrix* given by $\mathbf{H}(f(\mathbf{x}))$ is defined at the point $\mathbf{x}$ in the $n \times n$-dimensional space as the matrix of second order partial derivatives:
\begin{equation}
    \mathbf{H} f(\mathbf{x})=\frac{\partial ^2 f}{\partial x_i \partial x_j} = \begin{pmatrix} \frac{\partial^2 f}{\partial x_1^2}(\mathbf{x}) & \frac{\partial^2 f}{\partial x_1 \partial x_2}(\mathbf{x}) & \ldots & \frac{\partial^2 f}{\partial x_1 \partial x_n}(\mathbf{x})\\
    \frac{\partial^2 f}{\partial x_2 \partial x_1}(\mathbf{x}) & \frac{\partial^2 f}{\partial x_2^2}(\mathbf{x}) & \ldots & \frac{\partial^2 f}{\partial x_2 \partial x_n}(\mathbf{x}) \\
    \vdots & \vdots & \ddots & \vdots \\
    \frac{\partial^2 f}{\partial x_n \partial x_1}(\mathbf{x}) & \frac{\partial^2 f}{\partial x_n \partial x_2}(\mathbf{x}) & \ldots & \frac{\partial^2 f}{\partial x_n^2}(\mathbf{x})
    \end{pmatrix}(\#eq:7)
\end{equation}
```

One important relation that we will keep in mind is that the *Hessian matrix* is the *Jacobian* of the *gradient vector* of $f(\mathbf{x})$, where the *Jacobian matrix* of a vector-valued function $\mathbf{F}(\mathbf{x})$ is the matrix of all its first order partial derivatives, given by, $\mathbf{JF}(\mathbf{x})= \begin{pmatrix} \frac{\partial \mathbf{F}}{\partial x_1} & \ldots \frac{\partial \mathbf{F}}{\partial x_n} \end{pmatrix}$. The relation is as followed:
\begin{equation}
    \mathbf{H} f(\mathbf{x}) = \mathbf{J}(\nabla f(\mathbf{x})) (\#eq:8)
\end{equation}

Let us consider an example now.

```{example}
Let an objective function be $f(\mathbf{x}) = 2x_1x_2^3+3x_2^2x_3 + x_3^3x_1$. We will find out the gradient vector $\nabla f(\mathbf{x})$ and the Hessian matrix $\mathbf{H} f(\mathbf{x})$ at the point $\mathbf{p} = \begin{pmatrix} 1 & 2 & 3 \end{pmatrix}$. The gradient vector is $\nabla f(\mathbf{x}) = \begin{pmatrix} 2x_2^3+x_3^3 \\ 6x_1x_2^2+6x_2x_3 \\ 3x_2^2+3x_3^2x_1 \end{pmatrix}$. So $\nabla f(\mathbf{x})| \mathbf{p} = \begin{pmatrix} 43 \\ 60 \\ 39 \end{pmatrix}$. The Hessian matrix is therefore given by, $\mathbf{H}f(\mathbf{x}) = \begin{pmatrix} 0 & 6x_2^2 & 3x_3^2 \\ 6x_2^2 & 12x_1x_2+6x_3 & 6x_2 \\ 3x_3^2 & 6x_2 & 6x_3x_1 \end{pmatrix}$ and at point $\mathbf{p}$, $\mathbf{H} f(\mathbf{x})|\mathbf{p} = \begin{pmatrix} 0 & 24 & 27 \\ 24 & 42 & 12 \\ 27 & 12 & 18 \end{pmatrix}$.
```

We will try to work out the same example with Python scripting now. For that we need an extra package called [`autograd`](https://github.com/HIPS/autograd), besides the [`numpy`](https://numpy.org/) package. The `autograd` package is used for automatically differentiating native Python and Numpy code. Fundamentally `autograd` is used in *gradient-based optimization*. First `pip` install the `autograd package`

```{python, eval=FALSE}
pip install autograd
```

Now, after it is downloaded, we type the following in our notebook:

```{python}
import autograd.numpy as au
from autograd import grad, jacobian

p = np.array([1, 2, 3], dtype=float)
def f(x): # Objective function
    return 2*x[0]*x[1]**3+3*x[1]**2*x[2]+x[2]**3*x[0]
grad_f = grad(f) # gradient of the objective function
hessian_f = jacobian(grad_f) # Hessian of the objective function

print("gradient vector:",grad_f(p))
print("Hessian matrix:\n",hessian_f(p))
```

## Directional Derivative of the Objective Function

```{definition}
For a real valued objective function $f(\mathbf{x})$ and a feasible direction $\mathbf{\delta}$, the *directional derivative* of $f(\mathbf{x})$ in the direction $\mathbf{\delta}$ is given by:
\begin{equation}
    \frac{\partial f}{\partial \mathbf{\delta}}(\mathbf{x}) = \lim_{\alpha \to 0} \frac{f(\mathbf{x} + \alpha \mathbf{\delta}) - f(\mathbf{x})}{\alpha} (\#eq:9)
\end{equation}
where $\|\mathbf{\delta}\| = 1$.
```

Now for $\mathbf{x} \in \mathbb{R}^n$, let us consider the differential equation:
\begin{equation}
    df(\mathbf{x}) = \frac{\partial f(\mathbf{x})}{\partial x_1}dx_1 + \ldots + \frac{\partial f(\mathbf{x})}{\partial x_n}dx_n = \nabla^Tf(\mathbf{x})d\mathbf{x} = \langle \nabla f(\mathbf{x}), d\mathbf{x} \rangle (\#eq:10)
\end{equation}

where $\langle .,. \rangle$ denotes the dot product between two matrices and/or vectors. Now let us consider a function $\hat{f}(\mathbf{x}) = f(\hat{\mathbf{x}} + \alpha \mathbf{\delta})$, such that for a point $\mathbf{x}$ passing through the point $\hat{\mathbf{x}}$ on the line through $\hat{\mathbf{x}}$ in the direction $\mathbf{\delta}$ is given by $\mathbf{x}(\alpha) = \hat{\mathbf{x}} + \alpha \mathbf{\delta}$. now, for an infinitesimal change $d\alpha$, we have $d\mathbf{x}=\mathbf{\delta}d\alpha$. Thus, the differential at the point $\mathbf{x}$ in the given direction is $d\hat{f}=\nabla^Tf(\mathbf{x})\delta d\alpha$ So, the directional derivative now can be written as:
\begin{equation}
    \frac{\partial f}{\partial \mathbf{\delta}}(\mathbf{x}) = \frac{d}{d\alpha}f(\mathbf{x}+\alpha\mathbf{\delta})|_{\alpha=0} = \nabla^Tf(\mathbf{x})\mathbf{\delta} (\#eq:11)
\end{equation}

Now,let us look into a simple example:

```{example}
Let an objective function be $f(\mathbf{x}) = 2x_1x_2^3+3x_2^2x_3 + x_3^3x_1$. We will find out the gradient vector $\nabla f(\mathbf{x})$ at the point $\mathbf{p} = \begin{pmatrix} 1 & 2 & 3 \end{pmatrix}$ and then calculate the directional derivative in the direction $\mathbf{\delta}=\begin{pmatrix} \frac{1}{\sqrt{35}} & \frac{3}{\sqrt{35}} & \frac{5}{\sqrt{35}} \end{pmatrix}$. We will use the same `autograd` package to calculate the same using Python.
```

```{python}
p = np.array([1, 2, 3], dtype=float)
delta = np.array([1, 3, 5], dtype=float)/np.sqrt(35)
def f(x):
    return 2*x[0]*x[1]**3+3*x[1]**2*x[2]+x[2]**3*x[0]
grad_f = grad(f)

print("directional derivative:", grad_f(p).dot(delta))
```

We will see that the directional derivative is $\approx 70.655$.

## Positive Definite and Positive Semi-definite Matrices

```{definition}
A real matrix $\mathbf{M}\in \mathbb{R}^{N\times N}$ is a *positive definite* matrix if for any real vector $\mathbf{v} \in \mathbb{R}^n$ other than the null vector, the following is satisfied:
\begin{equation}
    \mathbf{v}^T\mathbf{M}\mathbf{v} > 0 (\#eq:12)
\end{equation}
```

```{definition}
A real matrix $\mathbf{M}\in \mathbb{R}^{N\times N}$ is a *positive semi-definite* matrix if for any real vector $\mathbf{v} \in \mathbb{R}^n$, the following is satisfied:
\begin{equation}
    \mathbf{v}^T\mathbf{M}\mathbf{v} \geq 0 (\#eq:13)
\end{equation}
```

```{theorem}
All the eigenvalues of a positive definite matrix are positive.
```

```{proof}
If $\lambda$ be an eigenvalue (real) of $\mathbf{M}$ and $\mathbf{v}$ be the corresponding eigenvector, then we have the following well known equation:
\begin{equation}
    \mathbf{Mv}=\lambda\mathbf{v} (\#eq:14)
\end{equation}
Now multiplying the equation with $\mathbf{v}^T$ on the left, we get the following:

\begin{align}
\mathbf{v}^T\mathbf{Mv}&=\lambda\mathbf{v}^T\mathbf{v}\\
&=\lambda \|\mathbf{v}\|^2 (\#eq:15)
\end{align}

Now the $L.H.S$ is positive as $\mathbf{M}$ is positive definite and $\|bm{v}\|^2$ is positive too. This implies that the eigenvalue $\lambda$ is positive.
```


The above proof can be extended to positive semi-definite matrix too in which case the eigenvalues are non-negative, i.e, either 0 or positive and we will exploit these properties in our python script to check for positive definiteness or positive semi-definiteness of a given matrix.

```{example}
We use a Python script to compute the eigenvalues and check whether the following matrices are positive definite, positive semi-definite or negative-definite:

* $\begin{pmatrix}2 & -1 & 0 \\ -1 & 2 & -1\\ 0 & -1 & 2 \end{pmatrix}$
* $\begin{pmatrix} -2 & 4\\ 4 & -8 \end{pmatrix}$
* $\begin{pmatrix} -2 & 2\\ 2 & -4 \end{pmatrix}$
```

```{python}
M = np.array(([2, -1, 0], [-1, 2, -1], [0, -1, 2]), dtype=float)
#M = np.array(([-2, 4], [4, -8]), dtype=float)
#M = np.array(([-2, 2], [2, -4]), dtype=float)

eigs = np.linalg.eigvals(M)
print("The eigenvalues of M:", eigs)

if (np.all(eigs>0)):
    print("M is positive definite")
elif (np.all(eigs>=0)):
    print("M is positive semi-definite")
else:
    print("M is negative definite")
```

Running the script for the first matrix tells us that it is positive definite. The Reader is asked to try out the code for the other two matrices.

## What is Convexity?

```{definition}
A set $\mathbf{X} \subset \mathbb{R}^n$ is said to be a *convex set* if $\forall\ \mathbf{x}, \mathbf{y} \in \mathbf{X}$ and $\alpha \in [0, 1]$, the following is satisfied:
\begin{equation}
    (1-\alpha)\mathbf{x} + \alpha \mathbf{y} \in \mathbf{X} (\#eq:16)
\end{equation}
If the above condition is not satisfied, the set is a *non-convex set*.
```

```{definition}
A function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is called a *convex function* if for every two points $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ and $\alpha \in [0,1]$, the following condition is satisfied:
\begin{equation}

f(\alpha \mathbf{x} + (1-\alpha)\mathbf{y}) \leq \alpha f(\mathbf{x})+(1-\alpha)f(\mathbf{y}) (\#eq:17)
\end{equation}
```

```{definition}
The function is *strictly convex* if $\leq$ symbol is replaced by $<$. In the similar way, one can define a *concave function* too.
```

```{definition}
A constrained optimization problem is called a *convex programming problem* if the following properties are satisfied:

* the objective function $f(\mathbf{x})$ is convex,
* the equality constraint functions $h_k(\mathbf{x})$ are linear, and
* the inequality constraint functions $g_k(\mathbf{x})$ are concave

```

This concept of convexity is used in practically solving many optimization problems in the real world. Now to test for convexity of $f(\mathbf{x})$ we study the following two theorems:

```{theorem}
If $f(\mathbf{x})$ is a differentiable objective function defined over the convex set $\mathbf{S} \subseteq \mathbb{R}^n$, then $\forall\ \mathbf{y}, \mathbf{z} \in \mathbf{S}$, $f(\mathbf{x})$ is convex over $\mathbf{S}$ if and only if
\begin{equation}
    f(\mathbf{y})+\nabla^Tf(\mathbf{y})(\mathbf{z}-\mathbf{y}) \leq f(\mathbf{z}) (\#eq:18)
\end{equation}
```

```{proof}
$f(\mathbf{x})$ is convex over $\mathbf{S}$ implies that $\forall\ \mathbf{y}, \mathbf{z} \in \mathbf{S}$ and $\forall\ \alpha \in [0,1]$ the following equation is hold:
\begin{equation}
    f(\alpha \mathbf{z} + (1-\alpha)\mathbf{y}) \leq \alpha f(\mathbf{y}) + (1-\alpha)f(\mathbf{y}) (\#eq:19)
\end{equation}
implying,
\begin{equation}
    \frac{f(\mathbf{y} + \alpha(\mathbf{z}-\mathbf{y})) - f(\mathbf{y})}{\alpha} \leq f(\mathbf{z}) - f(\mathbf{y}) (\#eq:20)
\end{equation}
Now, the difference inequality can be turned into a differential inequality by taking $\lim_{\alpha \to 0}$:
\begin{equation}
    \frac{df(\mathbf{y})}{d\alpha}\mid_{\mathbf{y}-\mathbf{z}} \leq f(\mathbf{z}) - f(\mathbf{y}) (\#eq:21)
\end{equation}
The $L.H.S$ is the *directional derivative* and thus from \@ref(eq:11) this can be written as:
\begin{equation}
    \frac{df(\mathbf{y})}{d\alpha}\mid_{\mathbf{y}-\mathbf{z}} = \nabla^Tf(\mathbf{x})(\mathbf{z}-\mathbf{y}) (\#eq:22)
\end{equation}
Now from \@ref(eq:17) and \@ref(eq:18), the following inequality can be written: $f(\mathbf{y}) + \nabla^Tf(\mathbf{y})(\mathbf{z}-\mathbf{y}) \leq f(\mathbf{z})$. Now, to work on the other way round, if \@ref(eq:14) is true, then for $\mathbf{x}=\alpha\mathbf{z}+(1-\alpha)\mathbf{y} \in \mathbf{S}$ and $\alpha \in [0,1]$, we will have the following inequalities:
\begin{equation}
    f(\mathbf{x})+\nabla^Tf(\mathbf{x})(\mathbf{z}-\mathbf{x}) \leq f(\mathbf{z}) (\#eq:23)
\end{equation}
and
\begin{equation}
    f(\mathbf{x})+\nabla^Tf(\mathbf{x})(\mathbf{y}-\mathbf{x}) \leq f(\mathbf{y}) (\#eq:24)
\end{equation}
Now (\@ref(eq:23) $\times \alpha) +$ \@ref(eq:24)$\times (1-\alpha)) \implies$
\begin{equation}
    \alpha f(\mathbf{z}) + (1-\alpha)f(\mathbf{x}) \leq \alpha f(\mathbf{z}) + (1-\alpha)f(\mathbf{y}) - f(\mathbf{x}) (\#eq:25)
\end{equation}
now $L.H.S = 0$ since $\alpha(\mathbf{z}-\mathbf{x})+(1-\alpha)(\mathbf{y}-\mathbf{x})=0$. So we get,
$$f(\mathbf{x})=f(\alpha\mathbf{z}+(1-\alpha)\mathbf{y}) \leq \alpha f(\mathbf{z}) + (1-\alpha)f(\mathbf{y}).$$ The converse relation is also satisfied, which completes our proof.
```


```{theorem}
If $f(\mathbf{x})$ is defined over an open convex set $\mathbf{S} \subseteq \mathbb{R}^n$ and if the Hessian matrix $\mathbf{H}f(\mathbf{x})$ is positive definite $\forall\ \mathbf{x} \in \mathbf{S}$, then $f(\mathbf{x})$ is *strictly convex* over the set $\mathbf{S}$.
```

```{proof}
Let $\mathbf{x}, \mathbf{y} \in \mathbf{S}$. Using Taylor expansion, the function $f(\mathbf{y})$ can be written as,
\begin{align}
f(\mathbf{y})&=f(\mathbf{x} + (\mathbf{y} - \mathbf{x}))\\
&=f(\mathbf{x}) + \nabla^Tf(\mathbf{x})(\mathbf{y}-\mathbf{x}) + \frac{1}{2}(\mathbf{y}-\mathbf{x})^T\mathbf{H}f(\mathbf{x} + \alpha(\mathbf{y} - \mathbf{x}))(\mathbf{y} - \mathbf{x}) (\#eq:26)
\end{align}
where $\alpha \in [0,1]$. Now positive definite Hessian matrix implies that $f(\mathbf{y}) > f(\mathbf{x})+\nabla^Tf(\mathbf{x})(\mathbf{y}-\mathbf{x})$. Now from Theorem 1.2 it can be said that $f(\mathbf{x})$ is strictly convex, thus proving the theorem.
```


## Numerical Optimization Algorithms

Optimization Algorithms are iterative techniques that follow the following fundamental steps:

* Initialize with a guess of the decision variables $\mathbf{x}$,
* Iterate through the process of generating a list of improving estimates,
* check whether the terminating conditions are met, and the estimates will be probably stop at the solution point $\mathbf{x}^*$.

The book by Nocedal and Wright [*Nocedal, Jorge, and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.*] states that most of the optimization strategies make use of either the objective function $f(\mathbf{x})$, the constraint functions $g(\mathbf{x})$ and $h(\mathbf{x})$, the first or second derivatives of these said functions, information collected during previous iterations and/or local information gathered at the present point. As Nocedal and Wright mentions, a good optimization algorithm should have the following fundamental properties:

* **Robustness**: For all acceptable initial points chosen, the algorithm should operate well on a broad range of problems, in their particular class.
* **Efficiency**: The time complexity and the space complexity of the algorithm should be practicable
* **Accuracy**: The solution should be as precise as possible, with the caveat that it should not be too much delicate to errors in the data or to numerical rounding and/or truncating errors while it is being executed on a machine.

There might be some trade offs allowed between speed and memory, between speed and robustness, etc.