Meta Learning - An Overview
Recent development of meta learning has driven the industry to a complete different trajectory compared to the ordinary and traditional learning algorithms. In ordinary way, we have fixed learning algorithms to perform specific tasks. But in meta learning, we try to improve the learning algorithm so that it is able to perform a range of tasks from the same distribution.
Let's take a conventional supervised learning as an example. To start with the task, we will need a dataset D that is consisting of tuples of (x, y) where x refers to data points and y refers to labels. Then, we train a predictive model to predict labels; $\hat{y}$ = $f_{\theta}$(x). Note that the model is parameterized by $\theta$. We tarin the predictive model by solving;
$\theta^{*}$ = arg $min_{\theta}$ L(D;$\theta$, w)
In here, L is a loss function that measures the loss between the targets given in dataset D and predictions of $f_{\theta}$. And we try to minimize the loss by adapting the parameters $\theta$ to get the optimal set of parameters $\theta^{*}$. The w here refers to initial assumptions we made of optimization and learning function. This is what meta learning try to tackle. Instead of having a fixed w, what if we implement an algorithm that can learn to choose the best w when a new task is given? In that way, we will be able to generalize our learning algorithms for a range of tasks.
Lke with the usual machine learning approach, in meta learning too we have a training step and a testing step. Let's look into both steps one at a time. In the training step, we assume a a set M of relatable tasks sampled from p(T). Then, we have a collection of tasks same as we have a collection of data points and labels in the conventional way. However, each element in the collection is consisting of a validation dataset and a training dataset.i.e., D$_{source}$ = (D$_{source}^{val}$, D$_{source}^{train}$)$^{(i)}$ for i=0, 1, ... , M.
Then, we can define the training step as below;
$w^{*}$ = arg $min_{w}$ $\sum \limits_{i=1}^{M}$ L $(D_{source}^{(i)};w)$
So, in the training step, we try to get the optimal w, i.e., $w^{*}$ by giving it the target tasks sampled from the distribution p(T).
You see, in the conventional way, we always had the w, such as optimizers and learning functions fixed but in meta-learning, we train them in the training step. Once,the training is done, we move onto the testing stage. In there, we use the $w^{*}$ as the input along with a set of unseen tasks to train the base model on those unseen tasks.
$\theta^{*}$ = arg $min_{\theta}$ L($D_{target}^{train (i)}$;$\theta$, $w^{*}$)
So, this is same as the conventional way but with the additional advantage of not having a fixed w but an adaptive $w^{*}$ that allow us to use it in a variety of tasks within a similar kind of distribution.
So, hope you were able to get some idea about meta learning and its background. If you want to look into this further, I suggest this review paper which was the reference for this article.