Mitigating Catastrophic Forgetting for Zero Shot Cross Lingual Transfer

Our external project supervisor, an applied scientist at Amazon AI and a Stanford alumni, has helped us come up with a method to mitigate catastrophic forgetting (which has become a major pain point in deep learning research). When training one task, artificial neural networks learn new weigh

2025-06-28 16:34:08 - Adil Khan

Project Title

Project Area of Specialization Artificial IntelligenceProject Summary

When training one task, artificial neural networks learn new weights to perform well on that task. During training of newer tasks, the previously learned weights get over-written. The ANN model may hence become good at the current task, but this happens at the cost of forgetting the former task. This phenomenon is known as catastrophic forgetting.

Pretrained models, such as mBERT (Devlin 2018), need fine-tuning on task specific labelled data to perform well on those tasks (e.g. paraphrase detection, intent classification, etc). For many practical problems, however, labelled data is often only available in high resource locales, such as English. When we fine-tune a pre-trained language model using high resource locale (say English sentiment training data), we risk overwriting important multilingual features originally learned during the language modeling (LM) task (which was trained using 100+ locales).

To mitigate this issue, we use regularized fine-tuning of pre-trained language models (LM) while training them using high resource locales available in the downstream task. We fine-tune pretrained models such that important weights learned during the original LM task (trained using 100+ locales) do not get drastically updated during the fine-tuning phase. More specifically, we do regularized fine-tuning so as to mitigate over-fitting to the source locale and hence mitigate the catastrophic forgetting phenomenon of important multilingual features that were learned during the pretraining phase.

NB: There's a mistake in our supervisor's name. I first wrote the names of both internal and external supervisors (in the initial sign up form) and I am unable to change it now. The supervisor name should be: Professor Dr. Muhammad Ali Ismail.

Project Objectives

To perform regularized fine-tuning of mBERT, such that it does not "forget" important parameters.

The pre-trained model mBERT has been trained on 100+ languages. When we fine-tune it on a downstream task (such as natural language inference) which is usually in English (a high-resource language), we overwrite parameters in mBERT that were important for the semantics of other languages (such as Urdu, Arabic).

Say we wanted to build and deploy a model for Natural Language Inference (NLI) in Urdu. Unfortunately, a corpus for NLI in Urdu (or any other low-resource language) is not available. Instead, we can use mBERT which has learned the semantics of 100+ languages, fine-tune it on the English NLI dataset in such a way that the original model does not "forget" the parameters that were important for other languages. In other words, it doesn't forget the other languages it had learnt. We can then expect the model to perform well (i.e. achieve high test accuracy) on a small Urdu NLI dataset (which is available in XNLI). This methodology is known as regularized fine-tuning.

Project Implementation Method

We will use the regularization technique - Memory Aware Synapses (MAS) - to compute the importance of the parameters of a neural network based on how sensitive the predicted output function is to a change in a chosen parameter. The more important a given parameter is, the more sensitive the neural network will be to small perturbations to it.

Unlike a lot of past works, MAS is very cost effective because it does not require re-training of original language model nor requires access to the original training data, so as to compute parameters’ importance for previous tasks.

We will use MAS to compute parameters’ importance of popular pretrained multilingual model mBERT. During fine-tuning, the neural network will be penalized for deviating too much from the original significant language model (LM) weights.

To test our approach, we will perform fine-tuning on the following three datasets:

Natural Language Inference (XNLI dataset): determines whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”. For example, 'You can leave' entails the premise, "You don't have to stay".
Intent classification (MultiAtis++ dataset): Automatically associate an 'intent' with a sentence and classify the sentence into slots.
Paraphrase detection (PAWS-X dataset): Given a pair of sentences, determine whether or not they are paraphrases of each other.

We will fine-tune using English datasets and report the improvements observed using our approach in a zero-shot setting on different non-English languages.

Much of this recent work has demonstrated that mBERT performs very well on zero-shot tasks, superseding prior techniques as the baseline for zero-shot cross-lingual transfer learning. By zero-shot, we mean that no parallel text or labeled data from the target language was used during model training, fine-tuning, or hyperparameter search. In this setting, models are trained on labeled (usually English) text and tested on target (non-English) text.

We propose to regularized fine-tuning of mBERT on a high-resource language (such as English, Spanish) and then evaluate our model on a low-resource language (such as Urdu, Swahili). In our novel approach, we attempt to reduce catastrophic forgetting while fine-tuning for the task downstream.

Benefits of the Project

There are more than 7,000 languages spoken in the world, over 90 of which have more than 10 million native speakers each. Despite this, very few languages have proper linguistic resources when it comes to natural language understanding tasks. Although there is growing awareness in the field, as evidenced by the release of datasets such as XNLI, most NLP research still only considers English. While one solution to this issue is to collect annotated data for all languages, this process is both too time-consuming and expensive to be feasible. We, instead, aim to train a model for particular tasks in a particular high-resource language and apply it to another low-resource language. We believe our strategy will allow one to use the large amount of training data available for English for the benefit of other languages.

Through this research, we hope to make considerable progress towards General Linguistic Intelligence, which is defined as "the ability to reuse previously acquired knowledge about a language’s lexicon, syntax, semantics, and pragmatic conventions to adapt to new tasks quickly" (Yogatama, 2019). Most chatbots are only available in high-resource languages such as English, Spanish, et cetera. Through General Linguistic Intelligence, we could possibly have popular virtual assistants (Alexa, Siri, etc.) working in low-resource languages (Urdu, Arabic, etc.).

Technical Details of Final Deliverable

The final deliverable will be our model that can perform well on low-resource languages such as Urdu, Tamil,, after only being trained on high-resource languages (English, Spanish, etc.). Our model will be trained and evaluated on the three NLP tasks: Natural Language Inference, Intent Classification and Paraphrase Detection.

Our findings will be gathered into a research paper which will be submitted to a reputable machine learning conference or journal.

Final Deliverable of the Project Software SystemCore Industry ITOther IndustriesCore Technology Artificial Intelligence(AI)Other Technologies NeuroTechSustainable Development Goals Decent Work and Economic Growth, Partnerships to achieve the GoalRequired Resources

Item Name	Type	No. of Units	Per Unit Cost (in Rs)	Total (in Rs)
			Total in (Rs)	69900
Cloud Amazon Instance (p3.8x Large)	Equipment	30	2100	63000
Pycharm Professional Edition	Equipment	3	2300	6900

Mitigating Catastrophic Forgetting for Zero Shot Cross Lingual Transfer

More Posts