Monolingual corpora

#Monolingual corpora full#

The loss function will try to reduce the difference between X and X_hat. Take the generated Spanish, corrupt it then encode it (l2 encode) and feed into M again to generate X_hat. Sample a proper English sentence (X), encode it (l1 encode) and feed into model M to generate Spanish. Take note of the difference in l1 and l2 in equation 2. x_hat is produced by feeding the corrupted y into M (x_hat ~ d(e(C(M(x)),l2),l1)). From the equation 2, we can see that the loss function calculates the sum of token-level cross-entropy losses between x and x_hat.

#Monolingual corpora full#

M is the full translation model consists of encoder and decoder.

Screenshot taken from the original paper. Basically it is trained in the same way as the typical Denosing Auto-encoder. The encoder and decoder are trained by minimising an objective function that measures their ability to reconstruct sentences from noisy inputs. The input sentence can be corrupted by randomly dropping or swapping words. The input sentences can be from the same or different domain. Train the encoder and decoder by reconstructing sentence in a particular domain. Sentences are generated using greedy decoding. The embedding and LSTM hidden state dimension are set to 300.

The attention weights are also shared between the encoder and decoder. As mentioned earlier, both source and target language share the same encoder, same goes to the decoder. The decoder which is a LSTM takes in previous hidden states, current word and a context vector given by a weighted sum over the encoder states. The encoder is a bidirectional LSTM which returns a sequence of hidden states. Input feeding is an approach to feed attentional vectors “ as inputs to the next time steps to inform the model about past alignment decisions” ( Source)

Sequence-to-sequence model with attention without input feeding. Decoder takes in Z and language l to generate words in language l. The decoder is language independent.Įncoder takes in W and generate Z. (Only 1 encoder for both languages)ĭecoder -> decode from the latent space to source and target sentences. This is pretty similar to the working mechanism of a GAN).Įncoder -> encode source and target sentences to latent space. The source and target sentence latent representations are constrained to have the same distribution using an adversarial regularisation term (model tries to fool the discriminator which is simultaneously trained to identify the language of a given latent representation.The model as to be able to work with noisy translation (From source to target language and vice versa).Build a common latent space between the two languages/domains (e.g English and French) and learn to translate by reconstructing in both domains.To train a general machine translation system without supervision using only monolingual corpus for each language. Parallel corpora dataset is not available for low-resource languages.It requires a lot manpower and specialised expertise. Parallel corpora datasets are costly to build.This post introduce the unsupervised machine translation model developed by Facebook. However, NMT model is very hard to train. A good NMT model can efficiently and accurately translate a sentence from one language to another. It allows people who speak different languages to communicate effectively with each other. Neural Machine Translation (NMT) is very important in today’s word.