



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Typology: Lecture notes
1 / 6
This page cannot be seen from the preview
Don't miss anything!
Wenqi Hou (wenqihou), Yun Nie (yunn)
Fig.5 linear decoder model 3) model with LSTM decoder Replacing linear decoder with a new one under recurrent neural network settings proved to be a big improvement lines (Fig. 6 ), and adding dropout regularization in all encoding functions made such improvement even greater. Without dropout, our model suffered a lot from overfitting: validation loss only decreased a little in the very beginning and kept on increasing until becoming more than doubled of training loss; similarly, validation set only achieved less than half of training F1 and EM ratios, with almost 30% gaps. After dropout was used, the gaps dropped to below 20% and we could see clear decreasing trend in validation loss, which later on did not increase sharply and kept a relatively moderate distance from training loss. Clearly, overfitting was relieved to some extent, but it still existed. The final version of this model achieved 57.8% F score on validation dataset and 49.1% on leaderboard test set. Fig. 6 RNN decoder model Fig. 7 RNN decoder model with dropout
4) parameter tuning Learning rate: For all implementations, we started from a learning rate of 0.01; however, it rarely survived due to gradients explosion, and even gradients clipping didn’t fully help. A promising alternative had been 0.001, with exponential decay of rate 0.8 (Fig. 7 ). This ensured that training loss could decrease in a smoother manner, and would not easily bounced large around local minimum or even diverge away. State size: This refers to the number of hidden states in LSTM networks. We used the same state size for all LSTM’s, and found size of 200 gave much better performance than with size of 100, on the sacrifice of slower learning and running time. Output size: 300, as discussed before. Batch size: Default size is 10 and we increased it to 40 to fully utilize GPU. This proved to save much of running time per epoch. Model performances after 10 epochs are summarized in Table 1: Table 1 model performances training validation decoder dropout rate loss F1 (%) EM (%) loss F1 (%) EM (%) linear - 5.84 15.8 15 8.6 6 7.8 2 LSTM
5) running time analysis The number of parameters of the model has a major influence on the training time of each epoch. When the size of hidden states increases from 100 to 200, the total number of parameters for the same model increases from 900,000 to 3,000,000, which makes the training time for one epoch twice as before. Bigger batch size leads to faster training and answering time. The size of batches indicates the degree of parallelism of model. We used larger batch size to fully utilize GPU. But using too large batches will cause out of memory errors. Dropout and learning rate also affect running time, but not as significantly as batch size and number of parameters. Dropout is used as a means of preventing overfitting, so it causes the model to learn slower but also makes the knowledge learned more universal applicable. Learning rate decides how fast we change the parameters according to the gradients, so it affects converging time.