Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Attention Question Answering Model, Lecture notes of Cognitive Psychology

Typology: Lecture notes

2020/2021

Uploaded on 05/24/2021

larryp
larryp 🇺🇸

4.8

(34)

353 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Seq2seq-Attention Question Answering Model
Wenqi Hou (wenqihou), Yun Nie (yunn)
Abstract:
A sequence-to-sequence attention reading comprehension model was implemented to
fulfill Question Answering task defined in Stanford Question Answering Dataset
(SQuAD). The basic structure was bidirectional LSTM (BiLSTM) encodings with
attention mechanism as well as BiLSTM decoding. Several adjustments such as
dropout, learning rate decay, and gradients clipping were used. Finally, the model
achieved 57.8% F1 score and 47.5% Exact Match (EM) ratio on validation set; and
49.1% F1 and 35.9% EM on private test set. Future work concerns improvement on
preventing overfitting while adding hidden layers.
Introduction
Question Answering (QA) machines are expecting strong increases in daily use now
and in near future. One particular task concerns reading comprehension: generate
answer to a question by locating a span in some given context paragraph (Fig.1). In past
researches of reading comprehension, available datasets were manually labelled and
restricted in sizes. With the launch of Stanford Question Answering Dataset (SQuAD),
models can be much better validated and tested. In this project, we utilized SQuAD to
build a sequence-to-sequence attention based network for question answering. The
intuition behind such attention mechanism is that the model could be trained to
recognize the difference between two context encodings, one with question attention
and another without; the different part is likely to be the answer, where the question
pays more attention to.
Fig.1 SQuAD question-answer example
pf3
pf4
pf5

Partial preview of the text

Download Attention Question Answering Model and more Lecture notes Cognitive Psychology in PDF only on Docsity!

Seq2seq-Attention Question Answering Model

Wenqi Hou (wenqihou), Yun Nie (yunn)

  • Abstract : A sequence-to-sequence attention reading comprehension model was implemented to fulfill Question Answering task defined in Stanford Question Answering Dataset (SQuAD). The basic structure was bidirectional LSTM (BiLSTM) encodings with attention mechanism as well as BiLSTM decoding. Several adjustments such as dropout, learning rate decay, and gradients clipping were used. Finally, the model achieved 57.8% F1 score and 47 .5% Exact Match (EM) ratio on validation set; and 49.1% F1 and 35.9% EM on private test set. Future work concerns improvement on preventing overfitting while adding hidden layers.
  • Introduction Question Answering (QA) machines are expecting strong increases in daily use now and in near future. One particular task concerns reading comprehension: generate answer to a question by locating a span in some given context paragraph (Fig.1). In past researches of reading comprehension, available datasets were manually labelled and restricted in sizes. With the launch of Stanford Question Answering Dataset (SQuAD), models can be much better validated and tested. In this project, we utilized SQuAD to build a sequence-to-sequence attention based network for question answering. The intuition behind such attention mechanism is that the model could be trained to recognize the difference between two context encodings, one with question attention and another without; the different part is likely to be the answer, where the question pays more attention to. Fig.1 SQuAD question-answer example
  • Approach : The basic structure of the model is a network of two encoders and a decoder, all implemented in bidirectional LSTM’s (BiLSTM) with minor variations (Fig. 2). Fig.2 model architecture First, question Q and context paragraph P are encoded in two independent BiLSTM’s which produce corresponding hidden states at each word position as H_P and H_Q. Then the encoded question and paragraph matrices H_Q and H_P are put in to another encoder with sequence to sequence attention (Fig. 3). For each hidden state vector h_Q in H_Q, calculate its attention score over all hidden states in H_P, which we used simple dot product here. Then the score matrix of each question position over the whole paragraph is multiplied by H_P. The product is sent into another layer of LSTM to generate the weighted context H_C under this specific question. H_C is concatenated with H_P into a larger state matrix that contain information about which parts are strongly focused and which are not. Fig.3 attention encoding The next step is to feed the matrix [H_C;H_P] into encoder, which consists of two separate bidirectional LSTM networks, one for start index and the other for end index. These gives two output vectors a_s and a_e, where largest element of each is the predicted index. Finally, use softmax activation and cross-entropy loss to arrive at the terminal. Train this model for long enough iterations with several adjustments and regularization, which are discussed in the next part.

Fig.5 linear decoder model 3) model with LSTM decoder Replacing linear decoder with a new one under recurrent neural network settings proved to be a big improvement lines (Fig. 6 ), and adding dropout regularization in all encoding functions made such improvement even greater. Without dropout, our model suffered a lot from overfitting: validation loss only decreased a little in the very beginning and kept on increasing until becoming more than doubled of training loss; similarly, validation set only achieved less than half of training F1 and EM ratios, with almost 30% gaps. After dropout was used, the gaps dropped to below 20% and we could see clear decreasing trend in validation loss, which later on did not increase sharply and kept a relatively moderate distance from training loss. Clearly, overfitting was relieved to some extent, but it still existed. The final version of this model achieved 57.8% F score on validation dataset and 49.1% on leaderboard test set. Fig. 6 RNN decoder model Fig. 7 RNN decoder model with dropout

4) parameter tuning Learning rate: For all implementations, we started from a learning rate of 0.01; however, it rarely survived due to gradients explosion, and even gradients clipping didn’t fully help. A promising alternative had been 0.001, with exponential decay of rate 0.8 (Fig. 7 ). This ensured that training loss could decrease in a smoother manner, and would not easily bounced large around local minimum or even diverge away. State size: This refers to the number of hidden states in LSTM networks. We used the same state size for all LSTM’s, and found size of 200 gave much better performance than with size of 100, on the sacrifice of slower learning and running time. Output size: 300, as discussed before. Batch size: Default size is 10 and we increased it to 40 to fully utilize GPU. This proved to save much of running time per epoch. Model performances after 10 epochs are summarized in Table 1: Table 1 model performances training validation decoder dropout rate loss F1 (%) EM (%) loss F1 (%) EM (%) linear - 5.84 15.8 15 8.6 6 7.8 2 LSTM

5) running time analysis The number of parameters of the model has a major influence on the training time of each epoch. When the size of hidden states increases from 100 to 200, the total number of parameters for the same model increases from 900,000 to 3,000,000, which makes the training time for one epoch twice as before. Bigger batch size leads to faster training and answering time. The size of batches indicates the degree of parallelism of model. We used larger batch size to fully utilize GPU. But using too large batches will cause out of memory errors. Dropout and learning rate also affect running time, but not as significantly as batch size and number of parameters. Dropout is used as a means of preventing overfitting, so it causes the model to learn slower but also makes the knowledge learned more universal applicable. Learning rate decides how fast we change the parameters according to the gradients, so it affects converging time.