Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Attention Question Answering Model, Lecture notes of Cognitive Psychology

Berklee College of Music Cognitive Psychology

Typology: Lecture notes

2020/2021

Uploaded on 05/24/2021

larryp 🇺🇸

4.8

(34)

353 documents

1 / 6

This page cannot be seen from the preview

Don't miss anything!

Seq2seq-Attention Question Answering Model

Wenqi Hou (wenqihou), Yun Nie (yunn)

• Abstract:

A sequence-to-sequence attention reading comprehension model was implemented to

fulfill Question Answering task defined in Stanford Question Answering Dataset

(SQuAD). The basic structure was bidirectional LSTM (BiLSTM) encodings with

attention mechanism as well as BiLSTM decoding. Several adjustments such as

dropout, learning rate decay, and gradients clipping were used. Finally, the model

achieved 57.8% F1 score and 47.5% Exact Match (EM) ratio on validation set; and

49.1% F1 and 35.9% EM on private test set. Future work concerns improvement on

preventing overfitting while adding hidden layers.

• Introduction

Question Answering (QA) machines are expecting strong increases in daily use now

and in near future. One particular task concerns reading comprehension: generate

answer to a question by locating a span in some given context paragraph (Fig.1). In past

researches of reading comprehension, available datasets were manually labelled and

restricted in sizes. With the launch of Stanford Question Answering Dataset (SQuAD),

models can be much better validated and tested. In this project, we utilized SQuAD to

build a sequence-to-sequence attention based network for question answering. The

intuition behind such attention mechanism is that the model could be trained to

recognize the difference between two context encodings, one with question attention

and another without; the different part is likely to be the answer, where the question

pays more attention to.

Fig.1 SQuAD question-answer example

Partial preview of the text

Download Attention Question Answering Model and more Lecture notes Cognitive Psychology in PDF only on Docsity!

Seq2seq-Attention Question Answering Model

Wenqi Hou (wenqihou), Yun Nie (yunn)

Abstract : A sequence-to-sequence attention reading comprehension model was implemented to fulfill Question Answering task defined in Stanford Question Answering Dataset (SQuAD). The basic structure was bidirectional LSTM (BiLSTM) encodings with attention mechanism as well as BiLSTM decoding. Several adjustments such as dropout, learning rate decay, and gradients clipping were used. Finally, the model achieved 57.8% F1 score and 47 .5% Exact Match (EM) ratio on validation set; and 49.1% F1 and 35.9% EM on private test set. Future work concerns improvement on preventing overfitting while adding hidden layers.
Introduction Question Answering (QA) machines are expecting strong increases in daily use now and in near future. One particular task concerns reading comprehension: generate answer to a question by locating a span in some given context paragraph (Fig.1). In past researches of reading comprehension, available datasets were manually labelled and restricted in sizes. With the launch of Stanford Question Answering Dataset (SQuAD), models can be much better validated and tested. In this project, we utilized SQuAD to build a sequence-to-sequence attention based network for question answering. The intuition behind such attention mechanism is that the model could be trained to recognize the difference between two context encodings, one with question attention and another without; the different part is likely to be the answer, where the question pays more attention to. Fig.1 SQuAD question-answer example

Approach : The basic structure of the model is a network of two encoders and a decoder, all implemented in bidirectional LSTM’s (BiLSTM) with minor variations (Fig. 2). Fig.2 model architecture First, question Q and context paragraph P are encoded in two independent BiLSTM’s which produce corresponding hidden states at each word position as H_P and H_Q. Then the encoded question and paragraph matrices H_Q and H_P are put in to another encoder with sequence to sequence attention (Fig. 3). For each hidden state vector h_Q in H_Q, calculate its attention score over all hidden states in H_P, which we used simple dot product here. Then the score matrix of each question position over the whole paragraph is multiplied by H_P. The product is sent into another layer of LSTM to generate the weighted context H_C under this specific question. H_C is concatenated with H_P into a larger state matrix that contain information about which parts are strongly focused and which are not. Fig.3 attention encoding The next step is to feed the matrix [H_C;H_P] into encoder, which consists of two separate bidirectional LSTM networks, one for start index and the other for end index. These gives two output vectors a_s and a_e, where largest element of each is the predicted index. Finally, use softmax activation and cross-entropy loss to arrive at the terminal. Train this model for long enough iterations with several adjustments and regularization, which are discussed in the next part.

Fig.5 linear decoder model 3) model with LSTM decoder Replacing linear decoder with a new one under recurrent neural network settings proved to be a big improvement lines (Fig. 6 ), and adding dropout regularization in all encoding functions made such improvement even greater. Without dropout, our model suffered a lot from overfitting: validation loss only decreased a little in the very beginning and kept on increasing until becoming more than doubled of training loss; similarly, validation set only achieved less than half of training F1 and EM ratios, with almost 30% gaps. After dropout was used, the gaps dropped to below 20% and we could see clear decreasing trend in validation loss, which later on did not increase sharply and kept a relatively moderate distance from training loss. Clearly, overfitting was relieved to some extent, but it still existed. The final version of this model achieved 57.8% F score on validation dataset and 49.1% on leaderboard test set. Fig. 6 RNN decoder model Fig. 7 RNN decoder model with dropout

4) parameter tuning Learning rate: For all implementations, we started from a learning rate of 0.01; however, it rarely survived due to gradients explosion, and even gradients clipping didn’t fully help. A promising alternative had been 0.001, with exponential decay of rate 0.8 (Fig. 7 ). This ensured that training loss could decrease in a smoother manner, and would not easily bounced large around local minimum or even diverge away. State size: This refers to the number of hidden states in LSTM networks. We used the same state size for all LSTM’s, and found size of 200 gave much better performance than with size of 100, on the sacrifice of slower learning and running time. Output size: 300, as discussed before. Batch size: Default size is 10 and we increased it to 40 to fully utilize GPU. This proved to save much of running time per epoch. Model performances after 10 epochs are summarized in Table 1: Table 1 model performances training validation decoder dropout rate loss F1 (%) EM (%) loss F1 (%) EM (%) linear - 5.84 15.8 15 8.6 6 7.8 2 LSTM

5) running time analysis The number of parameters of the model has a major influence on the training time of each epoch. When the size of hidden states increases from 100 to 200, the total number of parameters for the same model increases from 900,000 to 3,000,000, which makes the training time for one epoch twice as before. Bigger batch size leads to faster training and answering time. The size of batches indicates the degree of parallelism of model. We used larger batch size to fully utilize GPU. But using too large batches will cause out of memory errors. Dropout and learning rate also affect running time, but not as significantly as batch size and number of parameters. Dropout is used as a means of preventing overfitting, so it causes the model to learn slower but also makes the knowledge learned more universal applicable. Learning rate decides how fast we change the parameters according to the gradients, so it affects converging time.

Attention Question Answering Model, Lecture notes of Cognitive Psychology

Related documents

Partial preview of the text

Download Attention Question Answering Model and more Lecture notes Cognitive Psychology in PDF only on Docsity!

Seq2seq-Attention Question Answering Model