



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
You may look and share the identify essay types
Typology: Essays (university)
1 / 6
This page cannot be seen from the preview
Don't miss anything!
Mohammad H. Falakmasir, Kevin D. Ashley, Christian D. Schunn, Diane J. Litman
Learning Research and Development Center, Intelligent Systems Program, University of Pittsburgh
Abstract. Peer-reviewing is a recommended instructional technique to encourage good writing. Peer reviewers, however, may fail to identify key elements of an essay, such as thesis and conclusion statements, especially in high school writing. Our system identifies thesis and conclusion statements, or their absence, in students’ essays in order to scaffold reviewer reflection. We showed that computational linguistics and interactive machine learning have the potential to facilitate peer-review processes. Keywords: Peer-review, high school writing instruction, discourse analysis, natural language processing, interactive machine learning
Writing is essential to communication, learning, and problem solving. However, poor achievement in high school writing is a major deficiency in the US educational system [1]. There appears to be no single best approach to teaching writing; however, some practices have been shown to be more effective than others. One of these practices, peer-review of writing assignments, is a commonly recommended technique to improve writing skills, especially in large class settings. Peer-review not only provides students with feedback, it also gives them the opportunity to read essays of other students and improve their reflective and metacognitive skills. Several studies have found that providing feedback leads to improvement in the reviewer’s writing [2], especially when the students provide constructive feedback [3] and put effort into the process [4]. While web-based peer-review systems solve logistical challenges of the review process, such as distribution of documents, providing rubrics and review criteria, and supporting successive drafts, they are still far from optimal [5]. In particular, reviewers may not focus on the core aspects of the text being evaluated [6]. In argumentative writing, a thesis statement plays a pivotal role: it communicates the author’s position and opinion about the essay prompt; it anchors the framework of the essay, serving as a hook for tying the reasons and evidence presented and anticipates critiques and counterarguments [7]. The thesis statement thus has a major influence in assessing writing skills [8]. A conclusion reiterates the main idea and summarizes the entire argument in an essay. It may contain new information, such as self-reflections
on the writer’s position [ 7 ]. Since thesis and conclusion statements both play a critical role in the overall argument and share similar linguistic elements, in this paper we focus on automatically identifying these two core aspects. Advances in computational linguistics enable systems to automatically and quickly analyze large text corpora. Shermis et al. [ 9 ] reviewed the features of the three most successful Automated Essay Evaluation (AEE) systems. These systems can analyze certain pedagogically significant aspects of essays as reliably as expert human graders. In particular, Burstein and Marcu [ 10 ] presented a machine learning model for detecting thesis and conclusion sentences in students’ essays. Later they extended their model into a discourse analysis system as a part of ETS Criterion® software for online essay evaluation [ 11 ]. Their model uses lexical, syntactic, and rhetorical features and a complex classification framework to label different discourse elements of the essays like introductory material, thesis statement, topic sentences, and conclusion. Writing Pal (W-Pal) [ 12 ], an Intelligent Tutoring System, uses another AEE methodology to offer writing strategy instruction, game-based essay writing practice, and formative feedback to high school writers. It uses the Coh- Metrix AEE [ 13 ] to analyze student essays and provide formative feedback. We hypothesize that AEE techniques can also improve computer-supported peer- review by calling reviewers’ attention to particular features of an essay (e.g. thesis or conclusion statements) that deserve comment. Our AEE model is designed to be used as a part of the SWoRD peer-review system [ 14 ]. To the best of our knowledge, no one has used AEE techniques to support intelligent scaffolding of peer-reviews. We believe that our system has the potential to combine the strengths of both web-based peer review and automated essay evaluation systems. With an ability to identify thesis statements, the system will scaffold reviewers’ consideration of these issues posing such questions as:
It is important that reviewers attend to thesis statements: how well they are articulated and supported, and whether alternative interpretations/viewpoints are considered [ 16 ,
Positional Features : We used 3 positional features: paragraph number, sentence number in the paragraph, and type of paragraph (first, body, and last paragraph). We also used the same positional baseline as [ 11 ] in order to compare our results with their model. The positional baseline predicts all sentences in the first paragraph as a thesis statement and all sentences within the last paragraph as conclusion sentences. Sentence Level Features : We used a number of sentence level features based on the syntactic, semantic, and dependency parsing of the sentence. Based on our feature selection process, prepositional and gerund phrases are highly predictive of thesis and conclusion sentences. The number of adjectives and adverbs within the sentence is also highly correlated with a sentence being a thesis or conclusion statement. A set of frequent words was also predictive for thesis and conclusion sentences (e.g., “although”, “even though”, “because”, “due to”, “led to”, “caused”), and we used the number of occurrences of these words in a sentence as a feature in our model. Essay Level Features : We used 4 essay level features: number of keywords among the most frequent words of the essay, number of words overlapping with the assignment prompt, and a sentence importance score based on Rhetorical Structure Theory (RST) adapted from [ 19 ]. Table 2 shows the top 5 most predictive features for each category based on the Gini Coefficient [ 20 ] attribute selection method. This method considers the prior distribution of the classes and looks for the largest class in the training set (in our case sentences that are not the thesis) and tries to isolate it from other classes, which is suitable based on the nature of our classification task.
Table 2. Top 5 most predictive features for each category based on Gini Coefficent. Ranking Thesis Conclusion 1 Last Sentence Last Paragraph 2 First Paragraph Keyword Overlap 3 Common Words Common Words 4 Keyword Overlap Number of Adjectives 5 Number of Noun Phrases Number of Noun Phrases
After a data cleaning and pre-processing step, we created feature vectors for all of the sentences in the training set essays. Our target class had 3 labels: “thesis”, “conclusion”, and “other”. We considered sentences rated 2 and 3 as thesis and conclusion statements and put the ones rated 1 (incomplete) into the “other” category. We evaluated our model on two levels: sentence level and essay level, and compared its performance against the positional baseline and human annotated data. We used 3 classifiers in RapidMiner [ 21 ] in order to develop the sentence level models: Naïve Bayes, Decision Tree, and Support Vector Machine (SVM). We used 10 - fold essay stratified cross validation in order to evaluate our models on sentence level. In order to evaluate the models on essay level, we aggregated the results of the sentence level model in order to predict whether an essay contains a thesis/conclusion statement or not. Table 3 shows the performance of the 3 classifiers based on average Precision (P), Recall (R), and F-measure (F) among all 10 rounds of cross-validation. We use F, the harmonic mean of P and R, as our main performance evaluation metric.
Table 3. Average performance of 3 models and the positional baseline on development set Thesis Conclusion Essay Classifier P R F P R F P R F Positional Baseline 0.53 0.89 0.50 0.51 0.89 0.46 0.61 0.78 0. Naïve Bayes 0. 62 0. 76 0. 68 0.57 0.72 0.62 0.71 0.66 0. Decision Tree 0.75 0. 68 0. 71 0. 62 0. 43 0. 51 0.75 0.71 0. SVM 0. 85 0. 66 0. 74 0. 67 0. 41 0. 51 0.69 0.64 0.
In order to indicate how well the models generalize to new essays, we evaluated our models on an unseen test set. Table 4 shows the performance of 3 models.
Table 4. Average performance of 3 models and the positional baseline on unseen test set Thesis Conclusion Essay Classifier P R F P R F P R F Positional Baseline 0.58 0.88 0.57 0.58 0.84 0.55 0.58 0.84 0.5 5 Naïve Bayes 0. 70 0. 79 0. 74 0. 65 0. 69 0. 67 0.63 0.65 0. Decision Tree 0.82 0.84 0. 83 0. 49 0. 75 0. 59 0.75 0.73 0. SVM 0. 82 0. 65 0. 72 0.60 0.54 0.56 0.62 0.58 0.
The results show that all three models outperform the positional baseline. While the SVM classifier had the best precision on both development and test set at the sentence level, the Decision Tree classifier achieved higher recall and better overall performance at the essay level. Since we are not using the same training and test set as in [ 11 ], it is not valid to compare the exact value reported for P, R, and F. However, because we use the same positional baseline, and the results of the baseline can be considered as a rough estimate of the quality of the essays, we can compare the systems in terms of improvement over the baseline. In the thesis detection category, their highest reported improvement (regarding F) over the positional baseline is 0. while our best improvement is 0. 24 on the development set and 0.26 on the unseen test set. In the conclusion detection category, their highest reported improvement is 0.23 while our best improvement is 0.1 6 development set and 0. 12 on the unseen test set. In general, we have low performance in the conclusion category because the essays in our training set are first drafts of writing assignments and the students tend to spread the summary of their arguments across multiple sentences and our current model only works on the sentence level. In conclusion, our study shows that even with a relatively small corpus of essays, a computational linguistic model can identify core aspects of students’ essays. Our first priority was to detect the presence of thesis or conclusion statements within the student essays to provide instant feedback to authors upon submission. The second priority was to identify the particular sentences, to direct reviewers’ attention so that they focus some comments on how well the author has framed and supported his/her argument. Our next step is to embed our model into the SWoRD peer review system and evaluate its impact on the quality of student reviews. The peer-review nature of SWoRD gives us a unique opportunity benefit from both author and peer feedbacks in order to evaluate and refine our model while being used. We also plan to extend the model to detect other core elements of student essays such as topic sentences and supporting materials in order to provide feedback and scaffolding.