

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Material Type: Assignment; Class: Data Mining; Subject: Computer Science; University: University of Illinois Springfield; Term: Unknown 1989;
Typology: Assignments
1 / 2
This page cannot be seen from the preview
Don't miss anything!
CSC 573: Data Mining Weka Assignment #2 : Data Classification in WEKA Instructor: Ratko Orlandic
In this assignment, you will explore data-classification facilities in WEKA using both Explorer and Experimenter. You will use “contact-lenses”, “iris”, and “soybean” data sets, all of which are available in the required .arff format in the WEKA package. The “contact-lenses” data set has 24 instances with 5 nominal attributes, the last of which (“contact-lenses”) is the class dimension. The “iris” set has 150 instances with 4 continuous attributes and the nominal class, which is the last (5 th^ ) dimension. The “soybean” set has 683 instances with 36 nominal attributes, the last of which is the class dimension. Unlike the other two sets, “soybean” has missing values.
For each experiment A-C below, use WEKA Explorer to perform data classification using the following classification methods with default parameters: 1) OneR; 2) NaiveBayesSimple; 3) Id3; and 4) J48. For each method on every data set, use the following evaluation methods (“Test options” in the “Classify” window of the WEKA Explorer): a) “Use training set”; b) “Cross- validation” with 10 folds; and c) “Percentage split” set to 66%. Record the results of each run in a CSV “Results1.csv” or Excel file “Results.xls” were you will indicate only: a) the experiment (A- C), b) the name of the input data file; c) the classification method; d) the evaluation strategy used; and e) the percentage of correctly classified instances.
A. Perform data classification on the “contact-lenses.arff” data set using the four classification methods and each of the evaluation strategies indicated above.
B. Perform discretization of all non-class attributes in the “iris.arff” data set into 10 equal-width bins as follows: under “Filter” in the “Preprocess” window of the Explorer, select ‘filters’-
’unsupervised’->’attribute’->’Discretize’. Use default parameters for the ’Discretize’ filter. After you make sure that all non-class attributes are nominal, perform classification on this set using the four classification methods and each of the evaluation strategies indicated above.
C. Perform discretization of all non-class attributes in the “iris.arff” data set into 5 close-to equal- height bins by selecting the ’Discretize’ filter and choosing appropriate parameters. After you make sure that all non-class attributes are nominal, perform classification on this set using the four classification methods and each of the evaluation strategies indicated above.
D. For this experiment, use WEKA Experimenter. Perform data classification on “contact- lenses.arff” and “soybean.arff” using the OneR, NaiveBayesSimple, and J48 classification methods with default parameters. For each method on both data sets, use the 10 times 10 fold cross validation as the evaluation method. Record the results in the “RawResults.csv” file. From these results, compute average accuracy of each method for every data set (“contact-lenses” and “soybean”), and include in the “Results.csv” (i.e. “Results.xls”) file (the same result file as for experiments A-C): a) the experiment D, b) the name of the data file; c) the classification method;
d) the evaluation strategy; and e) the average percentage of correctly classified instances over 10x10 runs.
NOTE: To perform this experiment, in the “Setup” window of the Experimenter:
Once you have performed the experiments, you should spend some time evaluating your results. In particular, try to answer at least the following questions: Which classification method typically gives the highest accuracy? Which method does not perform well and why? Why did we use discretization of the “iris” data set? Does discretization and its method affect the results of classification and how? Which of the three evaluation methods overestimates the accuracy and why? Which of the three evaluation methods underestimates the accuracy and why? Record these and any other observations in a Word file called “Observations.doc”.
On or before the due date, you should submit in a single zipped file through the Blackboard system: a) the “Results.csv” (or “Results.xls”) file with the summary results of your runs in all experiments A-D, b) the “RawResults.csv” of the experiment D, and c) the “Observations.doc” file. Please adhere to the following submission procedure:
Grading will be done based on the correctness of the results as well as the extensiveness, clarity, and correctness of your observations.
Good luck!