

























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Material Type: Paper; Professor: Sullivan; Subject: Computer Science; University: University of Virginia; Term: Unknown 2007;
Typology: Papers
1 / 33
This page cannot be seen from the preview
Don't miss anything!
IEEE Transactions on Software Engineering (to appear).
Department of Computer Science Oregon State U. Corvallis, OR grother@cs.orst.edu
Department of Computer Science Middle Tenn. State U. Murfreesboro, TN untch@mtsu.edu
Department of Computer Science Oregon State U. Corvallis, OR chengyun@cs.orst.edu
College of Computing Georgia Inst. of Tech. 801 Atlantic Drive Atlanta, GA harrold@cis.ohio-state.edu
Abstract Test case prioritization techniques schedule test cases for execution in an order that attempts to increase their effectiveness at meeting some performance goal. Various goals are possible; one involves rate of fault detection — a measure of how quickly faults are detected within the testing process. An improved rate of fault detection during testing can provide faster feedback on the system under test and let software engineers begin correcting faults earlier than might otherwise be possible. One application of prioritization techniques involves regression testing – the retesting of software following modifications; in this context, prioritization techniques can take advantage of information gathered about the previous execution of test cases to obtain test case orderings. In this paper, we describe several techniques for using test execution information to prioritize test cases for regression testing, including: (1) techniques that order test cases based on their total coverage of code components; (2) techniques that order test cases based on their coverage of code components not previously covered; (3) techniques that order test cases based on their estimated ability to reveal faults in the code components that they cover. We report the results of several experiments in which we applied these techniques to various test suites for various programs and measured the rates of fault detection achieved by the prioritized test suites, comparing those rates to the rates achieved by untreated, randomly ordered, and optimally ordered suites. Analysis of the data shows that each of the prioritization techniques studied improved the rate of fault detection of test suites, and this improvement occurred even with the least expensive of those techniques. The data also shows, however, that considerable room remains for improvement. The studies highlight several cost-benefits tradeoffs among the techniques studied, as well as several opportunities for future work. Keywords: test case prioritization, regression testing, software testing, empirical studies
Software engineers often save the test suites they develop for their software so that they can reuse those test suites later as the software evolves. Such test suite reuse, in the form of regression testing, is pervasive in the software industry [24] and, together with other regression testing activities, has been estimated to account for as much as one-half of the cost of software maintenance [4, 20]. Running all of the test cases in a test suite, however, can require a large amount of effort. For example, one of our industrial collaborators reports that for one of its products of about 20,000 lines of code, the entire test suite requires seven weeks to run.
For this reason, researchers have considered various techniques for reducing the cost of regression testing, including regression test selection and test suite minimization techniques. Regression test selection techniques (e.g. [5, 7, 21, 29]) reduce the cost of regression testing by selecting an appropriate subset of the existing test suite, based on information about the program, modified version, and test suite. Test suite minimization techniques (e.g. [6, 15, 30, 37]) lower costs by reducing a test suite to a minimal subset that maintains equivalent coverage of the original test suite with respect to a particular test adequacy criterion. Regression test selection and test suite minimization techniques, however, can have drawbacks. For example, although some empirical evidence indicates that, in certain cases, there is little or no loss in the ability of a minimized test suite to reveal faults in comparison to its unminimized original [37, 38], other empirical evidence shows that the fault detection capabilities of test suites can be severely compromised by minimization [30]. Similarly, although there are safe regression test selection techniques (e.g. [3, 7, 29, 34]) that can ensure that the selected subset of a test suite has the same fault detection capabilities as the original test suite, the conditions under which safety can be achieved do not always hold [28, 29]. Test case prioritization techniques [31, 36] provide another method for assisting with regression testing.^1 These techniques let testers order their test cases so that those test cases with the highest priority, according to some criterion, are executed earlier in the regression testing process than lower priority test cases. For example, testers might wish to schedule test cases in an order that achieves code coverage at the fastest rate possible, exercises features in order of expected frequency of use, or exercises subsystems in an order that reflects their historically demonstrated propensity to fail. When the time required to reexecute an entire test suite is short, test case prioritization may not be cost-effective — it may be sufficient simply to schedule test cases in any order. When the time required to execute an entire test suite is sufficiently long, however, test case prioritization may be beneficial, because in this case, meeting testing goals earlier can yield meaningful benefits. Because test case prioritization techniques do not themselves discard test cases, they can avoid the drawbacks that can occur when regression test selection and test suite minimization discard test cases. Alternatively, in cases where the discarding of test cases is acceptable, test case prioritization can be used in conjunction with regression test selection or test suite minimization techniques to prioritize the test cases in the selected or minimized test suite. Further, test case prioritization can increase the likelihood that, if regression testing activities are unexpectedly terminated, testing time will have been spent more beneficially than if test cases were not prioritized. In this paper, we describe several techniques for prioritizing test cases for regression testing. We then describe several empirical studies we performed with these techniques to evaluate their ability to improve rate of fault detection — a measure of how quickly faults are detected within the testing process. An improved rate of fault detection during regression testing provides earlier feedback on a system under test and lets debugging activities begin earlier than might otherwise be possible. Our results indicate that test case prioritization can significantly improve the rate of fault detection of test suites. Our results also highlight several cost-benefits tradeoffs between various techniques. (^1) Some test case prioritization techniques may be applicable during the initial testing of software [1]. In this paper, however, we are concerned only with regression testing. Section 2 discusses other applications of prioritization and related work on prioritization in further detail.
efficient solution to the knapsack problem.^2 Similarly, given a function f that quantifies whether a test suite detects faults at the fastest rate possible, a precise solution to the test case prioritization problem would provide a solution to the halting problem. In such cases, prioritization techniques must be heuristics. Third, test case prioritization can be used either in the initial testing of software or in the regression testing of software. One difference between these two applications is that, in the case of regression testing, prioritization techniques can use information gathered in previous runs of existing test cases to help prioritize the test cases for subsequent runs. Fourth, it is useful to distinguish two varieties of test case prioritization: general test case prioritization and version-specific test case prioritization. In general test case prioritization, given program P and test suite T , we prioritize the test cases in T with the intent of finding an ordering of test cases that will be useful over a succession of subsequent modified versions of P. Thus, general test case prioritization can be performed following the release of some version of the program during off-peak hours, and the cost of performing the prioritization is amortized over the subsequent releases. It is hoped that the resulting prioritized suite will be more successful than the original suite at meeting the goal of the prioritization, on average over those subsequent releases. In contrast, in version-specific test case prioritization, given program P and test suite T , we prioritize the test cases in T with the intent of finding an ordering that will be useful on a specific version P ′^ of P. Version- specific prioritization is performed after a set of changes have been made to P and prior to regression testing P ′. Because this prioritization is accomplished after P ′^ is available, care must be taken to keep the cost of performing the prioritization from excessively delaying the very regression testing activities it is intended to facilitate. The prioritized test suite may be more effective at meeting the goal of the prioritization for P ′^ in particular than would a test suite resulting from general test case prioritization, but may be less effective on average over a succession of subsequent releases. Typically — though not necessarily — general test case prioritization does not use information about specific modified versions of P , whereas version-specific prioritization does use such information. Of course, it is possible for general test case prioritization techniques to incorporate information about expected modifica- tions to improve the average performance of prioritized test suites over a succession of program versions, and it is possible to use prioritization techniques that ignore the modified program as version-specific techniques. Fifth, it is also possible to integrate test case prioritization with regression test selection or test suite minimization techniques – for example, by prioritizing a test suite selected by a regression test selection algorithm, or by prioritizing the minimal test suite returned by a test suite minimization algorithm. Finally, given any prioritization goal, various prioritization techniques may be applied to a test suite with the aim of meeting that goal. For example, in an attempt to increase the rate of fault detection of test suites, we might prioritize test cases in terms of the extent to which they execute modules that, measured historically, have tended to fail. Alternatively, we might prioritize test cases in terms of their increasing cost-per-coverage of code components, or in terms of their increasing cost-per-coverage of features listed in a requirements specification. In any case, the intent behind the choice of a prioritization technique is to (^2) Informally, the knapsack problem is the problem of, given a set U whose elements each have a cost and a value, and given a size constraint and a value goal, finding a subset U ′^ of U such that U ′^ meets the given size constraint and the given value goal. For a more formal treatment see [11].
increase the likelihood that the prioritized test suite can better meet the goal than would an ad hoc or random ordering of test cases. In this paper we restrict our attention to general test case prioritization in application to regression testing, independent of regression test selection and test suite minimization. We focus on a specific goal and function f, and we evaluate the abilities of several prioritization techniques to help us meet this goal.
Our focus in this paper is the first goal listed at the beginning of Section 2: the goal of increasing the likelihood of revealing faults earlier in the testing process. Informally, we describe this goal as one of improving our test suite’s rate of fault detection: we describe a function f that quantifies this goal in Section 3.2. As we suggested in Section 1, there are several motivations for meeting this goal. An improved rate of fault detection during regression testing can let software engineers begin their debugging activities earlier than might otherwise be possible, speeding the release of the software. An improved rate of fault detection can also provide faster feedback on the system under test, and provide earlier evidence when quality goals have not been met, allowing strategic decisions about release schedules to be made earlier than might otherwise be possible. Further, in a testing situation in which the amount of testing time that will be available is uncertain (for example, when market pressures may force a release of the product prior to execution of all test cases), such prioritization can increase the likelihood that whenever the testing process is terminated, testing resources will have been spent more cost-effectively in relation to potential fault detection than they might otherwise have been. In this paper, we consider nine different test case prioritization techniques (see Table 1). The first three techniques serve as experimental controls (though not actually “techniques” in a practical sense, we refer to them as such to simplify the presentation.) The last six techniques represent heuristics that could be implemented using software tools; all of these techniques use test coverage information, produced by prior executions of test cases, to prioritize test cases for subsequent execution. A source of motivation for such approaches is the conjecture that the availability of test execution data can be an asset; however, such approaches also make the assumption that past test execution data can be used to predict, with sufficient accuracy, subsequent execution behavior. In practice, code modifications made to create a new version may alter test execution patterns; an issue impacting the efficacy of test case prioritization techniques is whether these alterations will significantly impact the predictive value of past execution data. We next describe the nine techniques listed in Table 1 in turn.
M 1 : No prioritization. To facilitate our empirical studies, one prioritization technique that we consider is simply the application of no technique; this lets us consider “untreated” test suites and serves as a control.
M 2 : Random prioritization. The success of an untreated test suite in meeting a goal may depend upon the manner in which the test suite is initially constructed. Therefore, as an additional control in our studies, we apply random prioritization, in which we randomly order the test cases in a test suite.
Test Case Fault 1 2 3 4 t 1 X X t 2 X X t 3 X X
Table 2: A case in which the greedy “optimal” prioritization algorithm may not produce an optimal solution.
M 4 : Total statement coverage prioritization. By instrumenting a program we can determine, for any test case, which statements in that program were exercised (covered) by that test case. We can then prioritize test cases in terms of the total number of statements they cover, by counting the number of statements covered by each test case, and then sorting the test cases in descending order of that number. (When multiple test cases cover the same number of statements, an additional rule is necessary to order these test cases; we order them randomly.) To illustrate, Figure 1 depicts a procedure (left), and the statement coverage of the executable statements in that procedure achieved by three test cases (center). Applied in this case, total statement coverage prioritization yields test case order (3, 1, 2).
X X X
else
endif
procedure P
test case 1 test case 2^ test case 3 2-true 2-false 3-true 3-false
X X
X
entry (^) X X X
BRANCH COVERAGE
X X X X X X
test case 1 test case 2 test case 3 X X
STATEMENT COVERAGE
1 2 3 5 6 7 8 9 X X X 4 X X X X X X
statement
Figure 1: Procedure P , and the statement and branch coverage of P achieved by three test cases.
For a test suite containing m test cases, and a program containing n statements, total statement coverage prioritization can be accomplished in time O(m n + m log m). (The first term denotes the time required to count the statements covered by each test case, and the second term denotes the time required to sort the test cases according to coverage.) Typically, n is greater than m, in which case the cost of this prioritization is O(m n). Note that our measure of total statement coverage does not consider repetition in coverage in its calcu- lation. That is, a statement that is executed once is treated the same as a statement that, due to looping, is executed multiple times. This treatment, however, is the treatment that underlies code-coverage-based testing techniques generally. Alternative measures could consider execution counts; we leave investigation of such alternatives as a subject for future work.
M 5 : Additional statement coverage prioritization. Total statement coverage prioritization schedules test cases in the order of total coverage achieved; however, having executed a test case and covered certain statements, more may be gained in subsequent testing by executing statements that have not yet been covered. Additional statement coverage prioritization iteratively selects a test case that yields the greatest statement coverage, then adjusts the coverage information on all remaining test cases to indicate their coverage of statements not yet covered, and repeats this process until all statements covered by at least one test case have been covered. (When multiple test cases cover the same number of statements not yet covered, an additional rule is necessary to choose one of these test cases; we do this randomly.) Having ordered a subset of the test cases in a test suite in this manner, we may reach a point where each statement has been covered by at least one test case, and the remaining unprioritized test cases cannot add additional statement coverage. We could order these remaining test cases next using any prioritization technique; in this work we order the remaining test cases by reapplying additional coverage prioritization (i.e., by resetting the coverage vectors for all of these test cases to their initial values, and reapplying the algorithm ignoring all previously prioritized test cases). For illustration, consider Figure 1. In this example, both total and additional statement coverage pri- oritization select test case 3 first; however, whereas total coverage prioritization selects test case 1 second, additional coverage prioritization detects that test case 1 covers no statements not already covered by test case 3 and that test case 2 covers an uncovered statement, and outputs test case order (3, 2, 1). Additional statement coverage prioritization requires coverage information for each unprioritized test case to be updated following the choice of each test case. Given a test suite containing m test cases and a program containing n statements, selecting a test case and readjusting coverage information has cost O(m n) and this selection and readjustment must be performed O(m) times. Therefore, the cost of additional statement coverage prioritization is O(m^2 n), a factor of m more expensive than total statement coverage prioritization.
M 6 : Total branch coverage prioritization. Total branch coverage prioritization is the same as total statement coverage prioritization, except that it uses test coverage measured in terms of program branches rather than statements. In this context, we define branch coverage as coverage of each possible overall outcome of a (possibly compound) condition in a predicate. Thus, for example, each if or while statement must be exercised such that it evaluates at least once to true and at least once to false. To accommodate functions that contain no branches, we treat each function entry as a branch, and regard that branch as covered by each test case that causes the function to be invoked. Because in theory branch coverage properly subsumes statement coverage [27] (e.g., a test suite that is adequate for branch coverage is necessarily adequate for statement coverage, but not vice-versa), one might conjecture that prioritization based on branch coverage should on average be at least as effective as, if not more effective than, prioritization based on statement coverage. On the other hand, the arms of a branch often contain different numbers of statements, and in this case, ordering by branches may cause less-than- ideal attention to be paid to branches that contain the most code; on this basis one might conjecture that prioritization for statement coverage would be more effective than prioritization for branch coverage.^3 To (^3) This latter possibility was suggested by one of the anonymous reviewers.
First, by factoring in execution probabilities, sensitivity measures the probability that a fault will cause a failure relative to an input distribution. When prioritizing test cases for regression testing based on existing coverage information, however, we are interested in the probability that, if a test case executes a statement s containing a fault, that fault will propagate to output. It is possible for s to have very high [low] infection and propagation probabilities with respect to the inputs that execute it, even though it has a very low [high] execution probability relative to an input distribution. The incorporation of execution probabilities into sensitivity estimates thus distorts the measure of the likelihood that a given test case that reaches s will expose a fault in s. For the application and approach that we consider, a more appropriate measure would consider only infection and propagation. A second drawback of sensitivity in this context involves its treatment of propagation and infection estimates. Sensitivity analysis separately calculates these estimates, and uses a conservative approach to combine them. This conservative approach is designed to reflect the worst case in which the set of data state errors that produce the infection estimate is exactly the set of data state errors that do not propagate to output, although in general this case may be unlikely to occur. This approach can result in low estimates of fault exposing potential, with a large number of statements receiving estimates of zero; these zero estimates may compromise the ability of test case prioritization techniques to create useful test case orderings. Thus, in this work, to obtain an approximation of the fault-exposing-potential of a test case, we adopt an approach that uses mutation analysis [9, 13] to produce a combined estimate of propagation-and-infection that does not incorporate independent execution probabilities. (Mutation analysis creates a large number of faulty versions (“mutants”) of a program by altering program statements, and uses these to assess the quality of test suites by measuring whether those test suites can detect those faults (“kill” those mutants).) The approach works as follows. Given program P and test suite T , we first create a set of mutants N = {n 1 , n 2 ,... , nm} for P , noting which statement sj in P contains each mutant. Next, for each test case ti ∈ T , we execute each mutant version nk of P on ti, noting whether ti kills that mutant. Having collected this information for every test case and mutant, we consider each test case ti, and each statement sj in P , and calculate the fault-exposing-potential F EP (s, t) of ti on sj as the ratio of mutants of sj killed by ti to the total number of mutants of sj. Note that if ti does not execute sj , this ratio is zero. To perform total FEP prioritization, given these F EP (s, t) values, we next calculate, for each test case ti ∈ T , an award value, by summing the F EP (sj , ti) values for all statements sj in P. Given these award values, we then prioritize test cases by sorting them in order of descending award value (resolving ties by random selection). To illustrate, Figure 2 depicts the procedure P considered in our earlier discussion of coverage-based prioritization techniques, and a table listing fault-exposing-potential estimates that might be calculated for the three test cases and the statements in that procedure. In this case the award value for test case t1 is 2.3, the award value for test case t2 is 2.41, the award value for test case t3 is 2.2, and total FEP prioritization outputs test case order (2, 1, 3). Total FEP prioritization may appear, like statement- and branch-coverage-based prioritization, to ignore multiple statement executions caused by looping. However, because the mutation scores with which we obtain FEP values are obtained through actual test executions, they have captured at least some of the
else
endif
procedure P
statement
1 2
5
3 4
8
6
9
7
.
. . .
.
.5. .
.6. .
.
test case 1 test case 2 test case 3 . .
.
FEP(s,t) values
.
Figure 2: Procedure P , and F EP (s, t) values for three test cases.
effects of looping on fault detection. One issue to consider with respect to the use of program mutation to approximate FEP values involves the “equivalent mutant” problem: the problem of determining whether a mutant version of a program is semantically equivalent to the original program. A semantically equivalent mutant can never be killed by any test case. The problem of identifying equivalent mutants is undecidable in general, and in practice can involve considerable human effort, and it was not feasible in our empirical studies to identify these mutants given the enormous numbers of mutants involved (over 160,000 mutants). Therefore, we considered two approaches for coping with the possible presence of these mutants. The first approach is to consider mutants not killed by any test case used in the empirical studies to be semantically equivalent mutants, and ignore these mutants in our FEP calculations. (The number of test cases used in our empirical studies is also enormous, as we report in Section 3.4.) This approach, however, may overestimate the number of semantically equivalent mutants, and cause us to overestimate FEP values. Such overestimates may cause us to assign an inordinately high award value to any test case that executes statements containing such mutants – an award value that proclaims the test case more powerful than it is. The second approach is to treat all mutants not killed by any test case as possibly nonequivalent, and consider those mutants in our FEP calculations. This approach may underestimate the number of semanti- cally equivalent mutants, and cause us to underestimate FEP values. Such underestimates may cause us to assign an inordinately low award value to any test case that executes statements containing such mutants – an award value that proclaims the test case less powerful than it is. We chose the second approach due to its conservatism. Given the F EP (s, t) values for a test suite containing m test cases and a program containing n statements, total FEP prioritization can be accomplished in time O(m n + m log m). In general, n is greater than m, in which case the cost of this prioritization is O(m n), a worst-case time analogous to that for total statement coverage prioritization. The cost of obtaining F EP (s, t) values could, however, be quite high: certainly, if these values are obtained through mutation analysis, this cost may be excessive. Thus, whereas our investigation of coverage- based prioritization techniques involves techniques that are potentially practical and applicable as presented, our investigation of FEP-based techniques should be considered exploratory. Such an exploration, however,
F EP (s, t) values. Thus, we have Caddi(t 1 ) = 2.3, Caddi(t 2 ) = 2.41, and Caddi(t 3 ) = 2.2, and additional FEP prioritization, like total FEP prioritization, selects test case 2 as the first test case. Having chosen this test case, additional FEP prioritization now calculates, for each statement s, C′(s), the new confidence in that statement. Because test case 2 executes only statements 1, 2, 3, and 4, their confidence values increase while the confidence values of other statements remain 0. Figure 3 (Calculation
statement
1 2
5
3 4
8
6
9
7
statement
1 2
5
3 4
8
6
9
7
statement
1 2
5
3 4
8
6
9
7
C(s) after execution of test case 2
Calculation 2:
.
. .
test case 1 test case 2 test case 3
C_addi(s) values per test case
Calculation 3:
.
. . .
. . .5. .
.6. .
test case 1 test case 2 test case 3 . .
.
C_addi(s) values per test case
Calculation 1:
.4 .2^. . .
. . . . .3.
. .
0 0 0 0
.
0
Figure 3: Values calculated during additional FEP prioritization for the program and test cases of Figure 2.
One difference between additional FEP prioritization and additional statement or branch coverage pri- oritization is that in the additional FEP prioritization algorithm, we are not likely to need to check whether “full confidence” has been achieved: it is not likely that we will reach a point at which no additional confi- dence can be gained for all remaining test cases. The reason for this is that, for a test case t’s Caddi(t) to be 0, the C(s) for each statement covered by t must be 1, and for the C(s) for a statement to be 1, there must exist some test case t′^ for which F EP (s, t′) is 1. F EP (s, t) may be estimated to be 1 in some cases, but it is unlikely that it will be estimated to be 1 for each statement covered by t. If this unlikely event did occur, we could proceed as with other “additional” coverage prioritization techniques, resetting C(s) and Caddi(t) values to their initial states for those test cases not yet prioritized and reapplying the algorithm to those test cases; however, in our empirical studies, this event did not occur. Like additional statement coverage prioritization, additional FEP prioritization requires coverage infor- mation for each unprioritized test case to be updated following the choice of each test case. Therefore, its cost, for a test suite of m test cases and a program containing n statements, is O(m^2 n), a factor of^ m more expensive than total FEP prioritization. Also like total FEP prioritization, however, additional FEP prioritization requires a method for estimating FEP values, a potentially expensive requirement.
In [1], Avritzer and Weyuker present techniques for generating test cases that apply to software that can be modeled by Markov chains, provided that operational profile data is available. Although the authors do not use the term “prioritization”, their techniques generate test cases in an order that can cover a larger proportion of the probability mass earlier in testing, essentially, prioritizing the test cases in an order that increases the likelihood that faults more likely to be encountered in the field will be uncovered earlier in testing. The approach provides an example of the application of prioritization to the initial testing of software, when test suites are not yet available. In [36], Wong et al. suggest prioritizing test cases according to the criterion of “increasing cost per additional coverage”. Although not explicitly stated by the authors, one possible goal of this prioritization is to reveal faults earlier in the testing process. The authors restrict their attention, however, to prioritization of test cases for execution on a specific modified version of a program (what we have termed “version- specific prioritization”), and to prioritization of only the subset of test cases selected by a safe regression test selection technique from the test suite for the program. The authors do not specify a mechanism for prioritizing remaining test cases after full coverage has been achieved. The authors describe a case study in which they applied their technique to a program of over 6000 lines of executable code (the same program, space, that we use in two of the empirical studies reported in this paper), and evaluated the resulting test suites against ten faulty versions of that program. They conclude that the technique was cost-effective in that application.
To investigate test case prioritization and to compare and evaluate the test case prioritization techniques described in Section 2, we performed several empirical studies.^4 This section describes those studies, including design, measures, subjects, results, and threats to validity.
We are interested in the following research questions.
Q1: Can test case prioritization improve the rate of fault detection of test suites?
Q2: How do the various test case prioritization techniques presented in Section 2 compare to one another in terms of effects on rate of fault detection?
To address our research questions, we require a measure with which to assess and compare the effectiveness of various test case prioritization techniques. (In terms of Definition 1, this measure plays the role of the function f.) As a measure of how rapidly a prioritized test suite detects faults, we use a weighted average of (^4) The subjects (programs, program versions, test cases, and test suites) used in these studies, and the data sets collected, can be obtained by contacting the first author.
in use, and how test case prioritization fits into that process. With general test case prioritization (the variety of prioritization that we are investigating), prioritization can be performed “off-line”, following a release of a system, at a time when resource usage may be noncritical (provided it falls below a certain threshold). The cost of performing this prioritization can then be amortized over successive releases of the software. The cost-benefits tradeoffs of using such prioritization techniques will vary with the process used and the resources available, and a single measure incorporating both costs and benefits could obscure cost-effectiveness analyses that might apply under particular processes. Thus, instead of measuring and reporting run-time costs, we have provided overall complexity analyses of test case prioritization techniques, and we use these in Section 4 when we discuss practical implications of our results.
To perform our empirical studies, we required several tools. To obtain test coverage and control-flow graph information, we used the Aristotle program analysis system [16]. To obtain mutation scores for use with FEP prioritization we used the Proteum mutation system [8]. We created prioritization tools that implement the techniques outlined in Section 2.
We used eight C programs as subjects (see Table 3). The first seven programs, with faulty versions and test cases, were assembled by researchers at Siemens Corporate Research for a study of the fault detection capabilities of control-flow and data-flow coverage criteria [19]. We refer to these as the Siemens programs. The eighth program is a program developed for the European Space Agency. We refer to this program as space. We further discuss the Siemens programs and space in the following subsections.
Lines of Number Number Test Average Executable of of Pool Test Suite Program Code Versions Mutants Size Size print tokens 402 7 4030 4130 16 print tokens2 483 10 4346 4115 12 replace 516 32 9622 5542 19 schedule 299 9 2153 2650 8 schedule2 297 10 2822 2710 8 tcas 138 41 2876 1608 6 tot info 346 23 5898 1052 7 space 6218 35 132163 13585 155 Table 3: Experiment subjects.
3.4.1 Siemens programs, versions, and test suites
The Siemens programs perform a variety of tasks: tcas is an aircraft collision avoidance system, schedule and schedule are priority schedulers, tot info computes statistics given input data, print tokens and print tokens2 are lexical analyzers, and replace performs pattern matching and substitution.
The researchers at Siemens sought to study the fault detecting effectiveness of coverage criteria. Therefore, they created faulty versions of the seven base programs by manually seeding those programs with faults, usually by modifying a single line of code in the program. Their goal was to introduce faults that were as realistic as possible, based on their experience with real programs. Ten people performed the fault seeding, working “mostly without knowledge of each other’s work” [19, p. 196]. The result of this effort was between 7 and 41 versions of each base program (see Table 3), each containing a single fault. In this context, the use of single-fault versions is an important experiment design choice that allows experimenters to precisely determine whether a test case reveals a particular fault simply by determining whether the version containing that fault fails. In the absence of this methodology, it may be difficult or impossible to associate test cases with particular faults. This choice does, however, pose a potential threat to validity; we discuss this further in Section 3.6. For each base program, the researchers at Siemens created a large test pool containing possible test cases for the program. To populate these test pools, they first created an initial suite of black-box test cases “according to good testing practices, based on the tester’s understanding of the program’s functionality and knowledge of special values and boundary points that are easily observable in the code” [19, p. 194], using the category partition method and the Siemens Test Specification Language tool [2, 25]. They then augmented this suite with manually-created white-box test cases to ensure that each executable statement, edge, and definition-use pair in the base program or its control-flow graph was exercised by at least 30 test cases. To obtain meaningful results with the seeded versions of the programs, the researchers retained only faults that were “neither too easy nor too hard to detect” [19, p. 196], which they defined as being detectable by at most 350 and at least 3 test cases in the test pool associated with each program. To obtain sample test suites for these programs, we used the test pools for the base programs and test- coverage information about the test cases in those pools to generate 1000 branch-coverage-adequate test suites for each program. More precisely, to generate a test suite T for base program P from test pool Tp, we used the C pseudo-random-number generator rand, seeded initially with the output of the C time system call, to obtain integers that we treated as indexes into Tp (modulo |Tp|). We used these indexes to select test cases from Tp; we added a selected test case t to T only if t added to the cumulative branch coverage of P achieved by the test cases added to T thus far. We continued to add test cases to T until T contained at least one test case that would exercise each executable branch in the base program. Table 3 lists the average sizes of the branch-coverage-adequate test suites generated by this procedure for the subject programs. Using Proteum, we generated mutants for the Siemens programs. Table 3 reports the numbers of mutant programs thus created.
3.4.2 Space, versions, and test suites
Space consists of 9564 lines of C code (6218 executable), and functions as an interpreter for an array definition language (ADL). The program reads a file that contains several ADL statements, and checks the contents of the file for adherence to the ADL grammar and to specific consistency rules. If the ADL file is correct, space outputs an array data file containing a list of array elements, positions, and excitations; otherwise the program outputs error messages.
An initial indication of how each prioritization technique affected a test suite’s rate of detection in this study can be determined from Figure 5, which presents boxplots^7 of the APFD values of the 9 categories of prioritized test suites for each program and an all-program total. M 1 is the control group. M 2 is the random prioritization group. M 3 is the optimal prioritization group. Comparing the boxplots of M 3 to those of M 1 and M 2 , it is readily apparent that optimal prioritization greatly improved the rate of fault detection (i.e., increased APFD values) of the test suites in comparison to no prioritization and random prioritization. Examining the boxplots of the other prioritization techniques, M 4 through M 9 , it seems that all produce some improvement. However the overlap in APFD values mandates formal statistical analysis. Using the SAS statistical package [10] to perform an ANOVA analysis,^8 we were able to reject the null hypothesis that the APFD means for the various techniques were equal (α=.05), confirming our boxplot observations. However the ANOVA analysis indicated statistically significant cross-factor interactions: pro- grams have an effect on APFD values. Thus, general statements about technique effects must be qualified. While rejection of the null hypothesis tells us that some techniques produce statistically different APFD means, to determine which techniques differ from each other requires running a multiple-comparison proce- dure [26]. Of the commonly used means separation tests, we elected to use the Bonferroni method [17] — for its conservatism and generality. Using Bonferroni, we calculated the minimum statistically significant difference between APFD means for each program. These are given in Table 4. The techniques are listed within each program sub-table by their APFD mean values, from higher (better) to lower (worse). Grouping letters partition the techniques; techniques that are not significantly different share the same grouping letter. Examination of these sub-tables affirms what the boxplots indicate: all of the non-control techniques provided some significant improvement in rate of fault detection in comparison to no prioritization and random prioritization. Although the relative improvement provided by each technique is dependent on the program, the All Pro- grams sub-table does show that additional FEP prioritization performed better overall than other techniques, and that total FEP prioritization performed better than all but branch-total prioritization (and no worse than branch-total). Also, the All Programs sub-table suggests that branch-coverage-based techniques per- formed as well as or better than their corresponding statement-coverage-based techniques (e.g., branch-total performed as well as statement-total, and branch-additional outperformed statement-additional). It is also interesting that in all but one case (print tokens), total branch coverage prioritization per- formed as well as or outperformed additional branch coverage prioritization, and in all cases, total statement coverage prioritization performed as well as or outperformed additional statement coverage prioritization. Another effect worth noting is that on five of the seven programs, and overall, randomly prioritized test suites outperformed untreated test suites. We comment further on these effects in Section 4.
(^7) Boxplots provide a concise display of a distribution. The central line in each box marks the median value. The edges of the box mark the first and third quartiles. The whiskers extend from the quartiles to the farthest observation lying within 1. times the distance between the quartiles. Individual markers beyond the whiskers are outliers. 8 ANOVA is an acronym for ANalysis Of VAriance, a standard statistical technique that is used to study the variability of experimental data [17].
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
1 0 0
M 1 M 2 M 3 M 4 M 5 M 6 M 7 M 8 M 9
print_tokens
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
1 0 0
M 1 M 2 M 3 M 4 M 5 M 6 M 7 M 8 M 9
schedule
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
1 0 0
M 1 M 2 M 3 M 4 M 5 M 6 M 7 M 8 M 9
print_tokens
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
1 0 0
M 1 M 2 M 3 M 4 M 5 M 6 M 7 M 8 M 9
tcas
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
1 0 0
M 1 M 2 M 3 M 4 M 5 M 6 M 7 M 8 M 9
replace
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
1 0 0
M 1 M 2 M 3 M 4 M 5 M 6 M 7 M 8 M 9
tot_info
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
1 0 0
M 1 M 2 M 3 M 4 M 5 M 6 M 7 M 8 M 9
schedule
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
1 0 0
M 1 M 2 M 3 M 4 M 5 M 6 M 7 M 8 M 9
all programs
Figure 5: APFD boxplots for Study 1 (vertical axis is APFD score): By program, by technique. The techniques are: M 1 : untreated, M 2 : random, M 3 : optimal, M 4 : stmt-total, M 5 : stmt-addtl, M 6 : branch- total, M 7 : branch-addtl, M 8 : FEP-total, M 9 : FEP-addtl.