This website contains complementary data to the submitted paper.

Code smells in a program represent indications of structural quality problems, which can be addressed by software refactoring. However, refactoring intends to achieve different goals in practice, and its application may not reduce smelly structures. Developers may neglect or end up creating new code smells through refactoring. Unfortunately, little has been reported about the beneficial and harmful effects of refactoring on code smells. This paper reports a longitudinal study intended to address this gap. We analyze how often commonly-used refactoring types affect the density of 13 types of code smells along the version histories of 23 projects. Our findings are based on the analysis of 16,566 refactorings distributed in 10 different types. Even though 79.4% of the refactorings touched smelly elements, 57% did not reduce their occurrences. Surprisingly, only 9.7% of refactorings removed smells, while 33.3% induced the introduction of new ones. More than 95% of such refactoring-induced smells were not removed in successive commits, which suggest refactorings tend to more frequently introduce long-living smells instead of eliminating existing ones. We also characterized and quantified typical refactoring-smell patterns, and observed that harmful patterns are frequent, including: (i) approximately 30% of the Move Method and Pull Up Method refactorings induced the emergence of God Class, and (ii) the Extract Superclass refactoring creates the smell Speculative Generality in 68% of the cases.

Code smells are anomalies in the program structure. A code smell is a surface indication that usually corresponds to a deeper problem in the system. There are different types of code smells documented in the literature [1]. For instance, lets suppose a single class with several responsibilities. A class with this symptom is hard to read and evolve. This type of smell is known as God Class. Another common anomaly is when a method has many parameters. This code smell is known as Long Parameter List. A method with this symptom is also hard to read and evolve.

Figure 1: Person class with God Class code smell

Person class with God Class code smell

Figure 2: Class after Extract Class refactoring without God Class

Person class after Extract Class refactoring without God Class

Figures 1 and 2 show the Person class in two different forms. In Figure 1, the Person class has at least three attributes representing two loosely-coupled concepts: person and telephone number. This version of Person can be considered a God Class. In order to remove this code smell, the developer can extract part of the class structure it into another class: TelephoneNumber (Figure 2). After this transformation, the program no longer has a God Class and still realize the same functionality. As the program have a code smell in Figure 1, its new internal structure represented in Figure 2 has a better structural quality.

Projects with low structural quality are hard to read and maintain. Therefore, whenever a code smell is identified in a program, a change should be made to remove it. Refactoring is a structural change intended to remove a code smell without changing its observable behavior [1]. There are several claims associated with the benefits of software refactoring, which is all implicitly associated with its capability of improving the software structural quality.

In fact, refactoring is a common practice [3] intended to improve code quality attributes [2]. Code smells are widely used for detecting refactoring opportunities [4]. Existing studies highlight that developers can use refactoring to improve structural quality by eliminating code smells in software [5] [6]. Let’s consider the Figures 1 and 2 again to illustrate this scenario. As mentioned before, the module Person in Figure 1 has a God Class code smell. This happened because this class is accumulating responsibilities that should be fulfilled by two classes. In order to address this smell, the Extract Class refactoring should be performed, i.e., a programmer could create a new class and move the relevant fields and methods from the old class into the new one. This refactoring, in this case, would remove the God Class code smell. Since the number of code smells would be reduced, this refactoring would improve the software structural quality.

Listing 1: Deserializer method using a temp variable

public Double deserialize(JsonParser j) {
    JsonNode node = j.getCodec().readTree();
    return node.get("x").val();
}

Listing 2: Deserializer method after Inline Temp refactoring

public Double deserialize(JsonParser j) {
    return j.getCodec().readTree().get("x").val();
}

There is a list of refactoring types documented in literature [1]. Figures 1 and 2 illustrate the Extract Class refactoring type. Another refactoring type is the Inline Temp. In order to use this refactoring, a temporary variable assigned to once with a simple expression must exist. Listing 1 presents a method responsible for deserializing a piece of JSON code into a Java object. This method uses a temporary variable (node) to return the desired value. This variable is used only in the next line of the method. Therefore, a developer could apply the Inline Temp refactoring to get rid of this variable. As result of this refactoring, the node variable disappears and the resulting method only contains one line (Listing 2).

In this case, a new code smell was introduced by the code change being accomplished by the programmer. The code presented in Listing 1 has no code smells, but the resulting code presented in Listing 2 has one. A call to getCodec method is firstly made. After that, three additional method calls are made in the same chain. This is a code smell know as Message Chain. Navigating through method calls this way means the code is coupled to the structure of the navigation. Therefore, any change in the object graph structure would affect this piece of code. In practice, refactoring may not be effective in some cases as this example illustrates. In this way, refactoring may not improve or maintain the structural quality and, worse, it may introduce new code smells. Lets consider again Listings 1 and 2. In this scenario, the Inline Temp refactoring removed an apparently unnecessary temporary variable, but introduced a new code smell on the software. Consequently, this refactoring degraded the program structural quality. This is an example of how refactoring may not be effective.

Both aforementioned examples presented cases where refactorings have had interference on the existence of code smells. Code smells were removed in the first case (Figures 1 and 2). In the second case, a new code smell instance emerged. These scenarios show how refactoring can interfere positively or negatively in the existence of code smells. The application of a refactoring may contribute to the degradation of structural quality (Listings 1 and 2). In this way, we can classify a single refactoring by looking how it changed the existing code smells. In order to show how this can be made, lets formally define it.

Let S = {s1,s2,···,sn} be the set of software projects to be evaluated. Each software s has a set of versions V(s) = {v1, v2, … , vm}. Each version vi has a set of elements E(vi) = {e 1, e2, …} representing all methods, classes and fields belonging to it. In Figures 1 and 2, the set S = {PhoneBook} is composed just by one software. This software PhoneBook has two versions V(PhoneBook) = {va, v b} represented by Figures 1 and 2, respectively. Finally, each version vii has a set of elements E(vi) composed by Person and TelephoneNumber classes, methods and fields.

In order to be able to detect refactorings we must analyze transformations between each subsequent pair of versions. In this way, we assume R is a refactoring detection function where R(vi, vi+1) = {r1(rt1; e 1), …, rk(rtk; ek)} gives us a set of tuples composed by two elements: the refactoring type (rt) and the set of refactored elements represented by e. So, the function R give us all refactoring activities detected in a pair of software versions. If we apply the R function in the PhoneBook software, the result would be R(va,vb) = {r1(Extract Class,e')} where e' = {Person, TelephoneNumber}.

Let CS be a code smell detection function where CS(e) = {cs1,…} returns a set of code smells present in a software element set e. As the objective is to classify refactorings, the classifying procedure will only analyze code smells related to elements affected by a refactoring. In this way, we can say that CSb[r](e) is the set of code smells of e before the application of the refactoring r. On the other hand, CSa[r](e) is the set of code smells found after the application of r. Considering r1 refactoring applied on classes of PhoneBook system, CS function would present the following results: CSb[r1]({Person}) = {God Class} and CSa[r1]({Person,TelephoneNumber}) = ∅.

Using data collected by the functions defined before, it is possible to classify a refactoring by looking how it interferes in existing code smells. Suppose e is a software element set, r is a refactoring activity and |CSb[r](e)|= x. After r refactoring, |CSa[r](e)|= y. Depending on x and y, it is possible to classify r. If x > y, r reduced the number of code smells on e and, because of that, r is a positive refactoring. In other way, if x < y, r increased the number of code smells on e and, because of that, r is a negative refactoring. When x = y, r is a neutral refactoring. This classification is summarized in the following table.

Condition Classification
|CSb[r](e)| > |CSa[r](e)| Positive Refactoring
|CSb[r](e)| < |CSa[r](e)| Negative Refactoring
|CSb[r](e)| = |CSa[r](e)| Neutral Refactoring

In this work, it is considered as refactored elements all those directly affected by the refactoring. If a refactoring is applied only in a method body, only this method is considered as refactored element. For instance, lets consider the Move Method refactoring. In this refactoring type, a method m is moved from class A to B. Hence, the considered refactored elements in this case would be {m, A, B}. All m method callers are affected by this refactoring, but we do not consider them as refactored elements. As another example, let us consider the Rename Method refactoring. In this scenario, a new name is given to the method m and the refactored element set would be just {m}. For each refactoring type a different refactored element set is used. In the following table, we present the considered refactored elements for each type of refactoring.

Refactoring Refactored Elements
Extract Interface classes implementing the new interface.
Extract Method (i) method created; (ii) method from where the new method was extracted; and (iii) class containing both methods.
Extract Superclass (i) classes extending the new class; and (ii) new class created.
Inline Method (i) the method which received the new code; and (ii) class containing the method.
Move Field the two classes affected by the change: the class which the field used to reside and the class which received the field.
Move Method the two classes affected by the change: the class which the method used to reside and the class which received the method.
Pull Up Field the two classes affected by the change: the class which the field used to reside and the class which received the field.
Pull Up Method the two classes affected by the change: the class which the method used to reside and the class which received the method.
Push Down Field the two classes affected by the change: the class which the field used to reside and the class which received the field.
Push Down Method the two classes affected by the change: the class which the method used to reside and the class which received the method.
Rename Method the renamed method and the class that contains it.

The following Table presents the sample sizes of the refactorings manually analyzed by type and the precision obtained to each one. In general, it was observed a high precision for each refactoring type, with a median of 88.36% (excluding rename method). The precision found in all refactoring types are close/inside the standard deviation (7.73). Applying the Grubb outlier test (alpha=0.05) we could not find any outlier, indicating that no refactoring type is strongly influencing the median precision found. Thus, the results found to the representative sample analyzed represent a key factor to provide reliability to the other results reported in this work.

Ref. type pop. size sample size precision
Rename method 12,752 373 95.17%
Extract method 7,517 366 80.60%
Move field 4,356 353 96.88%
Inline method 1,528 307 75.57%
Move method 1,404 302 88.08%
Pull up method 629 239 78.66%
Pull up field 465 211 98.58%
Extract superclass 342 181 93.92%
Extract interface 133 99 87.88%
Push down method 114 88 88.64%
Push down field 78 65 96.92%

We used Eclipse and the eGit plugin to categorize a transformation (root or floss). We selected a sample containing 730 transformations, and added in our web site. For each transformation in our sample, three authors of the paper manually categorize them. We selected the id commit of the resulting program, and used eGit to find it. By using the diff tool, we manually analyzed all changes in all classes modified in the scope. When a behavioral change was identified, we filled a form explaining it and classified the change as floss. When we did not find a behavioral change, we searched for other refactoring activities and indicated in the form.

# Artefact Description
1 Projects Analyzed Name, URL, number of versions and LOC of all analyzed projects. We provide the date and hash of the last commit analyzed
2 Submited Paper Complete text submited to FSE 2017
3 Non-Removal Patterns This file contains the complete list of non-removal patterns. It includes a table with four columns. The first one (ref_name) contains the Refactoring Type. The second one (smell_name), the code smell type. The third column (touches) presents the number of times the refactoring type reported in the first column was applied in system elements containing the smell of the second column. The fourth column (all_refs) presents the number of times that we observed the referred refactoring type in our entire dataset. The last column presents the percentage (touches/all_refs).

Lanza et al. originally proposed the rules presented in the following tables. Macia et al. adapted these rules and evaluated the thresholds in another work. Two sets of thresholds were used in order to collect code smells using these rules. The first set, as known as tight, represent the thresholds previously validated in the study by Macia et al.. We quality this strategy as tight because it relies on the use of high thresholds values aiming to detect only critical code smells across the projects. The second strategy, named as relaxed, uses relaxed thresholds aiming to detect as many code smells as possible. These relaxed thresholds represent lower bound thresholds, which were tuned through our validation phase. These two strategies allowed to derive refactoring classifications based on two different sets of code smells. The following table presents all thresholds used during code smell detection phase.

Tight thresholds.
# Code Smell Rule
1 Long Method (LOC > 50) and (CC > 5)
2 Shotgun Surgery (CC > 4) and (FanOut > 7)
3 Feature Envy (CC > 4) and (FanOut > 4) and (LCOM < 30%)
4 Divergent Change: (FanIn > 10) and (LCOM < 50%) and (CC > 4)
5 God Class [(LOC > 150) and (CBO > 6)] or [(NOM > 15) and (CBO > 6)]
Relaxed thresholds.
# Code Smell Rule
1 Long Method (LOC > 50) and (CC > 3)
2 Shotgun Surgery (CC > 3) and (FanOut > 4)
3 Feature Envy (CC > 3) and (FanOut > 3) and (LCOM < 50%)
4 Divergent Change: (FanIn > 5) and (LCOM < 60%) and (CC > 3)
5 God Class [(LOC > 150) and (CBO > 6)] or [(NOM > 15) and (CBO > 6)]

As described in the paper, we also used another set of rules to detect code smells. The third set of rules and thresholds was proposed by Bavota et al. The following table presents the rules extracted from this paper.

Bavota's Rules
# Code Smell Rule
1 Class data should be private A class having at least one public field.
2 Complex class A class having at least one method for which McCabe cyclomatic complexity is higher than 10.
3 Feature envy All methods having more calls with another class than the one they are implemented.
4 God class All classes having (i) cohesion lower than the average of the system AND (ii) LOCs > 500.
5 Lazy class All classes having LOCs lower than the first quartile of the distribution of LOCs for all system’s classes.
6 Long method All methods having LOCs higher than the average of the system.
7 Long parameter list All methods having a number of parameters higher than the average of the system.
8 Message chain All chains of methods’ calls longer than three.
9 Refused bequest All classes overriding more than half of the methods inherited by a superclass.
10 Spaghetti code A class implementing at least two long methods (see previous rule) interacting between them through method calls or shared fields.
11 Speculative generality A class declared as abstract having less than three children classes using its methods.

The threshold of 15% was selected for the following reason: (i) most of the refactoring types occur at least 10 times in each project of our dataset, and (ii) if a refactoring-smell pattern occur more than once (e.g. 1,5 times) in each project (in average), we consider that there is a non-ignorable chance the pattern occurrence is not accidental. In fact, we observed that several refactoring-smell patterns (with a frequency higher than 15%) occur more than once in several projects. We added this explanation in the companion website. Unfortunately, we have no space left to add this in the paper.

Any question/suggestion please contact the authors of this work.

# Name E-mail
1 Diego Cedrim dcgrego@inf.puc-rio.br
2 Alessandro Garcia afgarcia@inf.puc-rio.br
3 Melina Mongiovi melina@copin.ufcg.edu.br
4 Rohit Gheyi rohit@dsc.ufcg.edu.br
5 Leonardo Sousa lsousa@inf.puc-rio.br
6 Rafael de Mello rmaiani@inf.puc-rio.br
7 Baldoino Fonseca baldoino@ic.ufal.br
8 Márcio Ribeiro marcio@ic.ufal.br
9 Alexander Chávez alopez@inf.puc-rio.br

[1] M. Fowler, K. Beck, J. Brant, W. Opdyke, and D. Roberts, Refactoring: Improving the Design of Existing Code, 1 edition. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1999.

[2] F. Bourquin and R. K. Keller, “High-impact Refactoring Based on Architecture Violations,” in 11th European Conference on Software Maintenance and Reengineering (CSMR’07), 2007, pp. 149–158.

[3] Z. Xing and E. Stroulia, “Refactoring Practice: How it is and How it Should be Supported - An Eclipse Case Study,” in 2006 22nd IEEE International Conference on Software Maintenance, 2006, pp. 458–468.

[4] M. Zhang, T. Hall, and N. Baddoo, “Code Bad Smells: a review of current knowledge,” J. Softw. Maint. Evol. Res. Pract., vol. 23, no. 3, pp. 179–202, Apr. 2011.

[5] R. Kolb, D. Muthig, T. Patzke, and K. Yamauchi, “A case study in refactoring a legacy component for reuse in a product line,” in 21st IEEE International Conference on Software Maintenance (ICSM’05), 2005, pp. 369–378.

[6] P. Meananeatra, “Identifying refactoring sequences for improving software maintainability,” in Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering - ASE 2012, 2012, p. 406.