


Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
One approach to understand these estimates is to calculate the estimated marginal means (sometimes referred to as least square means, predicted means, or ...
Typology: Exams
1 / 4
This page cannot be seen from the preview
Don't miss anything!
Created winter 2018. Last updated August 2020
In a linear model with categorical variables, the table of model parameter estimates can be difficult to interpret. One approach to understand these estimates is to calculate the estimated marginal means (sometimes referred to as least square means, predicted means, or expected means). Most statistical software packages offer procedures to obtain predictions of the response variable for the different levels of categorical variables after fitting linear models. However, these procedures should be used carefully as the results obtained can be very different depending on the statistical software package used.
Consider a simulated dataset containing information about employees of a company, with information on their salary, age, gender, and job category. The continuous variables are summarized in Table 1; the categorical variables are summarized in Tables 2 and 3. Table 1: Mean and standard deviation of the continuous variables in the employee dataset Mean SD Salary 6806.43 3148. Age 39.16 45. Table 2: Summary of the job category variable Values Count Proportion 0 (clerical) 227 0. 1 (trainee) 168 0. 2 (security) 32 0. 3 (technical) 47 0. Table 3: Summary of the gender variable Values Count Proportion 0 (male) 258 0. 1 (female) 216 0.
In this newsletter, we will investigate the relationship between salary and gender controlling for job category and age. Table 4 contains the results of a linear model with salary as the dependent variable with gender, job category, and age as predictor variables. Note that in our example, we are applying dummy coding for categorical variables; we are considering the reference level to be the lowest level of these categorical variable (i.e. male (0) for gender and clerical (0) for job category). For more information about dummy coding, please refer to our Dummy and Effect Coding Newsletter (statnews #72). Table 4: Linear model summary with salary as the response and age, gender, job as predictors. Coefficient Estimate SE p-value Intercept (𝛽 0 ) 6963.73^ 235.94^ <0. Gender: female (𝛽 1 ) - 2456.7 240.92 <0. Age (𝛽 2 ) 0.81 2.52 0. Job: trainee (𝛽 3 ) 1302.53^ 254.02^ <0. Job: security (𝛽 4 ) 167.83 481.22 0. Job: technical (𝛽 5 ) 4613.43 407.11 <0. Coefficients obtained from the linear model are used to estimated marginal means. For gender , our independent variable of interest, 0 represents a male subject while as 1 represents female subject. But what values are used for the other variables in the model: age and job category? For continuous variables like age , marginal means procedures typically substitute the overall mean values for calculations (unless the user specifies otherwise); in our example, 39.16 is the average age. For categorical variables, some software packages calculate marginal means as if the data is from a balanced population, while others assume an unbalanced population. The term “balanced population” means that the sample is uniformly split across the different bins of the categorical variable; in terms of the 4-valued categorical variable job category , that would mean that 25 percent of the population falls into each bin. Thus, the predicted salary values obtained for each job category would be weighted equally when calculating the marginal mean for each gender. For an unbalanced population, the predicted salaries would be weighted according to the distribution of jobs in the data (see proportions in Table 2). We see that our data is not balanced in terms of the job category variable. The job category percentages range from 6.75 to 47.89 percent in the sample. Below we show how different software packages treat this categorical variable when calculating marginal means—specifically, whether they assume a balanced or unbalanced population.
In R, SAS, SPSS, and JMP, the marginal means procedure by default assumes a balanced population. To see this, we first calculate marginal means for each job category, for both male and female employees. We take the linear model equation and use the coefficients from Table 4, along with the appropriate values for gender (0 for males, 1 for females), age (the mean value, 39.16, and job category (1 for the indicated job, 0 for the others). For example, a female trainee’s predicted salary would be calculated as follows:
These are the marginal means computed by Stata. The same marginal means can be obtained directly from the coefficients of the linear equation by replacing each job category dummy variable with its corresponding proportion in our sample. For males, we have