The aim of this package is similar to the broom-package: transforming “untidy” input into a tidy data frame, especially for further use with ggplot. However, ggeffects does not return model-summaries; rather, this package computes marginal effects at the mean or average marginal effects from statistical models and returns the result as tidy data frame.
Since the focus lies on plotting the data (the marginal effects), at least one model term needs to be specified for which the effects are computed. It is also possible to compute marginal effects for model terms, grouped by the levels of another model’s predictor. The package also allows plotting marginal effects for two- or three-way-interactions, or for specific values of a model term only. Examples are shown below.
The returned data frames always have the same, consistent structure and column names, so it’s easy to create ggplot-plots without the need to re-write the arguments to be mapped in each ggplot-call. x
and predicted
are the values for the x- and y-axis. conf.low
and conf.high
could be used as ymin
and ymax
aesthetics for ribbons to add confidence bands to the plot. group
can be used as grouping-aesthetics, or for faceting.
ggpredict()
computes predicted values for all possible levels and values from a model’s predictors. In the simplest case, a fitted model is passed as first argument, and the term in question as second argument. Use the raw name of the variable for the terms
-argument only - you don’t need to write things like poly(term, 3)
or I(term^2)
for the terms
-argument.
library(ggeffects)
data(efc)
fit <- lm(barthtot ~ c12hour + neg_c_7 + c161sex + c172code, data = efc)
ggpredict(fit, terms = "c12hour")
#>
#> # Predicted values of Total score BARTHEL INDEX
#> # x = average number of hours of care per week
#>
#> x predicted std.error conf.low conf.high
#> 0 75.444 1.116 73.257 77.630
#> 20 70.378 0.925 68.564 72.191
#> 45 64.045 0.843 62.393 65.697
#> 65 58.979 0.930 57.157 60.802
#> 85 53.913 1.122 51.713 56.113
#> 105 48.847 1.377 46.148 51.546
#> 125 43.781 1.665 40.517 47.045
#> 170 32.382 2.373 27.732 37.033
#>
#> Adjusted for:
#> * neg_c_7 = 11.84
#> * c161sex = 1.76
#> * c172code = 1.97
As you can see, ggpredict()
and ggeffect()
have a nice print()
-method, which takes care of printing not too many rows (but always an equally spaced range of values, including minimum and maximum value of the term in question) and giving some extra information. This is especially useful when predicted values are shown depending on the levels of other terms (see below).
The output shows the predicted values for the response at each value from the term c12hour. The data is already in shape for ggplot:
library(ggplot2)
theme_set(theme_bw())
mydf <- ggpredict(fit, terms = "c12hour")
ggplot(mydf, aes(x, predicted)) + geom_line()
The terms
-argument accepts up to three model terms, where the second and third term indicate grouping levels. This allows predictions for the term in question at different levels for other model terms:
ggpredict(fit, terms = c("c12hour", "c172code"))
#>
#> # Predicted values of Total score BARTHEL INDEX
#> # x = average number of hours of care per week
#>
#> # low level of education
#> x predicted std.error conf.low conf.high
#> 0 74.746 1.776 71.266 78.227
#> 30 67.147 1.587 64.037 70.257
#> 55 60.815 1.550 57.777 63.853
#> 85 53.216 1.662 49.958 56.473
#> 115 45.617 1.914 41.865 49.368
#> 170 31.685 2.593 26.602 36.768
#>
#> # intermediate level of education
#> x predicted std.error conf.low conf.high
#> 0 75.465 1.114 73.282 77.647
#> 30 67.866 0.868 66.165 69.566
#> 55 61.533 0.872 59.824 63.243
#> 85 53.934 1.126 51.727 56.141
#> 115 46.335 1.522 43.352 49.318
#> 170 32.404 2.377 27.745 37.062
#>
#> # high level of education
#> x predicted std.error conf.low conf.high
#> 0 76.183 1.718 72.816 79.550
#> 30 68.584 1.616 65.417 71.751
#> 55 62.252 1.656 59.006 65.497
#> 85 54.653 1.843 51.040 58.265
#> 115 47.053 2.143 42.853 51.254
#> 170 33.122 2.863 27.511 38.733
#>
#> Adjusted for:
#> * neg_c_7 = 11.84
#> * c161sex = 1.76
Creating a ggplot is pretty straightforward: the colour-aesthetics is mapped with the group
-column:
mydf <- ggpredict(fit, terms = c("c12hour", "c172code"))
ggplot(mydf, aes(x, predicted, colour = group)) + geom_line()
Finally, a second grouping structure can be defined, which will create another column named facet
, which - as the name implies - might be used to create a facted plot:
mydf <- ggpredict(fit, terms = c("c12hour", "c172code", "c161sex"))
mydf
#>
#> # Predicted values of Total score BARTHEL INDEX
#> # x = average number of hours of care per week
#>
#> # low level of education
#> # [1] Male
#> x predicted std.error conf.low conf.high
#> 0 73.954 2.347 69.354 78.554
#> 45 62.556 2.208 58.228 66.883
#> 85 52.424 2.310 47.896 56.951
#> 170 30.893 3.085 24.847 36.939
#>
#> # low level of education
#> # [2] Female
#> x predicted std.error conf.low conf.high
#> 0 74.996 1.831 71.406 78.585
#> 45 63.597 1.603 60.456 66.738
#> 85 53.465 1.702 50.130 56.800
#> 170 31.934 2.606 26.827 37.042
#>
#> # intermediate level of education
#> # [1] Male
#> x predicted std.error conf.low conf.high
#> 0 74.673 1.845 71.055 78.290
#> 45 63.274 1.730 59.883 66.665
#> 85 53.142 1.911 49.397 56.887
#> 170 31.611 2.872 25.982 37.241
#>
#> # intermediate level of education
#> # [2] Female
#> x predicted std.error conf.low conf.high
#> 0 75.714 1.225 73.313 78.115
#> 45 64.315 0.968 62.418 66.213
#> 85 54.183 1.209 51.815 56.552
#> 170 32.653 2.403 27.943 37.362
#>
#> # high level of education
#> # [1] Male
#> x predicted std.error conf.low conf.high
#> 0 75.391 2.220 71.040 79.741
#> 45 63.992 2.176 59.727 68.258
#> 85 53.860 2.364 49.226 58.494
#> 170 32.330 3.257 25.946 38.713
#>
#> # high level of education
#> # [2] Female
#> x predicted std.error conf.low conf.high
#> 0 76.432 1.809 72.887 79.977
#> 45 65.034 1.712 61.679 68.388
#> 85 54.902 1.910 51.158 58.646
#> 170 33.371 2.895 27.697 39.045
#>
#> Adjusted for:
#> * neg_c_7 = 11.84
ggplot(mydf, aes(x, predicted, colour = group)) +
geom_line() +
facet_wrap(~facet)
If the term
argument is either missing or NULL
, marginal effects for each model term are calculated. The result is returned as a list, which can be plotted manually (or using the plot()
function).
mydf <- ggpredict(fit)
mydf
#> $c12hour
#>
#> # Predicted values of Total score BARTHEL INDEX
#> # x = average number of hours of care per week
#>
#> x predicted std.error conf.low conf.high
#> 0 75.444 1.116 73.257 77.630
#> 20 70.378 0.925 68.564 72.191
#> 45 64.045 0.843 62.393 65.697
#> 65 58.979 0.930 57.157 60.802
#> 85 53.913 1.122 51.713 56.113
#> 105 48.847 1.377 46.148 51.546
#> 125 43.781 1.665 40.517 47.045
#> 170 32.382 2.373 27.732 37.033
#>
#> Adjusted for:
#> * neg_c_7 = 11.84
#> * c161sex = 1.76
#> * c172code = 1.97
#>
#>
#> $neg_c_7
#>
#> # Predicted values of Total score BARTHEL INDEX
#> # x = Negative impact with 7 items
#>
#> x predicted std.error conf.low conf.high
#> 6 78.166 1.561 75.107 81.225
#> 8 73.571 1.206 71.207 75.935
#> 12 64.383 0.842 62.732 66.033
#> 14 59.788 0.972 57.883 61.693
#> 16 55.194 1.259 52.725 57.662
#> 20 46.005 2.021 42.043 49.966
#> 22 41.410 2.438 36.632 46.188
#> 28 27.627 3.735 20.306 34.947
#>
#> Adjusted for:
#> * c12hour = 42.20
#> * c161sex = 1.76
#> * c172code = 1.97
#>
#>
#> $c161sex
#>
#> # Predicted values of Total score BARTHEL INDEX
#> # x = carer's gender
#>
#> x predicted std.error conf.low conf.high
#> 1 63.962 1.728 60.575 67.350
#> 2 65.004 0.966 63.110 66.897
#>
#> Adjusted for:
#> * c12hour = 42.20
#> * neg_c_7 = 11.84
#> * c172code = 1.97
#>
#>
#> $c172code
#>
#> # Predicted values of Total score BARTHEL INDEX
#> # x = carer's level of education
#>
#> x predicted std.error conf.low conf.high
#> 1 64.057 1.554 61.012 67.103
#> 2 64.776 0.842 63.125 66.427
#> 3 65.494 1.621 62.317 68.672
#>
#> Adjusted for:
#> * c12hour = 42.20
#> * neg_c_7 = 11.84
#> * c161sex = 1.76
#>
#>
#> attr(,"class")
#> [1] "ggalleffects" "list"
ggaverage()
compute average marginal effects. While ggpredict()
creates a data-grid (using expand.grid()
) for all possible combinations of values (even if some combinations are not present in the data), ggaverage()
computes predicted values based on the given data. This means that different predicted values for the outcome may occure at the same value or level for the term in question. The predicted values are then averaged for each value of the term in question and the linear trend is smoothed accross the averaged predicted values. This means that the line representing the marginal effects may cross or diverge, and are not necessarily in paralell to each other.
mydf <- ggaverage(fit, terms = c("c12hour", "c172code"))
ggplot(mydf, aes(x, predicted, colour = group)) + geom_line()
Since ggaverage()
produces different predicted values for the same value or level of terms, linear-smoothing for gaussian models and loess-smoothing for non-gaussian models is used to produce “smooth” lines. This may lead to misleading plots, especially if polynomial or spline terms are included in the model. It’s preferred to use ggpredict()
or ggeffect()
.
To plot the marginal effects of interaction terms, simply specify these terms in the terms
-argument.
library(sjmisc)
data(efc)
# make categorical
efc$c161sex <- to_factor(efc$c161sex)
# fit model with interaction
fit <- lm(neg_c_7 ~ c12hour + barthtot * c161sex, data = efc)
# select only levels 30, 50 and 70 from continuous variable Barthel-Index
mydf <- ggpredict(fit, terms = c("barthtot [30,50,70]", "c161sex"))
ggplot(mydf, aes(x, predicted, colour = group)) + geom_line()
Since the terms
-argument accepts up to three model terms, you can also compute marginal effects for a 3-way-interaction. To plot the marginal effects of three interaction terms, just like before, specify all three terms in the terms
-argument.
# fit model with 3-way-interaction
fit <- lm(neg_c_7 ~ c12hour * barthtot * c161sex, data = efc)
# select only levels 30, 50 and 70 from continuous variable Barthel-Index
mydf <- ggpredict(fit, terms = c("c12hour", "barthtot [30,50,70]", "c161sex"))
ggplot(mydf, aes(x, predicted, colour = group)) +
geom_line() +
facet_wrap(~facet)
ggpredict()
also works for models with polynomial terms or splines. Following code reproduces the plot from ?splines::bs
:
library(splines)
data(women)
fm1 <- lm(weight ~ bs(height, df = 5), data = women)
dat <- ggpredict(fm1, "height")
ggplot(dat, aes(x, predicted)) +
geom_line() +
geom_point()
ggpredict()
also supports coxph
-models from the survival-package and is able to either plot risk-scores (the default), probabilities of survival (type = "surv"
) or cumulative hazards (type = "cumhaz"
).
Since probabilities of survival and cumulative hazards are changing accross time, the time-variable is automatically used as x-axis in such cases, so the terms
-argument only needs up to two variables for type = "surv"
or type = "cumhaz"
.
data("lung", package = "survival")
# remove category 3 (outlier)
lung <- subset(lung, subset = ph.ecog %in% 0:2)
lung$sex <- factor(lung$sex, labels = c("male", "female"))
lung$ph.ecog <- factor(lung$ph.ecog, labels = c("good", "ok", "limited"))
m <- survival::coxph(survival::Surv(time, status) ~ sex + age + ph.ecog, data = lung)
# predicted risk-scores
ggpredict(m, c("sex", "ph.ecog"))
#>
#> # Predicted risk scores
#> # x = sex
#>
#> # good
#> x predicted std.error conf.low conf.high
#> 1 0.829 0.148 0.621 1.107
#> 2 0.481 0.176 0.340 0.679
#>
#> # ok
#> x predicted std.error conf.low conf.high
#> 1 1.249 0.107 1.013 1.540
#> 2 0.724 0.125 0.566 0.925
#>
#> # limited
#> x predicted std.error conf.low conf.high
#> 1 2.044 0.157 1.503 2.781
#> 2 1.185 0.172 0.846 1.658
#>
#> Adjusted for:
#> * age = 62.42
# probability of survival
ggpredict(m, c("sex", "ph.ecog"), type = "surv")
#>
#> # Probability of Survival
#> # x = time
#>
#> # male
#> # good
#> x predicted conf.low conf.high
#> 5 0.997 0.989 1.000
#> 181 0.771 0.685 0.868
#> 276 0.645 0.536 0.776
#> 1022 0.086 0.028 0.264
#>
#> # male
#> # ok
#> x predicted conf.low conf.high
#> 5 0.995 0.984 1.000
#> 181 0.676 0.588 0.776
#> 276 0.517 0.417 0.640
#> 1022 0.025 0.006 0.108
#>
#> # male
#> # limited
#> x predicted conf.low conf.high
#> 5 0.992 0.974 1.000
#> 181 0.527 0.402 0.690
#> 276 0.339 0.220 0.523
#> 1022 0.002 0.000 0.041
#>
#> # female
#> # good
#> x predicted conf.low conf.high
#> 5 0.998 0.994 1.000
#> 181 0.860 0.794 0.932
#> 276 0.776 0.683 0.881
#> 1022 0.241 0.113 0.513
#>
#> # female
#> # ok
#> x predicted conf.low conf.high
#> 5 0.997 0.991 1.000
#> 181 0.797 0.724 0.877
#> 276 0.682 0.588 0.792
#> 1022 0.117 0.044 0.310
#>
#> # female
#> # limited
#> x predicted conf.low conf.high
#> 5 0.995 0.985 1.000
#> 181 0.690 0.577 0.825
#> 276 0.535 0.401 0.713
#> 1022 0.030 0.005 0.187
#>
#> Adjusted for:
#> * age = 62.42
ggeffects makes use of the sjlabelled-package and supports labelled data. If the data from the fitted models is labelled, the value and variable label attributes are usually copied to the model frame stored in the model object. ggeffects provides various getter-functions to access these labels, which are returned as character vector and can be used in ggplot’s lab()
- or scale_*()
-functions.
get_title()
- a generic title for the plot, based on the model family, like “predicted values” or “predicted probabilities”get_x_title()
- the variable label of the first model term in terms
.get_y_title()
- the variable label of the response.get_legend_title()
- the variable label of the second model term in terms
.get_x_labels()
- value labels of the first model term in terms
.get_legend_labels()
- value labels of the second model term in terms
.The data frame returned by ggpredict()
or ggaverage()
must be used as argument to one of the above function calls.
get_x_title(mydf)
#> [1] "average number of hours of care per week"
get_y_title(mydf)
#> [1] "Negative impact with 7 items"
ggplot(mydf, aes(x, predicted, colour = group)) +
geom_line() +
facet_wrap(~facet) +
labs(
x = get_x_title(mydf),
y = get_y_title(mydf),
colour = get_legend_title(mydf)
)