Overview over common statistical methods with links to SAS and R
Here we give an overview over methods that are often asked for in our consultancy services and we provide links to the appropriate functions/procedures in SAS and R. All links are external.
Statistical methods
Observe that most statistical models in their base form assume that observations are independent of each other. If observations are not independent correlations between them must be modelled. Some types of models that handle correlated observations are discussed under mixed models and generalised mixed models.
Links are given to SAS and R documentation and to case studies. The page is under preparation.
Truncated regression / censored regression
Linear Regression analysis:
In regression analysis we examine the relationship between two or more continuous variables. One of the variables is the response variable, which mean value is influenced by one of several explanatory variables. Classical linear regression analysis assumes normally distributed errors in the model in order to make inference about the model parameters. If data is not normally distributed, transformations or resampling techniques can be used. If the underlying distribution is known but not normal check logistic regression (binomial) , poisson regression (poisson) and generlised linear mixed models (binomial, poisson or some other distributins) below.
In SAS:
PROC REG: Is used mainly if there are only continuous explanatory variables. To include categorical (class) variables they need to be coded as dummy variables before entering into the model. PROC REG includes/included some more diagnostics compared to PROC GLM, but if you have both continuous and categorical explanatory variables PROC GLM is the better choice.
PROC GLM:Handles both continuous and categorical explanatory variables. Categorical variables are set in the CLASS statement.
In R:
lm (Stats): Handles both continous and categorical explanatory variables. Categorical variables must be defined as factors.
ANOVA - Analysis of Variance
Analysis of variance is used if data is collected in designed experiments with some (categorical) factors that should be compared, e.g. different treatments. The main goal is to determine if there are any differences in the mean values for different treatments and, if so, which differences there are. Analysis of variance can be used to compare the outcome when for a large number of different designs, like one factor (one-way ANOVA), several factors in a crossed design (two-way ANOVA,...), block-designs, like one factor in blocks, Latin Squares or Split-plot, hierarchical models and others. Below we discuss software for models that only include fixed factors. If random factors are included, like in hierarchical and split-plot models or if repeated measurements are made check for mixed models below.
Continuous explanatory variables can be included in ANOVA models and the method is then called analysis of covariance (ANCOVA). The response variable in ANOVA models is continuous and assumes normal distribution for the residuals. If the data distribution is another, e.g. binomial or Poission, look for generalised linear models, e.g. Logistic or Poission regression.
in SAS
PROC GLM: Handles both categorical and continuous explanatory variables. Categorical variables are set in the CLASS statement. Pairwise comparisions are made by the lsmeans statement and can be adjusted in several ways for multiple testing (e.g. Tukey, Bonferroni,..)
in R
lm (Stats): Handles both continous and categorical explanatory variables. Categorical variables must be defined as factors. Pairwise comparisons can be made using lsmeans in the lsmeans package.
MIXED models
When not only fixed but also random factors are included in a design the resulting model is called a mixed model. The main difference between fixed and random factors is that for fixed factors the main interest lies in the comparision of the different levels of this factor (e.g. treatment, diets, races, regions,...). For a random factor the comparison of its levels is not of interest, instead the amount of variation induced by this factor is estimated (e.g. different stables, time point, sites within a homogeneous region,...).
Hierarchical models (nested models, multilevel model or models for clustered data) are analysed as mixed models. Also models where the experimental units are measured repeatedly (in time or spaceI) are mixed models. For both these types of models the individual observations are no longer independent, since observations within the same nesting/clustering and observations made on the same experimental unit at different time points must be considered correlated.
In SAS:
PROC MIXED: handles random factors in one-way or crossed designs as well as in hierachical and repeated measurements designs. Correlations structures that are available are, for example, autoregressive (AR(1), spatial power (SP(POW), compound symmetry (CS) or an unstructured correlation matrix (UN).
Example: mixed models in SAS
In R:
lmer (lme4): handles random factors, but there is no possibility to estimate correlations between observations (i.e. if you have repeated measurements).
Example: Linear mixed models with lmer
lme (nlme): handles random factors and repeated factors. Available correlation structures are, for example, : autoreressive (CorAr1), spatial power (CorExp), compound symmetry (CorCompSymm), and unstructured (CorSymm).
Logistic regression
When the response variable is a 0/1 variable (e.g. diseased/healthy, developing mould / no mould, dead / alive, ...) we can not use models that assume continous responses. Logistic regression is developed for this kind of data and assumes that the underlying distribution is binomial. We model probabilities of the different outcomes as a linear model including categorical and continous explanatory variables.
If we have more than two categories in the response variable similar methods can be used, often called mulit-logit models: the proportional odds model works for ordinal responses and multinomial logit models work for nominal responses with more than two levels.
in SAS:
PROC LOGISTIC: can be used for logistic regression using logit or probit link functions. Also the cumulative logit function for proportional odds models and the generalised logit function for nominal response data with more than two levels are available (multinomial logit). PROC LOGISTIC used Effect coding of categorical explanatory variables as default.
PROC GENMOD: is a more covers generalised linear models and is less specific for logistic regression, since it does not cover the alternative with nominal responses with more than two levels. Logistic regression and proportial odds models are available together with other models in the generalised linear model family, like Poission regression. To run logistic regression use dist=binomial in the model statement. To run the proportial odds model choose dist=multinomial.PROC GENMOD uses GLM coding of categorical explanatory variables.
Note: Parameter estimates in proc logistic and proc genmod differ due to the different coding of the categorical explanatory variables even though the models are the same. To gain identical results change the parametrisation in PROC LOGISTIC to GLM (param=GLM) in the CLASS statement.
Examples:
Logistic regression (external link, UCLA)
Multinomial logit regression (external link, UCLA)
Ordinal logit regression, proportional odds (external link, UCLA)
in R:
glm (stats): glm covers a variety of generalised linear models. Logistic regression is run by choosing family=binomial.
polr (MASS): polr can be used to run the proportional odds model. The response must be defined as an ordered factor before using in the model.
mlogit (mlogit): mlogit fits a multinomial logit model.
Note: parametrisation of categorical varaibles can be changes by the contrasts function in R: contrasts(type)<-'contr.sum' gives effect coding (deviation coding), whereas the default is dummy coding ('contr.treatment')with the first level as reference (compare to SAS that uses the last level as reference).
Examples:
Logistic regression (external link, UCLA)
Multinomial logit regression (external link, UCLA)
Ordinal logit regression, proportional odds (external link, UCLA)
Poisson regression
Poisson regression is used if the response variable is count data, e.g. the number of butterflies observed during a specific time period.
in SAS:
PROC GENMOD: Poisson regression is part of the framework of generalised linear models and can therefore be run by PROC GENMOD. The choice is here dist=poisson in the model statement. If results give an indication of overdispersion dist=negbin (negative binomial) can be used.
Examples:
Poisson regression (external link, UCLA)
Poisson regression with overdispersion (external link, UCLA)
in R:
glm (stats): glm covers a variety of generalised linear models. Poisson regression is run by choosing family=Poisson. If overdispersion is a problem family=quasipoisson can be used.
Examples:
Poisson regression (external link, UCLA)
Poisson regression with overdispersion (external link, UCLA)
Generalised linear mixed models
In the same way as general linear models also generalised linear models (such as logisitic and poisson regression) can include random factors.
In SAS:
PROC GLIMMIX handles repeated measurements. It covers the following distributions: beta, binary, binomial, exponential, gamma, Gaussian, geometric, inverse Gaussian, log-normal, multinomial, negative binomial, poisson and t.
In R:
glmer (lme4) can be used for the following distributions: binomial, gaussian, gamma, inverse gaussian, Poisson, quasibinomial and quasipoisson. Again repeated measurements cannot be handled.
glmmPQL (MASS) can handle random factors and repeated measurements. It covers the same distributions as lmer.
Quantile regression
Where traditional regression models the mean value of the response variable conditional on the values of the explanatory variables, quantile regression models the median or the quantiles. While regression for medians can be seen as more robust than regressions to model the mean value, the main advantage of quantile regression is to model relationships between explanatory varaibles and high or low quantiles, such as the 90th or 10th percentile.
in SAS:
Proc QUANTREG: quantiles are regressed on explanatory variables by a linear model or by a non-parametric regression model using splines.
in R:
rq (quantreg): Quantile regression using a linear model or a non-parametric model with splines. A special version is available for censored data (crq).
Example: More information of quantile regression and links to R.
Censored regression and trunkated regression
Both censored and trunkated regression adress the problem of not observing all possible values of the response variable:
Trunkated regression is used for data that is collected in a design where not the all possible outcomes are observed. A typical example is the willingness-to-pay if the sample is drawn from buyers of the product. The individuals that are not ready to pay for the product will never be in the study. The response variable is truncted at 0 and so are all the explanatory variables (the individuals characteristics)
Censored regression is used if only the response variable is not observed but the explanatory variables are. If willingness-to-buy is studied by a sample from the entire population we will observe individuals with values 0. The variable is censored at 0 since we cannot observe negative willingness to pay.
in SAS:
PROC QLIM: handles both truncated and censored regression, e.g. different versions of Tobit regression
in R:
censReg (censReg): covers censored regression, e.g. different version of Tobit regression