Imputation – NoSimpler

Missing Value Imputation

Simple Linear Regression can be used to impute missing values, NA, from a data set comprising of correlated variables.

Understanding the dataset

# vector y has missing values we need to impute

x = c(1,2,3,4,5,6,7,8,9,10)
y = c(11,12,18,14,17,NA,NA,19,NA,27)
z = c(19,11,2,14,20,4,9,10,18,1)
w = c(1,4,7,10,3,5,7,6,6,9)

# create a data frame for the data set

data = data.frame(x,y,z,w)
data
# x y z w
# 1 1 11 19 1
# 2 2 12 11 4
# 3 3 18 2 7
# 4 4 14 14 10
# 5 5 17 20 3
# 6 6 NA 4 5
# 7 7 NA 9 7
# 8 8 19 10 6
# 9 9 NA 18 6
# 10 10 27 1 9

Find Correlation

To find the best dependent variable for fitting the model

# correlation between variables

cor(data)
# x y z w
#x 1.0000000 NA -0.27367 66 0.5029477
#y NA 1 NA NA
#z -0.2736766 NA 1.0000000 -0.5276512
#w 0.5029477 NA -0.5276 512 1.0000000

# since there are NA values in y, we cannot find its correlation with other variables
# we need to ignore the missing values for correlation

cor(data, use="complete.obs")
# x y z w
# x 1.0000000 0.9088508 -0.4794970 0.5427928
# y 0.9088508 1.0000000 -0.6931033 0.5575189
# z -0.4794970 -0.6931033 1.0000000 -0.6438960
# w 0.5427928 0.5575189 -0.6438960 1.0000000

# hightest correlation of y is with x

# we can also use symbols for correlation

symnum(cor(data,use="complete.obs"))
# x y z w
# x 1
# y * 1
# z . , 1
# w . . , 1
# attr(,"legend")
# [1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1

# '*' indicates 0.95 - highest correlation between y and x

Fit the model

Since y is highly correlated with x, we use the formula y~x for fitting the linear model

# fitting linear regression model of Y on X

lrm = lm(y~x, data = data)

Find the coefficients of the linear model

# print the linear model to find the coefficients

print(lrm)

#Call:
# lm(formula = y ~ x, data = data_new)

#Coefficients:
# (Intercept) x
# 9.743 1.509

# Using the coefficients the linear equation is
# y = 9.743 + 1.590*x

Predict y values using linear model

These values can be used to impute the missing y values

# predict y using linear model

y_pred = predict(lrm, newdata = data)

# compare the predicted values and the original values

data_compare = data.frame(y_pred,y)
data_compare

# y_pred y
# 1 11.25225 11
# 2 12.76126 12
# 3 14.27027 18
# 4 15.77928 14
# 5 17.28829 17
# 6 18.79730 NA
# 7 20.30631 NA
# 8 21.81532 19
# 9 23.32432 NA
# 10 24.83333 27