| Peer-Reviewed

Effect of Multicollinearity on Variable Selection in Multiple Regression

Received: 25 October 2021    Accepted: 17 November 2021    Published: 9 December 2021
Views:       Downloads:
Abstract

When Multicollinearity exists in a data set, the data is considered deficient. Multicollinearity is frequently encountered in observational studies. It creates difficulties when building regression models. It is a phenomenon whereby two or more explanatory variable in a multiple regression model are highly correlated. Variable selection is an important aspect of model building as such the choice of the best subset among many variables to be included in a model is the most difficult part of model building in regression analysis. Data was obtained from Nigerian Stock Exchange Fact Book, Nigerian Stock Exchange Annual Report and Account, CBN Statistical Bulletin and FOS Statistical bulletin from 1987 to 2018. Variance Inflation Factor (VIF) and correlation matrices were used to detect the presence of multicollinearity. Ridge regression and Least Square Regression were applied using R-package, Minitab and SPSS Packages. Ridge Models with constant range of 0.01 ≤ K ≤ 1.5 and Least Square Regression models were considered for each value of P = 2, 3, …,7. The optimal Ridge and Least Square model from the Ridge and Least Square Regression models were obtained by taking the average rank of the Coefficient of Determination and Mean Square Error. The result showed that the choices of variable selection were affected by the presence of multicollinearity as different variables were selected under Ridge and Least Square Regression for same level of P.

Published in Science Journal of Applied Mathematics and Statistics (Volume 9, Issue 6)
DOI 10.11648/j.sjams.20210906.12
Page(s) 141-153
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Regression, Multicollinearity, Ridge Regression, Partial Least Square, Extra Sum of Squares

References
[1] Alin, A. (2010). Multicollinearity. – WIREs Computational Statistics.
[2] Brue Ratner (2009). Variable selection method in regression: Ignorable Problem, Outing notable solution. 574 Flander drive north, woodmere NT 11581, USA.
[3] Bertsimas D. and King A.: (2016) OR forum—An algorithmic approach to linear regression. Operations Research, 64, 2–16.
[4] Bertsimas D., King A., and Mazumder R. (2016) Best subset selection via a modern optimization lens. The Annals of Statistics, 44, 813–852.
[5] Chatterjee S. and Hadi A. S (2012).: Regression Analysis by Example, Fifth Edition (Wiley, Hoboken).
[6] Dormann CF, Elith J, Bacher S, Buchmann C, Carl G, Carré G, (2013) Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography. 36 (1): 27–46.
[7] Farrar D. E. and Glauber R. R (1967).: Multicollinearity in regression analysis: The problem revisited. The Review of Economic and Statistics, 49, 92–107.
[8] Frank, I. and Friedman, J. (1993) A statistical view of some chemometrics regression tools. Technometrics, 35, 109–148.
[9] Fu, W. (1998) Penalized regression: the bridge versus the lasso. J. Computnl Graph. Statist., 7, 397–416.
[10] Gujarati DN, and Porter DCn (2009) Basic Econometrics. New York: McGraw Hill Inc.
[11] Gunst R. F. and Webster J. T. (1975),: Regression analysis and problems of multicollinearity. Communications in Statistics—Theory and Methods, 4 277–292.
[12] Hadi A. S. and Ling R. F. (1998): Some cautionary notes on the use of principal components regression. The American Statistician, 52, 15–19.
[13] Hoerl, A. E and Kennard, R. W (1970). Ridge regression: Biased estimation of non-orthogonal problems. Techno metrics, 12 (1): 55-67.
[14] Jolliffe I. T. (1982): A note on the use of principal components in regression. Applied Statistics, 31 300–303.
[15] Mansfield E. R. and Helms B. P.: Detecting multicollinearity. The American Statistician, 36 (1982), 158–160.
[16] Massy W. F (1965). Principal components regression in exploratory statistical research. Journal of the American Statistical Association, 60, 234–256.
[17] Meloun, M., M. Meloun, J. Militky, M. Hill, R. G. Brereton (2002). Crucial problems in regression modelling and their solutions. – Analyst 127: 433–450.
[18] Murray, C. J. L. (2006). Eight Americas: investigating mortality disparity across races, counties, and race-counties in the United States.
[19] Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B, 58, 267–288.
[20] Vandenberghe L. and Boyd S (1996), Semidefinite programming. SIAM Review, 38 49–95.
[21] Wold H. (1966): Estimation of principal components and related models by iterative least squares. In P. R. Krishnaiaah (ed.): Multivariate Analysis (Academic Press, New York,), 391–420.
[22] Zou H. and Hastie T.: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 67 (2005), 301–320.
Cite This Article
  • APA Style

    Harrison Oghenekevwe Etaga, Roseline Chibotu Ndubisi, Ngonadi Lilian Oluebube. (2021). Effect of Multicollinearity on Variable Selection in Multiple Regression. Science Journal of Applied Mathematics and Statistics, 9(6), 141-153. https://doi.org/10.11648/j.sjams.20210906.12

    Copy | Download

    ACS Style

    Harrison Oghenekevwe Etaga; Roseline Chibotu Ndubisi; Ngonadi Lilian Oluebube. Effect of Multicollinearity on Variable Selection in Multiple Regression. Sci. J. Appl. Math. Stat. 2021, 9(6), 141-153. doi: 10.11648/j.sjams.20210906.12

    Copy | Download

    AMA Style

    Harrison Oghenekevwe Etaga, Roseline Chibotu Ndubisi, Ngonadi Lilian Oluebube. Effect of Multicollinearity on Variable Selection in Multiple Regression. Sci J Appl Math Stat. 2021;9(6):141-153. doi: 10.11648/j.sjams.20210906.12

    Copy | Download

  • @article{10.11648/j.sjams.20210906.12,
      author = {Harrison Oghenekevwe Etaga and Roseline Chibotu Ndubisi and Ngonadi Lilian Oluebube},
      title = {Effect of Multicollinearity on Variable Selection in Multiple Regression},
      journal = {Science Journal of Applied Mathematics and Statistics},
      volume = {9},
      number = {6},
      pages = {141-153},
      doi = {10.11648/j.sjams.20210906.12},
      url = {https://doi.org/10.11648/j.sjams.20210906.12},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.sjams.20210906.12},
      abstract = {When Multicollinearity exists in a data set, the data is considered deficient. Multicollinearity is frequently encountered in observational studies. It creates difficulties when building regression models. It is a phenomenon whereby two or more explanatory variable in a multiple regression model are highly correlated. Variable selection is an important aspect of model building as such the choice of the best subset among many variables to be included in a model is the most difficult part of model building in regression analysis. Data was obtained from Nigerian Stock Exchange Fact Book, Nigerian Stock Exchange Annual Report and Account, CBN Statistical Bulletin and FOS Statistical bulletin from 1987 to 2018. Variance Inflation Factor (VIF) and correlation matrices were used to detect the presence of multicollinearity. Ridge regression and Least Square Regression were applied using R-package, Minitab and SPSS Packages. Ridge Models with constant range of 0.01 ≤ K ≤ 1.5 and Least Square Regression models were considered for each value of P = 2, 3, …,7. The optimal Ridge and Least Square model from the Ridge and Least Square Regression models were obtained by taking the average rank of the Coefficient of Determination and Mean Square Error. The result showed that the choices of variable selection were affected by the presence of multicollinearity as different variables were selected under Ridge and Least Square Regression for same level of P.},
     year = {2021}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Effect of Multicollinearity on Variable Selection in Multiple Regression
    AU  - Harrison Oghenekevwe Etaga
    AU  - Roseline Chibotu Ndubisi
    AU  - Ngonadi Lilian Oluebube
    Y1  - 2021/12/09
    PY  - 2021
    N1  - https://doi.org/10.11648/j.sjams.20210906.12
    DO  - 10.11648/j.sjams.20210906.12
    T2  - Science Journal of Applied Mathematics and Statistics
    JF  - Science Journal of Applied Mathematics and Statistics
    JO  - Science Journal of Applied Mathematics and Statistics
    SP  - 141
    EP  - 153
    PB  - Science Publishing Group
    SN  - 2376-9513
    UR  - https://doi.org/10.11648/j.sjams.20210906.12
    AB  - When Multicollinearity exists in a data set, the data is considered deficient. Multicollinearity is frequently encountered in observational studies. It creates difficulties when building regression models. It is a phenomenon whereby two or more explanatory variable in a multiple regression model are highly correlated. Variable selection is an important aspect of model building as such the choice of the best subset among many variables to be included in a model is the most difficult part of model building in regression analysis. Data was obtained from Nigerian Stock Exchange Fact Book, Nigerian Stock Exchange Annual Report and Account, CBN Statistical Bulletin and FOS Statistical bulletin from 1987 to 2018. Variance Inflation Factor (VIF) and correlation matrices were used to detect the presence of multicollinearity. Ridge regression and Least Square Regression were applied using R-package, Minitab and SPSS Packages. Ridge Models with constant range of 0.01 ≤ K ≤ 1.5 and Least Square Regression models were considered for each value of P = 2, 3, …,7. The optimal Ridge and Least Square model from the Ridge and Least Square Regression models were obtained by taking the average rank of the Coefficient of Determination and Mean Square Error. The result showed that the choices of variable selection were affected by the presence of multicollinearity as different variables were selected under Ridge and Least Square Regression for same level of P.
    VL  - 9
    IS  - 6
    ER  - 

    Copy | Download

Author Information
  • Department of Statistics, Faculty of Physical Sciences, Nnamdi Azikiwe University, Awka, Nigeria

  • Department of Statistics, Faculty of Physical Sciences, Nnamdi Azikiwe University, Awka, Nigeria

  • Department of Statistics, Faculty of Physical Sciences, Nnamdi Azikiwe University, Awka, Nigeria

  • Sections