The Importance of Train/Test Splitting
During this week, my class of General Assembly was exposed to modeling after getting an introduction to python and exploratory data analysis with real datasets taken from various fields including the medical industry, travel industry, etc. The way the class was exposed to modeling was through linear regression where we take one or more features from a dataset to make our predictor variables and our goal is to generate our response variable through a model.
I have been exposed to linear regression in previous college classes. However, there are limitations of just using the linear regression model itself. One important concept I learned involves train test splitting our data prior to fitting the model. The reason why one would do this is so that there is a better understanding of whether the model can represent new data or not which is something linear regression models cannot do with just the X and Y variables. Another reason is interpreting the bias-variance tradeoff in the model. The training set itself is the subset of the data on which we fit our model. The testing set if the subset of the data on which the quality of our predictions. A high score on testing data after train testing our data indicates our model can generalize to new data, indicating a higher possibility of high variance or overfitting. Linear Regression models tend to have a higher bias or underfit with a higher R squared score. The R squared score shows how well does our independent variables explain the variability in our dependent variable.