1. Final Prediction Assignment

Mena Solomon

2024/11/01

Who will win the 2024 Presidential Election?

With only one day to go until voting closes across the United States, the time has come to generate my finalized prediction for the 2024 Presidential election. Building off of eight weeks of learning, modeling, and discussing, my final model has been built based on existing scholarship, the successes (and failures) of models over the past week, data availability, and an increased understanding of the unique nature of this election.

Model Formula

My model includes four predictive variables:

Model Strategy

Regression Modeling I decided to use a standard regression to build out my predictive my model. I have three primary reasons for utilizing a regression model as opposed to an ensemble method or a form of machine learning.

  1. Transparency: In using a regression model throughout the past few weeks, I have attempted to ensure that I am transparent in my assumptions as well as their impact on the models and my overall predictions. My reasoning behind the prioritization of transparency lies in the importance of election forecasting in instilling confidence in U.S. elections. Without forecasting, the American people would have no baseline upon which to gauge election results. In a time where election integrity has been called into question and a team of officials are lined up to discount the results, predictions are a means of giving the public a sense of what to expect on election night, thus simple transparency is key to ensuring the result of my model directly reflects its inputs.
  2. Interpretability: Building upon the previous point, election forecasts are not just designed to be read by expert data scientists. Rather, large audiences of American citizens rely on forecasts to understand the temperature of the nation going into election night. In this way, it is important that my model be interpretable by audiences beyond a data science sphere. Using a regularized regression model allows me to easily interpret which variables are significant predictors and how, in aggregate, they deliver a prediction.
  3. Generalizability: Data science, especially as it pertains to election forecasting, is a relatively new field, thus generating a lack of robust data for every variable in my model. Regression models work best at generating generalizable results from limited data without creating extreme model biases.

The first regression model measures the relationship between Democratic two-party vote share and my four predictive variables: the latest poll averages, the interaction effect between Q2 GDP and incumbency, Democratic two-party vote share lagged one cycle, and Democratic two-party vote share lagged two cycles.

## 
## =================================================================================
##                                                         Dependent variable:      
##                                                   -------------------------------
##                                                   Democratic Two-Party Vote Share
## ---------------------------------------------------------------------------------
## Latest Democratic Poll Averages                          0.713*** (0.028)        
## Incumbency and GDP Interaction Effect                     -0.067 (0.049)         
## Democratic Two-Party Vote Share Lagged One Cycle         0.382*** (0.028)        
## Democratic Two-Party Vote Share Lagged Two Cycles        -0.085*** (0.025)       
## Constant                                                 3.628*** (1.041)        
## ---------------------------------------------------------------------------------
## Observations                                                    559              
## R2                                                             0.805             
## Adjusted R2                                                    0.803             
## Residual Std. Error                                      4.077 (df = 554)        
## F Statistic                                          570.632*** (df = 4; 554)    
## =================================================================================
## Note:                                                 *p<0.1; **p<0.05; ***p<0.01

Since my predictive model is regularized with elastic-net and normalized by combining models of both Democratic and Republican two-party vote share (in-depth explanations will follow below), the coefficients and r-squared value of this model are inaccurate representations of my final model. I include them here, however, as a sense-check for the assumptions made above. First, the adjusted r-squared of 0.8 displays that the simple regression model can explain 80% of the variance in Democratic two-party vote share. To me, this emphasizes that the variables included here do a reasonably good job of explaining the attitude of the American electorate when selecting a presidential candidate. Furthermore, for every variable except the Interaction between Incumbency and GDP, the size of each coefficient, as well as their statistical significance, indicate their relative importance in understanding Democratic vote share.

The second regression model measures the relationship between Republican two-party vote share and my four predictive variables: the latest poll averages, the interaction effect between Q2 GDP and incumbency, Republican two-party vote share lagged one cycle, and Republican two-party vote share lagged two cycles.

## 
## =================================================================================
##                                                         Dependent variable:      
##                                                   -------------------------------
##                                                   Republican Two-Party Vote Share
## ---------------------------------------------------------------------------------
## Latest Republican Poll Averages                          0.636*** (0.024)        
## Incumbency and GDP Interaction Effect                      0.026 (0.019)         
## Republican Two-Party Vote Share Lagged One Cycle         0.255*** (0.030)        
## Republican Two-Party Vote Share Lagged Two Cycles        0.193*** (0.024)        
## Constant                                                  -0.210 (1.168)         
## ---------------------------------------------------------------------------------
## Observations                                                    559              
## R2                                                             0.809             
## Adjusted R2                                                    0.807             
## Residual Std. Error                                      4.035 (df = 554)        
## F Statistic                                          585.451*** (df = 4; 554)    
## =================================================================================
## Note:                                                 *p<0.1; **p<0.05; ***p<0.01

These results are incredibly similar to those seen in the Democratic two-party vote share model, indicating both models operate similarly and are very successful in predicting two-party vote share. These sense-checks leave me confident in the variables I have chosen to include within my model going forward.

Regularized Regression When utilizing a regression model, there are two main concerns: overfitting and multicollinearity. To address these concerns, I decided to use an Elastic-Net regularized regression model, which combines both Lasso and Ridge regularization tools to penalize large coefficients and and average the coefficients of correlated predictors. My penalization term, known as alpha, was calculated using cross validation to pick a term wich best fit my model. In doing so, I effectively improved the stability of my model as well as its predictive power.

The first elastic-net regression regularizes the Democratic two-party vote share regression analyzed above

Table 1: Democratic Two-Party Vote Share Elastic Net Coefficients
VariableCoefficient
Intercept3.7798307
Intercept 10.0000000
Latest Democratic Poll Averages0.7058158
Incumbency and GDP Interaction Effect-0.0598857
Democratic Two Party Vote Share Lagged One Cycle0.3755114
Democratic Two Party Vote Share Lagged Two Cycles-0.0751620

In comparing the coefficients of this model with the model above, it appears that none of the coefficients underwent significant changes with the use of elastic-net. Indeed, the ideal lambda discovered through cross-validation was 0.03 meaning that my model requires only a small amount of regularization to optimize predictive performance. This means that my model has a low risk of both multicollinearity and overfitting, further emphasizing the strong predictive power of my selected variables. Each coefficient will be evaluated here:

The second elastic-net regression regularizes the Republican two-party vote share regression also analyzed above

Table 2: Republican Two-Party Vote Share Elastic Net Coefficients
VariableCoefficient
Intercept0.0823261
Intercept 10.0000000
Latest Republican Poll Averages0.6316387
Incumbency and GDP Interaction Effect0.0231685
Republican Two Party Vote Share Lagged One Cycle0.2562105
Republican Two Party Vote Share Lagged Two Cycles0.1901908

While the lambda found through cross-validation here is slightly higher, sitting around 0.056, the coefficients remain relatively unchanged compared to those found in the aimple regression model above. Once again, this indicates my model has a low risk of both multicollinearity and overfitting, further emphasizing the strong predictive power of my selected variables.

Model Validation

To verify the accuracy of my model in predicting my chosen outcome variables — Democratic and Republican two-party vote share — I decided to perform an out-of-sample performance validation. While I would have liked to display my in-sample error as well, the use of an aggregate elastic-net model predicted onto state-based variables makes it incredibly difficult, thus I will instead focus on out-of-sample error.

Using Bootstrapped Out-of-Sample Error Estimation to Test Predictive Power:

Table 3: Out-of-Sample Error Summary for Democratic and Republican Predictions
PartyMean ErrorStandard Deviation
Democratic-0.25185686.260530
Republican0.49697626.017972

The mean error for Democrats is around -0.25, indicating the model tends to slightly underestimate Democrat performance. The mean error for Republicans is around 0.5, indicating the model tends to slightly overestimate Republican performance. That said, the values are close to zero, suggesting the models do not have a significant directional bias in their predictions. A standard deviation of around 6 for both, however, is relatively high, especially in the context of vote share predictions. A high standard deviation (relative to the mean error) suggests that individual predictions vary considerably from the true values. While this finding is worrying, it is not indicative of a bad model, rather it emphasizes the limited data availability and high uncertainty in the election forecasting industry as a whole.

Predicting Vote Share

As I have done in the previous three weeks, I will be predicting for the seven states which expert predictors like Cook and Sabato determine to be toss-ups in the upcoming election: Arizona, Nevada, Michigan, Wisconsin, North Carolina, Georgia, and Pennsylvania. Using the elastic-net regularized regression model generated above, which includes four predictive variables, my models calculated both Democratic and Republican two-party vote share.

When interpreting the results below, bear in mind that the predicted two-party vote shares sum to above 100 as a result of the data used in this model. The data will be normalized below; however, the raw model results are included for the sake of evaluating the confidence intervals for each state.

Model of Elastic-Net Regularized Regression Predicted Two-Party Democratic Vote Share with 90% Confidence Intervals for Swing States

Table 4: Predicted Two-Party Democratic Vote Share with Confidence Intervals for Swing States
StatePredicted Vote ShareUpper BoundLower Bound
Arizona51.5998256.8139646.38569
Georgia52.0671257.2812646.85298
Michigan52.8826258.0967647.66849
Nevada52.2835557.4976947.06941
North Carolina51.7671556.9812946.55301
Pennsylvania52.4933257.7074647.27918
Wisconsin52.7362957.9504247.52215

Model of Elastic-Net Regularized Regression Predicted Two-Party Republican Vote Share with 90% Confidence Intervals for Swing States

Table 5: Predicted Two-Party Republican Vote Share with Confidence Intervals for Swing States
StatePredicted Vote ShareUpper BoundLower Bound
Arizona53.6898158.8494248.53021
Georgia53.5149858.6745948.35538
Michigan51.7918556.9514546.63224
Nevada52.0614457.2210546.90184
North Carolina53.5012258.6608248.34162
Pennsylvania52.5694157.7290147.40981
Wisconsin52.3116457.4712447.15203

The 90% confidence interval of these predictions includes both election outcomes, indicating the extreme variability of the model. This variability suggests that the predictions are sensitive to small changes in input, reflecting the inherent uncertainty in election forecasting. Since election prediction models rely on a limited set of data points and may not fully capture unforeseen events or shifts in voter sentiment, it is common for confidence intervals to span both possible outcomes. Such wide intervals remind us that while the model offers a probabilistic view of the election, it should not be interpreted as a definitive forecast.

Normalizing the Two-Party Vote Share in my Models to Generate a Final Prediction

Since my values for two-party vote share sum to over 100, I normalized them through a simple formula of deviding each party’s prediction by a sum of both parties’ predictions. Doing so generated the results displayed below:

StateDemocratic PredictionRepublican PredictionWinner
Arizona49.0075050.99250Trump
Georgia49.3143450.68566Trump
Michigan50.5210349.47897Harris
Nevada50.1064349.89357Harris
North Carolina49.1763650.82364Trump
Pennsylvania49.9637950.03621Trump
Wisconsin50.2021249.79788Harris

Final 2024 Prediction

In normalizing both predictions, Harris appears to win Michigan, Nevada, and Wisconsin on a slim margin, while Trump wins Arizona, Pennsylvania, Georgia, and North Carolina on similarly slim margin. This leads to a result where Trump wins with 281 electors while Harris has 257 electors.

Notes

All code above is accessible via Github.

Data Sources

US Presidential Election Popular Vote Data from 1948-2020 provided by the course. Economic data from the U.S. Bureau of Economic Analysis, also provided by the course. Polling data sourced from FiveThirtyEight.