8. Final Prediction Post | Election Prediction Blog

How to predict an election outcome?

Over the past seven weeks, I have been working to build a model which can effectively predict the outcome of the 2024 presidential election. Now, with a little over a week until election day, it is time to utilize my knowledge to produce a comprehensive election forecast.

My final model will include five predictive variables —

Q2 GDP Growth: In week two we began our discussion of fundamentals, covering the effect of the economy on incumbent vote share. Week two’s model discovered the significant relationship between Q2 GDP growth and vote share, above all other economic variables.
Latest Poll Averages: Polling was covered in week three. The regularized regression model from that post discovered that the weeks with the greatest predictive power were those closest to the election. In this way, the model only includes recent polling data
Incumbency: The incumbency advantage was discussed in week four wherein we weighed the effects of name recognition, pork-barrel spending, and candidate fatigue. Incumbent status proves to be a major predictor of election outcomes; however, this effect is complicated by the candidate switch from Biden to Harris. Regardless, incumbency proved predictive in my week four model, thus it is incorporated here as well.
Democratic Two-Party Vote Share Lagged One Cycle: In week five, we covered the effects of out final fundamental variable: demographics. As the electorate becomes further calcified, demographics are increasingly predictive of both turnout and election outcomes. It is difficult, however, to predict demographic shifts on existing data. Indeed, lagged vote share serves as a proxy for this variable (and others) by displaying how the state has voted in past elections.
Democratic Two-Party Vote Share Lagged Two Cycles: By including lagged vote share from both the previous cycle and the one before that, the model is able to account for shifts within the state — i.e. demographic, turnout, or campaign strategy changes.

Significantly, this model does not include campaign variables covered in week 6 and week 7. There are three primary reasons for this choice: 1. As political scholars point out, the election can often be predicted on fundamentals alone due to a tug-of-war effect wherein each candidate, campaigning at a similar volume, cancels out the effect of their opponent’s campaign. 2. Due to the ever-changing and increasingly dynamic nature of campaigning, there is little historical data to incorporate into the model. Limited data will often generate model bias, which would inhibit my understanding of the predictive power of the variables listed above. 3. Over the past month, Kamala Harris raised over 1 billion dollars in donations (Wall Street Journal, 2024). Indeed, campaign spending and mobilization has become unprecedented, calling into question the predictive power of campaigns.

This model is also trained off of data beginning in 1972 so as to include the maximum number of election cycles after the Civil Rights Act, when each party’s ideology become more consistent.

Training a regression model to predict Democratic two-party vote share

## 
## =================================================================================
##                                                         Dependent variable:      
##                                                   -------------------------------
##                                                   Democratic Two-Party Vote Share
## ---------------------------------------------------------------------------------
## Latest Democratic Poll Averages                          0.695*** (0.027)        
## Q2 GDP Growth                                            0.138*** (0.017)        
## Incumbency                                               -3.112*** (0.410)       
## Democratic Two-Party Vote Share Lagged One Cycle         0.526*** (0.033)        
## Democratic Two-Party Vote Share Lagged Two Cycles        -0.190*** (0.025)       
## Constant                                                 3.690*** (0.939)        
## ---------------------------------------------------------------------------------
## Observations                                                    559              
## R2                                                             0.836             
## Adjusted R2                                                    0.834             
## Residual Std. Error                                      3.741 (df = 553)        
## F Statistic                                          563.175*** (df = 5; 553)    
## =================================================================================
## Note:                                                 *p<0.1; **p<0.05; ***p<0.01

This model, with an adjusted R^2 of 83.4%, can explain all but 20% of the variance in Democratic two-party vote share in every state’s election since 1972. Above, the asterisks indicate that each of the five variables described above is predictive at the 0.01 level. Indeed, each coefficient is also of significant magnitude, representing the percent increase in Democratic vote share triggered by increasing each variable by one point.

Training a regression model to predict Republican two-party vote share

## 
## =================================================================================
##                                                         Dependent variable:      
##                                                   -------------------------------
##                                                   Republican Two-Party Vote Share
## ---------------------------------------------------------------------------------
## Latest Republican Poll Averages                          0.584*** (0.023)        
## Q2 GDP Growth                                            -0.040** (0.017)        
## Incumbency                                               -3.725*** (0.407)       
## Republican Two-Party Vote Share Lagged One Cycle         0.444*** (0.035)        
## Republican Two-Party Vote Share Lagged Two Cycles        0.076*** (0.026)        
## Constant                                                   0.578 (1.093)         
## ---------------------------------------------------------------------------------
## Observations                                                    559              
## R2                                                             0.833             
## Adjusted R2                                                    0.832             
## Residual Std. Error                                      3.770 (df = 553)        
## F Statistic                                          553.132*** (df = 5; 553)    
## =================================================================================
## Note:                                                 *p<0.1; **p<0.05; ***p<0.01

This model, with an adjusted R^2 of 83.2%, can explain all but 20% of the variance in Republican two-party vote share in every state’s election since 1972. Above, the asterisks indicate that each of the five variables described above is predictive at the 0.01 level, except Q2 GDP growth which is predictive at the 0.05 level. Indeed, each coefficient is also of significant magnitude, representing the percent increase in Democratic vote share triggered by increasing each variable by one point.

The similarities in both of these regression models is indicative of the predictive power of all of the variables included in the model.

Utilizing a regularized regression to eliminate collinearity

In using variables which are often highly correlated not only with my chosen outcome variable, two-party vote share, but the other variables within the model as well generates a high risk of collinearity. This could bias our results, thus I chose to use an elastic-net regularized regression which shrinks each coefficient based on its relative significance, thus decreasing model bias. In generating this model, cross validation was used to determine the model’s best lambda value.

To test the success of the elastic net regularization on enhancing my model’s predictive power, I ran a k-fold cross validation, the results of which are included here:

Table: Table 1: Out-of-Sample Error Summary for Democratic and Republican Predictions

Party	Mean Error	Standard Deviation
Democratic	0.0520180	5.800855
Republican	-0.1684959	7.165806

The very small mean error, accompanied by a similarly low standard deviation, increases my confidence in both models, indicating their predictive power.

Predicting the 2024 election

As I have done in the previous three weeks, I will be predicting for the seven states which expert predictors like Cook and Sabato determine to be toss-ups in the upcoming election: Arizona, Nevada, Michigan, Wisconsin, North Carolina, Georgia, and Pennsylvania. Using the elastic-net regularized regression model generated above, which includes five predictive variables, my models calculated both Democratic and Republican two-party vote share.

2024 Election Predictions:

	state	Democratic Two-Party Vote Share	Winner
1	Arizona	51.86191	Harris
4	Georgia	52.10763	Harris
7	Michigan	52.79416	Harris
12	Nevada	52.38088	Harris
16	North Carolina	51.74808	Harris
18	Pennsylvania	52.51349	Harris
22	Wisconsin	52.56357	Harris

	state	Republican Two-Party Vote Share	Winner
1	Arizona	53.26769	Trump
4	Georgia	53.37372	Trump
6	Michigan	51.74763	Trump
11	Nevada	51.63499	Trump
15	North Carolina	53.46686	Trump
17	Pennsylvania	52.28012	Trump
20	Wisconsin	52.39460	Trump

As displayed by both models, an apparent error exists wherein each model is biased to predict a two-party vote share which sums to around 105%, instead of 100%. This bias does not appear to shift when any single variable is removed, thus indicating that it is the fault of an anomaly in the data. To account for this error, my final result normalizes the results above.

Normalized 2024 election prediction:

State	Democratic Prediction	Republican Prediction	Winner
Arizona	49.33141	50.66859	Trump
Georgia	49.39985	50.60015	Trump
Michigan	50.50053	49.49947	Harris
Nevada	50.35854	49.64146	Harris
North Carolina	49.18320	50.81680	Trump
Pennsylvania	50.11135	49.88865	Harris
Wisconsin	50.08049	49.91951	Harris

After normalizing the results, the model appears to predict a landslide victory for Trump in every swing state. Indeed, this model predicts 312 electoral votes for Trump and 226 for Harris. While this result appears to be incredibly unlikely, it is not impossible. Furthermore, the confidence intervals (not shown above) include both outcomes, re-emphasizing that this year’s election will be decided within an incredibly slim margin.

Notes

All code above is accessible via Github.

Data Sources

US Presidential Election Popular Vote Data from 1948-2020 provided by the course. Economic data from the U.S. Bureau of Economic Analysis, also provided by the course. Polling data sourced from FiveThirtyEight.

Final Prediction Post

Mena Solomon

2024/10/27

How to predict an election outcome?

Utilizing a regularized regression to eliminate collinearity

Predicting the 2024 election

Notes