When thinking about a data science project, one of the first decisions I find myself having to make is which language I want to do the project with. My first two choices are generally R or Python. Both languages have the capabilities to do high level data science work, but each has its pros and cons as well.

Python

Python is often one of the first languages a new programmer learns. Its syntax is intuitive and the language has become quite popular over recent years, in part because of the (community of open source libraries) that are easy to import to a script.

A drawback to Python, however, is that it does not come built in with data frame capabilities and requires a third-party library (Pandas being the most popular) to structure data in a way that is easier to work with. You can structure data in lists, arrays, and dictionaries, but it requires a fair amount of code to get the data structured in a way you can work with it.

For how intuitive most of the language’s syntax is, visualizations are not the easiest to create either. Matplotlib is a popular library to create visualizations, but it is not as straightforward as some of the packages or built in capabilities available in R.

R

R is a programming language that is built by data scientists for data scientists. The language is widely used in academia and can complete complex statistical tasks with relatively little code. R also has built in data frame handling capabilities. When working with a large data set, it is quite easy to bring the data set into your environment and begin working with it immediately.

While the (Comprehensive R Archive Network (CRAN)) has some great libraries available, the roughly 19,000 packages available pale in comparaison to the over 300,000 that are available in Python. Even then, some packages are not maintained but still available and won’t work properly if you are using many libraries. Further, the learning curve on R is much steeper than Python because the syntax often does not resemble simple human language in the way that Python does.

Personal Greviances

Consider this section of the blog my personal Festivus celebration. To paraphrase Frank Constanza, the tradition of data science begins with an airing of grievances. This is my blog, I have a lot of problems with these languages, and now you’re gonna hear about it.

My biggest complaint with Python is that there is often a ton of data preprocessing that needs to be done in order to get the data ready to build a model. And even then, trying to get the data into the actual model is an arduous task. I spend more time fighting with data types in R than I do in any other language as well. Why there is so much trouble casting a 3 from a character to an integer I will never understand. It’s in my data set as an integer, why are you reading it as a character, R?

There Has To Be A Better Way

Enter, the (statsmodels) Python library. From the statsmodels website, this library “supports specifying models using R-style formulas”. My understanding is that this should create a best of both worlds situations in that we can leverage the speed and third-party libraries of Python while also maintaining the simple statistical model creation found in R.

About statsmodels

The statsmodels package was originally written as the models module of the scipy.stats package. When the module was removed in the late 2000s, developer Jonathan Taylor improved the package and decided to release it on its own.

statsmodels has a host of statistical modeling capabilites available through the statsmodels.api, statsmodels.tsa.api, and statsmodels.formula.api, models. A full list is available (here) with documentation to go along with each.

An Example

To try this pacage out, I want to use this library to see how much returning production influences a college football team’s winning. With the rise of the transfer portal in college football, there is a new debate over whether a team should seek out the best talent in the portal, or recruit players out of high school and develop them in their system. A team’s percentage of returning EPA (called PPA in our data API) will serve as our independent variable, and the team’s winning percentage will be the dependent variable.

EPA stands for Expected Points Added, and measures how many expected points are added or lost on a given play. For example, if a team has the ball on their own 20 yard line, they might be expected to get 2.5 point on that drive. But if they throw a 75 yard pass to take the ball down to the opponents 5 yard line, their expected points for the drive might rise to 6.5. The EPA for that play would be 4. I don’t know the exact breakdown of expected points on every yard line, but I this example should demonstrate how EPA works.

I’ll be pulling the data from the folks at (CollegeFootballData). I highly recommend this site for using college football data. If you wish to use this library, however, you will need to (sign up for a free API key) and configure it according to their instructions.

import statsmodels.api as sm # Our statsmodels library!
import cfbd # College football data library
from cfbd.rest import ApiException # Handle API exceptions from cfbd
import pandas as pd # Import pandas to work with dataframes

configuration = cfbd.Configuration()
configuration.api_key['Authorization'] = 'TckFfmc2oFlgUIw8GK5755CtUD00WFZPpbdOOwVD8J2iW1uZ6udJbM7yEGNKvdCO'
configuration.api_key_prefix['Authorization'] = 'Bearer'

# Set up instance of the PlayersAPI class from cfbd
epa_api = cfbd.PlayersApi(cfbd.ApiClient(configuration))

epa_resonse_2022 = epa_api.get_returning_production(year=2022)
epa_resonse_2023 = epa_api.get_returning_production(year=2023)
epa_resonse_2024 = epa_api.get_returning_production(year=2024)

# Extract data from objects in API response
epa_response_data_2022 = [vars(item) for item in epa_resonse_2022]
epa_response_data_2023 = [vars(item) for item in epa_resonse_2023]
epa_response_data_2024 = [vars(item) for item in epa_resonse_2024]

# Convert to dataframes
epa_df_2022 = pd.DataFrame(epa_response_data_2022)
epa_df_2023 = pd.DataFrame(epa_response_data_2023)
epa_df_2024 = pd.DataFrame(epa_response_data_2024)

# Combine to single EPA dataframe
epa_df = pd.concat([epa_df_2022, epa_df_2023, epa_df_2024])

epa_df.head(5)

	_configuration	_season	_team	_conference	_total_ppa	_total_passing_ppa	_total_receiving_ppa	_total_rushing_ppa	_percent_ppa	_percent_passing_ppa	_percent_receiving_ppa	_percent_rushing_ppa	_usage	_passing_usage	_receiving_usage	_rushing_usage	discriminator
0	<cfbd.configuration.Configuration object at 0x...	2022	Air Force	Mountain West	249.5	57.7	28.1	163.7	0.725	0.997	0.309	0.839	0.852	0.917	0.392	0.892	None
1	<cfbd.configuration.Configuration object at 0x...	2022	Akron	Mid-American	52.9	4.7	43.5	4.6	0.253	0.086	0.349	0.155	0.485	0.343	0.290	0.806	None
2	<cfbd.configuration.Configuration object at 0x...	2022	Alabama	SEC	401.7	243.9	117.3	40.5	0.555	0.896	0.313	0.525	0.586	0.962	0.276	0.446	None
3	<cfbd.configuration.Configuration object at 0x...	2022	Appalachian State	Sun Belt	304.1	165.3	41.6	97.3	0.561	0.967	0.150	1.031	0.680	0.943	0.211	0.751	None
4	<cfbd.configuration.Configuration object at 0x...	2022	Arizona	Pac-12	158.1	56.8	79.1	22.2	0.569	1.000	0.420	0.677	0.825	1.000	0.406	0.925	None

As you can see, we have a data frame created now with our EPA data. A note on the API response: the response is given as a list of objects, rather than as a dictionary, which is why we need to parse things out before creating a full dataframe.

The next thing we’ll need to do is get the winning percentages for each of these teams and combine that with our EPA data.

# Set up instance of the GamesAPI class from cfbd
teams_api = cfbd.GamesApi(cfbd.ApiClient(configuration))

teams_response_2022 = teams_api.get_team_records(year=2022)
teams_response_2023 = teams_api.get_team_records(year=2023)
teams_response_2024 = teams_api.get_team_records(year=2024)

teams_response_data_2022 = [vars(item) for item in teams_response_2022]
teams_response_data_2023 = [vars(item) for item in teams_response_2023]
teams_response_data_2024 = [vars(item) for item in teams_response_2024]

teams_df_2022 = pd.DataFrame(teams_response_data_2022)
teams_df_2023 = pd.DataFrame(teams_response_data_2023)
teams_df_2024 = pd.DataFrame(teams_response_data_2024)

teams_df = pd.concat([teams_df_2022, teams_df_2023, teams_df_2024])

# Rename columns so they will combine easily
epa_df.rename(columns={'_season': '_year'}, inplace=True)

# Merge the dataframes
full_df = pd.merge(epa_df, teams_df, how='left', on=['_year', '_team'])

# Extract data from the games object
full_df[['games', 'wins', 'losses', 'ties']] = full_df['_total'].apply(
    lambda x: [getattr(x, '_games', None), getattr(x, '_wins', None), 
               getattr(x, '_losses', None), getattr(x, '_ties', None)]
).apply(pd.Series)

# Drop the original '_total' column
full_df.drop(columns=['_total'], inplace=True)

full_df['wp'] = full_df['wins'] / full_df['games']

# Drop rows with missing informatoion
full_df.dropna(subset=['wp', '_percent_ppa'], inplace=True)

full_df.head(5)

	_configuration_x	_year	_team	_conference_x	_total_ppa	_total_passing_ppa	_total_receiving_ppa	_total_rushing_ppa	_percent_ppa	_percent_passing_ppa	_percent_receiving_ppa	_percent_rushing_ppa	_usage	_passing_usage	_receiving_usage	_rushing_usage	discriminator_x	_configuration_y	_team_id	_conference_y	_division	_expected_wins	_conference_games	_home_games	_away_games	discriminator_y	games	wins	losses	wp
0	<cfbd.configuration.Configuration object at 0x...	2022	Air Force	Mountain West	249.5	57.7	28.1	163.7	0.725	0.997	0.309	0.839	0.852	0.917	0.392	0.892	None	<cfbd.configuration.Configuration object at 0x...	2005.0	Mountain West	Mountain	10.9	{'games': 8, 'losses': 3, 'ties': 0, 'wins': 5}	{'games': 7, 'losses': 1, 'ties': 0, 'wins': 6}	{'games': 4, 'losses': 2, 'ties': 0, 'wins': 2}	None	13.0	10.0	3.0	0.769231
1	<cfbd.configuration.Configuration object at 0x...	2022	Akron	Mid-American	52.9	4.7	43.5	4.6	0.253	0.086	0.349	0.155	0.485	0.343	0.290	0.806	None	<cfbd.configuration.Configuration object at 0x...	2006.0	Mid-American	East	2.0	{'games': 8, 'losses': 7, 'ties': 0, 'wins': 1}	{'games': 5, 'losses': 4, 'ties': 0, 'wins': 1}	{'games': 7, 'losses': 6, 'ties': 0, 'wins': 1}	None	12.0	2.0	10.0	0.166667
2	<cfbd.configuration.Configuration object at 0x...	2022	Alabama	SEC	401.7	243.9	117.3	40.5	0.555	0.896	0.313	0.525	0.586	0.962	0.276	0.446	None	<cfbd.configuration.Configuration object at 0x...	333.0	SEC	West	10.5	{'games': 8, 'losses': 2, 'ties': 0, 'wins': 6}	{'games': 7, 'losses': 0, 'ties': 0, 'wins': 7}	{'games': 5, 'losses': 2, 'ties': 0, 'wins': 3}	None	13.0	11.0	2.0	0.846154
4	<cfbd.configuration.Configuration object at 0x...	2022	Arizona	Pac-12	158.1	56.8	79.1	22.2	0.569	1.000	0.420	0.677	0.825	1.000	0.406	0.925	None	<cfbd.configuration.Configuration object at 0x...	12.0	Pac-12		4.0	{'games': 9, 'losses': 6, 'ties': 0, 'wins': 3}	{'games': 7, 'losses': 4, 'ties': 0, 'wins': 3}	{'games': 5, 'losses': 3, 'ties': 0, 'wins': 2}	None	12.0	5.0	7.0	0.416667
5	<cfbd.configuration.Configuration object at 0x...	2022	Arizona State	Pac-12	54.3	7.5	31.4	15.3	0.136	0.082	0.195	0.105	0.134	0.030	0.223	0.169	None	<cfbd.configuration.Configuration object at 0x...	9.0	Pac-12		5.4	{'games': 9, 'losses': 7, 'ties': 0, 'wins': 2}	{'games': 6, 'losses': 4, 'ties': 0, 'wins': 2}	{'games': 6, 'losses': 5, 'ties': 0, 'wins': 1}	None	12.0	3.0	9.0	0.250000

Again we have to do a little extra cleaning when getting our data set up. A team’s win/loss data come in as an object (you can see an example in the conferece_games, home_games, and away_games columns, I just left those alone because I don’t need that granular information) so we have to do a little work to get the information from that object into its own columns, and then calculate winning percentage.

Now that we have our data set up as needed, let’s get to work with modeling. I’d like to do a comparison between the more traditional way of doing linear regression, out of the sklearn.linear_model.LinearRegression class, and doing it this new way with the stats model library. First off, this is how I would set up the data for sklearn.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

x = full_df['_percent_ppa'].to_numpy().reshape(-1, 1)
y = full_df['wp'].to_numpy().reshape(-1, 1)

model = LinearRegression()
model.fit(x, y)

y_pred = model.predict(x)

r_squared = model.score(x, y)
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y, y_pred)
intercept = model.intercept_ 
slope = model.coef_[0]

print(f"R-Squared: {r_squared}")
print(f"MSE: {mse}")
print(f"RMSE: {rmse}")
print(f"MAE: {mae}")
print(f"Intercept: {intercept}")
print(f"Slope: {slope}")

R-Squared: 0.04974591372315962
MSE: 0.03846808567182547
RMSE: 0.19613282660438427
MAE: 0.16322629887376885
Intercept: [0.42350689]
Slope: [0.20232842]

Now, this is a fairly basic example so there is not a ton that we need with our data before feeding it into our model. Of course, as we create more complex models this would become a more complex task. I am also able to pull out the summary information from the model, but it requires importing other classes and we have to write out formatted strings so we know what we are looking at.

Let’s compare this now to the statsmodels library.

x = full_df['_percent_ppa']
y = full_df['wp']

x = sm.add_constant(x)
model = sm.OLS(y, x).fit()

print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     wp   R-squared:                       0.050
Model:                            OLS   Adj. R-squared:                  0.047
Method:                 Least Squares   F-statistic:                     19.53
Date:                Sun, 02 Feb 2025   Prob (F-statistic):           1.30e-05
Time:                        17:21:09   Log-Likelihood:                 78.759
No. Observations:                 375   AIC:                            -153.5
Df Residuals:                     373   BIC:                            -145.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const            0.4235      0.026     16.527      0.000       0.373       0.474
_percent_ppa     0.2023      0.046      4.419      0.000       0.112       0.292
==============================================================================
Omnibus:                       13.233   Durbin-Watson:                   1.708
Prob(Omnibus):                  0.001   Jarque-Bera (JB):                6.402
Skew:                          -0.010   Prob(JB):                       0.0407
Kurtosis:                       2.360   Cond. No.                         5.75
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Thoughts on statsmodels

Now that we have our two models made, I have a couple of initial thoughts before dissecting the output of the models.

The first is that statsmodels on its own really doesn’t incorporate R syntax as much as I initially thought that it would. In fact, it seems that the entire premise of using R syntax for creating models is more related to the (patsy dependency), which is no longer in development, than the statsmodels library itself.

That being said, I do find this far easier than sklearn. With statsmodels, I don’t need to worry about converting my data into numpy arrays before putting the data into the model. The thing that I find most useful, however, is the fact that there is a built in summary method. Rather than having to pull each summary value out individually and getting it into a formatted string so I know what I am looking at, I can just do model.summary() and I can get a full look at what the model produces.

I find this to be far more convenient, and I certainly think I would use statsmodels if given a choice between that and sklearn.

Evaluating our model

Our model suggests that returning production plays a very small role, but important role in predicting team success. Our R-Squared value says that only 5% of the variance in a team’s winning percentage is explained by returning production, but that there is a statistically significant relationship between the two. So the relationship is there, but the affect on a team’s success is pretty weak. I don’t think that I would use returning production as a real predictor of team success given what I see in this model. The relationship may be statistically significant, but it does not appear as though it is practically significant.