Erick Lu
March 31, 2020 - GitHub Repo
In this project, I use Python to “scrape” ESPN for stats on all the players in the NBA, clean and organize the data into a data science-friendly format, and calculate some interesting statistics. Web scraping is a useful technique for extracting data from websites that don’t offer formatted, raw data for download.
As an example, I will be scraping data from the rosters of each team in the NBA for information such as player age, height, weight, and salary. I will also loop through each individual player's stats page and extract career averages such as points per game, free throw percentages, and more (as of currently, March 2020).
We can use this data to answer questions such as:
I've exported the data to a nicely organized csv file, accessible in the GitHub repo for this project, in case you would like to analyze it yourself. You can also run the python script scrape_nba_statistics.py
to re-scrape ESPN for up-to-date data.
In the following sections, I will describe how to loop through ESPN page sources using urllib
, extract information using re
(regular expressions), organize player statistics in pandas
DataFrames, and perform some simple modeling using scikit-learn
.
We will first take a look at the structure of the website and figure out which web pages we need to scrape information from. The teams page at https://www.espn.com/nba/teams looks like the following:
This looks very promising. All the teams are listed on this page, which means that they can easily be extracted from the page source. Let’s take a look at the page source to see if we can find URLs for each team's roster:
It looks like URLs for each of the teams rosters are listed in the page source with the following format: https://www.espn.com/nba/team/roster/_/name/team/team-name, as shown in the highlighted portion of the image above. Given that these all follow the same format, we can use regular expressions to pull out a list of all the team names from the page source, and then construct the roster URLs using the format above. Start by importing the urllib
and re
packages in Python:
import re
import urllib
from time import sleep
Now, let’s create a function that will extract all the team names from http://www.espn.com/nba/teams and construct roster URLs for each of the teams:
# This method finds the urls for each of the rosters in the NBA using regexes.
def build_team_urls():
# Open the espn teams webpage and extract the names of each roster available.
f = urllib.request.urlopen('http://www.espn.com/nba/teams')
teams_source = f.read().decode('utf-8')
teams = dict(re.findall("www\.espn\.com/nba/team/_/name/(\w+)/(.+?)\",", teams_source))
# Using the names of the rosters, create the urls of each roster
roster_urls = []
for key in teams.keys():
# each roster webpage follows this general pattern.
roster_urls.append('http://www.espn.com/nba/team/roster/_/name/' + key + '/' + teams[key])
teams[key] = str(teams[key])
return dict(zip(teams.values(), roster_urls))
rosters = build_team_urls()
rosters
The function build_team_urls()
returns a dictionary that matches team names with their corresponding roster URL. Given this information, we can systematically loop through all of the rosters and use regular expressions to extract player information for each team.
In order to figure out how to scrape the rosters, let’s take a look at the Golden State Warriors' roster page as an example:
Information for each player is nicely laid out in a table, meaning that the data is likely obtainable using regular expressions. Taking a look at the page source reveals that each player’s name and information are all provided in blocks of what apppear to be json
, highlighted below:
Given the standardized format of the data for each player, this information is indeed extractable using regular expressions. First, we should read in the roster webpage using urllib.request.urlopen
:
url = "https://www.espn.com/nba/team/roster/_/name/gs/golden-state-warriors"
f = urllib.request.urlopen(url)
roster_source = f.read().decode('utf-8')
Then, we construct the regex that will return information for each of the players on the roster webpage.
player_regex = ('\{\"name\"\:\"(\w+\s\w+)\",\"href\"\:\"http\://www\.espn\.com/nba/player/.*?\",(.*?)\}')
player_regex
player_info = re.findall(player_regex, roster_source)
player_info[0:4]
As you can see, player_info
is a list of tuples, in which each player name is paired with a set of information (height, weight, age, etc.) that is organized in json
format. We can use the json
package in Python to convert the information into a Python dictionary:
import json
draymond = json.loads("{"+player_info[3][1]+"}")
draymond
In the example above, all of the pertinent information for Draymond Green is now stored into a Python dictionary named draymond
. Let's use the snippets of code above to construct a function which loops through each player in a given roster and scrapes their information:
def get_player_info(roster_url):
f = urllib.request.urlopen(roster_url)
roster_source = f.read().decode('utf-8')
sleep(0.5)
player_regex = ('\{\"name\"\:\"(\w+\s\w+)\",\"href\"\:\"http\://www\.espn\.com/nba/player/.*?\",(.*?)\}')
player_info = re.findall(player_regex, roster_source)
player_dict = dict()
for player in player_info:
player_dict[player[0]] = json.loads("{"+player[1]+"}")
return(player_dict)
We can now loop through each team in rosters
and run get_player_info()
, storing the output in a dictionary called all_players
:
all_players = dict()
for team in rosters.keys():
print("Gathering player info for team: " + team)
all_players[team] = get_player_info(rosters[team])
After running this code, the all_players
dictionary should be a dictionary of dictionaries of dictionaries. This sounds complicated, but let's walk through what it looks like. The first level of keys should correspond to teams:
all_players.keys()
Within a team, the keys should correspond to player names. Let's zoom in on the LA Lakers:
all_players["los-angeles-lakers"].keys()
Now we can choose which player to look at. Let's choose LeBron James as an example:
all_players["los-angeles-lakers"]["LeBron James"]
A dictionary with information about LeBron James is returned. We can extract information even more precisely by specifying which field we are interested in. Let's get his salary:
all_players["los-angeles-lakers"]["LeBron James"]["salary"]
In order to make data analysis easier, we can re-format this dictionary into a pandas
DataFrame. The function pd.DataFrame.from_dict()
can turn a dictionary of dictionaries into a pandas
DataFrame, as demonstrated below:
import pandas as pd
gsw = pd.DataFrame.from_dict(all_players["golden-state-warriors"], orient = "index")
gsw
In the DataFrame above, each of the parameters such as 'age', 'salary', etc. are organized in columns and each player is a row. This makes display of the data much easier to read and understand. Furthermore, it also places null values when pieces of data are missing--for example, Chasson Randle's salary information is missing from the website, so 'NaN' is automatically placed in the DataFrame.
DataFrames allow us to quickly make calculations, sort players based on their stats, and compare stats between teams. To make a DataFrame containing data from all the teams, we will loop through each team in all_players
, construct DataFrames, label them with a team
column, and aggregate them into a single DataFrame called all_players_df
.
all_players_df = pd.DataFrame()
# loop through each team, create a pandas DataFrame, and append
for team in all_players.keys():
team_df = pd.DataFrame.from_dict(all_players[team], orient = "index")
team_df['team'] = team
all_players_df = all_players_df.append(team_df)
all_players_df.head(5)
Now, all_players_df
is a DataFrame with all the players in the NBA categorized by team. It contains player information such as age, salary, height, weight, etc. I'll export this data to a csv file, in case you readers out there want to read it in and play around with it yourself.
all_players_df.to_csv("NBA_roster_info_all_players_mar2020.csv")
We also want to scrape data coresponding to the performance of each player, in terms of points per game, field goal percentage, rebounds per game, etc. Our goal is to append this information to all_players_df
so that we can compare player performance with traits such as height, salary, etc. We can find performance stats at each player's personal page on ESPN:
We'll want to extract the career stats in the bottom row, which can be found in the highlighted section of the source code below:
In order to extract the information above for each player in our DataFrame, we can construct URLs for player stats pages using the id
column. Fortunately, the URL is standardized and very easy to construct. For example, using the id
value of 3975 for Stephen Curry, the URL to open would be: https://www.espn.com/nba/player/stats/_/id/3975. Below is an example of extracting his career stats using regexes:
url = "https://www.espn.com/nba/player/stats/_/id/3975"
f = urllib.request.urlopen(url)
sleep(0.3)
player_source = f.read().decode('utf-8')
# extract career stats using this regex
stats_regex = ('\[\"Career\",\"\",(.*?)\]\},\{\"ttl\"\:\"Regular Season Totals\"')
career_info = re.findall(stats_regex, player_source)
print(career_info)
We observe that some of the stats are complex and contain non-numerical symbols such as "-". In the example above, the range "3.7-4.0" is for the column "FT", which stands for "Free Throws Made-Attempted Per Game". We should split this up into two categories, "Free Throws Made (FTM)" and "Free Throws Attempted (FTA)", and do the same for field goals and 3 pointers. To do so, we can split the string based on "-" and then un-nest the list. We also need to convert the strings to floating point values.
from itertools import chain
career_info = career_info[0].replace("\"", "").split(",")
career_info = list(chain.from_iterable([i.split("-") for i in career_info]))
career_info = list(map(float,career_info))
print(career_info)
Now we can loop through each player in all_players_df
, open their stats webpage, extract their career stats, and store the stats in a separate data frame called career_stats_df
using the code below:
career_stats_df = pd.DataFrame(columns = ["GP","GS","MIN","FGM", "FGA","FG%","3PTM","3PTA","3P%","FTM","FTA","FT%","OR","DR","REB","AST","BLK","STL","PF","TO","PTS"])
for player_index in all_players_df.index:
url = "https://www.espn.com/nba/player/stats/_/id/" + str(all_players_df.loc[player_index]['id'])
f = urllib.request.urlopen(url)
sleep(0.3)
player_source = f.read().decode('utf-8')
# extract career stats using this regex
stats_regex = ('\[\"Career\",\"\",(.*?)\]\},\{\"ttl\"\:\"Regular Season Totals\"')
career_info = re.findall(stats_regex, player_source)
try:
# convert the stats to a list of floats, and add the entry to the DataFrame
career_info = career_info[0].replace("\"", "").split(",")
career_info = list(chain.from_iterable([i.split("-") for i in career_info]))
career_info = list(map(float,career_info))
career_stats_df = career_stats_df.append(pd.Series(career_info, index = career_stats_df.columns, name=player_index))
except:
# if no career stats were returned, the player was a rookie with no games played
print(player_index + " has no info, ", end = "")
Some player webpages did not have career stats, which I found corresponded to rookies which had no games played. This threw an error in the loop, so I used a try/except clause to bypass the error and continue stripping content for the remaining players. Their stats are currently stored in the object career_stats_df
:
career_stats_df.head(5)
The stats for each player are now organized in a neat DataFrame. Here is a legend for what each of the abbreviations mean:
I'll also export these stats to a csv file:
career_stats_df.to_csv("NBA_player_stats_all_mar2020.csv")
We will now join career_stats_df
with all_players_df
, which will merge the content from both data frames based on rows that have the same index (player name). Players in all_players_df
that are not included in career_stats_df
will have NaN
values for the joined columns.
all_stats_df = all_players_df.join(career_stats_df)
all_stats_df.head(5)
The performance stats have been added as columns on the right side of the DataFrame.
We notice that some of the columns which should contain numerical data such as salary, height, and weight are instead considered strings. This is beacuse they contain non-numerical characters (such as '$' and ',' for salary). To be able to compute statistics on these columns, we need to convert them to numeric values.
We can convert salaries to numeric by removing all non-numerical characters and converting to int
using list comprehension:
# before converting
all_stats_df['salary'].head(3)
all_stats_df['salary']=[int(re.sub(r'[^\d.]+', '', s)) if isinstance(s, str) else s for s in all_stats_df['salary'].values]
# after converting
all_stats_df['salary'].head(3)
Height is also provided in a non-numeric form, in feet plus inches (e.g. 6' 5"). We should convert this to a numeric form so that statistics can be calculated. To do so, we will write a small function that converts a string of feet plus inches into a numeric value of total inches, convert_height
.
def convert_height(height):
split_height = height.split(" ")
feet = float(split_height[0].replace("\'",""))
inches = float(split_height[1].replace("\"",""))
return (feet*12 + inches)
# before conversion
all_stats_df['height'].head(3)
all_stats_df['height'] = [convert_height(x) for x in all_stats_df['height']]
# after conversion
all_stats_df['height'].head(3)
Weight is also a non-numerical field, because of the units listed (e.g. weight': '230 lbs'). We will simply strip off the units for each entry by splitting the string in half with split(" ")
and taking the left side of the split.
# before conversion
all_stats_df['weight'].head(3)
all_stats_df['weight'] = [float(x.split(" ")[0]) for x in all_stats_df['weight']]
# after conversion
all_stats_df['weight'].head(3)
This should be the last of the values we have to convert to numeric. Now, we have a cleaned-up and joined dataset! I'll save the data to a csv file.
all_stats_df.to_csv("NBA_player_info_and_stats_joined_mar2020.csv")
If you want to read in the data at a later time, you can use read_csv()
like so:
all_stats_df = pd.read_csv("NBA_player_info_and_stats_joined_mar2020.csv", index_col=0)
We can use the data we just compiled to calculate some statistics. Let's start by calculating average stats per team, using groupby()
with mean()
in pandas
.
# calculate means and remove irrelevant columns for id and jersey #
mean_df = all_stats_df.groupby('team').mean().drop(['id','jersey'],1)
mean_df
As you can see, the index of the data frame that is returned corresponds to each individual team now, and the mean values are displayed for each of the columns with numerical values. To find the team with the highest averages for a specific stat, we can use the sort_values()
function. Let's find the top 5 teams with the highest average salary:
mean_df.sort_values('salary', ascending=False).head(5)
Looks like the highest average salary is paid by the Golden State Warriors. Similarly, we can find the top 10 highest paid players by sorting all_stats_df
on salary, then pulling out the top entries for the 'salary' and 'team' columns:
all_stats_df.sort_values('salary', ascending=False)[['salary','team']].head(10)
Stephen Curry is the highest paid player in the NBA with a whopping salary of $40,231,758, followed by Russell Westbrook. We can continue to sift through the data this way for whatever piques our interest. Given how many different variables there are, we can write a small function to make things easier:
def top_n(df, category, n):
return (df.sort_values(category, ascending=False)[[category,'team']].head(n))
This way, we can quickly identify the top n players for any given category in a DataFrame. Let's cycle through some stats of interest:
top_n(all_stats_df, 'PTS', 5)
top_n(all_stats_df, 'REB', 5)
top_n(all_stats_df, 'height', 5)
top_n(all_stats_df, 'weight', 5)
Interestingly, Tacko Fall of the Boston Celtics is both the tallest and the heaviest player in the NBA.
To get a high level overview of how each statistic correlates with one another, we can generate a correlation matrix using corr()
and matplotlib
.
corr_matrix = all_stats_df.drop(['id','jersey'],1).corr()
import matplotlib.pyplot as plt
f = plt.figure(figsize=(19, 15))
plt.matshow(corr_matrix, fignum=f.number)
plt.xticks(range(corr_matrix.shape[1]), corr_matrix.columns, fontsize=14, rotation=45, ha = 'left')
plt.yticks(range(corr_matrix.shape[1]), corr_matrix.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16);
We can learn a lot about how different statistics are associated with each other from this matrix, and also identify some interesting trends. For example:
We can narrow in on correlations of interest by sorting the correlation matrix. Let's try sorting by salary and identifying the top correlates:
corr_matrix.sort_values('salary', ascending=False)['salary'].head(10)
As we suspected, points per game (PTS) is most highly correlated with salary, followed by other point-related stats such as free throws made (FTM). Games started (GS) is also highly correlated with salary, which makes sense since highly-paid players are typically better and will be starters.
If we want to model how much more a player costs based on increases in points per game, an easy way is to use linear regression (OLS). To do so, we will use scikit-learn
. The LinearRegression()
function cannot handle null values, so we will remove players that don't have reported salaries or PTS values first:
from sklearn.linear_model import LinearRegression
# remove rows with null values for regression
reg_df = all_stats_df[['salary', 'PTS']].dropna()
Then, we will fit the model with the predictor variable (X) being PTS and the dependent variable (Y) being salary. We will set fit_intercept=False
since players cannot be paid less than $0.00 or score less than 0 PTS:
X = reg_df['PTS'].values.reshape(-1,1)
Y = reg_df['salary'].values.reshape(-1,1)
reg = LinearRegression(fit_intercept=False).fit(X,Y)
y_pred = reg.predict(X)
plt.figure(figsize=(12, 6))
plt.scatter(X, Y)
plt.plot(X, y_pred, color='red')
plt.xlabel("Points per game (Career)")
plt.ylabel("Salary (2020)")
plt.title('Salary vs PTS - simple linear regression', fontsize=16);
Consistent with the positive correlation we calculated previously, a regression line with a positive slope is fitted to the data. We can extract the slope of the line by getting the coefficient using .coef_
:
print(reg.coef_)
This was only meant to be a demonstration of what could be done with the data that we scraped. Better models can definitely be generated, especially given the nature of the data. Just by looking at the fit above, we can see that the residuals will be heteroskedastic. There are also a small number of players with high career points per game but low salaries in the bottom right corner of the plot which are skewing the regression line.
Taking into account these caveats, the value of the slope is ~947619.16. This suggests that for every unit increase in points per game, the predicted salary paid to a player increases by $947,619.16! Looks like making that free throw really does count.
Here, I used Python to scrape ESPN for statistics on all the players in the NBA using the urllib
and re
packages. Then, I used pandas
and scikit-learn
to organize the data and calculate some summary statistics.
I hope what you've learned from this project will help you out on your own web scraping quests. The techniques that I've outlined here should be broadly applicable for other websites. In general, webpages that link to subpages within the same site will construct their links in some sort of standardized pattern. If so, you can construct URLs for the subpages and loop through them as we have done here. Next time you find yourself flipping through a website and copy-pasting, consider trying to automate the process using Python!