Predicting Gaon Digital Streams through Spotify Audio Features

Marissa He

The Gaon Digital Chart is, essentially, the Billboard of South Korea. As the k-pop industry standard chart, its rankings are used for a number of things, such as deciding weekly music show winners or year-end awards shows, which can subsequently be used by companies in negotiating appearance fees or brand endorsements for their artists. Gaon records music ranking, streaming, and album sales data on a weekly, monthly and yearly basis, by aggregating from various music providors and streaming services. While there are obviously several factors that go into a song's popularity, such as artist/company brand recognition or current pop culture, I would like to see how well streams can be predicted from factors of the song itself.

To analyze the songs, I've used Spotify's Audio Features and Spotipy, a Python library for the Spotify Web API. Audio features includes standard measurements such as tempo, key, and loudness, but also quantifies some qualitative measurements, such as danceability, and energy. The full list of the audio features I will be using are as follows: danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, and tempo.

Data Scraping

Since it is, as of the writing of this tutorial, just after the 51st week of the year, I will be collecting streaming data on this week's top 100 songs (week 51), as well as the number of streams the current top 100 got historically, starting January 2020 (weeks 1-50). As the Gaon chart only displays the top 200 songs of the week, and some songs have stayed on the chart since before 2020, this will lead to some missing data. As much as I would like to have no missing data, that would be impossible. It takes ten minutes to scrape a year of Gaon charts, and some songs are very old. 'All I Want For Christmas Is You', by Mariah Carey, for example, is from 1994 and currently ranked 13th on Gaon (which wasn't even founded until 2010). Thus, I've decided to settle for this sample size.

First, let's collect the spotify data. The public playlist "Gaon Weekly Digital Chart TOP 100" is updated each week to reflect the Gaon chart, in the same order as the rankings. With spotipy, we can construct a dataframe of the track ids, release date, title, artist(s), and album. These column names are denoted with (eng) to indicate that they are Spotify's English translation rather than the original title.

In [ ]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import datetime 


client_id = # The client ID and client secret require creating a Spotify Developer account
client_secret = #

client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

# The URI of the playlist, found by right click -> Share -> Copy Spotify URI
URI = "spotify:playlist:65aWjXo7etaSDYMJeDriVf"
# returns a dictionary containing a list of tracks (which are also dictonary type)
playlist = sp.playlist(URI)

# storing each song's information in lists
ids = []
release_dates = []
albums = []
artists = []
titles = []
for item in playlist['tracks']['items']:
    track = item['track']
    ids.append(track['id'])
    release_dates.append(track['album']['release_date'])
    albums.append(track['album']['name'])
    
    # using Gaon's formating of artist1 , artist2 , ...
    artist = ''    
    for a in track["artists"]:
        artist += a['name']+' , '
    artist = artist[:-3]
    artists.append(artist)
    titles.append(track["name"])

# Spotify's audio features returns a list of dicts containing audio information for the 100 tracks
audio = sp.audio_features(ids)
spotify_df = pd.DataFrame(ids, columns=['id'])
spotify_df['release date'] = pd.to_datetime(release_dates)
spotify_df['Title (eng)'] = titles
spotify_df['artist (eng)'] = artists
spotify_df['album (eng)'] = albums
spotify_df = pd.merge(spotify_df, pd.DataFrame(audio), on=['id'])

spotify_df = spotify_df.drop(columns=["type", "uri", "track_href", "analysis_url"])

Next, let's retrieve the streaming information from the Gaon chart. Columns here are denoted with (kor) to indicate that they are the Korean versions of each song's title, artist name(s), and album.

In [ ]:
# URL
gaon_digital_week = "http://gaonchart.co.kr/main/section/chart/online.gaon?nationGbn=T&serviceGbn=ALL&targetTime="
gaon_digital_weekp2 = "&hitYear=2020&termGbn=week"
agent = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}

# Preparing to write to Excel, to avoid having to re-run the scraper
writer = pd.ExcelWriter('gaon_digital_merge.xlsx')
    
df = pd.DataFrame([], columns=[])

# Starting in week 51 and going backwards, so that this week shows up first in the dataframe (necessary later) 
for week in range(51, 0, -1):
    r = requests.get(gaon_digital_week+str(week)+gaon_digital_weekp2, headers=agent)
    root = BeautifulSoup(r.content, features="lxml")
    
    data = root.find("div", {"class": "chart"})
    data = data.find("table").find_all("tr")
    
    # The full headers in the table are ["ranking", "change", "albumimg", "subject", "count", "production", "share", "play"]
    # Most will be dropped, so I will not store them
    headers = ["Title (kor)", "artist (kor)", "album (kor)", "week", "count"]
    
    digital_week = []    
    for tr in data[1:]:
        t_row = []
        for index, td in zip(range(8), tr.find_all("td")):
            # subject
            if (index==3):
                text = td.text.split("\n")
                t_row.append(text[1])
                [artist, album] = text[2].split("|")
                t_row.append(artist)
                t_row.append(album)
            # count
            if (index==4):
                text = td.text.split("\n")
                text[1] = text[1].replace(",", "")
                t_row.append(week)
                t_row.append(int(text[1]))
                
        digital_week.append(t_row)
    
    df_digital = pd.DataFrame(digital_week, columns=headers)
    df = pd.concat([df, df_digital])
    

Finally, let's inner join the Spotify data and the Gaon data. Since the Spotify playlist only has the week's top 100, an inner join will ensure we only keep songs that are currently top 100.

In [ ]:
# I scraped Gaon's 51st week first, so the first 100 tuples of the Gaon dataframe 
# are in the same order as their corresponding tuple in the Spotify dataframe.
# This gives us something to join on.

spotify_df["Title (kor)"] = df["Title (kor)"][:100]
spotify_df["artist (kor)"] = df["artist (kor)"][:100]
spotify_df["album (kor)"] = df["album (kor)"][:100]

# inner join
df = pd.merge(df, spotify_df, how="inner", on=["Title (kor)", "artist (kor)", "album (kor)"])

# calculating how long it's been since each song's release
base_date = datetime.datetime(2019, 12, 28)

since_release = []
for i in df.to_dict("records"):
    since_release.append((base_date+datetime.timedelta(days=i["week"]*7)-i["release date"]).days//7)
df["weeks since release"] = since_release    

# Writing the data to avoid having to re-run the scraper
df.to_excel(writer, index = False)
writer.save()

This produces a dataframe that looks like this:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
import pandas as pd

df = pd.read_excel("gaon_digital_merge.xlsx")
df.head(10)
Out[1]:
Title (kor) artist (kor) album (kor) week count id release date Title (eng) artist (eng) album (eng) ... mode speechiness acousticness instrumentalness liveness valence tempo duration_ms time_signature weeks since release
0 VVS (Feat. JUSTHIS) (Prod. GroovyRoom) 미란이 (Mirani) , 먼치맨 , Khundi Panda , 머쉬베놈 (MUSH... 쇼미더머니 9 Episode 1 51 32311027 7Igh1mqghlHz5dimfQV85G 2020-11-21 VVS Miranni , Munchman , Khundi Panda , MUSHVENOM ... Show Me the Money 9 Episode 1 ... 0 0.1190 0.0683 0.0 0.3250 0.868 140.010 335253 4 4
1 VVS (Feat. JUSTHIS) (Prod. GroovyRoom) 미란이 (Mirani) , 먼치맨 , Khundi Panda , 머쉬베놈 (MUSH... 쇼미더머니 9 Episode 1 50 32311027 7Igh1mqghlHz5dimfQV85G 2020-11-21 VVS Miranni , Munchman , Khundi Panda , MUSHVENOM ... Show Me the Money 9 Episode 1 ... 0 0.1190 0.0683 0.0 0.3250 0.868 140.010 335253 4 3
2 VVS (Feat. JUSTHIS) (Prod. GroovyRoom) 미란이 (Mirani) , 먼치맨 , Khundi Panda , 머쉬베놈 (MUSH... 쇼미더머니 9 Episode 1 49 35951274 7Igh1mqghlHz5dimfQV85G 2020-11-21 VVS Miranni , Munchman , Khundi Panda , MUSHVENOM ... Show Me the Money 9 Episode 1 ... 0 0.1190 0.0683 0.0 0.3250 0.868 140.010 335253 4 2
3 VVS (Feat. JUSTHIS) (Prod. GroovyRoom) 미란이 (Mirani) , 먼치맨 , Khundi Panda , 머쉬베놈 (MUSH... 쇼미더머니 9 Episode 1 48 36788119 7Igh1mqghlHz5dimfQV85G 2020-11-21 VVS Miranni , Munchman , Khundi Panda , MUSHVENOM ... Show Me the Money 9 Episode 1 ... 0 0.1190 0.0683 0.0 0.3250 0.868 140.010 335253 4 1
4 VVS (Feat. JUSTHIS) (Prod. GroovyRoom) 미란이 (Mirani) , 먼치맨 , Khundi Panda , 머쉬베놈 (MUSH... 쇼미더머니 9 Episode 1 47 3820430 7Igh1mqghlHz5dimfQV85G 2020-11-21 VVS Miranni , Munchman , Khundi Panda , MUSHVENOM ... Show Me the Money 9 Episode 1 ... 0 0.1190 0.0683 0.0 0.3250 0.868 140.010 335253 4 0
5 내일이 오면 (Feat. 기리보이, BIG Naughty (서동현)) 릴보이 (lIlBOI) 쇼미더머니 9 Episode 3 51 25894741 7K31QxS2DmTBxdYldd8yqf 2020-12-05 Tomorrow lIlBOI , GIRIBOY , BIG Naughty Show Me The Money 9 Episode 3 ... 1 0.3710 0.3950 0.0 0.0664 0.495 78.283 276153 4 2
6 내일이 오면 (Feat. 기리보이, BIG Naughty (서동현)) 릴보이 (lIlBOI) 쇼미더머니 9 Episode 3 50 25894741 7K31QxS2DmTBxdYldd8yqf 2020-12-05 Tomorrow lIlBOI , GIRIBOY , BIG Naughty Show Me The Money 9 Episode 3 ... 1 0.3710 0.3950 0.0 0.0664 0.495 78.283 276153 4 1
7 내일이 오면 (Feat. 기리보이, BIG Naughty (서동현)) 릴보이 (lIlBOI) 쇼미더머니 9 Episode 3 49 4447833 7K31QxS2DmTBxdYldd8yqf 2020-12-05 Tomorrow lIlBOI , GIRIBOY , BIG Naughty Show Me The Money 9 Episode 3 ... 1 0.3710 0.3950 0.0 0.0664 0.495 78.283 276153 4 0
8 Dynamite 방탄소년단 Dynamite 51 23559793 0v1x6rN6JHRapa03JElljE 2020-08-21 Dynamite BTS Dynamite ... 0 0.0993 0.0112 0.0 0.0936 0.737 114.044 199054 4 17
9 Dynamite 방탄소년단 Dynamite 50 23559793 0v1x6rN6JHRapa03JElljE 2020-08-21 Dynamite BTS Dynamite ... 0 0.0993 0.0112 0.0 0.0936 0.737 114.044 199054 4 16

10 rows × 24 columns

Data Visualization and Analysis

To get an idea of what the number of streams look like over time, let's plot streams per week for the songs that has stayed on the chart for all of 2020.

In [2]:
plt.figure(figsize=(10,5))
plt.title('Weekly streams vs Weeks since release')
plt.xlabel('Weeks since Song Release')
plt.ylabel('Streaming count')

# Get the songs that show up the most (51 times) in the dataframe
songs = df['Title (eng)'].value_counts()[:18].index.values
for i in songs:
    row = df.loc[df['Title (eng)'] == i]
    plt.plot(row['weeks since release'], row['count'])
plt.legend(songs, bbox_to_anchor=(1, 1))
Out[2]:
<matplotlib.legend.Legend at 0x217519790d0>

From this, it seems that a song's streams right after release can wildly vary, but they all settle around 10 million per week. Given that these songs are all popular enough to have never dropped off the chart this year, these numbers are probably on the higher end of what you could expect for a song to have 50+ weeks after release.

To determine whether a song's audio's features affects how much it is streamed, let's set the null and alternative hypothesis to be as such:

Null Hypothesis: A song's audio features does not change how well we can predict streaming numbers

Alternative Hypothesis: A song's audio features does change how well we can predict streaming numbers

To test this, we need to create predictors for the Null Hypothesis. From the above graph, the relationship between weeks since release and streams is clearly nonlinear, so we'll use regression created through Support Vector Regression (SVR). Of the SVR kernel functions, I'll be using radial basis function (rbf). RBF will be able to make better predictions than linear regression libraries such as Ridge or Lasso, which we can show below.

But first, the SVR's C and gamma parameters need to be found. C is a regularization parameter that will decide the decision function's margin, and gamma determines the influence each line of data has. More information on the rbf parameters can be found here. The best method would be to use Grid Search to find the best values, but since it takes too long to run I just tried a few values manually and got C=10000000000, gamma=0.0001

In [2]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score

parameters = {"kernel": ["rbf"], "C": [1e7, 1e8, 1e9, 1e10], "gamma": [1e-4, 1e-3, 1e-2]}

X = df[['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'weeks since release']].values
Y = df['count'].values

# Code that would be used, but takes too long to run
#clf = GridSearchCV(SVR(), parameters, cv=2, verbose=2)
#clf = clf.fit(X, Y)
#clf.best_params_

Now, let's use these parameters to test our null hypothesis, and a linear regression to show why a nonlinear model was necessary.

In [3]:
X = df[['weeks since release']].values
Y = df['count'].values

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

clfs = [Lasso(), SVR(kernel='rbf', C=10000000000, gamma=0.0001)]
for clf in clfs:
    clf = clf.fit(X_train, Y_train)
    predictions = clf.predict(X_test)
    score = r2_score(Y_test, predictions)
    print(str(clf), "Null Hypothesis Score:", score)
Lasso() Null Hypothesis Score: 0.029772999949107204
SVR(C=10000000000, gamma=0.0001) Null Hypothesis Score: 0.1579204813506766

Next, let's run these regression methods on our alternative hypothesis, which states including audio features does improve the model:

In [4]:
X = df[['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'weeks since release']].values
Y = df['count'].values

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

clfs = [Lasso(), SVR(kernel = 'rbf', C=10000000000, gamma=0.0001)]
for clf in clfs:
    clf = clf.fit(X_train, Y_train)
    predictions = clf.predict(X_test)
    score = r2_score(Y_test, predictions)
    print(str(clf), "Alternative Hypothesis Score:", score)
Lasso() Alternative Hypothesis Score: 0.1636216975827961
SVR(C=10000000000, gamma=0.0001) Alternative Hypothesis Score: 0.3968508000948544

The above code displayed the r^2 scores of Lasso and SVR. The best possible value of r^2, the coefficient of determination, is 1.00, as it would mean 100% of the variation of weekly streams can be explained by the model. From these r^2 scores, we can conclude that: SVR does predict better than the linear regression methods Lasso, and including audio features does result in a much higher r^2 score.

However, tuples with the same audio features are most likely just the same song, so there's a possibility that individual audio features have no affect on the model, just the combinations of audio features that make up one of the songs being trained on. If we don't allow it to train on the song it's predicting, the r^2 scores completely change. Take, for example, the 18 songs that have been on the chart all 51 weeks of the year so far. Let's split the dataset into the song and the rest of the dataset, and only train on the rest of the dataset.

In [15]:
songs = df['Title (eng)'].value_counts()[:18].index.values
for i in songs:
    row = df.loc[df['Title (eng)'] == i]
    rownot = df.loc[df['Title (eng)'] != i]
    X = rownot[['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'weeks since release']].values
    Y = rownot['count'].values

    clf = SVR(kernel = 'rbf', C=1000000000, gamma=0.0001)
    clf = clf.fit(X, Y)
    
    predictions = clf.predict(row[['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'weeks since release']].values)
    
    plt.figure()
    plt.title('Weekly streams vs Weeks since release')
    plt.xlabel('Weeks since Song Release')
    plt.ylabel('Streaming count')
    plt.plot(row['weeks since release'], row['count'])
    plt.plot(row['weeks since release'], predictions)
    plt.legend(["Actual", "Predicted"])

As you can see, the predictions are far too different, meaning without training on the song being predicted on, even the most suitable model will be incredibly bad at predicting streaming data. If a song has not been released yet, using its audio features would not be a viable way to predict how well it does compared to using other data, like company/artist popularity. What if, then, a song was already on the chart and we wanted to predict how well it'd do in the next few weeks?

In [17]:
songs = df['Title (eng)'].value_counts()[:18].index.values
for i in songs:
    row = df.loc[df['Title (eng)'] == i]
    X = row[['weeks since release']].values
    Y = row['count'].values

    clf = SVR(C=10000000000, gamma=0.0001)
    clf.fit(X[10:], Y[10:])
    
    predictions = clf.predict(X)
    
    plt.figure()
    plt.title('Weekly streams vs Weeks since release')
    plt.xlabel('Weeks since Song Release')
    plt.ylabel('Streaming count')
    plt.plot(row['weeks since release'], row['count'])
    plt.plot(row['weeks since release'], predictions)
    plt.legend(["Actual", "Predicted"])

If we only train the model on the first 41 weeks of the year, and make it predict the last 10 weeks, it results in a curve that fits the trained data well, but only fits the untrained data sometimes. It seems the model is bad at predicting when a peak may end, and thus is only good for predicting one or two weeks into the future.

Conclusion

From this analysis, I conclude that a song's audio features are not a good predictor for how well a song charts after its release. An artist's management company would be better off using other features, like how well an artist's other songs have charted, or how well the song fits the season/weather, or whether the song will be featured on a popular drama or show. For predicting something like next week's streams, an SVR model can do fairly well, but the further into the future you try to predict the worse the prediction is. Ultimately, there are just too many factors in why a song's popularity changes on a week by week basis, and it may be more viable to predict on a monthly scale instead.

While the original problem statement, how well audio features can predict a song's weekly streams, seems to have gotten a conclusion of 'not very well', there are still many ways to analyze this dataset for information. For example, with histograms we can see the general public's preferences on danceability or loudness, which can be used in selecting which song to use as the title track of an album. The preferences of the the audience may seem fickle, but with further analysis, we could perhaps reach a better understanding.

In [ ]: