Introduction

Whether this is the tour the force, a great number or peculiar gadgets or glasses of cold vodka Martni, there are definitely many aspects that constitute a great James Bond movie. Having watched A spy who loved me (1977) recently, I’ve begun to wonder if that could be somehow figured out with the help of machine learning. This is my small summer data science project that aims to predict the average IMDB score for the newest 007 movie No time to die and see what contributes to the score.

More formally, we will be looking at a regression problem for a small dataset (24 items) and training a random forest with cross validation. Then we will examine the break down plot for a previously unseen example - the new movie. I will be using caret package for the model and DALEX for explanation.

1. The data: For your eyes only

data(JamesBond)
df <- JamesBond %>% select(-US_Gross, -World_Gross, -US_Adj, -Budget, -Avg_User_Rtn_Tom) %>%
  relocate(Avg_User_IMDB, .after = last_col())

I will be using the James Bond dataset from the HoRM package. See resources
Here we have the data of the previous 24 movies. I’m going to use 12 ouf of the 18 available features, namely: Year, Movie, Bond, World_Adj, Budget_Adj, Film_Length, Conquests, Martinis, BJB, Kills_Bond, Kills_Others, Top_100, Avg_User_IMDB.
To clear out the feature names:
- World_Adj The film’s 2013-adjusted worldwide gross (in 1000’s of U.S. dollars).
- Budget_Adj The film’s 2013-adjusted budget (in 1000’s of U.S. dollars).
- Avg_User_IMDB The average user rating on IMDB (www.imdb.com).
- Conquests The number of “conquests” by Bond in the film.
- Martinis The number of martinis Bond drank in the film.
- BJB The number of times Bond stated “Bond. James Bond.” in the movie.
- Kills_Bond The number of people killed by Bond.
- Kills_Others The number of people killed in the film by people other than Bond.
- Top_100 An indicator where a value of 1 means the title song within the top 100 on the UK Singles Chart and the U.S. Billboard Hot 100 and a value of 0 means it did not.

kable(df)
Year Movie Bond World_Adj Budget_Adj Film_Length Conquests Martinis BJB Kills_Bond Kills_Others Top_100 Avg_User_IMDB
1962 Dr. No Sean Connery 457928 7688 110 3 2 1 4 8 0 7.3
1963 From Russia with Love Sean Connery 598624 15174 115 4 0 0 11 16 0 7.5
1964 Goldfinger Sean Connery 935404 22468 110 2 1 2 9 68 1 7.8
1965 Thunderball Sean Connery 1040693 66333 130 3 0 0 20 90 1 7.0
1967 You Only Live Twice Sean Connery 775740 66035 117 3 1 0 21 175 1 6.9
1969 On Her Majesty’s Secret Service George Lazenby 518736 50608 142 3 1 2 5 37 0 6.8
1971 Diamonds Are Forever Sean Connery 664969 41274 120 1 0 1 7 42 1 6.7
1973 Live and Let Die Roger Moore 846046 36603 121 3 0 1 8 5 1 6.8
1974 The Man with the Golden Gun Roger Moore 459623 32965 125 2 0 2 1 5 0 6.7
1977 The Spy Who Loved Me Roger Moore 710290 53636 125 3 1 1 31 116 1 7.1
1979 Moonraker Roger Moore 672514 99134 126 3 1 1 12 69 0 6.2
1981 For Your Eyes Only Roger Moore 498812 71514 127 2 0 2 18 36 1 6.8
1983 Octopussy Roger Moore 437059 64102 131 2 0 1 15 43 1 6.5
1985 A View to a Kill Roger Moore 329322 64730 131 4 0 2 5 57 1 6.2
1987 The Living Daylights Timothy Dalton 390758 81749 130 2 2 1 13 29 0 6.7
1989 License to Kill Timothy Dalton 292392 78637 133 2 1 1 10 13 0 6.5
1995 Goldeneye Pierce Brosnan 542985 91404 130 2 1 1 47 25 0 7.2
1997 Tomorrow Never Dies Pierce Brosnan 491098 159117 119 3 1 1 30 24 0 6.4
1999 The World Is Not Enough Pierce Brosnan 504091 188130 128 3 1 2 27 43 0 6.3
2002 Die Another Day Pierce Brosnan 557433 183255 133 2 2 1 31 20 1 6.0
2006 Casino Royale Daniel Craig 686784 117465 144 2 3 1 11 11 1 7.9
2008 Quantum of Solace Daniel Craig 638035 248014 106 1 6 0 16 15 1 6.7
2012 Skyfall Daniel Craig 1120980 202240 143 3 1 1 26 26 1 7.8
2015 Spectre Daniel Craig 864553 240803 148 3 1 1 30 205 1 6.8

2. Charts: Live and let die

In this section we will look at some interesting features. Hover over the bullet marker to get extra information about the movie.

plot_df <- JamesBond
plot_df$Movie <- factor (plot_df$Movie , levels = plot_df$Movie) # we want to preserve the dataframe's chronological movie ordering

m <- list(
  l = 100,
  r = 50,
  b = 80,
  t = 150,
  pad = 2
)

fig <- plot_ly(plot_df, x = ~Year, y = ~Martinis, type= 'scatter', mode= 'lines+markers',
               line = list(color = 'rgba(49,130,189, 1)', width = 4),
               marker = list(color = 'rgba(49,130,189, 1)', size = 12),
               hoverinfo = 'text',
               text = ~paste(Movie, Year)) %>%
  layout(title = "Martini drinks Bond consumes in the movie",
         font=list(size = 20),
         margin = m,
         xaxis = list(
           showgrid = FALSE,
           zeroline = FALSE,
           tickfont = list(size = 16)
         ),
         yaxis = list(
           showgrid = FALSE,
           zeroline = TRUE 
         ))
fig
plot_df <- JamesBond
plot_df$Movie <- factor (plot_df$Movie , levels = plot_df$Movie)

fig2 <- plot_ly(plot_df, x = ~Year, y = ~Kills_Bond, name = 'By Bond', type= 'scatter', mode= 'lines+markers',
               line = list(color = 'rgba(166,4,4,1)', width = 4),
               marker = list(color = 'rgba(166,4,4,1)', size = 12),
               hoverinfo = 'text',
               text = ~paste(Movie, Year, "<br>Kills:", Kills_Bond)) %>%
  add_trace(y = ~Kills_Others,  type = 'scatter', mode= 'lines+markers',
               name = "By Others",
               line = list(color = 'rgba(192,172,182,1)', width = 4),
               marker = list(color = 'rgba(192,172,182,1)', size = 12),
               hoverinfo = 'text',
               text = ~paste(Movie, Year, "<br>Kills:", Kills_Others)) %>%
  layout(title = "Kills in 007 movies",
         font=list(size = 20),
         margin = m,
         xaxis = list(
           showgrid = FALSE,
           zeroline = FALSE,
           tickfont = list(size = 16)
         ),
         yaxis = list(
           title.text = '',
           showgrid = FALSE,
           zeroline = TRUE
         ))
fig2

The most correlated feature with the average IMDB score is the World gross, which should be expected. It is interesting to note that Kills, Film Length and Budget are growing in the newer movies.

fig_corr <- JamesBond %>% select(-US_Gross, -World_Gross, -US_Adj, -Budget, -Avg_User_Rtn_Tom, -Movie, -Bond) %>%
  relocate(Avg_User_IMDB, .after = last_col()) %>%
  cor() %>% 
  ggcorrplot(method = 'square', type = 'upper', outline.color = 'white', lab_size = 8,
             title= "Feature Correlations",
             legend.title = "Corelation",
             tl.cex = 18,
             lab=TRUE) +
  theme( text = element_text(size = 22))

fig_corr

3. Regression: The world is not enough

It is time to train our regressor. We will use random forest with cross-validation. Next, we will create a DALEX explainer to examine the Feature Importance.

data(JamesBond)
df <- JamesBond %>% select(-Movie, -US_Gross, -World_Gross, -US_Adj, -Budget, -Avg_User_Rtn_Tom) %>%
  relocate(Avg_User_IMDB, .after = last_col())

train_control <- trainControl(method = "cv", number = 8, savePredictions = "all")

model_rf <- train(Avg_User_IMDB~. , data = df, method = "rf", ntree = 100, trControl = train_control)

explainer_rf <- DALEX::explain(model = model_rf,
                              label = "rf",
                              verbose = FALSE,
                              y = df$Avg_User_IMDB
                              )

print(model_performance(explainer_rf))
## Measures for:  regression
## mse        : 0.05157403 
## rmse       : 0.2270992 
## r2         : 0.7990078 
## mad        : 0.1360167
## 
## Residuals:
##          0%         10%         20%         30%         40%         50% 
## -0.38593333 -0.30377500 -0.19552000 -0.12311000 -0.07096667 -0.03891667 
##         60%         70%         80%         90%        100% 
## -0.01014000  0.07145333  0.13466333  0.31706333  0.49035000
plot(model_parts(explainer_rf, loss_function = loss_root_mean_square)) + theme(text = element_text(size = 20))

4. Break down: No time to die

I found the necessary statistics from No time to die (2021). As some were hard to find, I assumed the values of Martinis and BJBs. The gross and budget were adjusted to 2013 for consistency:

no_time_to_die <- data.frame(2021, "Daniel Craig", 608510000, 214930000, 163, 0, 1, 1, 66, 109, 1, 0)
colnames(no_time_to_die) <- colnames(df)
df <- rbind(df, no_time_to_die)

kable(no_time_to_die)
Year Bond World_Adj Budget_Adj Film_Length Conquests Martinis BJB Kills_Bond Kills_Others Top_100 Avg_User_IMDB
2021 Daniel Craig 608510000 214930000 163 0 1 1 66 109 1 0
predict(explainer_rf, tail(df, n=1))
##       25 
## 7.347983
pb = predict_parts(explainer = explainer_rf, tail(df, n=1), type= "break_down")
plot(pb)

Let’s examine the contribution of each feature of our example, the 25th 007 movie. What influances the final score the most is the world gross (+), year (+), length (+) and budget (-). My favorite features, introductions and Martinis, are unfortunately not as significant, but they are an advantage (who would have thought!).

We are very close to the true average IMDB rating! (https://www.imdb.com/title/tt2382320/, August 2022).

Remark: This is just a toy example with a small dataset and it should be further tested on more examples to asses the model’s quality. Though it is definitely not enough to consider the rating of a movie based on the number of drinks consumed or main character’s introductions (even if this is Bond, James Bond), I hope this project offers quality entertainment and pleasing visualizations.

From Poland with love,
Kacper Dobek
https://github.com/kapiblue

Resources

Inspiration: https://betaandbit.github.io/RML/

James Bond R dataset https://search.r-project.org/CRAN/refmans/HoRM/html/JamesBond.html

How to use DALEX with caret https://htmlpreview.github.io/?https://github.com/ModelOriented/DALEX-docs/blob/master/vignettes/DALEX_caret.html

R Markdown Cookbook https://bookdown.org/yihui/rmarkdown-cookbook/

EMA book https://ema.drwhy.ai/

Daniel Craig’s retrospective https://youtu.be/2oZdJrph3RA

---
title: "WHAT MAKES A GREAT JAMES BOND MOVIE?"
author: "Kacper Dobek"
date: "2022-08-23"
output: 
    html_document:
        css: "resources/style.css"
        code_folding: hide
        code_download: true
        toc: true
        out_width: 80%
        
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, include = TRUE, cache = TRUE)
library(HoRM)
library(dplyr)
library(knitr)
library(DALEX)
library(caret)
library(ggplot2)
library(ggcorrplot)
library(plotly)
set.seed(007)
```

```{r img, echo=FALSE}
knitr::include_graphics('resources/007.png')
```

## Introduction

> Whether this is the tour the force, a great number or peculiar gadgets or glasses of cold vodka Martni, there are definitely many aspects that constitute a great James Bond movie. Having watched *A spy who loved me (1977)* recently, I've begun to wonder if that could be somehow figured out with the help of machine learning. This is my small summer data science project that aims to predict the average IMDB score for the newest 007 movie *No time to die* and see what contributes to the score.

> More formally, we will be looking at a regression problem for a small dataset (24 items) and training a random forest with cross validation. Then we will examine the break down plot for a previously unseen example - the new movie. I will be using *caret* package for the model and *DALEX* for explanation.

## 1. The data: For your eyes only

```{r dataprep}
data(JamesBond)
df <- JamesBond %>% select(-US_Gross, -World_Gross, -US_Adj, -Budget, -Avg_User_Rtn_Tom) %>%
  relocate(Avg_User_IMDB, .after = last_col())
```

I will be using the James Bond dataset from the HoRM package. [See resources](#resources)\
Here we have the data of the previous 24 movies. I'm going to use 12 ouf of the 18 available features, namely: `r colnames(df)`.\
To clear out the feature names:\
- **World_Adj** The film's 2013-adjusted worldwide gross (in 1000's of U.S. dollars).\
- **Budget_Adj** The film's 2013-adjusted budget (in 1000's of U.S. dollars).\
- **Avg_User_IMDB** The average user rating on IMDB (www.imdb.com).\
- **Conquests** The number of "conquests" by Bond in the film.\
- **Martinis** The number of martinis Bond drank in the film.\
- **BJB** The number of times Bond stated "Bond. James Bond." in the movie.\
- **Kills_Bond** The number of people killed by Bond.\
- **Kills_Others** The number of people killed in the film by people other than Bond.\
- **Top_100** An indicator where a value of 1 means the title song within the top 100 on the UK Singles Chart and the U.S. Billboard Hot 100 and a value of 0 means it did not.\



```{r table, fig.width = 12}
kable(df)
```

## 2. Charts: Live and let die

In this section we will look at some interesting features. Hover over the bullet marker to get extra information about the movie.

```{r martinis, fig.width= 12, fig.height= 7}
plot_df <- JamesBond
plot_df$Movie <- factor (plot_df$Movie , levels = plot_df$Movie) # we want to preserve the dataframe's chronological movie ordering

m <- list(
  l = 100,
  r = 50,
  b = 80,
  t = 150,
  pad = 2
)

fig <- plot_ly(plot_df, x = ~Year, y = ~Martinis, type= 'scatter', mode= 'lines+markers',
               line = list(color = 'rgba(49,130,189, 1)', width = 4),
               marker = list(color = 'rgba(49,130,189, 1)', size = 12),
               hoverinfo = 'text',
               text = ~paste(Movie, Year)) %>%
  layout(title = "Martini drinks Bond consumes in the movie",
         font=list(size = 20),
         margin = m,
         xaxis = list(
           showgrid = FALSE,
           zeroline = FALSE,
           tickfont = list(size = 16)
         ),
         yaxis = list(
           showgrid = FALSE,
           zeroline = TRUE 
         ))
fig
```

```{r kills, fig.width= 12, fig.height= 8}

plot_df <- JamesBond
plot_df$Movie <- factor (plot_df$Movie , levels = plot_df$Movie)

fig2 <- plot_ly(plot_df, x = ~Year, y = ~Kills_Bond, name = 'By Bond', type= 'scatter', mode= 'lines+markers',
               line = list(color = 'rgba(166,4,4,1)', width = 4),
               marker = list(color = 'rgba(166,4,4,1)', size = 12),
               hoverinfo = 'text',
               text = ~paste(Movie, Year, "<br>Kills:", Kills_Bond)) %>%
  add_trace(y = ~Kills_Others,  type = 'scatter', mode= 'lines+markers',
               name = "By Others",
               line = list(color = 'rgba(192,172,182,1)', width = 4),
               marker = list(color = 'rgba(192,172,182,1)', size = 12),
               hoverinfo = 'text',
               text = ~paste(Movie, Year, "<br>Kills:", Kills_Others)) %>%
  layout(title = "Kills in 007 movies",
         font=list(size = 20),
         margin = m,
         xaxis = list(
           showgrid = FALSE,
           zeroline = FALSE,
           tickfont = list(size = 16)
         ),
         yaxis = list(
           title.text = '',
           showgrid = FALSE,
           zeroline = TRUE
         ))
fig2
```
The most correlated feature with the average IMDB score is the World gross, which should be expected. It is interesting to note that Kills, Film Length and Budget are growing in the newer movies. 

```{r corr, fig.width = 18, fig.height = 12, warning=FALSE}

fig_corr <- JamesBond %>% select(-US_Gross, -World_Gross, -US_Adj, -Budget, -Avg_User_Rtn_Tom, -Movie, -Bond) %>%
  relocate(Avg_User_IMDB, .after = last_col()) %>%
  cor() %>% 
  ggcorrplot(method = 'square', type = 'upper', outline.color = 'white', lab_size = 8,
             title= "Feature Correlations",
             legend.title = "Corelation",
             tl.cex = 18,
             lab=TRUE) +
  theme( text = element_text(size = 22))

fig_corr
```

## 3. Regression: The world is not enough

It is time to train our regressor. We will use random forest with cross-validation. Next, we will create a DALEX explainer to examine the Feature Importance.

```{r caret, fig.width = 10, fig.height= 7, out.width='100%'}

data(JamesBond)
df <- JamesBond %>% select(-Movie, -US_Gross, -World_Gross, -US_Adj, -Budget, -Avg_User_Rtn_Tom) %>%
  relocate(Avg_User_IMDB, .after = last_col())

train_control <- trainControl(method = "cv", number = 8, savePredictions = "all")

model_rf <- train(Avg_User_IMDB~. , data = df, method = "rf", ntree = 100, trControl = train_control)

explainer_rf <- DALEX::explain(model = model_rf,
                              label = "rf",
                              verbose = FALSE,
                              y = df$Avg_User_IMDB
                              )

print(model_performance(explainer_rf))


plot(model_parts(explainer_rf, loss_function = loss_root_mean_square)) + theme(text = element_text(size = 20))
```

## 4. Break down: No time to die 

I found the necessary statistics from *No time to die (2021)*. As some were hard to find, I assumed the values of Martinis and BJBs. The gross and budget were adjusted to 2013 for consistency:

```{r breakdown, fig.width = 10, fig.height= 7, out.width='100%'}
no_time_to_die <- data.frame(2021, "Daniel Craig", 608510000, 214930000, 163, 0, 1, 1, 66, 109, 1, 0)
colnames(no_time_to_die) <- colnames(df)
df <- rbind(df, no_time_to_die)

kable(no_time_to_die)

predict(explainer_rf, tail(df, n=1))

pb = predict_parts(explainer = explainer_rf, tail(df, n=1), type= "break_down")
plot(pb)
```

Let's examine the contribution of each feature of our example, the 25th 007 movie. What influances the final score the most is the world gross (+), year (+), length (+) and budget (-). My favorite features, introductions and Martinis, are unfortunately not as significant, but they are an advantage (who would have thought!).

>  We are very close to the true average IMDB rating! (https://www.imdb.com/title/tt2382320/, August 2022). 

**Remark:** This is just a toy example with a small dataset and it should be further tested on more examples to asses the model's quality. Though it is definitely not enough to consider the rating of a movie based on the number of drinks consumed or main character's introductions (even if this is Bond, James Bond), I hope this project offers quality entertainment and pleasing visualizations.

From Poland with love,\
Kacper Dobek\
https://github.com/kapiblue

## Resources

Inspiration: https://betaandbit.github.io/RML/

James Bond R dataset https://search.r-project.org/CRAN/refmans/HoRM/html/JamesBond.html

How to use DALEX with caret https://htmlpreview.github.io/?https://github.com/ModelOriented/DALEX-docs/blob/master/vignettes/DALEX_caret.html

R Markdown Cookbook https://bookdown.org/yihui/rmarkdown-cookbook/

EMA book https://ema.drwhy.ai/

Daniel Craig's retrospective https://youtu.be/2oZdJrph3RA

