Star Wars: A data exploration

Star Wars: A data exploration

Before the release of “Star Wars: The Force Awakens”, the team at FiveThirtyEight wanted to answer some questions about the Star Wars franchise. In particular they were interested in answering the question Which movie is the best movie in the franchise?

The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you download from their GitHub repository.

For this post, we will be exploring the data set in Jupyter notebook. The following code will read the data into pandas Dataframe.

# Import the required modules
import pandas as pd  
import numpy as np

# Read data into python
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")

We need to specify an encoding because the data set has some characters that aren’t in Python’s default utf-8 encoding. You can read more about character encodings on developer Joel Spolsky’s blog.

star_wars.head(10) # Exploring the first ten rows of the dataframe
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 Unnamed: 27 Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
0 nan Response Response Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi Han Solo Luke Skywalker Princess Leia Organa Anakin Skywalker Obi Wan Kenobi Emperor Palpatine Darth Vader Lando Calrissian Boba Fett C-3P0 R2 D2 Jar Jar Binks Padme Amidala Yoda Response Response Response Response Response Response Response Response Response
1 3.29288e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3 2 1 4 5 6 Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Unfamiliar (N/A) Unfamiliar (N/A) Very favorably Very favorably Very favorably Very favorably Very favorably I don’t understand this question Yes No No Male 18-29 nan High school degree South Atlantic
2 3.29288e+09 No nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.29277e+09 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith nan nan nan 1 2 3 4 5 6 Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) I don’t understand this question No nan No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.29276e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 6 1 2 4 3 Very favorably Very favorably Very favorably Very favorably Very favorably Somewhat favorably Very favorably Somewhat favorably Somewhat unfavorably Very favorably Very favorably Very favorably Very favorably Very favorably I don’t understand this question No nan Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
5 3.29273e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 4 6 2 1 3 Very favorably Somewhat favorably Somewhat favorably Somewhat unfavorably Very favorably Very unfavorably Somewhat favorably Neither favorably nor unfavorably (neutral) Very favorably Somewhat favorably Somewhat favorably Very unfavorably Somewhat favorably Somewhat favorably Greedo Yes No No Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
6 3.29272e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 1 4 3 6 5 2 Very favorably Very favorably Very favorably Very favorably Very favorably Neither favorably nor unfavorably (neutral) Very favorably Neither favorably nor unfavorably (neutral) Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Neither favorably nor unfavorably (neutral) Very favorably Han Yes No Yes Male 18-29 $25,000 - $49,999 Bachelor degree Middle Atlantic
7 3.29268e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 6 5 4 3 1 2 Very favorably Very favorably Somewhat favorably Somewhat favorably Very favorably Very favorably Very favorably Very favorably Very favorably Somewhat favorably Very favorably Somewhat unfavorably Somewhat favorably Very favorably Han Yes No No Male 18-29 nan High school degree East North Central
8 3.29266e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 4 5 6 3 2 1 Very favorably Somewhat favorably Very favorably Neither favorably nor unfavorably (neutral) Very favorably Very unfavorably Somewhat unfavorably Neither favorably nor unfavorably (neutral) Somewhat favorably Somewhat favorably Somewhat favorably Very unfavorably Somewhat unfavorably Very favorably Han No nan Yes Male 18-29 nan High school degree South Atlantic
9 3.29265e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 4 6 2 1 3 Very favorably Somewhat unfavorably Somewhat favorably Somewhat favorably Somewhat favorably Very favorably Very favorably Very favorably Very favorably Neither favorably nor unfavorably (neutral) Somewhat favorably Very unfavorably Somewhat unfavorably Somewhat favorably Han No nan No Male 18-29 $0 - $24,999 Some college or Associate degree South Atlantic

The data has several columns including:

  • Respondent ID - Anonymized ID for the respondent
  • Gender - The respondent’s gender
  • Age - The respondent’s age
  • Household Income - The respondent’s income
  • Education - The respondent’s education level
  • Location (Census Region) - The respondent’s location
  • Have you seen any of the 6 films in the Star Wars franchise? - Has a Yes or No response
  • Do you consider yourself to be a fan of the Star Wars film franchise? - Has a Yes or No response

There are several other columns containing answers to questions about the Star Wars movies. For some questions like Which of the folllowing Star Wars films have you seen? Please select all that apply. respondent’s had to check multiple boxes. This type of data is difficult to represent in a typical columnar format. As a result, this data set needs a lot of cleaning.

The first thing we will do is remove any rows, where RespondentID is NaN. In order to accomplish this task we will utilize the pandas.notnull() function.

star_wars = star_wars[pd.notnull(star_wars['RespondentID'])]
star_wars.head(10)
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 Unnamed: 27 Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
1 3.29288e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3 2 1 4 5 6 Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Unfamiliar (N/A) Unfamiliar (N/A) Very favorably Very favorably Very favorably Very favorably Very favorably I don’t understand this question Yes No No Male 18-29 nan High school degree South Atlantic
2 3.29288e+09 No nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.29277e+09 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith nan nan nan 1 2 3 4 5 6 Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) I don’t understand this question No nan No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.29276e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 6 1 2 4 3 Very favorably Very favorably Very favorably Very favorably Very favorably Somewhat favorably Very favorably Somewhat favorably Somewhat unfavorably Very favorably Very favorably Very favorably Very favorably Very favorably I don’t understand this question No nan Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
5 3.29273e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 4 6 2 1 3 Very favorably Somewhat favorably Somewhat favorably Somewhat unfavorably Very favorably Very unfavorably Somewhat favorably Neither favorably nor unfavorably (neutral) Very favorably Somewhat favorably Somewhat favorably Very unfavorably Somewhat favorably Somewhat favorably Greedo Yes No No Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
6 3.29272e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 1 4 3 6 5 2 Very favorably Very favorably Very favorably Very favorably Very favorably Neither favorably nor unfavorably (neutral) Very favorably Neither favorably nor unfavorably (neutral) Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Neither favorably nor unfavorably (neutral) Very favorably Han Yes No Yes Male 18-29 $25,000 - $49,999 Bachelor degree Middle Atlantic
7 3.29268e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 6 5 4 3 1 2 Very favorably Very favorably Somewhat favorably Somewhat favorably Very favorably Very favorably Very favorably Very favorably Very favorably Somewhat favorably Very favorably Somewhat unfavorably Somewhat favorably Very favorably Han Yes No No Male 18-29 nan High school degree East North Central
8 3.29266e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 4 5 6 3 2 1 Very favorably Somewhat favorably Very favorably Neither favorably nor unfavorably (neutral) Very favorably Very unfavorably Somewhat unfavorably Neither favorably nor unfavorably (neutral) Somewhat favorably Somewhat favorably Somewhat favorably Very unfavorably Somewhat unfavorably Very favorably Han No nan Yes Male 18-29 nan High school degree South Atlantic
9 3.29265e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 4 6 2 1 3 Very favorably Somewhat unfavorably Somewhat favorably Somewhat favorably Somewhat favorably Very favorably Very favorably Very favorably Very favorably Neither favorably nor unfavorably (neutral) Somewhat favorably Very unfavorably Somewhat unfavorably Somewhat favorably Han No nan No Male 18-29 $0 - $24,999 Some college or Associate degree South Atlantic
10 3.29264e+09 Yes No nan Star Wars: Episode II Attack of the Clones nan nan nan nan 1 2 3 4 5 6 Neither favorably nor unfavorably (neutral) Very favorably Very favorably Very favorably Very favorably Somewhat unfavorably Very favorably Somewhat unfavorably Somewhat unfavorably Very favorably Very favorably Very favorably Somewhat unfavorably Very favorably I don’t understand this question No nan No Male 18-29 $25,000 - $49,999 Some college or Associate degree Pacific

Let’s have a look at the following two columns in the dataset:

  • Have you seen any of the 6 films in the Star Wars franchise?
  • Do you consider yourself to be a fan of the Star Wars film franchise?

Both represent Yes/No questions. They can also be NaN where are respondent chooses not to answer a question. In order to observe all of the unique values in a column, along with the total number of times each value appears, pandas.Series.value_counts() will be utilized.

Both columns are currently string type, because the main values they contain are Yes and No. We can make the data a bit easier to analyze down the road by converting each column to a Boolean having only the values True, False and NaN. Booleans are easier to work with because we can select the rows that are True or False without having to do a string comparison.

We can use the pandas.Series.map() method on series objects to perform the conversion.

# A dictionary to define the mapping from each value in the series to a new value

yes_no = {
    "Yes": True,
    "No": False
}

# Convert both the columns mentioned above to boolean type using series.map function

star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no) 

star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no) 

The next six columns represent a single checkbox question. The respondent checked off a series of boxes in response to the question, Which of the following Star Wars films have you seen? Please select all that apply.

The columns for this question are:

  • Which of the following Star Wars films have you seen? Please select all that apply. - Whether or not the respondent saw Star Wars: Episode I The Phantom Menace.
  • Unnamed: 4 - Whether or not the respondent saw Star Wars: Episode II Attack of the Clones.
  • Unnamed: 5 - Whether or not the respondent saw Star Wars: Episode III Revenge of the Sith.
  • Unnamed: 6 - Whether or not the respondent saw Star Wars: Episode IV A New Hope.
  • Unnamed: 7 - Whether or not the respondent saw Star Wars: Episode V The Empire Strikes Back.
  • Unnamed: 8 - Whether or not the respondent saw Star Wars: Episode VI Return of the Jedi.

For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn’t answer or didn’t see the movie. We’ll assume that they didn’t see the movie.

We’ll need to convert each of these columns to a Boolean, then rename the column to something more intuitive. We can convert the values the same way we did earlier, except that we’ll need to include the movie title and NaN in the mapping dictionary.

# Mapping dictionary
movie_mapping = {
    "Star Wars: Episode I  The Phantom Menace": True,
    np.nan: False,
    "Star Wars: Episode II  Attack of the Clones": True,
    "Star Wars: Episode III  Revenge of the Sith": True,
    "Star Wars: Episode IV  A New Hope": True,
    "Star Wars: Episode V The Empire Strikes Back": True,
    "Star Wars: Episode VI Return of the Jedi": True
}

# Converting the values in a loop using the mapping dictionary
for col in star_wars.columns[3:9]:
    star_wars[col] = star_wars[col].map(movie_mapping)

After calling the map() function, we need to rename the columns to better reflect what they represent. To accomplish this we will use the pandas.DataFrame.rename() method. The df.rename() method works a lot like map(). We pass it a dictionary that maps the current column names to new ones:

star_wars = star_wars.rename(columns={
    'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_1',
'Unnamed: 4': 'seen_2', 'Unnamed: 5': 'seen_3', 'Unnamed: 6' : 'seen_4', 'Unnamed: 7' : 'seen_5', 'Unnamed: 8' : 'seen_6'})

The pandas.DataFrame.rename() method will only rename the columns we specify in the dictionary, and won’t change the names of other columns. The code above will rename the Which of the following Star Wars films have you seen? Please select all that apply. column to seen_1 and Unnamed: 4 to seen_2 and so on.

The next six columns ask the respondent to rank the Star Wars movies in order of least favorite to most favorite. 1 means the film was the most favorite, and 6 means it was the least favorite. Each of the following columns can contain the value 1, 2, 3, 4, 5, 6, or NaN:

  • Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. - How much the respondent liked Star Wars: Episode I The Phantom Menace.
  • Unnamed: 10 - How much the respondent liked Star Wars: Episode II Attack of the Clones.
  • Unnamed: 11 - How much the respondent liked Star Wars: Episode III Revenge of the Sith.
  • Unnamed: 12 - How much the respondent liked Star Wars: Episode IV A New Hope.
  • Unnamed: 13 - How much the respondent liked Star Wars: Episode V The Empire Strikes Back.
  • Unnamed: 14 - How much the respondent liked Star Wars: Episode VI Return of the Jedi.

Fortunately, these columns don’t require a lot of cleanup. We’ll need to convert each column to a numeric type, though, then rename the columns so that we can tell what they represent more easily.

We can do the numeric conversions using pandas.DataFrame.astype() on dataframes.

star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)

The code above will convert column 9 up to but not including column 15 to the float data type. We will also give each column a more descriptive name to better reflect what they represent.

star_wars = star_wars.rename(columns = {'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.' : 'ranking_1', 'Unnamed: 10' : 'ranking_2', 'Unnamed: 11' : 'ranking_3', 'Unnamed: 12' : 'ranking_4', 'Unnamed: 13' : 'ranking_5', 'Unnamed: 14' : 'ranking_6'})

Now that we’ve cleaned up the ranking columns, we can find the highest-ranked movie more quickly. To do this, we take the mean of each of the ranking columns using the pandas.DataFrame.mean() method on dataframes and assign it to rank.

rank = star_wars[star_wars.columns[9:15]].mean()

We will visualize the ranking of each movies using a bar chart. To plot this graph we will make use of an open source graphing libraries called plotly. The link provides a brief description about plotly and how to install it on jupyter notebook.

# Loading the required module.
import plotly.graph_objects as go

# Set the colors of the bars
colors = ['lightslategray',] * 6

# Highlight the bar that has the lowest ranking (or is the most well liked)
colors[np.argmin(rank.values)] = 'crimson'


fig = go.Figure(data=[go.Bar(
    x=rank.index.base,  # Set the x-axis labels
    y= np.around(rank.values, 1), # Set the values of y-axis and round them of to a single decimal place

    marker_color=colors # marker color can be a single color value or an iterable
)], layout=go.Layout(
        xaxis=dict(showgrid=False))) # Remove the vertical grids
# Set the title for the plot
fig.update_layout(title_text='Ranking of Star Wars Movies')

# Set the number of ticks on the y-axis
fig.update_yaxes(nticks=5)

So far, we’ve cleaned up the data, renamed several columns, and computed the average ranking of each movie. As we suspected, it looks like the “original” movies are rated much more highly than the newer ones. The $5^{th}$ movie in the franchise is the most highly ranked movie.

Earlier in this project, we cleaned up the seen columns and converted their values to the Boolean type. When we call methods like pandas.DataFrame.sum() or mean(), they treat Booleans like integers. They consider True as 1, and False as 0. That means we can figure out how many people have seen each movie just by taking the sum of the column. Similar to ranking let us compute the sum of each seen column and assign it to most_seen.

most_seen = star_wars[star_wars.columns[3:9]].sum()
# Set the colors of the bars
colors = ['lightslategray',] * 6

# Highlight the bar that has the highest value for most seen liked)
colors[np.argmax(most_seen.values)] = 'crimson'


fig = go.Figure(data=[go.Bar(
    x=most_seen.index.base,  # Set the x-axis labels
    y= most_seen.values, # Set the values of y-axis round of to a single decimal place

    marker_color=colors # marker color can be a single color value or an iterable
)], layout=go.Layout(
        xaxis=dict(showgrid=False))) # Remove the vertical grids
# Set the title for the plot
fig.update_layout(title_text='Ranking of Star Wars Movies')

# Set the number of ticks on the y-axis
fig.update_yaxes(nticks=5)

It appears that the original movies were seen by more respondents than the newer movies in the franchise. This reinforces what we saw in the rankings, where the earlier movies seem to be more popular. We will split the dataframe into two groups based on a binary column such as Gender by creating two subsets of that column.

males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]

The subsets will allow us to compute the most viewed movie, the highest-ranked movie, and other statistics separately for each group.

males_rank = males[males.columns[9:15]].mean()
females_rank = females[females.columns[9:15]].mean()
males_most_seen = males[males.columns[3:9]].sum()
females_most_seen = females[females.columns[3:9]].sum()

fig = go.Figure(data=[
    go.Bar(name='Male', x=males_rank.index.base, y=np.around(males_rank.values, 1)),
    go.Bar(name='Female', x=females_rank.index.base, y=np.around(females_rank.values, 1))
], layout=go.Layout(
        xaxis=dict(showgrid=False),
        yaxis = dict(showgrid=False)))

fig.update_layout(title_text='Ranking of Star Wars Movies')
# Change the bar mode
fig.update_layout(barmode='group')
fig.update_yaxes(nticks=5)
fig.show()
fig = go.Figure(data=[
    go.Bar(name='Male', x=males_most_seen.index.base, y=males_most_seen.values),
    go.Bar(name='Female', x=females_most_seen.index.base, y=females_most_seen.values)
], layout=go.Layout(
        xaxis=dict(showgrid=False),
        yaxis = dict(showgrid=False)))

fig.update_layout(title_text='Views of each movie in the Star Wars Franchise')
# Change the bar mode
fig.update_layout(barmode='group')
fig.update_yaxes(nticks=5)
fig.show()

Interestingly, more men watched episodes 1-3, but men liked them far less than women did. We could also try to segment this dataframe based on different columns to find interesting patterns. Another path to explore would be to try to ascertain, which characters did the respondents like/dislike the most. However, I will save these task for the future.

Avatar
Amol Kulkarni
Ph.D.

My research interests include application of Machine learning algorithms to the fields of Marketing and Supply Chain Engineering, Decision Theory and Process Optimization.