Skip to main content
Brian Sunter

Analyzing Coffee with Data Science + ChatGPT Code Interpreter

Uploading a dataset about coffee and using ChatGPT to automatically clean the data, understand insights, and generate cool charts.

Cover image for Analyzing Coffee with Data Science + ChatGPT Code Interpreter

One of my favorite new features of ChatGPT is the “code interpreter,” which allows ChatGPT to execute code and read data files you upload.

You can upload a data file, then ask ChatGPT questions, and ChatGPT will write code to analyze the data and create visualizations based on your questions. Now you can produce excellent visualizations based on data with minimal effort.

It’s surprisingly intelligent, where ChatGPT will look at the structure of your data, automatically clean it, and decide how to analyze it.

I found an interesting dataset on Kaggle called “Coffee Quality Data”. It contains coffee reviews from the Coffee Quality Institute (CQI), with 100-point scores, along with information about the coffee, such as its country of origin, altitude, preparation method, and variety, and information about its flavor profile, such as acidity and sweetness.

I’m interested in examining this dataset to answer questions like: “Which countries produce the best coffee?” and “What are the differences in flavor profiles between countries and preparation methods?”

Loading and Cleaning Data

One of my favorite features of the ChatGPT code interpreter is its ability to clean data and do a preliminary analysis quickly.

The coffee dataset is of decent quality but has some issues, like missing data, poorly formatted altitude values, and varied processing method names. This is usually a manual, laborious process and is inconvenient for quick exploratory analysis. The ChatGPT code interpreter does an excellent job of automatically detecting the most critical data and fixing problems.

It starts by loading the dataset and printing it to discover the most important columns of data.

Load, print, and preliminary analysis

python
## First, let's load and inspect the dataset
import pandas as pd

## Load the data
df = pd.read_csv('/mnt/data/df_arabica_clean.csv')

## Display the first few rows of the dataframe
df.head()

The dataset contains 41 columns and 207 coffee samples. Here are a few key ones:

  • Country of Origin: The country where the coffee bean was grown.
  • Farm Name: The name of the coffee farm.
  • Altitude: The altitude at which the coffee was grown.
  • Region: The region where the coffee was grown.
  • Processing Method: The method used to process the coffee beans.
  • Variety: The coffee variety (e.g., Gesha, Typica, Bourbon).
  • Aroma, Flavor, Aftertaste, Acidity, Body, Balance, Uniformity, Clean Cup, Sweetness, Overall: Scores given to the coffee on various attributes by professional cuppers.
  • Total Cup Points: The overall quality score (our target variable).

Our target variable is Total Cup Points as we are trying to find out what makes the best coffee. The higher the total cup points, the better the coffee is considered to be.

For our analysis, we’ll mainly focus on the factors that might affect coffee quality, such as Country of Origin, Variety, Processing Method, Altitude, and sensory profiles (Aroma, Flavor, Aftertaste, Acidity, Body, Balance).

Missing Values

It checks for missing values and decides how to handle them.

python
## Descriptive statistics
desc_stats = df.describe(include='all')

## Checking for missing values
missing_values = df.isnull().sum()

desc_stats, missing_values

The dataset has minimal missing values:

  • Variety: 6 missing values
  • Processing Method: 5 missing values
  • Altitude: 1 missing value

For this relatively clean dataset, rows with missing values in key analysis columns can be excluded when analyzing those specific features.

Analyzing Data

It starts with a quick analysis to show the distribution of the most important data points.

Distribution of Ratings

The histogram shows the distribution of the total cup points, representing the coffee’s overall quality. The distribution is approximately normal with scores ranging from 78 to 89 points. The mean score is 83.7, indicating that most coffees in this dataset are of good specialty-grade quality.

Initial Analysis

Next, ChatGPT does a more detailed analysis by automatically graphing what it thinks are the most important pieces of data.

Let’s examine the relationships between coffee quality (Total Cup Points) and some potentially influential factors (Country of Origin, Variety, Processing Method, Altitude, and sensory profiles). We will use box plots and bar charts to visualize these relationships.

The visualizations suggest that:

  • The highest average coffee quality comes from Ethiopia, Tanzania, Taiwan, and Guatemala.
  • The Gesha variety and SL34 are associated with superior coffee quality.
  • Washed/Wet processing is the most common method, while Natural/Dry shows slightly higher median quality.
  • Higher altitudes (1600m+) tend to produce slightly higher quality coffee.
  • Flavor and Aftertaste scores have a strong positive correlation with overall coffee quality.

Loading Data and Removing Outliers

The graph above was generated by this code, written by ChatGPT:

The first part loads the data and handles altitude parsing:

python
import pandas as pd
import numpy as np

df = pd.read_csv('/mnt/data/df_arabica_clean.csv')

## Parse altitude - extract numeric values from ranges like "1700-1930"
def parse_altitude(alt):
    if pd.isna(alt):
        return np.nan
    alt_str = str(alt)
    if '-' in alt_str:
        parts = alt_str.split('-')
        try:
            return (float(parts[0]) + float(parts[1])) / 2
        except:
            pass
    try:
        return float(alt_str.replace(',', '').replace('m', '').strip())
    except:
        return np.nan

df['Altitude_Numeric'] = df['Altitude'].apply(parse_altitude)

## Calculate the mean Total Cup Points for each category and sort in descending order
top_countries_mean = df.groupby('Country of Origin')['Total Cup Points'].mean().sort_values(ascending=False).index[:10]
top_varieties_mean = df.groupby('Variety')['Total Cup Points'].mean().sort_values(ascending=False).index[:8]

## Get processing method means
df_method_mean = df.groupby('Processing Method')['Total Cup Points'].mean().sort_values(ascending=False)

Generating Charts with Matplotlib and seaborn

The second part of the code generates the charts:

python
import matplotlib.pyplot as plt
import seaborn as sns

## Set up the figure size
plt.figure(figsize=(12, 8))

## Box plot for Country of Origin
sns.boxplot(data=df[df['Country of Origin'].isin(top_countries_mean)],
            y='Country of Origin', x='Total Cup Points',
            order=top_countries_mean)
plt.title('Coffee Quality by Country of Origin')
plt.xlim(78, 90)
plt.tight_layout()
plt.show()

Visualizations

I continued asking it questions to generate visualizations, such as “Generate a bar chart for top mean cup scores by country, sorted in descending order.”

Here are some of my favorite visualizations:

Coffee Quality by Country

Key findings:

  • Ethiopia leads with the highest average quality score (84.96), known for its distinctive fruity and floral notes
  • Tanzania follows closely (84.74), producing bright, complex coffees
  • Taiwan shows impressive quality (84.35) with a large sample size (61 coffees), emerging as a significant specialty coffee origin
  • Guatemala produces consistently high-quality coffees (84.30) with excellent acidity

Coffee Quality by Variety

Key findings:

  • Gesha (also spelled Geisha) stands out as the premium variety with the highest average score (85.43), famous for its jasmine and bergamot notes
  • SL34 shows excellent quality (84.94), a Kenyan variety known for complex fruit flavors
  • Ethiopian Heirlooms score well (84.70), representing diverse indigenous varieties
  • Typica and Bourbon varieties show solid, consistent quality

Coffee Quality by Altitude Range

Higher altitudes generally produce better coffee due to slower cherry development, which allows more complex flavors to develop.

Key findings:

  • Coffee grown at 1800m+ shows the highest median quality
  • There’s a general trend of increasing quality with altitude
  • The effect is subtle but consistent across the dataset

Quality by Processing Method

What are processing methods? These are how the beans are dried and prepared before roasting.

Washed / Wet: Beans are de-pulped, fermented, and thoroughly washed of all mucilage. This method typically yields coffee with more pronounced acidity and cleaner flavors due to removing all fruit before drying.

Natural / Dry: The cherries are picked and spread out in the sun to dry, allowing the fruit to ferment before the seed is removed naturally. The coffee retains intense, fruity flavors from the cherry.

Honey / Pulped Natural: This technique involves removing the skin of the coffee cherries but leaving some of the fruity pulp on the seeds when drying. The name ‘honey’ refers to the sticky texture as it dries, not the taste.

Anaerobic: A newer experimental method where beans ferment in oxygen-free environments, creating unique and often intense flavor profiles.

Key findings:

  • Natural/Dry processing shows the highest median quality (84.0) among common methods
  • Washed/Wet is the most common method (124 samples) with consistent quality
  • Honey/Pulped Natural provides a balance between washed clarity and natural fruitiness
  • Experimental methods like Anaerobic show promising results but have very limited samples (n=1-2)

Flavor Profile by Country

How different countries compare on different flavor profiles, such as sweetness and acidity.

This uses an interesting chart called a “Radar Chart” to visualize multiple sensory dimensions simultaneously.

Key findings from flavor profiles:

  • Ethiopian coffees excel in Acidity (8.02) and Flavor (7.87)
  • Tanzanian coffees have the highest Aroma scores (7.90)
  • Taiwanese coffees show balanced profiles across all attributes
  • Guatemalan coffees demonstrate strong Flavor scores (7.88) with consistent overall quality

Sensory Attribute Correlations

Understanding which sensory attributes most strongly influence the overall quality score:

Key findings:

  • Overall rating has the strongest correlation (0.95) with Total Cup Points, followed closely by Flavor (0.94)
  • Aftertaste and Balance are also highly correlated (0.93)
  • Aroma shows strong influence on perceived quality (0.87)
  • Uniformity, Clean Cup, and Sweetness show perfect or near-perfect scores for most samples, so they don’t differentiate quality as much

Conclusion

The code interpreter of ChatGPT has been proving immensely useful for exploratory data analysis.

I had some fun extracting insights and generating cool visualizations from the Kaggle coffee dataset. Here are the key takeaways:

  1. Ethiopia and Tanzania produce the highest quality coffees on average, while Taiwan has emerged as an impressive specialty coffee origin
  2. Gesha variety consistently scores highest, justifying its premium pricing in the specialty market
  3. Higher altitude correlates with better quality, supporting the conventional wisdom about mountain-grown coffee
  4. Processing method matters: Natural/Dry processing shows the highest median quality among common methods, while Washed is most prevalent
  5. Flavor, Aftertaste, and Balance are the strongest predictors of overall coffee quality

ChatGPT was able to automate almost all of the data cleaning and code to generate visualizations.

I’m impressed with how intelligent the tool is and have had good results uploading datasets and asking questions about the data.

The main limitation is the Code Interpreter is limited to a handful of preinstalled libraries, but I could imagine this being extremely powerful when it’s able to install and run any library.