How to Query EOT Parquet Data with DuckDB#

This notebook shows how to quickly inspect and query the EOT Web Archive Parquet files using DuckDB. DuckDB is great for working with large Parquet datasets directly without needing to load everything into memory.

1. Install Required Tools#

First, make sure you have DuckDB installed:

pip install duckdb
import duckdb
import pandas as pd

2. Preview a Parquet File#

(Update the path if your file is saved somewhere else.)

# Preview the first 5 rows
duckdb.sql("SELECT * FROM 'EOT-2020/parquet/subset/part-00046-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet' LIMIT 5").df()

3. See Column Names#

# List all columns
df = duckdb.sql("SELECT * FROM 'EOT-2020/parquet/subset/part-00046-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet' LIMIT 0").df()

# Show column names
pd.DataFrame(df.columns, columns=["Column Name"])

4. Example Query: Find 404 Errors#

Find URLs with a fetch_status of 404 (Not Found).

# Find 404 responses
duckdb.sql("""
    SELECT url, fetch_status
    FROM 'EOT-2020/parquet/subset/part-00046-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet'
    WHERE fetch_status = 404
    LIMIT 5
""").df()

5. Example Query: Count MIME Types (Single Parquet File)#

Find the most common detected MIME types in the file.

# Count content types in one file
duckdb.sql("""
    SELECT content_mime_detected, COUNT(*) as count
    FROM 'EOT-2020/parquet/subset/part-00046-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet'
    GROUP BY content_mime_detected
    ORDER BY count DESC
    LIMIT 10
""").df()

6. Example Query: Count MIME Types (Multiple Parquet Files)#

Find the most common detected MIME types across multiple Parquet files.

# Count content types across multiple Parquet files
duckdb.sql("""
    SELECT content_mime_detected, COUNT(*) as count
    FROM 'EOT-2020/parquet/subset/*.parquet'
    GROUP BY content_mime_detected
    ORDER BY count DESC
    LIMIT 10
""").df()

Important: When querying across multiple files, all files you query together should have the same schema (or at least compatible schemas), otherwise you might get a schema mismatch error. Parquet files that were generated from the same source (e.g. EOT-2020) should be fine.