Accessing EOT Parquet Data

Accessing EOT Parquet Data#

This notebook shows how to access the EOT Parquet data. A more detailed description of these steps and overview of the dataset is available here: EOT Web Archive - Parquet Access Notes.

1. Install Required Tools#

# Install AWS CLI if missing
!brew install awscli

# Install pandas and pyarrow if needed
!pip install pandas pyarrow

2. List Bucket Contents#

# Quick check: top-level folders in eotarchive bucket
!aws s3 ls --no-sign-request s3://eotarchive/

3. List Available Parquet Files for EOT-2020#

# Show Parquet part files for EOT-2020
!aws s3 ls --no-sign-request s3://eotarchive/eot-index/table/eot-main/crawl=EOT-2020/

4. Create Local Folder for Subset Downloads#

# Make a folder to save downloaded files
!mkdir -p EOT-2020/parquet/subset/

5. Download a Single Test File#

# Download part-00046
!aws s3 cp --no-sign-request s3://eotarchive/eot-index/table/eot-main/crawl=EOT-2020/part-00046-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet EOT-2020/parquet/subset/

6. Load and Inspect the Parquet File#

import pandas as pd

# Load the downloaded Parquet file
df = pd.read_parquet("EOT-2020/parquet/subset/part-00046-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet")

# Show column names and types
print(df.dtypes)

# Preview the first few rows
df.head()

7. (Optional) Download the Full Dataset#

# Warning: large download (~77 GB compressed)!
# Make sure you have enough disk space first.

!mkdir -p EOT-2020/parquet/full/

!aws s3 sync --no-sign-request s3://eotarchive/eot-index/table/eot-main/crawl=EOT-2020/ EOT-2020/parquet/full/

8. (Optional) Download Only a Subset (e.g., first 5 files)#

!mkdir -p EOT-2020/parquet/subset/

!aws s3 cp --no-sign-request s3://eotarchive/eot-index/table/eot-main/crawl=EOT-2020/part-00047-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet EOT-2020/parquet/subset/
!aws s3 cp --no-sign-request s3://eotarchive/eot-index/table/eot-main/crawl=EOT-2020/part-00045-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet EOT-2020/parquet/subset/