Accessing EOT Parquet Data#
This notebook shows how to access the EOT Parquet data. A more detailed description of these steps and overview of the dataset is available here: EOT Web Archive - Parquet Access Notes.
1. Install Required Tools#
# Install AWS CLI if missing
!brew install awscli
# Install pandas and pyarrow if needed
!pip install pandas pyarrow
2. List Bucket Contents#
# Quick check: top-level folders in eotarchive bucket
!aws s3 ls --no-sign-request s3://eotarchive/
3. List Available Parquet Files for EOT-2020#
# Show Parquet part files for EOT-2020
!aws s3 ls --no-sign-request s3://eotarchive/eot-index/table/eot-main/crawl=EOT-2020/
4. Create Local Folder for Subset Downloads#
# Make a folder to save downloaded files
!mkdir -p EOT-2020/parquet/subset/
5. Download a Single Test File#
# Download part-00046
!aws s3 cp --no-sign-request s3://eotarchive/eot-index/table/eot-main/crawl=EOT-2020/part-00046-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet EOT-2020/parquet/subset/
6. Load and Inspect the Parquet File#
import pandas as pd
# Load the downloaded Parquet file
df = pd.read_parquet("EOT-2020/parquet/subset/part-00046-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet")
# Show column names and types
print(df.dtypes)
# Preview the first few rows
df.head()
7. (Optional) Download the Full Dataset#
# Warning: large download (~77 GB compressed)!
# Make sure you have enough disk space first.
!mkdir -p EOT-2020/parquet/full/
!aws s3 sync --no-sign-request s3://eotarchive/eot-index/table/eot-main/crawl=EOT-2020/ EOT-2020/parquet/full/
8. (Optional) Download Only a Subset (e.g., first 5 files)#
!mkdir -p EOT-2020/parquet/subset/
!aws s3 cp --no-sign-request s3://eotarchive/eot-index/table/eot-main/crawl=EOT-2020/part-00047-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet EOT-2020/parquet/subset/
!aws s3 cp --no-sign-request s3://eotarchive/eot-index/table/eot-main/crawl=EOT-2020/part-00045-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet EOT-2020/parquet/subset/