Parquet Access Notes#
Basic instructions on how to access the Parquet data for the End of Term (EOT) Web Archive hosted on AWS S3 [1, 2]. These notes are evolving. Feedback is welcome! Written for macOS/Linux/Unix environments.
Notebooks available:#
eot_parquet_access.ipynb: You can download this Jupyter Notebook and run all the example commands locally without needing to copy-paste from here.
eot_parquet_query.ipynb: Here’s an additional notebook that gives a basic overview showing how to view and query the Parquet data with DuckDB.
S3 Bucket Overview#
The EOT Web Archive S3 bucket is publicly accessible. You don’t need AWS credentials, but you must use the --no-sign-request flag with AWS CLI commands.
General Structure#
Bucket Name:
eotarchivePath to Parquet files:
eot-index/table/eot-main/crawl=EOT-2020/(change the year as needed)
eotarchive/
├── crawl-data/ # Raw WARC/WAT/WET files
└── eot-index/
├── collections/
│ ├── EOT-2004/ # Compressed CDXJ files (.gz)
│ ├── EOT-2008/
│ ├── EOT-2012/
│ ├── EOT-2016/
│ └── EOT-2020/
└── table/
└── eot-main/
├── crawl=EOT-2004/ # Parquet part files (.gz.parquet)
├── crawl=EOT-2008/
├── crawl=EOT-2012/
├── crawl=EOT-2016/
└── crawl=EOT-2020/
Parquet in EOT#
Parquet is a compressed, columnar format optimized for analytics, so you can run scalable DataFrame/SQL queries without parsing large text CDX files. In the EOT Datasets, the Parquet tables mirror the CDX/CDXJ capture index. They live under eotarchive/eot-index/table/eot-main/ and are partitioned by crawl year: crawl=EOT-2008/, crawl=EOT-2012/, crawl=EOT-2016/, crawl=EOT-2020/. Each year directory contains multiple part-*.parquet files that together make up that year’s crawl data. You can point DuckDB, PyArrow, pandas, Spark, or Athena at a year directory and query directly (e.g., by host/domain, HTTP status, MIME type, timestamp), and use the stored WARC filename/offset to retrieve the original record if needed. [3, 4, 5].
For the column names referenced in queries, see the EOT Parquet Data Dictionary; you can also view the embedded schema directly from the Parquet files (Step 6 below).
EOT-2020 Parquet Files*#
Details#
Number of files: 48 (
part-00000topart-00047)Compressed size (total): 76.9 GB
File size range: 0.7-3.2 GB
Average size: 1.5-2.5 GB per file
Download Speed Estimates#
Small file (0.7 GB): 2-3 minutes
Large file (3.2 GB): 10-15 minutes
Full dataset (~77 GB): 3.5-5 hours
*Sizes approximate. Time estimates based on 4-6 MB/s download speeds.
Requirements#
1. Install Required Tools#
You’ll need awscli installed:
brew install awscli
Explore and Access the EOT Data#
2. List Bucket Contents (Test Connection)#
aws s3 ls --no-sign-request s3://eotarchive/
Output:#
PRE crawl-data/
PRE eot-index/
3. List Available Parquet Files for a Specific Crawl#
This will show you all 48 Parquet part files (part-00000-\*.gz.parquet) in the EOT-2020 collection.
aws s3 ls --no-sign-request s3://eotarchive/eot-index/table/eot-main/crawl=EOT-2020/
Each line looks like:
<timestamp> <size> <filename>
Example:
2023-11-08 09:07:19 725934696 part-00046-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
Field |
Meaning |
|---|---|
|
The last modified date and time of the file in S3 (UTC). |
|
File size in bytes. This one is ~726 MB (0.7 GB) compressed. |
|
File name: a compressed Parquet part file. |
EOT-2020 Output (Full list of EOT-2020 Parquet files)#
2023-11-08 09:01:16 3448135711 part-00000-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:01:18 2968471460 part-00001-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:01:22 2416854708 part-00002-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:01:23 2568330760 part-00003-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:01:24 2019064040 part-00004-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:02:10 2049308445 part-00005-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:02:20 2026764211 part-00006-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:02:25 1945399315 part-00007-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:02:27 1940509056 part-00008-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:02:37 1928684657 part-00009-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:03:01 1993549968 part-00010-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:03:10 2722040758 part-00011-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:03:14 1926880962 part-00012-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:03:16 1853413426 part-00013-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:03:25 1798117192 part-00014-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:03:51 1767759290 part-00015-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:04:02 1763567305 part-00016-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:04:02 1714640093 part-00017-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:04:09 1657250685 part-00018-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:04:16 1601393518 part-00019-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:04:33 1646659046 part-00020-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:04:42 1596816861 part-00021-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:04:42 1590841948 part-00022-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:04:47 1522885885 part-00023-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:04:52 1564288165 part-00024-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:05:11 1819595077 part-00025-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:05:18 1497803634 part-00026-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:05:19 1490407961 part-00027-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:05:22 1493182381 part-00028-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:05:30 1398586176 part-00029-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:05:53 1354908761 part-00030-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:05:55 1354073376 part-00031-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:05:55 1370434591 part-00032-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:05:58 1379782218 part-00033-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:06:02 1282713381 part-00034-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:06:26 1232302068 part-00035-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:06:27 1283149573 part-00036-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:06:28 1047651424 part-00037-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:06:32 1060767172 part-00038-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:06:34 993121623 part-00039-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:06:55 926045582 part-00040-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:06:57 965251436 part-00041-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:06:58 923344888 part-00042-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:06:58 930140940 part-00043-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:06:59 815602721 part-00044-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:07:18 753819149 part-00045-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:07:19 725934696 part-00046-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
2023-11-08 09:07:21 747395388 part-00047-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet
4. Create Local Folder for Subset Downloads#
First, create a folder to organize your downloaded subsets: (replace with your output directory path)
mkdir -p EOT-2020/parquet/subset
5. Download a Specific Parquet File (Single Test)#
This downloads the file to your local machine and saves inside EOT-2020/parquet/subset/ (unless specified otherwise).
Then download:
aws s3 cp --no-sign-request \
s3://eotarchive/eot-index/table/eot-main/crawl=EOT-2020/part-00046-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet \
EOT-2020/parquet/subset/
6. Load and Inspect the Parquet File#
Important: The EOT Parquet files may use deprecated INT96 timestamp formats. To avoid errors, Python (pandas.read_parquet) is recommended over parquet-cli for inspection.
Example script (preview_parquet.py):
import pandas as pd
# Load the Parquet file
df = pd.read_parquet("EOT-2020/parquet/subset/part-00046-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet")
# Show the schema (columns and dtypes)
print(df.dtypes)
# Preview first few rows
print(df.head())
Output for part-00046 file:#
url_surtkey object
url object
url_host_name object
url_host_tld object
url_host_2nd_last_part object
url_host_3rd_last_part object
url_host_4th_last_part object
url_host_5th_last_part object
url_host_registry_suffix object
url_host_registered_domain object
url_host_private_suffix object
url_host_private_domain object
url_host_name_reversed object
url_protocol object
url_port float64
url_path object
url_query object
fetch_time datetime64[ns]
fetch_status int16
content_digest object
content_mime_type object
content_mime_detected object
content_charset object
content_languages object
content_puid object
warc_filename object
warc_record_offset int64
warc_record_length int64
warc_segment object
crawl object
subset object
dtype: object
url_surtkey ... subset
0 com,usarmyjrotc)/news/21/03/scripts/legacy/scr... ... warc
1 com,usarmyjrotc)/news/21/03/scripts/legacy/scr... ... warc
2 com,usarmyjrotc)/news/21/03/scripts/legacy/scr... ... warc
3 com,usarmyjrotc)/news/21/03/scripts/legacy/scr... ... warc
4 com,usarmyjrotc)/news/21/03/scripts/legacy/scr... ... warc
[5 rows x 31 columns]
7. Optional: Download the Full Dataset#
Warning: This will download all 48 Parquet files for EOT-2020 (~77 GB compressed). Test on smaller subsets first!
Create a folder to save the output files:
mkdir -p EOT-2020/parquet/full
Then download:
aws s3 sync --no-sign-request \
s3://eotarchive/eot-index/table/eot-main/crawl=EOT-2020/ \
EOT-2020/parquet/full/
8. Optional: Download Only a Subset (e.g., first 5 files)#
Create a folder to save the output files:
mkdir -p EOT-2020/parquet/subset/
Then download: (repeat for part-00001, part-00002, etc.)
aws s3 cp --no-sign-request \
s3://eotarchive/eot-index/table/eot-main/crawl=EOT-2020/part-00047-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet \
EOT-2020/parquet/subset/
aws s3 cp --no-sign-request \
s3://eotarchive/eot-index/table/eot-main/crawl=EOT-2020/part-00045-dda73194-fd75-4dcb-b361-4d099d882262-c000.gz.parquet \
EOT-2020/parquet/subset/