Data Dictionary#

EOT-2020 Dataset Overview#

File

Total Size Compressed

WARC files

266.04 TB (239,811 files)

WAT files

9.15 TB

WET files

2.6 TB

META files

712.66 GB

CDX files

83.66 GB

Parquet files

76.88 GB (48 files)

URL Index files

74.4 GB (49 files)

Source: End of Term Web Archive 2020 [6]

EOT Parquet Data Dictionary#

This data dictionary is based on the EOT Index Schema, available in JSON format on the Common Crawl Index Table repo [7]. The WARC Metadata sidecar fields (added by the EOT Archive team in 2023) are currently not included in that schema file and have been added to this data dictionary for reference and schema completeness. They include: content_mime_detected, content_charset, and content_languages, content_puid. For more information about the development and use of the WARC Metadata Sidecar fields, see Phillips, Phillips, and Alam (2023) [5].

In the data dictionary below, the Schema Data Type column reflects the data types as defined in the original JSON schema. The Pandas Data Type column represents how the data is interpreted and loaded by Pandas after reading the Parquet file using pandas.read_parquet().

Column Name

Schema Data Type

Pandas Data Type

Nullable

Description

From CDX

Example

url_surtkey

string

object

False

SURT URL key

N/A

com,example)/path/index.html

url

string

object

False

URL string

url

https://www.example.com/path/index.html

url_host_name

string

object

False

Hostname, including IP addresses

N/A

www.example.com

url_host_tld

string

object

True

Top-level domain or last part of the hostname

N/A

com for the hostname www.example.com

url_host_2nd_last_part

string

object

True

Second last part of the hostname

N/A

example for the hostname www.example.com, co for bbc.co.uk

url_host_3rd_last_part

string

object

True

Third last part of the hostname

N/A

www for the hostname www.example.com

url_host_4th_last_part

string

object

True

4th last part of the hostname

N/A

host1 for host1.subdomain.example.com

url_host_5th_last_part

string

object

True

5th last part of the hostname

N/A

host1 for host1.sub2.subdomain.example.com

url_host_registry_suffix

string

object

True

Domain registry suffix

N/A

com, co.uk

url_host_registered_domain

string

object

True

Domain name of the host (one level below the registry suffix)

N/A

example.com, bbc.co.uk

url_host_private_suffix

string

object

True

Suffix of domain registries including private registrars, see https://publicsuffix.org/

N/A

com, co.uk, but also s3.amazonaws.com or blogspot.com

url_host_private_domain

string

object

True

Domain name of the host (one level below the private suffix)

N/A

mypublicbucket.s3.amazonaws.com or myblog.blogspot.com

url_host_name_reversed

string

object

True

Hostname, excluding IP addresses, in reverse domain name notation

N/A

com.example.www

url_protocol

string

object

False

Protocol of the URL

N/A

https

url_port

integer

float64

True

Port of the URL (null if not explicitly specified in the URL)

N/A

8443

url_path

string

object

True

File path of the URL

N/A

/path/index.html

url_query

string

object

True

Query part of the URL

N/A

q=abc&lang=en for …/search?q=abc&lang=en

fetch_time

timestamp

datetime64[ns]

False

Fetch time (capture time stamp)

N/A

2017-10-24T00:14:32Z

fetch_status

short

int16

False

HTTP response status code (-1 if absent, eg. for revisit records)

status

200

content_digest

string

object

True

SHA-1 content digest (WARC-Payload-Digest)

digest

CH7IV3XAD3M7A42JARKRLJ3T5PGGCGXD

content_mime_type

string

object

True

Content-Type sent in HTTP response header

mime

text/html

warc_filename

string

object

False

WARC filename/path below s3://eotarchive/ or https://eotarchive.s3.amazonaws.com/

filename

crawl-data/EOT-2008/segments/IA-001/warc/DOTGOV-2008-01-20080923002742-04410-crawling14.us.archive.org.arc.gz

warc_record_offset

long

int64

False

Offset of the WARC record

offset

397346194

warc_record_length

long

int64

False

Length of the WARC record

length

24662

warc_segment

string

object

False

Segment the WARC file belongs to

N/A

IA-001

crawl

string

object

False

Crawl the capture/record is part of

N/A

EOT-2008

subset

string

object

False

Subset of responses (organized as subfolder of segments)

N/A

N/A

content_mime_detected

string

object

True

MIME type detected from content-based analysis using python-magic. Typically more general than the MIME type from HTTP headers or FIDO.

N/A

text/html

content_charset

string

object

True

Detected character encoding of the content using Chardet. Useful when HTTP headers are missing or incorrect.

N/A

UTF-8

content_languages

string

object

True

List of top detected languages from HTML or text content, using Compact Language Detector 2 (CLD2).

N/A

en, es

content_puid

string

object

True

PRONOM Persistent Unique Identifier for file format, determined using the FIDO format identification tool.

N/A

fmt/19