Reflections¶
Excerpt from “Cat’s Life” by People Too [28].¶
Limitations¶
Perhaps it is helpful to first acknowledge some limitations. Even when data is structured, accuracy can vary across databases. Discrepancies sometimes arise between the metadata indexed in bibliographic databases and the actual details listed in the papers. This can include errors in author names, incorrect publication years, missing co-authors, or inconsistent formatting of paper titles. These inconsistencies are particularly relevant when relying on automated methods to extract and analyze publication data, as errors in structured metadata can propagate and affect research findings.
Fortunately, these errors are relatively minimal in dblp when compared to other sources. That said, they do occur (primarily as papers missing author names or omitting available PDF links and DOIs). This is likely due to incomplete metadata for papers from earlier conference years. The dataset included with this report reflects the current metadata stored for the records in dblp (as of February 2025). My hope is to include a corrected dataset in the coming weeks that can be used for comparison and assessment regarding where these errors occur.
dblp RDF Schema¶
“The affiliation properties (affiliation and primaryAffiliation) in the dblp RDF schema may not reliably represent an author’s active affiliation at the time of publication. Researchers using this data should be aware of this limitation and treat the affiliation information as indicative rather than definitive.”
Another limitation to note is that certain entities and properties from dblp’s RDF Schema are currently still under development. For example, once dblp refines the modeling of affiliation information (e.g., by using structured entities rather than literal strings), it may be possible to update the query or pivoting logic to select the most appropriate affiliation per publication.
Todo
Revisit dblp Insights on ISMIR:
Add more details about dataset analysis. E.g., What does dblp help us learn about ISMIR?
Contextualize insights learned from dblp’s metadata: Summarize notes comparing the different sources providing access to the ISMIR publications. E.g,:
Differences in the available metadata fields from each source
Availability of abstracts: dblp (none), ISMIR conference github repo (2016-2024), Zenodo (2018-2024), Semantic Scholar (similar to Zenodo, but double-check to confirm)
Publication link errors: the frequency that this happens is very similar across all sources because this metadata generally comes directly from ISMIR, but I haven’t calculated the actual numbers. Run script to validate link resolution, especially for the links provided via ISMIR’s github repo
Quality/accuracy of author name disambiguation: dblp has been the best so far (however, authors still go unaccounted when their names are missing from the entries); add frequency numbers for Semantic Scholar’s duplicate author ID profiles; Zenodo presents more challenges (minimal use of unique author IDs via ORCID); relying solely on the ISMIR github repo presents the greatest challenge/limitation (plain text names with no unique IDs)
Benefits of RDF data: Note some of the possibilities that working with this data model offers