Best Practices: Data Citation

Proper data citation ensures you receive credit for your work and allows others to understand and build upon your research. By treating your data as an important part of your research output and citing it properly, you further illustrate the accuracy of your own work and contribute to a more open and collaborative scholarly community. Many journals and publishers require that data be cited in articles.

Why Data Citation is Essential

Just as you cite publications, citing data is a cornerstone of academic integrity and open science. It:

  • Provides Credit: Ensures that you as the data creator receive professional recognition for your work.
  • Promotes Transparency: Allows others to verify your findings and understand your methodology.
  • Increases Impact: Enables other researchers to discover and reuse your data, potentially leading to new collaborations and discoveries.
  • Ensures Persistence: Helps in tracking the impact and use of your data over time.

Prepare Your Data for Citation

Before anyone can cite your data, you must make it citable. This involves documentation and choosing a stable storage location.

Create Documentation

A dataset is of little use without context. Your documentation should allow someone to understand and reuse your data without needing to contact you. Include:

  • A README File: A plain text file (README.txt or README.md) is the most common format. It should contain:
    • Creator/Author: Names and affiliations of those who collected or created the data.
    • Methodology: How the data was generated (e.g., instruments used, software versions, survey questions, simulation parameters).
    • Data Description: An explanation of the files, variables, units of measurement, and any codes or abbreviations used.
    • Date of Collection: The timeframe during which the data was collected.
    • Terms of Use: Any licenses or restrictions on how the data can be used (e.g., Creative Commons licenses).
  • A Data Dictionary (or Codebook): For tabular data, this file defines each column/variable, its data type (e.g., integer, string), and allowed values.

Choose the Right Repository to Get a Persistent Identifier (PID)

To make your data reliably citable, it must be stored in a location that provides a Persistent Identifier (PID), such as a Digital Object Identifier (DOI). A DOI is a unique, permanent link to your dataset, ensuring it can be found even if its web address changes.

Standard cloud storage like UBbox or OneDrive are not appropriate for long-term, citable data archiving because they do not provide DOIs. Instead, you should use a dedicated data repository.

Key UB and Recommended Resources:

  • Dryad: UB is a Dryad member. For small datasets (under 10GB) your data publishing charges may be covered. Larger sets will incur fees. All data sets are curated to be Findable, Accessible, Interoperable, and Reuseable (FAIR).
  • Discipline-Specific Repositories: Many academic fields have their own trusted data repositories (e.g., GenBank for genetic sequences, ICPSR for social science data). The UB Research Data Services program can help you identify the best repository for your discipline. A great tool for finding one is re3data.org.
  • Generalist Repositories: If a suitable disciplinary repository doesn't exist, you can use generalist options like Zenodo, Figshare, or OSF. These are widely respected and provide free DOI minting for datasets.

For more information, see our page on best practices for choosing a repository.

Format Your Data Citation Correctly

Once your dataset is in a repository and has a DOI, you can cite it in your publications, presentations, and CV. A standard data citation includes the following core elements:

  • Author(s)/Creator(s): Just as with a manuscript publication, data authors should include all who helped to create the dataset, such as study PI, data collector, data analyst, etc.)
  • Title of Dataset: Include a title that is unique, different, or separate from the associated manuscript title.
  • Publisher/Repository/Location: Name the data repository that houses the data or indicates where the data may be discovered.
  • Bibliographic metadata: Include year, edition, volume, or version of the dataset.
  • Persistent Identifier: All published datasets should have a persistent URL, the most common of which is a DOI. A persistent URL is a link that remains the same over time, even when the website it connects to is updated.
  • Secondary Data: If citing secondary data, include the date you last accessed it, indicate which sections of the dataset that you used.

Citation Examples

APA

  • Lee, J., Seo, Y. S., & Faith, M. (2024). Whole-child development losses and racial inequalities during the pandemic: Fallouts of school closure with remote learning and unprotective community (Version 4) [Data set]. Dryad. https://doi.org/10.5061/DRYAD.66T1G1K8F
  • Chicago
  • Lee, Jaekyung, Young Sik Seo, and Myles Faith. 2024. “Whole-Child Development Losses and Racial Inequalities during the Pandemic: Fallouts of School Closure with Remote Learning and Unprotective Community.” Version 4. Dryad, October 7. https://doi.org/10.5061/DRYAD.66T1G1K8F.
  • IEEE
  • J. Lee, Y. S. Seo, and M. Faith, “Whole-child development losses and racial inequalities during the pandemic: Fallouts of school closure with remote learning and unprotective community.” Dryad, Oct. 07, 2024, doi: 10.5061/DRYAD.66T1G1K8F.
  • MLA
  • Lee, Jaekyung, et al. Whole-Child Development Losses and Racial Inequalities during the Pandemic: Fallouts of School Closure with Remote Learning and Unprotective Community. 4, Dryad, 7 Oct. 2024, doi:10.5061/DRYAD.66T1G1K8F.

Resources