Best Practices: Documentation and Metadata

Effective data documentation and metadata practices are important for ensuring the findability, accessibility, interoperability, and reusability (FAIR principles) of research data. Documenting decisions you’ve made with your data collection and data analysis will also make understanding your project, its data, and the processes you used for collecting and analyzing the data easier when you need to revisit it sometime in the future. Not only will good documentation help others understand your research, but it will also help your future self better understand what you did. Make sure to store your documentation alongside your data in secure, UB-approved storage locations.

Consider a Data Management Plan (DMP)

Before starting your research, create a DMP. Many funding agencies require this, and UB is a partner institution of the DMP Tool, a template-based platform, hosted by the University of California Curation Center at the California Digital Library. You can log in with your UBITName to write and store your data management plans, as well as view and use templates for each funder.

A DMP should outline:

  • Types of data you will collect
  • Data organization and naming conventions
  • Documentation and metadata strategies
  • Storage and backup plans (linking to UB resources)
  • Data sharing and preservation plans
  • Roles and responsibilities

Data Documentation

Your data documentation should answer the "who, what, when, where, why, and how" of your data. Keep documentation in one place and consider creating a "README" file for each dataset or project. Basically, think about if you were taking over a project in the middle of a grant, but you have no way to contact the former project manager. What type of information would you need to continue successfully?

Start Early and Document Continuously: Integrate documentation into your research workflow from the project's inception, not as an afterthought.

Essential Elements to Document:

  • Project-Level Information:
    • Title: Clear and descriptive title of the dataset/project
    • Creator(s): Names and affiliations of all individuals and organizations involved in data creation–authors, research assistants, etc.
    • Dates: Project start and end dates, data creation/modification dates–when was the analysis done?
    • Funder(s): Funding agencies and grant numbers.
    • Related Publications: Citations of any publications using this data.
    • Intellectual Property Rights: Any known rights, licenses, or restrictions on data use.
    • Contact Information: For questions about the data.
  • Dataset-Level Information:
    • Description/Abstract: A brief summary of the dataset's content and purpose–Why is the work important? What is the impetus for the project? What questions are you trying to answer?
    • Keywords: Relevant terms for discoverability.
    • Data Source/Provenance: Where the data originated (e.g., experimental, survey, existing dataset), how it was collected, and any relevant protocols or instruments used.
    • Methodology: Detailed explanation of data collection, processing, and analysis methods–include all the steps in your data process: how did you get from step A to step B to step C, etc. Include protocols: what decisions were made and why?
    • Scope & Coverage: Geographic, temporal, and subject coverage of the data, Where does the project take place? Does it involve a particular geographic area?
    • File List and Organization: A list of all files, their formats, and how they are organized within the directory structure–include naming conventions.
    • Relationships between files: If applicable, how different files within the dataset relate to each other.
  • Variable-Level Information (Data Dictionary): For tabular data, a data dictionary is critical.
    • Variable Names: Short, consistent, and descriptive names.
    • Variable Labels/Descriptions: Clear, longer descriptions of each variable.
    • Units of Measurement: For quantitative variables.
    • Allowed Values/Coding Schemes: Explanation of codes, abbreviations, or valid ranges for categorical or coded variables.
    • Missing Data Codes: How missing values are represented and what they signify.
    • Data Types: (e.g., numeric, text, date).
  • Processing and Manipulation:
    • Data Cleaning and Transformation: Describe any steps taken to clean, validate, or transform the data.
    • Software/Tools Used: List all software, scripts, and versions used for data processing and analysis.
    • Version History: Track all changes to your data and documentation, along with when and who made them, as well as clearly indicating different versions.
  • Quality Assurance:
    • Quality Control Measures: Describe steps taken to ensure data accuracy and reliability.

Metadata

Metadata is data about data that helps users understand and find your datasets. While data documentation focuses on the content, metadata provides structured information for discovery and interoperability. When you share your data, you want to make sure it is also findable. Metadata makes your data findable to others.

Key Aspects of Metadata:

  • Descriptive Metadata: Information for discovery and identification (e.g., title, author, abstract, keywords).
  • Structural Metadata: Describes the relationships within and between components of a digital object (e.g., file formats, file relationships).
  • Administrative Metadata: Information to manage the data (e.g., preservation metadata, rights management, technical information about creation).

Choosing and Applying Metadata Standards:

  • Discipline-Specific Standards: Prioritize standards widely accepted in your field. Many fields have established metadata standards (e.g., Dublin Core for general use, Data Documentation Initiative for social sciences, NIH Common Data Elements for health sciences). Utilizing these enhances interoperability and discoverability within your community.
  • General Standards: If no specific standard exists for your discipline, consider general standards like Dublin Core, which provides a simple yet effective set of elements for describing resources.
  • Repositories and Publishers: Be aware that data repositories and journals may have specific metadata requirements. Adhering to these early can save time later.