Module gamslib.sip.validation

Validation utilities for Bagit and object directories in GAMS projects.

This subpackage provides functions to validate the structure and metadata of Bagit directories, including checks for required files, manifests, and SIP JSON metadata.

Features

  • Validates Bagit directory structure and required files.
  • Checks bagit.txt, bag-info.txt, and manifest files (MD5, SHA512).
  • Validates SIP JSON metadata for completeness and correctness.
  • Raises BagValidationError for any validation failures.

Usage

Call validate_bag()(bag_dir) to perform all standard validations on a Bagit directory. Individual validation functions are also available for more granular checks.

Sub-modules

gamslib.sip.validation.baginfo

Validation functions for the bag-info.txt file in GAMS Bagit directories …

gamslib.sip.validation.bagit

Validation functions for the structure and contents of Bagit directories in GAMS projects …

gamslib.sip.validation.manifests

Validation functions for manifest files in Bagit directories for GAMS projects …

gamslib.sip.validation.sip_json

Validation functions for the sip.json file in GAMS Bagit directories …

Functions

def validate_bag(bag_dir: pathlib._local.Path) ‑> None
Expand source code
def validate_bag(bag_dir: Path) -> None:
    """
    Validate the structure and metadata of a Bagit directory.

    Args:
        bag_dir (Path): Path to the Bagit directory to validate.

    Raises:
        BagValidationError: If the bag directory does not exist or any validation check fails.

    Notes:
        - Runs all standard validation checks: structure, bagit.txt, manifests, SIP JSON, and bag-info.txt.
        - Raises an error immediately if any check fails.
    """
    if not bag_dir.is_dir():
        raise BagValidationError(f"Bag directory {bag_dir} does not exist")
    validate_structure(bag_dir)
    validate_bagit_txt(bag_dir)
    validate_manifest_md5(bag_dir)
    validate_manifest_sha512(bag_dir)
    validate_sip_json(bag_dir)
    validate_baginfo_text(bag_dir)

Validate the structure and metadata of a Bagit directory.

Args

bag_dir : Path
Path to the Bagit directory to validate.

Raises

BagValidationError
If the bag directory does not exist or any validation check fails.

Notes

  • Runs all standard validation checks: structure, bagit.txt, manifests, SIP JSON, and bag-info.txt.
  • Raises an error immediately if any check fails.
def validate_datastream_id(datastream_id: str) ‑> None
Expand source code
def validate_datastream_id(datastream_id: str) -> None:
    """Validate a given datastream ID.

    A valid datastream is must start with a letter or a number, followed by any
    number of ASCII letters, numbers, dots, dashes and underscores.

    Args:
        datastream_id (str): The datastream ID to validate.

    Raises:
        ValueError: If the datastream ID is invalid. The error message will indicate the reason.
    """
    _validate_object_id(datastream_id, allow_uppercase=True)

Validate a given datastream ID.

A valid datastream is must start with a letter or a number, followed by any number of ASCII letters, numbers, dots, dashes and underscores.

Args

datastream_id : str
The datastream ID to validate.

Raises

ValueError
If the datastream ID is invalid. The error message will indicate the reason.
def validate_pid(pid: str) ‑> None
Expand source code
def validate_pid(pid: str) -> None:
    """Validate a given PID (Project Identifier).

    A valid id follows the rules of xml:id, with some modifications:

     - All letters must be lowercase ASCII letters.
     - Every id must have the project sigle as prefix, followed by a dot.
       The prefix must start with a letter, followed by any number
       of letters and numbers.
     - The part after the dot must start with a letter or a number, followed by any
       number of ASCII letters, numbers, dots, and dashes.
     - For legacy reasons, the project prefix can be proceeded by a type prefix like 'o:'
       but we discourage the use of this prefix for new objects. Only lowercase letters are allowed as
       type prefix.

    Invalid ids are for example:

        - .abcdef  (starts with a dot)
        - 1abcdef (starts with a number)
        - abc/def (contains invalid character '/')
        - abc@def (contains invalid character '@')
        - abcdef  (no dot)
        - abc..def (double dot)

    Args:
        pid (str): The ID to validate.
        allow_uppercase (bool, optional): If True, allow uppercase letters in pid. Defaults to False.
            Object IDs (PIDs) should normally be lowercase only, but datastream id can be uppercase too.

    Raises:
        ValueError: If the ID is invalid. The error message will indicate the reason.
    """
    MAX_ID_LENGTH = 64
    # Check if the PID is a valid URI
    if len(pid) > MAX_ID_LENGTH:
        raise ValueError(f"ID must not be longer than {MAX_ID_LENGTH} characters")
    type_prefix, project_prefix, object_id = _split_id(pid)
    _validate_type_prefix(type_prefix)
    validate_project_name(project_prefix)
    _validate_object_id(object_id)
    if type_prefix:
        warnings.warn(
            "Using type prefixes in PIDs is discouraged for new objects.", UserWarning
        )

Validate a given PID (Project Identifier).

A valid id follows the rules of xml:id, with some modifications:

  • All letters must be lowercase ASCII letters.
  • Every id must have the project sigle as prefix, followed by a dot. The prefix must start with a letter, followed by any number of letters and numbers.
  • The part after the dot must start with a letter or a number, followed by any number of ASCII letters, numbers, dots, and dashes.
  • For legacy reasons, the project prefix can be proceeded by a type prefix like 'o:' but we discourage the use of this prefix for new objects. Only lowercase letters are allowed as type prefix.

Invalid ids are for example:

- .abcdef  (starts with a dot)
- 1abcdef (starts with a number)
- abc/def (contains invalid character '/')
- abc@def (contains invalid character '@')
- abcdef  (no dot)
- abc..def (double dot)

Args

pid : str
The ID to validate.
allow_uppercase : bool, optional
If True, allow uppercase letters in pid. Defaults to False. Object IDs (PIDs) should normally be lowercase only, but datastream id can be uppercase too.

Raises

ValueError
If the ID is invalid. The error message will indicate the reason.
def validate_project_name(value: str) ‑> None
Expand source code
def validate_project_name(
    value: str
) -> None:
    """
    Validate the project name. Can also be used to validate the project prefix of a PID.

    The value must start with a letter, followed by any number of letters and numbers.
    The project prefix is the part before the dot (.) in an ID like "abc.def123".

    Args:
        value (str): The project prefix to validate.

    Raises:
        ValueError: If the project prefix is invalid.
    """
    if not value:
        raise ValueError("Project prefix (before dot) is empty")

    if not re.match(r"^[a-z][a-z0-9]*$", value):
        raise ValueError(
            "A project name must start with a letter and contain "
            "only lowercase letters and numbers."
        )

Validate the project name. Can also be used to validate the project prefix of a PID.

The value must start with a letter, followed by any number of letters and numbers. The project prefix is the part before the dot (.) in an ID like "abc.def123".

Args

value : str
The project prefix to validate.

Raises

ValueError
If the project prefix is invalid.