Module gamslib.formatdetect

File format detection utilities for GAMS projects.

This submodule provides functions and classes to detect the format of files and return a FormatInfo object describing the detected format.

Currently these Detectors are available:

  • MimeDetector: Uses the mimetypes library to identify file formats based on file extensions. This is a minimal detector and should be used as a fallback only.
  • MagikaDetector: Uses the Google Magika library to identify file formats based on file content. This is the preferred detector and should be used by default.

All detectors implement the FormatDetector abstract base class and return FormatInfo objects with the detected format information. The FormatInfo object includes the MIME type, detector name, and the subformat name if applicable. The subformat is determined by heuristics based on the MIME type and file content.

Currently supported subformats include:

  • XML subformats
  • JSON subformats

Features

  • detect_format(): Main function to detect the format of a file.
  • Detector selection based on configuration ('general.format_detector').
  • Support for multiple detectors (e.g., Magika, MinimalDetector).
  • Extensible for future REST-based detectors (e.g., FITS).

Usage

Use detect_format()(filepath) to get format information for a file. Detector is chosen automatically based on configuration, but can be set explicitly for testing.

Configuration

  • 'general.format_detector': Name of the detector to use (default: 'magika').
  • 'general.format_detector_url': Optional URL for REST-based detectors.

Future

Additional detectors and REST-based services may be supported.

Sub-modules

gamslib.formatdetect.formatdetector

Abstract base class for format detectors …

gamslib.formatdetect.formatinfo

Describes the format of a file …

gamslib.formatdetect.jsontypes

Module to inspect and classify JSON files …

gamslib.formatdetect.magikadetector

A detector that uses the Google Magika library to detect file formats …

gamslib.formatdetect.minimaldetector

A detector that uses the mimetypes module to detect file formats …

gamslib.formatdetect.xmltypes

Functions and data for detecting XML file types and subtypes …

Functions

def detect_format(filepath: pathlib._local.Path) ‑> FormatInfo
Expand source code
def detect_format(filepath: Path) -> FormatInfo:
    """
    Detect the format of a file and return a FormatInfo object describing the format.

    Args:
        filepath (Path): Path to the file to detect format for.

    Returns:
        FormatInfo: Object containing format information for the file.

    Notes:
        - Detector is chosen based on the configuration setting 'general.format_detector'.
        - If no configuration is found, the default detector is used.
        - Explicit detector selection is only needed for testing or special cases.
    """
    try:
        config = get_configuration()
        detector = make_detector(
            config.general.format_detector, config.general.format_detector_url
        )
        return detector.guess_file_type(filepath)
    except MissingConfigurationException:
        # if no configuration is found, we use the default  detector
        detector = make_detector(DEFAULT_DETECTOR_NAME)
        return  make_detector(DEFAULT_DETECTOR_NAME).guess_file_type(filepath)

Detect the format of a file and return a FormatInfo object describing the format.

Args

filepath : Path
Path to the file to detect format for.

Returns

FormatInfo
Object containing format information for the file.

Notes

  • Detector is chosen based on the configuration setting 'general.format_detector'.
  • If no configuration is found, the default detector is used.
  • Explicit detector selection is only needed for testing or special cases.
def make_detector(detector_name: str, detector_url: str = '') ‑> FormatDetector
Expand source code
@lru_cache
def make_detector(detector_name: str, detector_url: str = "") -> FormatDetector:
    """
    Return a detector object based on the given name and optional URL.

    Args:
        detector_name (str): Name of the detector to use ('base', 'magika', etc.).
        detector_url (str): Optional URL for REST-based detectors.

    Returns:
        FormatDetector: An instance of the selected detector.

    Raises:
        ValueError: If the detector name is unknown.

    Notes:
        - If no detector name is provided, the default detector is used.
        - Future detectors may require checking for software or service availability.
    """
    # TODO: as soon we have detector which depend on installed software or available services,
    #       we must check for availability if no explicit detector is given
    detector = None
    if detector_name == "":
        detector_name = DEFAULT_DETECTOR_NAME
    if detector_name == "base":
        detector = MinimalDetector()
    elif detector_name == "magika":
        detector = MagikaDetector()
    # TODO: add more detectors
    if detector is None:
        raise ValueError(f"Unknown detector '{detector_name}'")
    return detector

Return a detector object based on the given name and optional URL.

Args

detector_name : str
Name of the detector to use ('base', 'magika', etc.).
detector_url : str
Optional URL for REST-based detectors.

Returns

FormatDetector
An instance of the selected detector.

Raises

ValueError
If the detector name is unknown.

Notes

  • If no detector name is provided, the default detector is used.
  • Future detectors may require checking for software or service availability.