Module gamslib.formatdetect.magikadetector
A detector that uses the Google Magika library to detect file formats.
This module provides the MagikaDetector class, which uses Magika to identify file formats. It includes logic to correct common misclassifications and to integrate with GAMSlib's format detection infrastructure.
Classes
class MagikaDetector-
Expand source code
class MagikaDetector(FormatDetector): """ Detector that uses the Google Magika library to detect file formats. Uses Magika's prediction engine to identify file types and MIME types. Applies corrections for known misclassifications (e.g., JSON-LD as JavaScript). """ def __init__(self): """ Initialize the MagikaDetector with Magika's BEST_GUESS prediction mode. """ self._magika_object = Magika(prediction_mode=PredictionMode.BEST_GUESS) @staticmethod def _fix_result( file_to_validate: Path, label: str, mime_type: str ) -> tuple[str, str]: """ Fix common misclassifications returned by Magika. Args: file_to_validate (Path): Path to the file being validated. label (str): Label returned by Magika. mime_type (str): MIME type returned by Magika. Returns: tuple[str, str]: Corrected (label, mime_type). Notes: - Changes 'javascript' label to 'json' for .jsonld/.json files. - Converts 'text/xml' MIME type to 'application/xml'. """ if label == "javascript" and file_to_validate.suffix in [".jsonld", ".json"]: label = "json" mime_type = "application/json" if mime_type == "text/xml": mime_type = "application/xml" return label, mime_type def guess_file_type(self, filepath: Path) -> FormatInfo: """ Detect the format of a file using Magika and return a FormatInfo object. Args: filepath (Path): Path to the file to analyze. Returns: FormatInfo: Object containing detected format information. Notes: - Applies corrections for known Magika misclassifications. - Uses DEFAULT_TYPE if Magika cannot determine the MIME type. - Integrates with xmltypes and jsontypes for subtype detection. """ detector_name = self.__class__.__name__ subtype = None try: result = self._magika_object.identify_path(filepath) _, mime_type = self._fix_result( filepath, result.dl.label, result.dl.mime_type ) except ValueError: mime_type = None if mime_type is None or mime_type == "application/undefined": mime_type = DEFAULT_TYPE warnings.warn( f"Could not determine mimetype for {filepath}. Using default type." ) elif xmltypes.is_xml_type(mime_type): mime_type, subtype = xmltypes.get_format_info(filepath, mime_type) elif jsontypes.is_json_type(mime_type): mime_type, subtype = jsontypes.get_format_info(filepath, mime_type) return FormatInfo(detector=detector_name, mimetype=mime_type, subtype=subtype) def __repr__(self): """ Return a string representation of the MagikaDetector. Returns: str: "MagikaDetector" """ return "MagikaDetector"Detector that uses the Google Magika library to detect file formats.
Uses Magika's prediction engine to identify file types and MIME types. Applies corrections for known misclassifications (e.g., JSON-LD as JavaScript).
Initialize the MagikaDetector with Magika's BEST_GUESS prediction mode.
Ancestors
- FormatDetector
- abc.ABC
Methods
def guess_file_type(self, filepath: pathlib._local.Path) ‑> FormatInfo-
Expand source code
def guess_file_type(self, filepath: Path) -> FormatInfo: """ Detect the format of a file using Magika and return a FormatInfo object. Args: filepath (Path): Path to the file to analyze. Returns: FormatInfo: Object containing detected format information. Notes: - Applies corrections for known Magika misclassifications. - Uses DEFAULT_TYPE if Magika cannot determine the MIME type. - Integrates with xmltypes and jsontypes for subtype detection. """ detector_name = self.__class__.__name__ subtype = None try: result = self._magika_object.identify_path(filepath) _, mime_type = self._fix_result( filepath, result.dl.label, result.dl.mime_type ) except ValueError: mime_type = None if mime_type is None or mime_type == "application/undefined": mime_type = DEFAULT_TYPE warnings.warn( f"Could not determine mimetype for {filepath}. Using default type." ) elif xmltypes.is_xml_type(mime_type): mime_type, subtype = xmltypes.get_format_info(filepath, mime_type) elif jsontypes.is_json_type(mime_type): mime_type, subtype = jsontypes.get_format_info(filepath, mime_type) return FormatInfo(detector=detector_name, mimetype=mime_type, subtype=subtype)Detect the format of a file using Magika and return a FormatInfo object.
Args
filepath:Path- Path to the file to analyze.
Returns
FormatInfo- Object containing detected format information.
Notes
- Applies corrections for known Magika misclassifications.
- Uses DEFAULT_TYPE if Magika cannot determine the MIME type.
- Integrates with xmltypes and jsontypes for subtype detection.