Extending file types and data types

FileExtension and DataType are StrEnums with a closed default set (CSV, XLSX, JSON and STRING, INTEGER, FLOAT, DATETIME, BOOLEAN, CATEGORICAL, INTERVAL). User projects can add support for additional file formats without forking the package by following the small recipe below.

1. Add a new FileExtension

StrEnums do not natively allow new members at runtime. The supported extension pattern is to subclass the enum and use the subclass in your schemas:

from enum import StrEnum
from algomancy_data import FileExtension


class MyFileExtension(StrEnum):
    PARQUET = "parquet"

Schemas accept any StrEnum-derived value via the existing _EXTENSION field (it is normalised to a string at use sites). Where the framework compares against the built-in FileExtension, the comparison is performed by string equality on the lower-cased value, so MyFileExtension.PARQUET == "parquet" works as expected.

2. Register an extractor for the new extension

Use the public register_extractor API to teach the framework how to extract data for the new (extension, schema_type) pair:

from algomancy_data import (
    register_extractor,
    SingleExtractor,
    SchemaType,
)


class ParquetSingleExtractor(SingleExtractor):
    def _extract_file(self):
        import pandas as pd
        return pd.read_parquet(self.file.path)


register_extractor(MyFileExtension.PARQUET, SchemaType.SINGLE, ParquetSingleExtractor)

After registration, ETLFactory.create_extraction_sequence() (and therefore SimpleETLFactory) will pick up the new extractor for any schema that declares _EXTENSION = MyFileExtension.PARQUET.

3. Add a new DataType (advanced)

DataType values are passed straight through to pandas via DataFrame.astype(dtype). To support a custom logical type:

  1. Subclass DataType the same way you subclassed FileExtension.

  2. Extend DataTypeConverter if the new type needs custom coercion beyond astype. The four built-in helpers (_convert_numeric_column, _convert_datetime_column, _convert_boolean_column, _convert_string_column) are the templates to follow; each takes an optional issues buffer and a table_name so dtype-conversion failures surface as CONVERSION_FAILED validation messages instead of silent NaNs.

If your custom type plugs cleanly into pandas, no converter changes are needed — the registry-based dispatch handles the rest.

4. Confirm the registration

registered_keys() returns every (FileExtension, SchemaType) pair the registry knows about, which is useful for sanity-checking at app startup:

from algomancy_data import registered_keys

assert (MyFileExtension.PARQUET, SchemaType.SINGLE) in registered_keys()