(extending-ref)= # Extending file types and data types `FileExtension` and `DataType` are `StrEnum`s with a closed default set (`CSV`, `XLSX`, `JSON` and `STRING`, `INTEGER`, `FLOAT`, `DATETIME`, `BOOLEAN`, `CATEGORICAL`, `INTERVAL`). User projects can add support for additional file formats without forking the package by following the small recipe below. ## 1. Add a new `FileExtension` `StrEnum`s do not natively allow new members at runtime. The supported extension pattern is to subclass the enum and use the subclass in your schemas: ```python from enum import StrEnum from algomancy_data import FileExtension class MyFileExtension(StrEnum): PARQUET = "parquet" ``` > Schemas accept any `StrEnum`-derived value via the existing > `_EXTENSION` field (it is normalised to a string at use sites). Where > the framework compares against the built-in `FileExtension`, the > comparison is performed by string equality on the lower-cased value, > so `MyFileExtension.PARQUET == "parquet"` works as expected. ## 2. Register an extractor for the new extension Use the public `register_extractor` API to teach the framework how to extract data for the new `(extension, schema_type)` pair: ```python from algomancy_data import ( register_extractor, SingleExtractor, SchemaType, ) class ParquetSingleExtractor(SingleExtractor): def _extract_file(self): import pandas as pd return pd.read_parquet(self.file.path) register_extractor(MyFileExtension.PARQUET, SchemaType.SINGLE, ParquetSingleExtractor) ``` After registration, `ETLFactory.create_extraction_sequence()` (and therefore `SimpleETLFactory`) will pick up the new extractor for any schema that declares `_EXTENSION = MyFileExtension.PARQUET`. ## 3. Add a new `DataType` (advanced) `DataType` values are passed straight through to pandas via `DataFrame.astype(dtype)`. To support a custom logical type: 1. Subclass `DataType` the same way you subclassed `FileExtension`. 2. Extend `DataTypeConverter` if the new type needs custom coercion beyond `astype`. The four built-in helpers (`_convert_numeric_column`, `_convert_datetime_column`, `_convert_boolean_column`, `_convert_string_column`) are the templates to follow; each takes an optional `issues` buffer and a `table_name` so dtype-conversion failures surface as `CONVERSION_FAILED` validation messages instead of silent NaNs. If your custom type plugs cleanly into pandas, no converter changes are needed — the registry-based dispatch handles the rest. ## 4. Confirm the registration `registered_keys()` returns every `(FileExtension, SchemaType)` pair the registry knows about, which is useful for sanity-checking at app startup: ```python from algomancy_data import registered_keys assert (MyFileExtension.PARQUET, SchemaType.SINGLE) in registered_keys() ```