Extending file types and data types¶
FileExtension and DataType are StrEnums with a closed default set
(CSV, XLSX, JSON and STRING, INTEGER, FLOAT, DATETIME,
BOOLEAN, CATEGORICAL, INTERVAL). User projects can add support
for additional file formats without forking the package by following the
small recipe below.
1. Add a new FileExtension¶
StrEnums do not natively allow new members at runtime. The supported
extension pattern is to subclass the enum and use the subclass in your
schemas:
from enum import StrEnum
from algomancy_data import FileExtension
class MyFileExtension(StrEnum):
PARQUET = "parquet"
Schemas accept any
StrEnum-derived value via the existing_EXTENSIONfield (it is normalised to a string at use sites). Where the framework compares against the built-inFileExtension, the comparison is performed by string equality on the lower-cased value, soMyFileExtension.PARQUET == "parquet"works as expected.
2. Register an extractor for the new extension¶
Use the public register_extractor API to teach the framework how to
extract data for the new (extension, schema_type) pair:
from algomancy_data import (
register_extractor,
SingleExtractor,
SchemaType,
)
class ParquetSingleExtractor(SingleExtractor):
def _extract_file(self):
import pandas as pd
return pd.read_parquet(self.file.path)
register_extractor(MyFileExtension.PARQUET, SchemaType.SINGLE, ParquetSingleExtractor)
After registration, ETLFactory.create_extraction_sequence() (and
therefore SimpleETLFactory) will pick up the new extractor for any
schema that declares _EXTENSION = MyFileExtension.PARQUET.
3. Add a new DataType (advanced)¶
DataType values are passed straight through to pandas via
DataFrame.astype(dtype). To support a custom logical type:
Subclass
DataTypethe same way you subclassedFileExtension.Extend
DataTypeConverterif the new type needs custom coercion beyondastype. The four built-in helpers (_convert_numeric_column,_convert_datetime_column,_convert_boolean_column,_convert_string_column) are the templates to follow; each takes an optionalissuesbuffer and atable_nameso dtype-conversion failures surface asCONVERSION_FAILEDvalidation messages instead of silent NaNs.
If your custom type plugs cleanly into pandas, no converter changes are needed — the registry-based dispatch handles the rest.
4. Confirm the registration¶
registered_keys() returns every (FileExtension, SchemaType) pair the
registry knows about, which is useful for sanity-checking at app
startup:
from algomancy_data import registered_keys
assert (MyFileExtension.PARQUET, SchemaType.SINGLE) in registered_keys()