ETL-Factory¶

ETL pipeline composition and abstract factory.

This module defines ETLPipeline which orchestrates the Extract-Validate- Transform-Load steps, and ETLFactory, a classmethod-based abstract factory whose subclasses wire up the four pipeline components for a given dataset configuration.

ETLPipeline.run() returns an ETLResult describing the outcome of the job. Data-quality failures (validation, missing/malformed inputs) are reported via status='failed' rather than as exceptions. Programmer errors (unexpected KeyError/AttributeError/TypeError etc. from user-supplied components) still propagate so that real defects are not masked.

class algomancy_data.etl.ETLResult(status, datasource=None, validation_result=None, raised=None)[source]¶

Bases: object

Structured outcome of an ETLPipeline.run() invocation.

status: Literal['success', 'failed']¶: 'success' if the run completed and validation passed; 'failed' if a data-quality issue was detected.

datasource: BASEDATASOURCE | None = None¶: Loaded destination object (None on failure).

validation_result: ValidationResult | None = None¶: Messages and counts from the validation step. Always present, even when extraction never produced data.

raised: Exception | None = None¶: Original exception when a recognised data-quality exception was caught and converted to a failure. None otherwise. Programmer errors are not captured here — they propagate from run() unchanged.

property is_success: bool¶

property is_failure: bool¶

property messages: List[ValidationMessage]¶: Convenience accessor for validation_result.messages.

class algomancy_data.etl.ETLPipeline(destination_name, extraction_sequence, validation_sequence, transformation_sequence, loader, logger)[source]¶

Bases: object

Coordinates a single end-to-end ETL job.

run()[source]¶

Execute the ETL job and return an ETLResult.

Orchestrates Extraction → Validation → Transformation → Load.

Returns:: status='success' with a loaded datasource when the job completes and validation passes; status='failed' (with messages on validation_result) when a data-quality issue is detected.
Return type:: ETLResult
Raises:: Exception – Programmer errors (KeyError, AttributeError, TypeError and anything else not classified as an expected data-quality failure) propagate so that real defects are not masked. Use validators for data-quality checks instead.

algomancy_data.etl.get_schema(file_name, schemas)[source]¶

Return schema(s) for the given file name based on configuration.

Parameters:

schemas (dict[str, Schema])
file_name (str) – Logical file name as defined in a schema.

Returns:

Schema or mapping of sub-name to Schema depending on the configuration type (single or multi).

Raises:

ETLConstructionError – If no configuration exists or it is invalid.

Return type:

Schema

exception algomancy_data.etl.ETLConstructionError(message)[source]¶

Bases: Exception

Raised when the ETL pipeline cannot be constructed.

class algomancy_data.etl.ETLFactory[source]¶

Bases: ABC

Abstract classmethod-based factory for ETL pipeline components.

Subclasses implement the four create_* classmethods to wire up extraction, validation, transformation, and loading for a dataset. build_pipeline orchestrates them in order and is the only entry point callers need.

Because the factory carries no instance state, it is always passed and used as a class (type[ETLFactory]), never instantiated.

abstractmethod classmethod create_extraction_sequence(files=None, schemas=None, logger=None)[source]¶

abstractmethod classmethod create_validation_sequence(schemas, logger=None)[source]¶

abstractmethod classmethod create_transformation_sequence(schemas=None, logger=None)[source]¶

abstractmethod classmethod create_loader(logger=None)[source]¶

classmethod build_pipeline(dataset_name, files, schemas, logger=None)[source]¶

Assemble and return an ETLPipeline instance.

Parameters:

dataset_name (str) – Destination dataset name.
files (Dict[str, File]) – Mapping of logical file names to File objects.
schemas (Dict[str, Schema]) – Mapping of logical schema names to Schema objects.
logger (Logger | None) – Optional logger forwarded to each pipeline step.

Returns:

ETLPipeline ready to run.

Return type:

ETLPipeline

class algomancy_data.etl.SimpleETLFactory[source]¶

Bases: ETLFactory

Concrete factory for the common case where no custom wiring is needed.

All four create_* methods have sensible defaults: registry-driven extraction, the three standard validators, a no-op transformation, and DataSourceLoader. Subclass and override only the methods that need non-default behaviour (e.g. a different CSV separator or extra validators).

classmethod create_extraction_sequence(files=None, schemas=None, logger=None)[source]¶

Default extractor wiring keyed off the registry.

For each File in files looks up the matching schema by name and selects an extractor class via get_extractor_class on (extension, schema_type). Override only when you need non-default extractor parameters (e.g. CSV separator).

Raises:: ETLConstructionError – If no extractor is registered for a schema’s (extension, schema_type) pair.

classmethod create_validation_sequence(schemas, logger=None)[source]¶

Default validation sequence using the new built-in validators.

Includes (in order): RequiredColumnsValidator, SchemaValidator, and PrimaryKeyValidator. The PK validator skips schemas with no declared primary key internally (and decomposes MULTI schemas into per-group synthetic SINGLE schemas via _schema_table_map), so it is safe to append unconditionally. Subclasses can override to add domain-specific validators.

classmethod create_transformation_sequence(schemas=None, logger=None)[source]¶

Return a no-op transformation sequence.

classmethod create_loader(logger=None)[source]¶

Return the default DataSourceLoader.