Transformer¶
Transformation primitives for ETL pipelines.
Defines the abstract Transformer contract and a few simple concrete
transformers, as well as a TransformationSequence to compose multiple
transformers into a single pipeline step.
- class algomancy_data.transformer.Transformer(name='Abstract Transformer', logger=None)[source]¶
Bases:
ABCBase class for a transformation step operating on tabular data.
Subclasses implement
transformand can mutate the provided mapping of DataFrames in-place or return a new mapping where applicable.- messages¶
ValidationMessages produced by this transformer during its most recent
transforminvocation. The ETL pipeline collects these from each transformer in the sequence and folds them into the run’sValidationResultso they surface viaETLResult.messages.
- algomancy_data.transformer.fill_empty(data)[source]¶
Forward-fill missing values across columns in a single row.
- Parameters:
data (DataFrame) – DataFrame to fill.
- Returns:
DataFrame with values forward-filled along axis=1.
- Return type:
DataFrame
- algomancy_data.transformer.drop_empty(data)[source]¶
Drop rows containing any NA values.
- Parameters:
data (DataFrame) – Input DataFrame.
- Returns:
DataFrame without rows containing NA values.
- Return type:
DataFrame
- class algomancy_data.transformer.NoopTransformer(logger=None)[source]¶
Bases:
TransformerTransformer that returns the input data unchanged.
- class algomancy_data.transformer.CleanTransformer(logger=None)[source]¶
Bases:
TransformerBasic cleanup: drop NA rows and normalize column names to lowercase.
- class algomancy_data.transformer.JoinTransformer(left, right, on, output, logger=None)[source]¶
Bases:
TransformerJoin two input tables and write the result to a new table key.
- left¶
Name of the left table to join.
- right¶
Name of the right table to join.
- on¶
Column name to join on.
- output¶
Key under which the merged table is stored.
- class algomancy_data.transformer.CascadeDropTransformer(schemas=None, extra_relations=None, snapshot=None, name='Cascade drop transformer', logger=None)[source]¶
Bases:
TransformerDrop rows whose declared foreign-key relations are unsatisfied.
Reads relations from supplied schemas (default source of truth) and optionally merges
extra_relationson top. Iterates to fixpoint, on each pass applying:Orphan-child drop (always on) — drop child rows whose FK tuple is not in the parent’s referenced column set.
Required-child parent drop — for relations with
parent_requires_child=True: drop parent rows whose PK doesn’t appear in any child’s FK column.
Aggregated
ValidationMessage``s are emitted with :class:`ValidationSeverity.ERROR` — one per ``(table, rule, relation)with the dropped row count.- Parameters:
schemas (Sequence[Type[Schema]] | None) – Schemas whose
Column.foreign_keydeclarations supply the default relation set.extra_relations (Sequence[Relation] | None) – Additional or override relations; override wins on matching
(child_table, child_cols).snapshot (CascadeSnapshot | None) – Optional
CascadeSnapshotpaired transformer. Used for partial-loss detection (seeCascadeSnapshot).name (str) – Override the transformer’s display name.
logger – Optional logger.
- class algomancy_data.transformer.CascadeSnapshot(schemas=None, extra_relations=None, logger=None)[source]¶
Bases:
TransformerCaptures referenced-child counts for partial-loss cascade detection.
A read-only transformer that, for every relation flagged
track_partial_loss=True, records the number of referencing children per parent row. Paired withCascadeDropTransformer(passed via itssnapshot=argument) to enable the partial-loss drop rule.Place this transformer before any drop-capable transformer so it captures the pre-cleanup baseline.
- Parameters:
- class algomancy_data.transformer.OptionalColumnGuard(schemas, logger=None)[source]¶
Bases:
TransformerMaterialise missing optional columns using each
Column.default.Injects missing optional columns into the corresponding DataFrame in-place, using
Column.defaultand coercing to the declared dtype. Downstream code can then assume the full schema is present.- _schemas¶
Schemas whose optional columns may be injected.
- class algomancy_data.transformer.TransformationSequence(transformers=None, logger=None)[source]¶
Bases:
objectA sequence of transformers executed in order.
- run_transformation(data)[source]¶
Run all transformers sequentially on a deepcopy of
data.- Parameters:
data (dict[str, DataFrame]) – Mapping of tables to DataFrames.
- Returns:
Transformed copy of the input mapping.
- Return type:
dict[str, pd.DataFrame]
- collect_messages()[source]¶
Aggregate ``ValidationMessage``s produced by all transformers.
Returns messages produced during the most recent
run_transformation()invocation, in transformer order.