Migration guide¶
The 0.6 → 0.8 versions delivered three coordinated overhauls of the ETL machinery. This page lists the breaking changes together with the minimal before/after snippets you need to migrate.
v0.6.0 — Schema API modernization¶
_DATATYPES → Column instances¶
The new declarative Column carries dtype together with optional
metadata (optional, primary_key, default, nullable, unique,
description). The legacy _DATATYPES dict still works but emits a
DeprecationWarning via Schema.columns().
class OrdersSchema(Schema):
_FILENAME = "orders"
_EXTENSION = FileExtension.CSV
_SCHEMA_TYPE = SchemaType.SINGLE
_DATATYPES = {
"id": DataType.STRING,
"qty": DataType.INTEGER,
}
class OrdersSchema(Schema):
_FILENAME = "orders"
_EXTENSION = FileExtension.CSV
_SCHEMA_TYPE = SchemaType.SINGLE
ID = Column(name="id", dtype=DataType.STRING, primary_key=True)
QTY = Column(name="qty", dtype=DataType.INTEGER)
Classmethod identity accessors¶
All schema-level accessors are now @classmethod, so the call form
gains parentheses:
schema.file_name
schema.extension
schema.datatypes
schema.file_name()
schema.extension()
schema.datatypes()
get_subschema(key) now returns a synthetic schema class, not an
instance — call its classmethods directly.
v0.7.0 — Structured validation framework¶
ValidationMessage: structured location fields¶
Positional construction (severity, message) still works. New optional
keyword fields (table, column, row, code) make messages
machine-readable for downstream rendering.
msg = ValidationMessage(ValidationSeverity.ERROR, "bad row 42 in widgets.price")
msg = ValidationMessage(
ValidationSeverity.ERROR,
"bad row",
table="widgets",
column="price",
row=42,
code="DTYPE_MISMATCH",
)
ValidationSequence.run_validation() → ValidationResult¶
is_valid, messages = sequence.run_validation(data)
result = sequence.run_validation(data)
result.is_valid
result.messages
result.counts_by_severity
result.as_dataframe()
Configurable halt threshold¶
sequence = ValidationSequence(
[...],
halt_on=ValidationSeverity.ERROR, # default is CRITICAL
)
New built-in validators¶
Validator |
Replaces ad-hoc check |
|---|---|
|
manual “is column X here?” checks |
|
per-project uniqueness/non-null checks |
|
per-column checks |
|
per-project FK checks |
The OptionalColumnGuard transformer (which injects missing optional
columns using Column.default) replaces manual df[col] = default lines.
v0.8.0 — Predictable ETL termination¶
ETLPipeline.run() returns ETLResult (no longer raises)¶
Data-quality failures (validation, missing/malformed files, dtype
conversion errors) arrive as ETLResult(status='failed'). Programmer
errors (e.g. KeyError from a custom transformer) still propagate.
try:
datasource = pipeline.run()
except ValidationError as exc:
report(exc)
result = pipeline.run()
if result.is_success:
use(result.datasource)
else:
report(result.validation_result)
if result.raised is not None:
# Expected ETL exception (e.g. FileNotFoundError) was caught
# and converted; the original is preserved here.
...
DataManager.etl_data() returns the result¶
dm.etl_data(files, "orders_2026") # raised on failure
result = dm.etl_data(files, "orders_2026")
if result.is_failure:
show_messages_to_user(result.validation_result.messages)
Conversion failures surface as validation messages¶
DataTypeConverter no longer prints + swallows coercion errors; they
arrive on the final ValidationResult as messages with code="CONVERSION_FAILED",
populated table/column/row.
Bonus: M4 boilerplate reductions¶
These are not breaking changes — old subclasses keep working — but you can now delete a lot of plumbing:
SimpleETLFactory(schemas)replaces fullETLFactorysubclasses for the common case.ETLFactoryships with defaultcreate_extraction_sequence/create_validation_sequence/create_transformation_sequence/create_loaderimplementations; only override the ones you need.DataManager.prepare_filesnow drives file-type dispatch off the schema-declared_EXTENSION.
See Extending file types and data types for the public
register_extractor API introduced in M5.
Bonus: M7 relational cascade cleanup¶
M7 is fully additive — no existing pipeline changes behavior unless
you add the new transformer. Three new optional Column fields and two
new transformers let you declaratively clean up incomplete input data.
Adding FK declarations to existing schemas¶
class OrderSchema(Schema):
_FILENAME = "order"
_EXTENSION = FileExtension.CSV
_SCHEMA_TYPE = SchemaType.SINGLE
ID = Column(name="id", dtype=DataType.STRING, primary_key=True)
PRODUCT_ID = Column(name="product_id", dtype=DataType.STRING)
class OrderSchema(Schema):
_FILENAME = "order"
_EXTENSION = FileExtension.CSV
_SCHEMA_TYPE = SchemaType.SINGLE
ID = Column(name="id", dtype=DataType.STRING, primary_key=True)
PRODUCT_ID = Column(
name="product_id",
dtype=DataType.STRING,
foreign_key=("product", "id"),
parent_requires_child=True, # opt-in
)
Wiring CascadeDropTransformer into a SimpleETLFactory¶
from algomancy_data import CascadeDropTransformer, SimpleETLFactory
factory = SimpleETLFactory(
schemas=[ProductSchema, OrderSchema],
transformers=[CascadeDropTransformer(schemas=[ProductSchema, OrderSchema])],
)
See Relational cascade cleanup for the full feature
description, including partial-loss detection via CascadeSnapshot.