Schema

Schema primitives for defining structured tabular data.

This module provides a Schema abstraction that declares columns via Column instances as class attributes. The legacy _DATATYPES dict is still accepted but emits a DeprecationWarning; migrate to Column declarations to silence it.

class algomancy_data.schema.DataType(*values)[source]

Bases: StrEnum

Enumeration of supported logical data types for schema fields.

STRING = 'string'
DATETIME = 'datetime64[ns]'
INTEGER = 'int64'
FLOAT = 'float64'
BOOLEAN = 'boolean'
CATEGORICAL = 'categorical'
INTERVAL = 'interval'
class algomancy_data.schema.FileExtension(*values)[source]

Bases: StrEnum

Supported file extensions for input files.

CSV = 'csv'
XLSX = 'xlsx'
JSON = 'json'
class algomancy_data.schema.SchemaType(*values)[source]

Bases: StrEnum

Enumeration of supported schema types.

SINGLE = 'single'
MULTI = 'multi'
class algomancy_data.schema.Column(name, dtype, optional=False, primary_key=False, default=None, nullable=False, unique=False, description='', foreign_key=None, parent_requires_child=False, track_partial_loss=False)[source]

Bases: object

Metadata for a single schema column.

Parameters:
  • name (str) – Actual column name as it appears in the source data.

  • dtype (DataType) – The expected DataType of this column.

  • optional (bool) – If True the column may be absent in the source data.

  • primary_key (bool) – If True this column is part of the (joint) primary key.

  • default (Any) – Value used when the column is absent and optional=True.

  • nullable (bool) – If True the column may contain null/NaN values.

  • unique (bool) – If True all values in the column must be distinct.

  • description (str) – Human-readable description of the column.

  • foreign_key (Tuple[str, str] | None) – Optional (parent_table, parent_column) tuple declaring that this column references a column on another table. Used by ForeignKeyValidator (for reporting violations) and by CascadeDropTransformer (for cascade cleanup).

  • parent_requires_child (bool) – If True, the referenced parent row requires at least one referencing child on this relation; parents with zero children get dropped by CascadeDropTransformer. Only meaningful when foreign_key is set.

  • track_partial_loss (bool) – If True, enables partial-loss cascade for this relation: parents that lose some (but not all) of their children mid-pipeline are dropped. Requires a CascadeSnapshot paired with the cascade transformer. Only meaningful when foreign_key is set.

name: str
dtype: DataType
optional: bool = False
primary_key: bool = False
default: Any = None
nullable: bool = False
unique: bool = False
description: str = ''
foreign_key: Tuple[str, str] | None = None
parent_requires_child: bool = False
track_partial_loss: bool = False
class algomancy_data.schema.ColumnGroup(name, columns, source_path=<factory>)[source]

Bases: object

Metadata for one sheet (sub-schema) of a MULTI schema.

Declare ColumnGroup instances as class attributes on a Schema subclass with _SCHEMA_TYPE = SchemaType.MULTI:

class LocationSchema(Schema):
    _FILENAME = "multisheet"
    _EXTENSION = FileExtension.XLSX
    _SCHEMA_TYPE = SchemaType.MULTI

    STEDEN = ColumnGroup("Steden", [
        Column("Country", dtype=DataType.STRING),
        Column("City",    dtype=DataType.STRING),
    ])
    KLANTEN = ColumnGroup("Klanten", [
        Column("ID",   dtype=DataType.INTEGER, primary_key=True),
        Column("Naam", dtype=DataType.STRING),
    ])
Parameters:
  • name (str) – Actual sheet / sub-schema name as it appears in the source file (may contain spaces and mixed case).

  • columns (List[Column]) – Ordered list of Column objects for this sub-schema.

  • source_path (Tuple[str, ...]) – For nested sources (e.g. JSON), the path of keys from the root record to the list of dicts that populates this group. () (the default) means the group is built from the root record itself; a tuple like ("PickOrderLines",) means each root record has a nested list at that key whose elements form the rows of this group. Ignored by extractors that do not support nesting (e.g. XLSXMultiExtractor).

name: str
columns: List[Column]
source_path: Tuple[str, ...]
class algomancy_data.schema.Schema[source]

Bases: ABC

Abstract base class for table schemas.

Declare columns as class attributes using Column instances:

class MySchema(Schema):
    _FILENAME = "my_file"
    _EXTENSION = FileExtension.CSV
    _SCHEMA_TYPE = SchemaType.SINGLE

    ID = Column("id", dtype=DataType.STRING, primary_key=True)
    NAME = Column("name", dtype=DataType.STRING)
    VALUE = Column("value", dtype=DataType.FLOAT, optional=True)

The legacy _DATATYPES dict is still supported but deprecated.

classmethod file_name()[source]

Return the base file name (without extension).

classmethod extension()[source]

Return the file extension.

Accepts any StrEnum-derived value (including user-defined FileExtension subclasses created for custom file formats — see Extending file types and data types). A plain str is upcast to the built-in FileExtension for compatibility, or returned as-is when it does not match a built-in value.

classmethod schema_type()[source]

Return the schema type (SINGLE or MULTI).

classmethod file_name_with_extension()[source]

Return <file_name>.<extension>.

classmethod columns()[source]

Return an ordered mapping of column name → Column.

For schemas that declare Column class attributes the mapping is built from those attributes (in class-definition order).

For schemas that still use the legacy _DATATYPES dict a DeprecationWarning is emitted and Column objects are built automatically with optional=False, primary_key=False, and default=None.

Raises:
  • NotImplementedError – If neither Column attributes nor _DATATYPES are defined.

  • TypeError – If called on a MULTI schema (use datatype_groups()).

classmethod get_legacy_columns_with_warning()[source]
classmethod column_groups()[source]

Return {group_name: {col_name: Column}} for MULTI schemas.

Scans vars(cls) for ColumnGroup attributes first (new API). Falls back to _DATATYPES for legacy schemas, emitting a DeprecationWarning and constructing bare Column objects (optional=False, primary_key=False, default=None).

Raises:
  • ValueError – If called on a SINGLE schema.

  • NotImplementedError – If neither ColumnGroup attrs nor _DATATYPES are defined.

classmethod required_columns()[source]

Return names of non-optional columns.

classmethod optional_columns()[source]

Return names of optional columns.

classmethod primary_key()[source]

Return tuple of column names that form the (joint) primary key.

classmethod sub_names()[source]

Return sub-schema names for MULTI schemas.

classmethod is_multi()[source]

Return True if this is a MULTI schema.

classmethod is_single()[source]

Return True if this is a SINGLE schema.

classmethod get_subschema(key)[source]

Return a synthetic SINGLE schema class for one sheet of a MULTI schema.

The returned class behaves as a normal Schema subclass and exposes datatypes() for the requested sub-name.

Parameters:

key (str) – Sub-schema name (e.g. sheet name in an XLSX file).

Raises:

ValueError – If called on a SINGLE schema or if key is invalid.

classmethod validate()[source]

Validate that every declared field name appears in the column mapping.

Raises:

AssertionError – If a field name is missing from the column mapping.

classmethod get_data_members()[source]

Return string-valued class attributes that represent column aliases.

Excludes dunder names, methods, classes, built-ins, descriptors, and Column instances (which are the new-style declaration).