DataManager¶

The DataManager class is responsible for data ingestion and internal storage. It is usually not accessed directly, but rather through the ScenarioManager facade. Three concrete implementations are available:

StatelessDataManager — in-memory only; no persistence.
StatefulDataManager — persists DataSources to disk as JSON files and reloads them on startup. Deprecated — use DatabaseDataManager for new projects.
DatabaseDataManager — persists DataSources to a SQL database (requires the [database] extra).

StatelessDataManager / StatefulDataManager¶

class algomancy_data.datamanager.DataManager(etl_factory, schemas, save_type, data_object_type, logger=None)[source]¶

Bases: ABC

Handles all data-related operations: loading, deriving, deleting, and storing datasets.

property data_object_type¶

abstractmethod startup()[source]¶

log(message)[source]¶

get_data_keys()[source]¶

get_data(data_key)[source]¶

set_data(data_key, data)[source]¶

derive_data(existing_key, derived_key)[source]¶

add_data_source(data_source)[source]¶

abstractmethod delete_data(data_key, prevent_masterdata_removal=False)[source]¶

static check_existence_of_files(file_name_to_path)[source]¶

prepare_files(file_items_with_content=None, file_items_with_path=None)[source]¶

etl_data(files, dataset_name)[source]¶

Run the ETL pipeline for dataset_name and store the result.

Parameters:

files (Dict[str, File]) – Mapping of logical file names to File objects.
dataset_name (str) – Logical name for the resulting dataset.

Returns:

structured outcome. Inspect result.status to tell success from failure and result.validation_result.messages for details.

Return type:

ETLResult

Raises:

ETLConstructionError – If pipeline construction fails.
Exception – Programmer errors from user-supplied components are allowed to propagate unchanged.

create_validation_sequence()[source]¶

class algomancy_data.datamanager.StatelessDataManager(etl_factory, schemas, save_type, data_object_type, logger=None)[source]¶

Bases: DataManager

startup()[source]¶

delete_data(data_key, prevent_masterdata_removal=False)[source]¶

class algomancy_data.datamanager.StatefulDataManager(etl_factory, schemas, data_folder, save_type, data_object_type, logger=None)[source]¶

Bases: DataManager

startup()[source]¶

Load persisted data sources from the data folder.

Each item is loaded independently; if a single file/directory fails to load it is logged and skipped, and any partial in-memory state for that item is rolled back so the manager remains consistent. Other items continue to load. Failures are surfaced through the configured logger; self.startup_errors collects them so callers can inspect what happened.

load_data_from_file(file_name, root=None)[source]¶

load_data_from_dir(directory, root=None)[source]¶

delete_data(data_key, prevent_masterdata_removal=False)[source]¶

store_data(dataset_name, data, USE_OLD_VERSION=True)[source]¶

store_data_source_as_json(dataset_name, allow_overwrite=False)[source]¶

DatabaseDataManager¶

DatabaseDataManager stores DataSources in a SQL database (SQLite by default; Postgres-compatible). Writes happen immediately after every ETL run, derive, or add_data_source call. get_data() loads a DataSource into RAM on first access and caches it, so only accessed datasets occupy memory.

Persistence path selection is dispatched automatically per DataSource:

Shared per-sub-table SQL (used when the subclass implements SqlTableLayout) — one physical SQL table per sub-table name (e.g. algomancy_ds__customers), shared across all sessions and datasets. Each row carries _algomancy_session_id and _algomancy_dataset_name discriminator columns, so the table count is bounded by the DataSource shape rather than growing with sessions × datasets. Data stays externally queryable.
JSON-blob fallback (used for all other BaseDataSource subclasses) — the DataSource is serialised via its abstract to_json() into a payload column on the algomancy_datasets catalogue table.

The bundled DataSource satisfies SqlTableLayout via its tables dict, so it is always stored in the shared per-sub-table tables.

Schema drift — if an older database is missing the payload or sub_tables columns, startup() raises immediately with a clear message directing you to drop the catalogue table (and any leftover algomancy_ds__… tables) and rebuild. There is no automatic migration from the older per-(session, dataset) table layout.

Requires sqlalchemy>=2.0. Install via:

pip install algomancy-data[database]