(fundamentals-data-container-ref)= # Data Source This document describes the data engine and data sources used in the Algomancy framework. We start with an overview of the underlying data container, the `BaseDataSource` class, and then discuss schemas and the ETL process. ## Description The abstract `BaseDataSource` class serves as a generic foundation for any domain-specific data model. It contains the core identity of a data object, including its unique ID, name, and classification (Master vs. Derived), as well as internal behavior for data management. The `DataSource` class is a reasonably generic, table-oriented implementation of `BaseDataSource` for tabular data using pandas DataFrames. A `DataSource` instance contains all necessary information to be processed by an `Algorithm` to produce a `ScenarioResult`. It supports serialization to and from JSON and Parquet formats, enabling easy persistence and transfer of experimental states. In an Algomancy app, data can exist in either one of two states: - Master data: Immutable data tied directly to source files. - Derived data: Data derived from master data that may be modified. Data is assigned the `MASTER_DATA` classification if it is constructed by an ETL process or saved by the `save` usecase. The `derive` usecase creates a copy of an existing dataset and assigns it the classification `DERIVED_DATA`. Data that was serialized (`download`-ed) and deserialized (`upload`-ed) maintains its classification. :::{dropdown} {octicon}`eye` Example usage :color: success A standard `DataSource` stores data in its `tables` attribute. The following example demonstrates basic usage: ```{code-block} python :caption: Example usage of DataSource :linenos: from algomancy_data import DataSource, DataClassification import pandas as pd # Create a DataSource instance ds = DataSource(DataClassification.MASTER_DATA, name="MyDataSource") # Add a table df = pd.DataFrame({'col1': [1, 2], 'col2': ['A', 'B']}) ds.add_table('my_table', df) # Retrieve a table retrieved_df = ds.get_table('my_table') ``` ::: ```{note} While you can create `DataSource` instances directly, they are typically produced by an {ref}`ETL process`. ``` ## Custom Data Source For most projects, you should create a custom subclass of `BaseDataSource` (or `DataSource`) to encapsulate domain-specific logic and attributes. To implement a custom data source: 1. **Subclass `BaseDataSource`:** Inherit from the base class. 2. **Implement Serialization:** Override `to_json` and `from_json` to handle your custom attributes. 3. (_Optional_) Handle Derivation: override `_post_derive()` to perform logic when data is branched from Master to Derived. :::{dropdown} {octicon}`eye` Example of a custom data source :color: success ```python from algomancy_data import BaseDataSource, DataClassification from dataclasses import dataclass import json @dataclass class Location: name: str lat: float lon: float class MyCustomSource(BaseDataSource): def __init__(self, ds_type, name, locations: dict[str, Location] = None): super().__init__(ds_type, name) self.locations = locations or {} def to_json(self) -> str: # Custom serialization logic return json.dumps({ "ds_type": self.ds_type, "name": self.name, "locations": {k: vars(v) for k, v in self.locations.items()} }) @classmethod def from_json(cls, json_string: str) -> 'MyCustomSource': data = json.loads(json_string) locations = {k: Location(**v) for k, v in data["locations"].items()} return cls(data["ds_type"], data["name"], locations) ``` ::: (data-parameters-ref)= ## Data Parameters A `BaseDataSource` subclass can declare a typed `BaseParameterSet` that the framework collects per scenario, persists alongside the algorithm parameters, and pushes onto the algorithm before `run()`. Use them for knobs that belong conceptually to the *data* rather than the algorithm — date range, region filter, category whitelist — so the same algorithm can consume the same data source under different slices without rewriting either side. Override `initialize_data_parameters` to declare the shape; the default returns `EmptyParameters()`, so existing subclasses keep working with no changes. The method runs on a populated instance, so it can inspect `self.tables` to derive sensible defaults (e.g. the unique values of a `category` column). :::{dropdown} {octicon}`eye` Example: a warehouse data source with two knobs :color: success ```{code-block} python :caption: Custom DataSource that declares data parameters :linenos: from algomancy_data import DataSource from algomancy_utils.baseparameterset import ( BaseParameterSet, IntegerParameter, MultiEnumParameter, ) class WarehouseDataParameters(BaseParameterSet): def __init__(self, categories: list[str]) -> None: super().__init__(name="Warehouse Data") self.add_parameters([ MultiEnumParameter( name="category_filter", choices=categories or ["(none)"], value=list(categories or ["(none)"]), ), IntegerParameter(name="min_daily_picks", minvalue=0, default=0), ]) def validate(self) -> None: pass class WarehouseDataSource(DataSource): def initialize_data_parameters(self) -> BaseParameterSet: sku = self.tables.get("sku_data") categories = ( sorted(str(c) for c in sku["category"].dropna().unique()) if sku is not None and "category" in sku.columns else [] ) return WarehouseDataParameters(categories=categories) ``` ::: The framework does **not** apply data parameters to the data automatically. The algorithm reads `self.data_params` and decides whether to act on them — typically by filtering its input before its main loop. Algorithms that don't care simply ignore the attribute; the safe access pattern is `self.data_params.contains("knob_name")` followed by `self.data_params["knob_name"]`. In the GUI the data parameter card renders next to the algorithm parameter card in the scenario-creation modal, populated as soon as the user picks a dataset. Over the HTTP API the descriptor is served by `GET /api/v1/sessions/{sid}/data/{dataset_key}/parameters`, and supplied values flow through the `data_params` field of `POST /scenarios`. See {ref}`Algorithms and Parameters ` for the algorithm-side read pattern. ## Database persistence When the framework runs with `persistence_backend="database"` (see {ref}`Sessions `), the `DatabaseDataManager` persists every `DataSource` through whichever `data_object_type` was wired into `CoreConfig`. It chooses between two storage paths per DataSource: 1. **JSON blob (universal default).** The full DataSource is serialised via its `to_json()` and stored in a `payload` column on the catalogue table. Any `BaseDataSource` subclass works out of the box — the only requirement is the abstract `to_json` / `from_json` pair every subclass already has to implement. 2. **Per-sub-table SQL (opt-in).** Each DataFrame the DataSource exposes becomes its own SQL table, named `ds__{session_id}__{dataset_name}__{sub_table}`. The data stays externally queryable and the DataSource is loaded lazily on `get_data()`. The bundled `DataSource` uses this path automatically. To opt a custom subclass into the per-table path, implement the {ref}`SqlTableLayout ` protocol. That is: implement the `to_sql_tables()` and `from_sql_tables()` functions. A si ```python from algomancy_data import BaseDataSource, DataClassification from algomancy_data.database import SqlTableLayout # noqa: F401 (for type hints only) import pandas as pd class MyTabularSource(BaseDataSource): def __init__(self, ds_type, name, **kwargs): super().__init__(ds_type, name, **kwargs) self._tables: dict[str, pd.DataFrame] = {} # ---- SqlTableLayout protocol ---- def to_sql_tables(self) -> dict[str, pd.DataFrame]: return self._tables def from_sql_tables(self, tables: dict[str, pd.DataFrame]) -> None: self._tables.update(tables) # ---- BaseDataSource (still required) ---- def to_json(self) -> str: ... @classmethod def from_json(cls, payload: str) -> "MyTabularSource": ... ``` If a subclass doesn't implement these two methods, persistence falls back to the JSON-blob path automatically — nothing else changes. Pick the per-table path when external SQL queryability or memory-efficient lazy loading of large DataFrames matters; pick the default JSON-blob path when the DataSource holds non-tabular state or you want the simplest possible contract. For more details on specific classes, see the {ref}`API reference`.