Data intake¶

An Algomancy app reads data through an ETL pipeline. The quickstart wizard has already generated the folder structure, schemas, and ETL factory for us. In this section we review the generated files, then write the custom transformation and loading logic for the TSP model.

Review the generated schemas¶

Open src/data_handling/generated_schemas.py. The wizard scanned the three input files and created a Schema subclass for each one, with inferred column names and types:

Note

The fields _FILENAME, _EXTENSION, and _SCHEMA_TYPE are required — an exception is raised at construction if any are missing.

Important

A single Schema corresponds to a single file. When a file contains more than one data table (e.g., multiple Excel sheets), set _SCHEMA_TYPE to MULTI and return a nested dictionary from _defined_datatypes, as done for OtherlocationsSchema above. The outer-dictionary keys must match the table identifiers (e.g., the sheet names of an xlsx).

Tip

Defining column names as class variables (e.g., ID = "ID") is not strictly necessary, but it makes the code more readable, prevents typos, and lets your IDE assist with autocompletion — especially when column names are long or appear in many places.

Review the generated ETL factory¶

Open src/data_handling/etl_factory.py. The wizard created a TSPETLFactory with extractors already configured for each input file:

Code

etl_factory.py (as generated)¶

from typing import Dict

import algomancy_data as de
from algomancy_data import File
from algomancy_data.extractor import (
    ExtractionSequence,
    CSVSingleExtractor,
    XLSXMultiExtractor,
    XLSXSingleExtractor,
)
from algomancy_data.transformer import TransformationSequence
from algomancy_utils import Logger

from src.data_handling.generated_schemas import all_schemas
from src.data_handling.generated_schemas import dc_schema, otherlocations_schema, stores_schema


class TSPETLFactory(de.SimpleETLFactory):

    @classmethod
    def create_extraction_sequence(
        cls, files=None, schemas=None, logger: Logger = None,
    ) -> ExtractionSequence:
        sequence = ExtractionSequence(logger=logger)

        # Extract dc
        sequence.add_extractor(
            XLSXSingleExtractor(
                file=files["dc"],
                schema=dc_schema,
                sheet_name="Sheet1",
                logger=logger,
            )
        )

        # Extract otherlocations
        sequence.add_extractor(
            XLSXMultiExtractor(
                file=files["otherlocations"],
                schema=otherlocations_schema,
                logger=logger,
            )
        )

        # Extract stores
        sequence.add_extractor(
            CSVSingleExtractor(
                file=files["stores"],
                schema=stores_schema,
                logger=logger,
                separator=",",
            )
        )

        return sequence

    @classmethod
    def create_transformation_sequence(
        cls, schemas=None, logger: Logger = None,
    ) -> TransformationSequence:
        # TODO: Add transformers to process your data.
        return TransformationSequence(logger=logger)

    @classmethod
    def create_validation_sequence(
        cls, schemas, logger: Logger = None,
    ) -> de.ValidationSequence:
        vs = de.ValidationSequence(logger=logger)
        vs.add_validator(de.ExtractionSuccessVerification())
        vs.add_validator(
            de.SchemaValidator(
                schemas=list(schemas.values()),
                severity=de.ValidationSeverity.CRITICAL,
            )
        )
        return vs

    @classmethod
    def create_loader(cls, logger: Logger = None) -> de.Loader:
        # TODO: Customize if you need a custom data container.
        return de.DataSourceLoader(logger)

An ETL factory has four responsibilities:

Extract — read the input files as configured by the schemas.
Validate — run validations on the extracted data.
Transform — reshape the extracted DataFrames into the form needed for loading.
Load — build the application data model from the transformed data.

Extraction and validation are already complete. We now need to replace the placeholder create_transformation_sequence and create_loader with TSP-specific implementations.

At this point you can verify that extraction works:

Run main.py.
Open the dashboard at http://127.0.0.1:8050.
Go to the Data page and import the files from data/setup/.
Verify that all three files are loaded without errors.

Transform¶

We transform all input data into a single pandas DataFrame that lists the locations, then derive a routes DataFrame from it.

Create the directory src/data_handling/transformers/.
Create transform_create_location_df.py — initialise an empty locations DataFrame:

Create one transformer per input source that appends its rows to the locations DataFrame. Each follows the same pattern — rename columns and concatenate:

Create TransformXDockToLocation, TransformDCToLocation, and TransformStoresToLocation in the same way, changing df_name to 'otherlocations.xdock', 'dc', and 'stores' respectively:

Code — remaining source transformers

# transform_xdock_to_location.py
class TransformXDockToLocation(Transformer):
    def __init__(self, location_df_name: str, logger=None) -> None:
        super().__init__(name="Location Transformer", logger=logger)
        self.location_df_name = location_df_name
        self.df_name = 'otherlocations.xdock'
        self.column_mapping = {'ID': 'id', 'x': 'x', 'y': 'y'}

    def transform(self, data: dict[str, pd.DataFrame]) -> None:
        data_df = data.get(self.df_name, None)
        data_df_locations = data.get(self.location_df_name, None)
        if (data_df is not None) and (data_df_locations is not None):
            normalized = (
                data_df.rename(columns=self.column_mapping)
                .reindex(columns=data_df_locations.columns)
                .astype(data_df_locations.dtypes.to_dict())
            )
            data[self.location_df_name] = pd.concat(
                [data_df_locations, normalized], ignore_index=True
            )


# transform_dc_to_location.py
class TransformDCToLocation(Transformer):
    def __init__(self, location_df_name: str, logger=None) -> None:
        super().__init__(name="Location Transformer", logger=logger)
        self.location_df_name = location_df_name
        self.df_name = 'dc'
        self.column_mapping = {'ID': 'id', 'x': 'x', 'y': 'y'}

    def transform(self, data: dict[str, pd.DataFrame]) -> None:
        data_df = data.get(self.df_name, None)
        data_df_locations = data.get(self.location_df_name, None)
        if (data_df is not None) and (data_df_locations is not None):
            normalized = (
                data_df.rename(columns=self.column_mapping)
                .reindex(columns=data_df_locations.columns)
                .astype(data_df_locations.dtypes.to_dict())
            )
            data[self.location_df_name] = pd.concat(
                [data_df_locations, normalized], ignore_index=True
            )


# transform_stores_to_location.py
class TransformStoresToLocation(Transformer):
    def __init__(self, location_df_name: str, logger=None) -> None:
        super().__init__(name="Location Transformer", logger=logger)
        self.location_df_name = location_df_name
        self.df_name = 'stores'
        self.column_mapping = {'ID': 'id', 'x': 'x', 'y': 'y'}

    def transform(self, data: dict[str, pd.DataFrame]) -> None:
        data_df = data.get(self.df_name, None)
        data_df_locations = data.get(self.location_df_name, None)
        if (data_df is not None) and (data_df_locations is not None):
            normalized = (
                data_df.rename(columns=self.column_mapping)
                .reindex(columns=data_df_locations.columns)
                .astype(data_df_locations.dtypes.to_dict())
            )
            data[self.location_df_name] = pd.concat(
                [data_df_locations, normalized], ignore_index=True
            )

Create transform_location_to_routes.py — derive a routes DataFrame as the Cartesian product of all locations, with Euclidean distance as cost:

Run main.py, import the data, and verify that transform_locations appears as a combined table.

Load¶

We build a domain-specific data model from the transformed DataFrames — a network of Location and Route objects managed by a NetworkManager.

Create the directory src/data_handling/data_model/.

Locations¶

We will use locations in the visualisation part of this tutorial. Create location.py:

Routes¶

We will use routes in the optimisation part of this tutorial. Create route.py:

Network Manager¶

Create network_manager.py to manage the set of locations and routes:

Data Model¶

Create data_model.py as a DataSource subclass so we can attach domain objects to the loaded data:

Loader¶

Create the directory src/data_handling/loaders/ and add loader.py:

@classmethod
def create_loader(cls, logger=None) -> Loader:
    return DataModelLoader(logger)

Also update main.py to use DataModel as the data object type:

data_object_type=DataModel,

Next step¶

All right. The information is loaded in Algomancy. Now it is time to define the algorithm(s).