(extractor-ref)

Extractor

exception algomancy_data.extractor.DateFormatError[source]

Bases: Exception

class algomancy_data.extractor.ConversionIssue(table, column, target_type, reason)[source]

Bases: object

Single dtype-conversion failure surfaced by DataTypeConverter.

table

Logical table/file name where the failure occurred.

column

Column whose conversion failed.

target_type

The schema-declared target DataType.

reason

Short description of the failure.

class algomancy_data.extractor.DataTypeConverter[source]

Bases: object

static convert_dtypes(df, schema_types, issues=None, table_name=None)[source]

Converts DataFrame columns to the specified data types in the schema. Attempts different localization options for numeric columns and date formats if the initial conversion fails.

Parameters:
  • df (DataFrame) – The pandas DataFrame to convert.

  • schema_types (dict[str, DataType]) – dictionary containing target_types for data, obtained from schema.

  • issues (List[ConversionIssue] | None) – Optional list that receives ConversionIssue entries for any column that fails to convert. The caller (typically an extractor) is expected to surface these via the validation step rather than silently corrupting data.

  • table_name (str | None) – Optional logical table name used to attach context to ConversionIssue entries.

Returns:

DataFrame with converted data types where possible.

Return type:

DataFrame

class algomancy_data.extractor.Extractor(file, logger=None)[source]

Bases: ABC

conversion_issues: List[ConversionIssue]
abstractmethod extract()[source]
class algomancy_data.extractor.SingleExtractor(file, schema, logger=None)[source]

Bases: Extractor

extract()[source]

Returns Dict[name, dataframe], so each dataset is identifiable

class algomancy_data.extractor.MultiExtractor(files, schema, logger=None)[source]

Bases: Extractor

extract()[source]

Returns Dict[name, dataframe], so each dataset is identifiable

get_extraction_key(name)[source]
get_schema_name(extraction_key)[source]
class algomancy_data.extractor.CSVSingleExtractor(file, schema, logger=None, separator=';')[source]

Bases: SingleExtractor

Parses and extracts data from a CSV file.

This class is designed for reading and extracting data specifically from Comma-Separated Values (CSV) files. It uses pandas for data manipulation and allows customization of the delimiter used in the CSV file through the separator parameter. The extracted data is provided in the form of a pandas DataFrame.

file

CSVFile File object that contains the content of the CSV file.

schema

Schema contains datatype information for each column in the DataFrame.

logger

Logger, optional An optional logger instance to log messages and errors.

separator

str The delimiter string to use for parsing the CSV file (default is “;”).

class algomancy_data.extractor.JSONSingleExtractor(file, schema, logger=None)[source]

Bases: SingleExtractor

Handles extraction of data from JSON files.

This class is designed to read and process data from a JSON file. It normalizes the JSON structure and converts it into a pandas DataFrame for further processing. It inherits from the Extractor base class and uses similar initialization parameters such as a file path and an optional logger.

JSONSingleExtractor expects the JSON file to be formatted such that the root level is a list. Each item in the list represents a single record, and each record has some properties. The properties are represented as key-value pairs. If the value is a dictionary, it is treated as a nested object. Each nested object is converted to a column in the dataframe. If the value is a list, it is converted to a string.

file

JSONFile File object that contains the content of the JSON file.

schema

Schema contains datatype information for each column in the DataFrame.

logger

Logger, optional Logger instance for logging messages. Defaults to None.

class algomancy_data.extractor.XLSXSingleExtractor(file, schema, sheet_name, logger=None)[source]

Bases: SingleExtractor

Represents an extractor for XLSX files.

This class is designed to handle the extraction of data from XLSX files. It uses pandas to read specified sheets from an XLSX file and converts the content into a DataFrame. It extends the functionality of a base SingleExtractor class, providing a specialized implementation for XLSX data.

file

XLSXFile The file object containing the content of the XLSX file.

schema

Schema The schema object containing the data types for each column in the DataFrame.

sheet_name

str | int The name or index of the sheet to extract data from.

logger

Logger, optional An optional logger instance for logging purposes.

class algomancy_data.extractor.XLSXMultiExtractor(file, schema, logger=None)[source]

Bases: MultiExtractor

Represents an extractor for XLSX files.

This class is designed to handle the extraction of data from XLSX files. It uses pandas to read specified sheets from an XLSX file and converts the content into DataFrame(s). It extends the functionality of a base MultiExtractor class, providing a specialized implementation for XLSX data.

file

XLSXFile The file object containing the content of the XLSX file.

schemas

Schema The schema object containing the data types for each column in the DataFrame.

sheet_names

List[str] The name of the sheets to extract data from.

logger

Logger, optional An optional logger instance for logging purposes.

Note that the sheet_names should match the keys of the schemas Dict.

class algomancy_data.extractor.JSONMultiExtractor(file, schema, logger=None)[source]

Bases: MultiExtractor

Extract a nested JSON document into multiple related tables.

The schema must be a SchemaType.MULTI schema declaring one ColumnGroup per output table. Each group’s source_path says where its rows live relative to a root record:

  • source_path=() — the root group. Each item of the top-level list contributes one row. Exactly one group must use this.

  • source_path=("PickOrderLines",) — a child group. Each root record has a nested list at that key whose elements become this group’s rows. Deeper paths (("foo", "bar")) walk through intermediate dicts before reaching the list.

A child column whose foreign_key=(parent_group_name, parent_pk_column) is automatically populated from the corresponding root record at extraction time. The same FK declaration is also consumed by ForeignKeyValidator and CascadeDropTransformer.

The top-level JSON may be either a list of records, or a dict with exactly one list-valued key (the wrapper is unwrapped automatically). List columns that are peeled off into a child group are dropped from the parent table so each row is a flat, queryable record.

file

JSONFile containing the nested document.

schema

MULTI schema whose ColumnGroup``s carry the ``source_path and foreign_key metadata.

class algomancy_data.extractor.DataFrameExtractor(name, df, schema, logger=None)[source]

Bases: Extractor

Extractor that wraps a pre-built pandas.DataFrame.

Useful for tests and notebook workflows where the input data is already in memory and no file IO is needed.

name

Logical table/file name under which the DataFrame is exposed.

df

The DataFrame to expose.

schema

Schema (SINGLE) whose datatypes() are applied via DataTypeConverter. MULTI schemas are not supported.

extract()[source]
class algomancy_data.extractor.ExtractionSequence(extractors=None, logger=None)[source]

Bases: object

run_extraction()[source]
property completed: bool
property data: Dict[str, DataFrame]
property conversion_issues: List[ConversionIssue]

Return dtype-conversion failures collected during extraction.

The ETL pipeline drains these and surfaces them as validation messages instead of letting them silently corrupt the data.

add_extractor(extractor)[source]
add_extractors(extractors)[source]