(extractor-ref)
Extractor¶
- class algomancy_data.extractor.ConversionIssue(table, column, target_type, reason)[source]¶
Bases:
objectSingle dtype-conversion failure surfaced by
DataTypeConverter.- table¶
Logical table/file name where the failure occurred.
- column¶
Column whose conversion failed.
- target_type¶
The schema-declared target
DataType.
- reason¶
Short description of the failure.
- class algomancy_data.extractor.DataTypeConverter[source]¶
Bases:
object- static convert_dtypes(df, schema_types, issues=None, table_name=None)[source]¶
Converts DataFrame columns to the specified data types in the schema. Attempts different localization options for numeric columns and date formats if the initial conversion fails.
- Parameters:
df (DataFrame) – The pandas DataFrame to convert.
schema_types (dict[str, DataType]) – dictionary containing target_types for data, obtained from schema.
issues (List[ConversionIssue] | None) – Optional list that receives
ConversionIssueentries for any column that fails to convert. The caller (typically an extractor) is expected to surface these via the validation step rather than silently corrupting data.table_name (str | None) – Optional logical table name used to attach context to
ConversionIssueentries.
- Returns:
DataFrame with converted data types where possible.
- Return type:
DataFrame
- class algomancy_data.extractor.Extractor(file, logger=None)[source]¶
Bases:
ABC- conversion_issues: List[ConversionIssue]¶
- class algomancy_data.extractor.CSVSingleExtractor(file, schema, logger=None, separator=';')[source]¶
Bases:
SingleExtractorParses and extracts data from a CSV file.
This class is designed for reading and extracting data specifically from Comma-Separated Values (CSV) files. It uses pandas for data manipulation and allows customization of the delimiter used in the CSV file through the separator parameter. The extracted data is provided in the form of a pandas DataFrame.
- file¶
CSVFile File object that contains the content of the CSV file.
- schema¶
Schema contains datatype information for each column in the DataFrame.
- logger¶
Logger, optional An optional logger instance to log messages and errors.
- separator¶
str The delimiter string to use for parsing the CSV file (default is “;”).
- class algomancy_data.extractor.JSONSingleExtractor(file, schema, logger=None)[source]¶
Bases:
SingleExtractorHandles extraction of data from JSON files.
This class is designed to read and process data from a JSON file. It normalizes the JSON structure and converts it into a pandas DataFrame for further processing. It inherits from the Extractor base class and uses similar initialization parameters such as a file path and an optional logger.
JSONSingleExtractor expects the JSON file to be formatted such that the root level is a list. Each item in the list represents a single record, and each record has some properties. The properties are represented as key-value pairs. If the value is a dictionary, it is treated as a nested object. Each nested object is converted to a column in the dataframe. If the value is a list, it is converted to a string.
- file¶
JSONFile File object that contains the content of the JSON file.
- schema¶
Schema contains datatype information for each column in the DataFrame.
- logger¶
Logger, optional Logger instance for logging messages. Defaults to None.
- class algomancy_data.extractor.XLSXSingleExtractor(file, schema, sheet_name, logger=None)[source]¶
Bases:
SingleExtractorRepresents an extractor for XLSX files.
This class is designed to handle the extraction of data from XLSX files. It uses pandas to read specified sheets from an XLSX file and converts the content into a DataFrame. It extends the functionality of a base SingleExtractor class, providing a specialized implementation for XLSX data.
- file¶
XLSXFile The file object containing the content of the XLSX file.
- schema¶
Schema The schema object containing the data types for each column in the DataFrame.
- sheet_name¶
str | int The name or index of the sheet to extract data from.
- logger¶
Logger, optional An optional logger instance for logging purposes.
- class algomancy_data.extractor.XLSXMultiExtractor(file, schema, logger=None)[source]¶
Bases:
MultiExtractorRepresents an extractor for XLSX files.
This class is designed to handle the extraction of data from XLSX files. It uses pandas to read specified sheets from an XLSX file and converts the content into DataFrame(s). It extends the functionality of a base MultiExtractor class, providing a specialized implementation for XLSX data.
- file¶
XLSXFile The file object containing the content of the XLSX file.
- schemas¶
Schema The schema object containing the data types for each column in the DataFrame.
- sheet_names¶
List[str] The name of the sheets to extract data from.
- logger¶
Logger, optional An optional logger instance for logging purposes.
Note that the sheet_names should match the keys of the schemas Dict.
- class algomancy_data.extractor.JSONMultiExtractor(file, schema, logger=None)[source]¶
Bases:
MultiExtractorExtract a nested JSON document into multiple related tables.
The schema must be a
SchemaType.MULTIschema declaring oneColumnGroupper output table. Each group’ssource_pathsays where its rows live relative to a root record:source_path=()— the root group. Each item of the top-level list contributes one row. Exactly one group must use this.source_path=("PickOrderLines",)— a child group. Each root record has a nested list at that key whose elements become this group’s rows. Deeper paths (("foo", "bar")) walk through intermediate dicts before reaching the list.
A child column whose
foreign_key=(parent_group_name, parent_pk_column)is automatically populated from the corresponding root record at extraction time. The same FK declaration is also consumed byForeignKeyValidatorandCascadeDropTransformer.The top-level JSON may be either a list of records, or a dict with exactly one list-valued key (the wrapper is unwrapped automatically). List columns that are peeled off into a child group are dropped from the parent table so each row is a flat, queryable record.
- file¶
JSONFilecontaining the nested document.
- schema¶
MULTIschema whoseColumnGroup``s carry the ``source_pathandforeign_keymetadata.
- class algomancy_data.extractor.DataFrameExtractor(name, df, schema, logger=None)[source]¶
Bases:
ExtractorExtractor that wraps a pre-built
pandas.DataFrame.Useful for tests and notebook workflows where the input data is already in memory and no file IO is needed.
- name¶
Logical table/file name under which the DataFrame is exposed.
- df¶
The DataFrame to expose.
- schema¶
Schema(SINGLE) whosedatatypes()are applied viaDataTypeConverter.MULTIschemas are not supported.
- class algomancy_data.extractor.ExtractionSequence(extractors=None, logger=None)[source]¶
Bases:
object- property completed: bool¶
- property data: Dict[str, DataFrame]¶
- property conversion_issues: List[ConversionIssue]¶
Return dtype-conversion failures collected during extraction.
The ETL pipeline drains these and surfaces them as validation messages instead of letting them silently corrupt the data.