matrice.dataset module#
Module to handle dataset-related operations within a project.
- class matrice.dataset.Dataset(session, dataset_id=None, dataset_name=None)[source]#
Bases:
object
Class to handle dataset-related operations within a project.
This class manages operations on a dataset within a specified project. During initialization, either dataset_name or dataset_id must be provided to locate the dataset.
- Parameters:
session (Session) – The session object that manages the connection to the server.
dataset_id (str, optional) – The ID of the dataset (default is None). Used to directly locate the dataset.
dataset_name (str, optional) – The name of the dataset (default is None). If dataset_id is not provided, dataset_name will be used to find the dataset.
- dataset_id#
The unique identifier for the dataset.
- Type:
str
- dataset_name#
The name of the dataset.
- Type:
str
- version_status#
The processing status of the latest dataset version.
- Type:
str
- latest_version#
The identifier of the latest version of the dataset.
- Type:
str
- no_of_samples#
The total number of samples in the dataset.
- Type:
int
- no_of_classes#
The total number of classes in the dataset.
- Type:
int
- no_of_versions#
The total number of versions for this dataset.
- Type:
int
- last_updated_at#
The timestamp of the dataset’s most recent update.
- Type:
str
- summary#
Summary of the dataset’s latest version, providing metrics like item count and class distribution.
- Type:
dict
- Raises:
ValueError – If neither dataset_id nor dataset_name is provided, or if there is a mismatch between dataset_id and dataset_name.
Example
>>> session = Session(account_number=account_number, access_key=access_key, secret_key=secret_key) >>> dataset = Dataset(session=session, dataset_id="12345",dataset_name="Sample") >>> print(f"Dataset Name: {dataset.dataset_name}") >>> print(f"Number of Samples: {dataset.no_of_samples}") >>> print(f"Latest Version: {dataset.latest_version}")
- add_data(source, source_url, new_dataset_version, old_dataset_version, dataset_description='', version_description='', compute_alias='')[source]#
Import a new version of the dataset from an external source. Only ZIP files are supported for upload.
This function creates a new dataset version or updates an existing version with data from a specified external source URL. The dataset ID must be set during initialization for this function to work.
- Parameters:
source (str) – The source of the dataset, indicating where the dataset originates (e.g., “url”).
source_url (str) – The URL of the dataset to be imported.
new_dataset_version (str) – The version identifier for the new dataset (e.g., “v2.0”).
old_dataset_version (str) – The version identifier of the existing dataset to be updated.
dataset_description (str, optional) – Description of the dataset (default is an empty string).
version_description (str, optional) – Description for the new dataset version (default is an empty string).
compute_alias (str, optional) – Alias for the compute instance to be used (default is an empty string).
- Returns:
tuple
A tuple containing
- dict (API response indicating the status of the dataset import.)
- str or None (Error message if an error occurred, None otherwise.)
- str (Status message indicating success or failure.)
- Raises:
SystemExit – If the dataset_id is not set or if the old dataset version is incomplete.
Example
>>> response, err, msg = dataset.add_data( >>> source="url", >>> source_url="https://example.com/dataset.zip", >>> new_dataset_version="v2.0", >>> old_dataset_version="v1.0" >>> ) >>> if err: >>> pprint(err) >>> else: >>> pprint(response)
- check_valid_spilts(dataset_version)[source]#
Check if the specified dataset version contains valid splits.
Valid splits include training, validation, and test sets. This function verifies that the specified dataset version has these splits properly configured.
- Parameters:
dataset_version (str) – The version of the dataset to check for valid splits (e.g., “v1.0”).
- Returns:
A tuple containing: - dict: API response indicating split validity, which includes:
isValid (str): Indicates if the splits are valid.
str or None: Error message if an error occurred, None otherwise.
str: Status message indicating success or failure.
- Return type:
tuple
- Raises:
SystemExit – If the dataset_id is not set.
Example
>>> split_status, err, msg = dataset.check_valid_splits(dataset_version="v1.0") >>> if err: >>> pprint(err) >>> else: >>> pprint(split_status) >>> >>> # Sample output >>> 'Valid Spilts'
- delete()[source]#
Delete the entire dataset.
This function deletes the entire dataset associated with the given dataset ID. The dataset ID must be set during initialization for this function to work.
- Returns:
A tuple containing: - dict: API response confirming the dataset deletion status. - str or None: Error message if an error occurred, None otherwise. - str: Status message indicating success or failure.
- Return type:
tuple
- Raises:
SystemExit – If the dataset_id is not set.
Example
>>> response, err, msg = dataset.delete() >>> if err: >>> pprint(err) >>> else: >>> pprint(response)
- delete_item(dataset_version, dataset_item_ids)[source]#
Delete items from a specific version of the dataset based on dataset type.
This function deletes items from a specified version of the dataset. The deletion method is selected automatically based on the dataset type (e.g., classification, detection). The dataset ID must be set during initialization for this function to work.
- Parameters:
dataset_version (str) – The version of the dataset from which to delete items.
dataset_item_ids (list of str) – A list of dataset item IDs to delete.
- Returns:
A tuple containing: - dict: API response indicating the deletion status. - str or None: Error message if an error occurred, None otherwise. - str: Status message indicating success or failure.
- Return type:
tuple
- Raises:
ValueError – If the dataset type is unsupported.
Example
>>> response, err, msg = dataset.delete_item( >>> dataset_version="v1.0", dataset_item_ids=["123", "456"] >>> ) >>> if err: >>> pprint(err) >>> else: >>> pprint(response)
- delete_version(dataset_version)[source]#
Delete a specific version of the dataset.
This function removes a specified version of the dataset. The dataset ID must be set during initialization for this function to work.
- Parameters:
dataset_version (str) – The version identifier of the dataset to delete (e.g., “v1.0”).
- Returns:
A tuple containing: - dict: API response confirming the deletion status. - str or None: Error message if an error occurred, None otherwise. - str: Status message indicating success or failure.
- Return type:
tuple
- Raises:
SystemExit – If the dataset_id is not set.
Example
>>> response, err, msg = dataset.delete_version(dataset_version="v1.0") >>> if err: >>> pprint(err) >>> else: >>> pprint(response)
- get_categories(dataset_version)[source]#
Get category details for a specific dataset version.
This function retrieves the categories available in a specified version of the dataset, including category IDs, names, and associated metadata.
- Parameters:
dataset_version (str) – The version of the dataset for which to fetch categories (e.g., “v1.0”).
- Returns:
A tuple containing: - list of dict: Each dictionary contains dataset category details, including:
_id (str): Unique identifier for the category.
_idDataset (str): ID of the dataset to which this category belongs.
_idSuperCategory (str): Identifier for the super-category, if applicable.
datasetVersion (str): Version of the dataset for this category.
name (str): Name of the category.
str or None: Error message if an error occurred, None otherwise.
str: Status message indicating success or failure.
- Return type:
tuple
- Raises:
SystemExit – If the dataset_id is not set.
Example
>>> categories, err, msg = dataset.get_categories(dataset_version="v1.0") >>> if err: >>> pprint(err) >>> else: >>> pprint(categories[:3]) >>> >>> # Sample output >>> [ >>> {'_id': '671638ef0f4507663b8ca2b7', '_idDataset': '671636dd6cffa65a7510a52b', '_idSuperCategory': '000000000000000000000000', 'datasetVersion': 'v1.0', 'name': 'Dog'}, >>> {'_id': '671638ef0f4507663b8ca2b6', '_idDataset': '671636dd6cffa65a7510a52b', '_idSuperCategory': '000000000000000000000000', 'datasetVersion': 'v1.0', 'name': 'Cat'}, >>> ... >>> ]
- get_processed_versions()[source]#
Get all processed versions of the dataset.
This function retrieves a list of all versions of the dataset that have completed processing.
- Returns:
A tuple containing: - list of dict: Each dictionary contains processed dataset version details, including:
_id (str): Unique identifier for the dataset.
_idProject (str): Project ID associated with the dataset.
allVersions (list of str): List of all versions of the dataset.
createdAt (str): Timestamp of when the dataset was created.
latestVersion (str): Identifier of the latest version of the dataset.
name (str): Name of the dataset.
processedVersions (list of str): List of processed versions.
- stats (list of dict): Version-specific statistics, including:
classStat (dict): Contains category-specific counts for test, train, unassigned, and val.
version (str): Version identifier.
versionDescription (str): Description of the version.
versionStats (dict): Overall statistics, including total, train, test, and val counts.
versionStatus (str): Status of the version, usually “processed”.
updatedAt (str): Timestamp of the last dataset update.
str or None: Error message if an error occurred, None otherwise.
str: Status message indicating success or failure.
- Return type:
tuple
- Raises:
SystemExit – If the dataset_id is not set.
Example
>>> processed_versions, err, msg = dataset.get_processed_versions() >>> if err: >>> pprint(err) >>> else: >>> pprint(processed_versions[:3]) >>> >>> # Sample output >>> [ >>> {'_id': '6703af894ddeac5b596b267b', '_idProject': '67036673ccb244bee86d1939', 'allVersions': ['v1.0', 'v1.1'], 'createdAt': '2024-10-07T09:53:13.223Z', 'name': 'Microcontroller', 'processedVersions': ['v1.1'], 'latestVersion': 'v1.1', ...}, >>> ... >>> ]
- list_items(dataset_version, page_size=10, page_number=0)[source]#
List items for a specific version of the dataset.
This function retrieves a paginated list of items for the specified dataset version, allowing control over the number of items per page and the page number.
- Parameters:
dataset_version (str) – The version of the dataset to retrieve items from (e.g., “v1.0”).
page_size (int, optional) – The number of items to return per page (default is 10).
page_number (int, optional) – The page number to retrieve (default is 0).
- Returns:
A tuple containing: - dict: API response with a list of dataset items, where each item contains: - str or None: Error message if an error occurred, None otherwise. - str: Status message indicating success or failure.
- Return type:
tuple
- Raises:
SystemExit – If the dataset_id is not set.
Example
>>> items, err, msg = dataset.list_items(dataset_version="v1.0", page_size=10, page_number=0) >>> if err: >>> pprint(err) >>> else: >>> pprint(items)
- rename(updated_name)[source]#
Update the name of the dataset.
This function updates the dataset name to a specified value. The dataset ID must be set during initialization for this function to work.
- Parameters:
updated_name (str) – The new name for the dataset.
- Returns:
A tuple containing: - dict: API response confirming the dataset name update, including:
MatchedCount (int): Number of records matched for the update.
ModifiedCount (int): Number of records modified.
UpsertedCount (int): Number of records upserted (inserted if not existing).
UpsertedID (str or None): ID of the upserted record if applicable, otherwise None.
str or None: Error message if an error occurred, None otherwise.
str: Status message indicating success or failure.
- Return type:
tuple
- Raises:
SystemExit – If the dataset_id is not set.
Example
>>> response, err, msg = dataset.rename(updated_name="Updated Dataset Name") >>> if err: >>> pprint(err) >>> else: >>> pprint(response) >>> >>> # Sample output >>> { >>> 'MatchedCount': 1, >>> 'ModifiedCount': 1, >>> 'UpsertedCount': 0, >>> 'UpsertedID': None >>> }
- split_data(old_dataset_version, new_dataset_version, is_random_split, train_num=0, val_num=0, test_num=0, transfers=[{'destination': '', 'source': '', 'transferAmount': 1}], dataset_description='', version_description='', new_version_description='', compute_alias='')[source]#
Split or transfer images between training, validation, and test sets in the dataset.
This function enables the creation of a new dataset version by transferring or splitting images from an existing version into training, validation, and test sets, with options for random or manual split distribution.
- Parameters:
old_dataset_version (str) – The version identifier of the existing dataset.
new_dataset_version (str) – The version identifier of the new dataset.
is_random_split (bool) – Indicates whether to perform a random split.
train_num (int, optional) – Number of training samples (default is 0).
val_num (int, optional) – Number of validation samples (default is 0).
test_num (int, optional) – Number of test samples (default is 0).
transfers (list of dict, optional) –
- List specifying transfers between dataset sets. Each dictionary should contain:
source (str): The source set (e.g., “train”).
destination (str): The target set (e.g., “test”).
transferAmount (int): Number of items to transfer (default is 1).
dataset_description (str, optional) – Description of the dataset (default is an empty string).
version_description (str, optional) – Description of the dataset version (default is an empty string).
new_version_description (str, optional) – Description of the new dataset version (default is an empty string).
compute_alias (str, optional) – Alias for the compute instance (default is an empty string).
- Returns:
A tuple containing: - dict: API response indicating the status of the dataset split or transfer. - str or None: Error message if an error occurred, None otherwise. - str: Status message indicating success or failure.
- Return type:
tuple
- Raises:
SystemExit – If the dataset_id is not set or if the old_dataset_version is not processed.
Example
>>> response, err, msg = dataset.split_data( >>> old_dataset_version="v1.0", >>> new_dataset_version="v2.0", >>> is_random_split=True, >>> train_num=100, >>> val_num=20, >>> test_num=30, >>> transfers=[{"source": "train", "destination": "test", "transferAmount": 100}] >>> ) >>> if err: >>> pprint(err) >>> else: >>> pprint(response)
- update_item_label(dataset_version, item_id, label_id)[source]#
Update the label of a specific dataset item.
This function assigns a new label to a specific item in a specified dataset version. The dataset ID must be set during initialization for this function to work.
- Parameters:
dataset_version (str) – The version of the dataset where the item resides (e.g., “v1.0”).
item_id (str) – The unique identifier of the dataset item to update.
label_id (str) – The unique identifier of the new label to assign to the dataset item.
- Returns:
tuple
A tuple containing
- dict (API response confirming the label update.)
- str or None (Error message if an error occurred, None otherwise.)
- str (Status message indicating success or failure.)
- Raises:
SystemExit – If the dataset_id is not set.
Example
>>> response, err, msg = dataset.update_item_label(dataset_version="v1.0", item_id="12345", label_id="67890") >>> if err: >>> pprint(err) >>> else: >>> pprint(response)
- matrice.dataset.get_dataset_size(session, url, project_id)[source]#
Fetch the size of a dataset from the specified URL.
This function sends a request to retrieve the dataset size, measured in megabytes, for a given project.
- Parameters:
session (Session) – The active session used to communicate with the API.
url (str) – The URL of the dataset to fetch the size for.
project_id (str) – The ID of the project associated with the dataset.
- Returns:
A tuple containing three elements: - dict: API response with dataset size information (e.g., size in MB). - str or None: Error message if an error occurred, None otherwise. - str: Status message indicating success or failure.
- Return type:
tuple
Example
>>> size_info, err, msg = get_dataset_size(session=session, url="https://example.com/dataset.zip", project_id="12345") >>> if err: >>> print(f"Error: {err}") >>> else: >>> print(f"Dataset size: {size_info.get('size', 'N/A')} MB")
- matrice.dataset.upload_file(session, file_path)[source]#
Upload a file to the dataset. Only ZIP files are supported.
This function uploads a ZIP file to the dataset server for the specified session. It generates an upload URL, then uses it to transfer the file.
- Parameters:
session (Session) – The active session used to communicate with the API.
file_path (str) – The local path of the file to upload.
- Returns:
A dictionary containing: - success (bool): Indicates if the upload was successful. - data (str): URL of the uploaded file if successful, empty string otherwise. - message (str): A status message indicating success or detailing any error.
- Return type:
dict
Example
>>> result = upload_file(session=session, file_path="path/to/data.zip") >>> if result['success']: >>> print(f"File uploaded successfully: {result['data']}") >>> else: >>> print(f"Error: {result['message']}")