Create a dataset

A dataset is a named collection of images at the organization level that you label and use for training.

Platform requirements

Viam rejects a training job if the dataset does not meet these minimums:

RequirementMinimum
Labeled images80% of the dataset
Examples per label10 images (classification) or 10 bounding boxes (object detection)
Total images15 (object detection only; classification has no total-image floor)

Recommendations for quality

Platform minimums get a training job accepted; they do not guarantee a good model. For production use:

  • Aim for hundreds of examples per label under varied conditions.
  • Keep label counts roughly balanced – no label should have more than 3x the images of any other.
  • Capture images from the deployment environment, meaning the real lighting, backgrounds, camera angles, and distances your machine uses. Stock datasets and stylized marketing images do not generalize.

1. Create a dataset

You can create a dataset from the web UI, the CLI, or programmatically.

Web UI:

  1. Go to app.viam.com.
  2. Click the DATA tab in the top navigation.
  3. Click the DATASETS subtab.
  4. Click + Create dataset.
  5. Enter a descriptive name for your dataset. Use a name that reflects the task, such as inspection-parts-v1 or package-detection. Dataset names must be unique within your organization.
  6. Click Create.

Your empty dataset now appears in the list.

CLI:

If you have the Viam CLI installed, create a dataset from the command line:

viam dataset create --org-id=YOUR-ORG-ID --name=my-inspection-dataset

The command returns the dataset ID, which you will need for subsequent CLI and SDK operations.

import asyncio
from viam.rpc.dial import DialOptions
from viam.app.viam_client import ViamClient


API_KEY = "YOUR-API-KEY"
API_KEY_ID = "YOUR-API-KEY-ID"
ORG_ID = "YOUR-ORGANIZATION-ID"


async def connect() -> ViamClient:
    dial_options = DialOptions.with_api_key(API_KEY, API_KEY_ID)
    return await ViamClient.create_from_dial_options(dial_options)


async def main():
    viam_client = await connect()
    data_client = viam_client.data_client

    dataset_id = await data_client.create_dataset(
        name="my-inspection-dataset",
        organization_id=ORG_ID,
    )
    print(f"Created dataset: {dataset_id}")

    viam_client.close()


if __name__ == "__main__":
    asyncio.run(main())
package main

import (
    "context"
    "fmt"

    "go.viam.com/rdk/app"
    "go.viam.com/rdk/logging"
)

func main() {
    apiKey := "YOUR-API-KEY"
    apiKeyID := "YOUR-API-KEY-ID"
    orgID := "YOUR-ORGANIZATION-ID"

    ctx := context.Background()
    logger := logging.NewDebugLogger("create-dataset")

    viamClient, err := app.CreateViamClientWithAPIKey(
        ctx, app.Options{}, apiKey, apiKeyID, logger)
    if err != nil {
        logger.Fatal(err)
    }
    defer viamClient.Close()

    dataClient := viamClient.DataClient()

    datasetID, err := dataClient.CreateDataset(
        ctx, "my-inspection-dataset", orgID)
    if err != nil {
        logger.Fatal(err)
    }
    fmt.Printf("Created dataset: %s\n", datasetID)
}

Replace all placeholder values (YOUR-API-KEY, YOUR-API-KEY-ID, YOUR-ORGANIZATION-ID) with your actual values. The API key must be organization-scoped – machine-scoped and location-scoped keys cannot create datasets. To fetch or create these values from the CLI:

viam organizations list
viam organizations api-key create --org-id=YOUR-ORGANIZATION-ID --name=training

You can also find your organization ID by clicking your organization name in the top navigation bar and then clicking Settings.

2. Add images to the dataset

With a dataset created, you need to populate it with images.

Web UI:

  1. Click the DATA tab in the top navigation.
  2. Use the filters to find the images you want. Filter by machine, component, time range, or tags.
  3. Select individual images by clicking their checkboxes, or use Select all to select all visible images.
  4. Click Add to dataset in the action bar that appears.
  5. Select your dataset from the dropdown.
  6. Click Add.

The selected images are now part of your dataset.

CLI:

Add images to a dataset using filter criteria:

viam dataset data add filter \
  --dataset-id=YOUR-DATASET-ID \
  --location-id=YOUR-LOCATION-ID \
  --tags=label1,label2

This adds all images matching the filter to the dataset. You can filter by location, machine, component, tags, or time range.

async def main():
    viam_client = await connect()
    data_client = viam_client.data_client

    await data_client.add_binary_data_to_dataset_by_ids(
        binary_ids=["binary-data-id-1", "binary-data-id-2"],
        dataset_id="YOUR-DATASET-ID",
    )
    print("Images added to dataset.")

    viam_client.close()

You can get binary data IDs by querying for images first using the data client’s binary_data_by_filter method, which returns objects that include the binary ID.

err = dataClient.AddBinaryDataToDatasetByIDs(
    ctx,
    []string{"binary-data-id-1", "binary-data-id-2"},
    "YOUR-DATASET-ID",
)
if err != nil {
    logger.Fatal(err)
}
fmt.Println("Images added to dataset.")

3. Annotate your images

Before training, you need to label every image in your dataset with tags (for classification) or bounding boxes (for object detection).

See Annotate images for step-by-step instructions on manual labeling, or Automate annotation to use an existing ML model to speed up the process.

4. Verify dataset quality

Before you train a model, check that your dataset meets the requirements.

In the web UI:

  1. Go to the DATA tab and click the DATASETS subtab.
  2. Click your dataset to open it.
  3. Review the dataset summary, which shows:
    • Total number of images
    • Number of labeled images
    • Labels used and their counts
  4. Check each requirement:
CheckWhat to look for
Enough imagesObject detection: at least 15 total. Classification: at least 10 per label (the binding constraint).
Labeling coverageAt least 80% of images have tags or bounding boxes
Examples per labelAt least 10 images per label
Label balanceNo label should have more than 3x the images of any other label
Production conditionsImages should represent real operating conditions, not staged or ideal setups

Common issues to fix before training:

  • Too few images in one class: Capture more images of the underrepresented class, or remove the class and merge it with a related one.
  • Unlabeled images: Either label them or remove them from the dataset. Unlabeled images do not help training and can confuse the summary statistics.
  • Non-representative images: If your model will run on a factory floor but your training images were taken on a clean desk, the model will not generalize. Capture images under production conditions – with the actual lighting, background, camera angle, and distance your machine uses.
async def main():
    viam_client = await connect()
    data_client = viam_client.data_client

    datasets = await data_client.list_datasets_by_organization_id(
        organization_id=ORG_ID,
    )
    for ds in datasets:
        print(f"Dataset: {ds.name}, ID: {ds.id}")

    viam_client.close()
datasets, err := dataClient.ListDatasetsByOrganizationID(ctx, orgID)
if err != nil {
    logger.Fatal(err)
}
for _, ds := range datasets {
    fmt.Printf("Dataset: %s, ID: %s\n", ds.Name, ds.ID)
}

Troubleshooting

Dataset creation fails
  • Name already exists. Dataset names must be unique within your organization. Choose a different name or delete the existing dataset if it is no longer needed.
  • Permission denied. Verify that your API key has organization-level access. API keys scoped to a single machine or location cannot create datasets.
Images not appearing in the dataset
  • Sync delay. Images must sync from the machine to the cloud before they are available to add to a dataset. Wait a minute and check the DATA tab to confirm images have arrived.
  • Wrong filter. If using the CLI add filter command, double-check your filter criteria. A typo in a tag name or location ID will match zero images without an error.
  • Binary ID mismatch. If adding images programmatically by ID, verify that the binary IDs are correct. You can retrieve valid IDs using the binary_data_by_filter method.
Label imbalance warnings
  • Collect more data for underrepresented labels. The most effective fix is to capture more images of the minority class under varied conditions.
  • Do not duplicate images. Adding copies of the same image inflates the count without adding information. The model needs diverse examples.
  • Consider merging labels. If two labels are very similar and one has few examples, merge them into a single label.

What’s next

  • Annotate images – label your images with tags or bounding boxes for training.
  • Automate annotation – use an existing ML model to auto-label images instead of doing it by hand.
  • Train a model – use your labeled dataset to train a classification or object detection model.