Skip to main content
This guide demonstrates how to configure the SALT library to load and preprocess datasets for multiple target languages.

1-to-N Translation Loading

In this example, we will setup a data pipeline that takes English as the source and prepares it for translation into Luganda and Acholi.

Configuration

The configuration is defined in YAML format. We specify the Hugging Face dataset path, the source language (eng), and a list of target languages ([lug, ach]).
import salt.dataset
import salt.utils
import yaml

# Define the pipeline configuration
yaml_config = '''
huggingface_load:
  path: Sunbird/salt
  split: train
  name: text-all
source:
  type: text
  language: eng
  preprocessing:
      - prefix_target_language
target:
  type: text
  language: [lug, ach]
'''

# Initialize the dataset from config
config = yaml.safe_load(yaml_config)
ds = salt.dataset.create(config)

# Fetch and display the first 5 samples
print(list(ds.take(5)))

Expected Output

The data loader prefixes the source text with the target language token (e.g., >>lug<< or >>ach<<), which is a common technique for multilingual translation models.
[
  {
    "source": ">>lug<< Eggplants always grow best under warm conditions.",
    "target": "Bbiringanya lubeerera asinga kukulira mu mbeera ya bugumu"
  },
  {
    "source": ">>ach<< Eggplants always grow best under warm conditions.",
    "target": "Bilinyanya pol kare dongo maber ka lyeto tye"
  },
  ...
]

Summary

This basic setup allows you to stream creating multilingual training batches efficiently using the SALT library.