Introduction
The SALT repository provides tools and resources for working with Sunbird African Language Technology (SALT) datasets. This repository facilitates the creation of multilingual datasets, the training and evaluation of multilingual models, and data preprocessing. It includes robust utilities for model training using HuggingFace frameworks, making it a valuable resource for machine translation and natural language processing (NLP) in underrepresented languages.Key Features
- Multilingual dataset handling and preprocessing.
- Metrics for evaluating NLP models.
- Utilities for training HuggingFace models.
- Jupyter notebooks demonstrating use cases.
- Documentation support via MkDocs.
Installation
Prerequisites
Ensure you have the following installed:- Python 3.8 or above.
- Git for cloning the repository.
- pip for managing Python packages.
Steps
-
Clone the repository:
Later this will just be ‘pip install salt’.
-
Install the required dependencies:
-
Verify installation by running tests:
Getting Started
Loading a Dataset
To load a dataset using the tools provided indataset.py:
Preprocessing Data
Usepreprocessing.py for operations like cleaning, augmentation, and formatting:
Training a Model
Leverage HuggingFace training utilities:Modules Overview
dataset.py
Purpose
Handles dataset loading, validation, and conversion tasks.Key Functions
create(config):- Generates a dataset based on the provided configuration.
- Example usage:
preprocessing.py
Purpose
Provides tools for cleaning and formatting text and audio data.Key Functions
-
clean_text:- Cleans text by removing noise and standardizing formatting.
- Example usage:
-
random_case:- Randomizes casing to simulate realistic variability in text data.
-
augment_audio_noise:- Adds controlled noise to audio samples for robustness.
metrics.py
Purpose
Defines evaluation metrics for NLP tasks.Key Functions
multilingual_eval:- Computes BLEU and other metrics for multilingual tasks.
- Example usage:
utils.py
Purpose
Provides utilities for model training, evaluation, and debugging.Key Classes and Functions
-
TrainableM2MForConditionalGeneration:- Customizes training for multilingual translation models.
- Example usage:
-
ForcedVariableBOSTokenLogitsProcessor:- Allows dynamic BOS token adjustments.

