SIFT (SUPER Intelligent Filtering Tool) is a desktop application that generates and cleans machine learning datasets from a single natural language prompt. It uses intelligent agents to identify the best matching datasets from external sources, then applies interactive feedback and reinforcement learning to refine the data into a curated collection.
- Prompt-based dataset generation - Enter a description (e.g., "images of solar panels on rooftops") and SIFT's agents query APIs such as Kaggle and Hugging Face. The most relevant datasets are automatically selected and downloaded.
- Interactive feedback loop - The application surfaces random samples from the chosen datasets. Users can accept or reject images and optionally provide explanations. This feedback is used to guide dataset refinement.
- Reinforcement learning for dataset cleaning - A PyTorch pipeline with CLIP embeddings applies reinforcement learning from human feedback (RLHF), learning user preferences and filtering out irrelevant samples.
- Cross-platform desktop app - Built with React and TypeScript and packaged with Electron, making it available on Windows, macOS, and Linux.
- The user provides a dataset prompt.
- Backend agents call APIs like Kaggle and Hugging Face to search for and select the best matching datasets.
- Selected datasets are stored locally.
- The interface presents random samples for user review.
- User feedback is collected and used to train a preference model.
- The RLHF pipeline filters and cleans the dataset.
- The result is a curated dataset tailored to the user's needs.
- Frontend: React, TypeScript, Electron
- Backend: FastAPI, LangChain
- Machine Learning: PyTorch, CLIP, RLHF
- Data Sources: Kaggle API, Hugging Face Datasets
- Prompt: "cats in costumes"
- Agents query APIs and identify the most relevant datasets from Hugging Face and Kaggle
- Random samples are shown in the interface (cats, dogs, unrelated animals)
- User accepts cats in costumes, rejects irrelevant samples, and explains why
- RLHF model adapts to feedback and updates the filtering process
- Final dataset contains only the images that fit the prompt and user preferences
Manually curating datasets is slow and error-prone. While APIs provide access to large datasets, they often include irrelevant or noisy data. SIFT automates dataset discovery through intelligent agents, then incorporates user feedback and reinforcement learning to ensure the final dataset is accurate, relevant, and high quality.
- Expand beyond images to include text and audio datasets
- Support cloud storage exports (e.g., AWS S3, Google Cloud Storage)
- Add advanced feedback tools such as multi-label selection and bounding box annotation
- Integrate fine-tuning pipelines for direct model training