Skip to content
/ sift Public

Generate, preprocess and clean anytype of dataset from a single prompt

Notifications You must be signed in to change notification settings

YuvDwi/sift

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sift

SIFT (SUPER Intelligent Filtering Tool)

SIFT (SUPER Intelligent Filtering Tool) is a desktop application that generates and cleans machine learning datasets from a single natural language prompt. It uses intelligent agents to identify the best matching datasets from external sources, then applies interactive feedback and reinforcement learning to refine the data into a curated collection.

Features

  • Prompt-based dataset generation - Enter a description (e.g., "images of solar panels on rooftops") and SIFT's agents query APIs such as Kaggle and Hugging Face. The most relevant datasets are automatically selected and downloaded.
  • Interactive feedback loop - The application surfaces random samples from the chosen datasets. Users can accept or reject images and optionally provide explanations. This feedback is used to guide dataset refinement.
  • Reinforcement learning for dataset cleaning - A PyTorch pipeline with CLIP embeddings applies reinforcement learning from human feedback (RLHF), learning user preferences and filtering out irrelevant samples.
  • Cross-platform desktop app - Built with React and TypeScript and packaged with Electron, making it available on Windows, macOS, and Linux.

How It Works

  1. The user provides a dataset prompt.
  2. Backend agents call APIs like Kaggle and Hugging Face to search for and select the best matching datasets.
  3. Selected datasets are stored locally.
  4. The interface presents random samples for user review.
  5. User feedback is collected and used to train a preference model.
  6. The RLHF pipeline filters and cleans the dataset.
  7. The result is a curated dataset tailored to the user's needs.

Technology Stack

  • Frontend: React, TypeScript, Electron
  • Backend: FastAPI, LangChain
  • Machine Learning: PyTorch, CLIP, RLHF
  • Data Sources: Kaggle API, Hugging Face Datasets

Example Workflow

  • Prompt: "cats in costumes"
  • Agents query APIs and identify the most relevant datasets from Hugging Face and Kaggle
  • Random samples are shown in the interface (cats, dogs, unrelated animals)
  • User accepts cats in costumes, rejects irrelevant samples, and explains why
  • RLHF model adapts to feedback and updates the filtering process
  • Final dataset contains only the images that fit the prompt and user preferences

Why SIFT

Manually curating datasets is slow and error-prone. While APIs provide access to large datasets, they often include irrelevant or noisy data. SIFT automates dataset discovery through intelligent agents, then incorporates user feedback and reinforcement learning to ensure the final dataset is accurate, relevant, and high quality.

Roadmap

  • Expand beyond images to include text and audio datasets
  • Support cloud storage exports (e.g., AWS S3, Google Cloud Storage)
  • Add advanced feedback tools such as multi-label selection and bounding box annotation
  • Integrate fine-tuning pipelines for direct model training

About

Generate, preprocess and clean anytype of dataset from a single prompt

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published