SIFT (SUPER Intelligent Filtering Tool)

SIFT (SUPER Intelligent Filtering Tool) is a desktop application that generates and cleans machine learning datasets from a single natural language prompt. It uses intelligent agents to identify the best matching datasets from external sources, then applies interactive feedback and reinforcement learning to refine the data into a curated collection.

Features

Prompt-based dataset generation - Enter a description (e.g., "images of solar panels on rooftops") and SIFT's agents query APIs such as Kaggle and Hugging Face. The most relevant datasets are automatically selected and downloaded.
Interactive feedback loop - The application surfaces random samples from the chosen datasets. Users can accept or reject images and optionally provide explanations. This feedback is used to guide dataset refinement.
Reinforcement learning for dataset cleaning - A PyTorch pipeline with CLIP embeddings applies reinforcement learning from human feedback (RLHF), learning user preferences and filtering out irrelevant samples.
Cross-platform desktop app - Built with React and TypeScript and packaged with Electron, making it available on Windows, macOS, and Linux.

How It Works

The user provides a dataset prompt.
Backend agents call APIs like Kaggle and Hugging Face to search for and select the best matching datasets.
Selected datasets are stored locally.
The interface presents random samples for user review.
User feedback is collected and used to train a preference model.
The RLHF pipeline filters and cleans the dataset.
The result is a curated dataset tailored to the user's needs.

Technology Stack

Frontend: React, TypeScript, Electron
Backend: FastAPI, LangChain
Machine Learning: PyTorch, CLIP, RLHF
Data Sources: Kaggle API, Hugging Face Datasets

Example Workflow

Prompt: "cats in costumes"
Agents query APIs and identify the most relevant datasets from Hugging Face and Kaggle
Random samples are shown in the interface (cats, dogs, unrelated animals)
User accepts cats in costumes, rejects irrelevant samples, and explains why
RLHF model adapts to feedback and updates the filtering process
Final dataset contains only the images that fit the prompt and user preferences

Why SIFT

Manually curating datasets is slow and error-prone. While APIs provide access to large datasets, they often include irrelevant or noisy data. SIFT automates dataset discovery through intelligent agents, then incorporates user feedback and reinforcement learning to ensure the final dataset is accurate, relevant, and high quality.

Roadmap

Expand beyond images to include text and audio datasets
Support cloud storage exports (e.g., AWS S3, Google Cloud Storage)
Add advanced feedback tools such as multi-label selection and bounding box annotation
Integrate fine-tuning pipelines for direct model training

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
backend		backend
electron		electron
src		src
.gitignore		.gitignore
README.md		README.md
index.html		index.html
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SIFT (SUPER Intelligent Filtering Tool)

Features

How It Works

Technology Stack

Example Workflow

Why SIFT

Roadmap

About

Uh oh!

Releases

Packages

Languages

YuvDwi/sift

Folders and files

Latest commit

History

Repository files navigation

SIFT (SUPER Intelligent Filtering Tool)

Features

How It Works

Technology Stack

Example Workflow

Why SIFT

Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages