Welcome to OutText_Preprocessing — your all-in-one Python library for advanced outlier handling and text preprocessing! 🎯 Whether you're building models for NLP 🧠 or machine learning 📈, this package is here to supercharge your data cleaning workflows 🚀.
Powerful functions to clean and transform raw text into model-ready input 💬⚙️.
A full suite of techniques to detect, reduce, or eliminate outliers for better model performance 📊🛡️.
- ✅ Clean noisy text (emojis, links, HTML, hashtags)
- ✨ Normalize casing, fix spelling, remove symbols
- 🌐 Multilingual support
- 📊 Easily apply on Pandas columns
- 📉 Z-score removal and capping
- 🔬 Yeo-Johnson transformation
- 🌈 Impact reduction & smooth capping
- ✂️ Adaptive trimming (IQR method)
- 🔄 Local standardization with rolling window
pip install OutText_preprocessingfrom OutText_preprocessing.text_cleaner import TextCleaner
from OutText_preprocessing.outlier_removal import OutlierRemover
# Text preprocessing
cleaner = TextCleaner()
clean_text = cleaner.clean("Th!s text 😜 has lots of NOISE!!! Check: https://example.com")
# Outlier removal
remover = OutlierRemover(method='impact_reduction')
processed_df = remover.fit_transform(your_dataframe)methods_columns_dict = {
"zscore_capper": ["col1", "col2"],
"impact_reduction": ["col3"],
}
processed_data = remover.multi_outlier_multi_columns(df, methods_columns_dict)clean()– Clean general textremove_emojis()remove_links_mentions_hashtags()extract_top_words()- Works directly on DataFrame columns 🐼
fit_transform(data)multi_outlier_multi_columns(data, methods_columns_dict)get_available_methods()
📊 When evaluated on synthetic classification tasks with injected outliers:
- RobustScaler →
0.9033accuracy - OutText_Preprocessing (multi-method) →
0.9067accuracy 🔥
✅ Proven better or equal performance to standard scalers! 🎯
✨ One library for both numeric and text preprocessing
🧪 Customizable outlier thresholds, smoothing, rolling window
🧼 Super clean & fast text cleaning with regex, language support, emoji filtering
📈 Improve model accuracy in real-life dirty data scenarios
-
🏅 Kaggle: Anurag Raj
-
💻 GitHub: Anurag Raj
-
💼 LinkedIn: Anurag Raj
| Method Name | Problem Type | Suitability / Use Case |
|---|---|---|
zscore |
Strict outlier removal in normally distributed data | Ideal for cleaning data by removing extreme outliers. Can cause row loss. |
zscore_capper |
Moderating extreme values in normal data | Use when you want to retain data but reduce influence of outliers via capping. |
yeo_johnson |
Skewed data with non-Gaussian distribution | Good for transforming non-normal distributions and removing outliers (but removes rows). |
yeo_johnson_capper |
Skewed data, need smooth transformation without row deletion | Same as above but caps values instead of removing. Great when data retention is critical. |
impact_reduction |
Large values dominate analysis but are valid | Retain all data, but minimize outlier impact. Use in regression and ML pre-processing. |
adaptive_trimming |
Data with unknown distribution or extreme outliers | Based on IQR, robust to non-normal data. Useful in robust statistics or median-based methods. |
smooth_capping |
Smoothly adjusting values instead of hard-capping or removing | Best when a softer influence reduction is desired, good for time-series or financial data. |
local_standardization |
Data with seasonal trends, patterns, or local anomalies | Useful for time-series or rolling-window data. Handles local outliers effectively. |
- For statistical analysis (like t-tests, regression): prefer zscore_capper or impact_reduction.
- For ML preprocessing: prefer yeo_johnson_capper, adaptive_trimming, or smooth_capping.
- For time-series: use local_standardization or smooth_capping.
- To remove bad data: use zscore or yeo_johnson (be aware of row loss).