KHARAGPUR DATA SCIENCE HACKATHON

TEAM: NEURAL NAVIGATORS

Members:

Mauryavardhan Singh(Leader)
Shubham Kumar Mandal
Hardik Mahawar
Ayushman Paul

Task 1: Paper Publishability Prediction

Introduction

This project implements a machine learning-based system to predict the publishability of academic or research papers. It uses natural language processing and machine learning techniques to analyze various aspects of research papers and determine their likelihood of being publishable.

Features

PDF text extraction and processing
Comprehensive feature extraction including:
- Structural analysis (presence of key sections)
- Content quality metrics (citations, equations, figures, tables)
- Readability scores
- Technical content density
Document embedding using SPECTER model
Binary classification using Random Forest
Detailed performance metrics and evaluation

Prerequisites

Ensure you have the following installed:

spacy
sentence-transformers
scikit-learn
pandas
numpy
matplotlib
seaborn
PyPDF2
textstat
transformers
langchain
faiss-cpu
streamlit
torch
networkx
plotly
ollama

Required Libraries

To install all the dependencies for this project, run the following command:

pip install spacy sentence-transformers scikit-learn pandas numpy matplotlib seaborn PyPDF2 textstat transformers langchain faiss-cpu streamlit torch networkx plotly ollama

python -m spacy download en_core_web_sm

Project Structure For Task 1

project/
├── KDSH_Task_1.ipynb     # Main Jupyter notebook containing all code
├── *.pdf                # 15 labeled PDF papers directly in the project folder
├── Papers/               # Directory containing 135 unlabeled PDF papers for analysis
│   └── *.pdf            # Unlabeled PDF files to be analyzed
└── results.csv          # Output file containing classification results

Installation and Setup

Clone the repository:

git clone https://github.com/InfoSage05/Research_Paper_Classification_Tool.git
cd Research_Paper_Classification_Tool

Install required dependencies:

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Prepare your data:
- Place PDF papers to be analyzed in the Papers/ directory
- Ensure training data (labeled papers) are properly organized as per the format in main()

Usage

Open and run the Jupyter notebook:
```
jupyter notebook KDSH_Task_1.ipynb
```
The notebook contains several key functions:
- read_pdf(): Extracts text from PDF files
- preprocess_text(): Cleans and prepares text for analysis
- extract_features(): Generates numerical features from paper text
- generate_embedding(): Creates document embeddings
- train_classifier(): Trains the model on labeled data
- predict_paper(): Makes predictions on new papers
The main execution will:
- Train the model using labeled examples
- Process all papers in the Papers/ directory
- Generate predictions
- Save results to results.csv

Output

The system generates a results.csv file containing:

Paper ID (filename)
Publishability prediction (0 or 1)

Model Performance

The system evaluates performance using:

F1 Score
Precision
Recall
Detailed classification report

Task 2: Research Paper Classification

Introduction

This project aims to classify research papers into specific categories using machine learning models. The system supports both static data classification and real-time dynamic classification with interactive data streaming using Streamlit.

Features

Static Data Classification: Analyze pre-existing datasets to classify research papers into categories.
Real-time Classification: Stream and classify research papers dynamically in real time.
Interactive User Interface: Streamlit-powered interface for ease of use.

Project Structure For Task 2

project/
├── KDSH_Task2_Stream.py  # Script for real-time data classification.
├── data/                 # Folder containing example datasets (if applicable)
│                         
└── KDSH_Task2_Static.py  # Script for static data classification.

How to Run

File 1: Static Data Classification

Navigate to the project folder:
```
cd <path_to_project>
```
Run the classification script:
```
streamlit run KDSH_Task2_Static.py
```
Upload the research paper PDF in the sidebar. The system will:
- Extract and process content.
- Classify the paper into relevant categories.
- Provide an interactive chart and analysis.

File 2: Real-Time Data Classification

Navigate to the real-time data folder:
```
cd <path_to_project>
```
Launch the Streamlit app:
```
streamlit run KDSH_Task2_Stream.py
```
Upload a research paper PDF. The system will:
- Stream data dynamically.
- Provide classification results in real time.
- Offer detailed justifications and analysis.

Technologies Used

Python
Streamlit
Hugging Face Transformers
LangChain
Scikit-Learn
PyTorch
Llama2
Sentence Transformers
PyPDF2
Plotly
SPECTER

Future Scope

Extend support for additional research domains.
Enhance model accuracy and speed.
Integrate more advanced NLP techniques.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
KDSH_Task2_Stream.py		KDSH_Task2_Stream.py
KDSH_Task_1.ipynb		KDSH_Task_1.ipynb
KDSH_Task_2_Static.ipynb		KDSH_Task_2_Static.ipynb
R001.pdf		R001.pdf
R002.pdf		R002.pdf
R003.pdf		R003.pdf
R004.pdf		R004.pdf
R005.pdf		R005.pdf
R006.pdf		R006.pdf
R007.pdf		R007.pdf
R008.pdf		R008.pdf
R009.pdf		R009.pdf
R010.pdf		R010.pdf
R011.pdf		R011.pdf
R012.pdf		R012.pdf
R013.pdf		R013.pdf
R014.pdf		R014.pdf
R015.pdf		R015.pdf
README.md		README.md
final_results.csv		final_results.csv
requirements.txt		requirements.txt
results.csv		results.csv
results1.csv		results1.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KHARAGPUR DATA SCIENCE HACKATHON

TEAM: NEURAL NAVIGATORS

Members:

Task 1: Paper Publishability Prediction

Introduction

Features

Prerequisites

Required Libraries

Project Structure For Task 1

Installation and Setup

Usage

Output

Model Performance

Task 2: Research Paper Classification

Introduction

Features

Project Structure For Task 2

How to Run

File 1: Static Data Classification

File 2: Real-Time Data Classification

Technologies Used

Future Scope

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KHARAGPUR DATA SCIENCE HACKATHON

TEAM: NEURAL NAVIGATORS

Members:

Task 1: Paper Publishability Prediction

Introduction

Features

Prerequisites

Required Libraries

Project Structure For Task 1

Installation and Setup

Usage

Output

Model Performance

Task 2: Research Paper Classification

Introduction

Features

Project Structure For Task 2

How to Run

File 1: Static Data Classification

File 2: Real-Time Data Classification

Technologies Used

Future Scope

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages