This project is a Flask-based web application that converts research papers (in PDF format) into detailed blog posts. The application extracts text from the PDF, processes it using OpenAI's GPT-4 model, and generates a well-structured blog post in LaTeX format. The blog post includes inline and display math for equations, making it suitable for technical content. Additionally, the application handles image extraction from the PDF and integrates them into the blog post.
- PDF Text Extraction: Extracts text from uploaded PDF files using
PyMuPDF(fitz). - Text Chunking: Splits the extracted text into manageable chunks for processing.
- OpenAI GPT-4 Integration: Uses OpenAI's GPT-4 model to generate blog content from the extracted text.
- Image Handling: Extracts images from the PDF and integrates them into the blog post.
- LaTeX to HTML Conversion: Converts the generated LaTeX content into HTML using Pandoc.
- Web Interface: Provides a simple web interface for uploading PDFs and viewing the generated blog post.
-
Clone the repository:
git clone https://github.com/yourusername/research-paper-to-blog.git cd research-paper-to-blog -
Set up a virtual environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
-
Set up OpenAI API Key:
- Create a
.envfile in the root directory and add your OpenAI API key:OPENAI_API_KEY=your_openai_api_key_here
- Create a
-
Install Pandoc:
- Ensure Pandoc is installed on your system. You can download it from Pandoc's official website.
-
Run the Flask application:
python app.py
-
Access the web interface:
- Open your web browser and navigate to
http://localhost:5600.
- Open your web browser and navigate to
-
Upload a PDF:
- Use the web interface to upload a research paper in PDF format.
- Optionally, provide additional prompts to guide the blog generation process.
-
View the generated blog post:
- After processing, the generated blog post will be displayed in HTML format.
app.py: The main Flask application script.templates/: Contains HTML templates for the web interface.upload.html: The upload page.output.html: The page displaying the generated blog post.
static/: Directory for storing static files (e.g., images extracted from PDFs).uploads/: Directory for storing uploaded PDFs and intermediate files.render_json_image_caption_renaming.py: Script for handling image extraction and renaming.
Flask: Web framework for building the application.openai: Python client for OpenAI's API.PyMuPDF(fitz): Library for extracting text and images from PDFs.PyPDF2: Library for reading PDF files.langchain: Library for text splitting and chunking.pylatexenc: Library for converting LaTeX to plain text.pandoc: Tool for converting LaTeX to HTML.
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
- Thanks to OpenAI for providing the GPT-4 model.
- Thanks to the developers of
PyMuPDF,PyPDF2, andlangchainfor their excellent libraries.
For any questions or feedback, please contact author at [hwagh@mtu.edu].


