-
extract DOM from site using pupeteer
-
simple sitecrawler (scrape subpages without sitemap)
-
html to md using turndown
-
md noise filter with mistralai/Mixtral-8x22B-Instruct-v0.1 (64k context)
-
finalise llm to denoise md -
express endpoint -
work on concurency
-
Navigate to the project directory in your terminal.
-
Install the project dependencies:
npm install- Create a .env file in the root of your project and add your DeepInfra API key: Replace your_api_key with your actual DeepInfra API key.(get yor api key for free here)
Running the API To start the API, run the following command in your terminal:
nodemon index.jsUsing the API To convert webpage content into markdown, make a GET request to the /web2md endpoint with the url and numPages query parameters. For example:
curl "http://localhost:3000/web2md?url=https://example.com&numPages=5"This will return the markdown content of the specified webpage.