Pulling feat/vector-search into develop#1086
Conversation
This commit adds the `@mlc-ai/web-llm` package as a dependency. This package will be used for integrating web large language models into the dicty frontpage application.
…ddings for publications The PublicationsView component now uses the mlc-ai/web-llm library to generate embeddings for publication titles. A new worker.ts file is added to handle the mlc-ai/web-llm tasks in a separate thread, preventing the main thread from being blocked. The CreateWebWorkerMLCEngine function is used to initialize the engine within the useEffect hook. The engine is initialized with the Llama-3.1-8B-Instruct model. The embeddings are created for the title of the first publication in the data array.
The WebWorkerMLCEngineHandler is now instantiated outside the options object. The options object was removed as it was not being used. A console log was added to check if the worker is loaded.
This commit initializes the mlc-llm engine using a web worker. The engine is configured with the "snowflake-arctic-embed-s-q0f32-MLC-b4" model. The worker is created using a URL relative to the current module, and the engine creates embeddings for the title of the first publication in the data array. The type of worker is set to module. The engine state variable is removed.
This commit introduces a new file, worker.ts, which is responsible for handling the WebWorkerMLCEngineHandler. The handler resides in the worker thread and listens for messages using self.onmessage. This setup enables offloading potentially heavy tasks to a separate thread, improving the main thread's performance and responsiveness.
|
Important Review skippedBot user detected. To trigger a single review, invoke the You can disable this status message by setting the 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Join our Discord community for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
This hook initializes the MLC engine using a web worker. It returns the engine instance and a loading state. The engine is initialized only once using a ref.
This commit adds cosine similarity search to the publications view. It uses the MLC engine to generate embeddings for the publications and then calculates the cosine similarity between the embeddings and the search query. The publications are then sorted by their cosine similarity to the search query. A custom hook useMLCEngine was created to initialize the engine. The dotProduct, magnitude, and cosineSimilarity functions were added to calculate the cosine similarity between two vectors. The PublicationWithEmbeddings type was added to store the embeddings for each publication. The CreateWebWorkerMLCEngine import was updated to include the WebWorkerMLCEngine and Embedding types. The fp-ts imports were updated to include the sort as Asort, map as Amap, mapWithIndex as AmapWithIndex, and reduce as Areduce functions. The useRef hook was added to store the MLC engine instance. The useEffect hook was updated to initialize the MLC engine and generate embeddings for the publications. The PublicationsView component was updated to use the cosine similarity search.
…rch functionality This commit adds publication embeddings to the PublicationsView component. The embeddings are generated using the MLC engine and stored in the publicationEmbeddings state variable. This will allow for more accurate and relevant search results in the future. The getPublicationEmbedding function creates embeddings for each publication by combining the title and abstract. The embeddings are then stored in the publicationEmbeddings state variable. The createEmbeddings function is called when the component mounts and when the engine or data dependencies change. This ensures that the embeddings are always up-to-date. The fp-ts library is used for functional programming paradigms.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #1086 +/- ##
========================================
Coverage 92.54% 92.54%
========================================
Files 417 417
Lines 23582 23582
Branches 1162 1152 -10
========================================
Hits 21823 21823
- Misses 1756 1758 +2
+ Partials 3 1 -2 🚀 New features to boost your workflow:
|
…-llm The Embedding import was not being used in the component, so it was removed to reduce the bundle size and improve performance.
This commit adds a semantic search feature to the PublicationsView component. The search allows users to find publications based on the meaning of their search query, rather than just keyword matching. The semantic search is implemented using the following steps: 1. The user enters a search query in the search box. 2. The search query is converted into an embedding vector. 3. The embedding vector is compared to the embeddings of all publications. 4. Publications with embeddings that are similar to the search query embedding are displayed in the list. The semantic search is implemented using the @mlc-ai/web-llm library. The search box is implemented using the TextField component from @material-ui/core. The FilterListIcon is used as the start adornment for the TextField. The handleSearchChange function is used to update the search state when the user enters a search query. The hasSimilarEmbedding function is used to filter the publications based on the similarity of their embeddings to the search query embedding. The embeddings are now stored as an array of embeddings instead of a single embedding.
The worker.ts file was using self.onmessage to listen for messages, which is not the recommended way to listen for messages in a web worker. This commit changes the code to use self.addEventListener instead, which is the standard way to listen for messages in a web worker. Also added eslint disable rule to allow usage of self.
Adds semantic search functionality to the publications list. The user can now type in a search query, and the publications list will filter to show only publications that are semantically similar to the search query. The semantic search is implemented using the useMLCEngine hook, which provides access to a machine learning model that can generate embeddings for text. The embeddings are used to calculate the cosine similarity between the search query and the publications. Publications with a cosine similarity above a certain threshold are considered to be semantically similar and are included in the filtered list. The matchingEmbeddings state variable is used to store the pubmedIds of the publications that are semantically similar to the search query. The filteredPublications variable is derived from the data variable by filtering out publications whose pubmedId is not in the matchingEmbeddings array. The handleSearch function is called when the user clicks the "Search" button. This function generates an embedding for the search query and then filters the embeddings array to find publications that are semantically similar to the search query. The pubmedIds of the matching publications are then stored in the matchingEmbeddings state variable. The hasSimilarEmbedding function is used to determine whether a publication is semantically similar to the search query. This function calculates the cosine similarity between the search query embedding and the publication embedding and returns true if the cosine similarity is above the threshold. The console.log statements were added to help debug the semantic search functionality. This change allows users to find publications that are relevant to their interests, even if they don't know the exact keywords to search for.
No description provided.