Skip to content

Pulling feat/vector-search into develop#1086

Open
github-actions[bot] wants to merge 12 commits into
developfrom
feat/vector-search
Open

Pulling feat/vector-search into develop#1086
github-actions[bot] wants to merge 12 commits into
developfrom
feat/vector-search

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

No description provided.

ktun95 added 5 commits June 23, 2025 08:49
This commit adds the `@mlc-ai/web-llm` package as a dependency.
This package will be used for integrating web large language
models into the dicty frontpage application.
…ddings for publications

The PublicationsView component now uses the mlc-ai/web-llm library to generate embeddings for publication titles. A new worker.ts file is added to handle the mlc-ai/web-llm tasks in a separate thread, preventing the main thread from being blocked. The CreateWebWorkerMLCEngine function is used to initialize the engine within the useEffect hook. The engine is initialized with the Llama-3.1-8B-Instruct model. The embeddings are created for the title of the first publication in the data array.
The WebWorkerMLCEngineHandler is now instantiated outside the options object. The options object was removed as it was not being used. A console log was added to check if the worker is loaded.
This commit initializes the mlc-llm engine using a web worker.
The engine is configured with the "snowflake-arctic-embed-s-q0f32-MLC-b4" model.
The worker is created using a URL relative to the current module,
and the engine creates embeddings for the title of the first
publication in the data array. The type of worker is set to module.
The engine state variable is removed.
This commit introduces a new file, worker.ts, which is responsible
for handling the WebWorkerMLCEngineHandler. The handler resides in
the worker thread and listens for messages using self.onmessage.
This setup enables offloading potentially heavy tasks to a separate
thread, improving the main thread's performance and responsiveness.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 23, 2025

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Join our Discord community for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

ktun95 added 3 commits June 23, 2025 23:28
This hook initializes the MLC engine using a web worker. It returns the
engine instance and a loading state. The engine is initialized only once
using a ref.
This commit adds cosine similarity search to the publications view.
It uses the MLC engine to generate embeddings for the publications
and then calculates the cosine similarity between the embeddings
and the search query. The publications are then sorted by their
cosine similarity to the search query.
A custom hook useMLCEngine was created to initialize the engine.
The dotProduct, magnitude, and cosineSimilarity functions were
added to calculate the cosine similarity between two vectors.
The PublicationWithEmbeddings type was added to store the
embeddings for each publication.
The CreateWebWorkerMLCEngine import was updated to include the
WebWorkerMLCEngine and Embedding types.
The fp-ts imports were updated to include the sort as Asort, map
as Amap, mapWithIndex as AmapWithIndex, and reduce as Areduce
functions.
The useRef hook was added to store the MLC engine instance.
The useEffect hook was updated to initialize the MLC engine and
generate embeddings for the publications.
The PublicationsView component was updated to use the cosine
similarity search.
…rch functionality

This commit adds publication embeddings to the PublicationsView component.
The embeddings are generated using the MLC engine and stored in the
publicationEmbeddings state variable. This will allow for more
accurate and relevant search results in the future.
The getPublicationEmbedding function creates embeddings for each
publication by combining the title and abstract. The embeddings are
then stored in the publicationEmbeddings state variable.
The createEmbeddings function is called when the component mounts and
when the engine or data dependencies change. This ensures that the
embeddings are always up-to-date.
The fp-ts library is used for functional programming paradigms.
@pull-request-size pull-request-size Bot added size/L and removed size/M labels Jun 24, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 24, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.54%. Comparing base (e8ab847) to head (acfbd41).

Current head acfbd41 differs from pull request most recent head 9541aca

Please upload reports for the commit 9541aca to get more accurate results.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff            @@
##           develop    #1086   +/-   ##
========================================
  Coverage    92.54%   92.54%           
========================================
  Files          417      417           
  Lines        23582    23582           
  Branches      1162     1152   -10     
========================================
  Hits         21823    21823           
- Misses        1756     1758    +2     
+ Partials         3        1    -2     

see 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ktun95 added 4 commits June 25, 2025 09:44
…-llm

The Embedding import was not being used in the component, so it was removed to reduce the bundle size and improve performance.
This commit adds a semantic search feature to the
PublicationsView component. The search allows users to
find publications based on the meaning of their search
query, rather than just keyword matching.

The semantic search is implemented using the following steps:
1. The user enters a search query in the search box.
2. The search query is converted into an embedding vector.
3. The embedding vector is compared to the embeddings of
   all publications.
4. Publications with embeddings that are similar to the
   search query embedding are displayed in the list.

The semantic search is implemented using the
@mlc-ai/web-llm library.

The search box is implemented using the TextField
component from @material-ui/core.

The FilterListIcon is used as the start adornment for the
TextField.

The handleSearchChange function is used to update the
search state when the user enters a search query.

The hasSimilarEmbedding function is used to filter the
publications based on the similarity of their embeddings
to the search query embedding.

The embeddings are now stored as an array of embeddings
instead of a single embedding.
The worker.ts file was using self.onmessage to listen for messages, which is not the recommended way to listen for messages in a web worker. This commit changes the code to use self.addEventListener instead, which is the standard way to listen for messages in a web worker. Also added eslint disable rule to allow usage of self.
Adds semantic search functionality to the publications list.
The user can now type in a search query, and the publications
list will filter to show only publications that are semantically
similar to the search query.

The semantic search is implemented using the useMLCEngine hook,
which provides access to a machine learning model that can
generate embeddings for text. The embeddings are used to
calculate the cosine similarity between the search query and
the publications. Publications with a cosine similarity above
a certain threshold are considered to be semantically similar
and are included in the filtered list.

The matchingEmbeddings state variable is used to store the
pubmedIds of the publications that are semantically similar
to the search query. The filteredPublications variable is
derived from the data variable by filtering out publications
whose pubmedId is not in the matchingEmbeddings array.

The handleSearch function is called when the user clicks the
"Search" button. This function generates an embedding for the
search query and then filters the embeddings array to find
publications that are semantically similar to the search query.
The pubmedIds of the matching publications are then stored in
the matchingEmbeddings state variable.

The hasSimilarEmbedding function is used to determine whether
a publication is semantically similar to the search query. This
function calculates the cosine similarity between the search
query embedding and the publication embedding and returns true
if the cosine similarity is above the threshold.

The console.log statements were added to help debug the
semantic search functionality.

This change allows users to find publications that are
relevant to their interests, even if they don't know the
exact keywords to search for.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant