Proposal: Include a Hibernate-Based DFS Implementation in the JGit Distribution #251
carstenartur
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello JGit Community,
JGit already provides the
DfsRepository/DfsObjDatabaseabstraction for pluggable storage backends. I have built a full database-backed implementation of this API using Hibernate ORM and Hibernate Search, and I would like to discuss including it as an optional module in the JGit distribution.What This Is
A complete DFS backend that stores Git objects, packs, refs, and reflogs in a relational database:
HibernateRepositoryextendsDfsRepositoryHibernateObjDatabaseextendsDfsObjDatabase— stores pack data as BLOBs, keyed by pack name and extensionHibernateRefDatabase,HibernateReflogWriter,HibernateReflogReader— database-backed ref and reflog storageHibernateRepositoryBuilderextendsDfsRepositoryBuilderGitObjectEntity,GitRefEntity,GitPackEntity,GitReflogEntity,GitCommitIndex,JavaBlobIndex,FilePathHistoryIt is actively used in my Sandbox project as an Eclipse plugin — see the
sandbox-jgit-storage-hibernatemodule.Why Include It in the JGit Distribution
Atomicity and Transactional Guarantees
Database transactions provide true ACID guarantees for pack commits, ref updates, and rollbacks — something inherently difficult to achieve with filesystem-based storage.
Packaging and Dependency Management
Maintaining this implementation externally means dealing with version compatibility against JGit internals, OSGi/p2 packaging challenges, and duplicated build infrastructure. Including it as an optional module in JGit would eliminate this burden and ensure it stays in sync with API changes.
A Real-World DFS Implementation Beyond
InMemoryRepositoryJGit ships
InMemoryRepositoryas its only DFS implementation. A Hibernate-based module would be the natural persistent counterpart — usable with any JDBC-compatible database (H2 for testing/embedded, PostgreSQL for production) — and would serve as a reference implementation that helps validate and harden the DFS API.Features This Has Enabled
The database layer has made it possible to build features that would be extremely difficult or impossible with filesystem-based storage:
ECJ-Based Java Source Tokenizer for Lucene
An
EcjTokenizerthat uses Eclipse's own Java compiler scanner to produce lexically correct Java tokens for Lucene indexing. Combined with anEcjTokenFilterthat applies CamelCase splitting, string literal stripping, and token-type-aware processing. This gives true language-aware full-text search over Java source code stored in Git — something fundamentally different from plain text search.AST-Based Structural Indexing
A
JavaBlobExtractorandJavaFileStrategythat parse Java source files using JDT'sASTParserand extract structural metadata — package names, declared types, methods, fields, supertypes, interfaces, imports — all indexed and queryable via Hibernate Search/Lucene.Semantic and Hybrid Code Search
A
SemanticSearchClientthat supports natural language queries over the indexed repository content — semantic search (vector-based), hybrid search (full-text + semantic), type search, symbol search, commit message search, and changed-path search. This allows asking questions like "find all implementations of a caching strategy" rather than just grep for a string.LLM-Powered Commit Analysis
A
CommitAnalysisJobthat feeds commit diffs to an LLM service to generate DSL rules and semantic evaluations of code changes. The database layer makes it practical to store, index, and query both the raw repository data and the AI-generated analysis results together.Structured Querying via
GitDatabaseQueryServiceSQL/HQL queries over commits, trees, blobs, and refs — e.g., "find all commits by author X touching files in path Y between dates A and B" — without walking the entire object graph.
None of these features require changes to the DFS API itself — they all build on top of the Hibernate storage layer. But they demonstrate why having this implementation inside the JGit distribution (rather than maintained externally) would benefit the broader ecosystem.
References
JGit Fork (persistence layer work):
👉 https://github.com/carstenartur/jgit
Hibernate Storage Module (in Sandbox):
👉
sandbox-jgit-storage-hibernateSandbox Project (Eclipse product with all integrations):
👉 https://github.com/carstenartur/sandbox
Related EGit Discussion — allowing plugins to switch the persistence layer:
👉 eclipse-egit/egit#145
What I Would Like to Discuss
Thank you for your time and feedback!
Best regards,
Carsten Hammer
GitHub: @carstenartur
Beta Was this translation helpful? Give feedback.
All reactions