Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .agents/journal/bolt.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,3 +152,7 @@ reuse results within the same sync run.
**Learning:** Even after optimizing individual detection rules with in-memory metadata, the overall detection process remained inefficient due to multiple high-level discovery phases. `discover_nested_projects`, `RepoMetadata::collect`, and `collect_package_names_with_nested` were each triggering independent, redundant WalkDir traversals or redundant metadata builds. Integrating nested project discovery directly into the primary metadata collection pass eliminated these redundant O(N) operations.

**Action:** Look for "layered" discovery logic where one phase finds sub-targets and subsequent phases re-scan them. Consolidate these into a single "collect once, use everywhere" pass at the highest possible level to maximize I/O efficiency.

## 2026-05-24 - Content Caching and Gradle Discovery for Tech Detection
**Learning:** Even with single-pass metadata, technology detection was performing redundant I/O and allocations by re-reading shared configuration files (e.g., `setup.py`) and re-calculating Gradle layout paths for every matching rule. This created an \(O(T \times F)\) overhead in I/O and heap allocations.
Comment on lines +156 to +157
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a blank line after the heading to satisfy Markdown linting.

Line 156 is missing the required blank line below the heading (MD022), which can keep doc-lint warnings noisy in CI.

Suggested fix
 ## 2026-05-24 - Content Caching and Gradle Discovery for Tech Detection
+
 **Learning:** Even with single-pass metadata, technology detection was performing redundant I/O and allocations by re-reading shared configuration files (e.g., `setup.py`) and re-calculating Gradle layout paths for every matching rule. This created an \(O(T \times F)\) overhead in I/O and heap allocations.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## 2026-05-24 - Content Caching and Gradle Discovery for Tech Detection
**Learning:** Even with single-pass metadata, technology detection was performing redundant I/O and allocations by re-reading shared configuration files (e.g., `setup.py`) and re-calculating Gradle layout paths for every matching rule. This created an \(O(T \times F)\) overhead in I/O and heap allocations.
## 2026-05-24 - Content Caching and Gradle Discovery for Tech Detection
**Learning:** Even with single-pass metadata, technology detection was performing redundant I/O and allocations by re-reading shared configuration files (e.g., `setup.py`) and re-calculating Gradle layout paths for every matching rule. This created an \(O(T \times F)\) overhead in I/O and heap allocations.
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 156-156: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.agents/journal/bolt.md around lines 156 - 157, The Markdown heading "##
2026-05-24 - Content Caching and Gradle Discovery for Tech Detection" is missing
a blank line beneath it; update the .agents/journal/bolt.md entry by inserting a
single blank line immediately after that heading to satisfy MD022 linting
(ensure the heading and the following paragraph are separated by one empty
line).

**Action:** Implement a per-project content cache to ensure each configuration file is read from disk at most once. Pre-discover layout-specific markers (like Gradle files) during the primary `WalkDir` to eliminate redundant filesystem checks and dynamic path constructions in the rule evaluation loop.
122 changes: 75 additions & 47 deletions src/skills/detect.rs
Original file line number Diff line number Diff line change
Expand Up @@ -80,22 +80,22 @@ struct RepoMetadata {
paths: HashSet<PathBuf>,
/// Set of relative paths that are directories.
dirs: HashSet<PathBuf>,
/// Immediate subdirectories of the project root (depth 1), cached for fast Gradle scanning.
root_dirs: Vec<PathBuf>,
/// Map of file extension (e.g., ".rs") to the first relative path found with it.
/// Used to quickly evaluate file_extensions rules.
extensions: HashMap<String, PathBuf>,
/// Relative paths to subdirectories that appear to be standalone projects (issue #409).
nested_projects: Vec<PathBuf>,
/// Relevant Gradle build files discovered during the walk.
gradle_files: Vec<PathBuf>,
}

impl RepoMetadata {
fn collect(project_root: &Path) -> Self {
let mut paths = HashSet::new();
let mut dirs = HashSet::new();
let mut root_dirs = Vec::new();
let mut extensions = HashMap::new();
let mut nested_projects = BTreeSet::new();
let mut gradle_files = Vec::new();

for entry in WalkDir::new(project_root)
.max_depth(MAX_DISCOVER_DEPTH)
Expand All @@ -116,15 +116,31 @@ impl RepoMetadata {
let relative_buf = relative.to_path_buf();
paths.insert(relative_buf.clone());

if entry.file_type().is_dir() && entry.depth() == 1 {
root_dirs.push(relative_buf.clone());
}
if entry.file_type().is_dir() {
dirs.insert(relative_buf.clone());
}

if entry.file_type().is_file() {
let file_name = entry.file_name().to_str().unwrap_or("");
let depth = entry.depth();

// Gradle file discovery for scan_gradle_layout
if depth == 1 {
if matches!(
file_name,
"build.gradle.kts"
| "build.gradle"
| "settings.gradle.kts"
| "settings.gradle"
) {
gradle_files.push(relative_buf.clone());
}
} else if depth == 2
&& ((relative.starts_with("gradle") && file_name == "libs.versions.toml")
|| matches!(file_name, "build.gradle.kts" | "build.gradle"))
{
gradle_files.push(relative_buf.clone());
}

// Integrated Nested Project Discovery (issue #409)
if PROJECT_MANIFEST_FILES.contains(&file_name)
Expand Down Expand Up @@ -153,9 +169,9 @@ impl RepoMetadata {
Self {
paths,
dirs,
root_dirs,
extensions,
nested_projects: nested_projects.into_iter().collect(),
gradle_files,
}
}
}
Expand Down Expand Up @@ -208,7 +224,7 @@ struct CompiledDetectionRules {
}

struct CompiledConfigFileContentRules {
files: Option<Vec<String>>,
files: Option<Vec<PathBuf>>,
patterns: Vec<Regex>,
scan_gradle_layout: bool,
}
Expand Down Expand Up @@ -283,7 +299,10 @@ impl CatalogDrivenDetector {
.collect::<Result<Vec<_>>>()?;

Ok::<_, anyhow::Error>(CompiledConfigFileContentRules {
files: content_rules.files.clone(),
files: content_rules
.files
.as_ref()
.map(|files| files.iter().map(PathBuf::from).collect()),
patterns,
scan_gradle_layout: content_rules.scan_gradle_layout.unwrap_or(false),
})
Expand Down Expand Up @@ -318,11 +337,19 @@ impl RepoDetector for CatalogDrivenDetector {

let mut detections = Vec::new();

// Optimization: Cache file content to avoid redundant I/O for shared config files (e.g. setup.py).
let mut content_cache = HashMap::new();

// Phase 1: Evaluate root project
for (tech_id, compiled) in &self.rules {
if let Some(detection) =
evaluate_rules(project_root, tech_id, compiled, &all_packages, &metadata)
{
if let Some(detection) = evaluate_rules(
project_root,
tech_id,
compiled,
&all_packages,
&metadata,
&mut content_cache,
) {
detections.push(detection);
}
}
Expand All @@ -332,6 +359,9 @@ impl RepoDetector for CatalogDrivenDetector {
let nested_dir = project_root.join(rel_nested_dir);
let nested_meta = RepoMetadata::collect(&nested_dir);
let nested_pkgs = collect_package_names(&nested_dir, &nested_meta);
// Clear content cache between projects because config files like setup.py
// will have different content in different sub-projects.
content_cache.clear();

let offset = Some(rel_nested_dir.to_path_buf());

Expand All @@ -341,9 +371,14 @@ impl RepoDetector for CatalogDrivenDetector {
continue;
}

if let Some(detection) =
evaluate_rules(&nested_dir, tech_id, compiled, &nested_pkgs, &nested_meta)
{
if let Some(detection) = evaluate_rules(
&nested_dir,
tech_id,
compiled,
&nested_pkgs,
&nested_meta,
&mut content_cache,
) {
// Adjust paths: detections are relative to nested_dir, need to prepend offset
let adjusted = TechnologyDetection {
technology: detection.technology,
Expand Down Expand Up @@ -388,6 +423,7 @@ fn evaluate_rules(
rules: &CompiledDetectionRules,
all_packages: &BTreeSet<String>,
metadata: &RepoMetadata,
content_cache: &mut HashMap<PathBuf, String>,
) -> Option<TechnologyDetection> {
// Check packages (exact match)
if let Some(packages) = &rules.packages {
Expand Down Expand Up @@ -439,10 +475,22 @@ fn evaluate_rules(
if let Some(content_rules) = &rules.config_file_content {
let files_to_scan = gather_content_scan_files(project_root, content_rules, metadata);
for file_path in &files_to_scan {
let absolute = project_root.join(file_path);
if let Ok(content) = fs::read_to_string(&absolute) {
// Optimization: Use content cache to avoid redundant I/O for shared config files.
let content = if let Some(cached) = content_cache.get(file_path) {
Some(cached)
} else {
let absolute = project_root.join(file_path);
if let Ok(content) = fs::read_to_string(absolute) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semgrep identified a blocking 🔴 issue in your code:

File path constructed from untrusted configuration rules allows path traversal attack, enabling attackers to read arbitrary files outside the intended directory.

More details about this

The code reads file content from a path constructed by joining project_root with file_path, where file_path comes from untrusted content_rules that may be loaded from external configuration. An attacker can craft a malicious content_rules configuration containing path traversal sequences (like ../../../etc/passwd) in the file paths. When project_root.join(file_path) is called, these sequences allow escaping the intended directory boundary and reading arbitrary files on the system. For example, if project_root is /home/app/config and file_path is ../../../etc/passwd, the code will read /etc/passwd instead of staying within the config directory. The application then searches patterns against this file content, potentially exposing sensitive data like credentials or system information to an attacker who controls the rule configuration.

Dataflow graph
flowchart LR
    classDef invis fill:white, stroke: none
    classDef default fill:#e7f5ff, color:#1c7fd6, stroke: none

    subgraph File0["<b>src/skills/detect.rs</b>"]
        direction LR
        %% Source

        subgraph Source
            direction LR

            v0["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L335 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 335] project_root</a>"]
        end
        %% Intermediate

        subgraph Traces0[Traces]
            direction TB

            v2["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L335 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 335] metadata</a>"]

            v3["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L358 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 358] &</a>"]

            v4["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L358 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 358] in</a>"]

            v5["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L358 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 358] rel_nested_dir</a>"]

            v6["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L359 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 359] nested_dir</a>"]

            v7["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L375 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 375] &</a>"]

            v8["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L374 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 374] evaluate_rules</a>"]

            v9["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L421 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 421] project_root</a>"]

            v10["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L482 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 482] absolute</a>"]
        end
            v2 --> v3
            v3 --> v4
            v4 --> v5
            v5 --> v6
            v6 --> v7
            v7 --> v8
            v8 --> v9
            v9 --> v10
        %% Sink

        subgraph Sink
            direction LR

            v1["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L483 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 483] absolute</a>"]
        end
    end
    %% Class Assignment
    Source:::invis
    Sink:::invis

    Traces0:::invis
    File0:::invis

    %% Connections

    Source --> Traces0
    Traces0 --> Sink


Loading

To resolve this comment:

✨ Commit fix suggestion
  1. Normalize and validate the file path to prevent path traversal before using it with fs::read_to_string or similar file system functions.
  2. Use a helper function like canonical_existing_path to canonicalize and check the path. You can do this by adding: let file_path = canonical_existing_path(file_path).ok()?; before any code that reads the file, such as fs::read_to_string(file_path).
  3. Only proceed with file operations if the call to canonical_existing_path returns Some(path), meaning the path is valid and canonical.
  4. Make sure that all dynamic path construction using user input (such as joins with user-derived values) passes through this validation step before being used by the application.

Canonicalizing the path with canonical_existing_path ensures that attacker-controlled input cannot break out of the intended directory, which helps to prevent path traversal vulnerabilities.

💬 Ignore this finding

Reply with Semgrep commands to ignore this finding.

  • /fp <comment> for false positive
  • /ar <comment> for acceptable risk
  • /other <comment> for all other reasons

Alternatively, triage in Semgrep AppSec Platform to ignore the finding created by tainted-path.

You can view more details about this finding in the Semgrep AppSec Platform.

content_cache.insert(file_path.clone(), content);
content_cache.get(file_path)
} else {
None
}
};

if let Some(content) = content {
for pattern in &content_rules.patterns {
if pattern.is_match(&content) {
if pattern.is_match(content) {
let display = file_path.display().to_string();
return Some(make_detection(
tech_id,
Expand Down Expand Up @@ -501,39 +549,19 @@ fn gather_content_scan_files(
let mut files = Vec::new();

if rules.scan_gradle_layout {
// Root-level Gradle files
for name in &[
"build.gradle.kts",
"build.gradle",
"settings.gradle.kts",
"settings.gradle",
"gradle/libs.versions.toml",
] {
let path = PathBuf::from(name);
if metadata.paths.contains(&path) {
files.push(path);
}
}

// Optimization: Use pre-calculated root_dirs from metadata to avoid
// re-filtering the entire directory set on every tech rule evaluation.
for dir in &metadata.root_dirs {
for build_file in &["build.gradle.kts", "build.gradle"] {
let path = dir.join(build_file);
if metadata.paths.contains(&path) {
files.push(path);
}
}
}
// Optimization: Use pre-discovered Gradle files from metadata to avoid
// dynamic path construction and repeated lookups in the path set.
files.extend(metadata.gradle_files.iter().cloned());
}

if let Some(explicit_files) = &rules.files {
for file in explicit_files {
let path = PathBuf::from(file);
if (metadata.paths.contains(&path) || project_root.join(&path).exists())
&& !files.contains(&path)
for path in explicit_files {
// Optimization: check metadata.paths first (O(1)), only fallback to fs for deep matches.
// Avoid duplicate entries if explicit files overlap with Gradle files.
if (metadata.paths.contains(path) || project_root.join(path).exists())
&& !files.contains(path)
{
files.push(path);
files.push(path.clone());
}
}
}
Expand Down
Loading