Skip to content

Analysis of parsers.py #32

@albatrosary

Description

@albatrosary

This module handles all input parsing — CLI flags, prompt assembly, sequence/pipeline definitions, and shell command parsing. It's the most verbose module with significant duplication.


Architecture

parse_cli_input()     → ParsedInput      (AI commands)
_parse_sh_input()     → ParsedShInput    (Shell commands)
build_ai_prompt()     → str              (Prompt assembly)
parse_sequence_steps() → nested list     (Pipeline parsing)
  ├── smart_split_steps()    (split on ->)
  ├── smart_split_parallel() (split on ||)
  ├── normalize_step()       (clean whitespace/comments)
  └── detect_parallel_block() (detect [...])

Critical Issues

1. smart_split_steps and smart_split_parallel are 95% identical

These two functions are character-by-character parsers that differ only in their delimiter (-> vs ||):

# smart_split_steps:
if in_quote is None and ch == "-" and i + 1 < length and text[i + 1] == ">":

# smart_split_parallel:
if in_quote is None and ch == "|" and i + 1 < length and text[i + 1] == "|":

Everything else — escape handling, quote tracking, accumulation, final cleanup — is identical. This is a textbook case for extraction:

def _smart_split(text: str, delimiter: str) -> list[str]:
    """Split text on a 2-char delimiter, respecting quotes and escapes."""
    assert len(delimiter) == 2
    d0, d1 = delimiter[0], delimiter[1]
    segments = []
    current = []
    in_quote = None
    i = 0
    length = len(text)

    while i < length:
        ch = text[i]

        if ch == "\\" and i + 1 < length:
            current.append(ch)
            current.append(text[i + 1])
            i += 2
            continue

        if ch in ('"', "'"):
            if in_quote is None:
                in_quote = ch
            elif in_quote == ch:
                in_quote = None
            current.append(ch)
            i += 1
            continue

        if in_quote is None and ch == d0 and i + 1 < length and text[i + 1] == d1:
            segments.append("".join(current).strip())
            current = []
            i += 2
            continue

        current.append(ch)
        i += 1

    if current:
        segments.append("".join(current).strip())

    return [s for s in segments if s]


def smart_split_steps(text: str) -> list[str]:
    return _smart_split(text, "->")

def smart_split_parallel(text: str) -> list[str]:
    return _smart_split(text, "||")

This eliminates ~50 lines of duplicated code.

2. VALID_COMMANDS is hardcoded, diverges from actual engine registry

VALID_COMMANDS = {
    "gemini", "gpt", "claude", "grok", "local",
    "sh", "scrub", "flush", "efficient",
}

This set is defined inside parse_sequence_steps and duplicates the engine names from config.py/engines.py. If someone adds a new engine (e.g., "deepseek"), they must remember to update this set too. It should be derived dynamically:

from .config import engines

VALID_COMMANDS = set(engines.keys()) | {"sh", "scrub", "flush", "efficient"}

However, there's a timing issue: parse_sequence_steps could be called before initialize_engines() populates the engines dict. The fix is to compute this at call time, not import time:

def parse_sequence_steps(editor_content: str) -> list[list[list[str]]] | None:
    valid_commands = set(engines.keys()) | {"sh", "scrub", "flush", "efficient"}
    ...

3. _parse_sh_input silently allows duplicate -r and -w flags

if parsed.run_file is not None:
    print("[!] @sh: -r specified more than once.")
# ... but continues anyway, overwriting the value
parsed.run_file = parts[i + 1]

The warning is printed but the previous value is silently overwritten. Compare with parse_cli_input which does the same for -w:

if parsed.write_file is not None:
    print("[!] Warning: write flag specified multiple times. Overwriting.")
parsed.write_file = parts[i + 1]

Both parsers should either reject duplicates (return None) or explicitly document that last-wins is intentional.


Significant Issues

4. parse_cli_input uses indices_to_skip pattern — fragile and hard to follow

The parser tracks which indices have been consumed via a set:

indices_to_skip = {0}
# ... later:
indices_to_skip.update({i, i + 1})
# ... finally:
a1_tokens = [parts[j] for j in range(len(parts)) if j not in indices_to_skip]

This means bare tokens are collected by exclusion — whatever wasn't consumed by a flag. This works, but it's error-prone. If a new flag is added and the developer forgets to add indices to indices_to_skip, those tokens leak into a1.

A cleaner approach:

bare_tokens = []
i = 1
while i < len(parts):
    token = parts[i]
    if token in ("-r", "--read"):
        ...
        i += 2
    elif token in ("-e", "--edit"):
        ...
        i += 1
    else:
        bare_tokens.append(token)
        i += 1

parsed.a1 = " ".join(bare_tokens)

This is what _parse_sh_input already does with bare_tokens. The two parsers use different patterns for the same problem.

5. build_ai_prompt raises RuntimeError but callers catch Exception

# build_ai_prompt:
raise RuntimeError(f"Error reading input file '{filename}': {e}")

# handle_ai_interaction (in handlers.py):
try:
    prompt_main = build_ai_prompt(parsed, editor_content)
except Exception as e:
    safe_print(f"[!] {e}")

Using RuntimeError for a file-not-found condition is too broad. A custom exception or FileNotFoundError / IOError would be more precise.

6. Escaped characters are preserved literally in smart_split_*

if ch == "\\" and i + 1 < length:
    current.append(ch)        # keeps the backslash
    current.append(text[i + 1])

The backslash is not consumed — it's passed through to the output. So \-> becomes the literal string \-> in the step text, which shlex.split later processes. This may or may not be the desired behavior, but it means escape handling is split across two layers (this parser and shlex), making it hard to reason about.

7. normalize_step strips comments naively

if not stripped or stripped.startswith("#"):
    continue

This strips full-line comments but not inline comments:

@gemini "prompt"  # this comment is NOT stripped

The # this comment... would become part of the command tokens. Either document this limitation or handle inline comments (being careful about # inside quotes).

8. normalize_step collapse loop is O(n²)

while "  " in normalized:
    normalized = normalized.replace("  ", " ")

For very long texts, this repeatedly scans and replaces. A single regex is cleaner:

normalized = re.sub(r" {2,}", " ", normalized)

Moderate Issues

9. detect_parallel_block doesn't handle nested brackets

if stripped.startswith("[") and stripped.endswith("]"):
    return True, stripped[1:-1].strip()

Input like [@gemini "[test]" || @gpt "next"] would incorrectly match the inner ] in "[test]" — but only if the quotes were stripped, which they aren't here since we're checking raw text. Still, unquoted brackets inside the block would break:

[@sh "echo ]" || @gpt "next"]

After shlex.split in the editor, echo ] would be a single token, but detect_parallel_block sees the first ] and stops. However, since the check is endswith("]"), it actually only checks the last character, so this specific case works. The real failure would be:

[@sh "echo test" || @gpt "next"] extra_text

This wouldn't be detected as parallel because it doesn't end with ]. That's actually correct behavior, but there's no error message — it falls through to sequential parsing.

10. Type annotation uses list[str] in @dataclass default

read_files: list[str] = None

This works in Python 3.10+ but list[str] as a type hint in this position requires from __future__ import annotations in Python 3.9. The __post_init__ workaround for the mutable default is correct, but the standard pattern is:

from dataclasses import dataclass, field

@dataclass
class ParsedInput:
    read_files: list[str] = field(default_factory=list)

This eliminates the need for __post_init__ entirely.


Comment Density Problem

This file has the worst comment-to-code ratio. Nearly every line is commented:

steps = []  # List to hold split steps
current = []  # List for the current step being built
in_quote = None  # Track if currently inside a quote
i = 0  # Index for iteration
length = len(text)  # Length of the text

These comments add zero information. The variable names already say everything. Removing them would make the file ~30% shorter and significantly more readable.


Summary Table

Severity Issue Location
🔴 Critical smart_split_steps/smart_split_parallel are near-identical Both functions
🟠 High VALID_COMMANDS hardcoded, diverges from engine registry parse_sequence_steps()
🟠 High Inconsistent parsing patterns (indices_to_skip vs bare_tokens) parse_cli_input vs _parse_sh_input
🟡 Medium Duplicate flags warned but not rejected _parse_sh_input, parse_cli_input
🟡 Medium Escaped chars preserved literally (split responsibility) smart_split_*
🟡 Medium Comments stripped only at line level, not inline normalize_step()
🟡 Medium O(n²) whitespace collapse normalize_step()
🟢 Low RuntimeError for file I/O errors build_ai_prompt()
🟢 Low Mutable default via __post_init__ instead of field() ParsedInput
🟢 Low Extreme comment density Throughout

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions