Skip to content

feat: .ksh extension + shebang-based language detection for extension-less scripts (#235, #237)#276

Open
azizur100389 wants to merge 1 commit intotirth8205:mainfrom
azizur100389:feat/ksh-and-shebang-detection
Open

feat: .ksh extension + shebang-based language detection for extension-less scripts (#235, #237)#276
azizur100389 wants to merge 1 commit intotirth8205:mainfrom
azizur100389:feat/ksh-and-shebang-detection

Conversation

@azizur100389
Copy link
Copy Markdown
Contributor

Summary

Two parser improvements that expand file coverage to extension-less Unix scripts and Korn shell files. +288 lines, 18 new tests.

Closes #235, closes #237.

Supersedes individual PRs #236 and #238 (which are being closed in favor of this bundled PR).


Feature 1: .ksh extension → bash parser (#235)

Register .ksh with tree-sitter-bash alongside .sh / .bash / .zsh. Korn shell is close enough to bash syntactically that tree-sitter-bash handles it correctly.

Context: @tirth8205 explicitly invited this in the close comment on PR #230:

The .ksh extension in particular looks worth adding — I didn't include it in #227.

Tests: test_detects_language extended with .ksh; test_ksh_extension_parses_as_bash — end-to-end parse that asserts identical function set and edge counts to .sh.

Feature 2: Shebang-based language detection (#237)

detect_language() was extension-only — any file with no extension was silently skipped. This misses git hooks, CI scripts, bin/ entry points, and installers.

New SHEBANG_INTERPRETER_TO_LANGUAGE table maps interpreter basenames to already-registered languages:

Interpreter Language
bash, sh, zsh, ksh, dash, ash bash
python, python2, python3, pypy, pypy3 python
node, nodejs javascript
ruby, perl, lua, Rscript, php (respective)

New _detect_language_from_shebang(path) reads first 256 bytes. Handles:

  • Direct form: #!/bin/bash
  • env indirection: #!/usr/bin/env bash
  • env -S flag: #!/usr/bin/env -S node --experimental-vm-modules
  • Trailing flags: #!/bin/bash -e
  • CRLF line endings, binary content, strict UTF-8 decoding

detect_language() falls back to shebang probe only for files with suffix == "". Files with a known extension are never re-read.

Tests (16 new): every mapped interpreter, env -S flag, trailing flags, missing shebang, empty file, binary content, unknown interpreter, extension-override guard, and end-to-end parse_file() producing function nodes.


Files changed (3 files, +288 / -1)

  • code_review_graph/parser.py.ksh mapping + SHEBANG_INTERPRETER_TO_LANGUAGE table + _detect_language_from_shebang() + detect_language() fallback
  • tests/test_multilang.py.ksh detection + end-to-end ksh parsing test
  • tests/test_parser.py — 16 shebang detection tests

Test results

  • 17/17 targeted tests pass (1 ksh + 16 shebang)
  • ruff check on all changed files: clean (1 pre-existing F841 in an unrelated test)

…-less scripts (tirth8205#235, tirth8205#237)

Two parser improvements that expand code-review-graph's file coverage
to extension-less Unix scripts and Korn shell files.

Feature 1: .ksh extension → bash parser (tirth8205#235)
-----------------------------------------------
Register .ksh (Korn shell) with tree-sitter-bash alongside the existing
.sh / .bash / .zsh entries shipped in v2.3.0.  Korn shell is close enough
to bash syntactically that tree-sitter-bash handles the structural
features the graph captures correctly.

Context: in the close comment on PR tirth8205#230, @tirth8205 explicitly flagged
this as worth adding: "The .ksh extension in particular looks worth
adding — I didn't include it in tirth8205#227."

Tests: test_detects_language extended with .ksh assertion;
test_ksh_extension_parses_as_bash — end-to-end regression test that
copies sample.sh to a temp .ksh file, parses it, and asserts identical
function set and edge counts.

Feature 2: shebang-based language detection (tirth8205#237)
--------------------------------------------------
detect_language() was extension-only — any file with no extension returned
None and was silently skipped.  This misses a huge category of production
files: git hooks, CI scripts, bin/ entry points, installers.

New SHEBANG_INTERPRETER_TO_LANGUAGE table maps common interpreter
basenames to languages already registered:
  bash/sh/zsh/ksh/dash/ash -> bash
  python/python2/python3/pypy/pypy3 -> python
  node/nodejs -> javascript
  ruby, perl, lua, Rscript, php

New _detect_language_from_shebang(path) static method reads the first
256 bytes, handles direct form (#!/bin/bash), env indirection
(#!/usr/bin/env bash), env -S flags, trailing flags (#!/bin/bash -e),
CRLF, binary content, and strict UTF-8 decoding.

detect_language() now falls back to the shebang probe for files with
no extension (suffix == "").  Files with a known extension are never
re-read — extension-based detection stays authoritative.

Tests (16 new in test_parser.py): every interpreter mapping, env -S flag,
trailing flags, missing shebang, empty file, binary content, unknown
interpreter, extension-does-not-get-overridden, and end-to-end
parse_file producing function nodes from an extension-less bash script.

Files changed
-------------
- code_review_graph/parser.py — .ksh mapping + SHEBANG_INTERPRETER_TO_LANGUAGE
  table + _detect_language_from_shebang() + detect_language() fallback
- tests/test_multilang.py — .ksh detection + end-to-end ksh parsing test
- tests/test_parser.py — 16 shebang detection tests
@azizur100389 azizur100389 force-pushed the feat/ksh-and-shebang-detection branch from 24fdad8 to fc5fc95 Compare April 14, 2026 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(parser): shebang-based language detection for extension-less scripts feat(bash): add .ksh extension to bash parser

1 participant