feat: .ksh extension + shebang-based language detection for extension-less scripts (#235, #237)#276
Open
azizur100389 wants to merge 1 commit intotirth8205:mainfrom
Open
Conversation
This was referenced Apr 14, 2026
c7b0708 to
24fdad8
Compare
…-less scripts (tirth8205#235, tirth8205#237) Two parser improvements that expand code-review-graph's file coverage to extension-less Unix scripts and Korn shell files. Feature 1: .ksh extension → bash parser (tirth8205#235) ----------------------------------------------- Register .ksh (Korn shell) with tree-sitter-bash alongside the existing .sh / .bash / .zsh entries shipped in v2.3.0. Korn shell is close enough to bash syntactically that tree-sitter-bash handles the structural features the graph captures correctly. Context: in the close comment on PR tirth8205#230, @tirth8205 explicitly flagged this as worth adding: "The .ksh extension in particular looks worth adding — I didn't include it in tirth8205#227." Tests: test_detects_language extended with .ksh assertion; test_ksh_extension_parses_as_bash — end-to-end regression test that copies sample.sh to a temp .ksh file, parses it, and asserts identical function set and edge counts. Feature 2: shebang-based language detection (tirth8205#237) -------------------------------------------------- detect_language() was extension-only — any file with no extension returned None and was silently skipped. This misses a huge category of production files: git hooks, CI scripts, bin/ entry points, installers. New SHEBANG_INTERPRETER_TO_LANGUAGE table maps common interpreter basenames to languages already registered: bash/sh/zsh/ksh/dash/ash -> bash python/python2/python3/pypy/pypy3 -> python node/nodejs -> javascript ruby, perl, lua, Rscript, php New _detect_language_from_shebang(path) static method reads the first 256 bytes, handles direct form (#!/bin/bash), env indirection (#!/usr/bin/env bash), env -S flags, trailing flags (#!/bin/bash -e), CRLF, binary content, and strict UTF-8 decoding. detect_language() now falls back to the shebang probe for files with no extension (suffix == ""). Files with a known extension are never re-read — extension-based detection stays authoritative. Tests (16 new in test_parser.py): every interpreter mapping, env -S flag, trailing flags, missing shebang, empty file, binary content, unknown interpreter, extension-does-not-get-overridden, and end-to-end parse_file producing function nodes from an extension-less bash script. Files changed ------------- - code_review_graph/parser.py — .ksh mapping + SHEBANG_INTERPRETER_TO_LANGUAGE table + _detect_language_from_shebang() + detect_language() fallback - tests/test_multilang.py — .ksh detection + end-to-end ksh parsing test - tests/test_parser.py — 16 shebang detection tests
24fdad8 to
fc5fc95
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two parser improvements that expand file coverage to extension-less Unix scripts and Korn shell files. +288 lines, 18 new tests.
Closes #235, closes #237.
Supersedes individual PRs #236 and #238 (which are being closed in favor of this bundled PR).
Feature 1:
.kshextension → bash parser (#235)Register
.kshwith tree-sitter-bash alongside.sh/.bash/.zsh. Korn shell is close enough to bash syntactically that tree-sitter-bash handles it correctly.Context: @tirth8205 explicitly invited this in the close comment on PR #230:
Tests:
test_detects_languageextended with.ksh;test_ksh_extension_parses_as_bash— end-to-end parse that asserts identical function set and edge counts to.sh.Feature 2: Shebang-based language detection (#237)
detect_language()was extension-only — any file with no extension was silently skipped. This misses git hooks, CI scripts,bin/entry points, and installers.New
SHEBANG_INTERPRETER_TO_LANGUAGEtable maps interpreter basenames to already-registered languages:bash,sh,zsh,ksh,dash,ashbashpython,python2,python3,pypy,pypy3pythonnode,nodejsjavascriptruby,perl,lua,Rscript,phpNew
_detect_language_from_shebang(path)reads first 256 bytes. Handles:#!/bin/bash#!/usr/bin/env bash-Sflag:#!/usr/bin/env -S node --experimental-vm-modules#!/bin/bash -edetect_language()falls back to shebang probe only for files withsuffix == "". Files with a known extension are never re-read.Tests (16 new): every mapped interpreter, env
-Sflag, trailing flags, missing shebang, empty file, binary content, unknown interpreter, extension-override guard, and end-to-endparse_file()producing function nodes.Files changed (3 files, +288 / -1)
code_review_graph/parser.py—.kshmapping +SHEBANG_INTERPRETER_TO_LANGUAGEtable +_detect_language_from_shebang()+detect_language()fallbacktests/test_multilang.py—.kshdetection + end-to-end ksh parsing testtests/test_parser.py— 16 shebang detection testsTest results
ruff checkon all changed files: clean (1 pre-existing F841 in an unrelated test)