diff --git a/CLAUDE.md b/CLAUDE.md index f83b199..5d4c1c4 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -118,6 +118,7 @@ The sgraph data model consists of three primary classes: - **loader/**: Model loading utilities - **cli/**: Command-line interface utilities - **attributes/**: Attribute query and management +- **cypher.py**: Cypher query language support via sPyCy (optional dependency: `spycy-aneeshdurg`) ### Data Formats @@ -177,6 +178,20 @@ element = api.getElementByPath('/some/path') elements = api.getElementsByName('nginx.c') ``` +### Querying with Cypher (requires `spycy-aneeshdurg`) +```python +from sgraph import SGraph +from sgraph.cypher import cypher_query + +model = SGraph.parse_xml_or_zipped_xml('model.xml') +results = cypher_query(model, ''' + MATCH (a:file)-[:imports]->(b) + RETURN a.name, b.name +''') +``` + +CLI: `python -m sgraph.cypher model.xml.zip [query]` — supports interactive REPL and 11 output formats (table, csv, tsv, json, jsonl, xml, deps, dot, plantuml, graphml, cytoscape). See `docs/cypher.md` for full documentation. + ## File Locations - Source code: `src/sgraph/` @@ -184,3 +199,5 @@ elements = api.getElementsByName('nginx.c') - Automation scripts: `scripts/` (includes `release.py`) - Package metadata: `setup.cfg`, `setup.py` - Documentation: `README.md`, `releasing.md`, `CLAUDE.md` +- Graph conventions: `docs/graph-conventions.md` +- Cypher query docs: `docs/cypher.md` diff --git a/README.md b/README.md index db9ab07..13eaf8f 100644 --- a/README.md +++ b/README.md @@ -136,6 +136,28 @@ Creating a simple model: ``` +### Querying with Cypher + +Models can be queried using the [openCypher](https://opencypher.org/) graph query language (requires optional dependency `spycy-aneeshdurg`): + +```python +from sgraph import SGraph +from sgraph.cypher import cypher_query + +model = SGraph.parse_xml_or_zipped_xml('model.xml') +results = cypher_query(model, 'MATCH (a)-[r:inc]->(b) RETURN a.name, b.name') +``` + +A CLI with interactive REPL is also available: + +```bash +pip install spycy-aneeshdurg +python -m sgraph.cypher model.xml.zip 'MATCH (n:file) RETURN n.name' # single query +python -m sgraph.cypher model.xml.zip # interactive REPL +python -m sgraph.cypher model.xml.zip -f dot 'MATCH (a)-[r]->(b) RETURN a, r, b' | dot -Tpng -o graph.png +``` + +See the [Cypher documentation](https://softagram.github.io/sgraph/cypher.html) for full details and query examples. ## Current utilization [Softagram](https://github.com/softagram) uses it for building up the information model about the diff --git a/docs/cypher.md b/docs/cypher.md new file mode 100644 index 0000000..6070708 --- /dev/null +++ b/docs/cypher.md @@ -0,0 +1,230 @@ +--- +layout: page +title: Cypher Query Support +permalink: /cypher/ +--- + +# Cypher Query Support + +sgraph provides built-in support for querying models using the [openCypher](https://opencypher.org/) query language. This is powered by [sPyCy](https://github.com/aneeshdurg/spycy), a Python implementation of openCypher with a pluggable graph backend. + +## Installation + +Cypher support requires sPyCy as an optional dependency: + +```bash +pip install spycy-aneeshdurg +``` + +## How It Works + +sgraph's data model is a **hierarchical graph** — elements form a tree, and associations form directed edges between elements. Cypher operates on **labeled property graphs** (flat nodes with labels, typed relationships with properties). The `SGraphCypherBackend` bridges these two models: + +| sgraph concept | Cypher concept | +|---|---| +| SElement | Node | +| Element `type` attribute | Node label (e.g., `:file`, `:class`) | +| Element `attrs` + name/path | Node properties | +| SElementAssociation | Relationship | +| Association `deptype` | Relationship type (e.g., `:imports`, `:function_ref`) | +| Parent-child hierarchy | `:CONTAINS` relationships (optional) | + +Elements without a `type` attribute have no labels. The `name` and `path` properties are always available on every node. + +## Python API + +```python +from sgraph import SGraph +from sgraph.cypher import cypher_query + +model = SGraph.parse_xml_or_zipped_xml('model.xml.zip') + +# Returns a pandas DataFrame +results = cypher_query(model, ''' + MATCH (f:file)-[:imports]->(dep) + RETURN f.name, dep.name +''') +print(results) +``` + +For more control over the backend (e.g., disabling hierarchy edges): + +```python +from sgraph.cypher import SGraphCypherBackend, SGraphCypherExecutor + +backend = SGraphCypherBackend(root=model.rootNode, include_hierarchy=False) +executor = SGraphCypherExecutor(graph=backend) +result = executor.exec('MATCH (n) RETURN count(n)') +``` + +## Command-Line Interface + +Query models directly from the terminal: + +```bash +# Single query +python -m sgraph.cypher model.xml.zip 'MATCH (n:file) RETURN n.name LIMIT 5' + +# Interactive REPL +python -m sgraph.cypher model.xml.zip + +# Without hierarchy edges +python -m sgraph.cypher --no-hierarchy model.xml.zip +``` + +In interactive mode, type Cypher queries at the `cypher>` prompt. Multi-line queries are supported — the query is executed when a line ends with a semicolon or when a blank line is entered after the query. Type `quit` or `exit` to leave, or press Ctrl+D. Use `\format ` to switch output format mid-session. + +### Output Formats + +The `-f` / `--format` flag controls the output format. Use `-o` to write to a file instead of stdout. + +**Tabular formats** — render the query result DataFrame: + +| Format | Flag | Description | +|---|---|---| +| `table` | `-f table` | Aligned columns (default) | +| `csv` | `-f csv` | Comma-separated values | +| `tsv` | `-f tsv` | Tab-separated values | +| `json` | `-f json` | JSON array of objects | +| `jsonl` | `-f jsonl` | One JSON object per line | + +**Graph formats** — extract a subgraph from `Node`/`Edge` objects in the result and export using sgraph converters. The query must return node and edge variables (e.g., `RETURN a, r, b`, not `RETURN a.name`): + +| Format | Flag | Description | +|---|---|---| +| `xml` | `-f xml` | sgraph XML format | +| `deps` | `-f deps` | Line-based deps format | +| `dot` | `-f dot` | Graphviz DOT | +| `plantuml` | `-f plantuml` | PlantUML component diagram | +| `graphml` | `-f graphml` | GraphML (yFiles compatible) | +| `cytoscape` | `-f cytoscape` | CytoscapeJS JSON | + +**Examples:** + +```bash +# Export query results as CSV for further processing +python -m sgraph.cypher model.xml.zip -f csv \ + 'MATCH (a:file)-[r:imports]->(b) RETURN a.name, b.name' > imports.csv + +# Extract a subgraph as Graphviz DOT +python -m sgraph.cypher model.xml.zip -f dot \ + 'MATCH (a)-[r:inc]->(b) RETURN a, r, b' | dot -Tpng -o graph.png + +# Export matched subgraph as PlantUML +python -m sgraph.cypher model.xml.zip -f plantuml \ + 'MATCH (a)-[r]->(b) WHERE a.name = "main.py" RETURN a, r, b' > deps.puml + +# Save subgraph as GraphML for yEd +python -m sgraph.cypher model.xml.zip -f graphml -o subgraph.graphml \ + 'MATCH (a)-[r:imports]->(b) RETURN a, r, b' + +# JSON Lines for streaming/piping +python -m sgraph.cypher model.xml.zip -f jsonl \ + 'MATCH (n:file) RETURN n.name, n.path' | jq . +``` + +## Query Examples + +### Software Architecture Convention + +**Find all import dependencies between files:** +```cypher +MATCH (a:file)-[r:imports]->(b:file) +RETURN a.name, b.name +``` + +**What does a specific file depend on?** +```cypher +MATCH (a)-[r]->(b) +WHERE a.name = 'main.py' AND type(r) <> 'CONTAINS' +RETURN a.name, type(r), b.name +``` + +**Transitive dependencies (up to 3 hops):** +```cypher +MATCH (a:file)-[:imports|function_ref*1..3]->(b) +RETURN DISTINCT a.name, b.name +``` + +**Count dependencies per file, sorted:** +```cypher +MATCH (a:file)-[r]->(b) +WHERE type(r) <> 'CONTAINS' +RETURN a.name, count(r) AS deps +ORDER BY deps DESC +``` + +**Files importing external packages:** +```cypher +MATCH (dir)-[:CONTAINS]->(f:file)-[:imports]->(ext:package) +RETURN dir.name, f.name, ext.name +``` + +**Find elements under External:** +```cypher +MATCH (n) +WHERE n.path CONTAINS 'External' +RETURN n.name, n.path +``` + +**Containment hierarchy:** +```cypher +MATCH (parent)-[:CONTAINS]->(child:file) +RETURN parent.name, child.name +``` + +**Navigate hierarchy with depth:** +```cypher +MATCH (root)-[:CONTAINS*1..3]->(deep) +WHERE root.name = 'src' +RETURN deep.name, deep.path +``` + +### Genealogy Convention + +**Find a person's parents:** +```cypher +MATCH (child)-[:parent]->(parent) +WHERE child.name CONTAINS 'Matti' +RETURN child.name, parent.name +``` + +**Find a person's children (reverse direction):** +```cypher +MATCH (child)-[:parent]->(parent) +WHERE parent.name CONTAINS 'Matti' +RETURN child.name +``` + +**Ancestors up to 3 generations:** +```cypher +MATCH (person)-[:parent*1..3]->(ancestor) +WHERE person.name CONTAINS 'Pekka' +RETURN ancestor.name +``` + +## Supported Cypher Features + +| Feature | Supported | +|---|---| +| `MATCH` with node labels and properties | Yes | +| `WHERE` with comparisons, `CONTAINS`, `AND`/`OR` | Yes | +| `RETURN` with aliases | Yes | +| `DISTINCT` | Yes | +| `ORDER BY`, `LIMIT`, `SKIP` | Yes | +| `count()`, `sum()`, `avg()`, `min()`, `max()` | Yes | +| `type(r)`, `labels(n)`, `id(n)` | Yes | +| Variable-length paths `*1..N` | Yes | +| `WITH` (intermediate results) | Yes | +| `UNION` / `UNION ALL` | Yes | +| `UNWIND` | Yes | +| `OPTIONAL MATCH` | Yes | +| `CREATE`, `DELETE`, `SET` | No (read-only) | +| `MERGE` | No | +| `CALL` subqueries | No | + +## Performance Notes + +The backend builds an in-memory index of all elements and associations on first use. For typical software projects (up to hundreds of thousands of elements) this is fast. For very large models (millions of elements), the initial indexing may take a few seconds. + +The `:CONTAINS` hierarchy edges roughly double the edge count. Use `include_hierarchy=False` (or `--no-hierarchy` on the CLI) if you only need association queries and want faster indexing. diff --git a/docs/graph-conventions.md b/docs/graph-conventions.md new file mode 100644 index 0000000..2a7973e --- /dev/null +++ b/docs/graph-conventions.md @@ -0,0 +1,244 @@ +--- +layout: page +title: Graph Conventions +permalink: /graph-conventions/ +--- + +# Graph Conventions + +sgraph is a generic hierarchical graph library. The data model itself is convention-agnostic: elements form a tree, associations form directed edges, and both carry arbitrary key-value attributes. The **convention** defines how these primitives are used to represent a specific domain. + +This document describes the known conventions. Understanding the active convention is essential for writing meaningful queries, building correct tooling, and (in the future) mapping sgraph structures to query languages like Cypher. + +## Convention: Software Architecture + +Used by Softagram analyzers and the sgraph-mcp-server to represent analyzed codebases. + +### Top-Level Structure + +Every software architecture model has a single **project element** (P) directly under the root. This element represents the analysis target — it may be a single project, a mono-repo, or an umbrella node covering an entire organization's codebases. + +``` +root (unnamed) +└── P (project element — exactly one) + ├── External/ (3rd-party dependencies) + ├── repo-a/ (git repository, type="repo") + ├── repo-b/ (git repository, type="repo") + └── ... +``` + +**Key rules:** + +- There is exactly **one** project element P under the root. +- P's name is typically the project or organization name. +- In a **multi-repo** analysis, P's children include multiple git repositories (type `repo`) plus the `External` subtree. +- In a **single-repo / local** analysis, P itself represents the sole git repository. Its children (besides `External`) are directories and files directly. Nested git submodules may still appear with type `repo` deeper in the tree. + +### The External Subtree + +The child of P named **`External`** is the ancestor of all identified third-party dependencies. Its internal structure is organized by package ecosystem: + +``` +P/ +└── External/ + ├── Python/ + │ ├── requests/ + │ ├── flask/ + │ └── ... + ├── NPM/ + │ ├── react/ + │ ├── lodash/ + │ └── ... + ├── Docker/ + │ └── Image/ + │ ├── nginx:latest/ + │ └── ... + ├── Go/ + ├── Maven/ + ├── Java/ + ├── Assemblies/ (.NET) + └── APT/ +``` + +External elements may carry a `repotype` attribute indicating the package manager (e.g., `NPM`, `PIP`, `APT`). The `version` attribute stores the resolved version when available. + +An element can be tested for externality by checking whether any of its ancestors is named `External`. The library provides `isExternalElement()` for this purpose. + +### Element Types + +Element types are stored in the `type` attribute (`attrs['type']`). Types are free-form strings — the library does not enforce an enum. The following types are conventional: + +#### Structural Types + +| Type | Meaning | +|------|---------| +| `repo` | Git repository root | +| `dir` | Directory / folder | +| `file` | Source file (generic) | + +#### Language-Specific File Types + +| Type | Meaning | +|------|---------| +| `c_source` | C source file (.c) | +| `c_header` | C header file (.h) | +| `python_module` | Python module (.py) | + +#### Code-Level Types + +| Type | Meaning | +|------|---------| +| `class` | Class definition | +| `function` | Function or method | +| `interface` | Interface definition | +| `property` | Property or field | +| `package` | Package / namespace | + +**Notes:** +- The `repo` type is preserved and never overwritten by directory-type inference. +- Composite types (e.g., `file_class`) can arise during element merging when two elements with different types are combined. +- Not all elements have a type — `getType()` returns an empty string when unset. + +### Association Types (Dependency Types) + +Associations represent directed dependencies between elements. The type is stored in the `deptype` field of `SElementAssociation`. + +| deptype | Meaning | +|---------|---------| +| `inc` | Include directive (C/C++ `#include`) | +| `imports` | Import statement | +| `function_ref` | Function call / reference | +| `inherits` | Class inheritance | +| `implements` | Interface implementation | +| `use` | General dependency (unclassified) | +| `calls` | Function/method invocation | +| `assembly_ref` | .NET assembly reference | + +**Dynamic (inferred) dependencies** are prefixed with `dynamic_` (e.g., `dynamic_function_ref`). These are generated by `SGraphAnalysis.generate_dynamic_dependencies()` for cases like polymorphic method dispatch where the static call target differs from the runtime target. + +### Common Element Attributes + +| Attribute | Type | Meaning | +|-----------|------|---------| +| `type` | str | Element type (see above) | +| `loc` | int | Lines of code | +| `visibility` | str | Access modifier (public, private, ...) | +| `complexity` | int | Cyclomatic complexity | +| `repo_url` | str | Git remote URL (on repo elements) | +| `version` | str | Version string (on External dependencies) | +| `repotype` | str | Package ecosystem (NPM, PIP, APT, ...) | + +### Association Attributes + +| Attribute | Type | Meaning | +|-----------|------|---------| +| `compare` | str | Change status: `added`, `removed`, `changed` (in diff models) | + +### Hierarchy Semantics + +In the software convention, the hierarchy represents **structural containment**: + +- A directory *contains* its files +- A file *contains* its classes, functions, and other declarations +- A class *contains* its methods and properties +- A repository *contains* its directory tree + +This containment relationship is implicit in the parent-child tree. Associations (edges) represent **semantic dependencies** that cross the containment boundary — a function calling another function, a file importing another file, a class inheriting from another class. + +### Path Format + +Element paths use `/` as separator and start from the project element: + +``` +/my-project/src/main/java/com/example/App.java/App/main + ^project ^directories ^file ^class ^method +``` + +External dependency paths: + +``` +/my-project/External/Python/requests +/my-project/External/NPM/react +/my-project/External/Docker/Image/nginx:latest +``` + +--- + +## Convention: Genealogy + +Used by the sgraph-genealogy-mcp-server to represent family trees. + +### Top-Level Structure + +All persons are placed as **direct children of the root element** — the hierarchy is flat. + +``` +root (unnamed) +├── Matti Leppanen 1804 Taipale, Kivijarvi K. 1860 Kivijarvi +├── Johan Leppanen 1770 Kivijarvi K. 1850 +├── Hilda Sofia Storck (Hanninen) 1906 K. 1976 +└── ... +``` + +### Element Naming Convention + +Each person's name encodes structured biographical data in a single string: + +``` +FirstName [Patronym] LastName [(FormerName)] BirthYear BirthPlace K. DeathYear DeathPlace +``` + +| Component | Example | Required | +|-----------|---------|----------| +| First name(s) | `Matti`, `Hilda Sofia` | Yes | +| Patronymic | `Iisakinpoika`, `Matintytär` | No | +| Last name(s) | `Leppanen`, `Storck` | Yes | +| Former/maiden name | `(Hanninen)`, `(Rintala)` | No | +| Birth year | `1804` | Yes | +| Birth place(s) | `Taipale, Kivijarvi` | No | +| Death marker | `K.` (Kuollut) | Only if deceased | +| Death year | `1860` | Only if deceased | +| Death place(s) | `Kivijarvi` | No | + +Approximate dates are prefixed with `noin` (approximately), `ennen vuotta` (before year), or `arviolta` (estimated). Negative years represent BC dates. + +### Association Types + +| deptype | Meaning | Direction | +|---------|---------|-----------| +| `parent` | Parent relationship | child → parent | + +This is the only association type. A person has outgoing `parent` associations pointing to their parents. Incoming `parent` associations come from their children. + +### Hierarchy Semantics + +In the genealogy convention, the hierarchy is **not semantically meaningful** — it serves only as a flat container. All family relationships are expressed through associations, not through parent-child tree nesting. + +--- + +## Defining New Conventions + +When creating a new sgraph convention for a domain, document the following: + +1. **Top-level structure**: How many elements under root? What do they represent? +2. **Hierarchy semantics**: Does the tree represent containment, categorization, or is it flat? +3. **Element types**: What values does the `type` attribute take? What do they mean? +4. **Naming convention**: Is the element name a simple identifier or does it encode structured data? +5. **Association types**: What `deptype` values exist and what relationships do they represent? +6. **Direction convention**: For each association type, what does the direction (from → to) mean? +7. **Standard attributes**: What attributes are expected on elements and associations? + +These definitions form the **schema** that enables meaningful queries, whether through the ModelApi, MCP tools, or query languages like Cypher. + +--- + +## Implications for Query Languages + +The convention determines how sgraph maps to external query models: + +| Concept | Software Convention | Genealogy Convention | +|---------|-------------------|---------------------| +| **Node labels** (Cypher) | Element type: `:File`, `:Class`, `:Function` | All nodes are `:Person` | +| **Relationship types** (Cypher) | deptype: `:IMPORTS`, `:CALLS`, `:INHERITS` | `:PARENT_OF` | +| **Hierarchy** | Explicit `:CONTAINS` relationships or path property | Not applicable (flat) | +| **Properties** | Attributes from `attrs` dict | Parsed from element name | diff --git a/docs/index.md b/docs/index.md index c4279ab..d7ce5b1 100644 --- a/docs/index.md +++ b/docs/index.md @@ -88,9 +88,11 @@ Track and improve: ## 📖 Documentation - [**Getting Started Guide**](getting-started.html) - Your first steps with sgraph -- [**API Reference**](api-reference.html) - Complete API documentation +- [**API Reference**](api-reference.html) - Complete API documentation - [**Examples & Tutorials**](examples.html) - Real-world usage examples - [**Data Formats**](data-formats.html) - Understanding XML and Deps formats +- [**Graph Conventions**](graph-conventions.html) - How elements and associations represent different domains +- [**Cypher Query Support**](cypher.html) - Query models with the Cypher graph query language - [**Visualization Guide**](visualization.html) - Creating beautiful diagrams ## 🌟 Example: Analyzing a Real Project diff --git a/requirements.txt b/requirements.txt index 89c534a..5e71605 100644 --- a/requirements.txt +++ b/requirements.txt @@ -3,4 +3,7 @@ pandas==2.3.3 lxml~=6.0.2 Levenshtein==0.27.3 pytest==9.0.2 -deprecation==2.1.0 \ No newline at end of file +deprecation==2.1.0 + +# Optional: Cypher query support (python -m sgraph.cypher) +spycy-aneeshdurg==0.0.3 \ No newline at end of file diff --git a/src/sgraph/cypher.py b/src/sgraph/cypher.py new file mode 100644 index 0000000..38229db --- /dev/null +++ b/src/sgraph/cypher.py @@ -0,0 +1,560 @@ +"""Cypher query support for sgraph via sPyCy. + +Provides a read-only sPyCy Graph backend that maps sgraph elements and +associations to the labeled property graph model expected by Cypher. + +Mapping (software convention): + SElement -> Node with labels from element type, properties from attrs + SElementAssociation -> Relationship with type from deptype + Parent-child -> :CONTAINS relationships (hierarchy as explicit edges) + +Usage: + from sgraph import SGraph + from sgraph.cypher import cypher_query + + model = SGraph.parse_xml_or_zipped_xml('model.xml') + results = cypher_query(model, ''' + MATCH (f:file)-[r:function_ref]->(g:file) + RETURN f.name, g.name, r + ''') +""" + +from dataclasses import dataclass, field +from typing import Any, Dict, List, Mapping, Optional, Tuple + +import pandas as pd + +from sgraph.selement import SElement +from sgraph.selementassociation import SElementAssociation + +try: + from spycy.graph import Graph + from spycy.spycy import CypherExecutorBase +except ImportError: + raise ImportError( + "spycy is required for Cypher support. " + "Install it with: pip install spycy-aneeshdurg" + ) + + +class _DictMapping(Mapping): + """Thin Mapping wrapper over a dict.""" + + def __init__(self, data: dict): + self._data = data + + def __getitem__(self, key): + return self._data[key] + + def __iter__(self): + return iter(self._data) + + def __len__(self): + return len(self._data) + + +@dataclass +class SGraphCypherBackend(Graph[int, int]): + """Read-only sPyCy Graph backend wrapping an sgraph model. + + Builds an index of all elements and associations on construction, + mapping them to the labeled property graph model. + + Args: + root: Root SElement of the sgraph model. + include_hierarchy: If True, add :CONTAINS edges for parent-child + relationships. Default True. + """ + + root: SElement + include_hierarchy: bool = True + + _node_data: Dict[int, dict] = field(default_factory=dict, init=False) + _edge_data: Dict[int, dict] = field(default_factory=dict, init=False) + _elem_to_node: Dict[int, int] = field(default_factory=dict, init=False) + _node_out: Dict[int, List[int]] = field(default_factory=dict, init=False) + _node_in: Dict[int, List[int]] = field(default_factory=dict, init=False) + _edge_src: Dict[int, int] = field(default_factory=dict, init=False) + _edge_dst: Dict[int, int] = field(default_factory=dict, init=False) + # Reverse mappings for subgraph extraction + _node_to_elem: Dict[int, SElement] = field( + default_factory=dict, init=False) + _edge_to_assoc: Dict[int, Optional[SElementAssociation]] = field( + default_factory=dict, init=False) + + def __post_init__(self): + self._build_index() + + def _collect_elements(self) -> List[SElement]: + # TODO: Collecting all elements into a list may be problematic for + # very large models (10M+ elements). Consider a stack-based iterative + # traversal with a while-loop to avoid both list materialization and + # Python's recursion limit. Fine for typical project sizes. + elements: List[SElement] = [] + self.root.traverseElements(lambda e: elements.append(e)) + return elements + + def _build_index(self): + node_id = 0 + edge_id = 0 + all_elements = self._collect_elements() + + # Pass 1: create nodes for all elements + for elem in all_elements: + nid = node_id + node_id += 1 + self._elem_to_node[id(elem)] = nid + self._node_to_elem[nid] = elem + self._node_out[nid] = [] + self._node_in[nid] = [] + + labels = set() + elem_type = elem.getType() + if elem_type: + labels.add(elem_type) + + props = dict(elem.attrs) if elem.attrs else {} + props['name'] = elem.name + props['path'] = elem.getPath() + # Remove 'type' from properties since it's a label + props.pop('type', None) + + self._node_data[nid] = { + 'labels': labels, + 'properties': props, + } + + # Pass 2: create edges for associations + seen_assocs = set() + for elem in all_elements: + for assoc in elem.outgoing: + assoc_id = id(assoc) + if assoc_id in seen_assocs: + continue + seen_assocs.add(assoc_id) + + from_nid = self._elem_to_node.get(id(assoc.fromElement)) + to_nid = self._elem_to_node.get(id(assoc.toElement)) + if from_nid is None or to_nid is None: + continue + + eid = edge_id + edge_id += 1 + + edge_props = dict(assoc.attrs) if assoc.attrs else {} + self._edge_data[eid] = { + 'type': assoc.deptype or 'unknown', + 'properties': edge_props, + } + self._node_out[from_nid].append(eid) + self._node_in[to_nid].append(eid) + self._edge_src[eid] = from_nid + self._edge_dst[eid] = to_nid + self._edge_to_assoc[eid] = assoc + + # Pass 3: create CONTAINS edges for hierarchy + if self.include_hierarchy: + for elem in all_elements: + parent_nid = self._elem_to_node.get(id(elem)) + if parent_nid is None: + continue + for child in elem.children: + child_nid = self._elem_to_node.get(id(child)) + if child_nid is None: + continue + eid = edge_id + edge_id += 1 + self._edge_data[eid] = { + 'type': 'CONTAINS', + 'properties': {}, + } + self._node_out[parent_nid].append(eid) + self._node_in[child_nid].append(eid) + self._edge_src[eid] = parent_nid + self._edge_dst[eid] = child_nid + # CONTAINS edges have no SElementAssociation + self._edge_to_assoc[eid] = None + + @property + def nodes(self) -> Mapping[int, Any]: + return _DictMapping(self._node_data) + + @property + def edges(self) -> Mapping[int, Any]: + return _DictMapping(self._edge_data) + + def add_node(self, data: Dict[str, Any]) -> int: + raise NotImplementedError("SGraphCypherBackend is read-only") + + def add_edge(self, start: int, end: int, data: Dict[str, Any]) -> int: + raise NotImplementedError("SGraphCypherBackend is read-only") + + def out_edges(self, node: int) -> List[int]: + return self._node_out.get(node, []) + + def in_edges(self, node: int) -> List[int]: + return self._node_in.get(node, []) + + def remove_node(self, node: int): + raise NotImplementedError("SGraphCypherBackend is read-only") + + def remove_edge(self, edge: int): + raise NotImplementedError("SGraphCypherBackend is read-only") + + def src(self, edge: int) -> int: + return self._edge_src[edge] + + def dst(self, edge: int) -> int: + return self._edge_dst[edge] + + +@dataclass +class SGraphCypherExecutor(CypherExecutorBase[int, int]): + """Cypher executor for sgraph models.""" + graph: SGraphCypherBackend = field( + default_factory=lambda: SGraphCypherBackend(SElement(None, '')) + ) + + +def cypher_query(model, query: str, + include_hierarchy: bool = True) -> pd.DataFrame: + """Execute a Cypher query against an sgraph model. + + Args: + model: An SGraph instance or the root SElement. + query: A Cypher query string. + include_hierarchy: If True, add :CONTAINS edges for the element + tree. Default True. + + Returns: + pandas DataFrame with query results. + """ + from sgraph.sgraph import SGraph + + if isinstance(model, SGraph): + root = model.rootNode + else: + root = model + + backend = SGraphCypherBackend(root=root, + include_hierarchy=include_hierarchy) + executor = SGraphCypherExecutor(graph=backend) + return executor.exec(query) + + +def _extract_subgraph(result: pd.DataFrame, + backend: SGraphCypherBackend): + """Build an SGraph subgraph from Cypher query results. + + Scans the result DataFrame for Node and Edge objects returned by + sPyCy, maps them back to the original SElements and + SElementAssociations, and assembles a new SGraph containing only + those elements and associations. + """ + from spycy.types import Node, Edge + from sgraph.sgraph import SGraph + + node_ids = set() + edge_ids = set() + + for col in result.columns: + for val in result[col]: + if val is pd.NA: + continue + if isinstance(val, Node): + node_ids.add(val.id_) + elif isinstance(val, Edge): + if isinstance(val.id_, list): + edge_ids.update(val.id_) + else: + edge_ids.add(val.id_) + elif isinstance(val, list): + for v in val: + if isinstance(v, Node): + node_ids.add(v.id_) + elif isinstance(v, Edge): + edge_ids.add(v.id_) + + # Also include source/target nodes of matched edges + for eid in list(edge_ids): + node_ids.add(backend.src(eid)) + node_ids.add(backend.dst(eid)) + + if not node_ids: + return SGraph(SElement(None, '')) + + # Build new graph: recreate element paths and associations + new_graph = SGraph(SElement(None, '')) + path_to_new_elem = {} + + for nid in sorted(node_ids): + elem = backend._node_to_elem[nid] + path = elem.getPath() + if not path: + continue + new_elem = new_graph.createOrGetElementFromPath(path) + # Copy type + orig_type = elem.getType() + if orig_type: + new_elem.setType(orig_type) + # Copy attributes (except type which is already set) + if elem.attrs: + for k, v in elem.attrs.items(): + if k != 'type': + new_elem.addAttribute(k, v) + path_to_new_elem[path] = new_elem + + for eid in sorted(edge_ids): + assoc = backend._edge_to_assoc.get(eid) + if assoc is None: + # CONTAINS edge — hierarchy is implicit in the tree + continue + from_path = assoc.fromElement.getPath() + to_path = assoc.toElement.getPath() + from_elem = path_to_new_elem.get(from_path) + to_elem = path_to_new_elem.get(to_path) + if from_elem is None: + from_elem = new_graph.createOrGetElementFromPath(from_path) + path_to_new_elem[from_path] = from_elem + if to_elem is None: + to_elem = new_graph.createOrGetElementFromPath(to_path) + path_to_new_elem[to_path] = to_elem + new_assoc = SElementAssociation(from_elem, to_elem, assoc.deptype) + if assoc.attrs: + for k, v in assoc.attrs.items(): + new_assoc.addAttribute(k, v) + new_assoc.initElems() + + return new_graph + + +# ── Output formatters ────────────────────────────────────────────── + +TABULAR_FORMATS = ('table', 'csv', 'tsv', 'json', 'jsonl') +GRAPH_FORMATS = ('xml', 'deps', 'dot', 'plantuml', 'graphml', 'cytoscape') +ALL_FORMATS = TABULAR_FORMATS + GRAPH_FORMATS + + +def _output_tabular(result: pd.DataFrame, fmt: str): + """Write DataFrame in a tabular format to stdout.""" + import json as json_mod + import sys + + if fmt == 'table': + if len(result): + print(result.to_string(index=False), flush=True) + elif fmt == 'csv': + print(result.to_csv(index=False), end='') + elif fmt == 'tsv': + print(result.to_csv(index=False, sep='\t'), end='') + elif fmt == 'json': + records = _dataframe_to_serializable(result) + print(json_mod.dumps(records, ensure_ascii=False, indent=2)) + elif fmt == 'jsonl': + records = _dataframe_to_serializable(result) + for rec in records: + print(json_mod.dumps(rec, ensure_ascii=False)) + + +def _dataframe_to_serializable(df: pd.DataFrame) -> list: + """Convert DataFrame to a JSON-serializable list of dicts.""" + from spycy.types import Node, Edge + + records = df.to_dict('records') + for rec in records: + for k, v in list(rec.items()): + if v is pd.NA: + rec[k] = None + elif isinstance(v, Node): + rec[k] = f'Node({v.id_})' + elif isinstance(v, Edge): + rec[k] = f'Edge({v.id_})' + elif isinstance(v, (set, frozenset)): + rec[k] = list(v) + return records + + +def _output_graph(result: pd.DataFrame, fmt: str, + backend: SGraphCypherBackend, outfile: Optional[str]): + """Extract subgraph from results and output in a graph format.""" + import json as json_mod + import sys + + subgraph = _extract_subgraph(result, backend) + + if fmt == 'xml': + subgraph.to_xml(outfile) + elif fmt == 'deps': + subgraph.to_deps(outfile) + elif fmt == 'plantuml': + subgraph.to_plantuml(outfile) + elif fmt == 'dot': + from sgraph.converters.xml_to_dot import graph_to_dot + graph_to_dot(subgraph) + elif fmt == 'graphml': + if outfile: + from sgraph.converters.graphml import sgraph_to_graphml_file + sgraph_to_graphml_file(subgraph, outfile) + else: + # graphml requires a file; write to stdout via temp + import tempfile + import os + from sgraph.converters.graphml import sgraph_to_graphml_file + with tempfile.NamedTemporaryFile( + mode='w', suffix='.graphml', + delete=False) as tmp: + tmp_path = tmp.name + try: + sgraph_to_graphml_file(subgraph, tmp_path) + with open(tmp_path, 'r') as f: + sys.stdout.write(f.read()) + finally: + os.unlink(tmp_path) + elif fmt == 'cytoscape': + from sgraph.converters.sgraph_to_cytoscape import graph_to_cyto + data = graph_to_cyto(subgraph) + print(json_mod.dumps(data, ensure_ascii=False, indent=2)) + + +# ── CLI ───────────────────────────────────────────────────────────── + +def main(): + """CLI entry point: python -m sgraph.cypher [query]""" + import argparse + import sys + import time + + from sgraph.sgraph import SGraph + from spycy.errors import ExecutionError + + parser = argparse.ArgumentParser( + prog='python -m sgraph.cypher', + description='Query sgraph models with Cypher.') + parser.add_argument('model', help='Path to model file (.xml or .xml.zip)') + parser.add_argument('query', nargs='?', default=None, + help='Cypher query to execute. ' + 'If omitted, starts interactive REPL.') + parser.add_argument('--no-hierarchy', action='store_true', + help='Do not create :CONTAINS edges for the ' + 'element tree') + parser.add_argument('-f', '--format', default='table', + choices=ALL_FORMATS, + help='Output format (default: table). ' + 'Graph formats (xml, deps, dot, plantuml, graphml, ' + 'cytoscape) extract a subgraph from Node/Edge ' + 'objects in the result.') + parser.add_argument('-o', '--output', default=None, + help='Output file (default: stdout). ' + 'Required for graphml format.') + args = parser.parse_args() + + print(f'Loading {args.model}...', file=sys.stderr) + t0 = time.time() + try: + model = SGraph.parse_xml_or_zipped_xml(args.model) + except Exception as e: + print(f'Error: {e}', file=sys.stderr) + sys.exit(1) + t_load = time.time() - t0 + + print(f'Building Cypher index...', file=sys.stderr) + t0 = time.time() + include_hierarchy = not args.no_hierarchy + backend = SGraphCypherBackend(root=model.rootNode, + include_hierarchy=include_hierarchy) + executor = SGraphCypherExecutor(graph=backend) + t_index = time.time() - t0 + + n_nodes = len(backend._node_data) + n_edges = len(backend._edge_data) + print(f'Ready: {n_nodes} nodes, {n_edges} edges ' + f'(load {t_load:.2f}s, index {t_index:.2f}s)', file=sys.stderr) + + if args.query: + # Single query mode + try: + result = executor.exec(args.query) + _output_result(result, args.format, backend, args.output) + except ExecutionError as e: + print(f'Error: {e}', file=sys.stderr) + sys.exit(1) + else: + # Interactive REPL + _run_repl(executor, backend, args.format, args.output) + + +def _output_result(result: pd.DataFrame, fmt: str, + backend: SGraphCypherBackend, + outfile: Optional[str] = None): + """Route output to the right formatter.""" + if fmt in TABULAR_FORMATS: + _output_tabular(result, fmt) + elif fmt in GRAPH_FORMATS: + _output_graph(result, fmt, backend, outfile) + + +def _run_repl(executor: SGraphCypherExecutor, + backend: SGraphCypherBackend, + fmt: str, outfile: Optional[str]): + """Interactive Cypher REPL.""" + import sys + import time + from spycy.errors import ExecutionError + + try: + import readline # noqa: F401 - enables line editing + except ImportError: + pass + + print('Enter Cypher queries. End with ; or blank line. ' + 'Type "quit" to exit.', file=sys.stderr) + + while True: + try: + lines = [] + prompt = 'cypher> ' + while True: + line = input(prompt) + stripped = line.strip().lower() + if stripped in ('quit', 'exit'): + return + # In-session format switch: \format + if stripped.startswith('\\format'): + parts = stripped.split() + if len(parts) == 2 and parts[1] in ALL_FORMATS: + fmt = parts[1] + print(f'Output format: {fmt}', file=sys.stderr) + else: + print(f'Available formats: ' + f'{", ".join(ALL_FORMATS)}', + file=sys.stderr) + lines = [] + break + lines.append(line) + if line.strip().endswith(';'): + break + if line.strip() == '' and lines: + break + prompt = ' > ' + + query_str = ' '.join(lines).strip() + if query_str.endswith(';'): + query_str = query_str[:-1] + if not query_str: + continue + + t0 = time.time() + result = executor.exec(query_str) + elapsed = time.time() - t0 + _output_result(result, fmt, backend, outfile) + print(f'({len(result)} rows, {elapsed:.3f}s)', + file=sys.stderr) + except ExecutionError as e: + print(f'Error: {e}', file=sys.stderr) + except EOFError: + print() + return + + +if __name__ == '__main__': + main()