Skip to content

Commit ba0f24f

Browse files
yoffCopilot
andcommitted
Python: canonicalize CFG nodes for dataflow
The shared CFG creates multiple ControlFlowNodes per AST node in conditional contexts (e.g. afterTrue/afterFalse for boolean conditions, empty/non-empty for for-loops, matched/unmatched for match cases). These splits matter for control-flow analysis, but for dataflow — where we ask 'what is the value of this expression?' — we need exactly one representative per AST or we double-count calls, arguments, and store steps. This adds Cfg::isCanonicalAstNodeRepresentative as a purely structural pick: for split ASTs it selects the 'positive' outcome variant; for non-split ASTs it selects the unique variant. The picker is implemented via genuine-outcome helpers that work around the shared CFG's cross-kind isAfterValue fallback (ControlFlowGraph.qll:870-892), see the doc on isGenuineAfterTrue for details. The TCfgNode-family newtypes in DataFlowPublic, TNormalCall and TPotentialLibraryCall in DataFlowDispatch, and the SSA-projected use-use/def-use steps in DataFlowPrivate are all routed through the canonical filter. DataFlowConsistency and the test UnresolvedCalls helper qualify their CallNode casts with Cfg:: to keep working. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent b8c5e25 commit ba0f24f

7 files changed

Lines changed: 165 additions & 40 deletions

File tree

python/ql/consistency-queries/DataFlowConsistency.ql

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ private import semmle.python.dataflow.new.internal.DataFlowImplSpecific
99
private import semmle.python.dataflow.new.internal.DataFlowDispatch
1010
private import semmle.python.dataflow.new.internal.TaintTrackingImplSpecific
1111
private import codeql.dataflow.internal.DataFlowImplConsistency
12+
private import semmle.python.controlflow.internal.Cfg as Cfg
1213

1314
private module Input implements InputSig<Location, PythonDataFlow> {
1415
private import Private
@@ -72,7 +73,7 @@ private module Input implements InputSig<Location, PythonDataFlow> {
7273
// resolve to multiple functions), but we only make _one_ ArgumentNode for each
7374
// argument in the CallNode, we end up violating this consistency check in those
7475
// cases. (see `getCallArg` in DataFlowDispatch.qll)
75-
exists(DataFlowCall other, CallNode cfgCall | other != call |
76+
exists(DataFlowCall other, Cfg::CallNode cfgCall | other != call |
7677
call.getNode() = cfgCall and
7778
other.getNode() = cfgCall and
7879
isArgumentNode(arg, call, _) and
@@ -88,16 +89,16 @@ private module Input implements InputSig<Location, PythonDataFlow> {
8889
// allow it instead.
8990
(
9091
call.getScope() = attr.getScope() and
91-
any(CfgNode n | n.asCfgNode() = call.getNode().(CallNode).getFunction()).getALocalSource() =
92-
attr
92+
any(CfgNode n | n.asCfgNode() = call.getNode().(Cfg::CallNode).getFunction())
93+
.getALocalSource() = attr
9394
or
9495
not exists(call.getScope().(Function).getDefinition()) and
9596
call.getScope().getScope+() = attr.getScope()
9697
) and
9798
(
9899
other.getScope() = attr.getScope() and
99-
any(CfgNode n | n.asCfgNode() = other.getNode().(CallNode).getFunction()).getALocalSource() =
100-
attr
100+
any(CfgNode n | n.asCfgNode() = other.getNode().(Cfg::CallNode).getFunction())
101+
.getALocalSource() = attr
101102
or
102103
not exists(other.getScope().(Function).getDefinition()) and
103104
other.getScope().getScope+() = attr.getScope()
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
---
2+
category: minorAnalysis
3+
---
4+
* The Python dataflow library is now built on the shared CFG and SSA libraries (`shared/controlflow` and `shared/ssa`), bringing Python in line with the other CodeQL languages. The legacy CFG in `semmle/python/Flow.qll` and the legacy ESSA SSA in `semmle/python/essa/*` remain available for downstream queries but are no longer used by the new dataflow library, type tracking, or API graphs. Most queries should be unaffected; a small number may produce slightly different results because of differences in CFG granularity (e.g. separate pre/post nodes per expression) and in how attribute and tuple-unpacking writes are modelled.

python/ql/lib/semmle/python/controlflow/internal/Cfg.qll

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,92 @@ private predicate isCanonical(CfgImpl::ControlFlowNode n) {
7676
n instanceof CfgImpl::ControlFlow::AnnotatedExitNode
7777
}
7878

79+
/**
80+
* Holds if `n` is genuinely the `TAfterValueNode` variant for a boolean-true
81+
* outcome of its AST node.
82+
*
83+
* The shared CFG's `isAfterValue` predicate has a kind-mismatch fallback
84+
* (see `ControlFlowGraph.qll`'s `isAfterValue` lines 870-892): when asking
85+
* `isAfterValue(_, BooleanSuccessor true)` on, say, an emptiness-empty
86+
* variant, it falsely returns `true` because the kinds differ and Python
87+
* does not provide `successorValueImplies`. The same fallback also makes
88+
* `TAfterNode` (the unsplit case) satisfy `isAfterValue(_, t)` for *every*
89+
* `t`.
90+
*
91+
* The combination `isAfterTrue ∧ ¬isAfterFalse` excludes both: a genuine
92+
* boolean-true variant satisfies `isAfterTrue` directly (newtype branch 3)
93+
* but not `isAfterFalse` (the dual variant, same kind, no fallback). A
94+
* non-boolean variant satisfies both via the cross-kind fallback. A
95+
* `TAfterNode` satisfies both via branch 1. So `isAfterTrue ∧ ¬isAfterFalse`
96+
* picks exactly the genuine boolean-true variant.
97+
*/
98+
private predicate isGenuineAfterTrue(ControlFlowNode n) {
99+
n.isAfterTrue(_) and not n.isAfterFalse(_)
100+
}
101+
102+
/**
103+
* Holds if `n` is genuinely the `TAfterValueNode` variant for the "empty"
104+
* outcome of its AST node (e.g. `for x in xs: ...` when `xs` is empty).
105+
*
106+
* See `isGenuineAfterTrue` for why we cannot just use a single
107+
* `isAfterValue` check.
108+
*/
109+
private predicate isGenuineAfterEmpty(ControlFlowNode n) {
110+
exists(EmptinessSuccessor empty |
111+
empty.getValue() = true and n.isAfterValue(n.getAstNode(), empty)
112+
) and
113+
not exists(EmptinessSuccessor nonEmpty |
114+
nonEmpty.getValue() = false and n.isAfterValue(n.getAstNode(), nonEmpty)
115+
)
116+
}
117+
118+
/**
119+
* Holds if `n` is genuinely the `TAfterValueNode` variant for the "matched"
120+
* outcome of its AST node (e.g. a `match` case-pattern that matched).
121+
*
122+
* See `isGenuineAfterTrue` for why we cannot just use a single
123+
* `isAfterValue` check.
124+
*/
125+
private predicate isGenuineAfterMatched(ControlFlowNode n) {
126+
exists(MatchingSuccessor matched |
127+
matched.getValue() = true and n.isAfterValue(n.getAstNode(), matched)
128+
) and
129+
not exists(MatchingSuccessor unmatched |
130+
unmatched.getValue() = false and n.isAfterValue(n.getAstNode(), unmatched)
131+
)
132+
}
133+
134+
/**
135+
* Holds if `n` is the canonical representative of its corresponding AST node
136+
* for dataflow purposes.
137+
*
138+
* The shared CFG associates a single AST node with multiple `ControlFlowNode`s
139+
* when the AST appears in a conditional context (boolean conditions split into
140+
* `afterTrue`/`afterFalse`; for-loop iters split into `[empty]`/`[non-empty]`;
141+
* `match`-case patterns split into `[matched]`/`[unmatched]`). These splits
142+
* matter for control-flow analysis, but for dataflow purposes — where we
143+
* ask "what is the value of this expression?" — a single representative
144+
* suffices and is required to avoid double-counting calls, arguments, store
145+
* steps, etc.
146+
*
147+
* The pick is structural: when an AST has a single `ControlFlowNode` (the
148+
* normal `TAfterNode` or `TBeforeNode`-leaf case), that node is canonical.
149+
* When an AST has a conditional split, the "positive" outcome variant
150+
* (true / empty / matched) is canonical. The three split kinds are mutually
151+
* exclusive per AST, so exactly one variant is selected.
152+
*/
153+
predicate isCanonicalAstNodeRepresentative(ControlFlowNode n) {
154+
// Non-split AST: the unique variant is canonical.
155+
not exists(ControlFlowNode other | other.getNode() = n.getNode() and other != n)
156+
or
157+
// Split AST: pick the "positive" outcome of the split.
158+
isGenuineAfterTrue(n)
159+
or
160+
isGenuineAfterEmpty(n)
161+
or
162+
isGenuineAfterMatched(n)
163+
}
164+
79165
/**
80166
* A control flow node. Control flow nodes have a many-to-one relation
81167
* with syntactic nodes, although most syntactic nodes have only one

python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatch.qll

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1444,11 +1444,12 @@ private predicate sameEnclosingCallable(Node node1, Node node2) {
14441444
// =============================================================================
14451445
newtype TDataFlowCall =
14461446
TNormalCall(Cfg::CallNode call, Function target, CallType type) {
1447-
resolveCall(call, target, type)
1447+
resolveCall(call, target, type) and
1448+
Cfg::isCanonicalAstNodeRepresentative(call)
14481449
} or
14491450
/** A call to the generated function inside a comprehension */
14501451
TComprehensionCall(Comp c) or
1451-
TPotentialLibraryCall(Cfg::CallNode call) or
1452+
TPotentialLibraryCall(Cfg::CallNode call) { Cfg::isCanonicalAstNodeRepresentative(call) } or
14521453
/** A synthesized call inside a summarized callable */
14531454
TSummaryCall(
14541455
FlowSummaryImpl::Public::SummarizedCallable c, FlowSummaryImpl::Private::SummaryNode receiver

python/ql/lib/semmle/python/dataflow/new/internal/DataFlowPrivate.qll

Lines changed: 27 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -365,7 +365,11 @@ module LocalFlow {
365365
exists(Cfg::DefinitionNode def |
366366
nodeFrom.(CfgNode).getNode() = def.getValue() and
367367
nodeTo.(CfgNode).getNode() = def and
368-
def instanceof Cfg::NameNode
368+
def instanceof Cfg::NameNode and
369+
// Parameter defaults are evaluated in the enclosing scope, while the
370+
// parameter itself lives in the function's scope. The cross-scope
371+
// edge is provided by `runtimeJumpStep` instead.
372+
not exists(Py::Parameter param | def.getNode() = param.asName())
369373
)
370374
or
371375
// With definition
@@ -410,11 +414,27 @@ module LocalFlow {
410414
}
411415

412416
predicate useToNextUse(Cfg::NameNode nodeFrom, Cfg::NameNode nodeTo) {
413-
SsaImpl::AdjacentUses::adjacentUseUse(nodeFrom, nodeTo)
417+
// The SSA-level adjacent-use predicate works on specific CFG variants
418+
// (e.g. boolean-outcome `[true]`/`[false]` or emptiness `[empty]`/`[non-empty]`
419+
// splits of the same AST node), but dataflow values are insensitive to
420+
// those splits — there is at most one `CfgNode` per AST. Project both
421+
// ends through `Cfg::isCanonicalAstNodeRepresentative` so all variants
422+
// contribute their use-use edges to the canonical pair.
423+
exists(Cfg::NameNode fromVariant, Cfg::NameNode toVariant |
424+
SsaImpl::AdjacentUses::adjacentUseUse(fromVariant, toVariant) and
425+
fromVariant.getNode() = nodeFrom.getNode() and
426+
toVariant.getNode() = nodeTo.getNode() and
427+
Cfg::isCanonicalAstNodeRepresentative(nodeFrom) and
428+
Cfg::isCanonicalAstNodeRepresentative(nodeTo)
429+
)
414430
}
415431

416432
predicate defToFirstUse(SsaImpl::EssaVariable var, Cfg::NameNode nodeTo) {
417-
SsaImpl::AdjacentUses::firstUse(var.getDefinition(), nodeTo)
433+
exists(Cfg::NameNode toVariant |
434+
SsaImpl::AdjacentUses::firstUse(var.getDefinition(), toVariant) and
435+
toVariant.getNode() = nodeTo.getNode() and
436+
Cfg::isCanonicalAstNodeRepresentative(nodeTo)
437+
)
418438
}
419439

420440
predicate useUseFlowStep(Node nodeFrom, Node nodeTo) {
@@ -423,12 +443,14 @@ module LocalFlow {
423443
// `x = f(y)`
424444
// nodeFrom is `y` on first line
425445
// nodeTo is `y` on second line
426-
exists(SsaImpl::EssaDefinition def |
446+
exists(SsaImpl::EssaDefinition def, Cfg::NameNode toVariant |
427447
nodeFrom.(CfgNode).getNode() = def.(SsaImpl::EssaNodeDefinition).getDefiningNode()
428448
or
429449
nodeFrom.(ScopeEntryDefinitionNode).getDefinition() = def
430450
|
431-
SsaImpl::AdjacentUses::firstUse(def, nodeTo.(CfgNode).getNode())
451+
SsaImpl::AdjacentUses::firstUse(def, toVariant) and
452+
toVariant.getNode() = nodeTo.(CfgNode).getNode().getNode() and
453+
Cfg::isCanonicalAstNodeRepresentative(nodeTo.(CfgNode).getNode())
432454
)
433455
or
434456
// Next use after use

python/ql/lib/semmle/python/dataflow/new/internal/DataFlowPublic.qll

Lines changed: 38 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,12 @@ overlay[local]
2929
newtype TNode =
3030
/** A node corresponding to a control flow node. */
3131
TCfgNode(Cfg::ControlFlowNode node) {
32-
isExpressionNode(node)
33-
or
34-
node.getNode() instanceof Pattern
32+
(
33+
isExpressionNode(node)
34+
or
35+
node.getNode() instanceof Pattern
36+
) and
37+
Cfg::isCanonicalAstNodeRepresentative(node)
3538
} or
3639
/**
3740
* A node corresponding to a scope entry definition. That is, the value of a variable
@@ -50,36 +53,39 @@ newtype TNode =
5053
// NOTE: since we can't rely on the call graph, but we want to have synthetic
5154
// pre-update nodes for class calls, we end up getting synthetic pre-update nodes for
5255
// ALL calls :|
53-
TSyntheticPreUpdateNode(Cfg::CallNode call) or
56+
TSyntheticPreUpdateNode(Cfg::CallNode call) { Cfg::isCanonicalAstNodeRepresentative(call) } or
5457
/**
5558
* A synthetic node representing the value of an object after a state change.
5659
* See QLDoc for `PostUpdateNode`.
5760
*/
5861
TSyntheticPostUpdateNode(Cfg::ControlFlowNode node) {
59-
exists(Cfg::CallNode call |
60-
node = call.getArg(_)
62+
Cfg::isCanonicalAstNodeRepresentative(node) and
63+
(
64+
exists(Cfg::CallNode call |
65+
node = call.getArg(_)
66+
or
67+
node = call.getArgByName(_)
68+
or
69+
// `self` argument when handling class instance calls (`__call__` special method))
70+
node = call.getFunction()
71+
)
6172
or
62-
node = call.getArgByName(_)
73+
node = any(Cfg::AttrNode a).getObject()
6374
or
64-
// `self` argument when handling class instance calls (`__call__` special method))
65-
node = call.getFunction()
66-
)
67-
or
68-
node = any(Cfg::AttrNode a).getObject()
69-
or
70-
node = any(Cfg::SubscriptNode s).getObject()
71-
or
72-
// self parameter when used implicitly in `super()`
73-
exists(Class cls, Function func, SsaImpl::ParameterDefinition def |
74-
func = cls.getAMethod() and
75-
not isStaticmethod(func) and
76-
// this matches what we do in ExtractedParameterNode
77-
def.getDefiningNode() = node and
78-
def.getParameter() = func.getArg(0)
75+
node = any(Cfg::SubscriptNode s).getObject()
76+
or
77+
// self parameter when used implicitly in `super()`
78+
exists(Class cls, Function func, SsaImpl::ParameterDefinition def |
79+
func = cls.getAMethod() and
80+
not isStaticmethod(func) and
81+
// this matches what we do in ExtractedParameterNode
82+
def.getDefiningNode() = node and
83+
def.getParameter() = func.getArg(0)
84+
)
85+
or
86+
// the iterable argument to the implicit comprehension function
87+
node.getNode() = any(Comp c).getIterable()
7988
)
80-
or
81-
// the iterable argument to the implicit comprehension function
82-
node.getNode() = any(Comp c).getIterable()
8389
} or
8490
/** A node representing a global (module-level) variable in a specific module. */
8591
TModuleVariableNode(Module m, GlobalVariable v) { v.getScope() = m } or
@@ -115,7 +121,9 @@ newtype TNode =
115121
exists(ParameterPosition ppos | ppos.isStarArgs(_) | exists(callable.getParameter(ppos)))
116122
} or
117123
/** A synthetic node to capture keyword arguments that are passed to a `**kwargs` parameter. */
118-
TSynthDictSplatArgumentNode(Cfg::CallNode call) { exists(call.getArgByName(_)) } or
124+
TSynthDictSplatArgumentNode(Cfg::CallNode call) {
125+
exists(call.getArgByName(_)) and Cfg::isCanonicalAstNodeRepresentative(call)
126+
} or
119127
/** A synthetic node to allow flow to keyword parameters from a `**kwargs` argument. */
120128
TSynthDictSplatParameterNode(DataFlowCallable callable) {
121129
exists(ParameterPosition ppos | ppos.isKeyword(_) | exists(callable.getParameter(ppos)))
@@ -132,14 +140,16 @@ newtype TNode =
132140
* by the callable being called.
133141
*/
134142
TSynthCapturedVariablesArgumentNode(Cfg::ControlFlowNode callable) {
135-
callable = any(Cfg::CallNode c).getFunction()
143+
callable = any(Cfg::CallNode c).getFunction() and
144+
Cfg::isCanonicalAstNodeRepresentative(callable)
136145
} or
137146
/**
138147
* A synthetic node representing the values of the variables captured
139148
* by the callable being called, after the output has been computed.
140149
*/
141150
TSynthCapturedVariablesArgumentPostUpdateNode(Cfg::ControlFlowNode callable) {
142-
callable = any(Cfg::CallNode c).getFunction()
151+
callable = any(Cfg::CallNode c).getFunction() and
152+
Cfg::isCanonicalAstNodeRepresentative(callable)
143153
} or
144154
/** A synthetic node representing the values of variables captured by a comprehension. */
145155
TSynthCompCapturedVariablesArgumentNode(Comp comp) {

python/ql/lib/utils/test/dataflow/UnresolvedCalls.qll

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ signature module UnresolvedCallExpectationsSig {
1111

1212
module DefaultUnresolvedCallExpectations implements UnresolvedCallExpectationsSig {
1313
predicate unresolvedCall(Cfg::CallNode call) {
14+
Cfg::isCanonicalAstNodeRepresentative(call) and
1415
not exists(DataFlowPrivate::DataFlowCall dfc |
1516
exists(dfc.getCallable()) and dfc.getNode() = call
1617
) and

0 commit comments

Comments
 (0)