Skip to content

Pyspark: DataFrame methods behind is_remote_only() statically evaluate to Union during typechecking #56141

@iamkhav

Description

@iamkhav

Description

A static typechecker can't evaluate is_remote_only() which makes the type annotation of relevant DataFrame methods be a Union of the property method or a Column (because of __getattr__).

if not is_remote_only():
@property
def rdd(self) -> "RDD[Row]":
"""Returns the content as an :class:`pyspark.RDD` of :class:`Row`.
.. versionadded:: 1.3.0
Returns
-------
:class:`RDD`
Examples
--------
>>> df = spark.range(1)
>>> type(df.rdd)
<class 'pyspark.core.rdd.RDD'>
"""
...

I read the reasoning on the PR #45053. In addition, I think it makes sense to add logic for static typecheckers.

Example

_ = df.rdd.flatMap(some_fn)

Typechecker doesn't know if rdd is the callable property rdd or a Column returned by __getattr__.
Calling flatMap() will make the typechecker throw an error because it doesn't know if the Callable or Column was returned.

Tested with https://github.com/astral-sh/ty and pyspark==4.1.2.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions