Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/requirements-docs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ sphinx-autobuild==2025.8.25
sphinx-copybutton==0.5.2
sphinx-sitemap==2.9.0
sphinx-click==6.2.0
sphinx-autodoc-typehints==3.9.9
sphinx-autodoc-typehints==3.9.11
14 changes: 7 additions & 7 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
setuptools==80.9.0
pytest==8.3.5
setuptools==82.0.1
pytest==9.0.2
respx==0.22.0
pytest-git==1.8.0
pytest-env==1.1.5
pytest-mock==3.14.0
fiftyone==1.5.2
datasets==3.6.0
ultralytics==8.3.237
pytest-env==1.6.0
pytest-mock==3.15.1
fiftyone==1.14.0
datasets==4.8.4

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify datasets 4.8.4 version and check usage patterns

echo "=== Checking PyPI for datasets 4.8.4 ==="
curl -s https://pypi.org/pypi/datasets/4.8.4/json | jq -r '.info.version // "VERSION NOT FOUND"'

echo -e "\n=== Searching for datasets library usage in codebase ==="
rg -n --type=py -C3 'from datasets import|import datasets' -g '!requirements*.txt'

echo -e "\n=== Checking for common datasets API patterns that may have changed ==="
rg -n --type=py -C2 'datasets\.(load_dataset|Dataset|DatasetDict|load_from_disk)' -g '!requirements*.txt'

echo -e "\n=== Checking for security advisories ==="
gh api graphql -f query='
{
  securityVulnerabilities(first: 10, ecosystem: PIP, package: "datasets") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
      firstPatchedVersion {
        identifier
      }
    }
  }
}'

Repository: DagsHub/client

Length of output: 1454


🏁 Script executed:

#!/bin/bash
# Check the actual usage of datasets import and hf_ds variable

echo "=== Full context of datasets import ==="
sed -n '45,70p' dagshub/data_engine/model/query_result.py

echo -e "\n=== Search for usage of hf_ds variable ==="
rg -n 'hf_ds\.' dagshub/data_engine/model/query_result.py -A2 -B2

echo -e "\n=== Check if datasets is used in actual function implementations ==="
rg -n 'def.*datasets|def.*hf_ds' dagshub/data_engine/model/query_result.py -A10

echo -e "\n=== Check type annotations using hf_ds ==="
rg -n 'hf_ds\[|-> .*hf_ds|: .*hf_ds' dagshub/data_engine/model/query_result.py

echo -e "\n=== Search for datasets usage in test files ==="
find . -name '*test*.py' -type f | head -20 | xargs rg -l 'datasets' 2>/dev/null | head -5

echo -e "\n=== Check requirements.txt for datasets version ==="
cat requirements.txt 2>/dev/null || echo "No requirements.txt found"

Repository: DagsHub/client

Length of output: 1323


🏁 Script executed:

#!/bin/bash
# Check context around line 322 to find method definition and tests

echo "=== Method containing from_pandas usage ==="
sed -n '300,330p' dagshub/data_engine/model/query_result.py

echo -e "\n=== Function signature ==="
rg -n 'def.*from_pandas|def.*\(.*\).*from_pandas' dagshub/data_engine/model/query_result.py -B20 | head -40

echo -e "\n=== Search for test files related to query_result ==="
find . -name '*test*query*result*.py' -o -name '*query*result*test*.py' | head -10

echo -e "\n=== Check if from_pandas is tested ==="
rg -n 'from_pandas' --type=py -l

Repository: DagsHub/client

Length of output: 1412


🏁 Script executed:

#!/bin/bash
# Find the method containing from_pandas by searching backwards

echo "=== Find method definition containing from_pandas (lines 280-325) ==="
sed -n '280,325p' dagshub/data_engine/model/query_result.py

echo -e "\n=== Check test_queryresult.py ==="
wc -l ./tests/data_engine/test_queryresult.py

echo -e "\n=== Content of test file ==="
head -100 ./tests/data_engine/test_queryresult.py

Repository: DagsHub/client

Length of output: 2909


🏁 Script executed:

#!/bin/bash
# Get the full method signature

echo "=== Find method signature ==="
sed -n '270,285p' dagshub/data_engine/model/query_result.py

echo -e "\n=== Search for method def containing as_huggingface or similar ==="
rg -n 'def.*as.*huggingface|def.*hf_dataset' dagshub/data_engine/model/query_result.py -A2

echo -e "\n=== Check if there's a method near line 280 ==="
sed -n '275,282p' dagshub/data_engine/model/query_result.py

Repository: DagsHub/client

Length of output: 1336


🌐 Web query:

datasets library 3.6.0 to 4.8.4 breaking changes from_pandas API

πŸ’‘ Result:

Between datasets==3.6.0 and datasets==4.8.4, Dataset.from_pandas(...) itself did not meaningfully β€œbreak” at the signature/parameter level (it still takes df, features=None, info=None, split=None, preserve_index=None, and the preserve_index=None behavior around RangeIndex is the same). (raw.githubusercontent.com)

The practical breaking changes you’re most likely to hit in code paths involving from_pandas are:

  1. Column access return type changed in v4.0.0
    In datasets>=4.0.0, ds["col"] returns a datasets.Column object (lazy column) rather than the plain Python list many users relied on in 3.x. If your code does list operations directly (e.g., ds["col"].append(...), set(ds["col"]), JSON-serializing it, etc.), you may need to wrap it: list(ds["col"]) (or otherwise materialize it) to get the old behavior. (github.com)

  2. Sequence feature type was replaced (v4.0.0)
    If you pass features= to from_pandas and you used datasets.Sequence(...), datasets>=4.0.0 replaced the Sequence feature type with the new List type (and Sequence becomes a utility that produces List/dict depending on subfeatures). This can break schemas you construct for from_pandas(..., features=...) (especially for nested/list columns) and anything that checks feature types explicitly. (github.com)

  3. IterableDataset variant gained an extra parameter (not breaking, but can surprise)
    IterableDataset.from_pandas(...) in current docs includes num_shards (default 1). This is additive (so not a breaking change), but if you have wrappers that mirror/validate exact call signatures, it can matter. (huggingface.co)

Citations:


Add test coverage for as_hf_dataset() method before upgrading datasets.

The from_pandas() API signature is compatible between 3.6.0 and 4.8.4. However, datasets 4.x introduced breaking changes in column access behavior (columns return datasets.Column objects instead of lists) and deprecated Sequence feature types. While the current as_hf_dataset() implementation doesn't directly use these features, it has no test coverage. Add a test to verify the method works correctly with 4.8.4 before this dependency update is deployed.

ultralytics==8.4.33
Loading