Conversation
…reating issue for unknown data
antarcticrainforest
left a comment
There was a problem hiding this comment.
Looks really good.
A couple for general comments.
- There are some formats that aren't covered yet. It's some weird json reference files but I'd have to investigate myself.
- I think the name
freva-xarrayor enginefrevais to loaded. How about changing the name of this.
how about xarray-prism / prism, or xarray-switchboard, switchboard. Or xarray-gateway/gateway? Just some thoughts
README.md
Outdated
| > If you deal with a data that `freva` engine is not able to open that, please | ||
| > report the data [here](https://github.com/freva-org/freva-xarray/issues/new) | ||
| > to let us improve this engine to be able to be versitile and work with all | ||
| > sort of climate data. |
There was a problem hiding this comment.
| > If you deal with a data that `freva` engine is not able to open that, please | |
| > report the data [here](https://github.com/freva-org/freva-xarray/issues/new) | |
| > to let us improve this engine to be able to be versitile and work with all | |
| > sort of climate data. | |
| > If you encounter with a data formats that `freva` engine is not able to open, please | |
| > files an issue report [here](https://github.com/freva-org/freva-xarray/issues/new). | |
| > This helps us to improve the engine enabling users work with different kinds of climate data. |
src/freva_xarray/backends/cloud.py
Outdated
| extra_lines = 0 | ||
| if show_progress: | ||
| fmt = "GRIB" if engine == "cfgrib" else "NetCDF3" | ||
| print(f"[warning] Remote {fmt} requires full file download") |
src/freva_xarray/entrypoint.py
Outdated
| ) -> xr.Dataset: | ||
| """Xarray Generic function: Open dataset with | ||
| automatic format detection.""" | ||
| if not isinstance(filename_or_obj, (str, Path)): |
There was a problem hiding this comment.
os.PathLike? This covers also fsspec stuff.
src/freva_xarray/entrypoint.py
Outdated
| lines_printed = 0 | ||
|
|
||
| if is_remote: | ||
| sys.stdout.write("[info] Detecting format...") |
src/freva_xarray/utils.py
Outdated
| self._render() | ||
|
|
||
| def _render(self) -> None: | ||
| mb = self._current / 1024 / 1024 |
There was a problem hiding this comment.
| mb = self._current / 1024 / 1024 | |
| mb = self._current / 1024**2 |
|
@mannreis could you take a look at this and advertise if needed. This seems to be really nice!!! |
mannreis
left a comment
There was a problem hiding this comment.
Really good work and Mo! I left my 2cents. My only concern is regarding caching the remote data which may not be, ever, an issue.
src/freva_xarray/utils.py
Outdated
| pct = min(self._current / self._total, 1.0) | ||
| filled = int(self.width * pct) | ||
| bar = "█" * filled + "░" * (self.width - filled) | ||
| total_mb = self._total / 1024 / 1024 |
There was a problem hiding this comment.
based on Martin's suggestion:
| total_mb = self._total / 1024 / 1024 | |
| total_mb = self._total / 1024**2 |
src/freva_xarray/_detection.py
Outdated
| return None | ||
|
|
||
|
|
||
| def _detect_from_magic_bytes(header: bytes, lower_path: str) -> str: |
There was a problem hiding this comment.
Since typing is around is something like this an improvement? Or returning "unknown" defeats the effort?
| def _detect_from_magic_bytes(header: bytes, lower_path: str) -> str: | |
| Engine = Literal["cfgrib", "scipy", "h5netcdf", "rasterio", "unknown"] | |
| def _detect_from_magic_bytes(header: bytes, lower_path: str) -> Engine: |
I guess it would also have to be propagated up the call stack and for that reason inconvenient
|
|
||
| # GRIB: cache locally | ||
| if engine == "cfgrib": | ||
| local_path = _cache_remote_file( |
There was a problem hiding this comment.
Caching is good but I'm not sure whether is better to enable it explicitly or allow to disable.
Regardless I would not do it on /tmp for these applications which I believe is the default in many systems.
There was a problem hiding this comment.
Thanks for the review. Regarding caching, I already though quite a bit about this, and for this purpose, we designed the FREVA_XARRAY_CACHE environment variable in place to user be able to change the cache location. Or user can also adjust it on storage_options argument of open_dataset function of xarray. Also when the user opening the remote files, progress logging appears to tell the user, it's storing data on /tmp and user sees the progress. It means users can cancel the procedure when it's downloading the data by watching the logs.
Also main purpose of designing the adjustable FREVA_XARRAY_CACHE was for our own data-loader. We wanted to adjust the cache on deployment to /scratch to don't run into storage limit.
Apart from those, we really don't have many other options for opening the grib and netcdf3 data. Because of the structure of those formats, we need to download the full data and then open it. So in short, /tmp was the safest way we could come up with. If there is any more secure suggestion, please let us know.
I first didn't mind the name but then after seeing the structure of xarray backends, I see Martin's point. Unless I misunderstood something this could even be a contribution upstream to xarray, thus freva is not involved. |
|
@antarcticrainforest thanks for the reviw. regarding this point:
could you please give a link or example of this data to dig more for making it compatible with this? |
In this PR we are going to introduce the xarray-prism as
prismengine for opening all available climate data including remote and posix data.How to install:
examples for remote data: