Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@ repos:
- id: fix-byte-order-marker
- id: name-tests-test
args: [ '--pytest-test-first' ]
exclude: ^tests/_duplicates.py$
- id: no-commit-to-branch
args: [ '--branch', 'main' ]
- id: trailing-whitespace
Expand Down
12 changes: 12 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,18 @@ Breaking changes
^^^^^^^^^^^^^^^^
* Development dependencies ("dev", "docs") are now installed via the new `dependency-groups` conventions (`PEP 735 <https://peps.python.org/pep-0735/>`_) (:pull:`419`)
* `prek` is now the suggested pre-commit runner (installed by default via `pip install --group dev`) (:pull:`419`)
* delete submodule ``src.cdm_reader_mapper.duplicates`` (:issue:`152`, :issue:`283`, :pull:`434`)

* ``cdm_reader_mapper.DupDetect`` is not importable anymore
* ``cdm_reader_mapper.duplicate_check`` is not importable anymore
* ``cdm_reader_mapper.DataBundle.duplicate_check`` is not callable anymore
* ``cdm_reader_mapper.DataBundle.get_duplicates`` is not callable anymore
* ``cdm_reader_mapper.DataBundle.flag_duplicates`` is not callable anymore
* ``cdm_reader_mapper.DataBundle.remove_duplicates`` is not callable anymore
* ``cdm_reader_mapper.DataBundle`` does not have attribute ``DupDetect`` anymore

* submodule ``src.cdm_reader_mapper.duplicates`` has been moved to `marine_qc <https://github.com/glamod/marine_qc/pull/207/>`_ (:issue:`283`, :pull:`434`)


Internal changes
^^^^^^^^^^^^^^^^
Expand Down
12 changes: 0 additions & 12 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,6 @@ Useful functions
.. autofunction:: cdm_reader_mapper.correct_pt
:noindex:

.. autofunction:: cdm_reader_mapper.duplicate_check
:noindex:

.. autofunction:: cdm_reader_mapper.map_model
:noindex:

Expand Down Expand Up @@ -84,12 +81,3 @@ Useful functions

.. autofunction:: cdm_reader_mapper.write_tables
:noindex:

.. _dupdetect:

DupDetect
=========

.. autoclass:: cdm_reader_mapper.DupDetect
:members:
:noindex:
19 changes: 1 addition & 18 deletions docs/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,24 +41,7 @@ In this case deck 704: US Marine Meteorological Journal collection of data code:

cdm_tables = db_cdm.data

4. Detect duplicated observations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Detect and flag duplicated observations without overwriting the original CDM tables:

.. code-block:: console

db_dup = db.duplicate_check()

db_dup_f = db_dup.flag_duplicates()

flagged_tables = db_dup_f.data

db_dup_r = db_dup.remove_duplicates()

removed_tables = db_dup_r.data

5. Write the output
4. Write the output
~~~~~~~~~~~~~~~~~~~
This writes the output to an ascii file with a pipe delimited format using the following function:

Expand Down
2 changes: 0 additions & 2 deletions docs/hyperlinks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@

.. _CDM: https://github.com/glamod/common_data_model/blob/master/cdm_latest.pdf

.. _CDM code tables for duplicate_status: https://glamod.github.io/cdm-obs-documentation/tables/code_tables/duplicate_status/duplicate_status.html

.. _CDM code tables for report_quality: https://glamod.github.io/cdm-obs-documentation/tables/code_tables/quality_flag/quality_flag.html

.. _conda: https://docs.conda.io/en/latest/
Expand Down
4 changes: 0 additions & 4 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ The **cdm_reader_mapper** toolbox is a python3_ tool designed for

* reading original marine-meteorological data files compliant with a user specified data model (:ref:`data-models`) into a Marine Data Format (MDF) file.
* mapping observed meteorological variables and its associated metadata from a data model (:ref:`data-models`) to the C3S CDS Common Data Model (CDM_) format or **imodel** as called in this tool.
* to detect and flag or remove duplicated observations

It was developed with the initial idea of reading data from the International Comprehensive Ocean-Atmosphere Data Set (ICOADS_) stored in the International Maritime Meteorological Archive (IMMA_) data format. In the meanwhile, it can read data C-RAID_ Copernicus in situ project too.

Expand All @@ -34,9 +33,6 @@ The reader allows for basic transformations of the data. This feature includes `
In addition, the **cdm_reader_mapper.DataBundle** object has several main method functions:

* :py:func:`DataBundle.map_model`: map observed variables and its associated metadata from a data model or models combination to the standardized C3S CDS Common Data Model (CDM_) format.
* :py:func:`DataBundle.duplicate_check`: detect duplicated observations
* :py:func:`DataBundle.flag_duplicates`: flag detected duplicated observations
* :py:func:`DataBundle.remove_duplicates`: remove detected duplicated observations
* :py:func:`DataBundle.write`: save both observational MDF files as a coma-separated list and observational standardized CDM tables as pipe-seperated lists

.. toctree::
Expand Down
18 changes: 0 additions & 18 deletions docs/tool-overview-databundle.rst
Original file line number Diff line number Diff line change
Expand Up @@ -84,22 +84,4 @@ Now the meteorological data can be maqpped to the Common Data Model (CDM_) using

For more information how the mapping is working, please see :ref:`tool-overview-mapper` and/or :ref:`how-to-register-a-new-data-model-mapping`.

:ref:`dupdetect`
^^^^^^^^^^^^^^^^

After mapping to the CDM format it is useful to check if the CDM tables contain any duplicates. The duplicate checker included in the ``cdm_reader_mapper`` toolbox is based on python record linkage toolkit RecordLinkage_.

The first step is to call the method function :func:`.DataBundle.duplicate_check`. This function scans the CDM tables for any duplicates.

.. code-block:: console

db_dup = db.duplicate_check()

Afterwards there are two options how to deal with the detected duplicates:

1. :func:`.DataBundle.flag_duplicates`
2. :func:`.DataBundle.remove_duplicates`

The first function flags the detected duplicates. For more information about the flags see `CDM code tables for duplicate_status`_ and `CDM code tables for report_quality`_. The second function removes the detected duplicates.

.. include:: hyperlinks.rst
1 change: 0 additions & 1 deletion environment-docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,4 +57,3 @@ dependencies:
- msgpack
- requests
- platformdirs >4.0.0
- recordlinkage >=0.15
1 change: 0 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,6 @@ dependencies = [
"pandas>=2.2.0",
"platformdirs >4.0.0",
"pyarrow >=15.0.0",
"recordlinkage >= 0.15",
"requests",
"timezonefinder >6.5.0,<9.0.0",
"xarray >=2023.11.0,!=2024.10.0"
Expand Down
6 changes: 0 additions & 6 deletions src/cdm_reader_mapper/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,6 @@
from .core.reader import read
from .core.writer import write
from .data import test_data
from .duplicates.duplicates import (
DupDetect,
duplicate_check,
)
from .mdf_reader.reader import read_data, read_mdf
from .mdf_reader.writer import write_data
from .metmetpy import (
Expand All @@ -35,11 +31,9 @@

__all__ = [
"DataBundle",
"DupDetect",
"cdm_tables",
"correct_datetime",
"correct_pt",
"duplicate_check",
"map_model",
"read",
"read_data",
Expand Down
207 changes: 0 additions & 207 deletions src/cdm_reader_mapper/core/databundle.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@
split_by_index,
)
from cdm_reader_mapper.common.iterators import ParquetStreamReader, is_valid_iterator
from cdm_reader_mapper.duplicates.duplicates import DupDetect, duplicate_check
from cdm_reader_mapper.metmetpy import (
correct_datetime,
correct_pt,
Expand Down Expand Up @@ -154,7 +153,6 @@ def __init__(
self._mask: pd.DataFrame | ParquetStreamReader = mask
self._imodel = imodel
self._mode = mode
self.DupDetect: DupDetect | None = None

def __len__(self) -> int:
"""
Expand Down Expand Up @@ -1414,208 +1412,3 @@ def write(
mode=mode,
**kwargs,
)

def duplicate_check(self, inplace: bool = False, **kwargs: Any) -> DataBundle | None:
r"""
Duplicate check in :py:attr:`data`.

Parameters
----------
inplace : bool, default: False
If True overwrite :py:attr:`data` in :py:class:`~DataBundle`
else return a copy of :py:class:`~DataBundle` with :py:attr:`data` as CDM tables.
\**kwargs : Any
Additional keyword-arguments for duplicate check.

Returns
-------
:py:class:`~DataBundle` or None
DataBundle containing new :py:class:`~DupDetect` class for further duplicate check methods or None if "inplace=True".

See Also
--------
DataBundle.get_duplicates : Get duplicate matches in `data`.
DataBundle.flag_duplicates : Flag detected duplicates in `data`.
DataBundle.remove_duplicates : Remove detected duplicates in `data`.

Notes
-----
Following columns have to be provided:

* `longitude`
* `latitude`
* `primary_station_id`
* `report_timestamp`
* `station_course`
* `station_speed`

This adds a new class :py:class:`~DupDetect` to :py:class:`~DataBundle`.
This class is necessary for further duplicate check methods.

For more information see :py:func:`duplicate_check`

Examples
--------
>>> db.duplicate_check()
"""
db_ = self._get_db(inplace)
if db_ is None:
return None
if db_._mode == "tables" and "header" in db_._data:
data = db_._data["header"]
else:
data = db_._data
db_.DupDetect = duplicate_check(data, **kwargs)
return self._return_db(db_, inplace)

def flag_duplicates(self, inplace: bool = False, **kwargs: Any) -> DataBundle | None:
r"""
Flag detected duplicates in :py:attr:`data`.

Parameters
----------
inplace : bool, default: False
If True overwrite :py:attr:`data` in :py:class:`~DataBundle`
else return a copy of :py:class:`~DataBundle` with :py:attr:`data` containing flagged duplicates.
\**kwargs : Any
Additional keyword-arguments for flagging duplicates.

Returns
-------
:py:class:`~DataBundle` or None
DataBundle containing duplicate flags in :py:attr:`data` or None if "inplace=True".

Raises
------
RuntimeError
Before flagging duplicates, a duplictate check has to be done, :py:func:`DataBundle.duplicate_check`.

See Also
--------
DataBundle.remove_duplicates : Remove detected duplicates in `data`.
DataBundle.get_duplicates : Get duplicate matches in `data`.
DataBundle.duplicate_check : Duplicate check in `data`.

Notes
-----
For more information see :py:func:`DupDetect.flag_duplicates`

Examples
--------
Flag duplicates without overwriting :py:attr:`data`.

>>> flagged_tables = db.flag_duplicates()

Flag duplicates with overwriting :py:attr:`data`.

>>> db.flag_duplicates(inplace=True)
>>> flagged_tables = db.data
"""
db_ = self._get_db(inplace)
if db_ is None:
return None

if db_.DupDetect is None:
raise RuntimeError("Before flagging duplicates, a duplictate check has to be done: 'db.duplicate_check()'")

db_.DupDetect.flag_duplicates(**kwargs)

if db_._mode == "tables" and "header" in db_._data:
db_._data["header"] = db_.DupDetect.result
else:
db_._data = db_.DupDetect.result
return self._return_db(db_, inplace)

def get_duplicates(self, **kwargs: Any) -> pd.DataFrame:
r"""
Get duplicate matches in :py:attr:`data`.

Parameters
----------
\**kwargs : Any
Additional keyword-arguments used for getting duplicates.

Returns
-------
pd.DataFrame
DataFrame containing duplicate matches.

Raises
------
RuntimeError
Before getting duplicates, a duplictate check has to be done, :py:func:`DataBundle.duplicate_check`.

See Also
--------
DataBundle.remove_duplicates : Remove detected duplicates in `data`.
DataBundle.flag_duplicates : Flag detected duplicates in `data`.
DataBundle.duplicate_check : Duplicate check in `data`.

Notes
-----
For more information see :py:func:`DupDetect.get_duplicates`

Examples
--------
>>> matches = db.get_duplicates()
"""
if self.DupDetect is None:
raise RuntimeError("Before getting duplicates, a duplictate check has to be done: 'db.duplicate_check()'")
return self.DupDetect.get_duplicates(**kwargs)

def remove_duplicates(self, inplace: bool = False, **kwargs: Any) -> DataBundle | None:
r"""
Remove detected duplicates in :py:attr:`data`.

Parameters
----------
inplace : bool, default: False
If True overwrite :py:attr:`data` in :py:class:`~DataBundle`
else return a copy of :py:class:`~DataBundle` with :py:attr:`data` containing no duplicates.
\**kwargs : Any
Additional keyword-arguments used to remove duplicates.

Returns
-------
:py:class:`~DataBundle` or None
DataBundle without duplicated rows or None if "inplace=True".

Raises
------
RuntimeError
Before removing duplicates, a duplictate check has to be done, :py:func:`DataBundle.duplicate_check`.

See Also
--------
DataBundle.flag_duplicates : Flag detected duplicates in `data`.
DataBundle.get_duplicates : Get duplicate matches in `data`.
DataBundle.duplicate_check : Duplicate check in `data`.

Notes
-----
For more information see :py:func:`DupDetect.remove_duplicates`

Examples
--------
Remove duplicates without overwriting :py:attr:`data`.

>>> removed_tables = db.remove_duplicates()

Remove duplicates with overwriting :py:attr:`data`.

>>> db.remove_duplicates(inplace=True)
>>> removed_tables = db.data
"""
db_ = self._get_db(inplace)
if db_ is None:
return None

if db_.DupDetect is None:
raise RuntimeError("Before removing duplicates, a duplictate check has to be done: 'db.duplicate_check()'")

db_.DupDetect.remove_duplicates(**kwargs)
header_ = db_.DupDetect.result
if not isinstance(db_._data, pd.DataFrame):
raise TypeError("data has unsupported type: {type(db_._data)}.")
db_._data = db_._data[db_._data.index.isin(header_.index)]
return self._return_db(db_, inplace)
Loading
Loading