[extras] add merge_safe as an alternative to merge_auto#14
[extras] add merge_safe as an alternative to merge_auto#14pabloyoyoista wants to merge 3 commits intomasterfrom
Conversation
|
Yes, a silent override is not optimal. I wonder if we need a new function for this, or if it is sufficient to use the new "safe"-argument? Also, talking to Carl, we should probably profile this to find the fastest way to check for existing keys (maybe using set.union()?). |
1249164 to
9112272
Compare
|
Sorry it took so long to come back to this
I thought it would be easier to use with a new merge_safe function than always requiring using
Do you have any existing profiling tests for this? Otherwise I'm happy to try some performance tests with a recent Python version |
|
We have really found this safe merge useful and have found no use cases where we want to drop data silently. So I guess the the only drawback to ponder on is how to make it efficient performance wise. The biggest effect on changing to the "safer merge" have been that it have exposed so many bugs in our projects which have been both scary and humbling. |
cd433ec to
005c2f5
Compare
Sofia has recurrently expressed dissatisfaction with merge_auto not working super well at reproducing data due to the update in some data types not being fit for data science. For example, while Counter's update does what one would expect, dict's update might simply override previous values from the iterator without giving out any feedback. To allow data scientists to avoid this issue, we add a new API that makes sure that previous values don't exist.
005c2f5 to
439c5d8
Compare
|
We've met today and agreed that we want the new behavior to be the default, after some small testing showed that the new behavior only introduces an overhead of 1 second per 31 million rows. Of course, this breaks tests. So I'll fix them and we have a meeting queued in 2 weeks |
We are basically returning a set, so we don't really care if the key already exists
Sofia has recurrently expressed dissatisfaction with merge_auto not working super well at reproducing data due to the update in some data types not being fit for data science. For example, while Counter's update does what one would expect, dict's update might simply override previous values from the iterator without giving out any feedback. To allow data scientists to avoid this issue, we add a new API that makes sure that previous values don't exist.
This is probably missing tests and some further discussion (like, should we consider the
Countertype safe?), so marking it as draft