Introduce spam_checker_spammy internal event metadata.#19453
Introduce spam_checker_spammy internal event metadata.#19453reivilibre merged 12 commits intodevelopfrom
spam_checker_spammy internal event metadata.#19453Conversation
|
|
| PolicyServerSpammy(bool), | ||
| SpamCheckerSpammy(bool), |
There was a problem hiding this comment.
Should this be an enum which would allow future adaptions like this? It makes sense that these could be both true so that may not apply in any case. I'm guessing we have some early-returns if either things marks the event as spammy though?
There was a problem hiding this comment.
An enum isn't a bad idea actually, but I can't think of a pleasant way of doing that migration, which leads me to think we might be better off keeping the 2 bools but also addressing the 'early-returns' point you make.
There was a problem hiding this comment.
Early-return removed at be8c05b so we evaluate both attributes
| PolicyServerSpammy(bool), | ||
| SpamCheckerSpammy(bool), |
There was a problem hiding this comment.
Is "spam checker" only for the module API?
There was a problem hiding this comment.
Yes, or really there is also the pre-module-API 'spam checker API' but it got superseded by the module API, I think we have a compatibility shim.
| # Mark this as spam so we don't re-evaluate soft-failure status. | ||
| redacted_event.internal_metadata.spam_checker_spammy = True |
There was a problem hiding this comment.
Based on how this was used in #18968, perhaps insert_sticky_events_txn should be updated to instead skip any soft-failed events 🤔
Especially as the reasoning there is "Skipping the insertion of these types of 'invalid' events is useful for performance reasons because they would fill up the table yet we wouldn't show them to clients anyway." and soft_failed covers the behavior of showing events to clients.
There was a problem hiding this comment.
Ahh, this comment below clarifies why we distinguish this:
Note: Soft-failed sticky events ARE inserted, as their soft-failed status could be re-evaluated later.
| {self.remote_user_join_event.event_id}, | ||
| ) | ||
|
|
||
| def test_federated_events_with_spam_checker_metadata(self) -> None: |
There was a problem hiding this comment.
It would be nice to add a in-repo Complement test for this sort of thing. Only hiccup would be setting up/configuring up the spam checker module but we could always configure one and have some specific constant that triggers this behavior SPAM_CHECKER_SPAM
There was a problem hiding this comment.
I'm not sure how much it fits; I feel the trial test is adequate and is nicer to develop on.
The spam_checker_spammy field is also internal so wouldn't be visible externally.
Am I missing something really beneficial that would be covered?
There was a problem hiding this comment.
To make it more clear, I think the trial test is sufficient 👍
It was more an ideal. One benefit would be avoiding the room version details (like this problem). We could also avoid the internal details of manually puppeting federation here.
In terms of checking, we would probably check via /sync (send a sentinel event after the spammy one) and/or admin API's if the spam_checker_spammy value was actually important.
…ta` overrides Extra type annotations needed because `**kwargs` now has a wider type
07b866c to
71137f6
Compare
| token = self.login(user_id, "pass") | ||
| room_id = self.helper.create_room_as(user_id, tok=token) | ||
|
|
||
| start_id = self.store.get_max_sticky_events_stream_id() |
There was a problem hiding this comment.
Not exactly a problem of this PR specifically but the name get_max_sticky_events_stream_id() feels a bit off.
get_current_sticky_events_stream_id seems more appropriate. I see how it got the name "max" because that's what the underlying get_current_token() says but then it gets confusing when we also have get_max_allocated_token()
Recently dealt with this in #19558 / #19677 and ended up with
synapse/synapse/storage/databases/main/room.py
Lines 1305 to 1319 in 62523d8
There was a problem hiding this comment.
The reason this name got introduced is because this is what account_data does, which I was using as a reference for thread subscriptions (iirc).
And of course since documenting the 'how to add a stream' cheatsheet, everyone has been cribbing off that, copying the pattern further.
You are very correct though, in a multi-writer stream, max_stream_id (or thereabouts) sounds like it would be the maximum position any writer had persisted.
Whereas the current one actually sounds like it would make more sense being called 'min', based on this docstring of the underlying function:
def get_persisted_upto_position(self) -> int:
"""Get the max position where all previous positions have been
persisted.
Note: In the worst case scenario this will be equal to the minimum # <--------------------
position across writers. This means that the returned position here can
lag if one writer doesn't write very often.
"""I do prefer the 'current' idea and I think I will update all streams to match that (in a subsequent PR of course)
There was a problem hiding this comment.
We have even more variants, have asked internally to give people a chance to provide an opinion on whether any one of them has a better reason to be standard than the others.
stream_tokene.g.get_to_device_stream_tokencurrent_keye.g.self.sources.account_data.get_current_key()(TIL about this variant)max_stream_ide.g.get_max_push_rules_stream_idcurrent_stream_ide.g.get_current_quarantined_media_stream_id
It might still make sense to exclude the events stream from the discussion/any potential change, since we have both a min and max there and therefore it's hard to drop the 'max' name.... (although we call those the room_min and room_max even though it's about the events stream :S)
After people have a chance to have some input I will go through and try to unify these
Follows: #19365
Part of: MSC4354 Sticky Events (experimental feature #19409)
This PR introduces a
spam_checker_spammyflag, analogous topolicy_server_spammy, as an explicit flagthat an event was decided to be spammy by a spam-checker module.
The original Sticky Events PR (#18968) just reused
policy_server_spammy, but it didn't sit right with mebecause we (at least appear to be experimenting with features that) allow users to opt-in to seeing
policy_server_spammyevents (presumably for moderation purposes).Keeping these flags separate felt best, therefore.
As for why we need this flag: soon soft-failed status won't be permanent, at least for sticky events.
The spam checker modules currently work by making events soft-failed.
We want to prevent spammy events from getting reconsidered/un-soft-failed, so it seems like we need
a flag to track spam-checker spamminess separately from soft-failed.
Should be commit-by-commit friendly, but is also small.