Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
### Fixes

- Fixed multi-workspace desktop/API sync scoping, fail-closed workspace authentication, workspace-qualified purges, and explicit watch workspace selection. Thanks @zm2231.
- Prevented unchanged desktop refreshes from duplicating message events and added preview-first retained-history compaction with `purge --keep-message-events`. Thanks @barbieri.

## 0.7.3 - 2026-06-19

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ Choose the path that matches your setup:
- `update` pulls and imports the latest git snapshot, or restores a historical tag/ref without moving the share checkout
- `sync` performs a one-shot crawl from bot/API, MCP connector, wiretap/desktop, or both
- `import` imports a Slack export ZIP or extracted export directory
- `purge` previews or deletes messages and message-owned records older than a cutoff
- `purge` previews or deletes messages and message-owned records older than a cutoff, with optional retained-event compaction
- `tail` listens for live events through Socket Mode, including one tail per configured workspace
- `watch` refreshes desktop-local state on a schedule, optionally scoped with `--workspace <id>`
- `search` runs safe local text search with FTS and substring fallback, optionally filtered by workspace
Expand Down
2 changes: 2 additions & 0 deletions SPEC.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,7 @@ Expected flags:
- optional `--workspace <id>`
- `--force` to execute; omission is a preview
- `--keep-media` to retain cached media no longer referenced by stored messages
- optional `--keep-message-events <n>` to retain the newest events per message, event type, and source
- `--vacuum` to compact SQLite after deletion

Behavior:
Expand All @@ -203,6 +204,7 @@ Behavior:
- preserve workspaces, channels, users, and sync state
- record per-channel retention floors so incremental API/MCP repair overlap does not restore purged history
- delete only cached media paths with no remaining database references
- preview and compact retained event history only when `--keep-message-events` is provided
- do not compact the SQLite file unless `--vacuum` is set

### `status`
Expand Down
2 changes: 1 addition & 1 deletion docs/desktop-mode.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ Use `watch` to keep refreshing the DB from local desktop state:
slacrawl watch --desktop-every 5m
```

This loop does not truncate the database. It repeatedly upserts and appends event history so the local DB stays current as Slack Desktop changes. It refreshes every workspace in the signed-in desktop profile by default; pass `--workspace T01234567` to restrict each refresh to one workspace.
This loop does not truncate the database. It repeatedly upserts desktop state and appends event history when a message snapshot changes, so unchanged refreshes do not amplify the archive. It refreshes every workspace in the signed-in desktop profile by default; pass `--workspace T01234567` to restrict each refresh to one workspace.

## Validation Commands

Expand Down
17 changes: 17 additions & 0 deletions docs/retention.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,19 @@ Pass `--force` to execute:
slacrawl purge --workspace T01234567 --older-than 90d --force
```

Retained messages can also accumulate historical snapshots. Add
`--keep-message-events N` to preview keeping only the newest `N` events for
each message, event type, and source:

```bash
slacrawl --json purge --older-than 90d --keep-message-events 5
slacrawl purge --older-than 90d --keep-message-events 5 --force --vacuum
```

The preview reports `compacted_message_events` without changing the database.
Compaction never removes the canonical current row in `messages`, does not mix
event sources or event types, and follows `--workspace` when supplied.

The SQLite transaction deletes:

- messages
Expand All @@ -52,6 +65,10 @@ The SQLite transaction deletes:
- embedding jobs
- FTS entries

Consecutive identical message snapshots are suppressed during normal ingest;
real transitions, including a value changing and later reverting, remain in
event history.

Workspaces, channels, users, and sync state remain. Executed purges also record
a per-channel retention floor so ordinary incremental API and MCP syncs do not
restore deleted history through their repair overlap. New replies to expired
Expand Down
2 changes: 2 additions & 0 deletions internal/cli/app_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -710,6 +710,7 @@ func TestCompletionBashOutput(t *testing.T) {
require.Contains(t, out, "--ref")
require.Contains(t, out, "--max-bytes")
require.Contains(t, out, "--desktop-every --workspace")
require.Contains(t, out, "--keep-message-events")
}

func TestCompletionZshOutput(t *testing.T) {
Expand All @@ -736,6 +737,7 @@ func TestCompletionZshOutput(t *testing.T) {
require.Contains(t, out, "connector")
require.Contains(t, out, "purge")
require.Contains(t, out, "--keep-media")
require.Contains(t, out, "--keep-message-events[events retained per message, source, and type]")
require.Contains(t, out, "--tag[immutable snapshot tag]")
require.Contains(t, out, "--ref[historical Git ref to import]")
require.Contains(t, out, "--workspace[workspace id]")
Expand Down
4 changes: 2 additions & 2 deletions internal/cli/completion.go
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ _slacrawl()
COMPREPLY=( $(compgen -W "--workspace --dry-run --force --format --help -h ${global_flags}" -- "${cur}") )
;;
purge)
COMPREPLY=( $(compgen -W "--before --older-than --workspace --force --keep-media --vacuum --help -h ${global_flags}" -- "${cur}") )
COMPREPLY=( $(compgen -W "--before --older-than --workspace --force --keep-media --keep-message-events --vacuum --help -h ${global_flags}" -- "${cur}") )
;;
tail)
COMPREPLY=( $(compgen -W "--workspace --repair-every --help -h ${global_flags}" -- "${cur}") )
Expand Down Expand Up @@ -320,7 +320,7 @@ _slacrawl() {
_arguments '--workspace[workspace id]:workspace id:' '--dry-run[walk and count without writing]' '--force[overwrite existing slack-export rows at the same rank]' '--format[output format]:format:(text json log)'
;;
purge)
_arguments '--before[absolute cutoff]:time:' '--older-than[relative cutoff]:duration:' '--workspace[workspace id]:workspace id:' '--force[execute deletion]' '--keep-media[retain unreferenced cached media]' '--vacuum[compact database after deletion]'
_arguments '--before[absolute cutoff]:time:' '--older-than[relative cutoff]:duration:' '--workspace[workspace id]:workspace id:' '--force[execute deletion]' '--keep-media[retain unreferenced cached media]' '--keep-message-events[events retained per message, source, and type]:count:' '--vacuum[compact database after deletion]'
;;
tail)
_arguments '--workspace[workspace id]:workspace id:' '--repair-every[repair interval]:duration:'
Expand Down
93 changes: 55 additions & 38 deletions internal/cli/purge.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,23 +15,25 @@ import (
)

type purgeResponse struct {
Cutoff time.Time `json:"cutoff"`
WorkspaceID string `json:"workspace_id,omitempty"`
DryRun bool `json:"dry_run"`
Messages int64 `json:"messages"`
MessageEvents int64 `json:"message_events"`
MessageFiles int64 `json:"message_files"`
Mentions int64 `json:"mentions"`
EmbeddingJobs int64 `json:"embedding_jobs"`
FTSEntries int64 `json:"fts_entries"`
CachedMediaFiles int64 `json:"cached_media_files"`
CachedMediaBytes int64 `json:"cached_media_bytes"`
CachedMediaDeleted int64 `json:"cached_media_deleted"`
CachedMediaMissing int64 `json:"cached_media_missing"`
CachedMediaRetained int64 `json:"cached_media_retained"`
CachedMediaFailures []string `json:"cached_media_failures,omitempty"`
KeepMedia bool `json:"keep_media"`
Vacuumed bool `json:"vacuumed"`
Cutoff time.Time `json:"cutoff"`
WorkspaceID string `json:"workspace_id,omitempty"`
DryRun bool `json:"dry_run"`
Messages int64 `json:"messages"`
MessageEvents int64 `json:"message_events"`
CompactedMessageEvents int64 `json:"compacted_message_events"`
KeepMessageEvents int `json:"keep_message_events,omitempty"`
MessageFiles int64 `json:"message_files"`
Mentions int64 `json:"mentions"`
EmbeddingJobs int64 `json:"embedding_jobs"`
FTSEntries int64 `json:"fts_entries"`
CachedMediaFiles int64 `json:"cached_media_files"`
CachedMediaBytes int64 `json:"cached_media_bytes"`
CachedMediaDeleted int64 `json:"cached_media_deleted"`
CachedMediaMissing int64 `json:"cached_media_missing"`
CachedMediaRetained int64 `json:"cached_media_retained"`
CachedMediaFailures []string `json:"cached_media_failures,omitempty"`
KeepMedia bool `json:"keep_media"`
Vacuumed bool `json:"vacuumed"`
}

func (a *App) runPurge(ctx context.Context, configPath string, args []string, format OutputFormat) error {
Expand All @@ -46,6 +48,7 @@ func (a *App) runPurge(ctx context.Context, configPath string, args []string, fo
workspaceID := fs.String("workspace", "", "limit purge to workspace id")
force := fs.Bool("force", false, "execute deletion instead of previewing")
keepMedia := fs.Bool("keep-media", false, "retain unreferenced cached media files")
keepMessageEvents := fs.Int("keep-message-events", 0, "keep newest message events per source and type")
vacuum := fs.Bool("vacuum", false, "compact the SQLite database after deletion")
if err := fs.Parse(args); err != nil {
return err
Expand All @@ -60,15 +63,22 @@ func (a *App) runPurge(ctx context.Context, configPath string, args []string, fo
return errors.New("--vacuum requires --force")
}
workspaceSet := false
keepMessageEventsSet := false
fs.Visit(func(item *flag.Flag) {
if item.Name == "workspace" {
switch item.Name {
case "workspace":
workspaceSet = true
case "keep-message-events":
keepMessageEventsSet = true
}
})
workspace := strings.TrimSpace(*workspaceID)
if workspaceSet && workspace == "" {
return errors.New("--workspace cannot be empty")
}
if keepMessageEventsSet && *keepMessageEvents <= 0 {
return errors.New("--keep-message-events must be greater than zero")
}

now := a.nowUTC()
cutoff, err := resolvePurgeCutoff(now, *before, *olderThan)
Expand All @@ -90,9 +100,10 @@ func (a *App) runPurge(ctx context.Context, configPath string, args []string, fo
defer func() { _ = st.Close() }()

opts := store.PurgeOptions{
Before: cutoff,
WorkspaceID: workspace,
Delete: *force,
Before: cutoff,
WorkspaceID: workspace,
Delete: *force,
KeepMessageEvents: *keepMessageEvents,
}
var report store.PurgeReport
var mediaDeleted, mediaMissing, mediaRetained int64
Expand All @@ -103,8 +114,9 @@ func (a *App) runPurge(ctx context.Context, configPath string, args []string, fo
if *force && !*keepMedia {
if strings.TrimSpace(cfg.CacheDir) == "" {
preview, err := st.PurgeMessages(ctx, store.PurgeOptions{
Before: cutoff,
WorkspaceID: opts.WorkspaceID,
Before: cutoff,
WorkspaceID: opts.WorkspaceID,
KeepMessageEvents: opts.KeepMessageEvents,
})
if err != nil {
return err
Expand All @@ -121,8 +133,9 @@ func (a *App) runPurge(ctx context.Context, configPath string, args []string, fo
} else {
err := media.WithCacheLock(ctx, cfg.CacheDir, func() error {
preview, err := st.PurgeMessages(ctx, store.PurgeOptions{
Before: cutoff,
WorkspaceID: opts.WorkspaceID,
Before: cutoff,
WorkspaceID: opts.WorkspaceID,
KeepMessageEvents: opts.KeepMessageEvents,
})
if err != nil {
return err
Expand Down Expand Up @@ -190,7 +203,7 @@ func (a *App) runPurge(ctx context.Context, configPath string, args []string, fo
}
}
}
response := purgeResponseFromStore(cutoff, opts.WorkspaceID, !*force, *keepMedia, report)
response := purgeResponseFromStore(cutoff, opts.WorkspaceID, !*force, *keepMedia, opts.KeepMessageEvents, report)
response.CachedMediaDeleted = mediaDeleted
response.CachedMediaMissing = mediaMissing
response.CachedMediaRetained = mediaRetained
Expand Down Expand Up @@ -279,19 +292,21 @@ func resolvePurgeCutoff(now time.Time, before, olderThan string) (time.Time, err
return now.Add(-duration), nil
}

func purgeResponseFromStore(cutoff time.Time, workspaceID string, dryRun, keepMedia bool, report store.PurgeReport) purgeResponse {
func purgeResponseFromStore(cutoff time.Time, workspaceID string, dryRun, keepMedia bool, keepMessageEvents int, report store.PurgeReport) purgeResponse {
response := purgeResponse{
Cutoff: cutoff,
WorkspaceID: workspaceID,
DryRun: dryRun,
Messages: report.Messages,
MessageEvents: report.MessageEvents,
MessageFiles: report.MessageFiles,
Mentions: report.Mentions,
EmbeddingJobs: report.EmbeddingJobs,
FTSEntries: report.FTSEntries,
CachedMediaFiles: int64(len(report.Media)),
KeepMedia: keepMedia,
Cutoff: cutoff,
WorkspaceID: workspaceID,
DryRun: dryRun,
Messages: report.Messages,
MessageEvents: report.MessageEvents,
CompactedMessageEvents: report.CompactedMessageEvents,
KeepMessageEvents: keepMessageEvents,
MessageFiles: report.MessageFiles,
Mentions: report.Mentions,
EmbeddingJobs: report.EmbeddingJobs,
FTSEntries: report.FTSEntries,
CachedMediaFiles: int64(len(report.Media)),
KeepMedia: keepMedia,
}
for _, item := range report.Media {
response.CachedMediaBytes += item.Size
Expand All @@ -312,6 +327,8 @@ Flags:
-workspace string limit purge to one workspace
-force execute deletion
-keep-media retain unreferenced cached media files
-keep-message-events int
keep newest events per message, source, and type
-vacuum compact the database after deletion; requires --force
`)
}
60 changes: 59 additions & 1 deletion internal/cli/purge_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,60 @@ func TestPurgeCommandValidatesSafetyFlags(t *testing.T) {
require.ErrorContains(t, err, "future")
err = app.Run(context.Background(), []string{"purge", "--before", "2026-01-01", "--workspace", " "})
require.ErrorContains(t, err, "--workspace cannot be empty")
err = app.Run(context.Background(), []string{"purge", "--before", "2026-01-01", "--keep-message-events", "0"})
require.ErrorContains(t, err, "--keep-message-events must be greater than zero")
}

func TestPurgeCommandPreviewsAndCompactsRetainedMessageEvents(t *testing.T) {
dir := t.TempDir()
configPath := filepath.Join(dir, "config.toml")
dbPath := filepath.Join(dir, "slacrawl.db")
cfg := config.Default()
cfg.DBPath = dbPath
cfg.CacheDir = ""
require.NoError(t, cfg.Save(configPath))

now := time.Date(2026, 6, 10, 12, 0, 0, 0, time.UTC)
messageTime := now.Add(-24 * time.Hour)
st, err := store.Open(dbPath)
require.NoError(t, err)
require.NoError(t, st.UpsertMessage(context.Background(), store.Message{
WorkspaceID: "T1", ChannelID: "C1", TS: purgeTestSlackTS(messageTime),
Text: "retained", NormalizedText: "retained", SourceRank: 3,
SourceName: "desktop-indexeddb", RawJSON: `{"version":0}`, UpdatedAt: messageTime,
}, nil))
for i := 1; i <= 4; i++ {
_, err := st.DB().ExecContext(context.Background(), `
insert into message_events (channel_id, ts, event_type, source_name, payload_json, created_at)
values ('C1', ?, 'message', 'desktop-indexeddb', ?, ?)
`, purgeTestSlackTS(messageTime), fmt.Sprintf(`{"version":%d}`, i), messageTime.Add(time.Duration(i)*time.Second).Format(time.RFC3339))
require.NoError(t, err)
}
require.NoError(t, st.Close())

var stdout bytes.Buffer
app := &App{Stdout: &stdout, Stderr: &stdout, now: func() time.Time { return now }}
args := []string{
"--config", configPath, "--json", "purge", "--before", "2026-01-01",
"--keep-message-events", "2", "--keep-media",
}
require.NoError(t, app.Run(context.Background(), args))
var preview purgeResponse
require.NoError(t, json.Unmarshal(stdout.Bytes(), &preview))
require.True(t, preview.DryRun)
require.Equal(t, 2, preview.KeepMessageEvents)
require.Equal(t, int64(3), preview.CompactedMessageEvents)
require.Equal(t, int64(5), purgeTestTableCount(t, dbPath, "message_events"))

stdout.Reset()
args = append(args, "--force")
require.NoError(t, app.Run(context.Background(), args))
var executed purgeResponse
require.NoError(t, json.Unmarshal(stdout.Bytes(), &executed))
require.False(t, executed.DryRun)
require.Equal(t, int64(3), executed.CompactedMessageEvents)
require.Equal(t, int64(2), purgeTestTableCount(t, dbPath, "message_events"))
require.Equal(t, int64(1), purgeTestMessageCount(t, dbPath))
}

func TestPurgeCommandKeepMedia(t *testing.T) {
Expand Down Expand Up @@ -368,11 +422,15 @@ func TestRemovePurgeMediaContinuesAfterFailure(t *testing.T) {
}

func purgeTestMessageCount(t *testing.T, dbPath string) int64 {
return purgeTestTableCount(t, dbPath, "messages")
}

func purgeTestTableCount(t *testing.T, dbPath, table string) int64 {
t.Helper()
st, err := store.Open(dbPath)
require.NoError(t, err)
defer func() { require.NoError(t, st.Close()) }()
rows, err := st.QueryReadOnly(context.Background(), "select count(*) as n from messages")
rows, err := st.QueryReadOnly(context.Background(), "select count(*) as n from "+table)
require.NoError(t, err)
return rows[0]["n"].(int64)
}
Expand Down
3 changes: 3 additions & 0 deletions internal/share/share.go
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,9 @@ func importLocked(ctx context.Context, s *store.Store, opts Options) (Manifest,
if _, err := tx.ExecContext(ctx, `delete from message_fts`); err != nil {
return Manifest{}, fmt.Errorf("clear message_fts: %w", err)
}
if _, err := tx.ExecContext(ctx, `delete from message_event_heads`); err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Seed event heads after share imports

When a config has git-share auto-update plus live sync, sync runs autoUpdateShare before runSyncTargets (internal/cli/app.go:567-572). This import clears message_event_heads but then imports messages/message_events and never rebuilds the derived heads, so the following sync hits sql.ErrNoRows in appendMessageEvent and appends a duplicate unchanged event for every message it touches; repeated share updates can therefore re-amplify the history this change is meant to bound. Seed the heads after import, or include them in the snapshot, before returning.

Useful? React with 👍 / 👎.

return Manifest{}, fmt.Errorf("clear message event heads: %w", err)
}
for i := len(SnapshotTables) - 1; i >= 0; i-- {
table := SnapshotTables[i]
if _, err := tx.ExecContext(ctx, "delete from "+quoteIdent(table)); err != nil { //nolint:gosec // Snapshot table names are quoted identifiers from the fixed schema list.
Expand Down
6 changes: 4 additions & 2 deletions internal/share/share_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,7 @@ func TestExportImportRoundTrip(t *testing.T) {
require.NotEmpty(t, manifest.Tables)
require.FileExists(t, filepath.Join(opts.RepoPath, ManifestName))

reader, err := store.Open(filepath.Join(dir, "reader.db"))
require.NoError(t, err)
reader := seedStore(t, filepath.Join(dir, "reader.db"))
defer func() { require.NoError(t, reader.Close()) }()

imported, err := Import(ctx, reader, opts)
Expand All @@ -45,6 +44,9 @@ func TestExportImportRoundTrip(t *testing.T) {
require.NoError(t, err)
require.Len(t, rows, 1)
require.Equal(t, "git backed archive works", rows[0].Text)
heads, err := reader.QueryReadOnly(ctx, `select count(*) as count from message_event_heads`)
require.NoError(t, err)
require.Equal(t, int64(0), heads[0]["count"])
}

func TestImportRejectsIncompleteManifestBeforeClearing(t *testing.T) {
Expand Down
Loading