Skip to content

Novatechflow/siem sec#7

Merged
novatechflow merged 58 commits intomainfrom
novatechflow/siem-sec
Mar 17, 2026
Merged

Novatechflow/siem sec#7
novatechflow merged 58 commits intomainfrom
novatechflow/siem-sec

Conversation

@novatechflow
Copy link
Member

Summary

This PR is a large end-to-end upgrade of EUOSINT across collector, discovery, source registry, API, UI, Docker runtime, and deployment operations.

Core platform changes

  • Expanded and reworked source coverage (now broad multi-country registry with curated/rejected handling).
  • Added/strengthened collector capabilities:
    • Interpol pagination + accumulation improvements.
    • FBI Wanted API ingestion.
    • Browser-backed ingestion hardening.
    • Redirect handling and transport fallback resilience.
    • Category dictionary support and relevance/downranking improvements.
  • Added search backend and frontend integration:
    • SQLite FTS5 indexing.
    • /api/search endpoint.
    • UI search wiring with fallback behavior.
    • API rate limiting.

Discovery and hygiene

  • Added autonomous source discovery improvements:
    • Gap analysis for missing country/category coverage.
    • DDG headless discovery as primary with LLM fallback.
    • Wikidata timeout/hygiene fixes.
  • Added live registry hygiene behavior:
    • Runtime merge of seed registry into DB.
    • Better rejection propagation/handling.
    • Dead-letter/replacement queue cleanups.
  • Introduced stronger noise filtering and explicit rejections:
    • European Schoolnet / World Bank education and other non-OSINT sources.
    • HTML/nav junk suppression (tasking/navigation false positives).

New intelligence domains

  • Added new category families and source sets, including:
    • maritime_security
    • legislative
    • conflict_monitoring
    • environmental_disaster
    • disease_outbreak
  • Added associated UI/category handling, labels, and severity behavior.

Geocoding and map quality

  • Added 3-tier geocoding pipeline:
    • city DB lookup,
    • Nominatim fallback,
    • country-capital fallback.
  • Tightened coordinate precision and map behavior:
    • better bounds/zoom behavior,
    • reduced invalid placements (water/desert drift),
    • improved international/conflict placement handling.

UI and UX

  • Updated category presentation and navigation behavior.
  • Improved severity/filter interactions and regional/category reset behavior.
  • Added branding/footer updates and terminology consistency updates.

Docker, install, and operations

  • Added remote installer bootstrap (deploy/install.sh) with:
    • install mode (preserve vs fresh volume reset),
    • GHCR image selection/tag prompts,
    • domain/TLS prompts and preflight checks.
  • Updated runtime bootstrap to avoid stale seeded alert snapshots on fresh volumes.
  • Added/updated pre-seeded source DB flow for faster bootstrap.
  • Hardened Docker/GitHub Actions build behavior:
    • Chromium install retry logic (headless required),
    • improved workflow cache scoping + retry path for transient build failures.
  • Updated operations/docs to match current deployment/runtime model.

Why

  • Reduce ingestion noise and zombie alerts.
  • Improve source quality and operational relevance.
  • Make discovery more autonomous but safer.
  • Improve geospatial correctness and UI usability.
  • Make fresh installs and CI/CD more reliable.

Validation focus

  • Collector ingestion across RSS/HTML/API sources and fallback behavior.
  • Registry merge/rejection behavior from seed + runtime updates.
  • Discovery candidate quality and hygiene filters.
  • Geocoding/map placement correctness.
  • Docker fresh-start behavior (no stale seeded alert payloads).
  • CI/release Docker build stability and Chromium availability.

- Introduced a new category dictionary in JSON format to manage alert categories such as missing persons, wanted suspects, travel warnings, fraud alerts, cyber advisories, and public appeals.
- Added curated agencies with relevant RSS feeds for missing persons and wanted suspects, including INTERPOL, FBI, and Europol.
- Implemented a new source candidates structure for future source management.
- Developed a FeedDirectory component to display alerts, source health, and statistics.
- Created a custom hook to fetch and manage source health data.
- Established a theme utility for consistent color management across the application.
- Defined TypeScript interfaces for source health data structures to ensure type safety.
…ment categories

New category types with theme colors and icons for expanded source coverage.
Routine domestic police operations (raids, drug busts, sentencing) are
penalised -0.20 unless cross-border signals are present. Interpol
fetcher uses polite paginated API (20/page, 2s delay).
Prevents flat top-authorities ranking where every source caps at 20.
Cyber advisory sources are individually capped at 15 in the registry.
- Interpol Red/Yellow Notices (paginated API)
- Humanitarian: ICRC, UNHCR, WHO, ReliefWeb, WFP, MSF
- Conflict: ICG, SIPRI, NATO, UN SC, OSCE, African Union
- Intelligence: CIA, MI5, GCHQ, BfV, BND, DGSI, AIVD, SÄPO, ASIO, CSIS
- Health: ECDC, CDC, ProMED, WHO emergencies
- Emergency: GDACS, FEMA, EU ERCC, USGS
- Ukraine: CERT-UA, NSDC, SBU, National Police
- Nordics: full coverage NO/DK/FI/IS/SE (police, intel, CERT, travel)
- Eastern Europe: PL/CZ/HU/RO/SK/BG/RS/GE/MD police and intel
- Middle East: Israel, UAE, Saudi, Jordan, Qatar CERTs
- Africa: AU, Egypt, Morocco, Rwanda, Ethiopia, Senegal, Algeria
- Asia: Taiwan, China, Vietnam, Pakistan, Bangladesh, Sri Lanka, Mongolia
- Sanctions: OFAC, EU, UN, FATF, OpenSanctions
- Financial fraud: FCA, BaFin, ESMA, SEC, FinCEN, FINMA + 6 EU regulators
- Organized crime: OCCRP, UNODC, ENFAST, OLAF, EPPO, DIA, SFO, DEA, ATF
- Europol updated to CMS API RSS, duplicate entry removed
- Cyber advisory sources capped at max_items=15
Alerts table now has a companion alerts_fts virtual table with BM25
ranking. SearchAlerts method supports text queries, category/region/
status filters, and automatic FTS index rebuild on SaveAlerts.
Embedded HTTP server in collector process, enabled via --api flag.
Supports ranked text search (BM25), category/region/status filters,
auto prefix matching, and FTS5 syntax (quoted phrases, AND/OR/NOT).
Caddy proxies /api/* to collector:3001 in Docker.
useSearch hook probes /api/health on mount, then sends debounced
queries to /api/search. When API is unavailable, falls back to the
existing in-memory string.includes() filter.
Interpol API requires Referer, Origin, and Sec-Fetch-* headers
mimicking an XHR from www.interpol.int. Without these, Akamai
returns 403. Verified: 6,455 Red + 11,271 Yellow notices accessible.
…US agencies

- New fbi-wanted-json source type hitting api.fbi.gov public API
- Parser extracts name, charges, nationality, aliases, reward, armed-dangerous
- 4 FBI subcategories: wanted, ten-most-wanted, seeking-info, parental-kidnappings
- Add fetch_mode: browser to DEA and ATF (blocked by stealth HTTP)
- New dea-fugitives and usms-mostwanted browser-backed sources
- Remove duplicate fbi-seeking and fbi-mostwanted registry entries
- 30-request burst, 5/sec refill, stale eviction after 10min
- Returns 429 with Retry-After header when exceeded
- Health endpoint exempt from rate limiting
- Extracts client IP from X-Forwarded-For/X-Real-Ip
- Makefile dev-stop/dev-restart now prune dangling images and build cache
- Always follow HTTP redirects for RSS/Atom feed fetches (302/307 are
  normal for feeds) instead of treating them as dead sources
- Strip HTML and truncate to 2k chars before sending to Google Translate
  to prevent 413 on feeds with full-page markup in descriptions
- Export StripHTML for cross-package use
- Left panel: smaller severity numbers, remove duplicate Alerts row,
  larger Countries/Feeds display, clickable severity filter, zone stats,
  ">" prefix on capped authority counts, Middle East region
- GlobeView: fix map disappearing on force-reload via ResizeObserver
- Docker: merge registry into DB on every startup, bump MAX_PER_SOURCE
  to 40, add dev-sync-registry target, German AA followRedirects
BSI NESAS is a product certification feed (NESAS audit/evaluation
docs), not security advisories. Mark it as rejected in the registry
and add promotion_status filtering to normalizeAll() so the JSON
loader respects rejection status the same way the SQLite loader does.
FBI removed their RSS feeds; news/press releases are not available via
the Wanted API either.  Mark fbi-news as rejected.

Fix FeedDirectory global overview to use source-health total instead of
counting only sources that produced alerts.
Critical/High buttons now filter the map and alert feed globally
instead of only affecting left-panel stats. Third box shows conflict
monitoring count (ACLED etc.) and toggles that category filter.
Remove the Clear button — click again to deselect.
Add country/region extraction from alert titles so international
sources (Crisis Group, SIPRI, UN Press, AU Peace) pin to the actual
conflict location instead of the org HQ. Uses rightmost-match
heuristic with ~150 country centroids plus conflict sub-regions
(Tigray, Donbas, Rakhine, etc.).

Fix broken conflict feed URLs: SIPRI → /rss/combined.xml, UN SC →
press.un.org/en/rss.xml. Reject NATO, OSCE, ACLED (no working feeds).
New sources:
- UN Peacekeeping (Blue Helmets) — mission/deployment updates
- UN OCHA — humanitarian crisis coordination
- UN News Peace & Security — conflict/peacekeeping coverage
- UN News Refugees & Migrants — displacement/migration intel
- UN News Humanitarian Aid — aid operations and tasking
- ICRC Humanitarian Law & Policy — IHL and conflict law
- ICRC Field Operations — Israel/Gaza/West Bank ops reporting

Rejected (feeds dead, no alternative):
- ICRC News (404), UNHCR (403), WFP (403), ICRC Family Links (403)
OIJ Costa Rica: remove "oij" from include_keywords (matched every URL
on the domain), drop root URL from feed_urls to avoid scraping
navigation. Tighten keywords to actual missing person terms.

NCMEC: their RSS emits titles like ": Name (State)" with a leading
colon. Prefix with "Missing" so it reads "Missing: Name (State)".
When switching regions via header shortcuts or the scope dropdown, the
active navigator group now resets so the first group in the new region
is auto-selected instead of sticking to the old selection.
Replace API JSON URLs (ws-public.interpol.int/notices/v1/red/...)
with human-readable web URLs (interpol.int/.../View-Red-Notices#ID).

Override lat/lng from Interpol HQ (Lyon) to the person's nationality
country centroid so notices pin to the correct location on the map.
…nation

Interpol has ~6.4k red and ~4k yellow notices. Instead of fetching all
at once, each run fetches a 320-notice window (2 pages × 160) and
advances a persistent cursor. State reconciliation carries forward
previously accumulated alerts for sources marked accumulate:true,
building the full corpus over successive runs.
Add maxBounds, maxBoundsViscosity, minZoom and noWrap to stop
vertical scrolling past the world edge and tile repetition.
Increase global/international zoom and minZoom from 2 to 3 to eliminate
white gaps at map edges. Add German-language keywords to severity
inference so CERT.AT/BSI advisories get correct severity levels.
Document that EUOSINT intentionally pulls only the newest 160 red and
160 yellow Interpol notices per run to avoid data overflow. Also covers
severity classification, map tiles, and collector cycle behavior.
Cover all 17 alert categories with descriptions and example sources,
severity classification rules, Interpol notice limits, map tiles,
collector cycle, and region scoping.
Add 12 new sources: GDACS disasters, USGS earthquakes, NOAA oil/chem
incidents, Smithsonian volcanoes, EMSA maritime, IAEA nuclear, WHO,
ECDC epidemiological updates + risk assessments, CDC, WOAH zoonotic.
Reject ECB press releases (general news, not fraud intel).
Add severity keywords for outbreaks, natural disasters, and hazmat.
Fraud sources will appear after next dev-restart (DB merge on startup).
…LM category vetting

Source lifecycle is now fully automated without requiring restarts:
- Every collector cycle merges the JSON registry into SQLite (new sources
  picked up, rejected status synced)
- Dead sources (404, 403, DNS, TLS errors) are rejected in both SQLite
  AND the JSON registry, then written to the DLQ
- Merge respects runtime rejections: if a source was killed at runtime,
  re-merging from JSON won't resurrect it
- LLM vetting prompt now includes all 18 categories with descriptions
  so the model can assign the correct category for discovered sources
- Verdict struct includes category field validated against the taxonomy
Add -v flag to docker-compose down so feed-data volume is removed,
ensuring the entrypoint re-seeds the DB from the current JSON registry
on next start.
Add all 18 category labels to LLM search discovery for replacement
feed lookups. Remove dead rejectInJSONRegistry (wrote to Docker
ephemeral FS). Add scripts/apply-dlq.py and Makefile dev-sync-dlq
target for developer-side DLQ processing.
Add make dev-export-db to snapshot sources.db from a running collector.
Entrypoint prefers sources.seed.db over cold init+import when available.
The merge step still runs on every start to pick up JSON registry updates.
Tier 1: GeoNames cities500.txt (200k+ cities) imported into SQLite for
fast city-name lookups. Text is scanned for place names and matched
against the DB with population-based disambiguation.

Tier 2: OSM Nominatim fallback for place names not in the local DB.
Rate-limited (1 req/sec), in-memory cached, configurable base URL for
self-hosted instances.

Tier 3: Country-level text scanning now returns capital city coordinates
instead of geographic centroids. Fixes island nations (Malta, Cyprus,
Singapore, etc.) placing alerts in the sea.

The Dockerfile downloads cities500.txt at build time (~30MB). The
collector auto-imports it into SQLite on first run. All tiers are
optional — the system degrades gracefully.
Gap analysis scans the active registry against 120+ target countries
worldwide. Missing country+category combinations generate synthetic
search candidates that feed into the existing LLM search + RSS probe
pipeline. Covers Europe, Americas, Asia-Pacific, Middle East, Africa,
Central Asia, and Caucasus.

Also fixes Interpol notice URLs: 2026/5314 → 2026-5314 in fragment.
DuckDuckGo search via headless Chrome is now the first-citizen feed
discovery method — zero API keys, zero tokens. The system searches for
RSS/Atom feeds using gap-analysis targets, extracts result URLs from
DDG HTML, and feeds them into the existing probe pipeline.

LLM search only runs for targets that DDG didn't cover, saving tokens.

Also expanded feed probe paths with government/ministry patterns
(DOJ-style feeds, multi-language /de/feed /fr/feed etc.) and enabled
browser + DDG in docker-compose.
Root cause: SERVICE wikibase:label and P279* subclass traversal cause
persistent timeouts on the public Wikidata SPARQL endpoint. Fix both
police.go and humanitarian.go to query one type ID at a time with P31
only, LIMIT 50, deriving names from hostnames and a static country map.
Added 8 explicit subclass type IDs to compensate for removed P279*.
Links to streamingintelligence product page and contact page with UTM
tracking parameters. Referrer enabled for analytics attribution.
3.2MB SQLite snapshot with curated feeds so new deployments start with
full coverage immediately. Entrypoint copies it to /data/sources.db on
first run; subsequent starts merge the JSON registry on top.
…titles

Include keywords now match title only, not the URL — the feed URL path
(e.g. /desaparecidos) was letting every link on the page pass the filter.
Added junk title blocklist to reject navigation boilerplate (load more,
cookie config, browser names) at parse time.
…rces

New categories: maritime_security (7 sources incl. US Navy, CIMSEC,
EU NAVFOR, MARAD) and legislative (7 sources incl. EU Parliament,
EU Council, EEAS, US Congress, State Dept). Additional conflict
monitoring sources: US DoD, NATO, OSCE, UN Security Council, ICG,
German Foreign Office, France Diplomatie. All filtered with
include_keywords to reduce noise. Gap analysis now discovers these
categories automatically for all target countries.
Add non-OSINT term and host blocklists to discovery hygiene (education,
world bank, social media, entertainment, etc). Purge orphan alerts from
rejected/removed sources on each collection cycle. Explicitly reject
worldbank-education-digital in registry.
@novatechflow novatechflow merged commit 3182974 into main Mar 17, 2026
4 checks passed
novatechflow added a commit that referenced this pull request Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant