Skip to content

Experiments Results

yixu9-hub edited this page Dec 11, 2025 · 12 revisions

Post Inconsistency

Data Environment Settings 1

Tests 200 users sending 2000 posts in total concurrently and right after the completion using timeline service to retrieve the feed content under post and pull strategy respectively

Missing Rate

algorithms Pull Push Hybrid
missing rate 38% 100% 80%

Missing Content

  1. Pull
image
  1. Push
image
  1. Hybrid
image

Insights

When using Pull-based model, Pull mode calls PostRepository.GetPostByUserID(), which is basically a DynamoDB query on the user_id-index GSI to fetch all the posts. DynamoDB only guarantees eventual consistency on GSIs, so right after 2000 concurrent writes there’s a propagation lag before those items become visible on the GSI. We also noticed there are some jump in the order of the missing posts when retrieve the timeline, thats also prove that we’re merging whatever the index happened to return at that moment, not because the posts are not saved in the database

While using Push-based model, the Fan-out runs asynchronously through SNS/SQS, so the test hits /api/timeline before the background writes finish. When we’re creating thousands of posts at high concurrency, those goroutines queue up and SQS/Dynamo writes can take seconds. The test immediately checks the timeline window and sees only older entries, so every new content is reported missing.

This pattern is also confirmed in Setting 2, where we tested with 50 regular-user posts (processed via the push strategy under the hybrid mode) and 150 celebrity-user posts (processed via the pull strategy) that simulated the real life cases. This setting exhibited a missing rate between pull and push model and mainly caused by the regular users processed via the push strategy.

To sum up, the pull strategy supports more real-time retrieval because data is written to the database immediately. In contrast, the push strategy may introduce additional latency, as it relies on asynchronous processing and can involve queued updates.

Timeline Retrievel Time

Data Environment Settings

  1. Tests 3 different users timeline get API response time: 10 following, 100 following, and 1600 following. Each following user has 10 posts.
  2. Hybrid follower count threshold: 20000

Response time

algorithms 10 following 100 following 1600 following
push 45ms 43ms 48ms
pull 50ms 200ms 3200ms
hybrid 52ms 130ms 2200ms
Figure_1

Database Storage

Data Environment Settings

User Base: 5,000 users
Test Scenarios: Push, Pull, and Hybrid fan-out strategies
Database: Amazon DynamoDB


Storage Metrics by Strategy

Strategy Post Items Post Storage (MB) Timeline Items Timeline Storage (MB) Total Storage (MB)
Push 0 0.00 3,686,351 812.71 812.71
Pull 46,317 6.21 0 0.00 6.21
Hybrid 1,303 0.18 3,477,569 760.96 761.14

Storage Cost Metrics by Strategy

Strategy Post Cost ($/mo) Timeline Cost ($/mo) Total Cost ($/mo) Annual Cost ($)
Push $0.0000 $0.1984 $0.1984 $2.38
Pull $0.0015 $0.0000 $0.0015 $0.02
Hybrid $0.0001 $0.1858 $0.1859 $2.23

Average Item Size

Strategy Post Avg (bytes) Timeline Avg (bytes)
Push 0 231.17
Pull 140.69 0
Hybrid 141.00 229.45

Storage Efficiency per User (5K users)

Metric Push Pull Hybrid
MB per user 0.163 0.001 0.152
Timeline items per user 737.27 0 695.51
Post items per user 0 9.26 0.26
strategy_comparison strategy_tradeoffs

Throughput, AutoScale test

Test Configuration

  • Users: 1500 concurrent
  • Spawn rate: 20 users/second
  • Duration: 20 minutes
  • Services: Post (1024MB), Timeline, User, Social Graph, Web
  • Social Graph: 1500 users, ~140k relationships (power-law)

PUSH (SNS fan-out)

  • Requests: ~392,025 (Aggregated)
  • Failures: 0 (0.00%)
  • Throughput: ~326.8 req/s (Aggregated)
  • Read (timeline): Median ~51ms, P95 ~7.8s, P99 ~11s
  • Write (post): Median ~34ms
  • Behavior: Writes are asynchronous, timelines precomputed; high reliability and stable performance.
image image image image

PULL (on-demand aggregation)

  • Requests: ~143,283 (Aggregated)
  • Failures: ~98,641 (≈68.8% overall; Read ~61.37%, Write ~98.84%)
  • Throughput: ~129.3 req/s (Aggregated)
  • Read (timeline): Median ~7.6s, high tail latencies; frequent timeouts under load
  • Write (post): Median ~110ms, but extremely high failure rate
  • Behavior: Reads aggregate multiple sources on demand; degrades sharply with high followings and concurrency.
image image image image

HYBRID (adaptive threshold)

  • Requests: ~157,736 (Aggregated)
  • Failures: ~30,401 (mostly writes; ≈95.76% write failure)
  • Throughput: ~131.6 req/s (Aggregated)
  • Read (timeline): Median ~11s, no read failures
  • Write (post): Median ~4.7s, high failure rate indicates write path falling back to PULL or unstable deployment
  • Behavior: Mixed results; current configuration likely routes many writes via PULL.
image image image image image

Why Performance Differs

  • PUSH:

    • Precomputed timelines reduce read-time work; SNS fanout amortizes cost at write time.
    • Low median read latency and zero failures; tail latency increases during bursts due to queue/backfill, but remains stable overall.
  • PULL:

    • Read-time aggregation scales poorly with follower counts; N+1 queries and cross-service fetches drive timeouts.
    • Read latency grows with the number of followings and concurrency; frequent ALB/backend timeouts cause high failure rates, especially at scale. Write latency looks fine, but user experience is dominated by slow, failure-prone reads.
  • HYBRID:

    • Intended to route high-follower users via PULL and others via PUSH.
    • High write failure rate indicates many writes still follow the PULL path (or fall back to it) and hit timeouts; reads show few failures but high median/tail latency