Skip to content

Implement support for new epoch columns to StreamDBClient queries#216

Open
Tiihott wants to merge 22 commits into
teragrep:mainfrom
Tiihott:epoch_reads
Open

Implement support for new epoch columns to StreamDBClient queries#216
Tiihott wants to merge 22 commits into
teragrep:mainfrom
Tiihott:epoch_reads

Conversation

@Tiihott
Copy link
Copy Markdown
Contributor

@Tiihott Tiihott commented Aug 25, 2025

Description

The aim of this pull request is to implement support for new epoch time columns to logfile metadata queries done by StreamDBClient. Original logdate and logtime implementations are prone to timezone issues where the result of the query changes depending on the session timezone, using epoch time resolves this issue.

Includes:

  • Added support for using epoch_hour column as a source for logtime result value.
  • Removed support for using old logdate and logtime implementation.
  • New tests for asserting the query changes work as expected.
  • Renamed and disabled old logdate implementation tests.
  • Fixed weightedOffsetTest which was not using whole hours for logtime values.

Resolves #213
Resolves #214
Resolves #284

Checklists

Testing

General

  • I have checked that my test files and functions have meaningful names.
  • I have checked that each test tests only a single behavior.
  • I have done happy tests.
  • I have tested only my own code.
  • I have tested at least all public methods.

Assertions

  • I have checked that my tests use assertions and not runtime overhead.
  • I have checked that my tests end in assertions.
  • I have checked that there is no comparison statements in assertions.
  • I have checked that assertions are in tests and not in helper functions.
  • I have checked that assertions for iterables are outside of for loops and both sides of the iteration blocks.
  • I have checked that assertions are not tested inside consumers.

Testing Data

  • I have tested algorithms and anything else with the possibility of unbound growth.
  • I have checked that all testing data is local and fully replaceable or reproducible or both.
  • I have checked that all test files are standalone.
  • I have checked that all test-specific fake objects and classes are in the test directory.
  • I have checked that my tests do not contain anything related to customers, infrastructure or users.
  • I have checked that my tests do not contain non-generic information.
  • I have checked that my tests do not do external requests and are not privately or publicly routable.

Statements

  • I have checked that my tests do not use throws for exceptions.
  • I have checked that my tests do not use try-catch statements.
  • I have checked that my tests do not use if-else statements.

Java

  • I have checked that my tests for Java uses JUnit library.
  • I have checked that my tests for Java uses JUnit utilities for parameters.

Other

  • I have only tested public behavior and not private implementation details.
  • I have checked that my tests are not (partially) commented out.
  • I have checked that hand-crafted variables in assertions are used accordingly.
  • I have tested Object Equality.
  • I have checked that I do not have any manual tests or I have a valid reason for them and I have explained it in the PR description.

Code Quality

  • I have checked that my code follows metrics set in Procedure: Class Metrics.
  • I have checked that my code follows metrics set in Procedure: Method Metrics.
  • I have checked that my code follows metrics set in Procedure: Object Quality.
  • I have checked that my code does not have any NULL values.
  • I have checked my code does not contain FIXME or TODO comments.

@Tiihott Tiihott self-assigned this Aug 25, 2025
@StrongestNumber9
Copy link
Copy Markdown
Contributor

StrongestNumber9 commented Aug 27, 2025

Does not include:

MariaDB TestContainers implementation for StreamDBClient tests, tests that are run on workstation require a working MariaDB instance to be set up. Workflows have the correct MariaDB setup already present for running the tests.

I think this is not very good approach as this could introduce security issues for workstation users when running improperly configured MariaDB, like unauthenticated access but open to the internet. Also while the CI starts from scratch everytime using well known steps, the database developer is using might get out of sync with schema changes or have lingering data which makes the tests unreliable.

Example of the actual issue with out of date database like I had:

[ERROR] epochHourNullTest  Time elapsed: 0.157 s  <<< ERROR!
org.jooq.exception.DataAccessException: SQL [insert into `journaldb`.`logfile` (`id`, `logdate`, `expiration`, `bucket_id`, `path`, `object_key_hash`, `host_id`, `original_filename`, `archived`, `file_size`, `sha256_checksum`, `archive_etag`, `logtag`, `source_system_id`, `category_id`, `uncompressed_file_size`, `epoch_hour`, `epoch_expires`, `epoch_archived`) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)]; (conn=3) Unknown column 'epoch_hour' in 'INSERT INTO'
        at com.teragrep.pth_06.planner.StreamDBClientTest.epochHourNullTest(StreamDBClientTest.java:296)
Caused by: java.sql.SQLSyntaxErrorException: (conn=3) Unknown column 'epoch_hour' in 'INSERT INTO'
        at com.teragrep.pth_06.planner.StreamDBClientTest.epochHourNullTest(StreamDBClientTest.java:296)

Concider implementing it like described in https://github.com/teragrep/mvn_01/tree/main/testcontainers

Copy link
Copy Markdown
Contributor

@StrongestNumber9 StrongestNumber9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests requires quite a lot of cleaning and adding the testcontainers as they are very mentally expensive to parse, especially with the duplicated codeblocks

HostRecord hostRecord = new HostRecord(UShort.valueOf(1), "testHost1");
ctx.insertInto(JOURNALDB.HOST).set(hostRecord).execute();

// Set logdate to 2023-10-04 instead of the correct 2023-10-05 to emulate timezone issues, and test if epoch_hour takes priority or not.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like something worth of two distinct tests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the test configuration to use predetermined UTC-4 timezone for all tests instead of whatever the system the test runs on decides to use as session timezone.
epochHourNullTest() now tests if the results are affected by the UTC-4 session timezone as expected when old logdate/logtime implementation is used.
epochHourTimezoneTest() tests if logtime and logdate results are unaffected by the UTC-4 timezone.

Comment on lines +217 to +219
long logtime = hourRange.get(0).get(8, Long.class);
Assertions.assertEquals(1696471200L, logtime);
Date logdate = hourRange.get(0).get(5, Date.class);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracting these hourRanges to single use variable looks a bit noisier than it has to be

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed single use variables in commit 8f96a57

long logtime = hourRange.get(0).get(8, Long.class);
Assertions.assertEquals(1696471200L, logtime);
Date logdate = hourRange.get(0).get(5, Date.class);
Assertions.assertEquals(Date.valueOf(date), logdate);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this Date.valueOf(date) be changed into literal of some sort instead of using output of a function with a variable passed as the value? This is to avoid some bizarre scenario where both Date.valueOf(date) and logdate ends up being null or 0 or something else similarly unexpected

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed Date.valueod() to literal string format instead of LocalDate in commit 8f96a57

"example.log-@1696471200-2023100423.log.gz",
new Timestamp(2025, 8, 13, 16, 18, 22, 0),
ULong.valueOf(120L),
"e2I8CnejWweTSk8tmo4tNkDO2fU7RajqI111FPlF7Mw=",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random sha256Checksum value, replaced with simple description string in commit 381c1d3

new Timestamp(2025, 8, 13, 16, 18, 22, 0),
ULong.valueOf(120L),
"e2I8CnejWweTSk8tmo4tNkDO2fU7RajqI111FPlF7Mw=",
"5ddea0496b0a9ad1266b26da3de9f573",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random archiveEtag value, replaced with simple description string in commit 381c1d3

LogfileRecord logfileRecord = new LogfileRecord(
ULong.valueOf(1),
Date.valueOf(LocalDate.of(2023, 10, 4)),
new Date(2125 - 1900, 7, 13),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like magic numbers

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced with easier to read Date.valueOf() in commit 381c1d3

Comment on lines +110 to +151
// Init mandatory Config object with the minimum options required for testing StreamDBClient.
Map<String, String> opts = new HashMap<>();
opts.put("S3endPoint", "mock");
opts.put("S3identity", "mock");
opts.put("S3credential", "mock");
opts.put("DBusername", streamDBUsername);
opts.put("DBpassword", streamDBPassword);
opts.put("DBurl", streamDBUrl);
opts.put("DBstreamdbname", streamdbName);
opts.put("DBjournaldbname", journaldbName);
opts.put("queryXML", "<index value=\"example\" operation=\"EQUALS\"/>");
opts.put("archive.enabled", "true");
Config config = new Config(opts);
// Add test data to journaldb and streamdb, there are several tables in both databases.
Settings settings = new Settings()
.withRenderMapping(new RenderMapping().withSchemata(new MappedSchema().withInput("streamdb").withOutput(config.archiveConfig.dbStreamDbName), new MappedSchema().withInput("journaldb").withOutput(config.archiveConfig.dbJournalDbName), new MappedSchema().withInput("bloomdb").withOutput(config.archiveConfig.bloomDbName)));
final Connection connection = Assertions
.assertDoesNotThrow(
() -> DriverManager
.getConnection(
config.archiveConfig.dbUrl, config.archiveConfig.dbUsername,
config.archiveConfig.dbPassword
)
);
final DSLContext ctx = DSL.using(connection, SQLDialect.MYSQL, settings);

BucketRecord bucketRecord = new BucketRecord(UShort.valueOf(1), "bucket1");
ctx.insertInto(JOURNALDB.BUCKET).set(bucketRecord).execute();

ctx
.query(String.format("INSERT INTO %s.category (id, name) VALUES (1, 'testCategory')", journaldbName))
.execute();
ctx
.query(
String
.format(
"INSERT INTO %s.source_system (id, name) VALUES (2, 'testSourceSystem2')",
journaldbName
)
)
.execute();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could be deduplicated either in beforeEach or as some buildDSLContext function. Repeating 40 idential lines per test is not very easy to mentally parse

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved all the test data insertions that are not directly relevant to the tests to beforeEach in commit 2fbb4de
Created a private method for initializing Config object with test options in commit 1f2b0a6

@Tiihott
Copy link
Copy Markdown
Contributor Author

Tiihott commented Aug 28, 2025

Implemented testcontainers to the tests. As testcontainers do not support managing several databases inside the same testcontainer using the methods available on the testcontainer object, the streamdb database had to be initialized using a separate CREATE_STREAMDB_DB.sql script that is called during container startup.

Note: New databases can't be created after the container is already started as the query results in error with access denied for the user.

@Tiihott
Copy link
Copy Markdown
Contributor Author

Tiihott commented Aug 28, 2025

Cleaned up beforeEach further by moving all journaldb table creation and test data population inside the script that is called during container startup.

@StrongestNumber9
Copy link
Copy Markdown
Contributor

The tests looks a lot better now, good job with them! I still couldn't run the project as I use podman and not docker but I'll look into what was missing

@StrongestNumber9
Copy link
Copy Markdown
Contributor

I was able to make it work with podman, opened ticket a ticket for enhancing documentation in teragrep/mvn_01#84

I think this looks acceptable but I would like @kortemik to give another sanity check

@Tiihott
Copy link
Copy Markdown
Contributor Author

Tiihott commented Sep 12, 2025

Noticed an error in logdate definition for StreamDBClient queries. The epoch based logdate that is not affected by session timezone should actually be sourced from epoch_hour column instead of epoch_archived column.
Fixed in commit 13cfba8

@Tiihott
Copy link
Copy Markdown
Contributor Author

Tiihott commented Sep 15, 2025

Rebased to main.

@Tiihott
Copy link
Copy Markdown
Contributor Author

Tiihott commented Oct 1, 2025

Rebased to main.

@Tiihott
Copy link
Copy Markdown
Contributor Author

Tiihott commented Nov 3, 2025

Rebased to main, updated the new SteamDBClient tests in the rebase.
MariaDB TestContainers implementation for StreamDBClient tests was added to main.

@Tiihott Tiihott requested a review from elliVM November 3, 2025 13:05
Copy link
Copy Markdown
Contributor

@elliVM elliVM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments

Comment thread src/main/java/com/teragrep/pth_06/planner/StreamDBClient.java Outdated
Comment thread src/test/java/com/teragrep/pth_06/planner/StreamDBClientTest.java
Comment thread src/test/java/com/teragrep/pth_06/planner/StreamDBClientTest.java
@Tiihott Tiihott requested a review from elliVM November 6, 2025 10:09
elliVM
elliVM previously approved these changes Nov 6, 2025
Copy link
Copy Markdown
Contributor

@elliVM elliVM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kortemik kortemik requested a review from eemhu November 19, 2025 11:43
@kortemik
Copy link
Copy Markdown
Member

please rebase

@Tiihott Tiihott added the review Issues or pull requests waiting for a review label Nov 19, 2025
eemhu
eemhu previously approved these changes Nov 20, 2025
@Tiihott Tiihott dismissed stale reviews from eemhu and elliVM via 2c5ffbc November 20, 2025 10:49
@Tiihott
Copy link
Copy Markdown
Contributor Author

Tiihott commented Nov 20, 2025

Rebased to main.

@kortemik
Copy link
Copy Markdown
Member

kortemik commented Nov 20, 2025

proper indexes are missing for logfile table regardin epoch_hour, coalesce will call non-indexed column epoch_hour and cause full table scan. additionally coalesce will cause the search to be non-search argument able meaning b-tree scan can not be used and therefore this approach will not work.

see more at https://dba.stackexchange.com/questions/162024/is-coalesce-sargable-now

@Tiihott
Copy link
Copy Markdown
Contributor Author

Tiihott commented Feb 17, 2026

Rebased to main

@Tiihott
Copy link
Copy Markdown
Contributor Author

Tiihott commented Feb 18, 2026

The original query implementation uses the logdate column to pull the logfiles from a specific date to the slicetable, and the logtime value sourced from path column is used to fetch all logfiles from a specific hour range from the slicetable where the logfiles were originally pulled.

As the logdate column is seemingly tied to the logtime value present in the path column, the query must be changed to use epoch_hour as the source for logdate value used in pulling logfiles to slicetable. Looking at the current epoch_hour implementation, it is indeed in my opinion better to drop both the logtime and logdate usages from the queries and use epoch_hour directly in both pulling logfiles to slicetable and fetching hour range from the slicetable.

The logdate is included in the result values of the queries, but it is dropped from the results during processing in ArchiveRangeProcessor.processRange(). So dropping the logdate from query results should also be fine.

Started implementing changes to the queries to see how much refactoring is needed to achieve the change to using only raw epoch_hour value in the queries.

@Tiihott Tiihott removed the review Issues or pull requests waiting for a review label Feb 18, 2026
@Tiihott
Copy link
Copy Markdown
Contributor Author

Tiihott commented Feb 18, 2026

Successfully removed the logdate from the query result values of StreamDBClient, cleaning up the logdateFunction usage from the class.
The logdateFunction is now only used in NestedTopNQuery to get the logfiles belonging to a specific date, looking into refactoring it to use raw epoch_hour values instead.

@Tiihott
Copy link
Copy Markdown
Contributor Author

Tiihott commented Feb 20, 2026

After implementing additional tests for ArchiveQueryProcessor class, I noticed that the EarlistCondition and LatestCondition classes must be refactored to use epoch_hour instead of logdate and parsed logtime in the condition.

Tiihott added 19 commits March 20, 2026 11:24
…me field creation. REBASE

Implemented coalesce with epoch_hour converted to date for defining logdate of the logfile. REBASE
…sing date arithmetics that completely ignores session timezone. Added epochHourTimezoneTest. REBASE
…ge of Result10 to Result9 with the Date result value removed.
…ndition with logfile.epoch_hour implementation. REBASE
@Tiihott
Copy link
Copy Markdown
Contributor Author

Tiihott commented Mar 27, 2026

epoch_hour usage has now been implemented to the earliest and latest conditions of the queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

5 participants