Implement support for new epoch columns to StreamDBClient queries#216
Implement support for new epoch columns to StreamDBClient queries#216Tiihott wants to merge 22 commits into
Conversation
I think this is not very good approach as this could introduce security issues for workstation users when running improperly configured MariaDB, like unauthenticated access but open to the internet. Also while the CI starts from scratch everytime using well known steps, the database developer is using might get out of sync with schema changes or have lingering data which makes the tests unreliable. Example of the actual issue with out of date database like I had: Concider implementing it like described in https://github.com/teragrep/mvn_01/tree/main/testcontainers |
StrongestNumber9
left a comment
There was a problem hiding this comment.
The tests requires quite a lot of cleaning and adding the testcontainers as they are very mentally expensive to parse, especially with the duplicated codeblocks
| HostRecord hostRecord = new HostRecord(UShort.valueOf(1), "testHost1"); | ||
| ctx.insertInto(JOURNALDB.HOST).set(hostRecord).execute(); | ||
|
|
||
| // Set logdate to 2023-10-04 instead of the correct 2023-10-05 to emulate timezone issues, and test if epoch_hour takes priority or not. |
There was a problem hiding this comment.
Sounds like something worth of two distinct tests
There was a problem hiding this comment.
Changed the test configuration to use predetermined UTC-4 timezone for all tests instead of whatever the system the test runs on decides to use as session timezone.
epochHourNullTest() now tests if the results are affected by the UTC-4 session timezone as expected when old logdate/logtime implementation is used.
epochHourTimezoneTest() tests if logtime and logdate results are unaffected by the UTC-4 timezone.
| long logtime = hourRange.get(0).get(8, Long.class); | ||
| Assertions.assertEquals(1696471200L, logtime); | ||
| Date logdate = hourRange.get(0).get(5, Date.class); |
There was a problem hiding this comment.
Extracting these hourRanges to single use variable looks a bit noisier than it has to be
There was a problem hiding this comment.
Removed single use variables in commit 8f96a57
| long logtime = hourRange.get(0).get(8, Long.class); | ||
| Assertions.assertEquals(1696471200L, logtime); | ||
| Date logdate = hourRange.get(0).get(5, Date.class); | ||
| Assertions.assertEquals(Date.valueOf(date), logdate); |
There was a problem hiding this comment.
Can this Date.valueOf(date) be changed into literal of some sort instead of using output of a function with a variable passed as the value? This is to avoid some bizarre scenario where both Date.valueOf(date) and logdate ends up being null or 0 or something else similarly unexpected
There was a problem hiding this comment.
Changed Date.valueod() to literal string format instead of LocalDate in commit 8f96a57
| "example.log-@1696471200-2023100423.log.gz", | ||
| new Timestamp(2025, 8, 13, 16, 18, 22, 0), | ||
| ULong.valueOf(120L), | ||
| "e2I8CnejWweTSk8tmo4tNkDO2fU7RajqI111FPlF7Mw=", |
There was a problem hiding this comment.
What is this value?
There was a problem hiding this comment.
Random sha256Checksum value, replaced with simple description string in commit 381c1d3
| new Timestamp(2025, 8, 13, 16, 18, 22, 0), | ||
| ULong.valueOf(120L), | ||
| "e2I8CnejWweTSk8tmo4tNkDO2fU7RajqI111FPlF7Mw=", | ||
| "5ddea0496b0a9ad1266b26da3de9f573", |
There was a problem hiding this comment.
What is this value?
There was a problem hiding this comment.
Random archiveEtag value, replaced with simple description string in commit 381c1d3
| LogfileRecord logfileRecord = new LogfileRecord( | ||
| ULong.valueOf(1), | ||
| Date.valueOf(LocalDate.of(2023, 10, 4)), | ||
| new Date(2125 - 1900, 7, 13), |
There was a problem hiding this comment.
Looks like magic numbers
There was a problem hiding this comment.
replaced with easier to read Date.valueOf() in commit 381c1d3
| // Init mandatory Config object with the minimum options required for testing StreamDBClient. | ||
| Map<String, String> opts = new HashMap<>(); | ||
| opts.put("S3endPoint", "mock"); | ||
| opts.put("S3identity", "mock"); | ||
| opts.put("S3credential", "mock"); | ||
| opts.put("DBusername", streamDBUsername); | ||
| opts.put("DBpassword", streamDBPassword); | ||
| opts.put("DBurl", streamDBUrl); | ||
| opts.put("DBstreamdbname", streamdbName); | ||
| opts.put("DBjournaldbname", journaldbName); | ||
| opts.put("queryXML", "<index value=\"example\" operation=\"EQUALS\"/>"); | ||
| opts.put("archive.enabled", "true"); | ||
| Config config = new Config(opts); | ||
| // Add test data to journaldb and streamdb, there are several tables in both databases. | ||
| Settings settings = new Settings() | ||
| .withRenderMapping(new RenderMapping().withSchemata(new MappedSchema().withInput("streamdb").withOutput(config.archiveConfig.dbStreamDbName), new MappedSchema().withInput("journaldb").withOutput(config.archiveConfig.dbJournalDbName), new MappedSchema().withInput("bloomdb").withOutput(config.archiveConfig.bloomDbName))); | ||
| final Connection connection = Assertions | ||
| .assertDoesNotThrow( | ||
| () -> DriverManager | ||
| .getConnection( | ||
| config.archiveConfig.dbUrl, config.archiveConfig.dbUsername, | ||
| config.archiveConfig.dbPassword | ||
| ) | ||
| ); | ||
| final DSLContext ctx = DSL.using(connection, SQLDialect.MYSQL, settings); | ||
|
|
||
| BucketRecord bucketRecord = new BucketRecord(UShort.valueOf(1), "bucket1"); | ||
| ctx.insertInto(JOURNALDB.BUCKET).set(bucketRecord).execute(); | ||
|
|
||
| ctx | ||
| .query(String.format("INSERT INTO %s.category (id, name) VALUES (1, 'testCategory')", journaldbName)) | ||
| .execute(); | ||
| ctx | ||
| .query( | ||
| String | ||
| .format( | ||
| "INSERT INTO %s.source_system (id, name) VALUES (2, 'testSourceSystem2')", | ||
| journaldbName | ||
| ) | ||
| ) | ||
| .execute(); | ||
|
|
There was a problem hiding this comment.
These could be deduplicated either in beforeEach or as some buildDSLContext function. Repeating 40 idential lines per test is not very easy to mentally parse
|
Implemented testcontainers to the tests. As testcontainers do not support managing several databases inside the same testcontainer using the methods available on the testcontainer object, the streamdb database had to be initialized using a separate CREATE_STREAMDB_DB.sql script that is called during container startup. Note: New databases can't be created after the container is already started as the query results in error with access denied for the user. |
|
Cleaned up beforeEach further by moving all journaldb table creation and test data population inside the script that is called during container startup. |
|
The tests looks a lot better now, good job with them! I still couldn't run the project as I use podman and not docker but I'll look into what was missing |
|
I was able to make it work with podman, opened ticket a ticket for enhancing documentation in teragrep/mvn_01#84 I think this looks acceptable but I would like @kortemik to give another sanity check |
|
Noticed an error in logdate definition for StreamDBClient queries. The epoch based logdate that is not affected by session timezone should actually be sourced from epoch_hour column instead of epoch_archived column. |
13cfba8 to
c064d8b
Compare
|
Rebased to main. |
|
Rebased to main. |
|
Rebased to main, updated the new SteamDBClient tests in the rebase. |
|
please rebase |
|
Rebased to main. |
|
proper indexes are missing for logfile table regardin epoch_hour, coalesce will call non-indexed column epoch_hour and cause full table scan. additionally coalesce will cause the search to be non-search argument able meaning b-tree scan can not be used and therefore this approach will not work. see more at https://dba.stackexchange.com/questions/162024/is-coalesce-sargable-now |
|
Rebased to main |
|
The original query implementation uses the logdate column to pull the logfiles from a specific date to the slicetable, and the logtime value sourced from path column is used to fetch all logfiles from a specific hour range from the slicetable where the logfiles were originally pulled. As the logdate column is seemingly tied to the logtime value present in the path column, the query must be changed to use epoch_hour as the source for logdate value used in pulling logfiles to slicetable. Looking at the current epoch_hour implementation, it is indeed in my opinion better to drop both the logtime and logdate usages from the queries and use epoch_hour directly in both pulling logfiles to slicetable and fetching hour range from the slicetable. The logdate is included in the result values of the queries, but it is dropped from the results during processing in ArchiveRangeProcessor.processRange(). So dropping the logdate from query results should also be fine. Started implementing changes to the queries to see how much refactoring is needed to achieve the change to using only raw epoch_hour value in the queries. |
|
Successfully removed the logdate from the query result values of StreamDBClient, cleaning up the logdateFunction usage from the class. |
|
After implementing additional tests for ArchiveQueryProcessor class, I noticed that the EarlistCondition and LatestCondition classes must be refactored to use epoch_hour instead of logdate and parsed logtime in the condition. |
…me field creation. REBASE Implemented coalesce with epoch_hour converted to date for defining logdate of the logfile. REBASE
…hen available. REBASE
…sing date arithmetics that completely ignores session timezone. Added epochHourTimezoneTest. REBASE
…TableNullEpochTest().
…ge of Result10 to Result9 with the Date result value removed.
…ndition with logfile.epoch_hour implementation. REBASE
|
epoch_hour usage has now been implemented to the earliest and latest conditions of the queries. |
Description
The aim of this pull request is to implement support for new epoch time columns to logfile metadata queries done by StreamDBClient. Original logdate and logtime implementations are prone to timezone issues where the result of the query changes depending on the session timezone, using epoch time resolves this issue.
Includes:
Resolves #213
Resolves #214
Resolves #284
Checklists
Testing
General
Assertions
Testing Data
Statements
Java
Other
Code Quality