Conversation
There was a problem hiding this comment.
Hey @arthurdrake1, looks very solid, I am honestly impressed by the complexity of the model :).
My comments are essentially some small improvements. I however have a more "meta comment" to make:
I believe the single file implementation is not going to do well in this case. The file iscurrently too big to navigate. I have two suggestions to make the code easier to understand:
- put all the utils, network components, gpt (things you use in the algorithm but are not the core logic of the algorithm itself) in their own file. I'd keep the networks definitions (Encoder, Decoder, etc.) in the training script, though.
- have the different stages in different files / scripts. Currently, the script is doing a lot of stuff and instantiating a lot of pieces. I believe it might be easier to get if we have different with well defined input and output.
What do you think? :)
engiopt/vqgan/evaluate_vqgan.py
Outdated
|
|
||
| # Restores the pytorch model from wandb | ||
| if args.wandb_entity is not None: | ||
| artifact_path_0 = f"{args.wandb_entity}/{args.wandb_project}/{args.problem_id}_vqgan_cvqgan:seed_{seed}" |
There was a problem hiding this comment.
I believe having more explicit names than 0, 1, 2, e.g., artifact_transformer might help keeping track of these
There was a problem hiding this comment.
Ok I just used 0, 1, 2 for an easier shorthand while coding, but agree spelling them out will help understanding
| ) | ||
|
|
||
| # Clean up conditions based on model training settings and convert back to tensor | ||
| sampled_conditions_new = sampled_conditions.select(range(len(sampled_conditions))) |
There was a problem hiding this comment.
I don't get what this does
There was a problem hiding this comment.
This copies over the original sampled_conditions into a new dataset that can then be normalized/cleaned prior to feeding into the model. The reason we need this copy is (to the best of my knowledge) the metrics.metrics() call later requires the original conditions for its calculations.
There was a problem hiding this comment.
Fair! The name is a bit cryptic then, though.
.pre-commit-config.yaml
Outdated
| entry: pyright | ||
| language: node | ||
| pass_filenames: false | ||
| pass_filenames: true |
There was a problem hiding this comment.
It was for a debug earlier, I will change it back
pyproject.toml
Outdated
| pythonVersion = "3.9" | ||
| pythonPlatform = "All" | ||
| typeshedPath = "typeshed" | ||
| # typeshedPath = "typeshed" -> commented out may lead to precommit out of memory error |
There was a problem hiding this comment.
Also for pre-commit debug, I will undo this
engiopt/vqgan/vqgan.py
Outdated
|
|
||
| def __init__(self): | ||
| super().__init__() | ||
| self.register_buffer("shift", th.tensor([-0.030, -0.088, -0.188])[None, :, None, None]) |
There was a problem hiding this comment.
These constants are coming from LPIPS?
There was a problem hiding this comment.
Yes, they are just constants to normalize the input for the pretrained VGG model
engiopt/vqgan/vqgan.py
Outdated
|
|
||
|
|
||
| ########################################### | ||
| ########## GPT-2 BASE CODE BELOW ########## |
There was a problem hiding this comment.
I'd put this into its own file
engiopt/vqgan/vqgan.py
Outdated
| th.as_tensor(training_ds["optimal_upsampled"][:]).to(device), | ||
| *[th.as_tensor(training_ds[key][:]).to(device) for key in conditions], | ||
| ) | ||
| dataloader_0 = th.utils.data.DataLoader( |
There was a problem hiding this comment.
Why 3 times the same thing?
There was a problem hiding this comment.
The different model stages could have different batch sizes
engiopt/vqgan/vqgan.py
Outdated
| name=run_name, | ||
| dir="./logs/wandb", | ||
| ) | ||
| wandb.define_metric("0_step", summary="max") |
There was a problem hiding this comment.
Yes in this special multi-stage training case, to have separate training step counts for each one.
|
@ffelten Going to make the minor edits now. I will also move all the "helper modules" like nanogpt into their own file. As for separating out the different training phases, this would be more complicated as the current logic is built on the single script implementation, i.e., easily reusing the trained vqgan and cvqgan to then train the transformer. |
ffelten
left a comment
There was a problem hiding this comment.
Some minor stuff. I did not read all the details but it looks clean enough to me!
engiopt/vqgan/vqgan.py
Outdated
| cond_lr: float = 2e-4 # Default: 2e-4 | ||
| """learning rate for CVQGAN""" | ||
| latent_size: int = 16 | ||
| """size of the latent feature map (automatically determined later)""" |
There was a problem hiding this comment.
Why is this determined later? Also, is this "automated determination" reflected in the wandb config (hyperparams) ?
There was a problem hiding this comment.
The latent feature map dimension is dictated by the number of encoder and decoder layers, so in practice the user has to increase or decrease those to change latent_size. Number of image channels is more straightforward, we just update that based on the number of channels in the data (always 1 in our case, for now).
I checked the wandb config for a test run I did just now with bogus values (--latent_size 123123 and --image_channels 123123). They remained at 16 and 1 respectively in the hyperparams list on wandb.
There was a problem hiding this comment.
@ffelten Upon further review of the code, I don't think these args are needed. They're just calculated as normal variables and override any actual user inputs as I determined above. Should be good to go now.
engiopt/vqgan/vqgan.py
Outdated
| """number of epochs of training""" | ||
| batch_size_vqgan: int = 16 | ||
| """size of the batches for Stage 1""" | ||
| lr_vqgan: float = 5e-5 # Default: 2e-4 |
There was a problem hiding this comment.
I'm not sure you want to keep the default comment? There are some others for different hyperparams, too
There was a problem hiding this comment.
Yes thanks, this was for debugging so I removed it now
Adds the VQGAN training and evaluation with some tuned hyperparameters and using an Online Clustered Codebook to promote 100% codebook utilization.
Also includes three new functions in
transforms.pythat could be useful for most/all EngiOpt models:resize_toprovides a standardized way of resizing tensors, i.e., to 128x128 prior to model training and vice versanormalizeto normalize the conditions for zero mean and unit stddrop_constantto remove any condition columns that remain constant throughout the data (i.e.,overhang_constrant= 0 for all of beams2d)The latter two are optional and can be disabled in the args for VQGAN.