Add the ability to process the following modalities: - [ ] Image - [ ] Audio Modify the architecture to process modalities with early-fusion as in [Chameleon](https://arxiv.org/abs/2405.09818).
Add the ability to process the following modalities:
Modify the architecture to process modalities with early-fusion as in Chameleon.