Generative-Models-Tutorial/comp5421.html at main · jxbbb/Generative-Models-Tutorial · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>COMP5421 - Computer Vision</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            margin: 0;
            padding: 0;
            line-height: 1.6;
        }

        nav {
            background-color: #2c3e50;
            color: white;
            padding: 1rem;
            position: fixed;
            width: 100%;
            top: 0;
        }

        nav a {
            color: white;
            text-decoration: none;
            margin-right: 20px;
        }

        .container {
            max-width: 1200px;
            margin: 80px auto 40px;
            padding: 0 20px;
        }

        section {
            margin-bottom: 40px;
            background-color: #f9f9f9;
            padding: 20px;
            border-radius: 8px;
            box-shadow: 0 2px 4px rgba(0,0,0,0.1);
        }

        h1, h2 {
            color: #2c3e50;
        }

        footer {
            background-color: #2c3e50;
            color: white;
            text-align: center;
            padding: 1rem;
            position: relative;
            bottom: 0;
            width: 100%;
        }
    </style>
</head>
<body>
    <nav>
        <a href="#description">Project Description</a>
        <!-- <a href="#instructor">Instructor</a> -->
        <!-- <a href="#tas">TAs</a> -->
        <a href="#Groups">Groups</a>
        <a href="#resources">Resources</a>
        <a href="#tutorials">Tutorials</a>
    </nav>

    <div class="container">
        <section id="description">
            <h1>COMP5421 - Computer Vision - Spring 2026</h1>
            <h2>Instructor</h2>
            <ul>
                <li>Prof. Long QUAN</li>
                <li>Email: <a href="mailto:quan@cse.ust.hk">quan@cse.ust.hk</a></li>
            </ul>
            <h2>TAs</h2>
            <ul>
                <li>Bu JIN (bjinaa@connect.ust.hk)</li>
                <li>Xiangjun GAO (xgaobq@cse.ust.hk)</li>
            </ul>
            <h2>Project Description</h2>
            <h3>Topic: Visual Generation</h3>
            <p>
                This semester's project focuses on visual generation, encompassing any research or applied topic related to generative visual models. The aim is to explore, implement, and present creative or technical advancements in this area.
            </p>
            <p>You are expected to prepare preliminary results and give <strong>a mid-term presentation</strong> around the middle of the semester. You will then continue the same project, explore additional directions, and deliver <strong>a final presentation</strong> at the end of the semester.
            </p>
            <p>
                You are encouraged to explore any topic related to visual generative models, such as: image generation, image super-resolution, image inpainting, video generation, 3D generation, or other relevant areas
            </p>
            <p>
                You may work in groups of 1-2 students.
            </p>
            <h2>Groups</h2>
            <p>
                You can access this <a href="https://docs.google.com/spreadsheets/d/1uBQzrivSv8ES2qwmtLC94I27ETpArk7ywW-nVkT2vEI/edit?usp=sharing">link</a> (https://docs.google.com/spreadsheets/d/1uBQzrivSv8ES2qwmtLC94I27ETpArk7ywW-nVkT2vEI/edit?usp=sharing) to enter your group information or find a partner.
            </p>

            <h2>
                Project Presentation and Report
            </h2>
            <p>
                The mid-term presentations will be held on March 27 and April 1. You may indicate your preferred presentation time in the group table; however, scheduling preferences cannot be guaranteed. If you have special circumstances (e.g., attending a conference and needing to present on a specific date), please note this in the Remarks column.
            </p>
            <p>
                In summary, you have to prepare to <strong>present</strong> on 27 March/1 April and <strong>submit a report</strong> to the TAs(bjinaa@connect.ust.hk) by 5 April. Here we provide a sample <a href="https://github.com/jxbbb/Generative-Models-Tutorial/blob/main/COMP5421_Course_Report_Template.zip">template</a> for your reference. However, there are no strict requirements on format or content. Feel free to use your own style and ideas. You are encouraged to be creative and present your work in the way you think best showcases your project.
            </p>

        </section>

        <section id="resources">
            <h2>Course Resources</h2>
            <h3>Deep Generative Models</h3>
            <p>
                Deep generative models are a class of methods that use deep neural networks to learn and represent the underlying distribution of data. In recent years, these models have achieved remarkable progress in various domains, including image and video synthesis, 3D generation, and beyond. Prominent examples include Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), autoregressive models, flow-based models, and diffusion models.
            </p>
            <p>
                Here, you can use <strong>any models</strong> in your project provided it achieves satisfactory performance.
            </p>
            <p>
                There are some useful tutorials for deep generative models:
                <ul>
                    <li><a href="https://deepgenerativemodels.github.io/notes/index.html">Stanford CS236: Deep generative models</a></li>
                    <li>Lil'Log: <a href="https://lilianweng.github.io/posts/2021-07-11-diffusion-models/">Diffusion Models</a>, <a href="https://lilianweng.github.io/posts/2018-08-12-vae/">VAEs</a>, <a href="https://lilianweng.github.io/posts/2017-08-20-gan/">GANs</a>, <a href="https://lilianweng.github.io/posts/2018-10-13-flow-models/">flow-based models</a></li>
                    <li><a href="https://yang-song.net/blog/2021/score/">Score-based generative modeling</a> by Yang Song</li>
                </ul>
            </p>
            <h3>
                Dataset
            </h3>
            <p>
                We list some commonly used datasets for generative models.
                <ul>
                    <li><strong>MNIST:</strong> A tiny dataset consisting of 70000 handwritten digits which are grayscale and 28x28 pixels in size.</li>
                    <li><strong>CIFAR-10:</strong> The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes.</em></strong></li>
                    <li><strong>CeleA:</strong> CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. </li>
                    <li><strong>Tiny-ImageNet: </strong>Tiny ImageNet is a subset of ImageNet dataset, which contains 100,000 images of 200 classes (500 for each class) downsized to 64x64 colored images.</li>
                    <li><strong>UCF101: </strong>UCF101 is an action recognition dataset of realistic action videos with 13320 videos from 101 action categories.</li>
                    <li><strong>ShapeNetCore:</strong> ShapeNetCore covers 55 common object categories with about 51300 unique 3D models. </li>
                </ul>
                <p>
                    You can get use of MNIST, CIFAR-10, CeleA and UCF101 with <strong><em>torchvision.datasets.dataset_name</em></strong> easily.
                </p>
                <p>
                    For Tiny-ImageNet, you can download it from <a href="https://huggingface.co/datasets/zh-plus/tiny-imagenet">huggingface</a>.
                </p>
                <p>
                    For ShapeNetCore, you can log in to the <a href="https://shapenet.org/" target="_blank">ShapeNet</a> website to apply for the dataset, or download preprocessed dataset by previous works directly(for example: <a href="https://github.com/1zb/3DShape2VecSet" target="_blank">3DShape2VecSet</a>).
                </p>
            </p>

        </section>

        <section id="tutorials">
            <h2>Tutorials & Links</h2>
            <p>
                We provide two simple implementations for diffusion models and flow matching models to help you get started quickly if you don't have prior knowledge.
            </p>
            <ul>
                <li>
                    Here is a <a href="https://github.com/jxbbb/Generative-Models-Tutorial/blob/main/diffusion.ipynb">simple implementation</a> for diffusion model with <a href="https://huggingface.co/docs/diffusers/en/index">diffusers</a>.
                </li>
                <li>
                    For flow-based model, we recommand you to refer to torchcfm <a href="https://github.com/atong01/conditional-flow-matching/blob/main/examples/2D_tutorials/Flow_matching_tutorial.ipynb">2D toy example</a> and <a href="https://github.com/atong01/conditional-flow-matching/blob/main/examples/images/mnist_example.ipynb"> MNIST examples</a>.
                </li>
            </ul>
            <p>
                Additionally, we recommend several <strong>open-source</strong> methods for reference:
            </p>
            <p>
                Image synthesis:
            </p>
            <ul>
                <li>
                    LDM: <a href="https://github.com/CompVis/latent-diffusion">High-Resolution Image Synthesis with Latent Diffusion Models.</a>
                </li>
                <li>
                    DiT: <a href="https://github.com/facebookresearch/DiT">DiT: Scalable Diffusion Models with Transformers.</a>
                </li>
                <li>
                    VAR: <a href="https://github.com/FoundationVision/VAR">Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction.</a>
                </li>
            </ul>
            <p>
                Video synthesis:
            </p>
            <ul>
                <li>
                    Open-Sora: <a href="https://github.com/hpcaitech/Open-Sora">Open-Sora: Democratizing Efficient Video Production for All.</a>
                </li>
                <li>
                    Wan: <a href="https://github.com/Wan-Video/Wan2.2">Wan: Open and Advanced Large-Scale Video Generative Models.</a>
                </li>
                <li>
                    DynamiCrafter: <a href="https://github.com/Doubiiu/DynamiCrafter">DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors.</a>
                </li>
            </ul>
            <p>
                3D generation:
            </p>
            <ul>
                <li>
                    Trellis.2: <a href="https://github.com/microsoft/TRELLIS.2">Native and Compact Structured Latents for 3D Generation.</a>
                </li>
                <li>
                    Direct3D-S2: <a href="https://github.com/DreamTechAI/Direct3D-S2">Direct3D‑S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention.</a>
                </li>
            </ul>
        </section>
    </div>

    <footer>
        <p>&copy; 2026 COMP5421 - Computer Vision Course. All rights reserved.</p>
    </footer>
</body>
</html>