All end-to-end evaluations in the Quartz paper need to run for 24 hours. To see the improvements in future commits, it would be a good idea to create some smaller benchmarks that can be run in several minutes.