Reproducibility issue

**Description**

I am currently unable to reproduce the results reported in the paper.

Below is the code I am using to compute the F1 score for the Weather dataset (located in _dataset/weather_):

```
from sklearn.metrics import f1_score
import pickle as pkl
import numpy as np
import os

city = 'hs'

city_full_name = {
    'ny': 'New York City',
    'hs': 'Houston',
    'sf': 'San Francisco'
}

with open('indices.pkl', 'rb') as f:
    indices = pkl.load(f)

with open(f'rain_{city}.pkl', 'rb') as f:
    labels = pkl.load(f)

data_size = len(indices)

num_train = int(data_size * 0.6)
num_test = int(data_size * 0.2)
num_vali = data_size - num_train - num_test

seq_len_day = 1

idx_train = np.arange(num_train - seq_len_day)
idx_valid = np.arange(num_train - seq_len_day, num_train + num_vali - seq_len_day)
idx_test = np.arange(num_train + num_vali - seq_len_day, num_train + num_vali + num_test - seq_len_day)

res = []
for _i in idx_test:
    i = indices[_i]
    label = labels[_i]
    with open(f'gpt_predict_text/{city}_{i}.txt', 'r') as f:
        text = f.read()
        if 'not rain' in text.lower():
            pred = False
        elif 'rain' in text.lower():
            pred = True
        else:
            print(f"Invalid prediction: {text}")
            continue
        res.append((pred, label))

y_true = [label for _, label in res]
y_pred = [pred for pred, _ in res]
print(f1_score(y_true, y_pred, average='micro'))
```

The scores I obtain do not match the values reported in Table 2 of the paper. Could you please share the evaluation code used to produce the reported results?

Thanks for sharing the code and your work 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducibility issue #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Reproducibility issue #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions