DataOverview/DataOverview.Rmd at main · Fdoel/DataOverview · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# Rating scale dataset

This document will go through a few open-source datasets and explain their characteristics

### Movielens 100K

Movielens is a non-commercial website that helps user find movies that they might like. It is run by GroupLens, a research lab at the University of Minnesota. Movielens datasets can be found on [Grouplens website](https://grouplens.org/datasets/movielens/). The 100K dataset has 100K observed values.

The 100K and the 1M are among those classified by Grouplens as "older datasets". The dataset they reccommend for new research is the 25M dataset from 2019, which is a lot more recent compared to the 100K dataset which is from 1998. Sadly, these datasets are too much for my hardware to handle so I cant visualise them like I did in this file.

```{r, fig.width=10, fig.height=11, echo=FALSE}
knitr::include_graphics("MovieLens100/MovieLens100.png")
```

### Movielens 1M

Movielens 1M also comes from Grouplens. It dates 2003, so it is slightly more recent than the 100K dataset. It constain 1M obserrvations.
```{r, fig.width=10, fig.height=11, echo=FALSE}
knitr::include_graphics("MovieLens1M/MovieLens1M.png")
```

### Movielens 25M

Also from Grouplens, this dataset contains 25 million observed values, and is the dataset Grouplens recommends for new research. It is also fairly popular among recent papers.

```{r, fig.show="hold", out.width="50%", echo=FALSE}
load("MovieLens25M/MovieLens25Minfo.RData")
print(informationDf)
knitr::include_graphics("MovieLens25M/MovieLens25Mratings.png")
knitr::include_graphics("MovieLens25M/MovieLens25Muser_mean_plot.png")
knitr::include_graphics("MovieLens25M/MovieLens25Mitem_mean_plot.png")
knitr::include_graphics("MovieLens25M/MovieLens25Muser_sd_plot.png")
knitr::include_graphics("MovieLens25M/MovieLens25Mitem_sd_plot.png")
```

### Jester

Jester is a research project from the UC Berkeley Laboratory for Automation Science and Engineering. It is a joke reccomendation system, where users rate 100 jokes from  (-10.0 to 10). The datasets uses only user who have rated >35 jokes, although a different portion of the dataset is available for those that rated between 15 and 35 jokes. If you end up using this dataset email the owner out of courtesy.

For Jester, there are 4 datasets available, the one shown here is the first one shown [here](https://eigentaste.berkeley.edu/dataset/)

In regards to the ratings: "To rate items, users are
asked to click their mouse on a horizontal “ratings bar” which returns scalar values. While
technically not continuous (limited by the granularity of HTML image maps), we can
distinguish approximately 200 levels of ratings in the scale." This explains why -0.29 is the most commong value.


```{r, fig.width=10, fig.height=11, echo=FALSE}
knitr::include_graphics("Jester/Jester.png")
```

### Personality

 Also from the grouplens website, personality comes from [Nguyen et al. 2018](https://doi.org/10.1007/s10796-017-9782-y) a paper which examines people satisfaction with recommendation systems. The source of these ratings are also Movielens, where users were shown an invitation to participate in this study. Only users who had rated 15 or more movies were eligible, as Movielens generally only starts recommending movies when 15 or more have been rated.

```{r, fig.width=10, fig.height=11, echo=FALSE}
knitr::include_graphics("personality/personality.png")
```

### Amazon Grocery and Gourmet food

This Dataset is an updated version of the Amazon review dataset released in 2014. It is retrieved [here](https://nijianmo.github.io/amazon/). It is from the recommended "small" subsets.These data have been reduced to extract the k-core, such that each of the remaining users and items have k reviews each. Intrestingly the data is very heavily biased towards 5 star ratings.

```{r, fig.show="hold", out.width="50%", echo=FALSE}
load("AmazonFood/AmazonFoodinfo.RData")
print(informationDf)
knitr::include_graphics("AmazonFood/AmazonFoodratings.png")
knitr::include_graphics("AmazonFood/AmazonFooduser_mean_plot.png")
knitr::include_graphics("AmazonFood/AmazonFooditem_mean_plot.png")
knitr::include_graphics("AmazonFood/AmazonFooduser_sd_plot.png")
knitr::include_graphics("AmazonFood/AmazonFooditem_sd_plot.png")
```