Goodreads Book Analysis and Network Construction

Introduction

This project aims to analyze the Goodreads book dataset and construct various networks based on the data. The main objectives of this project are:

Construct a user network based on book reviews.
Construct a network of similar books based on book data.
Analyze the popularity of books and genres over time.
Perform network analysis and visualization on the constructed networks.

Dataset

The dataset used in this project is the Goodreads book dataset, which contains information about books, authors, and user reviews. The dataset is in JSON format and can be found in the dataset directory. The relevant files are:

goodreads_books.json: Contains information about books, including title, authors, publisher, publication year, and similar books.
goodreads_reviews_dedup.json: Contains user reviews, including book ID, user ID, rating, and date added.
goodreads_book_genres_initial.json: Contains genre information for books.

1. `authors_network.py`

This script creates a network of authors based on their co-authorship of books. The main tasks performed by this script are:

Building an undirected graph G where nodes represent authors, and edges connect authors who have co-authored at least one book.
Calculating network properties such as density, number of edges and nodes, average degree, connected components, and average clustering coefficient.
Visualizing the degree distribution of the network.
Identifying and visualizing communities within the network using the Louvain method for community detection.
Calculating and visualizing PageRank scores to identify influential authors.

2. `Goodreads.py`

This script focuses on analyzing the book similarity network and identifying the most popular and best-rated authors. The main tasks performed by this script are:

Loading the goodreads_books.json dataset and creating a DataFrame books.
Creating an interaction dataset by removing irrelevant columns from books.
Building a graph G where nodes represent books, and edges connect books that are listed as similar.
Visualizing the book similarity network.
Detecting communities within the network using the Louvain method.
Identifying the most popular authors based on the number of connections (similar books) in the network.
Identifying the best-rated authors based on the average rating of their books.

3. `similar_books.py`

This script analyzes the network of similar books and calculates various network properties. The main tasks performed by this script are:

Loading the goodreads_books.json dataset and creating a DataFrame books.
Creating an interaction dataset by removing irrelevant columns from books.
Building a graph G where nodes represent books, and edges connect books that are listed as similar.
Calculating the degree distribution of the network.
Counting the total number of unique nodes (books) and edges in the network.
Calculating the average path length of the connected components in the network.
Calculating the average clustering coefficient of the network.
Calculating the diameter of the largest connected component in the network.
Optionally, visualizing the network.

4. `users_network2.ipynb`

The file users_network2.ipynb contains code for constructing a user network based on the book reviews. The network is created by connecting users who have rated the same book. The code performs the following steps:

Read the book data and user review data from the JSON files.
Filter the book data to include only books published between 1980 and 2017, with a maximum of 1000 books per year.
Process the user review data to create interactions between users and books.
Construct the user network by connecting users who have rated the same book.
Analyze the network properties, such as degree distribution, average path length, and average clustering coefficient.
Perform community detection using various algorithms (e.g., Louvain, Greedy Modularity).
Calculate node importance measures (e.g., PageRank, Closeness Centrality, Degree Centrality, Betweenness Centrality).

Genre Analysis

The book_network.txt file contains code for analyzing the popularity of books and genres over time. It performs the following steps:

Read the book genre data from the JSON file.
Calculate the average rating, number of ratings, and number of books for each genre.
Analyze the genre distribution and popularity over time.

6. `Visualization_data.ipynb`

The Visualization_data.ipynb file includes code for analyzing and visualizing the constructed networks. The analysis includes:

Visualization of network properties (degree distribution, clustering coefficient, etc.).
Community detection using various algorithms (Louvain, Greedy Modularity, etc.).
Calculation of node importance measures (PageRank, Closeness Centrality, Degree Centrality, Betweenness Centrality).
Visualization of network communities based on genres.

The code uses libraries such as NetworkX, pandas, matplotlib, and community for network analysis and visualization.

Usage

To use this project, you need to have Python and the required libraries installed. The required libraries can be installed using env.yaml

Clone or download the project repository.
Place the dataset files (goodreads_books.json, goodreads_reviews_dedup.json, and goodreads_book_genres_initial.json) in the dataset directory.
Run the code files individually.

Note: You may need to modify file paths or other parameters to work correctly with your local setup.

References:

Book genre trends: https://observablehq.com/d/6c2122f0341d1212

Book popularity trends: https://observablehq.com/d/9e49b9059f62929b

Contributing

Contributions to this project are welcome. If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the project's GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
Static_files		Static_files
.gitignore		.gitignore
Books.webm		Books.webm
Genre_dist.png		Genre_dist.png
Goodreads.ipynb		Goodreads.ipynb
README.md		README.md
authors_network.ipynb		authors_network.ipynb
book_network.ipynb		book_network.ipynb
books_popularity.json		books_popularity.json
env.yaml		env.yaml
genre_dist.json		genre_dist.json
genres.pop		genres.pop
requiremets.txt		requiremets.txt
similar_books.ipynb		similar_books.ipynb
users_network.ipynb		users_network.ipynb
visualization.js		visualization.js
visualization_data.ipynb		visualization_data.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Goodreads Book Analysis and Network Construction

Introduction

Dataset

1. `authors_network.py`

2. `Goodreads.py`

3. `similar_books.py`

4. `users_network2.ipynb`

Genre Analysis

6. `Visualization_data.ipynb`

Usage

References:

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Goodreads Book Analysis and Network Construction

Introduction

Dataset

1. authors_network.py

2. Goodreads.py

3. similar_books.py

4. users_network2.ipynb

Genre Analysis

6. Visualization_data.ipynb

Usage

References:

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `authors_network.py`

2. `Goodreads.py`

3. `similar_books.py`

4. `users_network2.ipynb`

6. `Visualization_data.ipynb`

Packages