Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
config/database.ini
__pycache__/
.venv/
174 changes: 74 additions & 100 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,119 +1,93 @@
# Differential Privacy over SQL

## Table of Contents
* [About the Project](#about-the-project)
* [Prerequisites](#prerequisites)
* [Tools](#tools)
* [Python Dependency](#python-dependency)
* [Database Permission](#database-permission)
* [system structure](#system-structure)
* [Demo System](#demo-system)
* [Instruction for Collecting Result](#collect-result)
* [Future Plan](#future-plan)
# Differential Privacy over SQL (DPSQL)

DPSQL is a system designed for answering SQL queries while satisfying differential privacy guarantees.

## About The Project
Differential Privacy over SQL (DPSQL) is a system for answering queries over differential privacy.

The file structure is as below
```
The file and directory structure of the project is organized as follows:

```text
project
└───config
└───docs
└───Profile
└───src
│ └───algorithm
└───Test
│ └───TPCH
│ └───Graph
└───Sample
├── config/ # Configuration files required for the system
├── docs/ # Reference information and documentation
├── Profile/ # Profile information/licenses (e.g., mosek.lic)
├── src/ # Main source code files
│ └── algorithm/ # Core algorithms integrated into the system (e.g., FastSJA, OptSJA)
├── Test/ # Queries used in system experiments (TPCH, Graph)
└── Sample/ # Scripts for database setup and collecting experiment results
```
`./config` stores the configuration files users need for the system.

`./docs` stores the reference information users need to work with DPSQL:
## Prerequisites

`./Profile` stores the Profile information for using `mosek` in the system.
### Tools
* **[PostgreSQL](https://www.postgresql.org/)**: Database engine.
* **[Python3](https://www.python.org/download/releases/3.0/)**: Ensure version 3.0 or higher.
* **[Mosek](https://www.mosek.com/downloads/)**: License file must be placed in `./Profile`.
* **CPLEX (Full Edition)**: Required for large datasets. Note: Do not rely on `pip install cplex` alone, as it has a 1,000-variable limit.
* [Detailed CPLEX Installation & Python Linking Guide](docs/cplex_setup.md)

`./src` stores main source files.
* `./src/algorithm` stores 3 algorithm we integrated into this system.
### Python Dependencies

`./Test` stores the queries used in the experiments of the system.
Install the required Python packages using the provided `requirements.txt` file:

`./Sample` stores the script for setting up database and collecting experiment results.
```bash
pip install -r requirements.txt
```

### Database Permissions
The user running the system must have read permissions for the target database schema.

## Prerequisites
### Tools
Before running this project, please install below tools
* [PostgreSQL](https://www.postgresql.org/)
* [Python3](https://www.python.org/download/releases/3.0/)
* [Cplex](https://www.ibm.com/analytics/cplex-optimizer)
* [Mosek](https://www.mosek.com/downloads/) and the licence is under `./Profile`.

Please do not install `Cplex` dependency, which can only handle a small dataset, but download the `Cplex API` and import that to python with this [instruction](https://www.ibm.com/docs/zh/icos/12.9.0?topic=cplex-setting-up-python-api).
(We are aware that this link is expired and are working on a substitute solution.)

### Python Dependency
Here are dependencies used in python programs:
* `matplotlib`
* `numpy`
* `sys`
* `os`
* `collections`
* `configparser`
* `math`
* `psycopg2`
* `pglast`v4.4
* `argparser`

### Database permission
The user should have the permission to read the schema of the database to use this system.

## System structure
TODO

## Demo System

To run the system, run `main.py`. There are seven parameters
- `--d`: path to database initialization file;
- `--q`: path to query file;
- `--r`: path to private relation file;
- `--c`: path to the configuration file;
- `--o`: path to the output file;
- `--debug`: debug mode for more information;
- `--optimal`: choose to use optimal algorithm for SJA queries;

One can use `--h` to get help for parameter instruction.

For more information about input file, users can consult [here](./docs/system-input.md)

For the SQL syntax used in this system, users can consult [here](./docs/query-syntax.md)

Example:
```
## Usage / Demo System

The main entry point for the system is `main.py`.

### Command-Line Arguments
| Parameter | Description |
| :--- | :--- |
| `--d` | Path to the database initialization file |
| `--q` | Path to the query file |
| `--r` | Path to the private relation file |
| `--c` | Path to the configuration file |
| `--o` | Path to the output file |
| `--debug` | Enable debug mode for more detailed logging |
| `--optimal` | Use the optimal algorithm for SJA queries |

*Use `python main.py --h` to view complete help instructions.*

**Documentation Links:**
* [Input File Configuration](./docs/system-input.md)
* [Supported SQL Syntax](./docs/query-syntax.md)

**Example Run:**
```bash
python main.py --d ./config/database.ini --q ./test.txt --r ./test_relation.txt --c ./config/parameter.config --o out.txt
```

## collect result
## Collecting Results

1. install the dependency
Follow these steps to set up the data and collect experiment results:

2. create an empty database in `PosgreSQL`
3. generate `tbl` data files by using dbgen from [TPCH website](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp)
and store them in `/Sample/data/TPCH`
4. run script we provide in `/Sample/setupDBTPCH.py`
```
python setupDBTPCH.py --db databasename
```
5. run script we provide in `/Sample/collectResult.py`
```commandline
python collectResult.py
```
6. find the result in `/Sample/result/TPCH`
1. **Install Dependencies**: Ensure tools and Python requirements are installed as per the [Prerequisites](#prerequisites).
2. **Database Setup**: Create an empty database in PostgreSQL.
3. **Data Generation**: Generate `.tbl` data files using `dbgen` from the [TPC-H website](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp), and store them in `./Sample/data/TPCH`.
4. **Database Initialization**: Run the setup script provided in `./Sample/setupDBTPCH.py`:
```bash
python Sample/setupDBTPCH.py --db <databasename>
```
5. **Run Collection Script**:
```bash
cd Sample
python collectResult.py
```
6. **View Results**: The output will be available in `./Sample/result/TPCH`.

## Query Rewriting & Subquery Unnesting

DPSQL automatically rewrites and unnests subqueries to standard relational joins to ensure differential privacy mechanisms can be seamlessly applied. Through a custom Abstract Syntax Tree (AST) visitor (`UnnestSubqueries` in `src/parser.py`) built using `pglast`, the system traverses the AST and flattens nested `IN`, `ANY`, and `EXISTS` subqueries found in the `WHERE` clause into standard multi-table joins, while automatically preserving and linking the original filtering conditions.

## Future Plan
## Future Plans

- Distinct count queries type (projection);
- User Interface
- Better user experience;
- Optimization;
* Support for distinct count queries (projection).
* Develop a User Interface (UI).
* Improve overall user experience.
* General performance optimization.
Loading