PhishVLM

An extension from our work "Less Defined Knowledge and More True Alarms: Reference-based Phishing Detection without a Pre-defined Reference List". Published in USENIX Security 2024.

• Read our Paper •

• Visit our Website •

• Download our Datasets •

• Cite our Paper •

Introduction

Existing reference-based phishing detection:

❌ Relies on a pre-defined reference list, which is lack of comprehensiveness and incurs high maintenance cost
❌ Does not fully make use of the textual semantics present on the webpage

In our PhishVLM, we build a reference-based phishing detection framework:

✅ Without the pre-defined reference list: Modern VLMs have encoded far more extensive brand-domain information than any predefined list
✅ Chain-of-thought credential-taking prediction: Reasoning the credential-taking status in a step-by-step way by looking at the screenshot

Framework

Input: a URL and its screenshot, Output: Phish/Benign, Phishing target

Step 1: Brand recognition model
- Input: Logo Screenshot
- Output: VLM's predicted brand
Step 2: Credential-Requiring-Page classification model
- Input: Webpage Screenshot
- Output: VLM chooses from A. Credential-Taking Page or B. Non-Credential-Taking Page
- Go to step 4 if VLM chooses 'A', otherwise go to step 3.
Step 3: Credential-Requiring-Page transition model (activates if VLM chooses 'B' from the last step)
- Input: All clickable UI elements screenshots
- Intermediate Output: Top-1 most likely login UI
- Output: Webpage after clicking that UI, go back to Step 1 with the updated webpage and URL
Step 4: Output step
- Case 1: If the domain is from a web hosting domain: it is flagged as phishing if (i) VLM predicts a targeted brand inconsistent with the webpage's domain and (ii) VLM chooses 'A' from Step 2
- Case 2: If the domain is not from a web hosting domain: it is flagged as phishing if (i) VLM predicts a targeted brand inconsistent with the webpage's domain (ii) VLM chooses 'A' from Step 2 and (iii) the domain is not a popular domain indexed by Google
- Otherwise: reported as benign

Project structure

scripts/ 
├── infer/
│   └──test.py             # inference script
├── pipeline/             
│   └──test_llm.py # TestVLM class
└── utils/ # other utitiles such as web interaction utility functions 

prompts/ 
├── brand_recog_prompt.json 
└── crp_pred_prompt.json
└── crp_trans_prompt.json

Setup

Step 1: Install Requirements.

Tested on Ubuntu, CUDA 11

A new conda environment "phishllm" will be created after this step

conda create -n phishllm python=3.10
conda activate phishllm
pip install -r requirements.txt
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
pip install --no-build-isolation git+https://github.com/facebookresearch/detectron2.git
cd scripts/phishintention
chmod +x setup.sh
./setup.sh

Step 2: Install Chrome

sudo apt install ./google-chrome-stable_current_amd64.deb

Step 3: Register Two API Keys.

🔑 OpenAI API key, See Tutorial here. Paste the API key to ./datasets/openai_key.txt.
🔑 Google Programmable Search API Key, See Tutorial here. Paste your API Key (in the first line) and Search Engine ID (in the second line) to ./datasets/google_api_key.txt:
```
 [API_KEY]
 [SEARCH_ENGINE_ID]
```

Prepare the Dataset

To test on your own dataset, you need to prepare the dataset in the following structure:

testing_dir/
├── aaa.com/
│   ├── shot.png  # save the webpage screenshot
│   ├── info.txt  # save the webpage URL
│   └── html.txt  # save the webpage HTML source
├── bbb.com/
│   ├── shot.png  # save the webpage screenshot
│   ├── info.txt  # save the webpage URL
│   └── html.txt  # save the webpage HTML source
├── ccc.com/
│   ├── shot.png  # save the webpage screenshot
│   ├── info.txt  # save the webpage URL
│   └── html.txt  # save the webpage HTML source

Inference: Run PhishLLM

  conda activate phishllm
  python scripts/infer/test.py --folder [folder to test, e.g., ./datasets/test_sites]

Understand the Output

You will see the console is printing logs like the following

Expand to see the sample log


  [PhishLLMLogger][DEBUG] Folder ./datasets/field_study/2023-09-01/device-862044b2-5124-4735-b6d5-f114eea4a232.remotewd.com
  [PhishLLMLogger][DEBUG] Time taken for LLM brand prediction: 0.9699530601501465 Detected brand: sonicwall.com
  [PhishLLMLogger][DEBUG] Domain sonicwall.com is valid and alive
  [PhishLLMLogger][DEBUG] Time taken for LLM CRP classification: 2.9195783138275146 	 CRP prediction: A. This is a credential-requiring page.
  [❗️] Phishing discovered, phishing target is sonicwall.com

Meanwhile, a txt file named "[today's date]_phishllm.txt" is being created, it has the following columns:
- "folder": name of the folder
- "phish_prediction": "phish" | "benign"
- "target_prediction": phishing target brand's domain, e.g. paypal.com, meta.com
- "brand_recog_time": time taken for brand recognition
- "crp_prediction_time": time taken for CRP prediction
- "crp_transition_time": time taken for CRP transition

Citations

  @inproceedings {299838,
  author = {Ruofan Liu and Yun Lin and Xiwen Teoh and Gongshen Liu and Zhiyong Huang and Jin Song Dong},
  title = {Less Defined Knowledge and More True Alarms: Reference-based Phishing Detection without a Pre-defined Reference List},
  booktitle = {33rd USENIX Security Symposium (USENIX Security 24)},
  year = {2024},
  isbn = {978-1-939133-44-1},
  address = {Philadelphia, PA},
  pages = {523--540},
  url = {https://www.usenix.org/conference/usenixsecurity24/presentation/liu-ruofan},
  publisher = {USENIX Association},
  month = aug
  }

If you have any issues running our code, you can raise a Github issue or email us liu.ruofan16@u.nus.edu, lin_yun@sjtu.edu.cn, dcsdjs@nus.edu.sg.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
datasets		datasets
figures		figures
prompts		prompts
scripts		scripts
.gitignore		.gitignore
README.md		README.md
param_dict.yaml		param_dict.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhishVLM

Introduction

Framework

Project structure

Setup

Step 1: Install Requirements.

Step 2: Install Chrome

Step 3: Register Two API Keys.

Prepare the Dataset

Inference: Run PhishLLM

Understand the Output

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PhishVLM

Introduction

Framework

Project structure

Setup

Step 1: Install Requirements.

Step 2: Install Chrome

Step 3: Register Two API Keys.

Prepare the Dataset

Inference: Run PhishLLM

Understand the Output

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages