Skip to content

code-philia/PhishVLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PhishVLM

An extension from our work "Less Defined Knowledge and More True Alarms: Reference-based Phishing Detection without a Pre-defined Reference List". Published in USENIX Security 2024.

Read our Paper

Visit our Website

Download our Datasets

Cite our Paper

Introduction

Existing reference-based phishing detection:

  • ❌ Relies on a pre-defined reference list, which is lack of comprehensiveness and incurs high maintenance cost
  • ❌ Does not fully make use of the textual semantics present on the webpage

In our PhishVLM, we build a reference-based phishing detection framework:

  • Without the pre-defined reference list: Modern VLMs have encoded far more extensive brand-domain information than any predefined list
  • Chain-of-thought credential-taking prediction: Reasoning the credential-taking status in a step-by-step way by looking at the screenshot

Framework

Input: a URL and its screenshot, Output: Phish/Benign, Phishing target

  • Step 1: Brand recognition model

    • Input: Logo Screenshot
    • Output: VLM's predicted brand
  • Step 2: Credential-Requiring-Page classification model

    • Input: Webpage Screenshot
    • Output: VLM chooses from A. Credential-Taking Page or B. Non-Credential-Taking Page
    • Go to step 4 if VLM chooses 'A', otherwise go to step 3.
  • Step 3: Credential-Requiring-Page transition model (activates if VLM chooses 'B' from the last step)

    • Input: All clickable UI elements screenshots
    • Intermediate Output: Top-1 most likely login UI
    • Output: Webpage after clicking that UI, go back to Step 1 with the updated webpage and URL
  • Step 4: Output step

    • Case 1: If the domain is from a web hosting domain: it is flagged as phishing if (i) VLM predicts a targeted brand inconsistent with the webpage's domain and (ii) VLM chooses 'A' from Step 2

    • Case 2: If the domain is not from a web hosting domain: it is flagged as phishing if (i) VLM predicts a targeted brand inconsistent with the webpage's domain (ii) VLM chooses 'A' from Step 2 and (iii) the domain is not a popular domain indexed by Google

    • Otherwise: reported as benign

Project structure

scripts/ 
├── infer/
│   └──test.py             # inference script
├── pipeline/             
│   └──test_llm.py # TestVLM class
└── utils/ # other utitiles such as web interaction utility functions 

prompts/ 
├── brand_recog_prompt.json 
└── crp_pred_prompt.json
└── crp_trans_prompt.json

Setup

Step 1: Install Requirements.

Tested on Ubuntu, CUDA 11

  • A new conda environment "phishllm" will be created after this step
conda create -n phishllm python=3.10
conda activate phishllm
pip install -r requirements.txt
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
pip install --no-build-isolation git+https://github.com/facebookresearch/detectron2.git
cd scripts/phishintention
chmod +x setup.sh
./setup.sh

Step 2: Install Chrome

sudo apt install ./google-chrome-stable_current_amd64.deb

Step 3: Register Two API Keys.

  • 🔑 OpenAI API key, See Tutorial here. Paste the API key to ./datasets/openai_key.txt.

  • 🔑 Google Programmable Search API Key, See Tutorial here. Paste your API Key (in the first line) and Search Engine ID (in the second line) to ./datasets/google_api_key.txt:

     [API_KEY]
     [SEARCH_ENGINE_ID]
    

Prepare the Dataset

To test on your own dataset, you need to prepare the dataset in the following structure:

testing_dir/
├── aaa.com/
│   ├── shot.png  # save the webpage screenshot
│   ├── info.txt  # save the webpage URL
│   └── html.txt  # save the webpage HTML source
├── bbb.com/
│   ├── shot.png  # save the webpage screenshot
│   ├── info.txt  # save the webpage URL
│   └── html.txt  # save the webpage HTML source
├── ccc.com/
│   ├── shot.png  # save the webpage screenshot
│   ├── info.txt  # save the webpage URL
│   └── html.txt  # save the webpage HTML source

Inference: Run PhishLLM

  conda activate phishllm
  python scripts/infer/test.py --folder [folder to test, e.g., ./datasets/test_sites]

Understand the Output

  • You will see the console is printing logs like the following

    Expand to see the sample log

    
      [PhishLLMLogger][DEBUG] Folder ./datasets/field_study/2023-09-01/device-862044b2-5124-4735-b6d5-f114eea4a232.remotewd.com
      [PhishLLMLogger][DEBUG] Time taken for LLM brand prediction: 0.9699530601501465 Detected brand: sonicwall.com
      [PhishLLMLogger][DEBUG] Domain sonicwall.com is valid and alive
      [PhishLLMLogger][DEBUG] Time taken for LLM CRP classification: 2.9195783138275146 	 CRP prediction: A. This is a credential-requiring page.
      [❗️] Phishing discovered, phishing target is sonicwall.com
    
  • Meanwhile, a txt file named "[today's date]_phishllm.txt" is being created, it has the following columns:

    • "folder": name of the folder
    • "phish_prediction": "phish" | "benign"
    • "target_prediction": phishing target brand's domain, e.g. paypal.com, meta.com
    • "brand_recog_time": time taken for brand recognition
    • "crp_prediction_time": time taken for CRP prediction
    • "crp_transition_time": time taken for CRP transition

Citations

  @inproceedings {299838,
  author = {Ruofan Liu and Yun Lin and Xiwen Teoh and Gongshen Liu and Zhiyong Huang and Jin Song Dong},
  title = {Less Defined Knowledge and More True Alarms: Reference-based Phishing Detection without a Pre-defined Reference List},
  booktitle = {33rd USENIX Security Symposium (USENIX Security 24)},
  year = {2024},
  isbn = {978-1-939133-44-1},
  address = {Philadelphia, PA},
  pages = {523--540},
  url = {https://www.usenix.org/conference/usenixsecurity24/presentation/liu-ruofan},
  publisher = {USENIX Association},
  month = aug
  }

If you have any issues running our code, you can raise a Github issue or email us liu.ruofan16@u.nus.edu, lin_yun@sjtu.edu.cn, dcsdjs@nus.edu.sg.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors