FRED is configured from 2 sources:
- FRED is started by running
run.py. Use--helpto see all available parameters. - Edit the
globals.pyfile.
--hostis the local address that FRED will listen on. Default is0.0.0.0--portis the local port FRED will listen on. Default is5000--external_addressis the external address of FRED that an outside ML component will send results back at. Useful for NAT situations when the crawler component is hidden from a ML component--external_portis the complementary port of the external address--crawl_threadsis the number of concurrent threads that handle crawling. Please note that chromium requires a few GB of RAM per instance, so scale accordingly to avoid overflowing RAM--ai_threadsis the number of concurrent threads that handle the ML image analysis. Default is 1, because if you use a GPU (which you should anyway if you want ML analysis) sequential running won't overflow the GPU's RAM--max_megapixelsis the number of megapixels that the (raw) image analysis will crop each screenshot to. This option is set at 20MP default--log_levelstandard pyhon logging level. Default isdebug, available levels are:debug,warning,info,error--timeout_overallis the maximum timeout in seconds to wait for a job to be finished, from scheduled to done. Set 0 to disable, default is 60 minutes (3600 seconds).')--timeout_crawlis the maximum timeout in seconds to wait for the AI analysis to be finished. Set 0 to disable, default is 30 minutes (1800 seconds).')--timeout_aiis the maximum timeout in seconds to wait for the AI analysis to be finished. Set 0 to disable, default is 30 minutes (1800 seconds)
Edit these only if you know what you're doing. Some of the parameters here are set by the run.py arguments. A few of the more interesting ones are:
MEDIAN_KERNEL_WIDTH_RATIO = 8.0IMAGE_RESIZE_RATIO = 8.0VARIANCE_NUM_ITERATIONS = 3VARIANCE_INTERATIONS_INTERVAL_SEC = 10
CRAWL_URL_RETRIES = 3PAGE_LOAD_TIMEOUT_SEC = 60NUM_CRAWLING_THREADS = 10
What is actually happening when a FRED process starts?
Remember, a process starts by sending a request with a baseline URL and an updated URL (and a couple other params, detailed further below). FRED is set as a async state machine, meaning that each job (request) is treated in a queued manner and always has a status attached. On your local file system you will find a fred/jobs folder that will contain each individual job.
A job is uniquely identified by a timestamp and a suffix; this id is the folder name in the fred/jobs folder. A log.json file will contain all the information about the job, and will be updated async, as soon as anything changes.
Here's what happens:
A request is received, a unique id is generated, a folder appears in ''fred/jobs'' with this id, and a ''log.json'' file is created.
The job waits for a free crawler thread.
As soon as a thread is available, the job is updated with the status Crawling. The baseline is crawled up to a predefined depth and links are collected. This initial link exploration is done in a depth-first manner, up to a maximum depth, and a number of links will be extracted for being compared.
Each link obtained in the previous step is crawled (meaning fully loaded in chromium) and screenshots are taken, first on the baseline site, then on the updated site. Screenshots are taken at all specified resolution.
The log.json file is updated. A new pages value will appear in its dictionary with stats about each page. Locally, in the job's folder, a baseline and updated folders will appear, and each page will have its own subfolder with screenshots saved as raw_xyz_00.png where xyz is the resolution.
Each screenshot is analyzed: dynamic content is masked (best-effort), and different similarity measures are computed on each resolution - URL pair.
The log.json file is updated. A new pages value will appear in its dictionary with stats about each page, and a results will contain raw image analysis. Each page subfolder will now contain image analysis highlighting differences between images (only in the updated folder)
If the request had the ml_enabled property set, then the process continues with this status. Else, status is set to Done and the crawler thread is free to pick up another job.
What happens here is that all raw images are packed in a HTTP request and sent to the address specified in the ml_component field of the initial request.
The FRED instance awaits a request sent by a component in step 5. It unpacks all screenshots, writes them on disk (if we are using the same local instance, nothing happens as the files are already there) and starts the ML analysis.
Here, the ML model is loaded and run on all resolution pairs. If a GPU is available, it will automatically be used to speed up computation.
As the ML process creates a textblock and a image output file, in this step the alignment between identified masks is performed, and content is compared pixel-by-pixel. Results are written in the results key in the log.json file locally, and all this information (screenshots + analysis) is sent back to the requester address/port.
The requester also updates its log.json file (if not already done if using the same local address for ML).
This status is reached if all previous steps have completed successfully. This means that a job is finished.
*** Please note: to judge a job status you have to look at both the status field, which should be set to Done and the error field, which should be empty. If the status is Done and the error field is not empty, it means that the job has failed somehow, with the error message being written in the error field.
FRED exposes the following methods:
Call this when you want FRED to validate a website. Method: POST
- Parameters:
baseline_url = URL of the original site. Type: str, Required
updated_url = URL of the updated site. Type: str, Required
max_depth = Max depth to search for links. Type: int, Default: 10, Optional
max_urls = Number of URLs (pages) to compare. Type: int, Default: 10, Optional
prefix = Use this field if you want to add an identifier after the job id (which is a time value). If left empty a short random hexa code will be generated and appended. Even though this is a suffix, the name prefix is kept for backward compatibility. Do NOT use spaces in this field. Type: str, Default: (empty), Optional
ml_enable = Set to True to enable ML processing. Type: bool, Default: False, Optional
ml_address = Set the ML address of the FRED instance which has a GPU (usually you want to do ML on a GPU). Type: str, Default: 0.0.0.0:5000, Optional
requested_resolutions = List of resolutions screenshots will be taken at. Either set a single value or join multiple values with a comma. Type: list, Default: 512,1024, Optional
requested_score_weights = The overall divergence is a weighted sum of the network score, visual analysis score and AI visual analysis score. This value is always a 3-valued list with the weights for each of these components. Note that if ML enabled is set to False, it will automatically rescale the first two to sum to one (and ignore the third which belongs to the AI analysis). Type: list, Default: 0.1,0.4,0.5, Optional
requested_score_epsilon = This is supposed to be an epsilon to allow fuzzy scores. Not yet implemented. Type: epsilon, Default: 10,0,0, Optional
-
Success Response:
A status 200 code is returned with the following content:
{ id : job_id } -
Error Response:
This method cannot really fail as all it does is locally schedule a job for crawling and it just allocates an unique id. If it fails then it has a != 200 status code and will most likely be some HTTP code indicating that the server is inaccessible, etc.
-
Sample Call:
arr = { "baseline_url": "https://www.test.com", "updated_url": "https://www.test.com", "max_depth": "1", "max_urls": "1", "prefix": "test" }; $.ajax({ type: "POST", url: "/api/verify", dataType: "json", contentType: "application/json", data: JSON.stringify(arr), });
Call this to obtain the status of a job based on a job id. Method: GET
if not report:
return {'Error': 'Report does not exist'}, 404
else:
pages = report.get("pages")
if pages is not None:
for p in list(pages):
pages[normalize_url_name(p)] = pages[p]
del pages[p]
return report, 200
- Parameters:
id = id of the job. Type: str, Required
-
Success Response:
A status 200 code is returned with a report (a dictionary) with the status and details of the respective job.
-
Error Response:
A 404 is returned if the job does not exist (invalid job id specified).
-
Sample Call:
arr = { "baseline_url": "https://www.test.com", "updated_url": "https://www.test.com", "max_depth": "1", "max_urls": "1", "prefix": "test" }; $.ajax({ type: "POST", url: "/api/verify", dataType: "json", contentType: "application/json", data: JSON.stringify(arr), });
Call this to obtain a status of all jobs on a FRED instance. Method: GET
- Parameters:
No parameters are needed here.
-
Success Response:
A status 200 code is returned with the full dictionary containing all jobs (as they are in the moment of calling). Each job is a key in the dict. Please see the log details section for more info about how this dict is structured.
-
Error Response:
If there are no jobs, an empty response will be sent with status 200. Otherwise, the default HTTP codes apply if there are any errors.
-
Sample Call:
The log object contains all the information regarding a job. It is represented as a JSON, with the following top-level structure:
error: "",
input_data: {},
pages: {},
report: {},
stats: {},
status: "Done"Let's analyze each field:
These are the two main fields indicating the status of the job (as a string) - one of the steps detailed here. The error field (a string) indicates if there was an error processing the job. Thus:
- a successful job has status
Doneand an empty error field. - an unsuccessful job has status
Doneand a non-empty error field. - a job in progress has a status different from
Done.
This is a dictionary containing information regarding the parameters this job was instantiated with:
This dictionary contains durations for crawling, ML processing (if enabled) and overall.
cr_finished_at: "2020-06-24 16:47:15.704698",
cr_started_at: "2020-06-24 16:45:09.074632",
finished_at: "2020-06-24 16:47:26.722477",
ml_finished_at: "2020-06-24 16:47:26.686243",
ml_started_at: "2020-06-24 16:47:15.833722",
queued_at: "2020-06-24 16:45:09.066220"Note that cr_started_at (crawling started) is not necessarily the same as queued_at, as the actual starting time depends on available crawler threads.
The pages object contains a list of crawled pages. Each such page is identified by the page address, and is a dictionary itself containing:
console_logsandnetwork_logs: they have a similar structure, and contain adivergencescore (0-100) and a listing on what differs from baseline and updated.raw_screenshot_analysiscontains a list of resolutions as key, with the following values:
"1920": {
"diff_pixels": 0.0,
"mse": 0.0,
"ssim": 100.0,
}This is an example of an 1920px (width) baseline-updated screenshot analysis, with the diff_pixels being the percent of different pixels between the screenshots, mse being the normalized mean square error and ssim being the structural similarity index. For a pair of identical screenshots, diff_pixels and mse are 0, while ssim is 100.`
ml_screenshot_analysis
This dict contain a list of resolutions. Each screenshot resolution has a content and a mask object, each of them having three entries: images and textblock are values (0-100) that indicate how similar the images and the detected texts are, with the overall value being their average. The resolution also contains an overall entry which is the average of the content and mask's overalls.
1920: {
content: {
images: 0,
overall: 0,
textblock: 0
},
mask: {
images: 0,
overall: 0,
textblock: 0
},
overall: 0,
}The report is the essential part of the log object. It summarizes all divergences.
It has the following fields:
crawl_divergence: 16.279069767441857,
ml_screenshot_divergence: 0,
overall_divergence: 1.6279069767441858,
raw_screenshot_diff_pixels: 0,
raw_screenshot_divergence: 0,
raw_screenshot_mse: 0,
raw_screenshot_ssim: 100,
unique_baseline_pages: [],
unique_updated_pages: []crawl_divergenceis computed as the maximum of all the pages' average console and network log divergencesraw_screenshot_diff_pixelssimilarly, the maximum of all the pages' different pixels divergenceraw_screenshot_msesimilarly for the mse divergenceraw_screenshot_ssimsimilarly for the structural similarity index valueraw_screenshot_divergenceit is the maximum of the average of the three values above, considering all crawled pagesml_screenshot_divergencethe maximum divergence value from all pages, where the value is theoveralldivergenceoverall_divergencethis is the overall divergence that FRED outputs, computed as a weighted sum ofcrawl_divergence,raw_screenshot_divergence, andml_screenshot_divergence(if enabled)unique_baseline_pagesandunique_updated_pageslist pages (addresses) that were unique, either on the baseline site or on the updated site. If any of these entries contain values, then a maximum divergence will be reported, because that means that the baseline-updated sites are not identical as they don't serve the same pages.
Note, FRED runs under the assumption that if a page is very different baseline-to-updated, then this is the value reported back to the user, even if all other pages might be identical. So, we always compute the maximum divergence for all pages for all resolutions.