-
Notifications
You must be signed in to change notification settings - Fork 7
SSW CodeAuditor Knowledge Base (KB)
- After verifying, we discovered that certain websites such as Microsoft and Twitter employ measures against web scraping. These measures may include login authentication, redirects or the generation of multiple CORS network errors, as shown in the below screenshot. As a result, although the websites continue to function normally, CodeAuditor labels them as errors in its reports.
- To address this issue, we still include these sites in the analysis report but list them under the Non-scrapable sites section.
- URLs that are determined to fall under the Unscannable Links list because the links are checked using a HEAD request when a GET request gives us a more accurate response, however we don't want to be checking against GET for every link due to potential performance issues, nor do we want to be doubling up on requests by falling back on GET requests.
- Therefore, we could be using this Unscannable Links list to decide whether we check a link against HEAD or GET, which should hopefully provide more accurate link statuses for links that fall under those unscannable domains while lessening any potential negative performance impacts.
Figure: Microsoft website returning CORS network errors
- In CodeAuditor, a designated section is allocated to record instances of unsuccessful scraping attempts.
- The following is a list of websites that are recognized to have implemented anti-web scraping measures:
- Artillery will fail in cases where a discrepancy arises between the domain of the website's cookies and the host domain. Therefore, it is essential to ensure alignment between the domain name specified in the cookie configuration and the host domain name for seamless test execution.
- If your website is hosted on Azure, make sure to disable ARR Affinity
- 🤖 Within CodeAuditor, we offer a functionality allowing users to incorporate URLs into a designated "Ignored List". This list encompasses URLs that are intentionally excluded from the scanning process due to various reasons such as when the user knows in prior that a particular URL is unscannable. By adding URLs to the Ignored List, users effectively exclude them from the generated report, thereby facilitating a more refined and accurate scanning experience.
- When adding to the "Ignored URL List", users have the option to utilize Glob Patterns to precisely define which URLs they wish to exclude. CodeAuditor supports various Glob Patterns, which are detailed below along with corresponding examples:
- Asterisk (*) - Matches any sequence of characters within a single path segment.
Example: example.com/*/page matches example.com/123/page and example.com/abc/page.
- Double Asterisk (**) - Matches any sequence of characters across multiple path segments.
Example: example.com/** matches example.com/123/page, example.com/abc/def/page, or example.com/page.
- Question Mark (?) - Matches any single character within a URL.
Example: example.com/page-?.html matches example.com/page-1.html or example.com/page-a.html.
- Character Set ([]) - Matches any single character within the specified set.
Example: example.com/page-[123].html matches example.com/page-1.html, example.com/page-2.html, or example.com/page-3.html.
- Negation (!) - Excludes URLs that match the specified pattern.
Example: example.com/!page*.html excludes URLs like example.com/page-1.html or example.com/page-2.html.
- Range (-) - Matches a range of characters within a URL.
Example: example.com/page-[1-5].html matches example.com/page-1.html, example.com/page-2.html, and example.com/page-5.html.
- Brace Expansion ({}) - Matches any of the comma-separated values within the braces.
Example: example.com/{page,post}.html matches example.com/page.html and example.com/post.html.