These Are Not The Scrapes You're Looking For - Session Anomalies
In my first article in this series, I discussed web scraping -- what it is, why people do it, and why it could be harmful. My second article outlined the details of bot detection and how the ASM blocks against these pesky little creatures. This last article in the series of web scraping will focus on the final part of the ASM defense against web scraping: session opening anomalies and session transaction anomalies. These two detection modes are new in v11.3, so if you're using v11.2 or earlier, then you should upgrade and take advantage of these great new features! ASM Configuration In case you missed it in the bot detection article, here's a quick screenshot that shows the location and settings of the Session Opening and Session Transactions Anomaly in the ASM. You'll find all the fun when you navigate to Security > Application Security > Anomaly Detection > Web Scraping. There are three different settings in the ASM for Session Anomaly: Off, Alarm, and Alarm and Block. (Note: these settings are configured independently...they don't have to be set at the same value) Obviously, if Session Anomaly is set to "Off" then the ASM does not check for anomalies at all. The "Alarm" setting will detect anomalies and record attack data, but it will allow the client to continue accessing the website. The "Alarm and Block" setting will detect anomalies, record the attack data, and block the suspicious requests. Session Opening Anomaly The first detection and prevention mode we'll discuss is Session Opening Anomaly. But before we get too deep into this, let's review what a session is. From a simple perspective, a session begins when a client visits a website, and it ends when the client leaves the site (or the client exceeds the session timeout value). Most clients will visit a website, surf around some links on the site, find the information they need, and then leave. When clients don't follow a typical browsing pattern, it makes you wonder what they are up to and if they are one of the bad guys trying to scrape your site. That's where Session Opening Anomaly defense comes in! Session Opening Anomaly defense checks for lots of abnormal activities like clients that don't accept cookies or process JavaScript, clients that don't scrape by surfing internal links in the application, and clients that create a one-time session for each resource they consume. These one-time sessions lead scrapers to open a large number of new sessions in order to complete their job quickly. What's Considered A New Session? Since we are discussing session anomalies, I figured we should spend a few sentences on describing how the ASM differentiates between a new or ongoing session for each client request. Each new client is assigned a "TS cookie" and this cookie is used by the ASM to identify future requests from the client with a known, ongoing session. If the ASM receives a client request and the request does not contain a TS cookie, then the ASM knows the request is for a new session. This will prove very important when calculating the values needed to determine whether or not a client is scraping your site. Detection There are two different methods used by the ASM to detect these anomalies. The first method compares a calculated value to a predetermined ceiling value for newly opened sessions. The second method considers the rate of increase of newly opened sessions. We'll dig into all that in just a minute. But first, let's look at the criteria used for detecting these anomalies. As you can see from the screenshot above, there are three detection criteria the ASM uses...they are: Sessions opened per second increased by: This specifies that the ASM considers client traffic to be an attack if the number of sessions opened per second increases by a given percentage. The default setting is 500 percent. Sessions opened per second reached: This specifies that the ASM considers client traffic to be an attack if the number of sessions opened per second is greater than or equal to this number. The default value is 400 sessions opened per second. Minimum sessions opened per second threshold for detection: This specifies that the ASM considers traffic to be an attack if the number of sessions opened per second is greater than or equal to the number specified. In addition, at least one of the "Sessions opened per second increased by" or "Sessions opened per second reached" numbers must also be reached. If the number of sessions opened per second is lower than the specified number, the ASM does not consider this traffic to be an attack even if one of the "Sessions opened per second increased by" or "Sessions opened per second" reached numbers was reached. The default value for this setting is 200 sessions opened per second. In addition, the ASM maintains two variables for each client IP address: a one-minute running average of new session opening rate, and a one-hour running average of new session opening rate. Both of these variables are recalculated every second. Now that we have all the basic building blocks. let's look at how the ASM determines if a client is scraping your site. First Method: Predefined Ceiling Value This method uses the user-defined "minimum sessions opened per second threshold for detection" value and compares it to the one-minute running average. If the one-minute average is less than this number, then nothing else happens because the minimum threshold has not been met. But, if the one-minute average is higher than this number, the ASM goes on to compare the one-minute average to the user-defined "sessions opened per second reached" value. If the one-minute average is less than this value, nothing happens. But, if the one-minute average is higher than this value, the ASM will declare the client a web scraper. The following flowchart provides a pictorial representation of this process. Second Method: Rate of Increase The second detection method uses several variables to compare the rate of increase of newly opened sessions against user-defined variables. Like the first method, this method first checks to make sure the minimum sessions opened per second threshold is met before doing anything else. If the minimum threshold has been met, the ASM will perform a few more calculations to determine if the client is a web scraper or not. The "sessions opened per second increased by" value (percentage) is multiplied by the one-hour running average and this value is compared to the one-minute running average. If the one-minute average is greater, then the ASM declares the client a web scraper. If the one-minute average is lower, then nothing happens. The following matrix shows a few examples of this detection method. Keep in mind that the one-minute and one-hour averages are recalculated every second, so these values will be very dynamic. Prevention The ASM provides several policies to prevent session opening anomalies. It begins with the first method that you enable in this list. If the system finds this method not effective enough to stop the attack, it uses the next method that you enable in this list. The following screenshots show the different options available for prevention. The "Drop IP Addresses with bad reputation" is tied to Rate Limiting, so it will not appear as an option unless you enable Rate Limiting. Note that IP Address Intelligence must be licensed and enabled. This feature is licensed separately from the other ASM web scraping options. Here's a quick breakdown of what each of these prevention policies do for you: Client Side Integrity Defense: The system determines whether the client is a legal browser or an illegal script by sending a JavaScript challenge to each new session request from the detected IP address, and waiting for a response. The JavaScript challenge will typically involve some sort of computational challenge. Legal browsers will respond with a TS cookie while illegal scripts will not. The default for this feature is disabled. Rate Limiting: The goal of Rate Limiting is to keep the volume of new sessions at a "non-attack" level. The system will drop sessions from suspicious IP addresses after the system determines that the client is an illegal script. The default for this feature is also disabled. Drop IP Addresses with bad reputation: The system drops requests from IP addresses that have a bad reputation according to the system’s IP Address Intelligence database (shown above). The ASM will drop all request from any "bad" IP addresses even if they respond with a TS cookie. IP addresses that do not have a bad reputation also undergo rate limiting. The default for this option is disabled. Keep in mind that this option is available only after Rate Limiting is enabled. In addition, this option is only enforced if at least one of the IP Address Intelligence Categories is set to Alarm mode. Prevention Duration Now that we have detected session opening anomalies and mitigated them using our prevention options, we must figure out how long to apply the prevention measures. This is where the Prevention Duration comes in. This setting specifies the length of time that the system will prevent an attack. The system prevents attacks by rejecting requests from the attacking IP address. There are two settings for Prevention Duration: Unlimited: This specifies that after the system detects and stops an attack, it performs attack prevention until it detects the end of the attack. This is the default setting. Maximum <number of> seconds: This specifies that after the system detects and stops an attack, it performs attack prevention for the amount of time indicated unless the system detects the end of the attack earlier. So, to finish up our Session Opening Anomaly part of this article, I wanted to share a quick scenario. I was recently reading several articles from some of the web scrapers around the block, and I found one guy's solution to work around web scraping defense. Here's what he said: "Since the service conducted rate-limiting based on IP address, my solution was to put the code that hit their service into some client-side JavaScript, and then send the results back to my server from each of the clients. This way, the requests would appear to come from thousands of different places, since each client would presumably have their own unique IP address, and none of them would individually be going over the rate limit." This guy is really smart! And, this would work great against a web scraping defense that only offered a Rate Limiting feature. Here's the pop quiz question: If a user were to deploy this same tactic against the ASM, what would you do to catch this guy? I'm thinking you would need to set your minimum threshold at an appropriate level (this will ensure the ASM kicks into gear when all these sessions are opened) and then the "sessions opened per second" or the "sessions opened per second increased by" should take care of the rest for you. As always, it's important to learn what each setting does and then test it on your own environment for a period of time to ensure you have everything tuned correctly. And, don't forget to revisit your settings from time to time...you will probably need to change them as your network environment changes. Session Transactions Anomaly The second detection and prevention mode is Session Transactions Anomaly. This mode specifies how the ASM reacts when it detects a large number of transactions per session as well as a large increase of session transactions. Keep in mind that web scrapers are designed to extract content from your website as quickly and efficiently as possible. So, web scrapers normally perform many more transactions than a typical application client. Even if a web scraper found a way around all the other defenses we've discussed, the Session Transaction Anomaly defense should be able to catch it based on the sheer number of transactions it performs during a given session. The ASM detects this activity by counting the number of transactions per session and comparing that number to a total average of transactions from all sessions. The following screenshot shows the detection and prevention criteria for Session Transactions Anomaly. Detection How does the ASM detect all this bad behavior? Well, since it's trying to find clients that surf your site much more than other clients, it tracks the number of transactions per client session (note: the ASM will drop a session from the table if no transactions are performed for 15 minutes). It also tracks the average number of transactions for all current sessions (note: the ASM calculates the average transaction value every minute). It can use these two figures to compare a specific client session to a reasonable baseline and figure out if the client is performing too many transactions. The ASM can automatically figure out the number of transactions per client, but it needs some user-defined thresholds to conduct the appropriate comparisons. These thresholds are as follows: Session transactions increased by: This specifies that the system considers traffic to be an attack if the number of transactions per session increased by the percentage listed. The default setting is 500 percent. Session transactions reached: This specifies that the system considers traffic to be an attack if the number of transactions per session is equal to or greater than this number. The default value is 400 transactions. Minimum session transactions threshold for detection: This specifies that the system considers traffic to be an attack if the number of transactions per session is equal to or greater than this number, and at least one of the "Sessions transactions increased by" or "Session transactions reached" numbers was reached. If the number of transactions per session is lower than this number, the system does not consider this traffic to be an attack even if one of the "Session transactions increased by" or "Session transaction reached" numbers was reached. The default value is 200 transactions. The following table shows an example of how the ASM calculates transaction values (averages and individual sessions). We would expect that a given client session would perform about the same number of transactions as the overall average number of transactions per session. But, if one of the sessions is performing a significantly higher number of transactions than the average, then we start to get suspicious. You can see that session 1 and session 3 have transaction values higher than the average, but that only tells part of the story. We need to consider a few more things before we decide if this client is a web scraper or not. By the way, if the ASM knows that a given session is malicious, it does not use that session's transaction numbers when it calculates the average. Now, let's roll in the threshold values that we discussed above. If the ASM is going to declare a client as a web scraper using the session transaction anomaly defense, the session transactions must first reach the minimum threshold. Using our default minimum threshold value of 200, the only session that exceeded the minimum threshold is session 3 (250 > 200). All other sessions look good so far...keep in mind that these numbers will change as the client performs additional transactions during the session, so more sessions may be considered as their transaction numbers increase. Since we have our eye on session 3 at this point, it's time to look at our two methods of detecting an attack. The first detection method is a simple comparison of the total session transaction value to our user-defined "session transactions reached" threshold. If the total session transactions is larger than the threshold, the ASM will declare the client a web scraper. Our example would look like this: Is session 3 transaction value > threshold value (250 > 400)? No, so the ASM does not declare this client as a web scraper. The second detection method uses the "transactions increased by" value along with the average transaction value for all sessions. The ASM multiplies the average transaction value with the "transactions increased by" percentage to calculate the value needed for comparison. Our example would look like this: 90 * 500% = 450 transactions Is session 3 transaction value > result (250 > 450)? No, so the ASM does not declare this client as a web scraper. By the way, only one of these detection methods needs to be met for the ASM to declare the client as a web scraper. You should be able to see how the user-defined thresholds are used in these calculations and comparisons. So, it's important to raise or lower these values as you need for your environment. Prevention Duration In order to save you a bunch of time reading about prevention duration, I'll just say that the Session Transactions Anomaly prevention duration works the same as the Session Opening Anomaly prevention duration (Unlimited vs Maximum <number of> seconds). See, that was easy! Conclusion Thanks for spending some time reading about session anomalies and web scraping defense. The ASM does a great job of detecting and preventing web scrapers from taking your valuable information. One more thing...for an informative anomaly discussion on the DevCentral Security Forum, check out this conversation. If you have any questions about web scraping or ASM configurations, let me know...you can fill out the comment section below or you can contact the DevCentral team at https://devcentral.f5.com/s/community/contact-us.931Views2likes2CommentsMore Web Scraping - Bot Detection
In my last article, I discussed the issue of web scraping and why it could be a problem for many individuals and/or companies. In this article, we will dive into some of the technical details regarding bots and how the BIG-IP Application Security Manager (ASM) can detect them and block them from scraping your website. What Is A Bot? A bot is a software application that runs automated tasks and typically performs these tasks much faster than a human possibly could. In the context of web scraping, bots are used to extract data from websites, parse the data, and assemble it into a structured format where it can be presented in a useful form. Bots can perform many other actions as well, like submitting forms, setting up schedules, and connecting to databases. They can also do fun things like add friends to social networking sites like Twitter, Facebook, Google+, and others. A quick Internet search will show that many different bot tools are readily available for download free of charge. We won't go into the specifics of each vendor's bot application, but it's important to understand that they are out there and are very easy to use. Bot Detection So, now that we know what a bot is and what it does, how can we distinguish between malicious bot activity and harmless human activity? Well, the ASM is configured to check for some very specific activities that help it determine if the client source is a bot or a human. By the way, it's important to note that the ASM can accurately detect a human user only if clients have JavaScript enabled and support cookies. There are three different settings in the ASM for bot detection: Off, Alarm, and Alarm and Block. Obviously, if bot detection is set to "Off" then the ASM does not check for bot activity at all. The "Alarm" setting will detect bot activity and record attack data, but it will allow the client to continue accessing the website. The "Alarm and Block" setting will detect bot activity, record the attack data, and block the suspicious requests. These settings are shown in the screenshot below. Once you apply the setting for bot detection, you can then tune the ASM to begin checking for bots that are accessing your website. The bot detection utilizes four different techniques to detect and defend against bot activity. These include Rapid Surfing, Grace Interval, Unsafe Interval, and Safe Interval. Rapid Surfing detects bot activity by counting the client's page consumption speed. A page change is counted from the page load event to its unload event. The ASM configuration allows you to set a maximum number of page changes for a given time period (measured in milliseconds). If a page changes more than the maximum allowable times for the given time interval, the ASM will declare the client as a bot and perform the action that was set for bot detection (Off, Alarm, Alarm and Block). The default setting for Rapid Surfing is 5 page changes per second (or 1000 milliseconds). The Grace Interval setting specifies the maximum number of page requests the system reviews while it tries to detect whether the client is a human or a bot. As soon as the system makes the determination of human or bot it ends the Grace Interval and stops checking for bots. The default setting for the Grace Interval is 100 requests. Once the system determines that the client is valid, the system does not check the subsequent requests as specified in the Safe Interval setting. This setting allows for normal client activity to continue since the ASM has determined the client is safe (during the Grace Interval). Once the number of requests sent by the client reaches the value specified in the Safe Interval setting, the system reactivates the Grace Interval and begins the process again. The default setting for the Safe Interval is 2000 requests. This Safe Interval is nice because it lowers the processing overhead needed to constantly check every client request. If the system does not detect a valid client during the Grace Interval, the system issues and continues to issue the "Web Scraping Detected" violation until it reaches the number of requests specified in the Unsafe Interval setting. The Unsafe Interval setting specifies the number of requests that the ASM considers unsafe. Much like in the Safe Interval, after the client sends the number of requests specified in the Unsafe Interval setting, the system reactivates the Grace Interval and begins the process again. The default setting for the Unsafe Interval is 100 requests. The following figure shows the settings for Bot Detection and the values associated with each setting. Interval Timing The following picture shows a timeline of client requests and the intervals associated with each request. In the example, the first 100 client requests will fall into the Grace Interval, and during this interval the ASM will be determining whether or not the client is a bot. Let's say a bot is detected at client request 100. Then, the ASM will immediately invoke the Unsafe Interval and the next 100 requests will be issued a "Web Scraping Detected" violation. When the Unsafe Interval is complete, the ASM reverts back to the Grace Interval. If, during the Grace Interval, the system determines that the client is a human, it does not check the subsequent requests at all (during the Safe Interval). Once the Safe Interval is complete, the system moves back into the Grace Interval and the process continues. Notice that the ASM is able to detect a bot before the Grace Interval is complete (as shown in the latter part of the diagram below). As soon as the system detects a bot, it immediately moves into the Unsafe Interval...even if the Grace Interval has not reached its set threshold. Setting Thresholds As you can see from the timeline above, it's important to establish the correct thresholds for each interval setting. The longer you make the Grace Interval, the longer you give the ASM a chance to detect a bot, but keep in mind that the processing overhead can become expensive. Likewise, the Unsafe Interval setting is a great feature, but if you set it too high, your requesting clients will have to sit through a long period of violation notices before they can access your site again. Finally, the Safe Interval setting allows your users free and open access to your site. If you set this too low, you will force the system to cycle through the Grace Interval unnecessarily, but if you set it too high, a bot might have a chance to sneak past the ASM defense and scrape your site. Remember, the ASM does not check client requests at all during the Safe Interval. Also, remember the ASM does not perform web scraping detection on traffic from search engines that the system recognizes as being legitimate. If your web application has its own search engine, it's recommended that you add it to the system. Go to Security > Options > Application Security > Advanced Configuration > Search Engines and add it to the list (the ASM comes preconfigured with Ask, Bing, Google, and Yahoo already loaded). A Quick Test I loaded up a sample virtual web server (in this case it was a fictitious online auction site) and then configured the ASM to Alarm and Block bot activity on the site. Then, I fired up the iMacro plugin from Firefox to scrape the site. Using the iMacro plugin, I sent many client requests in a short amount of time. After several requests to the site, I received the response page shown below. You can see that the response page settings in the ASM are shown in the Firefox browser window when web scraping is detected. These settings can be adjusted in the ASM by navigating to Security > Application Security > Blocking > Response Pages. Well, thanks for coming back and reading about ASM bot detection. Be sure to swing by again for the final web scraping article where I will discuss session anomalies. I can't possibly think of anything that could be more fun than that!2KViews1like0Comments