anti-crawler

About 2580 wordsAbout 9 min

2024-01-14

Note

First edition, to be proofread.

Crawlers and anti-crawlers are a long-lasting offensive and defensive battle, not only a commercial struggle, but also a technical struggle, a battlefield without smoke. There is no permanent technical means for crawlers, and it is difficult for anti-crawlers to find a defensive means to prevent crawling. Both sides are constantly fighting, the attackers constantly change their attack methods, and the defenders constantly update their defense methods against the attack methods.

The behavior of crawlers is difficult to eliminate, and most anti-crawler measures are constantly increasing the difficulty of crawling, making the cost of crawlers continue to rise, until the benefits brought by the behavior are lower than the cost paid.

This article only discusses some technical solutions for anti-crawler. For some technical solutions involving security and confidentiality, the details of the solutions and implementation methods will not be given.

Crawler

A crawler is a program that obtains information by crawling web page content.

At present, many e-commerce and content websites are under pressure from normal user access behavior on the one hand, and on the other hand, they are under pressure from crawling behaviors of various crawlers that far exceed user access.

Not only does it consume a lot of server resources and bandwidth, but also a large amount of commercially valuable data is illegally obtained by third parties, which brings huge losses.

Common crawler strategies

Common web crawler strategies include the following:

Breadth-first traversal strategy

This is the most common crawling strategy. Starting from the starting web page, first crawl as many homepage links as possible where the starting web page is located, then traverse these links in a breadth-first manner to crawl as many crawlable web pages as possible.

Non-complete PageRank strategy

Use the PageRank algorithm to determine the priority of crawling web pages, and give priority to crawling web pages with high PageRank values.

OPIC strategy

This strategy takes into account the importance of web pages and assigns crawling priorities based on the importance of web pages.

Large site priority strategy

This strategy gives priority to crawling web pages of large websites, because large websites usually contain more valuable information.

Web page update strategy

This strategy assigns crawling priorities based on the update frequency of web pages, giving priority to crawling frequently updated web pages.

Distributed cluster crawler

For large-scale data crawling tasks, distributed cluster crawlers can be used. This type of crawler can crawl data from multiple nodes at the same time to improve crawling efficiency.

In addition, there are strategies such as using proxy IP pools and simulated logins.

Common technical solutions for crawling content

There are many technical solutions for crawling web page content, some of which are common methods:

HTTP request library

Python has many powerful HTTP request libraries, such as requests, urllib3, http.client, etc. These libraries can send various types of HTTP requests, including GET, POST, etc.

Browser automation tools

Such as Selenium, Pyppeteer, etc., these tools can simulate user operations in the browser to crawl dynamic web page content.

Proxy IP

When crawling web pages, you may encounter IP blocking. Use proxy IP to avoid being blocked.

Multithreading or multi-process

Multithreading or multi-process can improve the efficiency of crawlers, but you need to pay attention to thread safety or process safety.

Distributed crawler

Distributed crawler can crawl data from multiple nodes at the same time

Database storage data

When crawling a large amount of data, use a database to store data, such as MySQL, MongoDB, etc.

Caching technology

Use caching technology to avoid repeated crawling of web page

Parsing tools such as regular expressions or BeautifulSoup

These tools can help extract the required data from web pages.

Anti-crawler

Anti-crawler is a series of defensive measures to deal with crawlers.

From the perspective of crawler strategies and technical solutions, common features of crawler programs include:

High-frequency access
Capture html content
Capture data interface
Simulate user behavior, such as with the help of tools such as Selenium and Pyppeteer

Anti-crawler strategy

Based on these features, common anti-crawler strategies include:

User-Agent + Referer detection
Account and Cookie verification
Verification code
IP frequency limit

However, due to the existence of tools such as Selenium and Pyppeteer, User-Agent + Referer detection can be easily bypassed.

There are some gray industries selling virtual mobile phone numbers both at home and abroad, and even some "number-raising" industries targeting the account systems of various companies, which leads to more complex strategies for account and cookie verification.

The frequency of IP restrictions can be easily bypassed through the proxy IP pool. Of course, we can also establish a blacklist IP pool to block these IPs identified as crawlers.

Moreover, image recognition technology is becoming more and more mature, including AI image recognition, which is becoming more and more accurate. Some simple image noise verification codes have completely lost their function and become solutions that only affect user experience. Of course, there are also many solutions such as sliding verification codes and face recognition that can provide better anti-crawler protection.

The above strategies are defensive measures taken before the crawler touches the content. If the crawler breaks through these defensive measures, we can also take defensive measures on the content.

Content defense

Content defense means taking some defensive measures on the content before crawling it. Mainly defend from two directions:

Make crawlers not see content
Make crawlers not read content

`not see` content

Generally speaking, for a newly released or updated web page, crawlers and crawler developers need to crawl the web page again. For crawler developers, they need to check what updates have been made to the web page first, then adjust the crawler program, and then crawl. In this process, crawler developers may open the debug console through the browser to view the web page source code, data interface, etc.

In order to deal with this behavior, we can take some defensive measures for the behavior of opening the debug console according to its characteristics or its behavior.

Debug console

The first common one is to add a debugger statement in the source code, and when the console is opened, it will enter the debug mode.

We can use infinite loops to achieve infinite debugging.

function bun() {
  setInterval(() => {
    debugger
  }, 50)
}
try {
  bun()
} catch (err) {}

When the console is opened, the program is blocked by the debugger, so breakpoint debugging cannot be performed and web page requests cannot be seen.

Countermeasures for unlimited debugger

Unlimited debugger can usually only prevent novices, but it will basically not work for experienced technicians.

Technicians can turn off unlimited debugger through the Deactivate breakpoints button in the console or use the shortcut Ctrl + F8; You can also add add script ignore list to ignore the execution of code lines or files.

Therefore, some more complex variations have been derived:

By formatting the code to write debugger in one line, the Deactivate breakpoints button or the shortcut Ctrl + F8 cannot turn off the debugger, but add script ignore list can still prevent the debugger from running.

function bun() {
  setInterval(() => {
    debugger
  }, 50)
}
try {
  bun()
} catch (err) {}

Furthermore, you can rewrite debugger into Function("debugger")(); to deal with add script ignore list

function bun() {
  setInterval(() => {
    Function('debugger')()
  }, 50)
}
try {
  bun()
} catch (err) {}

You can make the code more complex:

function bun() {
  setInterval(() => {
    ;(function () {
      return false
    })
      ['constructor']('debugger')
      ['call']()
  }, 50)
}
try {
  bun()
} catch (err) {}

Then perform code encryption and obfuscation, etc.

Detect window size changes

When the console is opened, if it is opened in the window, the console will be adsorbed on the sidebar, causing the browser window to change. You can indirectly determine whether the console is opened by detecting the difference between the external window size and the internal window size.

if (window.outerHeight - window.innerHeight > 200 || window.outerWidth - window.innerWidth > 200) {
  // Replace web page content
  document.body.innerHTML = 'Illegal debugging detected, please close and refresh and try again!'
}

When the console is detected to be opened, we can directly replace all the contents of the web page, or redirect to a new blank window. So that it cannot see the correct content.

But its defect is that if the console mode is selected to be opened outside the window, it will not cause the browser window to change. Then it is impossible to determine whether the console is opened.

We can combine the above two points to form a solution:

function bun() {
  if (window.outerHeight - window.innerHeight > 200 || window.outerWidth - window.innerWidth > 200) {
    // Replace page content
    document.body.innerHTML = 'Illegal debugging detected, please close and refresh to try again!'
  }
  setInterval(() => {
    ;(function () {
      return false
    })
      ['constructor']('debugger')
      ['call']()
  }, 50)
}
try {
  bun()
} catch (err) {}

But if you think about it carefully, judging whether there is potential crawler behavior based on whether the console is open is useful, but only a little bit.

In the modern browser Chrome, starting from version 117, the Override content function is provided, which can not only initiate Mock requests in the Network panel, but also directly replace the resource content of the current web page in the Sources panel. This means that even if we add these detections to the code, crawler developers can directly modify the web page content and delete these codes in the Sources panel.

In addition, there are packet capture tools such as Fiddler and Charles, which can directly capture content, including faking https certificates, etc.

Detection based on whether the console is open is useless. The only benefit it may bring is to prevent novices, and its negative benefit may be to further increase the debugging and troubleshooting costs of website owners and maintainers themselves.

So whether it is necessary to do so is a matter of opinion.

`Unable to read` content

When you cannot prevent crawlers from accessing the content and crawling the web page content or data, you can continue to defend the content itself.

The premise of defending the content itself is to ensure that the normal user access behavior should be presented to the content correctly, and then use other measures to make the content read by the crawler behavior incorrect.

In this regard, different types of websites will use different technical means.

For content websites:

With pictures as the main content

Usually, watermarks are added to the pictures, and watermarks include visible watermarks and hidden watermarks.

With text creation as the main content

Common methods include inserting invisible characters in the text content, or re-establishing a set of Unicode character mapping tables, making the characters in the source code inconsistent with the characters actually rendered on the web page.

However, if all character mapping tables are rebuilt, the workload will become very large, adding a lot of unnecessary complexity, so generally speaking, only some key content needs to be processed.

Character Obfuscation Strategy

By using some technical means to process the characters of the key content, the characters in the source code are inconsistent with the characters actually rendered on the web page, ensuring that the content seen by the user is the real content, while the content obtained by the crawler is the content containing obfuscated characters.

font-face character set

font-face is used to define fonts in CSS.

For example, the icon font in iconfont. The characters of the key content are rendered using SVG and integrated into a font file, and then the font is defined through font-face. After introducing the font library in the web page source code, the content is written using unicode encoding, and then rendered as normal content.

In this approach, since the key content is replaced with unicode, the crawler can only crawl the unicode code instead of the real content. It also needs to parse the corresponding font library for encoding mapping to obtain the real content, which increases the complexity of crawling. You can also use the dynamic font library to update the encoding mapping relationship of the font library from time to time to increase the complexity of further crawling.

background-image patchwork

Background image patchwork is generally used in some scenes where the key content is numbers and letters. Because the number of characters is small, it does not take up too many resources when converted to images. You can also use "Sprite" to merge into one image and control the displayed content by background positioning.

The content crawled in this way is only a set of empty tags. The crawler needs to further read CSS, obtain images, and position information to analyze and obtain the content.

Character interleaving

Character interleaving is mainly done by inserting characters that will not be rendered between normal content, but these characters are readable in the source code content.

For example, inserting other random numbers between 12234:

<span>1</span><span>2</span><span>2</span><span>3</span><span>4</span>

Insert random numbers

<span>1</span><span>3</span><span>2</span><span>2</span><span>3</span><span>6</span><span>4</span>

Then, through class names, selectors, etc., according to certain rules, set the inserted random numbers to display: node and other hidden values.

This method inserts normal content into normal content. If the crawler does not know the rules, it will mistakenly think that the correct content has been obtained.

Pseudo-element hiding

With the help of pseudo-elements ::before and ::after, the key content is filled into the CSS attribute content.

Element positioning interspersed

Element positioning interspersed is to shuffle and reorganize the correct content, and then adjust it to the correct order position through positioning.

shadowDOM hiding

shadowDOM, through the Element.attachShadow() method, a Shadow DOM is mounted to the specified element, and the key content is written into the Shadow DOM. Due to the characteristics of Shadow DOM, its mode can be specified as false, refusing to access the closed shadow root node from outside js.

Even if the crawler is through Selenium and Pyppeteer, it cannot read the content. The disadvantage is the compatibility problem of shadowDOM, which may cause some users to not see the content.

When the commercial benefits of crawler behavior are very huge, more resources will be invested in anti-crawler measures to make more targeted crawler programs.

This phenomenon is particularly serious on e-commerce and platform websites. In order to play the price advantage, it is necessary to obtain the commodity prices of competing platforms in a timely manner and adjust the prices and preferential strategies of their own platforms. Including price monitoring, price change monitoring, price warning and other measures.

Often when one party's anti-crawler solution is technically adjusted, once the other party's crawler cannot obtain the content, it will immediately send emails, text messages, etc. for warning notifications. The two sides have formed a long-term and lasting tug-of-war of attack and defense.

From the current situation, it is difficult to completely prevent crawlers from crawling content, and crawlers also have various means to determine whether they have crawled the correct content.

Since it is impossible to eliminate it, in addition to the necessary conventional anti-crawler measures, why not give the crawler some real content? Then we can mix some not-so-real content with the real content, why not?