橙子建站驗(yàn)證碼網(wǎng)絡(luò)營(yíng)銷(xiāo)的特點(diǎn)
字體反爬案例分析與爬取實(shí)戰(zhàn)
文章目錄
- 字體反爬案例分析與爬取實(shí)戰(zhàn)
- 1. 案例介紹
- 2. 案例分析
- 3. 爬取
本節(jié)來(lái)分析一個(gè)反爬案例,該案例將真實(shí)的數(shù)據(jù)隱藏到字體文件里,即使我們獲取了頁(yè)面源代碼,也無(wú)法直接提取數(shù)據(jù)的真實(shí)值。
1. 案例介紹
案例網(wǎng)站為https://antispider4.scrape.center/,第一眼看這個(gè)網(wǎng)站沒(méi)有啥特別的,那么我們先用selenium爬取一些信息,例如電影標(biāo)題、類(lèi)別、評(píng)分等,代碼實(shí)現(xiàn)如下:
from selenium import webdriver
from pyquery import PyQuery as pq
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWaitbrowser = webdriver.Chrome()
browser.get('https://antispider4.scrape.center/')
WebDriverWait(browser, 10) \.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.item')))
html = browser.page_source
doc = pq(html)
items = doc('.item')
for item in items.items():name = item('.name').text()categories = [o.text() for o in item('.categories button').items()]score = item('.score').text()print(f'name: {name} categories: {categories} score: {score}')
browser.close()
name: 霸王別姬 - Farewell My Concubine categories: ['劇情', '愛(ài)情'] score:
name: 這個(gè)殺手不太冷 - Léon categories: ['劇情', '動(dòng)作', '犯罪'] score:
name: 肖申克的救贖 - The Shawshank Redemption categories: ['劇情', '犯罪'] score:
name: 泰坦尼克號(hào) - Titanic categories: ['劇情', '愛(ài)情', '災(zāi)難'] score:
......
這里就出現(xiàn)問(wèn)題了,我們的score字段沒(méi)有任何信息,通過(guò)分析源代碼,發(fā)現(xiàn)評(píng)分對(duì)應(yīng)的節(jié)點(diǎn)內(nèi)并不包含任何的數(shù)字信息:
<p data-v-090744c8="" class="score m-t-md m-b-n-sm"><span data-v-090744c8=""><i data-v-090744c8="" class="icon icon-789"></i></span><span data-v-090744c8=""><i data-v-090744c8="" class="icon icon-981"></i></span><span data-v-090744c8=""><i data-v-090744c8="" class="icon icon-504"></i></span></p>
span節(jié)點(diǎn)里面什么信息都沒(méi)有,那頁(yè)面上的評(píng)分結(jié)果是怎么出來(lái)的?這其實(shí)是CSS的結(jié)果。
2. 案例分析
<i data-v-090744c8="" class="icon icon-789">::before
</i>
<i data-v-090744c8="" class="icon icon-981">::before
</i>
可以詳細(xì)觀(guān)察一下源代碼,各個(gè)span節(jié)點(diǎn)的不同之處在于內(nèi)部的i節(jié)點(diǎn)的class取值不太一樣,我們可以看到有3個(gè)span節(jié)點(diǎn),對(duì)應(yīng)的class取值分別是icon-789,icon-981,icon-504;接著我們觀(guān)察i節(jié)點(diǎn)的CSS樣式可以發(fā)現(xiàn)i節(jié)點(diǎn)內(nèi)部有一個(gè)::before字段,在CSS中,該字段用于創(chuàng)造一個(gè)偽節(jié)點(diǎn),及這個(gè)i節(jié)點(diǎn)或者span節(jié)點(diǎn)不一樣,::before可以往特定的節(jié)點(diǎn)中插入內(nèi)容,同時(shí)在CSS中使用content字段定義這一個(gè)內(nèi)容。我們可以在瀏覽器中追蹤C(jī)SS源代碼,代碼文件如下所示:
點(diǎn)擊右邊的app.654ba59e.css:1,進(jìn)入文件之后可以看到整個(gè)CSS源代碼都在那里放著
所以我們只需要讀取CSS文件并提取映射關(guān)系,這個(gè)CSS文件就是:https://antispider4.scrape.center/css/app.654ba59e.css,下面是部分截圖:
3. 爬取
我們可以用requests庫(kù)讀取結(jié)果,并通過(guò)正則表達(dá)式將映射關(guān)系提取出來(lái),我們用findall方法對(duì)內(nèi)容進(jìn)行匹配,取出每一個(gè)關(guān)系賦值成字典即可,之后通過(guò)索引進(jìn)行訪(fǎng)問(wèn):
from selenium import webdriver
from pyquery import PyQuery as pq
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import re
import requests
url = 'https://antispider4.scrape.center/css/app.654ba59e.css'response = requests.get(url)
pattern = re.compile('.icon-(.*?):before\{content:"(.*?)"\}')
results = re.findall(pattern, response.text)
icon_map = {item[0]: item[1] for item in results}def parse_score(item):elements = item('.icon')icon_values = []for element in elements.items():class_name = (element.attr('class'))icon_key = re.search('icon-(\d+)', class_name).group(1)icon_value = icon_map.get(icon_key)icon_values.append(icon_value)return ''.join(icon_values)browser = webdriver.Chrome()
browser.get('https://antispider4.scrape.center/')
WebDriverWait(browser, 10) \.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.item')))
html = browser.page_source
doc = pq(html)
items = doc('.item')
for item in items.items():name = item('.name').text()categories = [o.text() for o in item('.categories button').items()]score = parse_score(item)print(f'name: {name} categories: {categories} score: {score}')
browser.close()