概述
爬蟲需要抓取網(wǎng)站價格,與一般抓取網(wǎng)頁區(qū)別的是抓取內(nèi)容是通過AJAX加載,并且價格是通過CSS背景圖片顯示的。
每一個數(shù)字對應(yīng)一個樣式,如'p_h57_5'
.p_h57_5 {
background: url('http://pic.c-ctrip.com/priceblur/h57/3713de5c594648529f39d031243966dd.gif') no-repeat -590px;
padding: 0 6px;
font-size: 18px;
}
數(shù)字對應(yīng)的樣式和對應(yīng)的backgroundimg都是動態(tài)改變的,需要獲取到每一個房型的房價。雖然后來有了其它渠道獲取房價,這里記錄一下用Selenium&Emgu抓取的方式。
流程:
1.Selenium訪問網(wǎng)址
2.全屏截圖
3.Selenium選擇器獲取房型等信息
4.Selenium選擇器獲取價格DOM元素,計算出價格元素的相對位置,截取價格圖片,使用Emgu識別價格并且輸出
實現(xiàn)
```C#
static void Main(string[] args)
{
//訪問網(wǎng)址
ChromeOptions options = new ChromeOptions();
options.AddArguments("--start-maximized --disable-popup-blocking");
var driver = new ChromeDriver(options);
driver.Navigate().GoToUrl("http://hotels.ctrip.com/hotel/992765.html");
try
{
new WebDriverWait(driver, TimeSpan.FromSeconds(1)).Until(
ExpectedConditions.ElementExists((By.ClassName("htl_room_table")))); //表示已加載完畢
}
finally
{
}
//刪除價格的¥符號
ReadOnlyCollection<IWebElement> elementsList = driver.FindElementsByCssSelector("tr[expand]");
driver.ExecuteScript(@"
var arr = document.getElementsByTagName('dfn');
for(var i=0;i<arr.length;i++){
arr[i].style.display = 'none';
}
");
//全屏截圖
var image2 = GetEntereScreenshot(driver);
image2.Save(@"Z:111.jpg");
//輸出
Console.WriteLine("{0,-20}{1,-20}{2,-20}", "房型", "類型", "房價");
foreach (IWebElement _ in elementsList)
{
//var image = _.Snapshot();
//image.Save(@"Z:" + Guid.NewGuid() + ".jpg");
//var str = ORC_((Bitmap)image);
var roomType = "";
try
{
roomType = _.FindElement(By.CssSelector(".room_unfold")).Text;
}
catch (Exception)
{
}
var roomTypeText = regRoomType.Match(roomType);
var roomTypeName = _.FindElement(By.CssSelector("span.room_type_name")).Text;
//價格元素生成圖片
var image = _.FindElement(By.CssSelector("span.base_price")).SnapshotV2(image2);
//識別
var price = ORC_((Bitmap)image);
Console.WriteLine("{0,-20}{1,-20}{2,-20}", roomTypeText.Value, roomTypeName, price);
}
Console.Read();
}
```
圖片識別方法
```C#
static Program()
{
ocr.SetVariable("tesseditchar_whitelist", "0123456789");
}
private static Tesseract _ocr = new Tesseract(@"C:Emguemgucv-windows-universal-cuda 2.9.0.1922in essdata", "eng", Tesseract.OcrEngineMode.OEM_TESSERACT_CUBE_COMBINED);
//傳入圖片進行識別
public static string ORC_(Bitmap img)
{
//""標示OCR識別調(diào)用失敗
string re = "";
if (img == null)
return re;
else
{
Bgr drawColor = new Bgr(Color.Blue);
try
{
Image<Bgr, Byte> image = new Image<Bgr, byte>(img);
using (Image<Gray, byte> gray = image.Convert<Gray, Byte>())
{
_ocr.Recognize(gray);
Tesseract.Charactor[] charactors = _ocr.GetCharactors();
foreach (Tesseract.Charactor c in charactors)
{
image.Draw(c.Region, drawColor, 1);
}
re = _ocr.GetText();
}
return re;
}
catch (Exception ex)
{
return re;
}
}
}
```