In this article, we will discuss “How to Handle Content Scraping with pagination in Laravel”. I will try to explain to you Content Scraping with pagination data. Here’s, only the required changes are explained. So I recommend you to check our previous post “How to Handle Content Scraping in Laravel” for better understanding.
Update “getCrawlerContent” Function
In our previous example code snippet, our class contains the function “getCrawlerContent”. Here we have to add the following.
... /** * Content Crawler */ public function getCrawlerContent() { try { $response = $this->client->get('<URL>'); // URL, where you want to fetch the content // get content and pass to the crawler $content = $response->getBody()->getContents(); $crawler = new Crawler( $content ); $_this = $this; $data = ''; $parentData = $crawler->filter('div.card--post') ->each(function (Crawler $node, $i) use($_this) { return $_this->getNodeContent($node); } ); // function which handles the pagination data. $childData = $this->paginate( $crawler, $_this->crawler_page_limit ?? 3 ); $data = array_merge($parentData, $childData); dump($data); } catch ( Exception $e ) { echo $e->getMessage(); } } ...
Create helper functions
Create a “paginate” and “subCrawler” function, these function filters the pagination link and the same link will be used for content crawling. Here, we have to call this function recursively until our pagination limit is not reached.
Here is one more question, why do we have to set the limit for pagination for content crawling?
If we are not breaking the process, the function works infinitely until the complete pagination is not crawled. So, imagine if a site has more than 100 pages then what happens.
That’s the main reason to break the process within a specific limit, as per the below-mentioned code snippet.
... private function paginate($crawler_instance, $max_page_allowed = 3) : array { $instance = $crawler_instance; $paginate = $instance->filter('.w-full nav a.arrow--right')->count() ? $instance->filter('.w-full nav a.arrow--right')->attr('href') : 0; // echo $paginate; $current_page_no = \explode('=',$paginate)[1] ?? 0 ; $data = []; $_this = $this; if( $paginate !== 0 && $current_page_no <= $max_page_allowed ) { $childData = $this->subCrawler($paginate); $data = array_merge($data, $childData); } return $data; } public function initCrawler($url) { try { $response = $this->client->get($url); // $code = $response->getStatusCode(); // 200 // $reason = $response->getReasonPhrase(); // OK // get content and pass to the crawler $content = $response->getBody()->getContents(); return new Crawler( $content ); } catch ( Exception $e ) { echo $e->getMessage(); } } private function subCrawler($url) : array { $_this = $this; $data = []; $subCrawler = $this->initCrawler($url); $childData = $subCrawler->filter('div.card--post') ->each(function (Crawler $node, $i) use($_this) { return $_this->filterContent($node); }); $data = array_merge($data, $childData); $subData = $this->paginate($subCrawler); // Call recursively $data = array_merge($data, $subData); return $data; } ...
Finally, Content Scraping with pagination data example is complete. I hope you like this and also this will help in your future projects. If you have any query then please feel free to add in the comment section 🙂
If you like our content, please consider buying us a coffee.
Thank you for your support!
Buy Me a Coffee
hi
thank you for your efforts it helps me
but i have error about (initCrawler)
it’s afunction but is missing in above code
thanks in advance
hello
im using your post about How To Handle Content Scraping With Pagination In Laravel.
but im getting an error that initCrawler doesnt exist.
can you help me?
thank you