In this article, we will discuss “How to Handle Content Scraping with pagination in Laravel”. I will try to explain to you to Content Scraping with pagination data. Here’s, only the required changes are explained. So I recommend you to check our previous post “How to Handle Content Scraping in Laravel” for better understanding.
Update “getCrawlerContent” Function
In our previous example code snippet, our class contains a function “getCrawlerContent”. Here we have to add the following.
... /** * Content Crawler */ public function getCrawlerContent() { try { $response = $this->client->get('<URL>'); // URL, where you want to fetch the content // get content and pass to the crawler $content = $response->getBody()->getContents(); $crawler = new Crawler( $content ); $_this = $this; $data = ''; $parentData = $crawler->filter('div.card--post') ->each(function (Crawler $node, $i) use($_this) { return $_this->getNodeContent($node); } ); // function which handles the pagination data. $childData = $this->paginate( $crawler, $_this->crawler_page_limit ?? 3 ); $data = array_merge($parentData, $childData); dump($data); } catch ( Exception $e ) { echo $e->getMessage(); } } ...
Create helper functions
Create a “paginate” and “subCrawler” function, these function filters the pagination link and the same link will be used for content crawling. Here, we have to call this function recursively until our pagination limit is not reached.
Here one more question, why we have to set the limit for pagination for content crawling?
If we are not breaking the process, the function works infinitely until the complete pagination is not crawled. So, imagine if a site having more then 100 pages then what happens.
That’s the main reason to break the process within a specific limit, as per the below-mentioned code snippet.
... /** * Pagination records */ private function paginate($crawler_instance, $max_page_allowed = 3) { $instance = $crawler_instance; $paginate = $instance->filter('.w-full nav a.arrow--right')->count() ? $instance->filter('.w-full nav a.arrow--right')->attr('href') : 0; // echo $paginate; $current_page_no = \explode('=',$paginate)[1] ?? 0 ; $data = []; $_this = $this; if( $paginate !== 0 && $current_page_no <= $max_page_allowed ) { $childData = $this->subCrawler($paginate); $data = array_merge($data, $childData); } return $data; } /** * Sub-Crawler for pagination records */ private function subCrawler($url) { $_this = $this; $data = []; $subCrawler = $this->initCrawler($url); $childData = $subCrawler->filter('div.card--post') ->each(function (Crawler $node, $i) use($_this) { return $_this->getNodeContent($node); }); $data = array_merge( $data, $childData ); $subData = $this->paginate( $subCrawler, $_this->crawler_page_limit ?? 3 ); // Call recursively $data = array_merge( $data, $subData ); return $data; } ...
Finally, Content Scraping with pagination data example is complete. I hope you like this and also this will help in your future projects. If you have any query then please feel free to add in the comment section 🙂
hi
thank you for your efforts it helps me
but i have error about (initCrawler)
it’s afunction but is missing in above code
thanks in advance
hello
im using your post about How To Handle Content Scraping With Pagination In Laravel.
but im getting an error that initCrawler doesnt exist.
can you help me?
thank you