Share, Read & Learn

How to Handle Content Scraping with Pagination in Laravel

0 337

In this article, we will discuss “How to Handle Content Scraping with pagination in Laravel”. I will try to explain to you to Content Scraping with pagination data. Here’s, only the required changes are explained. So I recommend you to check our previous post “How to Handle Content Scraping in Laravel” for better understanding.

Update “getCrawlerContent” Function

In our previous example code snippet, our class contains a function “getCrawlerContent”. Here we have to add the following.

...
    /**
     * Content Crawler
     */
    public function getCrawlerContent()
    {
        try {
            $response = $this->client->get('<URL>'); // URL, where you want to fetch the content

            // get content and pass to the crawler
            $content = $response->getBody()->getContents();

            $crawler = new Crawler( $content );
            
            $_this = $this;
            $data = '';
            $parentData = $crawler->filter('div.card--post')
                            ->each(function (Crawler $node, $i) use($_this) {
                                return $_this->getNodeContent($node);
                            }
                        );

            // function which handles the pagination data.
            $childData = $this->paginate( $crawler, $_this->crawler_page_limit ?? 3 );
            $data = array_merge($parentData, $childData);

            dump($data);
            
        } catch ( Exception $e ) {
            echo $e->getMessage();
        }
    }
...

Create helper functions

Create a “paginate” and “subCrawler” function, these function filters the pagination link and the same link will be used for content crawling. Here, we have to call this function recursively until our pagination limit is not reached.

Here one more question, why we have to set the limit for pagination for content crawling?

If we are not breaking the process, the function works infinitely until the complete pagination is not crawled. So, imagine if a site having more then 100 pages then what happens.

That’s the main reason to break the process within a specific limit, as per the below-mentioned code snippet.

...
    /**
     * Pagination records
     */
    private function paginate($crawler_instance, $max_page_allowed = 3) 
    {
        $instance = $crawler_instance;
        $paginate = $instance->filter('.w-full nav a.arrow--right')->count() ? $instance->filter('.w-full nav a.arrow--right')->attr('href') : 0;
        // echo $paginate;
        $current_page_no = \explode('=',$paginate)[1] ?? 0 ;
        $data = [];
        $_this = $this;
        if( $paginate !== 0 && $current_page_no <= $max_page_allowed ) {
            $childData = $this->subCrawler($paginate);
            $data = array_merge($data, $childData);
        }
        return $data;
    }

    /**
     * Sub-Crawler for pagination records
     */
    private function subCrawler($url)
    {
        $_this = $this;
        $data = [];
        $subCrawler = $this->initCrawler($url);
        $childData = $subCrawler->filter('div.card--post')
                    ->each(function (Crawler $node, $i) use($_this) {
                        return $_this->getNodeContent($node);
                    });
        
        $data = array_merge( $data, $childData );
        
        $subData = $this->paginate( $subCrawler, $_this->crawler_page_limit ?? 3 ); // Call recursively

        $data = array_merge( $data, $subData );
        
        return $data;
    }
...

Finally, Content Scraping with pagination data example is complete. I hope you like this and also this will help in your future projects. If you have any query then please feel free to add in the comment section 🙂

Leave A Reply

Your email address will not be published.