CodeBriefly
Tech Magazine

How to Handle Content Scraping with Pagination in Laravel

2 2,764

Get real time updates directly on you device, subscribe now.

In this article, we will discuss “How to Handle Content Scraping with pagination in Laravel”. I will try to explain to you Content Scraping with pagination data. Here’s, only the required changes are explained. So I recommend you to check our previous post “How to Handle Content Scraping in Laravel” for better understanding.

Update “getCrawlerContent” Function

In our previous example code snippet, our class contains the function “getCrawlerContent”. Here we have to add the following.

...
    /**
     * Content Crawler
     */
    public function getCrawlerContent()
    {
        try {
            $response = $this->client->get('<URL>'); // URL, where you want to fetch the content

            // get content and pass to the crawler
            $content = $response->getBody()->getContents();

            $crawler = new Crawler( $content );
            
            $_this = $this;
            $data = '';
            $parentData = $crawler->filter('div.card--post')
                            ->each(function (Crawler $node, $i) use($_this) {
                                return $_this->getNodeContent($node);
                            }
                        );

            // function which handles the pagination data.
            $childData = $this->paginate( $crawler, $_this->crawler_page_limit ?? 3 );
            $data = array_merge($parentData, $childData);

            dump($data);
            
        } catch ( Exception $e ) {
            echo $e->getMessage();
        }
    }
...

Create helper functions

Create a “paginate” and “subCrawler” function, these function filters the pagination link and the same link will be used for content crawling. Here, we have to call this function recursively until our pagination limit is not reached.

Here is one more question, why do we have to set the limit for pagination for content crawling?

If we are not breaking the process, the function works infinitely until the complete pagination is not crawled. So, imagine if a site has more than 100 pages then what happens.

That’s the main reason to break the process within a specific limit, as per the below-mentioned code snippet.

...    
    private function paginate($crawler_instance, $max_page_allowed = 3) : array
    {
        $instance = $crawler_instance;
        $paginate = $instance->filter('.w-full nav a.arrow--right')->count() ? $instance->filter('.w-full nav a.arrow--right')->attr('href') : 0;
        // echo $paginate;
        $current_page_no = \explode('=',$paginate)[1] ?? 0 ;
        $data = [];
        $_this = $this;
        if( $paginate !== 0 && $current_page_no <= $max_page_allowed ) {
            $childData = $this->subCrawler($paginate);
            $data = array_merge($data, $childData);
        }
        return $data;
    }

    public function initCrawler($url)
    {
        try {
            $response = $this->client->get($url);
            // $code = $response->getStatusCode(); // 200
            // $reason = $response->getReasonPhrase(); // OK
    
            // get content and pass to the crawler
            $content = $response->getBody()->getContents();
            return new Crawler( $content );
        } catch ( Exception $e ) {
            echo $e->getMessage();
        }
    }

    private function subCrawler($url) : array
    {
        $_this = $this;
        $data = [];
        $subCrawler = $this->initCrawler($url);
        $childData = $subCrawler->filter('div.card--post')
                    ->each(function (Crawler $node, $i) use($_this) {
                        return $_this->filterContent($node);
                    });
        
        $data = array_merge($data, $childData);
        
        $subData = $this->paginate($subCrawler); // Call recursively

        $data = array_merge($data, $subData);
        
        return $data;
    }
...

Finally, Content Scraping with pagination data example is complete. I hope you like this and also this will help in your future projects. If you have any query then please feel free to add in the comment section 🙂

If you like our content, please consider buying us a coffee.
Thank you for your support!
Buy Me a Coffee

Get real time updates directly on you device, subscribe now.

2 Comments
  1. Khaled EL-Azab says

    hi
    thank you for your efforts it helps me
    but i have error about (initCrawler)
    it’s afunction but is missing in above code
    thanks in advance

  2. maryam says

    hello
    im using your post about How To Handle Content Scraping With Pagination In Laravel.
    but im getting an error that initCrawler doesnt exist.
    can you help me?
    thank you

Leave A Reply

Your email address will not be published.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. AcceptRead More