개발 꿀팁/PHP

[PHP function] URL page image tag extraction

Jammie 2018. 1. 16. 01:33
반응형

Let's take a look at how to use HTML extraction and regular expressions to extract only images from a web page.

(Unauthorized use of images on other websites is a violation of copyright, so if you want to use it commercially, you should proceed after the copyright agreement. Otherwise, please use it only for the management or research of your own website.)


First of all, we created two functions.


- The getImgTag function extracts the necessary values using regular expressions.

- A function called getRC functions to get the source of the URL address.


After getting the source of the URL address through the two functions, we extracted the element of the img tag using the regular expression, and returned it in an array.


<?php

// getImgTag ('URL address', 'Tag', 'Attribute')

print_r(getImgTag('{Web page URL address)}', 'img', 'src'));

 

// This is a function that extracts Tag and Attribute values ​​using regular expressions.

function getImgTag($url, $tag, $attribute = null)

{

    if (!empty($tag)) {

        $htmlDom = getRC($url);

        preg_match_all("/<".$tag."[^>]*".$attribute."=[\"']?([^>\"']+)[\"']?[^>]*>/i", $htmlDom, $imageList);

  

        $result = null;

  

        if (empty($attribute)) {

           // Extract the entire img tag.

            $result = $imageList[0];

        } else {

            // extract only src value of img tag.

            $result = $imageList[1];

        }

  

        // Return in array form.

        return $result;

    } else {

        return null;

    }

}

 

// HTML extraction function via URL address.

function getRC($url)

{

    if (ini_get('allow_url_fopen') == '1') {

        // Separate the hostname and url path values.

        $parsedUrl = parse_url($url);

        $host = $parsedUrl['host'];

        if (isset($parsedUrl['path'])) {

            $path = $parsedUrl['path'];

        } else {

            $path = '/';

        }

  

        if (isset($parsedUrl['query'])) {

            $path .= '?' . $parsedUrl['query'];

        }

  

        if (isset($parsedUrl['port'])) {

            $port = $parsedUrl['port'];

        } else {

            $port = '80';

        }

  

        $timeout = 10;

        $response = '';

       // Connect to remote server.

        $fp = fsockopen($host, $port, $errno, $errstr, $timeout);

  

        if (!$fp) {

            echo "Cannot retrieve $url";

        } else {

           // send the necessary headers.

            fputs($fp, "GET $path HTTP/1.0\r\n" .

                    "Host: $host\r\n" .

                    "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3\r\n" .

                    "Accept: */*\r\n" .

                    "Accept-Language: en-us,en;q=0.5\r\n" .

                    "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n" .

                    "Keep-Alive: 300\r\n" .

                    "Connection: keep-alive\r\n" .

                    "Referer: http://$host\r\n\r\n");

  

           // Start receiving response from the remote server.

            while ($line = fread($fp, 4096)) {

                $response .= $line;

            }

  

            fclose($fp);

  

           // Remove the header part.

            $pos = strpos($response, "\r\n\r\n");

            $response = substr($response, $pos + 4);

        }

    } else {

        // If allow_url_fopen is disabled, create curl or function yourself.


        $curl = curl_init();

        $timeOut = 10;

  

        curl_setopt($curl, CURLOPT_URL, $url);

        curl_setopt($curl, CURLOPT_HEADER, false);

        curl_setopt($curl, CURLOPT_TIMEOUT, $timeOut);

        curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

  

        $curlData = curl_exec($curl);

        curl_close($curl);

  

        $response = json_decode($curlData);

    }

  

    // Returns the response-processed value.

    return $response;

}

반응형