댓글 검색 목록

[php] DiDOM - PHP 초고속 HTML 스크레이퍼 및 파서

페이지 정보

작성자 운영자 작성일 20-12-01 16:00 조회 892 댓글 0

DiDOM-간단하고 빠른 HTML 파서.

https://github.com/Imangazaliev/DiDOM

설치

DiDOM을 설치하려면 다음 명령을 실행하십시오.

composer require imangazaliev/didom

빠른 시작

use DiDom\Document;

$document = new Document('http://www.news.com/', true);

$posts = $document->find('.post');

foreach($posts as $post) {
    echo $post->text(), "\n";
}

새 문서 생성

DiDom은 여러 가지 방법으로 HTML을 로드 할 수 있습니다.

생성자 사용

// the first parameter is a string with HTML
$document = new Document($html);

// file path
$document = new Document('page.html', true);

// or URL
$document = new Document('http://www.example.com/', true);

두 번째 매개 변수는 파일을 로드해야 하는지 여부를 지정합니다. 기본값은 false입니다.

Signature:

__construct($string = null, $isFile = false, $encoding = 'UTF-8', $type = Document::TYPE_HTML)

$string - HTML 또는 XML 문자열 또는 파일 경로.

$isFile - 첫 번째 매개 변수가 파일 경로임을 나타냅니다.

$encoding - 문서 인코딩.

$type - 문서 유형 (HTML-문서 :: TYPE_HTML, XML-문서 :: TYPE_XML).

별도의 methods으로

$document = new Document();

$document->loadHtml($html);

$document->loadHtmlFile('page.html');

$document->loadHtmlFile('http://www.example.com/');

XML을 로드 하는 데 사용할 수 있는 두 가지 메서드는 loadXml 및 loadXmlFile입니다.

이러한 메서드은 추가 옵션을 허용합니다.

$document->loadHtml($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$document->loadHtmlFile($url, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$document->loadXml($xml, LIBXML_PARSEHUGE);
$document->loadXmlFile($url, LIBXML_PARSEHUGE);

요소 검색

DiDOM은 CSS 선택기 또는 XPath를 검색 표현식으로 허용합니다. 첫 번째 매개 변수로 expression을 경로 지정하고 두 번째 매개 변수에서 유형을 지정해야 합니다 (기본 유형은 Query :: TYPE_CSS).

find() 메소드 사용 :

use DiDom\Document;
use DiDom\Query;

...

// CSS selector
$posts = $document->find('.post');

// XPath
$posts = $document->find("//div[contains(@class, 'post')]", Query::TYPE_XPATH);

주어진 식과 일치하는 요소가 발견되면 메서드는 DiDom \ Element 인스턴스의 배열을 반환하고 그렇지 않으면 빈 배열을 반환합니다. DOMElement 객체의 배열을 얻을 수도 있습니다. 이를 얻으려면 false를 세 번째 매개 변수로 전달하십시오.

매직 메서드 __invoke() 사용 :

$posts = $document('.post');

경고 :이 방법은 나중에 제거 될 수 있으므로 바람직하지 않습니다.

xpath() 메소드 사용 :

$posts = $document->xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' post ')]");

요소 내에서 검색 할 수 있습니다.

echo $document->find('nav')[0]->first('ul.menu')->xpath('//li')[0]->text();

요소가 있는지 확인

요소가 있는지 확인하려면 has() 메서드를 사용하십시오.

if ($document->has('.post')) {
    // code
}

요소가 있는지 확인한 다음 가져와야 하는 경우

if ($document->has('.post')) {
    $elements = $document->find('.post');
    // code
}

그러나 다음과 같이 더 빠를 것입니다.

if (count($elements = $document->find('.post')) > 0) {
    // code
}

첫 번째 경우에는 두 개의 쿼리를 작성하기 때문입니다.

요소에서 검색

메소드 find(), first(), xpath(), has(), count()는 Element에서도 사용할 수 있습니다.

예:

echo $document->find('nav')[0]->first('ul.menu')->xpath('//li')[0]->text();

findInDocument() 메소드

다른 요소에서 찾은 요소를 변경, 교체 또는 제거하면 문서가 변경되지 않습니다. 이는 Element 클래스의 find () 메서드 (각각 첫 번째 () 및 xpath 메서드)가 검색 할 새 문서를 생성하기 때문에 발생합니다.

소스 문서에서 요소를 검색하려면 findInDocument () 및 firstInDocument () 메소드를 사용해야 합니다.

// nothing will happen
$document->first('head')->first('title')->remove();

// but this will do
$document->first('head')->firstInDocument('title')->remove();

경고 : findInDocument () 및 firstInDocument () 메서드는 문서에 속한 요소와 new Element (...)를 통해 생성 된 요소에 대해서만 작동합니다. 요소가 문서에 속하지 않으면 LogicException이 발생합니다.

지원되는 선택자

DiDom은 다음을 기준으로 검색을 지원합니다.

tag
class, ID, name and value of an attribute
pseudo-classes:
first-, last-, nth-child
empty and not-empty
contains
has

// all links
$document->find('a');

// any element with id = "foo" and "bar" class
$document->find('#foo.bar');

// any element with attribute "name"
$document->find('[name]');
// the same as
$document->find('*[name]');

// input field with the name "foo"
$document->find('input[name=foo]');
$document->find('input[name=\'bar\']');
$document->find('input[name="baz"]');

// any element that has an attribute starting with "data-" and the value "foo"
$document->find('*[^data-=foo]');

// all links starting with https
$document->find('a[href^=https]');

// all images with the extension png
$document->find('img[src$=png]');

// all links containing the string "example.com"
$document->find('a[href*=example.com]');

// text of the links with "foo" class
$document->find('a.foo::text');

// address and title of all the fields with "bar" class
$document->find('a.bar::attr(href|title)');

Output

HTML 얻기

html() 메소드 사용 :

$posts = $document->find('.post');

echo $posts[0]->html();

문자열로 캐스팅 :

$html = (string) $posts[0];

HTML 출력 형식 지정

$html = $document->format()->html();

요소에는 format() 메서드가 없으므로 요소의 형식화 된 HTML을 출력해야 하는 경우 먼저 문서로 변환해야 합니다.

$html = $element->toDocument()->format()->html();

Inner HTML

$innerHtml = $element->innerHtml();

문서에는 innerHtml() 메서드가 없으므로 문서의 내부 HTML을 가져와야 하는 경우 먼저 요소로 변환합니다.

$innerHtml = $document->toElement()->innerHtml();

XML 얻기

echo $document->xml();

echo $document->first('book')->xml();

콘텐츠 얻기

$posts = $document->find('.post');

echo $posts[0]->text();

새 요소 만들기

클래스의 인스턴스 만들기

use DiDom\Element;

$element = new Element('span', 'Hello');

// Outputs "<span>Hello</span>"
echo $element->html();

첫 번째 매개 변수는 속성의 이름이고 두 번째 매개 변수는 해당 값 (선택 사항)이고 세 번째 매개 변수는 요소 속성 (선택 사항)입니다.

속성이 있는 요소 생성의 예 :

$attributes = ['name' => 'description', 'placeholder' => 'Enter description of item'];

$element = new Element('textarea', 'Text', $attributes);

DOMElement 클래스의 인스턴스에서 요소를 만들 수 있습니다.

use DiDom\Element;
use DOMElement;

$domElement = new DOMElement('span', 'Hello');

$element = new Element($domElement);

createElement 메소드 사용

$document = new Document($html);

$element = $document->createElement('span', 'Hello');

요소의 이름 얻기

$element->tag;

부모 요소 가져 오기

$document = new Document($html);

$input = $document->find('input[name=email]')[0];

var_dump($input->parent());

형제 요소 가져 오기

$document = new Document($html);

$item = $document->find('ul.menu > li')[1];

var_dump($item->previousSibling());

var_dump($item->nextSibling());

자식 요소 얻기

$html = '<div>Foo<span>Bar</span><!--Baz--></div>';

$document = new Document($html);

$div = $document->first('div');

// element node (DOMElement)
// string(3) "Bar"
var_dump($div->child(1)->text());

// text node (DOMText)
// string(3) "Foo"
var_dump($div->firstChild()->text());

// comment node (DOMComment)
// string(3) "Baz"
var_dump($div->lastChild()->text());

// array(3) { ... }
var_dump($div->children());

문서 얻기

$document = new Document($html);

$element = $document->find('input[name=email]')[0];

$document2 = $element->getDocument();

// bool(true)
var_dump($document->is($document2));

요소 속성 작업

속성 생성 / 업데이트

setAttribute 메소드 사용 :

$element->setAttribute('name', 'username');

메소드 attr사용 :

$element->attr('name', 'username');

매직 메서드 __set 사용 :

$element->name = 'username';

속성 값 얻기

getAttribute 메소드 사용 :

$username = $element->getAttribute('value');

메소드 attr 사용 :

$username = $element->attr('value');

매직 메소드 __get 사용 :

$username = $element->name;

속성이 없으면 null을 반환합니다.

속성이 있는지 확인

hasAttribute 메소드 사용 :

if ($element->hasAttribute('name')) {
    // code
}

매직 메서드 __isset 사용 :

if (isset($element->name)) {
    // code
}

속성 제거 :

removeAttribute 메소드 사용 :

$element->removeAttribute('name');

매직 메서드 __unset 사용 :

unset($element->name);

요소 비교

$element  = new Element('span', 'hello');
$element2 = new Element('span', 'hello');

// bool(true)
var_dump($element->is($element));

// bool(false)
var_dump($element->is($element2));

자식 요소 추가

$list = new Element('ul');

$item = new Element('li', 'Item 1');

$list->appendChild($item);

$items = [
    new Element('li', 'Item 2'),
    new Element('li', 'Item 3'),
];

$list->appendChild($items);

자식 요소 추가

$list = new Element('ul');

$item = new Element('li', 'Item 1');
$items = [
    new Element('li', 'Item 2'),
    new Element('li', 'Item 3'),
];

$list->appendChild($item);
$list->appendChild($items);

요소 교체

$element = new Element('span', 'hello');

$document->find('.post')[0]->replace($element);

Waning : 문서에서 직접 찾은 요소 만 바꿀 수 있습니다.

// nothing will happen
$document->first('head')->first('title')->replace($title);

// but this will do
$document->first('head title')->replace($title);

이에 대한 자세한 내용은 요소 검색 섹션을 참조하십시오.

요소 제거

$document->find('.post')[0]->remove();

경고 : 문서에서 직접 찾은 요소 만 제거 할 수 있습니다.

// nothing will happen
$document->first('head')->first('title')->remove();

// but this will do
$document->first('head title')->remove();

캐시 작업

캐시는 CSS에서 변환 된 XPath 표현식의 배열입니다.

캐시에서 가져 오기

use DiDom\Query;

...

$xpath    = Query::compile('h2');
$compiled = Query::getCompiled();

// array('h2' => '//h2')
var_dump($compiled);

캐시 설정

Query::setCompiled(['h2' => '//h2']);

여러 가지 잡다한

preserveWhiteSpace

기본적으로 공백 유지는 비활성화 되어 있습니다.

문서를 로드 하기 전에 preserveWhiteSpace 옵션을 활성화 할 수 있습니다.

$document = new Document();

$document->preserveWhiteSpace();

$document->loadXml($xml);

count

count() 메서드는 선택자와 일치하는 하위 항목을 계산합니다.

// prints the number of links in the document
echo $document->count('a');

// prints the number of items in the list
echo $document->first('ul')->count('li');

matches

노드가 선택자와 일치하면 true를 반환합니다.

$element->matches('div#content');

// strict match
// returns true if the element is a div with id equals content and nothing else
// if the element has any other attributes the method returns false
$element->matches('div#content', true);

isElementNode

요소가 요소 (DOMElement)인지 확인합니다.