Adjusting the HTML markup in PHP has always been a struggle, but WordPress 6.2 makes it a breeze with the WP_HTML_Tag_Processor API.
For example, here’s how you can add an alt=""
attribute to an <img />
tag:
The PHP code snippets in this post are live!
You can edit and re-run them to your heart’s content.
<?php
$html = '<img src="/husky.jpg">';
$p = new WP_HTML_Tag_Processor( $html );
if ( $p->next_tag() ) {
$p->set_attribute( 'alt', 'Husky in the snow' );
}
echo $p->get_updated_html();
If you’ve ever struggled to add an HTML attribute using regular expressions, you know how big of an improvement this is! In fact, Tag Processor was born out of this exact struggle.
Last year, Dennis Snell and I tried to add a CSS class to the first <h1>
, <h2>
, … tag in every WordPress heading block. However, the hours spent on crafting the perfect regular expression were largely wasted. At that point, I really wanted to use an HTML parser. No existing library was suitable so we rolled up our sleeves and started building a new one.
Today, WP_HTML_Tag_Processor is a part of the upcoming WordPress 6.2 release, and this post will show you how to use it. Enjoy!
Tag Processor is linear and reads one tag at a time
Tag Processor sees HTML as a list of tags, not as a document tree. It does not understand what a child or a parent is. It only understands the next tag as read from left to right:
<?php
$html = '<h1><p></p></h1><div></div>';
$p = new WP_HTML_Tag_Processor( $html );
while($p->next_tag()) {
echo $p->get_tag()."\n";
}
While this is limiting, it also makes tag processor extremely fast and memory-efficient. There is no virtual document tree, preemptive parsing, or backtracking. The tag processor has a light footprint because it does not do anything you don’t specifically request.
HTML operations only affect the selected tag
To use Tag Processor provides methods like get_tag()
or set_attribute($name, $value)
, you first need to select a target tag. No tag is selected at first. The Tag Processor will only read the first tag once you call $p->next_tag()
:
<?php
$html = '<h1></h1><p></p>';
$p = new WP_HTML_Tag_Processor( $html );
// No tag is selected until the
// first $p->next_tag() call:
var_dump($p->get_tag());
$p->next_tag();
echo $p->get_tag()."\n";
To select the p
tag you’ll need to call $p->next_tag()
again. Go ahead and paste this PHP snippet at the bottom of the code editor above:
$p->next_tag();
echo $p->get_tag()."\n";
So far so good!
Checking whether the next tag exists
Suppose you call $p->next_tag()
too many times and go past the final <p></p>
. There will be no errors, but the selected tag will be null:
<?php
$html = '<h1></h1><p></p>';
$p = new WP_HTML_Tag_Processor( $html );
$p->next_tag();
$p->next_tag();
$p->next_tag();
// No tag is selected once we go past
// the last tag:
var_dump($p->get_tag());
To make sure your assumptions about the processed HTML hold, consult the return value of $p->next_tag()
. It returns true when it finds a tag, and false when it goes past the last tag in the document:
<?php
$html = '<h1></h1><p></p>';
$p = new WP_HTML_Tag_Processor( $html );
if($p->next_tag()) {
var_dump($p->get_tag());
}
if($p->next_tag()) {
var_dump($p->get_tag());
}
if($p->next_tag()) {
// There is no third tag so this will not run:
var_dump($p->get_tag());
}
Finding the right tag
The next_tag()
moves one tag at a time, but it can also perform lookups if you pass a $query
argument. It supports matching a specific tag name, a CSS class, or both:
<?php
$html = '<div><div class="block-group"></div></div>';
$p = new WP_HTML_Tag_Processor( $html );
// Tag and attribute name lookup is case-insensitive
// according to the HTML specification
$query = array(
'tag_name' => 'DIV',
'class_name' => 'block-group'
);
if ( $p->next_tag( $query ) ) {
$p->remove_class( 'block-group' );
$p->add_class( 'wp-block-group' );
}
echo $p->get_updated_html();
Reading HTML attributes
You can read the selected tag’s name and attributes using $p->get_tag()
and $p->get_attribute($name)
. Notice how the HTML entities are automatically decoded:
<?php
$html = '<h1 title="Tag Processor Tutorial <3"></h1>';
$p = new WP_HTML_Tag_Processor( $html );
// Select h1:
$p->next_tag();
// Echo the details:
echo $p->get_tag() . PHP_EOL;
echo $p->get_attribute('title');
Unfortunately, reading the text or HTML contents of a tag is not supported yet.
Modifying HTML attributes
You can update the tag’s attributes using the $p->set_attribute($name, $value)
and $p->remove_attribute($name)
methods. Just like in the previous example, the HTML entities are handled automatically:
<?php
$html = '<h1 id="main">
Site title
</h1>
<p>Content</p>';
$p = new WP_HTML_Tag_Processor( $html );
$p->next_tag();
$p->remove_attribute( 'id' );
// There is no class attribute, but that's okay –
// there won't be any errors:
$p->remove_attribute( 'class' );
// The escaping is handled automatically:
$p->set_attribute( 'title', 'Using <html> "tags"' );
echo $p->get_updated_html();
Working with CSS classes
Tag Processor can adjust the CSS classes via the add_class( $class )
and remove_class( $class )
methods. This is how Dennis and I ended up adding the wp-block-heading
tag to the first <h1>
, <h2>
, … tag in every WordPress heading block:
<?php
$html = '<h2 class="bold">This is a heading</h2>';
$p = new WP_HTML_Tag_Processor( $html );
$header_tags = array( 'H1', 'H2', 'H3', 'H4', 'H5', 'H6' );
while ( $p->next_tag() ) {
if ( in_array( $p->get_tag(), $header_tags, true ) ) {
$p->add_class( 'wp-block-heading' );
break;
}
}
echo $p->get_updated_html();
Handling tricky HTML inputs
Tag Processor implements the WHATWG HTML parsing spec which means it can safely process HTML markup that would derail most regular expressions and even DOMDocument
:
<?php
$tricky_html = <<<HTML
<textarea src="These <p>'s are not actual HTML elements">
<p><p<!--<p>-->="</p>"</p>
</textarea>
<p></p>
HTML;
$p = new WP_HTML_Tag_Processor( $tricky_html );
$p->next_tag('p');
$p->add_class('bold');
echo $p->get_updated_html();
In contrast, the DOMDocument
finds three <p>
tags and throws a bunch of warnings:
<?php
$tricky_html = <<<HTML
<textarea src="These <p>'s are not actual HTML elements">
<p><p<!--<p>-->="</p>"</p>
</textarea>
<p></p>
HTML;
$d = new DOMDocument();
$d->loadHTML($tricky_html);
var_dump($d->getElementsByTagName('p'));
More HTML APIs is coming in the future
This is just the first HTML API in WordPress. In the future you’ll be able to find tags by CSS selectors, update the inner HTML, and construct new HTML trees from scratch. Stay tuned!
Follow me on Twitter for more tutorials like this one. I can also send new articles directly to your inbox – just sign up to my Substack: