Disable Special "class" Attribute Handling

January 31, 2024 Post a Comment

The Story: When you parse HTML with BeautifulSoup, class attribute is considered a multi-valued attribute and is handled in a special manner: Remember that a single tag can have m

Solution 1:

What I don't like in this approach is that it is quite "unnatural" and "magical" involving importing "private" internal _htmlparser. I hope there is a simpler way.

Yes, you can import it from bs4.builder instead:

from bs4 import BeautifulSoup
from bs4.builder import HTMLParserTreeBuilder

classMyBuilder(HTMLParserTreeBuilder):
    def__init__(self):
        super(MyBuilder, self).__init__()
        # BeautifulSoup, please don't treat "class" as a list
        self.cdata_list_attributes["*"].remove("class")


soup = BeautifulSoup(data, "html.parser", builder=MyBuilder())

And if it's important enough that you don't want to repeat yourself, put the builder in its own module, and register it with register_treebuilders_from() so that it takes precedence.

Solution 2:

The class HTMLParserTreeBuilder is actually declared on the upper module_init__.py, so there is no need to import directly from the private submodule. That said I would do it the following way:

import re

from bs4 import BeautifulSoup
from bs4.builder import HTMLParserTreeBuilder

bb = HTMLParserTreeBuilder()
bb.cdata_list_attributes["*"].remove("class")

soup = BeautifulSoup(bs, "html.parser", builder=bb)
found_elements = soup.find_all(class_=re.compile(r"^name\-single name\d+$"))
print found_elements

It is basically the same as defining the class as in the OP (maybe a bit more explicit), but I don't think there is a better way to do it.

alezinhacris

Disable Special "class" Attribute Handling

Solution 1:

Solution 2:

Post a Comment for "Disable Special "class" Attribute Handling"

Widget HTML #3