Disable Special "class" Attribute Handling
The Story: When you parse HTML with BeautifulSoup, class attribute is considered a multi-valued attribute and is handled in a special manner: Remember that a single tag can have m
Solution 1:
What I don't like in this approach is that it is quite "unnatural" and "magical" involving importing "private" internal
_htmlparser
. I hope there is a simpler way.
Yes, you can import it from bs4.builder
instead:
from bs4 import BeautifulSoup
from bs4.builder import HTMLParserTreeBuilder
classMyBuilder(HTMLParserTreeBuilder):
def__init__(self):
super(MyBuilder, self).__init__()
# BeautifulSoup, please don't treat "class" as a list
self.cdata_list_attributes["*"].remove("class")
soup = BeautifulSoup(data, "html.parser", builder=MyBuilder())
And if it's important enough that you don't want to repeat yourself, put the builder in its own module, and register it with register_treebuilders_from()
so that it takes precedence.
Solution 2:
The class HTMLParserTreeBuilder
is actually declared on the upper module_init__.py
, so there is no need to import directly from the private submodule. That said I would do it the following way:
import re
from bs4 import BeautifulSoup
from bs4.builder import HTMLParserTreeBuilder
bb = HTMLParserTreeBuilder()
bb.cdata_list_attributes["*"].remove("class")
soup = BeautifulSoup(bs, "html.parser", builder=bb)
found_elements = soup.find_all(class_=re.compile(r"^name\-single name\d+$"))
print found_elements
It is basically the same as defining the class as in the OP (maybe a bit more explicit), but I don't think there is a better way to do it.
Post a Comment for "Disable Special "class" Attribute Handling"