How To Scraep Text And Image Together?
I'm working on a webpage scraper with beautifulSoup4. I want to get text and images of the article, but have some problems! html code is sth like this:
some texts1 &l
Solution 1:
This is not the best and correct way, but it should work:
from bs4 import BeautifulSoup
html = "<div>\
some texts1\
<br />\
<img src=\"imgpic.jpg\" />\
<br />\
some texts2\
</div>"
soup = BeautifulSoup(html)
text = "+".join(soup.stripped_strings).split("+")
print text[0]
print soup.find("img")['src']
print text[1]
Output:
some texts1
imgpic.jpg
some texts2
Solution 2:
Instead of using get_text()
, I'd use prettify()
to return the entire <div>
section you want as a string. This way you are always guaranteed to have the correct texts at the top and bottom. From there you can strip away parts of the string to get what you want:
# post_soup is the <div> element you posteds = post_soup.prettify()
split_s = s.split('<br/>')
top = split_s[0].strip('<div>')
bottom = split_s[-1].strip('</div>')
Output:
>>> top
u'\n some texts1\n '>>> bottom
u'\n some texts2\n'
Post a Comment for "How To Scraep Text And Image Together?"