Python Beautifulsoup4 Website Parsing
I'm trying to scrape some sports data from a website using Beautifulsoup4, but am having some trouble figuring out how to proceed. I'm not that great with HTML, and can't seem to
Solution 1:
Each score unit is located inside a <td class='match-details'>
element, loop over those to extract match details.
From there, you can extract the text from children elements using the .stripped_strings
generator; just pass it to ''.join()
to get all strings contained in a tag. Pick team-home
, score
and team-away
separately for ease of parsing:
formatchin soup.find_all('td', class_='match-details'):
home_tag = match.find('span', class_='team-home')
home = home_tag and''.join(home_tag.stripped_strings)
score_tag = match.find('span', class_='score')
score = score_tag and''.join(score_tag.stripped_strings)
away_tag = match.find('span', class_='team-away')
away = away_tag and''.join(away_tag.stripped_strings)
With an additional print
this gives:
>>>for match in soup.find_all('td', class_='match-details'):... home_tag = match.find('span', class_='team-home')... home = home_tag and''.join(home_tag.stripped_strings)... score_tag = match.find('span', class_='score')... score = score_tag and''.join(score_tag.stripped_strings)... away_tag = match.find('span', class_='team-away')... away = away_tag and''.join(away_tag.stripped_strings)...if home and score and away:...print home, score, away...
Newcastle 0-3 Sunderland
West Ham 2-0 Swansea
Cardiff 2-1 Norwich
Everton 2-1 Aston Villa
Fulham 0-3 Southampton
Hull 1-1 Tottenham
Stoke 2-1 Man Utd
Aston Villa 4-3 West Brom
Chelsea 0-0 West Ham
Sunderland 1-0 Stoke
Tottenham 1-5 Man City
Man Utd 2-0 Cardiff
# etc. etc. etc.
Solution 2:
You can use tag.string propery to get value of tag.
Refer to the documentation for more details. http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Solution 3:
due to a redirect to here: https://www.bbc.com/sport/football/premier-league/scores-fixtures
This is an update to the accepted answer, which is still correct. ping me if you edit your answer and i will delete this answer.
formatchin soup.find_all('article', class_='sp-c-fixture'):
home_tag = match.find('span', class_='sp-c-fixture__team sp-c-fixture__team--time sp-c-fixture__team--time-home').find('span').find('span')
home = home_tag and''.join(home_tag.stripped_strings)
score_tag = match.find('span', class_='sp-c-fixture__number sp-c-fixture__number--time')
score = score_tag and''.join(score_tag.stripped_strings)
away_tag = match.find('span', class_='sp-c-fixture__team sp-c-fixture__team--time sp-c-fixture__team--time-away').find('span').find('span')
away = away_tag and''.join(away_tag.stripped_strings)
if home and score and away:
print(home, score, away)
Post a Comment for "Python Beautifulsoup4 Website Parsing"