mastering BeautifulSoup

tl;dr: BeautifulSoup selectors and code snippets

Once you've become familiar with scraping websites with Python, requests, and BeautifulSoup (if not, read this first), you'll want to start creating reusable components to speed build time and improve data reliability.

Below I've included reference snippets for

  • extracting data
  • cleaning data
  • handling links
  • handling tables
  • other general functions


extracting data

Notation Type Comments
.attrs Macro All attributes of a selected element (print this)
.div Element The div element inside the currently selected element
.a Element The a element. inside the currently selected element Note you'll have to use .get('href') to get the link associated
.p Element Get the p element (paragraph) inside the currently selected element
.span Element Get the span element inside the currently selected element
.title Element Get the title element inside the currently selected element
.svg Element Get the SVG inside the currently selected element
.img Element Get the image inside the currently selected element
.get('src') Image Link The link associated with an image
.get('alt') Image Link The alt text associated with an image
.text Attribute The text associated with an element (string)
.string Attribute Similar to .text, but supports some navigation (eg with .children)
.strong Attribute Get the bolded text, also link display text sometimes
.get_text("\n", strip=True) Attribute Get text that is broken up (e.g. by newlines)
.get('href') Attributes Get the link value (e.g. of a parent a tag)
.nextSibling Navigation Get the next item in the DOM tree (right below in the console)
.contents Attribute The elements inside the current element (in bytes)
.contents[0] Attribute The first element inside the current element
.contents[0].contents[1] Navigation You can navigate this way if the sitemap is consistent
.a.strong.text Navigation You can stack dotwise queries if sitemap consistent
[0] / [1] Navigation If your elements are a list (eg with a find_all), you can query through them with the index
.extract Misc If you need a tag and you want to throw away the parsed. Rare. (Explanation)


cleaning data

Notation Comments
.split("/", 1)[1].strip() Split by predictable substring e.g."/"
", ".join(set(data_as_list)) Deduplicate list and convert to string
.strip().replace("\n", "").replace("\r", "") Remove whitespace and newlines
' '.join(data_as_str.split()) Replace blocks of whitespace with single whitespace
" ".join() -> .replace(' ', ', ') -> .split(',, ')[1] Split by spaces in a string (converting it to a list first)
.rstrip(",") Delete trailing commas or other delimiters
str(HTMLSnippet) Convert to a string, then parse with regular string formatting (generally inadvisable)

True if parsed.find("a", {"href", "features/Exclusive"}) else False


parser

I used to prefer lxml, but I've spent enough time wrestling with dependency management to prefer html.parser. I have not noticed performance or functionality differences.

parsed = BeautifulSoup(response.content, "html.parser")
from bs4 import BeautifulSoup


links

To get all links on a page:

links = parsed.select('a[href]')

Get the values from those links

link_text = [x.get_text() for x in parsed.find_all("a")]
link_href = [x.get("href") for x in parsed.select('a[href]')]

If you only want internal subsites:

all_links = parsed.find_all('a', href=True)
for n, link in enumerate(all_links):
    if x.startswith("/"):
        all_links[n] = f"http://example.com{x}" 
    elif "example.com" not in link:
        del all_links[n]

If you only want links to one external domain

internal_links = parsed.select('a[href^="http://example.com/"]')


tables

Iterate through each row in a table:

rows = [x for x in parsed.body.find_all('tr')]

Get the text from every cell in a row of a table:

result_list = [td.text.strip() for row in rows for td in row.find_all('td')]


general

To get the next selector at the same level of the tree:

find_next_sibling("div")

You can turn recursive off in a find_all:

all_divs_at_that_level = parsed.find_all("div", recursive=False)

you can also specify only tags with some attribute are included, for example all images with alt text (SO):

data_as_str = (", ".join([img['alt'] for img in parsed.find_all('img', alt=True)]))

To get a selector with spaces in the name, you can use select with periods instead of spaces (Docs)

element = parsed.select('div.container-lg.clearfix.px-3.mt-4')

alternately, you can take the most specific substring

element = parsed.find("div", {"class" : "clearfix"})

if you want a simple boolean on the presence of an element, you can do so with if / else

link_exists = True if parsed.find("a", {"href", "/somesubsite"}) else False


functions

If you want to find elements that have consistent text (you can use this to chain .nextSibling) (SO)

contacts = parsed.find(lambda elm: elm.name == "h2" and "Contact" in elm.text)

Find and concatenate all matching selectors (even if there are none present):

def flatten_enclosed_elements(enclosing_element, selector_type, **kwargs):
    if not enclosing_element:
        logging.warning('no enclosing element for flatten_enclosed_elements')
        return None

    text_list = []
    for ele in enclosing_element.find_all(selector_type):
        if ele and ele.get_text():
            text_list.append(ele.get_text().strip().replace("\n", "").replace("\r", ""))

    return ", ".join(text_list) if text_list and kwargs.get("output_str") else text_list

Find and concatenate all neighboring sibling selectors (even if there are none present):

def flatten_neigboring_selectors(enclosing_element, selector_type):
    if not enclosing_element:
        logging.warning('no enclosing element for flatten_neighboring_selectors')
        return None

    text_list = []
    for ele in enclosing_element.find_all(selector_type):
        next_s = ele.nextSibling
        if not (next_s and isinstance(next_s, NavigableString)):
            continue
        elif next_s and str(next_s):
            textlist.append(next_s.get_text().strip().replace("\n", "").replace("\r", ""))
    return text_list

Find all occurrences of a list of strings (SO)

import re
def get_whitelist_occurances(parsed, tag):
    tag = 'div' # e.g.
    for string in ['str_1', 'str_2', 'str_3']:
        found_selectors = parsed.find_all(tag, name=re.compile(f"^{string}$", re.I))
        found_strings = [x.get_text() for x in found_selectors]

Detect the site's language if text is found:

from langdetect import detect
def detect_language(output_dict):
  try:
    return detect(output_dict.get("Text"))
  except:
    pass


gotchas

NavigableStrings are just more annoying strings. You may get them sometimes when using .nextSibling. You can convert to regular string with

text_str = str(maybe_ns) if isinstance(maybe_ns, NavigableString) else maybe_ns