mastering BeautifulSoup

tl;dr: BeautifulSoup selectors and code snippets

Once you've become familiar with scraping websites with Python, requests, and BeautifulSoup (if not, read this first), you'll want to start creating reusable components to speed build time and improve data reliability.

Below I've included reference snippets for


extracting data

Notation Type Comments
.attrs Macro All attributes of a selected element (print this)
.div Element The div element inside the currently selected element
.a Element The a element. inside the currently selected element Note you'll have to use .get('href') to get the link associated
.p Element Get the p element (paragraph) inside the currently selected element
.span Element Get the span element inside the currently selected element
.title Element Get the title element inside the currently selected element
.svg Element Get the SVG inside the currently selected element
.img Element Get the image inside the currently selected element
.get('src') Image Link The link associated with an image
.get('alt') Image Link The alt text associated with an image
.text Attribute The text associated with an element (string)
.string Attribute Similar to .text, but supports some navigation (eg with .children)
.strong Attribute Get the bolded text, also link display text sometimes
.get_text("\n", strip=True) Attribute Get text that is broken up (e.g. by newlines)
.get('href') Attributes Get the link value (e.g. of a parent a tag)
.nextSibling Navigation Get the next item in the DOM tree (right below in the console)
.contents Attribute The elements inside the current element (in bytes)
.contents[0] Attribute The first element inside the current element
.contents[0].contents[1] Navigation You can navigate this way if the sitemap is consistent
.a.strong.text Navigation You can stack dotwise queries if sitemap consistent
[0] / [1] Navigation If your elements are a list (eg with a find_all), you can query through them with the index
.extract Misc If you need a tag and you want to throw away the parsed. Rare. (Explanation)


cleaning data

Notation Comments
.split("/", 1)[1].strip() Split by predictable substring e.g. "/"
", ".join(set(data_as_list)) Deduplicate list and convert to string
.strip().replace("\n", "").replace("\r", "") Remove whitespace and newlines
' '.join(data_as_str.split()) Replace blocks of whitespace with single whitespace
" ".join() -> .replace(' ', ', ') -> .split(',, ')[1] Split by spaces in a string (converting it to a list first)
.rstrip(",") Delete trailing commas or other delimiters
str(HTMLSnippet) Convert to a string, then parse with regex or substr in (generally inadvisable)
True if parsed.find("div", {"class", "features"}) else False Convert tag presence to a boolean


parser

the default is html.parser

from bs4 import BeautifulSoup
parsed = BeautifulSoup(response.content, "html.parser")

lxml is faster, but you have to manage the dependency (which is 12 MB unzipped)

from bs4 import BeautifulSoup
import lxml
parsed = BeautifulSoup(response.content, "lxml")

(once you've created the parsed object) if you want to iterate across all text elements as text (instead of HTML tags), you can use

text = parsed.get_text(separator=" ", strip=True)


handling links

to get all links on a page:

links = parsed.select('a[href]')

get the values from those links

link_text = [x.get_text() for x in parsed.find_all("a")]
link_href = [x.get("href") for x in parsed.select('a[href]')]

if you only want internal subsite links:

all_links = parsed.find_all('a', href=True)
internal_links = []
for n, link in enumerate(all_links):
    if link.startswith("/"):
        internal_links.append(f"https://example.com{link}")
    elif "example.com" in link:
        internal_links.append(link)

if you only want links to one external domain

internal_links = parsed.select('a[href^="http://othersite.com"]')


handling tables

iterate through each row in a table:

rows = [x for x in parsed.body.find_all('tr')]

get the text from every cell in a row of a table:

result_list = [td.text.strip() for row in rows for td in row.find_all('td')]

you may also run into tables created out of divs. to parse, find the parent tag, and iterate through child divs

table_parent = parsed.find("div", {'class': "parent-class"})
for table_cell_tag in table_parent.find_all("div", {"class": "cell-class"})
    print(table_cell_tag.get_text())


general tips

to get the next selector at the same level of the tree:

find_next_sibling("div")

you can turn recursive off in a find_all:

all_divs_at_that_level = parsed.find_all("div", recursive=False)

you can also specify only tags with some attribute are included, for example all images with alt text (SO):

data_as_str = (", ".join([img['alt'] for img in parsed.find_all('img', alt=True)]))

to get a selector with spaces in the name, you can use select with periods instead of spaces (Docs)

element = parsed.select('div.container-lg.clearfix.px-3.mt-4')

alternately, you can take the most specific substring

element = parsed.find("div", {"class" : "clearfix"})

you can use regex on tag elements, but it will be hard to troubleshoot

import re
element = parsed.find("div", {"id":re.compile('foo|bar')})

get elements that have a non-standard attribute present:

currency_option_tags = parsed.select('option[data-currency]')


functions

if you want to find elements that have consistent text (you can use this to chain .nextSibling) (SO)

contacts = parsed.find(lambda elm: elm.name == "h2" and "Contact" in elm.text)

find and concatenate all matching selectors (even if there are none present):

def flatten_enclosed_elements(enclosing_element, selector_type, **kwargs):
    if not enclosing_element:
        logging.warning('no enclosing element for flatten_enclosed_elements')
        return None

    text_list = []
    for ele in enclosing_element.find_all(selector_type):
        if ele and ele.get_text():
            text_list.append(ele.get_text().strip().replace("\n", "").replace("\r", ""))

    return ", ".join(text_list) if text_list and kwargs.get("output_str") else text_list

find and concatenate all neighboring sibling selectors (even if there are none present):

def flatten_neigboring_selectors(enclosing_element, selector_type):
    if not enclosing_element:
        logging.warning('no enclosing element for flatten_neighboring_selectors')
        return None

    text_list = []
    for ele in enclosing_element.find_all(selector_type):
        next_s = ele.nextSibling
        if not (next_s and isinstance(next_s, NavigableString)):
            continue
        elif next_s and str(next_s):
            textlist.append(next_s.get_text().strip().replace("\n", "").replace("\r", ""))
    return text_list

find all occurrences of a list of strings (SO)

import re
def get_tags_with_name_in_list(parsed, tag_type, name_strings_list):
    re_pattern = re.compile("^" + "|".join(name_strings_list) + "$", re.I)
    found_tags = parsed.find_all(tag_type, name=re_pattern, re.I))
    found_strings = [x.get_text() for x in found_tags]

detect the site's language if text is found:

from langdetect import detect
def detect_language(text):
  try:
    return detect(text)
  except:
    pass


general tips

NavigableStrings are just more annoying strings. You may get them sometimes when using .nextSibling. You can convert to regular string with

text_str = str(maybe_ns) if isinstance(maybe_ns, NavigableString) else maybe_ns

.get_text(), .getText(), and .text are the same thing

.get_text() returns the text of a given tag and all child tags. If you just want a given tag's text, use .string

The contents of <script>, <style>, and <template> tags are not considered to be β€˜text’, since they are not human visible. Use .string instead of the above .text methods (According to the docs)