So I need to scrape some a site using Python but the problem is that the markup is random, unstructured, and proving hard to work with.
<p style='font-size: 24px;'> <strong>Title A</strong> </p> <p> <strong> First Subtitle of Title A </strong> "Text for first subtitle" </p>
Then it will switch to
<p> <strong style='font-size: 24px;'> Second Subtitle for Title B </strong> </p>
Then sometimes the new subtitles are added to the end of the previous subtitle's text
<p> ...title E's content finishes <strong> <span id="inserted31" style="font-size: 24px;"> Title F </span> </strong> </p> <p> <strong> First Subtitle for Title F </strong> </p>
Enough confusion, it's simply poor markup. Obvious patterns such as 'font-size:24px;' can find the titles but there isn't a solid, reusable method to scrape the children and associate them with the title.
Regex might work but I feel like the randomness would result in scraping patterns that are too specific and not DRY.
I could offer to rewrite the html and fix the hierarchy, however, this being a wordpress site, I fear the content might come back as incompatible to the admin in the wordpress interface.
Any suggestions for either a better scraping method or a way to go about wordpress would be greatly appreciated. I want avoid just copying/pasting as much as possible.