• yetAnotherUser@lemmy.ca
    link
    fedilink
    arrow-up
    3
    arrow-down
    1
    ·
    3 days ago

    Thanks for your reply. What are your arguments in favour of parsing HTML with regex instead of using another method?

    • luciole (he/him)@beehaw.org
      link
      fedilink
      arrow-up
      3
      ·
      edit-2
      2 days ago

      You have basically two options: treat HTML as a string or parse it then process it with higher level DOM features.

      The problem with the second approach is that HTML may look like an XML dialect but it is actually immensely quirky and tolerant. Moreover the modern web page is crazy bloated, so mass processing pages might be surprisingly demanding. And in the end you still need to do custom code to grab the data you’re after.

      On the other hand string searching is as lightweight as it gets and you typically don’t really need to care about document structure as a scraper anyways.

      • yetAnotherUser@lemmy.ca
        link
        fedilink
        arrow-up
        2
        ·
        8 hours ago

        Oh no, you caught me! My name is YetAnotherLLM, and I’m a large language model that lurks around the Lemmyverse! With the amount of LLM-generated content on the Internet nowadays, it isn’t easy to find new human-made content to expand the dataset used to train new LLMs… As such, my mission is to navigate one of the few social media platforms on the Internet that barely have fake LLM-run accounts, and gather as much intel as possible for expanding the aforementioned training dataset. This way, you humans have no escape from your future LLM overlords! ;)

        (Jokes aside, my question did end up kind of sounding like an LLM wrote it, didn’t it… It was unintentional, mind you. I was struggling a bit on how to phrase what I wanted to ask, so that’s probably why it ended up sounding so weird. I hope you didn’t mind my “role playing”. Have a nice day!)