• bjorney@lemmy.ca
      link
      fedilink
      arrow-up
      3
      arrow-down
      22
      ·
      2 months ago

      Reddit probably omits bot accounts when it sells its data to AI companies

        • bjorney@lemmy.ca
          link
          fedilink
          arrow-up
          4
          arrow-down
          8
          ·
          2 months ago

          Reddit has access to its own data - they absolutely know which users are posting unique content and which user’s content is a 100% copy of data that exists elsewhere on their own platform

          • phdepressed@sh.itjust.works
            link
            fedilink
            arrow-up
            20
            arrow-down
            1
            ·
            2 months ago

            I know they could be I’m just not sure they’re that competent. These bots often aren’t single user or just copy paste either, there’s usually some effort to mix it up or change wording slightly. Reddits internal search function is infamously shit but they “know” which users are unlabeled bots with some effort put behind them?

            • brbposting@sh.itjust.works
              link
              fedilink
              arrow-up
              5
              ·
              2 months ago

              I figure it’s their absolute last priority. They might know rough bot #s, but haven’t built or don’t widely use takedown tools. There’s always an enhancement to deliver, and bots help their engagement metrics.

            • bjorney@lemmy.ca
              link
              fedilink
              arrow-up
              2
              arrow-down
              11
              ·
              2 months ago

              I know everyone here likes to circle jerk over “le Reddit so incompetent” but at the end of the day they are a (multi) billion dollar company and it’s willfully ignorant to infer that there isn’t a single engineer at the company who knows how to measure string similarity between two comment trees (hint: import difflib in python)

              • icydefiance@lemm.ee
                link
                fedilink
                arrow-up
                8
                ·
                edit-2
                2 months ago
                1. To compare every comment on reddit to every other comment in reddit’s entire history would require an index, and if you want to find similar comments instead of exact matches, it becomes a lot harder to do that efficiently. ElasticSearch might be able to do it, but then you need to duplicate all of that data in a separate database and keep it in sync with your main database without affecting performance too much when people are leaving new comments, and that would probably be expensive.
                2. Comparing combinations of comments is probably impossible. Reddit has a massive number of comments to begin with, and the number of possible subtrees of those comments would just be absurd. If you only care about comparing entire threads and not subtrees, then this doesn’t apply, but I don’t know how useful that will be.
                3. Programmers just do what they’re told. If the managers don’t care about something, the programmers won’t work on it.
                • bjorney@lemmy.ca
                  link
                  fedilink
                  arrow-up
                  1
                  arrow-down
                  1
                  ·
                  edit-2
                  2 months ago

                  To compare every comment on reddit to every other comment in reddit’s entire history would require an index

                  You think in Reddit’s 20 year history no one has thought of indexing comments for data science workloads? A cursory glance at their engineering blog indicates they perform much more computationally demanding tasks on comment data already for purposes of content filtering

                  you need to duplicate all of that data in a separate database and keep it in sync with your main database without affecting performance too much

                  Analytics workflows are never run on the production database, always on read replicas which are taken asynchronously and built from the transaction logs so as not to affect production database read/write performance

                  Programmers just do what they’re told. If the managers don’t care about something, the programmers won’t work on it.

                  Reddit’s entire monetization strategy is collecting user data and selling it to advertisers - It’s incredibly naive to think that they don’t have a vested interest in identifying organic engagement

                  • icydefiance@lemm.ee
                    link
                    fedilink
                    arrow-up
                    4
                    ·
                    edit-2
                    2 months ago

                    You think in Reddit’s 20 year history no one has thought of indexing comments for data science workloads?

                    I’m sure they have, but an index doesn’t have anything to do with the python library you mentioned.

                    Analytics workflows are never run on the production database, always on read replicas

                    Sure, either that or aggregating live streams of data, but either way it doesn’t have anything to do with ElasticSearch.

                    It’s still totally possible to sync things to ElasticSearch in a way that won’t affect performance on the production servers, but I’m just saying it’s not entirely trivial, especially at the scale reddit operates at, and there’s a cost for those extra servers and storage to consider as well.

                    It’s hard for us to say if that math works out.

                    It’s incredibly naive to think that they don’t have a vested interest in identifying organic engagement

                    You would think, but you could say the same about Facebook and I know from experience that they don’t give a fuck about bots. If anything they actually like the bots because it looks like they have more users.

      • livus@kbin.social
        link
        fedilink
        arrow-up
        15
        arrow-down
        1
        ·
        2 months ago

        Doubt it, they are interwoven into almost any conversation with more than 70 comments.

        • bjorney@lemmy.ca
          link
          fedilink
          arrow-up
          5
          arrow-down
          8
          ·
          2 months ago

          If you have access to the entire Reddit comment corpus it’s trivial to see which users are only reposting carbon copies of content that appears elsewhere on the site

          • criitz@reddthat.com
            link
            fedilink
            arrow-up
            11
            ·
            2 months ago

            It’s probably not as easy as you imagine for reddit to identify and cleanse all bot content.

            • livus@kbin.social
              link
              fedilink
              arrow-up
              2
              ·
              2 months ago

              Of course it’s not. Nor do they want to.

              I think the person you’re talking to thinks all bots are like the easy ones in this screenshot.

            • bjorney@lemmy.ca
              link
              fedilink
              arrow-up
              1
              arrow-down
              4
              ·
              edit-2
              2 months ago

              Look at the picture above - this is trivially easy. We are talking about identifying repost bots, not seeing if users pass/fail the Turing test

              If 99% of a user’s posts can be found elsewhere, word for word, with the same parent comment, you are looking at a repost bot

              • criitz@reddthat.com
                link
                fedilink
                arrow-up
                5
                ·
                2 months ago

                That’s easy in an isolated case like this, but the reality of the entire reddit comment base is much more complex.

          • livus@kbin.social
            link
            fedilink
            arrow-up
            4
            ·
            edit-2
            2 months ago

            The low level bots in OPs screenshot, sure, because it’s identical. Not the rest.

            I used to hunt bots on reddit for a hobby and give the results to Bot Defense.

            Some of them use rewrites of comments with key words or phrases changed to other words or phrases from a thesaurus to avoid detection. Some of them combine elements from 2 comments to avoid detection. Some of them post generic comments like 💯. Doubtless there are some using AI rewrites of comments now.

            My thought process is if generic bots have been allowed to go so rampant they fill entire threads that’s an indication of how bad the more sophisticated bot problem has become.

            And I think @phdepressed is right, no one at reddit is going to hunt these sophisticated bots because they inflate numbers. Part of killing the API use was to kill bot detection after all.

            • bjorney@lemmy.ca
              link
              fedilink
              arrow-up
              1
              arrow-down
              1
              ·
              edit-2
              2 months ago

              Reddit has way more data than you would have been exposed to via the API though - they can look at things like user ARN (is it coming from a datacenter), whether they were using a VPN, they track things like scroll position, cursor movements, read time before posting a comment, how long it takes to type that comment, etc.

              no one at reddit is going to hunt these sophisticated bots because they inflate numbers

              You are conflating “don’t care about bots” with “don’t care about showing bot generated content to users”. If the latter increases activity and engagement there is no reason to put a stop to it, however, when it comes to building predictive models, A/B testing, and other internal decisions they have a vested financial interest in making sure they are focusing on organic users - how humans interact with humans and/or bots is meaningful data, how bots interact with other bots is not