Reddit will block bots from accessing its public data to prevent web scraping for AI training

Reddit will block bots from accessing its public data to prevent web scraping for AI training
Reddit will block bots from accessing its public data to prevent web scraping for AI training

MADRID, June 27 (Portaltic/EP) –

Forum reddit has announced that it will update its Robot Exclusion Protocol (robots.txt file) to block access by automated ‘bots’ to its public data and thus prevent so-called data scraping or ‘web scrapping’, used in intelligence training artificial (AI).

Data scraping or ‘web scraping’ It is a process of collecting content from web pages using software that extracts HTML content from these sites to filter the information and store it, which is compared to the automatic process of copying and pasting.

Although this is a common and legal practice, it goes against the terms of use of some websites, since it can be executed for malicious purposes, as developer Robb Knight and Wired have recently verified.

The pair discovered that AI developer Perplexity had ignored the Robot Exclusion Protocol of certain websites and ran web scraping with it to train its artificial intelligence models.

To avoid these types of situations, Reddit has announced that in the coming weeks it will update its Bot Exclusion Protocol, which “provides high-level instructions” on how it does and does not allow third-party agents to crawl its directories.

Once you have updated the robots.txt file, it will continue to block unknown ‘bots’ and crawlers from accessing reddit.com and will limit your browsing speed. Nevertheless, will maintain open access to your content for researchers and organizations like the Internet Archive, whom it considers “good faith actors” who access its content “for non-commercial use.”

In contrast, the platform requires permission and a fee when accessing data and tools for commercial purposes, including training AI models.

With this, it has indicated that anyone who accesses its website must comply with its use policies, “including those in force to protect redditors”, and has made a guide available to interested parties to access its content legitimately.

It is worth remembering, however, that Reddit already announced a new public content policy at the beginning of May, arising from the realization that “more and more commercial entities are using unauthorized access or misusing authorized access to collect public data.” “, including those on the platform.

He also presented a new ‘subreddit’ for researchers, with which he demonstrated his intention to preserve public access to the platform’s content for “those who believe in the use responsible and non-commercial of public data“.

 
For Latest Updates Follow us on Google News
 

-