Unveiling the Hidden Layers of the Web: Introducing our Open-Sourced Scrapers

We’re thrilled to announce a new milestone in our journey towards data transparency and platform accountability. The scrapers we used in the CrossOver project to confront data with the information provided by platform’s official APIs are now open-sourced!

These scrapers enable researchers to collect data from Google Autocomplete, YouTube search emulation, and Reddit trends. We aimed to pull back the curtain and reveal what users really see when they interact with these platforms.

Hidden Truths in the Digital World

Throughout our project, we uncovered numerous revelations. For example, we found that “Nazis” was trending on Twitter in Belgium, yet there was little to no trace of this on the platform’s official API. On May 9 2022, between 9 and 10 a.m, CrossOver’s monitoring computers detected that the keyword “nazis” appeared in top trends for Belgian users on Twitter, but this word was not to be found on the trends list retrieved when querying the platform’s official API. The tweets contained dubious content and disinformation about Ukraine. The same phenomenon happened on November 22 and 23, 2022, starting at 11 p.m, as the word “nazis” appeared trending on our monitoring devices, but not on the trends list as returned by the official API. Deeper research revealed that tweets containing the word “nazis” were thriving on Twitter in Belgium.

We also discovered how Google Autocomplete suggested “insider” when a Belgian user may have typed “donbass” into the search bar, leading to a pro-kremlin media outlet. Despite Google Search lacking an official API, our open-sourced scraper has successfully managed to pull this valuable data.

Revealing YouTube’s Recommendations

Our first investigations involved comparing data from the official YouTube API to the information gathered through our Raspberry Pis, on which the scrapers run. These were strategically disseminated in 7 different Belgian provinces, emulating local search trends and recommendations.

Although the disparities between the two datasets were slight, they were not insignificant. Both data sources corroborated an increasing importance of CGTN Français in recommended content following the RT ban.

All these small discrepancies underline the fact that there may be a gap between the official results provided by the API and the actual results users see on their screens. In other words, users in different geographic areas may be experiencing different types of content, even when they are searching for the same topics.

The Questions that Arise

Our findings provoke critical questions about the transparency and fairness of content dissemination. Whenever we tried to confront platform in terms of differences between what users actually see and the API results, we were left with no satisfying response (no, Google, a blogpost explaining how the autocomplete feature works is not an acceptable answer).

By making our scrapers open-sourced, we’re hoping to empower other researchers, technologists, and curious minds in general, to further investigate these intriguing behaviors. It’s time for us all to take a more significant role in understanding the digital landscape that shapes our perception and informs our decisions.

Open source isn’t just about free access to technology – it’s also about transparency, accountability, collaboration, and making the world a better place through shared knowledge and resources. Happy investigating!