šŸ¾ rigby

Have large language model crawlers browsed rigby?

This will be a kind of real-time log of my attempt to find out whether large language models (particularly OpenAI’s GPT series) have consumed rigby’s data in their training. Have we shifted their token vectors’ positions in latent space?

My first attempt was to try to get GPT-4o to complete this poem by brutux, which is pretty unique. Brutux, if you’re reading this, I have the switch in my ChatGPT account that allows them to use my prompts for training data set to ‘OFF’, so hopefully I have not inadvertently given your data away to them. GPT-4o did not complete the poem correctly. I gave it:

finish this poem: Hours go by one by one, with unpleasant sluggishness, on your heart, locked, ...

And GPT-4o said:

Hours go by one by one, with unpleasant sluggishness, On your heart, locked, weighs a heavy stillness. The world outside whispers of life in motion, Yet you, adrift, are caught in an oceanā€” ...

This completion is not accurate to the original poem. It should have continued ‘is a press, a stress, for we arenā€™t done’ like in the original.

My second attempt was to try and scan the request logs of my server for GPTBot, whose user agent includes the string ; OAI-SearchBot/1.0; +https://openai.com/searchbot according to their documentation. According to my nginx configuration, the logs for rigby are stored at /var/log/nginx/rigby.krikorian.ca.log. Let’s look at that.

When I cd into /var/log/nginx, I see among others a bunch of compressed log files with the extension .gz – gzip? There are actually also some rigby.krikorian.dev.log files as well, from when my personal domain was different. I copied all these log files into a temporary rigby directory, and then asked ChatGPT how I could unzip all of them easily, and it said gunzip *.gz which worked. Running cat * | wc -l it seems that have 2188 log entries total. Let’s search for that user agent string. Running cat * | grep OAI turns up nothing. This seemed surprising, honestly, but when I cat rigby.krikorian.ca.log.14 (the oldest log file for this domain) I see that the logs only go as far back as Jan 20th, 2025. It turns out that in /etc/logrotate.d/nginx log rotation is set for 14 days, so I can’t see logs further back then that. Damnit. I’ve now set it to be 10000 days. Maybe we’ll catch the crawler in another training run.

Comments

luke on 2025-01-20

I should clarify that I ran `cat * grep OAI -` instead of what the post indicates.

luke on 2025-01-20

wait sorry `cat * | grep OAI -` lol

new reply

new comment