I was just listening to a fascinating programme on ChatGPT on BBC Radio 4 – Word of mouth: Chatbots. Really interesting conversation between Michael Rosen (multi award winning children’s author, poet, performer, broadcaster and scriptwriter) and “Emily M Bender, Professor of computational linguistics at the University of Washington and co-author of the infamous paper ‘On the Dangers of Stochastic Parrots’”. The segment I am listening to right now is on inbuilt bias in AI. Given Jeri’s concerns on Elizabeth’s pocast about Google Bard recommending Reddit as a source, I was startled to hear from Professor Bender that ChatGPT was trained by looking at sites linked by Reddit users (because that was the easiest way to get a lot of data where real humans link to a diverse range of topics). The whole episode is worth listening to, but that fact jumped out at me. I wonder whether that changes anyone’s view of the output of ChatGPT?
[Note that a quick websearch suggests that ChatGPT-2 was largely trained on sites suggested on Reddit, but -3 had significant additions to that training data (including the English Language version of Wikipedia). The training sources for -4 have not been revealed as far as I can tell. I haven’t found a suitably reputable source to link for this information, but have seen it on a number of different websites – if anyone has a good source for this, do post below. I did find this article from the Washington Post about Google’s AI dataset, which didn’t fill me with confidence about that either…]
Professor Bender also says “I like to talk about the output [of AI chatbots] as ‘non-information’. It is not truth and not lie….but even ‘non-information’ pollutes our information ecosystem.”