Artificial Intelligence & Machine Learning , Fraud Management & Cybercrime , Next-Generation Technologies & Secure Development
Training AI on Social Media: What Could Go Wrong?
Unfiltered Training Data Can Cause Safety Issues, Spread MisinformationLinkedIn this week joined its peers in using social media posts as training data for artificial intelligence models, raising concerns of trustworthiness and safety.
See Also: The SIEM Selection Roadmap: Five Features That Define Next-Gen Cybersecurity
AI companies greatly rely on publicly available data. As that data runs out, social media content offers an alternative that is vast, free and conveniently accessible. This makes using social media data cost-effective and efficient but has serious caveats of safety issues and the platforms being a breeding ground for misinformation. LinkedIn users can opt out from their personal data being used to train the platform's AI model.
Companies that tap into social media data find diverse, real-world language data that can help LLMs understand current trends and colloquial expressions, said Stephen Kowski, field CTO at AI-powered security company SlashNext. Social media provides insights into human communication patterns that may not be available in more formal sources, he told Information Security Media Group.
LinkedIn is not the only company to use customer social media data. Social media giant Meta and X, formerly Twitter, have trained their AI models with user data. As with LinkedIn, users must manually opt out of having their data scraped, rather than being asked to give prior permission. Others such as Reddit have licensed their data for money instead.
The question for AI developers is not whether companies use the data or even whether it is fair to do so - it is whether the data is reliable or not.
The quality of training data is crucial for AI model performance. High-quality, diverse data leads to more accurate and reliable outputs, while biased or low-quality data can result in flawed predictions and perpetuate misinformation. Companies must employ advanced AI-driven content filtering and verification systems to ensure the quality and reliability of the data used, Kowski said.
The harm of using low-quality social media data to train AI models is that it can perpetrate the biases people use in their posts, use human slang and jargon, and push misinformation and harmful content.
Social media data quality varies across platforms. LinkedIn has relatively higher-quality data due to its professional focus and user verification processes. Reddit can provide diverse perspectives but requires more rigorous content filtering. "Effective use of any platform's data demands advanced AI-driven content analysis to identify reliable information and filter out potential misinformation or low-quality content," Kowski said.
Researchers and companies are developing solutions to mitigate the misinformation that AI internalizes when trained on social media data. One such method is watermarking AI content to inform the user the source of the information, but the method is not foolproof. Companies training the AI models can also identify harmful behaviors and instruct the LLMs to avoid them, but this is not a scalable solution. At the moment, the only guardrails in place are ones that companies have volunteered to adhere to and ones that governments have suggested.