Which Twitter Fact-Checking Bot Actually Works? A Data-Driven Answer

Matthew Northey
4 days ago
2 min read

In the wild west of X, where a single tweet can spark a global firestorm, fact-checking bots have become digital sheriffs, only most of them are asleep at the wheel. AI Seer has

published a forensic audit of two leading fact-checking bots, ArAIstotle and Perplexity, testing them on 45 real tweets from verified powerhouses like @hubermanlab @cnni,@ElonClipsX, and @TheChiefNerd. The verdict? ArAIstotle didn’t just win. It dominated.

AI Seer selected six high-engagement accounts known for science, news, and bold claims. For each of their 45 tweets, spanning text, images, and video clips the tester posted a single command: “Fact-check this.” Three bots were summoned: ArAIstotle, Perplexity, and Grok. But Grok? A no-show. Zero responses across all 45 tests. Whether down, rate-limited, or simply not listening, xAI’s bot was effectively disqualified before the race began. That left ArAIstotle and Perplexity in a head-to-head.

Both responded to 38 out of 45 tweets (84.4% coverage), with misses mostly on promotional or non-factual posts. So reliability? A tie. But quality? Wait and see.

The tester built a blind evaluation system using three top-tier LLMs. Gemini 2.5 Pro, Claude 3.7 Sonnet, and GPT-4o, each scoring every bot response on a 1–10 scale across five critical dimensions:

Claim Coverage & Segmentation – Did the bot break the tweet into individual, testable claims?
Specificity of Verification – Were facts backed by dates, numbers, records?
Truth Assessment Explicitness – Did it clearly say true, false, misleading, or uncertain?
Depth of Explanation – Was the reasoning logical, contextual, and complete?
Completeness of Response – Did it finish the job, or leave claims dangling?

To level the playing field, bot names were hidden, “Bot A,” “Bot B”, and evaluators were fed reverse image search results and video transcripts for multimedia tweets. The numbers don’t lie. Across 180 blinded evaluations (38 responses × 2 bots × 3 judges × 1 average per dimension), ArAIstotle scored 8.55/10 while Perplexity scored 5.91/10. That’s a 44.7% quality advantage and the gap was consistent across all three judges.

Dimension	ArAIstotle	Perplexity	Gap
Truth Assessment	9.35	5.72	+3.63
Explanation Depth	8.08	5.33	+2.74
Response Completeness	8.94	6.45	+2.49
Claim Identification	8.53	6.35	+2.18
Verification Specificity	7.70	6.10	+1.60
Overall Average	8.55	5.91	+2.64

ArAIstotle’s secret? Surgical precision. Take one Elon Musk clip claiming reusable rockets could slash spaceflight costs by 100×. Perplexity offered a polite nod: “Partially reusable systems have reduced costs by up to 70%.” Useful, but vague, no verdict, no breakdown. ArAIstotle? It isolated six claims, declared two false, and countered with data: current tech suggests 10–20× savings at best. It didn’t just inform, it adjudicated.

ArAIstotle led in every single dimension, with near-perfect scores in truth clarity (9.35) and completeness (8.94). Perplexity’s best showing? A 6.45 in completeness, still a full 2.5 points behind. The conclusion is a wake-up call that shows that response rate isn’t enough. In the fight against misinformation, depth and courage matter more than showing up. ArAIstotle doesn’t just fact-check, it dissects, verifies, and rules. That’s the future of truth.

For more information visit:

https://x.com/ashwad_harshan/article/1983116357714513967