What is wrong with YouTube captions?
It’s 2025. Captions can be automatically generated with near-perfect accuracy. Entire platforms rely on having adequate captions to serve both the deaf and hearing people who just don’t want to watch with sound on. And it’s natural for captions to struggle with complex or uncommon words, but YouTube’s captions are so bad. It makes me feel insane. Because not only are they terrible, they’re terrible in a very specific way that no other platform struggles with. It is a unique kind of terrible, and it baffles me. I decided to chronicle all the errors in an example video, which did happen to have pertinent content.
Here’s that list:
0:07 - “litter” captioned as “L”.
0:31 - “Infinitely” or “intimately”1 captioned as “inally”.
0:38 - “what it” captioned as “what he”.2
1:30 - “r/mildlyinfuriating” captioned as “r/m infuriating”.
1:31 - “Just watched” captioned as “just watch”.
1:58 - “speculative” captioned as “speculated”.3
2:40 - “as” captioned as “has”.
2:57 - Captioning omitted for stuttering.4
3:08 - “post seemed” captioned as “PO seem”.
3:20 - “miss” captioned as “missed”.
3:29 - “feed” captioned as “feet”.
3:47 - “90’s” captioned as “9s”.
4:47 - “TEMU” captioned as “Teemu”.5
5:10 - “Leipzig” captioned as “Li Zig”.
5:19 - DuckDuckGo captioned as “duck ducko” (after Google and Bing were correctly transcribed).
6:42 - “no-tech-skills” captioned as “not Tech skills”.
7:29 - an extra “hello” is added.
9:47 - “404” captioned as “44”.
10:02 - “high-quality” captioned as “highquality”.6
10:32 - “anti-AI” captioned as “anti- aai”.
11:17 - “non-AI” captioned as “non aai”.
11:57 - “In a push to” captioned as “and it push to”.
11:59 - “monopolize” captioned as “Monopoly ize”.
12:53 - “than” captioned as “and”.
14:59 - “AI [bleep]cking” captioned as “a aiing”.
15:59 - The word “dom” is omitted.7
16:38 - “Buncha” captioned as “but of”.
16:42 - “I wanna” captioned as "I would to”.
17:53 - “a cute” captioned as “aute”.
18:29 - “I deeply” captioned as “I L”.
20:12 - A missing word (likely “fucking”) captioned as “INF”.
21:03 - “scam” captioned as “scab”.
21:42, 21:44, 22:07 - “Ian McShane” captioned alternately as “Ian mcshain” and “Ian McShan”.8
22:59 - “Trailer Two” captioned as “trailer to”.
23:08 - “Michael Giacchino” captioned as “Michael Gino”.
23:35 - Mouth “sound effects” are omitted.
24:11 - “What” is omitted.
24:28 - “movies” is duplicated.
25:22 - “more [bleep]cky side” captioned as “Mory side”.
26:06 - “shadowbanned” captioned as “Shadow band”.
26:45 - “[bleep]ckin’” captioned as “fing”.
26:47 - “tryna”/”trying” captioned as “try".
28:13 - “said it” captioned as “seaded”.
28:57 - “Aww” captioned as “a”.
Miscellaneous: At several points, captions are delayed; this is particularly noticeable around 24:00. Numerous random words are also capitalized.
I will grant, that at 46 errors out of a 5,500 word video, this is numerically above the standard bar for “99% accuracy” that human captioners strive for. That being said, the 1% in human captioning is like the 1% in my hearing - some things just aren’t clearly said, or the captioner straight up doesn’t know the word or name being said. This is completely acceptable! I would contest, though, that the lack of punctuation (perfectly attainable, as many of Google’s competitors have proven) causes an immediate drop in accessibility, making it far from comparable to true, beautiful, 99% accurate captions.9
What gets me the most is the strangeness of it. On a purely anecdotal level, I remember thinking circa 2018 that YouTube captions were better than my own hearing, and being wowed by their precision. They’ve gotten worse in the past few years, and in an extremely intentional and bizarre way. This video didn’t even include a major issue, which is that numbers are often captioned with an extra degree of 10 (for example, “40,000” becomes “400,000” or “440,000”), which can throw off the entire purpose and context of the sentence. Or my personal least favorite, “th000”. The errors don’t even contextually make sense. “No” and “not” are not pronounced the same. You can’t just throw letters wherever you want and have the word still mean the same thing and be pronounced the same way.
You get it? It’s not that they’re errors, it’s that there’s some sort of bizarre filter being put over the captions that make them erroneous in specific and disastrous ways. Some of these errors absolutely erode the meaning of the entire sentence! Some of them are completely nonsensical, and not anything that a human has said, ever. It’s tempting to chalk it up to “AI slop”, because that seems to be Google’s main product now, but it’s a more insidious, evil sort of slop. An appropriately trained AI would not be making these mistakes. We know this because, again, YouTube is the only platform with this problem.
There is something going on in the processing of these captions that is either deliberately inaccessible or so incompetent as to be indistinguishable from malice. And I feel insane, because nobody is acknowledging it. This is not normal, and this is not how auto-captions work. There is no universe where they should be simply making up words that no one has ever said! There is no universe where they should process a perfectly understandable word and output incomprehensible crap. There is no universe where the same word, said by the same person in the same context, should be captioned in different ways.
It is ludicrous that the bar for accessibility is so low within one of the largest video hosts in the world that, after removing the ability for community members to do it themselves, they are also now worsening the bare minimum of accessibility that’s been provided to this point. I cannot imagine nobody internally has said anything about this. If no one has, then shame on everyone internally who has refused time and time again to do the bare minimum for deaf people.10 How dare you. Truly. Gemini’s Voice Mode is burning the environment to be a little bit better at identifying when someone is asking about Sephora lip gloss, [Errata June 2025: I was wrong to imply that Gemini is singularly bad for the environment. See here.] and our access to videos is being obfuscated by an intentionally crappier version of what are likely the same models. YouTube ought to be at the cutting edge of this technology. I should be blown away by how no one even needs to caption things anymore, because voice-to-text is perfectly capable of it. Why does the one thing Google is peddling right now, language models and voice-to-text, take a backseat the moment it comes to not being evil? Oh, right.
I use these captions because I’m hard of hearing. When assisted by captions, I can understand about 99% of what I hear. This word clearly falls within the 1%, but would be trivial for any other captioning technology to at least make a guess.
This speaker has a slight accent and does not always enunciate perfectly. Again, this should be trivial for modern captioning technology, unless it is exclusively trained on American accents (which may well be the case).
Again, speaker error to a degree. Google is at the forefront of language processing models, though, which should in theory understand context.
Stuttering is properly captioned in several other parts of the video.
I’m counting brand names for exceptionally well-known brands that can absolutely be hardcoded into the captions. Earlier in the video, “IMDb” was captioned properly, and earlier in this sentence, “AliExpress” was captioned as “Ali Express”, so it’s reasonable to think the captions may have some brand names hardcoded.
Yes, this is absolutely an error. No one writes it this way.
Presumably because “dom” is in some contexts a sexual term. In this context, it’s a name, and was properly captioned a few seconds prior.
I wouldn’t consider a slightly-uncommon name an error, but there’s no reason for the spelling to be inconsistent like this.
Look at the difference there! There are a couple of minor errors and typos, but the average person, deaf or hearing, would not have to expend any additional brain processing power to keep up with most of the video, whereas you constantly have to be figuring out where a sentence starts and ends with no grammar, and even more so with absolute nonsense peppered in at random.
I’m aware I’m yelling into the void here, but really - we can’t even enable swearing client-side. It’s up to individual creators to decide whether deaf people are allowed to read the word “fuck”, when they’re saying it in the video for hearing people to enjoy. What the fuck, man? We’re not children.