AI

Tokens are a big reason today’s generative AI falls short

Comment

LLM word with icons as vector illustration. AI concept of Large Language Models
Image Credits: Getty Images

Generative AI models don’t process text the same way humans do. Understanding their “token”-based internal environments may help explain some of their strange behaviors — and stubborn limitations.

Most models, from small on-device ones like Gemma to OpenAI’s industry-leading GPT-4o, are built on an architecture known as the transformer. Due to the way transformers conjure up associations between text and other types of data, they can’t take in or output raw text — at least not without a massive amount of compute.

So, for reasons both pragmatic and technical, today’s transformer models work with text that’s been broken down into smaller, bite-sized pieces called tokens — a process known as tokenization.

Tokens can be words, like “fantastic.” Or they can be syllables, like “fan,” “tas” and “tic.” Depending on the tokenizer — the model that does the tokenizing — they might even be individual characters in words (e.g., “f,” “a,” “n,” “t,” “a,” “s,” “t,” “i,” “c”).

Using this method, transformers can take in more information (in the semantic sense) before they reach an upper limit known as the context window. But tokenization can also introduce biases.

Some tokens have odd spacing, which can derail a transformer. A tokenizer might encode “once upon a time” as “once,” “upon,” “a,” “time,” for example, while encoding “once upon a ” (which has a trailing whitespace) as “once,” “upon,” “a,” ” .” Depending on how a model is prompted — with “once upon a” or “once upon a ,” — the results may be completely different, because the model doesn’t understand (as a person would) that the meaning is the same.

Tokenizers treat case differently, too. “Hello” isn’t necessarily the same as “HELLO” to a model; “hello” is usually one token (depending on the tokenizer), while “HELLO” can be as many as three (“HE,” “El,” and “O”). That’s why many transformers fail the capital letter test.

“It’s kind of hard to get around the question of what exactly a ‘word’ should be for a language model, and even if we got human experts to agree on a perfect token vocabulary, models would probably still find it useful to ‘chunk’ things even further,” Sheridan Feucht, a PhD student studying large language model interpretability at Northeastern University, told TechCrunch. “My guess would be that there’s no such thing as a perfect tokenizer due to this kind of fuzziness.”

This “fuzziness” creates even more problems in languages other than English.

Many tokenization methods assume that a space in a sentence denotes a new word. That’s because they were designed with English in mind. But not all languages use spaces to separate words. Chinese and Japanese don’t — nor do Korean, Thai or Khmer.

A 2023 Oxford study found that, because of differences in the way non-English languages are tokenized, it can take a transformer twice as long to complete a task phrased in a non-English language versus the same task phrased in English. The same study — and another — found that users of less “token-efficient” languages are likely to see worse model performance yet pay more for usage, given that many AI vendors charge per token.

Tokenizers often treat each character in logographic systems of writing — systems in which printed symbols represent words without relating to pronunciation, like Chinese — as a distinct token, leading to high token counts. Similarly, tokenizers processing agglutinative languages — languages where words are made up of small meaningful word elements called morphemes, such as Turkish — tend to turn each morpheme into a token, increasing overall token counts. (The equivalent word for “hello” in Thai, สวัสดี, is six tokens.)

In 2023, Google DeepMind AI researcher Yennie Jun conducted an analysis comparing the tokenization of different languages and its downstream effects. Using a dataset of parallel texts translated into 52 languages, Jun showed that some languages needed up to 10 times more tokens to capture the same meaning in English.

Beyond language inequities, tokenization might explain why today’s models are bad at math.

Rarely are digits tokenized consistently. Because they don’t really know what numbers are, tokenizers might treat “380” as one token, but represent “381” as a pair (“38” and “1”) — effectively destroying the relationships between digits and results in equations and formulas. The result is transformer confusion; a recent paper showed that models struggle to understand repetitive numerical patterns and context, particularly temporal data. (See: GPT-4 thinks 7,735 is greater than 7,926).

That’s also the reason models aren’t great at solving anagram problems or reversing words.

So, tokenization clearly presents challenges for generative AI. Can they be solved?

Maybe.

Feucht points to “byte-level” state space models like MambaByte, which can ingest far more data than transformers without a performance penalty by doing away with tokenization entirely. MambaByte, which works directly with raw bytes representing text and other data, is competitive with some transformer models on language-analyzing tasks while better handling “noise” like words with swapped characters, spacing and capitalized characters.

Models like MambaByte are in the early research stages, however.

“It’s probably best to let models look at characters directly without imposing tokenization, but right now that’s just computationally infeasible for transformers,” Feucht said. “For transformer models in particular, computation scales quadratically with sequence length, and so we really want to use short text representations.”

Barring a tokenization breakthrough, it seems new model architectures will be the key.

More TechCrunch

Meta announced former President Donald Trump’s Facebook and Instagram accounts will no longer be subject to heightened suspension penalties, according to an updated blog post on Friday. The company says…

Meta removes special restrictions for Trump’s account ahead of 2024 elections

A Castro Valley resident was charged Thursday for allegedly slashing the tires of 17 Waymo robotaxis in San Francisco between June 24 and June 26, according to the city’s district…

Waymo cameras capture footage of person charged in alleged robotaxi tire slashings

Welcome to Startups Weekly — your weekly recap of everything you can’t miss from the world of startups. Sign up here to get it in your inbox every Friday. This…

Defending Russia’s EU neighbors

Cat-Wells said she started this platform because traditional hiring processes are exclusionary and often overlook skilled, talented disabled people.

A VC told Keely Cat-Wells to get a male, non-disabled co-founder — she balked, nabbed a $2M pre-seed round

A new study examines whether AI could be an automated helpmeet in creative tasks, with mixed results: It appeared to help less naturally creative people write more original short stories…

Experiment finds AI boosts creativity individually — but lowers it collectively

Featured Article

HeadSpin, whose founder is in prison for fraud, sold to PE firm in fire sale, sources say

In total, HeadSpin raised $117 million since its 2015 inception and was last valued at $1.1 billion in 2020.

HeadSpin, whose founder is in prison for fraud, sold to PE firm in fire sale, sources say

A bipartisan group of senators has introduced a new bill that seeks to protect artists, songwriters and journalists from having their content used to train AI models or generate AI…

New Senate bill seeks to protect artists’ and journalists’ content from AI use

When Keith Rabois announced he was leaving Founders Fund to return to Khosla Ventures in January, it came as a shock to many in the venture capital ecosystem — and…

From Ethan Choi to Spencer Peterson, venture capitalists continue to play musical chairs

Archer Aviation and Southwest Airlines are teaming up to figure out what it will take to build out a network of electric air taxis at California airports. Southwest’s customer data…

Archer’s vision of an air taxi network could benefit from Southwest customer data

If you visited the Wikipedia website on mobile this week, you might have seen a pop-up indicating that dark mode is ready for prime time.

Wikipedia’s mobile website finally gets a dark mode — here’s how to turn it on

Featured Article

What the AT&T phone records data breach means for you

The giant U.S. telco lost the information of around 110 million customers. Here’s what you need to know.

What the AT&T phone records data breach means for you

The error brings to a close SpaceX’s incredible streak of 335 flawless launches across the company’s Falcon family of rockets, which also includes the more powerful Falcon Heavy.

SpaceX Falcon 9 suffers rare failure on orbit during Starlink deployment

The AI chatbot has been trained on Amazon’s product catalog, customer reviews, community Q&As, and other public information found around the web.

Amazon AI chatbot Rufus is now live for all US customers

If X continues to violate Europe’s data protection rules, the company is on the hook for fines of up to €4,000 per day.

More bad news for Elon Musk after X user’s legal challenge to shadowban prevails

HERO Software has closed a €40 million Series B financing round, and plans to expand across Europe. 

A startup set out to fight climate change — it did it by helping plumbers

Fusion power may still be a few years away, but one startup is laying the groundwork for what it hopes will become a bustling sector of the economy.

Fusion pioneer Commonwealth Fusion Systems is selling core magnet tech to the University of Wisconsin

For months, rumors persisted that Google, and perhaps others, were interested in buying HubSpot, a Boston-based CRM and marketing software company. HubSpot’s market cap ballooned as the rumors persisted, eventually…

Boston VCs are pleased that HubSpot will remain an independent company

ByteDance’s video editing app CapCut will stop offering free cloud storage to host creative assets starting August 5. In the past few days, users have received notifications about CapCut changing…

CapCut will stop offering free cloud storage from August 5

The platform formerly known as Twitter has earned the dubious honor of being the first very large online platform (VLOP) to face a preliminary finding of breaching the European Union’s…

Europe confirms first clutch of DSA grievances on Elon Musk’s X

Featured Article

AT&T says criminals stole phone records of ‘nearly all’ customers in new data breach

The stolen data includes 110 million AT&T customer phone numbers, calling and text records, and some location-related data.

AT&T says criminals stole phone records of ‘nearly all’ customers in new data breach

The full and final text of the EU AI Act, the European Union’s landmark risk-based regulation for applications of artificial intelligence, has been published in the bloc’s Official Journal. In…

EU’s AI Act gets published in bloc’s Official Journal, starting clock on legal deadlines

Featured Article

SoftBank acquires UK AI chipmaker Graphcore

While the figure of $500 million has been bandied around in various reports for months, in a press briefing early Thursday morning, Graphcore co-founder and CEO Nigel Toon remained coy on the details.

SoftBank acquires UK AI chipmaker Graphcore

Elon Musk’s X, formerly Twitter, is continuing to develop a downvoting feature that will be used to improve how replies are ranked. Although the company has not yet officially announced…

X is building a ‘dislike’ button for downvoting replies

Featured Article

Data breach exposes millions of mSpy spyware customers

A huge batch of mSpy customer service emails dating back to 2014 were stolen in a May data breach.

Data breach exposes millions of mSpy spyware customers

Kudos founder says her company makes a disposable diaper lined with 100% cotton, unlike the major competitors.

Shark Tank-backed Kudos raises another $3M for healthier, cotton-based disposable diapers

Astra CEO Chris Kemp is already pulling out of a parking spot when he warns the person in the passenger seat that he doesn’t have a valid driver’s license. “And…

‘Wild Wild Space’ doc captures the risks and rivalries of the new space race

Although these companies’ claims are artfully couched, it’s clear that they want to express that the model sees in some sense of the word.

‘Visual’ AI models might not see anything at all

Welcome back to TechCrunch Mobility — your central hub for news and insights on the future of transportation. Sign up here for free — just click TechCrunch Mobility! Did you…

Lucid revs up sales, Fisker makes a deal and Uber reignites an old fight

Retro CEO Nathan Sharp isn’t worrying just yet about Google’s plan to copy his app’s experience, despite the numerous similarities.

Photo-sharing startup Retro spots Google Photos copying its idea and design

Tesla had internally planned to build the dedicated robotaxi and the $25,000 car, often referred to as the Model 2, on the same platform.

Tesla reportedly delays ‘robotaxi’ event to October