Technologies

Ask AI Why It Sucks at Sudoku. You’ll Find Out Something Troubling About Chatbots

How much can you trust a generative AI tool if it can’t explain itself honestly or accurately?

Chatbots are genuinely impressive when you watch them do things they’re good at, like writing a basic email or creating weird futuristic-looking images. But ask generative AI to solve one of those puzzles in the back of a newspaper, and things can quickly go off the rails.

That’s what researchers at the University of Colorado Boulder found when they challenged large language models to solve Sudoku. And not even the standard 9×9 puzzles. An easier 6×6 puzzle was often beyond the capabilities of an LLM without outside help (in this case, specific puzzle-solving tools).

A more important finding came when the models were asked to show their work. For the most part, they couldn’t. Sometimes they lied. Sometimes they explained things in ways that made no sense. Sometimes they hallucinated and started talking about the weather.

If gen AI tools can’t explain their decisions accurately or transparently, that should cause us to be cautious as we give these things more control over our lives and decisions, said Ashutosh Trivedi, a computer science professor at the University of Colorado at Boulder and one of the authors of the paper published in July in the Findings of the Association for Computational Linguistics.

«We would really like those explanations to be transparent and be reflective of why AI made that decision, and not AI trying to manipulate the human by providing an explanation that a human might like,» Trivedi said.

When you make a decision, you can try to justify it, or at least explain how you arrived at it. An AI model may not be able to accurately or transparently do the same. Would you trust it?

Why LLMs struggle with Sudoku

We’ve seen AI models fail at basic games and puzzles before. OpenAI’s ChatGPT (among others) has been totally crushed at chess by the computer opponent in a 1979 Atari game. A recent research paper from Apple found that models can struggle with other puzzles, like the Tower of Hanoi.

It has to do with the way LLMs work and fill in gaps in information. These models try to complete those gaps based on what happens in similar cases in their training data or other things they’ve seen in the past. With a Sudoku, the question is one of logic. The AI might try to fill each gap in order, based on what seems like a reasonable answer, but to solve it properly, it instead has to look at the entire picture and find a logical order that changes from puzzle to puzzle.

Read more: AI Essentials: 29 Ways You Can Make Gen AI Work for You, According to Our Experts

Chatbots are bad at chess for a similar reason. They find logical next moves but don’t necessarily think three, four, or five moves ahead — the fundamental skill needed to play chess well. Chatbots also sometimes tend to move chess pieces in ways that don’t really follow the rules or put pieces in meaningless jeopardy.

You might expect LLMs to be able to solve Sudoku because they’re computers and the puzzle consists of numbers, but the puzzles themselves are not really mathematical; they’re symbolic. «Sudoku is famous for being a puzzle with numbers that could be done with anything that is not numbers,» said Fabio Somenzi, a professor at CU and one of the research paper’s authors.

I used a sample prompt from the researchers’ paper and gave it to ChatGPT. The tool showed its work, and repeatedly told me it had the answer before showing a puzzle that didn’t work, then going back and correcting it. It was like the bot was turning in a presentation that kept getting last-second edits: This is the final answer. No, actually, never mind, this is the final answer. It got the answer eventually, through trial and error. But trial and error isn’t a practical way for a person to solve a Sudoku in the newspaper. That’s way too much erasing and ruins the fun.

AI struggles to show its work

The Colorado researchers didn’t just want to see if the bots could solve puzzles. They asked for explanations of how the bots worked through them. Things did not go well.

Testing OpenAI’s o1-preview reasoning model, the researchers saw that the explanations — even for correctly solved puzzles — didn’t accurately explain or justify their moves and got basic terms wrong.

«One thing they’re good at is providing explanations that seem reasonable,» said Maria Pacheco, an assistant professor of computer science at CU. «They align to humans, so they learn to speak like we like it, but whether they’re faithful to what the actual steps need to be to solve the thing is where we’re struggling a little bit.»

Sometimes, the explanations were completely irrelevant. Since the paper’s work was finished, the researchers have continued to test new models released. Somenzi said that when he and Trivedi were running OpenAI’s o4 reasoning model through the same tests, at one point, it seemed to give up entirely.

«The next question that we asked, the answer was the weather forecast for Denver,» he said.

(Disclosure: Ziff Davis, CNET’s parent company, in April filed a lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

Explaining yourself is an important skill

When you solve a puzzle, you’re almost certainly able to walk someone else through your thinking. The fact that these LLMs failed so spectacularly at that basic job isn’t a trivial problem. With AI companies constantly talking about «AI agents» that can take actions on your behalf, being able to explain yourself is essential.

Consider the types of jobs being given to AI now, or planned for in the near future: driving, doing taxes, deciding business strategies and translating important documents. Imagine what would happen if you, a person, did one of those things and something went wrong.

«When humans have to put their face in front of their decisions, they better be able to explain what led to that decision,» Somenzi said.

It isn’t just a matter of getting a reasonable-sounding answer. It needs to be accurate. One day, an AI’s explanation of itself might have to hold up in court, but how can its testimony be taken seriously if it’s known to lie? You wouldn’t trust a person who failed to explain themselves, and you also wouldn’t trust someone you found was saying what you wanted to hear instead of the truth.

«Having an explanation is very close to manipulation if it is done for the wrong reason,» Trivedi said. «We have to be very careful with respect to the transparency of these explanations.»

Technologies

Google races to put Gemini at the center of Android before Apple’s AI reboot

Google is using its latest Android rollout to position Gemini as the AI layer across phones, Chrome, laptops and cars.

Google is using its latest Android rollout to make Gemini less of a chatbot and more of an operating layer across the phone, browser, car and laptop, just weeks before Apple is expected to show its own Gemini-powered Apple Intelligence reboot at WWDC.
Ahead of its Google I/O developer conference next week, the company previewed a number of Android updates, including AI-powered app automation, a smarter version of Chrome on Android, new tools for creators, a redesigned Android Auto experience, and a sweeping set of new security features.
Alphabet is counting on Gemini to help Google compete directly with OpenAI and Anthropic in the market for artificial intelligence models and services, while also serving as the AI backbone across its expansive portfolio of products, including Android. Meanwhile, Gemini is powering part of Apple’s new AI strategy, giving Google a role in the iPhone maker’s reset even as it races to prove its own version of personal AI on the phone is further along.
Sameer Samat, who oversees Google’s Android ecosystem, told CNBC that Google is rebuilding parts of Android around Gemini Intelligence to help users complete everyday tasks more easily.
“We’re transitioning from an operating system to an intelligence system,” he said.
As part of Tuesday’s announcements. Google said Gemini Intelligence will be able to move across apps, understand what’s on the screen and complete tasks that would normally require a user to jump between multiple services. That means Android is moving beyond the traditional assistant model, where users ask a question and get an answer, and acting more like an agent.
For instance, Google says Gemini can pull relevant information from Gmail, build shopping carts and book reservations. Samat gave the example of asking Gemini to look at the guest list for a barbecue, build a menu, add ingredients to an Instacart list and return for approval before checkout.
A big concern surrounding agentic AI involves software taking action on a user’s behalf without permissions. Samat said Gemini will come back to the user before completing a transaction, adding, “the human is always in the loop.”
Four months after announcing its Gemini deal with Google, Apple is under pressure to show a more capable version of Apple Intelligence, which has been a relative laggard on the market. Apple has long framed privacy, hardware integration and control of the user experience as its advantages.
Google’s Android push is designed to show it can bring AI deeper into the device experience while still giving users control over what Gemini can see, where it can act and when it needs confirmation.
The app automation features will roll out in waves, starting with the latest Samsung Galaxy and Google Pixel phones this summer, before expanding across more Android devices, including watches, cars, glasses and laptops later this year.
The company is also redesigning Android Auto around Gemini, turning the car into another major surface for its assistant. Android Auto is in more than 250 million cars, and Google says the new release includes its biggest maps update in a decade and Gemini-powered help with tasks like ordering dinner while driving.
Alphabet’s AI strategy has been embraced by Wall Street, which has pushed the company’s stock price up more than 140% in the past year, compared to Apple’s roughly 40% gain. Investors now want to see how Gemini can become more central to the products people use every day.
WATCH: Alphabet briefly tops Nvidia after report of $200 billion Anthropic cloud deal

Technologies

Waymo recalls 3,800 robotaxis after glitch allowed some vehicles to ‘drive into standing water’

Waymo issued a voluntary recall of about 3,800 of its robotaxis to fix software issues that could allow them to drive into flooded roadways.

Waymo is recalling about 3,800 robotaxis in the U.S. to fix software issues that could allow them to “drive onto a flooded roadway,” according to a letter on the National Highway Traffic Safety Administration’s website.
The voluntary recall is for Waymo vehicles that use the company’s fifth and sixth generation automated driving systems (or ADS), the U.S. auto safety regulator said in the letter posted Tuesday.
Waymo autonomous vehicles in Austin, Texas, were seen on camera driving onto a flooded street and stalling, requiring other drivers to navigate around them. It’s the latest example of a safety-related issue for the Alphabet-owned AV unit that’s rapidly bolstering its fleet of vehicles and entering new U.S. markets.
Waymo has drawn criticism for its vehicles failing to yield to school buses in Austin, and for the performance of its vehicles during widespread power outages in San Francisco in December, when robotaxis halted in traffic, causing gridlock.
The company said in a statement on Tuesday that it’s “identified an area of improvement regarding untraversable flooded lanes specific to higher-speed roadways,” and opted to file a “voluntary software recall” with the NHTSA.
“Waymo provides over half a million trips every week in some of the most challenging driving environments across the U.S., and safety is our primary priority,” the company said.
Waymo added that it’s working on “additional software safeguards” and has put “mitigations” in place, limiting where its robotaxis operate during extreme weather, so that they avoid “areas where flash flooding might occur” in periods of intense rain.
WATCH: Waymo launches new autonomous system in Chinese-made vehicle

Technologies

Qualcomm tumbles 13% as semiconductor stocks retreat from historic AI-fueled surge

Semiconductor equities reversed sharply after a broad AI-driven advance, with Qualcomm suffering its worst day since 2020 amid inflation concerns and rising oil prices.

Semiconductor stocks fell sharply on Tuesday, reversing course after an extensive rally that had expanded the artificial intelligence investment theme well past Nvidia and driven the industry to unprecedented levels.

Qualcomm plunged 13% and was on track for its steepest single-day decline since 2020. Intel shed 8%, while On Semiconductor and Skyworks Solutions each lost more than 6%. The iShares Semiconductor ETF, which benchmarks the overall sector, fell 5%.

The sell-off came after a key gauge of consumer prices came in above forecasts, and as conflict in Iran pushed crude oil higher—prompting investors to shift away from riskier assets.

The preceding advance had widened the AI opportunity set beyond longtime industry leader Nvidia, which for much of the past several years had largely carried the market to new peaks on its own.

Explosive appetite for central processing units, along with the graphics processing units that power large language models, has sent chipmakers to all-time highs.

Market participants are wagering that the shift from AI model training to autonomous agents will lift demand for additional AI hardware. Among the beneficiaries are memory chip producers, which are raising prices as supply remains tight.

Micron Technology slid 6%, and Sandisk cratered 8%. Sandisk’s stock has surged more than six times over since January.

Technologies3 года ago

Tech Companies Need to Be Held Accountable for Security, Experts Say

Technologies3 года ago

Best Handheld Game Console in 2023

Technologies5 лет ago

Black Friday 2021: The best deals on TVs, headphones, kitchenware, and more

Technologies3 года ago

Tighten Up Your VR Game With the Best Head Straps for Quest 2

Technologies5 лет ago

Google to require vaccinations as Silicon Valley rethinks return-to-office policies

Technologies5 лет ago

Verum, Wickr and Threema: next generation secured messengers

Technologies4 года ago

The number of Сrypto Bank customers increased by 10% in five days

Technologies5 лет ago

Olivia Harlan Dekker for Verum Messenger

Verum World

Ask AI Why It Sucks at Sudoku. You’ll Find Out Something Troubling About Chatbots

Technologies

Ask AI Why It Sucks at Sudoku. You’ll Find Out Something Troubling About Chatbots

Why LLMs struggle with Sudoku

AI struggles to show its work

Explaining yourself is an important skill

Technologies

Google races to put Gemini at the center of Android before Apple’s AI reboot

Technologies

Waymo recalls 3,800 robotaxis after glitch allowed some vehicles to ‘drive into standing water’

Technologies

Qualcomm tumbles 13% as semiconductor stocks retreat from historic AI-fueled surge

Trending

Verum World

Ask AI Why It Sucks at Sudoku. You’ll Find Out Something Troubling About Chatbots

Why LLMs struggle with Sudoku

AI struggles to show its work

Explaining yourself is an important skill

You may like

Technologies

Google races to put Gemini at the center of Android before Apple’s AI reboot

Technologies

Waymo recalls 3,800 robotaxis after glitch allowed some vehicles to ‘drive into standing water’

Technologies

Qualcomm tumbles 13% as semiconductor stocks retreat from historic AI-fueled surge

Trending