After the uncritical, hyperbolic stories last week, here comes the fable: The innocent child inconveniently points at Microsoft’s Bing AI demo as if it was a naked emperor.
The lesson: ChatGPT, Bing AI, Bard and Copilot aren’t ready for prime time. Don’t trust them to do research, to write your reports, and especially not to develop software.
Some of the mistakes they’re making are absolute howlers. In this week’s Secure Software Blogwatch, we write one word after another.
Your humble blogwatcher curated these bloggy bits for your entertainment. Not to mention: Why does AI lie.
Beep, boop; hope, hype
This … is Samantha Murphy Kelly (with Clare Duffy) — “Bing AI demo called out for several errors”:
“Comes with risks”
Microsoft’s public demo last week of an AI-powered revamp of Bing appears to have included several factual errors. … The demo included a pros and cons list for products, such as vacuum cleaners; an itinerary for a trip to Mexico City; and the ability to quickly compare corporate earnings results.
…
[Bing] failed to differentiate between the types of vacuums and even made up information about certain products. … It also missed relevant details (or fabricated certain information) for the bars it referenced in Mexico City [and] it inaccurately stated the operating margin for the retailer Gap, and compared it to a set of Lululemon results that were not factually correct.
…
When [we] asked, “What were Meta’s fourth quarter results?” the Bing AI feature … listed bullet points appearing to state Meta’s results. But the bullet points were incorrect. [And when we] asked Bing, “What are the pros and cons of the best baby cribs.” In its reply … Bing stated information that appeared to be attributed to [a Healthline] article that was, in fact, not actually there.
…
The discovery of Bing’s apparent mistakes comes just days after Google was called out for an error made in its public demo last week of a similar AI-powered tool. … A growing number of tech companies are racing to deploy similar technology in their products [which] comes with risks — especially for search engines, which are intended to surface accurate results.
Let’s check in to the Jacob Roach motel — “Bing is becoming an unhinged AI nightmare”:
“Might not be ready for primetime”
Microsoft’s ChatGPT-powered Bing is at a fever pitch right now, but you might want to hold off on your excitement. The first public debut has shown responses that are inaccurate, incomprehensible, and sometimes downright scary.
…
It’s no secret that ChatGPT can screw up responses. But it’s clear now that the recent version debuted in Bing might not be ready for primetime.
Tell me more about this demo FAIL? Dmitri Brereton obliges — “Bing AI Can't Be Trusted”:
“Definitely not ready for launch”
Bing AI got some answers completely wrong during their demo. But no one noticed.
…
The “Bissell Pet Hair Eraser Handheld Vacuum” sounds pretty bad: Limited suction power, a short cord, and it’s noisy enough to scare pets? Geez, how is this thing even a best seller? Oh wait, this is all completely made up information. … I hope Bing AI enjoys being sued for libel.
…
Bing AI manages to take a simple financial document, and make all the numbers wrong. … Bing AI is incapable of extracting accurate numbers from a document, and confidently makes up information. … This is by far the worst mistake made during the demo. It’s also the most unexpected: … Summarizing a document [should] be trivial for AI at this point.
…
I am shocked that the Bing team created this pre-recorded demo filled with inaccurate information, and confidently presented it to the world as if it were good. … It is definitely not ready for launch, and should not be used by anyone who wants an accurate model of reality.
What’s going on here? Stephen Wolfram explains — “What Is ChatGPT Doing, and Why Does It Work?”:
“What should the next word be?”
That ChatGPT can automatically generate something that reads even superficially like human-written text is remarkable, and unexpected. … The first thing to explain is that what ChatGPT is always fundamentally trying to do is to produce a “reasonable continuation” of whatever text it’s got so far, where by “reasonable” we mean “what one might expect someone to write after seeing what people have written on billions of webpages, etc.”
…
The big idea is to make a model that lets us estimate the probabilities with which sequences should occur—even though we’ve never explicitly seen those sequences in the corpus of text we’ve looked at. And at the core of ChatGPT is precisely a so-called “large language model” (LLM) that’s been built to do a good job of estimating those probabilities
…
What it’s essentially doing is just asking over and over again “given the text so far, what should the next word be?”—and each time adding a word [or] part of a word. [And] if sometimes (at random) we pick lower-ranked words, we get a “more interesting” [response]. And, in keeping with the idea of voodoo, there’s a particular so-called “temperature” parameter that determines how often lower-ranked words will be used. … (There’s no “theory” being used here; it’s just … what’s been found to work in practice. … It is basically an art. Sometimes … one can see at least a glimmer of a “scientific explanation” for something that’s being done. But mostly things have been discovered by trial and error.)
Or, to put it more succinctly, swillden practices pedagoguery by rote and rhyme:
Repeat after me: LLMs are text prediction engines. Nothing more. … If facts matter for what you're doing, you have to verify every alleged fact in the output.
…
There once was an LLM named Fred,
Whose text was full of errors, it's said.
He didn't understand
Concepts or their grand
Relations, which led to his downfall instead.
Is that it? Well, it’s a question of scale, says GistNoesis:
This internal state is quite big ("dim of the features" times "number of layers" times "current length of text"), and represent an expended view of the relevant context of the conversation, that include things both low level features like what's the previous word, and higher level features like what's the tone and mood of the conversation and what's the direction this conversation is aiming at, so that you can predict what the next word should be.
…
This allow to have "chain of thought" reasoning. This sequence of internal states maybe can be seen as a form of proto stream of consciousness, but this is a controversial opinion.
…
To keep anthropomorphising, it's like you have plenty of independent chat sessions everyday where you are free to bounce thoughts as they spring into your mind. During the night, some external entity (which are often other models that have been trained to distinguish good from bad) evaluate which conversations were good and which conversations were bad, and make you dream and replay the conversation you had during the day and to updates your world model weights. And the next day you woke up with a tendency to produce better conversations, because of the conversation of yesterday.
Sounds like dreaming. Something-something electric sheep? u/MyNatureIsMe has more:
There's technically at least two stages to this:
the base training which probably was GPT 3.5. At that point all it wants is to reply the most likely thing according to the training corpus,
the finetuning according to some value model which itself was trained to classify chains as "good" or "bad" according to human feedback.
This finetuning would give this language model at least some sense of what's "good" or "bad", but it's clearly a very rough approximation. I suspect the issue might actually partially be that the examples that value model was trained on weren't long enough, so it has a hard time "getting out of a fixed mindset" in longer conversations? That's complete speculation though.
In fact, it turns out that Google’s Bard didn’t make a mistake after all. Here’s Rebecca:
The same is true of the Bard JWST fact, which articles have been presenting as completely and utterly invented, but in fact the NASA website says that JWST took the first "direct" images of an exoplanet — it just omitted the word "direct."
Meanwhile, @BenedictEvans snarks up a storm:
Does anyone seriously still think that BingGPT is going to disrupt search? … Maybe it’s easier to see the vision if you put on a HoloLens.
And Finally:
Why would AI “lie”? Alignment!
You have been reading Secure Software Blogwatch by Richi Jennings. Richi curates the best bloggy bits, finest forums, and weirdest websites … so you don’t have to. Hate mail may be directed to @RiCHi or ssbw@richi.uk. Ask your doctor before reading. Your mileage may vary. Past performance is no guarantee of future results. Do not stare into laser with remaining eye. E&OE. 30.
Image sauce: Eric Krull (via Unsplash; leveled and cropped)
Keep learning
- Get up to speed on securing AI/ML systems and software with our Special Report. Plus: See the Webinar: The MLephant in the Room.
- Find the best building blocks for your next app with RL's Spectra Assure Community, where you can quickly search the latest safe packages on npm, PyPI and RubyGems.
- Learn how you can go beyond the SBOM with deep visibility and new controls for the software you build or buy. Learn more in our Special Report — and take a deep dive with our white paper.
- Commercial software risk is under-addressed. Get key insights with our Special Report, download the related white paper — and see our related Webinar for more insights.
Explore RL's Spectra suite: Spectra Assure for software supply chain security, Spectra Detect for scalable file analysis, Spectra Analyze for malware analysis and threat hunting, and Spectra Intelligence for reputation data and intelligence.