What is Chat scraping when it provides answers to font licensing questions?

Nick Shinn · December 2025

Because this is the norm that our clients and customers recognize.

John Hudson · December 2025

If only it were simply ‘scraping’ some content and regurgitating it. General AI is an algorithmically influenced probability engine. So what it is doing is combining probabalistic analysis of content on which it has been trained with its programmers’ agenda to keep people using it by providing the kind of results they seem to like. This is why general AI is unreliable, makes stuff up, can’t cite its sources (or invents sources), and is often incapable of answering even quite simple questions accurately and honestly.

Dave Crossland · December 2025

John Hudson said:

general AI is unreliable, makes stuff up, can’t cite its sources (or invents sources)

You seem to be behind the times. Reliability is improving, as well as providing citations, with stuff like https://en.wikipedia.org/wiki/Retrieval-augmented_generation

But to Nick's original question, it is scraping all the text that it is trained on, and my personal opinion and understanding is that OpenAI has been obtaining ~all publicly available texts.

John Hudson · December 2025

Reliability is improving, as well as providing citations

In the meantime, though, we have had three years’ worth of slop flooding the Web, all of which is now available to AI as ‘sources’. Simple information that used to be easily findable, e.g. how long to cook a particular vegetable for in an air fryer and at what temperature, is now buried under a mountain of AI-generated ad-driven untested recipe sites, all with contradictory and inaccurate directions. You only need to look at the unasked-for AI summaries in Google searches to see that the results are ingesting the upstream slop.

Maybe the big breakthrough in AI will be the ability to identify and ignore its own results.

Peter Constable · December 2025

There isn't one answer to Nick's question: it depends on the nature of the AI application customers are interacting with.

If you use an AI chat application that is a thin wrapper around a large language model, such as ChatGPT (at least, early versions), then that is nothing more than a probabilistic text-prediction engine. If you feed it, say, "Do most font licenses permit embedding in PDF?", then it will answer with a high-probability continuation of that text, with probabilities based on data it was trained on. The algorithms involved are very sophisticated and complex, but it's a fair approximation to say it's just fancy text prediction. Potential words to continue that text include "Yes", "No", "font", "John", "yellow", "influenza", and any other word, each with a different conditional probability. The chat application will pick one with a high probability, given the preceding text, then continue to select the next word based on different probabilities given the now-extended text, and so on. A response like "Font licenses never refer to embedding" is certainly plausible, even quite likely, based on text prediction.

If we approach such an app expecting anything more than a sensible English sentence that you could reasonably expect from a fluent English speaker who has minimal knowledge of font licenses, then one of the following will be true:

* You will consider its response to be a hallucination.
* You had wrong expectations of the app.

But not all AI apps are like that.

Prior to (say) five years ago, Web search engines were complex information retrieval (IR) systems that would return links to Web content, with various criteria they might use to determine how to rank results. Today, search engines that have integrated AI likely are not using AI to invent results. They likely are

using AI to interpret natural language queries (e.g., mapping a natural language sentence like "What temperature should I use to cook carrots in my air fryer?" to a query like "carrots +'air fryer' +temperature")
using AI to provide a natural language summary response

The engine will likely use IR algorithms very similar to what they used five years ago to find results from the web, and provide source references to results in the natural language summary.

Something that's not clear to me is whether they might be incorporating AI somehow in ranking results. Five years ago, the search engine you were using might have used a ranking algorithm that set a large weight based on which results were most often clicked on over the previous six months by your top 20 Facebook friends, or top 20 sales ranking on Amazon, or other such techniques. I expect the IR component of search engines is fairly similar today.

If I go to Google search (not my usual search engine, but following John's scenario) and request a search using "how long to cook carrots in an air fryer", it provides an "ai overview" with source references that include a YouTube video and various sites, such as Air Fryer Carrots - Dash of Sanity.

Something true today that was also true five years ago: when the search site presents results ranked in an order it hopes you'll find relevant and useful (so you'll keep using the site and, hopefully, click on paid-for links or click on ads), you are faced with a choice of deciding what is or isn't useful for your particular need. E.g., are the instructions Dash of Sanity provided for cooking carrots useful?

Something different today than five years ago is that there is a lot more content out there. Some of that is likely AI generated, though I have no idea what proportion of new content that comprises. Lots of people have been getting into the web-info gig: I like cooking; if I make a cooking site, I can make money!

These days, I usually use Copilot chat as my front end to get information for some need, and I've usually found that far more satisfying than using traditional search sites: I find I can usually ask a specific question and get a specific answer that is useful for my purpose, though that's not guaranteed. But Copilot chat is not simply a thin wrapper around a LLM: it is tightly integrated with traditional web search (or, at work, company information) that is the grounding information used in results it provides.

Hopefully, your customers are not using a Chat agent that is simply a front end to a LLM without other grounding information.

Nick Shinn · December 2025

I am primarily interested in uncommon licensing requirements, for which there is a small training data set online, and wondering about the possible disproportionate influence of certain large players in monetizing that usage—and the extent to which such industry leaders shape the market.

What is Chat scraping when it provides answers to font licensing questions?

Comments

Categories