Content in DoX CMS and local LLMs

Whether large language models are the greatest scam of our time or the next industrial revolution, they will prove rather unhelpful unless they have access to your own content. At the same time, there are few risks to information security as great as having classified content be leaked and made available by a public large language model. And the providers of those models cannot be trusted at all to respect their promises about how they use provided information. This much is clear in light of all the court cases over training data. As such, I will present some thoughts on suggestions on how to best use content written in DoX CMS with local language models, instead.

I have previously discussed the possibilities that generative AI presents here. Personally, I remain a skeptic over these systems being profitable and their long-term scalability. However, I accept that they can have use cases that provide actual value as part of people’s workflows.

Local language models

The large language models that most know such as ChatGPT and Claude are cloud-services based on their own servers. This means that content given to them for processing is sent to those servers and it is hard to verify whether it remains available to further train the models. Even though publicly available documentation benefits everyone involved to a degree, not all content is intended to be available to everyone. Usually, you also want to retain control of the distribution and versioning of the content viewable by the public.

Public language models are thus usually not a suitable aid during production. The lack of trust towards them is hard to bridge with just promises when the same companies fight actively to remove all restrictions on training their models.

The solution here are local language models which do not send the content that they process elsewhere.

There are various implementations for them and I will not discuss all the options. Here, I will focus on general principles and use the GPT4ALL interface and related base models for my examples. Such models are available from the Mistral and Falcon model families, for example. These are not the only available options, though. You can also look for suitable solutions in the Hugging Face model database.

Setup

The platforms for local models, such as GPT4ALL or Ollama, have their own installation files. Personally, I also encountered an issue which prevented their use even after they had been installed. On a Windows-based device, I solved this by installing the latest version of Microsoft’s C++ Redistributable library.

After it has been installed, at least GPT4ALL requires you to also install the used models and to specify the used local file locations. You must also add more accepted file types in the program’s settings. The selected models should also receive baseline prompts suited for their use cases.

File types

In the settings for the GPT4ALL software, the first field under LocalDocs specifies accepted file types. You must separate the extensions for these file types with commas. The use cases that I suggest below require you to add XML, DITA, DITAMAP, CSS, MOD, DTD, and XSLX as accepted file types like this.

Versioning

An issue that language models face is becoming stuck on outdated practices. This mainly relates to their second training round with your own content. At least GPT4ALL allows you to select which file sets to embed at any given time. This allows you to separate earlier versions to be their own sets to embed. You can then select to only use them when you need recommendations about how to formulate content based on those files.

My practical recommendation is that you use different local folders for different versions. You can save such folders on a network drive that requires no login to keep them available for all users that need them. They must then each be designated as their own content sets for embedding. Afterwards, you can combine these sets based on your needs at the time.

I recommend that you also do versions of the baseline prompts given to these models. This lets you iterate on them and separate them based on tasks without fear of losing other versions. You can, for example, save these prompts in TXT files outside your platform of choice. You can then copy and paste them to be used by models.

Baseline prompts

Baseline prompts are instructions added to each prompt sent to the model. General instructions for them are available here, for example. Models like Mistral also provide their own guidelines for prompting. This includes baseline prompts written for such models.

DoX CMS:n kautta tuotettujen sisältöjen läpikäyntiin soveltuva pohjakehote vaatii ainakin määritelmät DoX CMS:n eri komponenteille (muuttuja, tagi, elementtiluokka, yms.). This allows you to use these concepts in your requests to the model when you want it to process raw content. Here are the definitions that my current prompt includes. You can improve on them and adjust them to your needs.

# Definitions
- variable: inline content that replaces related parts designated with two pairs of curly parentheses
- tag: identifier and XML attribute (data-doxattribute-) used to filter content in publications (do not mix with XML tags)
- element class: XML attribute (doxelementclass) used to add exceptions to general styles to those elements or their child elements
- topic tree: same function as DITAMAP files, a list of hierarchically organized topics from which you filter out unrelated parts for a publication
- localized image: BLK files that add the correct images for different languages to the <img> elements with Href values that link to them

The baseline prompt should also include these details:

the role assigned to the language model,
intended use cases,
restrictions on allowed use,
instructions for how to resolve uncertainty,
instructions for formatting responses, and
examples of prompts and how to answer them.

The provided qualifiers must be as quantitative as possible and relate to as unambiguous and easily processed variables as possible. For example, instead of word count, text length should be defined based on a character count or the number of tokens because these values are less ambiguous to a language model. Based on the average length of words in the used language, you can approximate the character count for a given word count. Similarly, ‘better’ and descriptions like it require explicit conditions such as required reading comprehension being restricted to a specified level on the PIAAC standard.

Limitations

Local language models are more restricted than services maintained on specialized servers. The available model are usually simpler to remain usable on consumer hardware. At the same time, the available computers for employees are often considerably weaker than would be ideal for use with language models. If a user’s usual duties do not include 3D modeling and video deliveries, their work computer is unlikely to have a high-end GPU. Laptops often also lack the memory capacity for best results.

In a pinch, you can overcome this restriction by maintaining a local server to which your internal users can send their requests. This option is not cheap. However, for an environment with multiple users, it is considerably cheaper than upgrading each user’s personal hardware to support local language model use. I will not detail how to establish a local server here, though.

Language models are also structurally unpredictable. Whatever instructions the baseline prompt provides are not treated as absolute. And the results from a language model depend on all the already discussed factors: the selected model, the provided baseline prompt, and sufficient hardware. As such, this kind of behavior can result in an expensive QA spiral. The results are also affected by the content made available to them as well as users’ prompts, of course. Nothing ensures that you will emerge on the other side with better results, either. Like I said, these language models are structurally unpredictable.

This makes it crucial to approach the issue systematically, one potential cause at a time. And at the same time, you must be ready to accept the possibility that you cannot wrestle the model into obedience. This is not an issue that you can resolve with just more carefully formulated prompts. More detailed prompts require more processing power but simply adding better hardware to help process them does not ensure the intended results, either. It is a vicious cycle that can be hard to exit. Whether achieving satisfactory results is even possible remains in a state of constant uncertainty. All such sunk costs bolster the temptation to keep trying. However, there are times when one must accept that language models are not capable of miracles. Or sometimes even useful aid with basic tasks.

Adapting DoX CMS

I would not write about this subject without having things to say about how DoX CMS can support content written there being made available to these language models. Currently, we have no features that specifically support language model use but proper application of our existing features can also help with it.

Metadata

The prolog field in the text editor lets you add data elements to your content. Language models benefit from such metadata which adds context for the significance of each topic. Here are some examples of details in the metadata which can help language models find the right content when prompted:

Year: The year on which that content was written.
Role: A brief description of the type of content such as calling them instructions.
Keywords: Identifiers for subjects from a list.
Description: Open-ended descriptions to help the language model and other users.

You can also embed open-ended descriptions to different elements with inline comments. Similarly, the use of descriptive ID values for elements can make them easier for a language model to process.

Content types

DoX CMS allows you to export more than actual publications. These other forms of content can be used as context for the feedback that a language model provides, for example. You can also consult it for feedback on how to formulate your raw content.

Naturally, actual publications are the clearest example of content to be provided to a language model. They already pull together all disparate components which makes it less likely that the language model encounters parts with gaps that it cannot process.

PDF

PDF remains the primary delivery format for publications and language models are quite capable of digesting these files. They are not suited for everything, though, since information about styling is not directly available in them as CSS.

PDF inputs give language models access to the contents of actual publications in their final form (assuming that you use this delivery format). This file format’s easy digestibility can also help language models process content being reviewed before delivery or that is already approved.

DITA

DITA as a delivery format is the most similar to raw content. This may help language models associate parts of a delivery with those parts in the raw content. If such inputs allow it to associate positions marked with conref values and their source content, for example, that would greatly aid during content reviews. By default, these parts of the raw content are processed simply as empty elements with conref values, which leaves gaps in the raw content being processed by a language model. Overall, they seem unable to process parts with conref values as having the same content as related source elements even when those source elements are a part of what is being processed.

The clear advantage of this publication format is how it includes DTD and DITAMAP files. At minimum, I recommend that you add the DTD file from DoX CMS into the file set available to a local language model. This should make the model less likely to recommend changes that this format does not allow, such as adding a figure element inside a simpletable element.

Another benefit to DITA publications is how their content remains split between distinct files. This allows you to take these publications apart and separate these files into multiple file sets. These separated file sets can be used for subject-specific queries with no need to process the contents of the whole publication. This eases some of the demand for more powerful hardware.

HTML

Besides being its own publication format, HTML is also the base from which DoX CMS compiles PDF publications, for example. You can thus use it to try and help with details such as styling PDF publications. These publications include both the correct identifiers and your current style sheets.

Ideally, this would also result in the language model learning the parallels between raw content and HTML publications. I would not hold my breath when it comes to this outcome, though. Instead, HTML can prove useful also simply because it is the standard for online content. The majority of the training data used for language models is scraped from the web, which should make HTML plenty palatable to them.

Raw content

DoX CMS also has several ways for you to download raw content in for format used by the text editor. The Editor menu has a command which downloads all selected topics as one XML file. Additionally, the preview files for translations are essentially the same, and they allow you to directly select all content related to a publication. This content lets you request feedback from a language model while you still work on said content.

Below, I will present a third method to present this content to a language model with the help of our WebDAV implementation. This option is more complex. However, it does keep those pieces of content separate and which lets you divide them as needed. This option also lets that content be designated as part of the workflow.

Spreadsheets

Remember that you can download almost any list in DoX CMS as a XLSX spreadsheet. The way that these files combine different data points can provide language models with useful context for processing other content. Here are some examples that hint at how to use these spreadsheets:

Editor: This spreadsheet contains ID values of topics, their titles, and their descriptions, for example.
Topic tree: This spreadsheet contains an unfiltered topic tree that acts as a template for publications.
Element classes: This spreadsheet contains the names and descriptions of element classes.
Publication: This spreadsheet contains the topics for the actual publication together with the same information as the Editor menu.

You must download the spreadsheets for publications in a less than intuitive way. To do so, you must select two revisions of publications and then use the command to compare them. The menu that this opens has buttons for downloading the spreadsheets for each.

WebDAV

Content that is reserved to you in DoX CMS can also be opened locally from a network drive. This feature was originally intended for use with other text editors such as oXygen XML and FrameMaker. You can also use it to make content available to a local language model, though. The benefits of this alternative are that it keeps content divided into different files, and that you can control their availability as an integrated part of the workflow.

Our manual contains full details on how to use WebDAV. I will not share that information in public to avoid issues with information security. In essence, though, it involves the use of a user profile for DoX CMS to map a network drive. This network drive shows topic trees and topics in different languages when they are reserved to that user.

I recommend that you copy the contents that you retrieve like this to a local folder or another network drive which is available without login. The language model can then use these copies for which you can also define versions. Unfortunately, at least GPT4ALL cannot maintain a connection to a network drive added like this. Instead, it must reprocess all these files when you restart your computer.

Workflow

When you integrate language models with the workflow feature, you can add a new workflow state that allows content to be reserved to (only) an AI user. This then makes that content available in its WebDAV folder to be picked up.

A workflow like this requires a new role, a user account dedicated to this, and a workflow state made for it.

Only we can control roles. You would thus need to request us to add a role like that. In practice, this is a means to mark the AI user, and to limit the available permissions to having content be reserved to it.

User profiles like this count towards your license, of course. This would of course be a cheaper license with limited permissions and we can discuss its pricing. In practice, this involves using a fake email address for that user profile and your admin users changing its password manually. These login details are only used for the network drive to let different users retrieve the content in there.

The only thing left is to add a new workflow state. My recommendation is to set the user role used with the AI model as the only allowed role in it. This makes it impossible to reserve that content to other users with this workflow state. You must also place this state between other states in the workflow based on available transitions. The best option for this depends on how you will use this state. Nothing prevents you from having multiple applications for language models at the same time. If you do this, you can add more than one of these states.

If you will use the model to review content, add this state after the default state. I recommend that you use it before the stage for actual reviews because a language model can provide a basic review at best. It can never, for example, point out product features that have not yet been documented. In this case, this state would become an available detour between the default state and the main review state.

If you will instead provide the model with approved raw content as an example of proper formatting, for example, you should add this state between the review state and the state for approved content. Alternatively, you can allow content to be reserved to the language model when it is approved or replace the default state for approved content with a state like this.

Summary

Local language models are a means to avoid the information security risks associated with AI services. You can install them with the help of platforms such as GPT4ALL and Ollama. Afterwards, these platforms allow you to select which model to use. You can then adjust it to fit your needs better with carefully formulated baseline prompts and by feeding it your content. Unfortunately, these language models have their limits which makes you unable to force them to always obey your instructions. For example, the processing power and memory on available devices limit their usefulness.

You can feed content made with DoX CMS to these language models to make those models more helpful and to review that content with help from them. If you do this, you should enrich your content with metadata that makes it easier for language models to process. Different publishing formats each have their benefits when used with a language model, such as the DTD files made available by DITA publications. You can also export both raw content and spreadsheets of different lists from the system.

A more comprehensive solution for integrating language models in your workflow requires the use of WebDAV. It involves you to add a new role, a user profile, and one or two workflow states for use with the language model. This allows you to reserve content to those users as part of the workflow, and you can retrieve them from the WebDAV-based network drive to be used by the language model. I recommend that you copy these contents onto a different network drive which does not require you to log in. That way, any user can retrieve different versions and subject-specific parts based on their needs.