Efficient Data Formats for GPT


TL;DR: Try YAML/TOML or CSV instead of JSON with GPT models. Read the Conclusion for a bit more detail.

Methodology

Large Language Models (LLMs) like GPT use tokens to process and generate text. Tokens are essentially common sequences of text.

Some data formats intrinsically take up more tokens and are more expensive to use with LLM models like ChatGPT. For example, at $0.02+ per 1000 tokens for OpenAI GPT models, each additional token has a very direct cost associated. Current LLMs are also constrained by their available memory, which is usually measured in tokens. Thus, having a more efficient data serialization format is cheaper and allows us to use memory more efficiently.

This post looks at a few popular data formats and compares their cost in tokens.

All comparisons use the same data in different formats. The data used is as follows, taken from json.org and generated using ChatGPT:

servlets:

{
  "web-app": {
    "servlet": [
      {
        "servlet-name": "cofaxCDS",
        "servlet-class": "org.cofax.cds.CDSServlet",
        "init-param": {
          "configGlossary:installationAt": "Philadelphia, PA",
          "configGlossary:adminEmail": "ksm@pobox.com",
          "configGlossary:poweredBy": "Cofax",
          "configGlossary:poweredByIcon": "/images/cofax.gif",
          "configGlossary:staticPath": "/content/static",
          "templateProcessorClass": "org.cofax.WysiwygTemplate",
          "templateLoaderClass": "org.cofax.FilesTemplateLoader",
          "templatePath": "templates",
          "templateOverridePath": "",
          "defaultListTemplate": "listTemplate.htm",
          "defaultFileTemplate": "articleTemplate.htm",
          "useJSP": false,
          "jspListTemplate": "listTemplate.jsp",
          "jspFileTemplate": "articleTemplate.jsp",
          "cachePackageTagsTrack": 200,
          "cachePackageTagsStore": 200,
          "cachePackageTagsRefresh": 60,
          "cacheTemplatesTrack": 100,
          "cacheTemplatesStore": 50,
          "cacheTemplatesRefresh": 15,
          "cachePagesTrack": 200,
          "cachePagesStore": 100,
          "cachePagesRefresh": 10,
          "cachePagesDirtyRead": 10,
          "searchEngineListTemplate": "forSearchEnginesList.htm",
          "searchEngineFileTemplate": "forSearchEngines.htm",
          "searchEngineRobotsDb": "WEB-INF/robots.db",
          "useDataStore": true,
          "dataStoreClass": "org.cofax.SqlDataStore",
          "redirectionClass": "org.cofax.SqlRedirection",
          "dataStoreName": "cofax",
          "dataStoreDriver": "com.microsoft.jdbc.sqlserver.SQLServerDriver",
          "dataStoreUrl": "jdbc:microsoft:sqlserver://LOCALHOST:1433;DatabaseName=goon",
          "dataStoreUser": "sa",
          "dataStorePassword": "dataStoreTestQuery",
          "dataStoreTestQuery": "SET NOCOUNT ON;select test='test';",
          "dataStoreLogFile": "/usr/local/tomcat/logs/datastore.log",
          "dataStoreInitConns": 10,
          "dataStoreMaxConns": 100,
          "dataStoreConnUsageLimit": 100,
          "dataStoreLogLevel": "debug",
          "maxUrlLength": 500
        }
      },
      {
        "servlet-name": "cofaxEmail",
        "servlet-class": "org.cofax.cds.EmailServlet",
        "init-param": {
          "mailHost": "mail1",
          "mailHostOverride": "mail2"
        }
      },
      {
        "servlet-name": "cofaxAdmin",
        "servlet-class": "org.cofax.cds.AdminServlet"
      },
      {
        "servlet-name": "fileServlet",
        "servlet-class": "org.cofax.cds.FileServlet"
      },
      {
        "servlet-name": "cofaxTools",
        "servlet-class": "org.cofax.cms.CofaxToolsServlet",
        "init-param": {
          "templatePath": "toolstemplates/",
          "log": 1,
          "logLocation": "/usr/local/tomcat/logs/CofaxTools.log",
          "logMaxSize": "",
          "dataLog": 1,
          "dataLogLocation": "/usr/local/tomcat/logs/dataLog.log",
          "dataLogMaxSize": "",
          "removePageCache": "/content/admin/remove?cache=pages&id=",
          "removeTemplateCache": "/content/admin/remove?cache=templates&id=",
          "fileTransferFolder": "/usr/local/tomcat/webapps/content/fileTransferFolder",
          "lookInContext": 1,
          "adminGroupID": 4,
          "betaServer": true
        }
      }
    ],
    "servlet-mapping": {
      "cofaxCDS": "/",
      "cofaxEmail": "/cofaxutil/aemail/*",
      "cofaxAdmin": "/admin/*",
      "fileServlet": "/static/*",
      "cofaxTools": "/tools/*"
    },
    "taglib": {
      "taglib-uri": "cofax.tld",
      "taglib-location": "/WEB-INF/tlds/cofax.tld"
    }
  }
}

flat:

[
  {
    "name": "John Doe",
    "occupation": "Software Developer",
    "age": 30
  },
  {
    "name": "Jane Smith",
    "occupation": "Teacher",
    "age": 40
  },
  {
    "name": "Michael Johnson",
    "occupation": "Accountant",
    "age": 45
  },
  {
    "name": "Samantha Lee",
    "occupation": "Graphic Designer",
    "age": 28
  },
  {
    "name": "Robert Williams",
    "occupation": "Marketing Manager",
    "age": 50
  },
  {
    "name": "Emily Davis",
    "occupation": "Journalist",
    "age": 35
  },
  {
    "name": "William Brown",
    "occupation": "Engineer",
    "age": 42
  },
  {
    "name": "Amanda Wilson",
    "occupation": "Sales Manager",
    "age": 37
  },
  {
    "name": "Daniel Martin",
    "occupation": "Doctor",
    "age": 55
  },
  {
    "name": "Megan Anderson",
    "occupation": "Web Developer",
    "age": 29
  },
  {
    "name": "Christopher Garcia",
    "occupation": "Architect",
    "age": 48
  },
  {
    "name": "Stephanie Rodriguez",
    "occupation": "Human Resources Manager",
    "age": 39
  },
  {
    "name": "David Hernandez",
    "occupation": "Real Estate Agent",
    "age": 44
  },
  {
    "name": "Ashley Perez",
    "occupation": "Graphic Designer",
    "age": 27
  },
  {
    "name": "Erica Turner",
    "occupation": "Financial Analyst",
    "age": 33
  },
  {
    "name": "James Cooper",
    "occupation": "Project Manager",
    "age": 41
  },
  {
    "name": "Michelle Taylor",
    "occupation": "Public Relations Specialist",
    "age": 36
  },
  {
    "name": "Steven Parker",
    "occupation": "Lawyer",
    "age": 52
  },
  {
    "name": "Lauren Hall",
    "occupation": "Product Manager",
    "age": 31
  },
  {
    "name": "Brandon Wright",
    "occupation": "IT Manager",
    "age": 47
  }
]

DOM-like:

{
  "title": "My Website",
  "header": {
    "logo": "logo.png",
    "navigation": [
      { "text": "Home", "link": "index.html" },
      { "text": "About", "link": "about.html" },
      { "text": "Contact", "link": "contact.html" }
    ]
  },
  "main": {
    "sections": [
      {
        "title": "Welcome to my website!",
        "content": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed euismod euismod est, eu fringilla sapien facilisis id. Proin bibendum vestibulum tortor vel aliquam. Integer ultrices justo vitae nisl dapibus sagittis. Sed vitae velit justo."
      },
      {
        "title": "About me",
        "content": "Curabitur vel ullamcorper nibh. In eget nisl vel ante pulvinar aliquet. Suspendisse vel pharetra purus, eu tincidunt libero. Duis ac sagittis dolor. Nulla facilisi. Nam at ex vitae tellus suscipit congue id id libero. Aliquam lobortis lorem non sapien maximus, vitae bibendum ipsum vehicula. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas."
      },
      {
        "title": "Contact me",
        "content": "Curabitur vel ullamcorper nibh. In eget nisl vel ante pulvinar aliquet. Suspendisse vel pharetra purus, eu tincidunt libero. Duis ac sagittis dolor. Nulla facilisi. Nam at ex vitae tellus suscipit congue id id libero. Aliquam lobortis lorem non sapien maximus, vitae bibendum ipsum vehicula. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas."
      }
    ]
  },
  "footer": {
    "copyright": {
      "text": "© 2023 My Website",
      "link": "index.html"
    },
    "social": [
      { "icon": "facebook.png", "link": "#" },
      { "icon": "twitter.png", "link": "#" },
      { "icon": "instagram.png", "link": "#" }
    ]
  }
}

Token counts are calculated using:

Results

Format Dataset Characters GPT
XML DOM-like 2081 887
JSON DOM-like 1780 745
YAML DOM-like 1576 616
TOML DOM-like 1642 593
CSV DOM-like 1682 598
JSON minified DOM-like 1514 491
XML flat 2238 931
JSON flat 1760 757
YAML flat 1222 341
TOML flat 1457 493
CSV flat 658 181
JSON minified flat 1279 317
XML servlets 5019 2286
JSON servlets 3718 1820
YAML servlets 2964 1213
TOML servlets 2968 1086
CSV servlets 2900 950
JSON minified servlets 2710 835
* Potentially lossy and difficult to generate

Conclusion

Depending on your dataset and what you’re trying to generate, it might be a great idea to try an alternative data serialization format.

In particular, these formats seem to be worth experimenting with:

  • Minified JSON. While regular JSON is quite expensive due to all the included whitespace tokens, minified JSON has an advantage over whitespace-based data formats like YAML due to not wasting tokens on simply spaces or tabs. However, it must be noted that it might be more difficult to force a model to respond in minified JSON, so whether minifying has a significant effect or not depends on which of the two is heavy - the prompt or the LLM output.
  • YAML or TOML. For nested and recursive datasets, YAML and TOML seem to be somewhere between 25 and 50 percent more efficient than the equivalent data in JSON, and in general more efficient accross the board. I would probably pick this as a good starting point as these formats are highly available in most programming languages and seem to have a good efficiency token-wise.
  • CSV. For highly uniform flat data (i.e. flat lists of things), data formatted in CSV can be up to 5 times as efficient as the same data formatted as JSON. However, it can also be very inefficient and lossy for deeply nested or recursive data. In particular, for this kind of nested data the accuracy of LLM generation would probably be a significant issue.

Finally, given how much of a difference data serialization format seems to have in regards to the token count, further research should be conducted in this area. I’d wager it won’t be long until we see a new LLM-focused data serialization format, perhaps one that is highly efficient token-wise.

It might also be interesting to compare different data formats by how they impact LLM accuracy & generative capabilities, but that is left as an exercise to the reader.