Skip to main content
Skip to main content

Use JSON where appropriate

ClickHouse now offers a native JSON column type designed for semi-structured and dynamic data. It’s important to clarify that this is a column type, not a data format—you can insert JSON into ClickHouse as a string or via supported formats like JSONEachRow, but that does not imply using the JSON column type. Users should only use the JSON type when the structure of their data is dynamic, not when they simply happen to store JSON.

When to Use the JSON Type

Use the JSON type when your data:

  • Has unpredictable keys that can change over time.
  • Contains values with varying types (e.g., a path might sometimes contain a string, sometimes a number).
  • Requires schema flexibility where strict typing isn’t viable.

If your data structure is known and consistent, there is rarely a need for the JSON type, even if your data is in JSON format. Specifically, if your data has:

  • A flat structure with known keys: use standard column types e.g. String.
  • Predictable nesting: use Tuple, Array, or Nested types for these structures.
  • Predictable structure with varying types: consider Dynamic or Variant types instead.

You can also mix approaches - for example, use static columns for predictable top-level fields and a single JSON column for a dynamic section of the payload.

Considerations and Tips for Using JSON

The JSON type enables efficient columnar storage by flattening paths into subcolumns. But with flexibility comes responsibility. To use it effectively:

  • Specify path types using hints in the column definition to specify types for known sub columns, avoiding unnecessary type inference.
  • Skip paths if you don’t need the values, with SKIP and SKIP REGEXP to reduce storage and improve performance.
  • Avoid setting max_dynamic_paths too high - large values increase resource consumption and reduce efficiency. As a rule of thumb, keep it below 10,000.

Advanced Features

  • JSON columns can be used in primary keys like any other columns. Codecs cannot be specified for a sub-column.
  • They support introspection via functions like JSONAllPathsWithTypes() and JSONDynamicPaths().
  • You can read nested sub-objects using the .^ syntax.
  • Query syntax may differ from standard SQL and may require special casting or operators for nested fields.

For additional guidance, see ClickHouse JSON documentation or explore our blog post A New Powerful JSON Data Type for ClickHouse.

Examples

Consider the following JSON sample, representing a row from the Python PyPI dataset:

Lets assume this schema is static and the types can be well defined. Even if the data is in NDJSON format (json row per line), there is no need to use the JSON type for such a schema. Simply define the schema with classic types.

and insert JSON rows:

Consider the arXiv dataset containing 2.5m scholarly papers. Each row in this dataset, distributed as NDJSON, represents a published academic paper. An example row is shown below:

While the JSON here is complex, with nested structures, it is predictable. The number and type of the fields will not change. While we could use the JSON type for this example, we can also just define the structure explicitly using Tuples and Nested types:

Again we can insert the data as JSON:

Suppose another column is added tags. If this was simply a list of strings we could model as an Array(String), but let's assume users can add arbitrary tag structures with mixed types (notice score is a string or integer). Our modified JSON document:

In this case, we could model the arxiv documents as either all JSON or simply add a JSON tags column. We provide both examples below:

note

We specify the update_date in the JSON definition as we use it in the ordering/primary key. This helps ClickHouse know this column won’t be null. If not specified, the user must explicitly allow nullable primary keys (not recommended for performance reasons) via the setting allow_nullable_key=1

We can insert into this table and view the subsequently inferred schema using the JSONAllPathsWithTypes function and PrettyJSONEachRow output format:

Alternatively, we could model this using our earlier schema and a JSON tags column. This is generally preferred, minimizing the inference required by ClickHouse:

We can now infer the types of the sub column tags.