Documents

Documents API

Aggregate Document Count

DocumentsAPI.aggregate_count(query: str | None = None, filter: Filter | dict | None = None) int

Count of documents matching the specified filters and search.

Parameters
  • query (str | None) – The free text search query, for details see the documentation referenced above.

  • filter (Filter | dict | None) – The filter to narrow down the documents to count.

Returns

The number of documents matching the specified filters and search.

Return type

int

Examples

Count the number of documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> c = CogniteClient()
>>> count = c.documents.aggregate_count()

Count the number of PDF documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes import filters
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> c = CogniteClient()
>>> is_pdf = filters.Equals(DocumentProperty.mime_type, "application/pdf")
>>> pdf_count = c.documents.aggregate_count(filter=is_pdf)

Aggregate Document Value Cardinality

DocumentsAPI.aggregate_cardinality_values(property: DocumentProperty | SourceFileProperty | list[str] | str, query: str | None = None, filter: Filter | dict | None = None, aggregate_filter: AggregationFilter | dict | None = None) int

Find approximate property count for documents.

Parameters
  • property (DocumentProperty | SourceFileProperty | list[str] | str) – The property to count the cardinality of.

  • query (str | None) – The free text search query, for details see the documentation referenced above.

  • filter (Filter | dict | None) – The filter to narrow down the documents to count cardinality.

  • aggregate_filter (AggregationFilter | dict | None) – The filter to apply to the resulting buckets.

Returns

The number of documents matching the specified filters and search.

Return type

int

Examples

Count the number of types of documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> c = CogniteClient()
>>> count = c.documents.aggregate_cardinality_values(DocumentProperty.type)

Count the number of authors of plain/text documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes import filters
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> c = CogniteClient()
>>> is_plain_text = filters.Equals(DocumentProperty.mime_type, "text/plain")
>>> plain_text_author_count = c.documents.aggregate_cardinality_values(DocumentProperty.author, filter=is_plain_text)

Count the number of types of documents in your CDF project but exclude documents that start with “text”:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> from cognite.client.data_classes import aggregations
>>> c = CogniteClient()
>>> agg = aggregations
>>> is_not_text = agg.Not(agg.Prefix("text"))
>>> type_count_excluded_text = c.documents.aggregate_cardinality_values(DocumentProperty.type, aggregate_filter=is_not_text)

Aggregate Document Property Cardinality

DocumentsAPI.aggregate_cardinality_properties(path: DocumentProperty | SourceFileProperty | list[str] | str, query: str | None = None, filter: Filter | dict | None = None, aggregate_filter: AggregationFilter | dict | None = None) int

Find approximate paths count for documents.

Parameters
  • path (DocumentProperty | SourceFileProperty | list[str] | str) – The scope in every document to aggregate properties. The only value allowed now is [“metadata”]. It means to aggregate only metadata properties (aka keys).

  • query (str | None) – The free text search query, for details see the documentation referenced above.

  • filter (Filter | dict | None) – The filter to narrow down the documents to count cardinality.

  • aggregate_filter (AggregationFilter | dict | None) – The filter to apply to the resulting buckets.

Returns

The number of documents matching the specified filters and search.

Return type

int

Examples

Count the number metadata keys for documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes.documents import SourceFileProperty
>>> c = CogniteClient()
>>> count = c.documents.aggregate_cardinality_properties(SourceFileProperty.metadata)

Aggregate Document Unique Values

DocumentsAPI.aggregate_unique_values(property: DocumentProperty | SourceFileProperty | list[str] | str, query: str | None = None, filter: Filter | dict | None = None, aggregate_filter: AggregationFilter | dict | None = None, limit: int = 25) UniqueResultList

Get unique properties with counts for documents.

Parameters
  • property (DocumentProperty | SourceFileProperty | list[str] | str) – The property to group by.

  • query (str | None) – The free text search query, for details see the documentation referenced above.

  • filter (Filter | dict | None) – The filter to narrow down the documents to count cardinality.

  • aggregate_filter (AggregationFilter | dict | None) – The filter to apply to the resulting buckets.

  • limit (int) – Maximum number of items. Defaults to 25.

Returns

List of unique values of documents matching the specified filters and search.

Return type

UniqueResultList

Examples

Get the unique types with count of documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> c = CogniteClient()
>>> result = c.documents.aggregate_unique_values(DocumentProperty.mime_type)
>>> unique_types = result.unique

Get the different languages with count for documents with external id prefix “abc”:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes import filters
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> c = CogniteClient()
>>> is_abc = filters.Prefix(DocumentProperty.external_id, "abc")
>>> result = c.documents.aggregate_unique_values(DocumentProperty.language, filter=is_abc)
>>> unique_languages = result.unique

Get the unique mime types with count of documents, but exclude mime types that start with text:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> from cognite.client.data_classes import aggregations
>>> c = CogniteClient()
>>> agg = aggregations
>>> is_not_text = agg.Not(agg.Prefix("text"))
>>> result = c.documents.aggregate_unique_values(DocumentProperty.mime_type, aggregate_filter=is_not_text)
>>> unique_mime_types = result.unique

Aggregate Document Unique Properties

DocumentsAPI.aggregate_unique_properties(path: DocumentProperty | SourceFileProperty | list[str] | str, query: str | None = None, filter: Filter | dict | None = None, aggregate_filter: AggregationFilter | dict | None = None, limit: int = 25) UniqueResultList

Get unique paths with counts for documents.

Parameters
  • path (DocumentProperty | SourceFileProperty | list[str] | str) – The scope in every document to aggregate properties. The only value allowed now is [“metadata”]. It means to aggregate only metadata properties (aka keys).

  • query (str | None) – The free text search query, for details see the documentation referenced above.

  • filter (Filter | dict | None) – The filter to narrow down the documents to count cardinality.

  • aggregate_filter (AggregationFilter | dict | None) – The filter to apply to the resulting buckets.

  • limit (int) – Maximum number of items. Defaults to 25.

Returns

List of unique values of documents matching the specified filters and search.

Return type

UniqueResultList

Examples

Get the unique metadata keys with count of documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes.documents import SourceFileProperty
>>> c = CogniteClient()
>>> result = c.documents.aggregate_unique_values(SourceFileProperty.metadata)

List Documents

DocumentsAPI.list(filter: Filter | dict | None = None, limit: int | None = 25) DocumentList

List documents

You can use filters to narrow down the list. Unlike the search method, list does not restrict the number of documents to return, meaning that setting the limit to -1 will return all the documents in your project.

Parameters
  • filter (Filter | dict | None) – Filter | dict | None): The filter to narrow down the documents to return.

  • limit (int | None) – Maximum number of documents to return. Defaults to 25. Set to None or -1 to return all documents.

Returns

List of documents

Return type

DocumentList

Examples

List all PDF documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes import filters
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> c = CogniteClient()
>>> is_pdf = filters.Equals(DocumentProperty.mime_type, "application/pdf")
>>> pdf_documents = c.documents.list(filter=is_pdf)

Iterate over all documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> c = CogniteClient()
>>> for document in c.documents:
...    print(document.name)

Retrieve Document Content

DocumentsAPI.retrieve_content(id: int) bytes

Retrieve document content

Returns extracted textual information for the given document.

The document pipeline extracts up to 1MiB of textual information from each processed document. The search and list endpoints truncate the textual content of each document, in order to reduce the size of the returned payload. If you want the whole text for a document, you can use this endpoint.

Parameters

id (int) – The server-generated ID for the document you want to retrieve the content of.

Returns

The content of the document.

Return type

bytes

Examples

Retrieve the content of a document with id 123:

>>> from cognite.client import CogniteClient
>>> c = CogniteClient()
>>> content = c.documents.retrieve_content(id=123)

Retrieve Document Content Buffer

DocumentsAPI.retrieve_content_buffer(id: int, buffer: BinaryIO) None

Retrieve document content into buffer

Returns extracted textual information for the given document.

The document pipeline extracts up to 1MiB of textual information from each processed document. The search and list endpoints truncate the textual content of each document, in order to reduce the size of the returned payload. If you want the whole text for a document, you can use this endpoint.

Parameters
  • id (int) – The server-generated ID for the document you want to retrieve the content of.

  • buffer (BinaryIO) – The document content is streamed directly into the buffer. This is useful for retrieving large documents.

Examples

Retrieve the content of a document with id 123 into local file “my_text.txt”:

>>> from cognite.client import CogniteClient
>>> from pathlib import Path
>>> c = CogniteClient()
>>> with Path("my_file.txt").open("wb") as buffer:
...     c.documents.retrieve_content_buffer(id=123, buffer=buffer)

Search Documents

DocumentsAPI.search(query: str, highlight: Literal[False] = False, filter: Filter | dict | None = None, sort: DocumentSort | str | list[str] | tuple[SortableProperty, Literal['asc', 'desc']] | None = None, limit: int = DEFAULT_LIMIT_READ) DocumentList
DocumentsAPI.search(query: str, highlight: Literal[True], filter: Filter | dict | None = None, sort: DocumentSort | str | list[str] | tuple[SortableProperty, Literal['asc', 'desc']] | None = None, limit: int = DEFAULT_LIMIT_READ) DocumentHighlightList

Search documents

This endpoint lets you search for documents by using advanced filters and free text queries. Free text queries are matched against the documents’ filenames and contents. For more information, see endpoint documentation referenced above.

Parameters
  • query (str) – The free text search query.

  • highlight (bool) – Whether or not matches in search results should be highlighted.

  • filter (Filter | dict | None) – The filter to narrow down the documents to search.

  • sort (DocumentSort | SortableProperty | tuple[SortableProperty, Literal["asc", "desc"]] | None) – The property to sort by. The default order is ascending.

  • limit (int) – Maximum number of items to return. When using highlights, the maximum value is reduced to 20. Defaults to 25.

Returns

List of search results. If highlight is True, a DocumentHighlightList is returned, otherwise a DocumentList is returned.

Return type

DocumentList | DocumentHighlightList

Examples

Search for text “pump 123” in PDF documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes import filters
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> c = CogniteClient()
>>> is_pdf = filters.Equals(DocumentProperty.mime_type, "application/pdf")
>>> documents = c.documents.search("pump 123", filter=is_pdf)

Find all documents with exact text ‘CPLEX Error 1217: No Solution exists.’ in plain text files created the last week in your CDF project and highlight the matches:

>>> from datetime import datetime, timedelta
>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes import filters
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> from cognite.client.utils import timestamp_to_ms
>>> c = CogniteClient()
>>> is_plain_text = filters.Equals(DocumentProperty.mime_type, "text/plain")
>>> last_week = filters.Range(DocumentProperty.created_time,
...     gt=timestamp_to_ms(datetime.now() - timedelta(days=7)))
>>> documents = c.documents.search('"CPLEX Error 1217: No Solution exists."',
...     highlight=True,
...     filter=filters.And(is_plain_text, last_week))

Documents classes

class cognite.client.data_classes.documents.Document(id: int, created_time: int, source_file: SourceFile, external_id: str | None = None, title: str | None = None, author: str | None = None, producer: str | None = None, modified_time: int | None = None, last_indexed_time: int | None = None, mime_type: str | None = None, extension: str | None = None, page_count: int | None = None, type: str | None = None, language: str | None = None, truncated_content: str | None = None, asset_ids: list[int] | None = None, labels: list[Label | str | LabelDefinition] | None = None, geo_location: GeoLocation | None = None, cognite_client: CogniteClient | None = None, **_: Any)

Bases: CogniteResource

A representation of a document in CDF.

Parameters
  • id (int) – A server-generated ID for the object.

  • created_time (int) – The creation time of the document in CDF in milliseconds since Jan 1, 1970.

  • source_file (SourceFile) – The source file that this document is derived from.

  • external_id (str | None) – The external ID provided by the client. Must be unique for the resource type.

  • title (str | None) – The title of the document.

  • author (str | None) – The author of the document.

  • producer (str | None) – The producer of the document. Many document types contain metadata indicating what software or system was used to create the document.

  • modified_time (int | None) – The last time the document was modified in CDF in milliseconds since Jan 1, 1970.

  • last_indexed_time (int | None) – The last time the document was indexed in the search engine, measured in milliseconds since Jan 1, 1970.

  • mime_type (str | None) – The detected mime type of the document.

  • extension (str | None) – Extension of the file (always in lowercase)

  • page_count (int | None) – The number of pages in the document.

  • type (str | None) – The detected type of the document.

  • language (str | None) – The detected language of the document.

  • truncated_content (str | None) – The truncated content of the document.

  • asset_ids (list[int] | None) – The ids of any assets referred to in the document.

  • labels (list[Label | str | LabelDefinition] | None) – The labels attached to the document.

  • geo_location (GeoLocation | None) – The geolocation of the document.

  • cognite_client (CogniteClient | None) – No description.

  • **_ (Any) – No description.

dump(camel_case: bool = False) dict[str, Any]

Dump the instance into a json serializable Python data type.

Parameters

camel_case (bool) – Use camelCase for attribute names. Defaults to False.

Returns

A dictionary representation of the instance.

Return type

dict[str, Any]

class cognite.client.data_classes.documents.DocumentHighlight(highlight: Highlight, document: Document)

Bases: CogniteResource

A pair of a document and highlights.

This is used in search results to represent the result

Parameters
  • highlight (Highlight) – The highlight from the document matching search results.

  • document (Document) – The document.

dump(camel_case: bool = False) dict[str, Any]

Dump the instance into a json serializable Python data type.

Parameters

camel_case (bool) – Use camelCase for attribute names. Defaults to False.

Returns

A dictionary representation of the instance.

Return type

dict[str, Any]

class cognite.client.data_classes.documents.DocumentHighlightList(resources: Collection[Any], cognite_client: CogniteClient | None = None)

Bases: CogniteResourceList[DocumentHighlight]

class cognite.client.data_classes.documents.DocumentList(resources: Collection[Any], cognite_client: CogniteClient | None = None)

Bases: CogniteResourceList[Document], IdTransformerMixin

class cognite.client.data_classes.documents.DocumentProperty(value)

Bases: EnumProperty

An enumeration.

class cognite.client.data_classes.documents.DocumentUniqueResult(count: int, values: list[str | int | float | Label])

Bases: UniqueResult

class cognite.client.data_classes.documents.Highlight(name: list[str], content: list[str])

Bases: CogniteResource

Highlighted snippets from name and content fields which show where the query matches are.

This is used in search results to represent the result.

Parameters
  • name (list[str]) – Matches in name.

  • content (list[str]) – Matches in content.

dump(camel_case: bool = False) dict[str, Any]

Dump the instance into a json serializable Python data type.

Parameters

camel_case (bool) – Use camelCase for attribute names. Defaults to False.

Returns

A dictionary representation of the instance.

Return type

dict[str, Any]

class cognite.client.data_classes.documents.SortableDocumentProperty(value)

Bases: EnumProperty

An enumeration.

class cognite.client.data_classes.documents.SortableSourceFileProperty(value)

Bases: EnumProperty

An enumeration.

class cognite.client.data_classes.documents.SourceFile(name: str, hash: str | None = None, directory: str | None = None, source: str | None = None, mime_type: str | None = None, size: int | None = None, asset_ids: list[int] | None = None, labels: list[Label | str | LabelDefinition] | None = None, geo_location: GeoLocation | None = None, dataset_id: int | None = None, security_categories: list[int] | None = None, metadata: dict[str, str] | None = None, cognite_client: CogniteClient | None = None, **_: Any)

Bases: CogniteResource

The source file that a document is derived from.

Parameters
  • name (str) – The name of the source file.

  • hash (str | None) – The hash of the source file. This is a SHA256 hash of the original file. The hash only covers the file content, and not other CDF metadata.

  • directory (str | None) – The directory the file can be found in.

  • source (str | None) – The source of the file.

  • mime_type (str | None) – The mime type of the file.

  • size (int | None) – The size of the file in bytes.

  • asset_ids (list[int] | None) – The ids of the assets related to this file.

  • labels (list[Label | str | LabelDefinition] | None) – A list of labels associated with this document’s source file in CDF.

  • geo_location (GeoLocation | None) – The geolocation of the source file.

  • dataset_id (int | None) – The id if the dataset this file belongs to, if any.

  • security_categories (list[int] | None) – The security category IDs required to access this file.

  • metadata (dict[str, str] | None) – Custom, application specific metadata. String key -> String value.

  • cognite_client (CogniteClient | None) – No description.

  • **_ (Any) – No description.

dump(camel_case: bool = False) dict[str, Any]

Dump the instance into a json serializable Python data type.

Parameters

camel_case (bool) – Use camelCase for attribute names. Defaults to False.

Returns

A dictionary representation of the instance.

Return type

dict[str, Any]

class cognite.client.data_classes.documents.SourceFileProperty(value)

Bases: EnumProperty

An enumeration.

Bases: object

Preview

Download Image Preview Bytes

DocumentPreviewAPI.download_page_as_png_bytes(id: int, page_number: int = 1) bytes

Downloads an image preview for a specific page of the specified document.

Parameters
  • id (int) – The server-generated ID for the document you want to retrieve the preview of.

  • page_number (int) – Page number to preview. Starting at 1 for first page.

Returns

The png preview of the document.

Return type

bytes

Examples

Download image preview of page 5 of file with id 123:

>>> from cognite.client import CogniteClient
>>> c = CogniteClient()
>>> content = c.documents.previews.download_page_as_png_bytes(id=123, page_number=5)

Download an image preview and display using IPython.display.Image (for example in a Jupyter Notebook):

>>> from IPython.display import Image
>>> binary_png = c.documents.previews.download_page_as_png_bytes(id=123, page_number=5)
>>> Image(binary_png)

Download Image Preview to Path

DocumentPreviewAPI.download_page_as_png(path: Path | str | IO, id: int, page_number: int = 1, overwrite: bool = False) None

Downloads an image preview for a specific page of the specified document.

Parameters
  • path (Path | str | IO) – The path to save the png preview of the document. If the path is a directory, the file name will be ‘[id]_page[page_number].png’.

  • id (int) – The server-generated ID for the document you want to retrieve the preview of.

  • page_number (int) – Page number to preview. Starting at 1 for first page.

  • overwrite (bool) – Whether to overwrite existing file at the given path. Defaults to False.

Examples

Download Image preview of page 5 of file with id 123 to folder “previews”:

>>> from cognite.client import CogniteClient
>>> c = CogniteClient()
>>> c.documents.previews.download_page_as_png("previews", id=123, page_number=5)

Download PDF Preview Bytes

DocumentPreviewAPI.download_document_as_pdf_bytes(id: int) bytes

Downloads a pdf preview of the specified document.

Only the 100 first pages will be included.

Previews will be rendered if necessary during the request. Be prepared for the request to take a few seconds to complete.

Parameters

id (int) – The server-generated ID for the document you want to retrieve the preview of.

Returns

The pdf preview of the document.

Return type

bytes

Examples

Download PDF preview of file with id 123:

>>> from cognite.client import CogniteClient
>>> c = CogniteClient()
>>> content = c.documents.previews.download_document_as_pdf_bytes(id=123)

Download PDF Preview to Path

DocumentPreviewAPI.download_document_as_pdf(path: Path | str | IO, id: int, overwrite: bool = False) None

Downloads a pdf preview of the specified document.

Only the 100 first pages will be included.

Previews will be rendered if necessary during the request. Be prepared for the request to take a few seconds to complete.

Parameters
  • path (Path | str | IO) – The path to save the pdf preview of the document. If the path is a directory, the file name will be ‘[id].pdf’.

  • id (int) – The server-generated ID for the document you want to retrieve the preview of.

  • overwrite (bool) – Whether to overwrite existing file at the given path. Defaults to False.

Examples

Download PDF preview of file with id 123 to folder “previews”:

>>> from cognite.client import CogniteClient
>>> c = CogniteClient()
>>> c.documents.previews.download_document_as_pdf("previews", id=123)