Documents

Documents API

Aggregate Document Count

DocumentsAPI.aggregate_count(query: str | None = None, filter: Filter | dict | None = None) → int

Count of documents matching the specified filters and search.

Parameters

query (str | None) – The free text search query, for details see the documentation referenced above.
filter (Filter | dict | None) – The filter to narrow down the documents to count.

Returns

The number of documents matching the specified filters and search.

Return type

int

Examples

Count the number of documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> c = CogniteClient()
>>> count = c.documents.aggregate_count()

Count the number of PDF documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes import filters
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> c = CogniteClient()
>>> is_pdf = filters.Equals(DocumentProperty.mime_type, "application/pdf")
>>> pdf_count = c.documents.aggregate_count(filter=is_pdf)

Aggregate Document Value Cardinality

Find approximate property count for documents.

Parameters

property (DocumentProperty | SourceFileProperty | list[str] | str) – The property to count the cardinality of.
query (str | None) – The free text search query, for details see the documentation referenced above.
filter (Filter | dict | None) – The filter to narrow down the documents to count cardinality.
aggregate_filter (AggregationFilter | dict | None) – The filter to apply to the resulting buckets.

Returns

The number of documents matching the specified filters and search.

Return type

int

Examples

Count the number of types of documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> c = CogniteClient()
>>> count = c.documents.aggregate_cardinality_values(DocumentProperty.type)

Count the number of authors of plain/text documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes import filters
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> c = CogniteClient()
>>> is_plain_text = filters.Equals(DocumentProperty.mime_type, "text/plain")
>>> plain_text_author_count = c.documents.aggregate_cardinality_values(DocumentProperty.author, filter=is_plain_text)

Count the number of types of documents in your CDF project but exclude documents that start with “text”:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> from cognite.client.data_classes import aggregations
>>> c = CogniteClient()
>>> agg = aggregations
>>> is_not_text = agg.Not(agg.Prefix("text"))
>>> type_count_excluded_text = c.documents.aggregate_cardinality_values(DocumentProperty.type, aggregate_filter=is_not_text)

Aggregate Document Property Cardinality

Find approximate paths count for documents.

Parameters

path (DocumentProperty | SourceFileProperty | list[str] | str) – The scope in every document to aggregate properties. The only value allowed now is [“metadata”]. It means to aggregate only metadata properties (aka keys).
query (str | None) – The free text search query, for details see the documentation referenced above.
filter (Filter | dict | None) – The filter to narrow down the documents to count cardinality.
aggregate_filter (AggregationFilter | dict | None) – The filter to apply to the resulting buckets.

Returns

The number of documents matching the specified filters and search.

Return type

int

Examples

Count the number metadata keys for documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes.documents import SourceFileProperty
>>> c = CogniteClient()
>>> count = c.documents.aggregate_cardinality_properties(SourceFileProperty.metadata)

Aggregate Document Unique Values

Get unique properties with counts for documents.

Parameters

property (DocumentProperty | SourceFileProperty | list[str] | str) – The property to group by.
query (str | None) – The free text search query, for details see the documentation referenced above.
filter (Filter | dict | None) – The filter to narrow down the documents to count cardinality.
aggregate_filter (AggregationFilter | dict | None) – The filter to apply to the resulting buckets.
limit (int) – Maximum number of items. Defaults to 25.

Returns

List of unique values of documents matching the specified filters and search.

Return type

UniqueResultList

Examples

Get the unique types with count of documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> c = CogniteClient()
>>> result = c.documents.aggregate_unique_values(DocumentProperty.mime_type)
>>> unique_types = result.unique

Get the different languages with count for documents with external id prefix “abc”:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes import filters
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> c = CogniteClient()
>>> is_abc = filters.Prefix(DocumentProperty.external_id, "abc")
>>> result = c.documents.aggregate_unique_values(DocumentProperty.language, filter=is_abc)
>>> unique_languages = result.unique

Get the unique mime types with count of documents, but exclude mime types that start with text:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> from cognite.client.data_classes import aggregations
>>> c = CogniteClient()
>>> agg = aggregations
>>> is_not_text = agg.Not(agg.Prefix("text"))
>>> result = c.documents.aggregate_unique_values(DocumentProperty.mime_type, aggregate_filter=is_not_text)
>>> unique_mime_types = result.unique

Aggregate Document Unique Properties

Get unique paths with counts for documents.

Parameters

path (DocumentProperty | SourceFileProperty | list[str] | str) – The scope in every document to aggregate properties. The only value allowed now is [“metadata”]. It means to aggregate only metadata properties (aka keys).
query (str | None) – The free text search query, for details see the documentation referenced above.
filter (Filter | dict | None) – The filter to narrow down the documents to count cardinality.
aggregate_filter (AggregationFilter | dict | None) – The filter to apply to the resulting buckets.
limit (int) – Maximum number of items. Defaults to 25.

Returns

List of unique values of documents matching the specified filters and search.

Return type

UniqueResultList

Examples

Get the unique metadata keys with count of documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes.documents import SourceFileProperty
>>> c = CogniteClient()
>>> result = c.documents.aggregate_unique_values(SourceFileProperty.metadata)

List Documents

DocumentsAPI.list(filter: Filter | dict | None = None, limit: int | None = 25) → DocumentList

List documents

You can use filters to narrow down the list. Unlike the search method, list does not restrict the number of documents to return, meaning that setting the limit to -1 will return all the documents in your project.

Parameters

filter (Filter | dict | None) – Filter | dict | None): The filter to narrow down the documents to return.
limit (int | None) – Maximum number of documents to return. Defaults to 25. Set to None or -1 to return all documents.

Returns

List of documents

Return type

DocumentList

Examples

List all PDF documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes import filters
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> c = CogniteClient()
>>> is_pdf = filters.Equals(DocumentProperty.mime_type, "application/pdf")
>>> pdf_documents = c.documents.list(filter=is_pdf)

Iterate over all documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> c = CogniteClient()
>>> for document in c.documents:
...    print(document.name)

Retrieve Document Content

DocumentsAPI.retrieve_content(id: int) → bytes

Retrieve document content

Returns extracted textual information for the given document.

The document pipeline extracts up to 1MiB of textual information from each processed document. The search and list endpoints truncate the textual content of each document, in order to reduce the size of the returned payload. If you want the whole text for a document, you can use this endpoint.

Parameters: id (int) – The server-generated ID for the document you want to retrieve the content of.
Returns: The content of the document.
Return type: bytes

Examples

Retrieve the content of a document with id 123:

>>> from cognite.client import CogniteClient
>>> c = CogniteClient()
>>> content = c.documents.retrieve_content(id=123)

Retrieve Document Content Buffer

DocumentsAPI.retrieve_content_buffer(id: int, buffer: BinaryIO) → None

Retrieve document content into buffer

Returns extracted textual information for the given document.

The document pipeline extracts up to 1MiB of textual information from each processed document. The search and list endpoints truncate the textual content of each document, in order to reduce the size of the returned payload. If you want the whole text for a document, you can use this endpoint.

Parameters

id (int) – The server-generated ID for the document you want to retrieve the content of.
buffer (BinaryIO) – The document content is streamed directly into the buffer. This is useful for retrieving large documents.

Examples

Retrieve the content of a document with id 123 into local file “my_text.txt”:

>>> from cognite.client import CogniteClient
>>> from pathlib import Path
>>> c = CogniteClient()
>>> with Path("my_file.txt").open("wb") as buffer:
...     c.documents.retrieve_content_buffer(id=123, buffer=buffer)

Search Documents

Search documents

This endpoint lets you search for documents by using advanced filters and free text queries. Free text queries are matched against the documents’ filenames and contents. For more information, see endpoint documentation referenced above.

Parameters

query (str) – The free text search query.
highlight (bool) – Whether or not matches in search results should be highlighted.
filter (Filter | dict | None) – The filter to narrow down the documents to search.
sort (DocumentSort | SortableProperty | tuple[SortableProperty, Literal["asc", "desc"]] | None) – The property to sort by. The default order is ascending.
limit (int) – Maximum number of items to return. When using highlights, the maximum value is reduced to 20. Defaults to 25.

Returns

List of search results. If highlight is True, a DocumentHighlightList is returned, otherwise a DocumentList is returned.

Return type

DocumentList | DocumentHighlightList

Examples

Search for text “pump 123” in PDF documents in your CDF project:

>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes import filters
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> c = CogniteClient()
>>> is_pdf = filters.Equals(DocumentProperty.mime_type, "application/pdf")
>>> documents = c.documents.search("pump 123", filter=is_pdf)

Find all documents with exact text ‘CPLEX Error 1217: No Solution exists.’ in plain text files created the last week in your CDF project and highlight the matches:

>>> from datetime import datetime, timedelta
>>> from cognite.client import CogniteClient
>>> from cognite.client.data_classes import filters
>>> from cognite.client.data_classes.documents import DocumentProperty
>>> from cognite.client.utils import timestamp_to_ms
>>> c = CogniteClient()
>>> is_plain_text = filters.Equals(DocumentProperty.mime_type, "text/plain")
>>> last_week = filters.Range(DocumentProperty.created_time,
...     gt=timestamp_to_ms(datetime.now() - timedelta(days=7)))
>>> documents = c.documents.search('"CPLEX Error 1217: No Solution exists."',
...     highlight=True,
...     filter=filters.And(is_plain_text, last_week))

Documents classes

Bases: CogniteResource

A representation of a document in CDF.

Parameters

id (int) – A server-generated ID for the object.
created_time (int) – The creation time of the document in CDF in milliseconds since Jan 1, 1970.
source_file (SourceFile) – The source file that this document is derived from.
external_id (str | None) – The external ID provided by the client. Must be unique for the resource type.
title (str | None) – The title of the document.
author (str | None) – The author of the document.
producer (str | None) – The producer of the document. Many document types contain metadata indicating what software or system was used to create the document.
modified_time (int | None) – The last time the document was modified in CDF in milliseconds since Jan 1, 1970.
last_indexed_time (int | None) – The last time the document was indexed in the search engine, measured in milliseconds since Jan 1, 1970.
mime_type (str | None) – The detected mime type of the document.
extension (str | None) – Extension of the file (always in lowercase)
page_count (int | None) – The number of pages in the document.
type (str | None) – The detected type of the document.
language (str | None) – The detected language of the document.
truncated_content (str | None) – The truncated content of the document.
asset_ids (list[int] | None) – The ids of any assets referred to in the document.
labels (list[Label | str | LabelDefinition] | None) – The labels attached to the document.
geo_location (GeoLocation | None) – The geolocation of the document.
cognite_client (CogniteClient | None) – No description.
**_ (Any) – No description.

dump(camel_case: bool = False) → dict[str, Any]

Dump the instance into a json serializable Python data type.

Parameters: camel_case (bool) – Use camelCase for attribute names. Defaults to False.
Returns: A dictionary representation of the instance.
Return type: dict[str, Any]

class cognite.client.data_classes.documents.DocumentHighlight(highlight: Highlight, document: Document)

Bases: CogniteResource

A pair of a document and highlights.

This is used in search results to represent the result

Parameters

highlight (Highlight) – The highlight from the document matching search results.
document (Document) – The document.

dump(camel_case: bool = False) → dict[str, Any]

Dump the instance into a json serializable Python data type.

Parameters: camel_case (bool) – Use camelCase for attribute names. Defaults to False.
Returns: A dictionary representation of the instance.
Return type: dict[str, Any]

class cognite.client.data_classes.documents.DocumentHighlightList(resources: Collection[Any], cognite_client: CogniteClient | None = None): Bases: CogniteResourceList[DocumentHighlight]

class cognite.client.data_classes.documents.DocumentList(resources: Collection[Any], cognite_client: CogniteClient | None = None): Bases: CogniteResourceList[Document], IdTransformerMixin

class cognite.client.data_classes.documents.DocumentProperty(value)

Bases: EnumProperty

An enumeration.

class cognite.client.data_classes.documents.DocumentUniqueResult(count: int, values: list[str | int | float | Label]): Bases: UniqueResult

class cognite.client.data_classes.documents.Highlight(name: list[str], content: list[str])

Bases: CogniteResource

Highlighted snippets from name and content fields which show where the query matches are.

This is used in search results to represent the result.

Parameters

name (list[str]) – Matches in name.
content (list[str]) – Matches in content.

dump(camel_case: bool = False) → dict[str, Any]

Dump the instance into a json serializable Python data type.

Parameters: camel_case (bool) – Use camelCase for attribute names. Defaults to False.
Returns: A dictionary representation of the instance.
Return type: dict[str, Any]

class cognite.client.data_classes.documents.SortableDocumentProperty(value)

Bases: EnumProperty

An enumeration.

class cognite.client.data_classes.documents.SortableSourceFileProperty(value)

Bases: EnumProperty

An enumeration.

Bases: CogniteResource

The source file that a document is derived from.

Parameters

name (str) – The name of the source file.
hash (str | None) – The hash of the source file. This is a SHA256 hash of the original file. The hash only covers the file content, and not other CDF metadata.
directory (str | None) – The directory the file can be found in.
source (str | None) – The source of the file.
mime_type (str | None) – The mime type of the file.
size (int | None) – The size of the file in bytes.
asset_ids (list[int] | None) – The ids of the assets related to this file.
labels (list[Label | str | LabelDefinition] | None) – A list of labels associated with this document’s source file in CDF.
geo_location (GeoLocation | None) – The geolocation of the source file.
dataset_id (int | None) – The id if the dataset this file belongs to, if any.
security_categories (list[int] | None) – The security category IDs required to access this file.
metadata (dict[str, str] | None) – Custom, application specific metadata. String key -> String value.
cognite_client (CogniteClient | None) – No description.
**_ (Any) – No description.

dump(camel_case: bool = False) → dict[str, Any]

Dump the instance into a json serializable Python data type.

Parameters: camel_case (bool) – Use camelCase for attribute names. Defaults to False.
Returns: A dictionary representation of the instance.
Return type: dict[str, Any]

class cognite.client.data_classes.documents.SourceFileProperty(value)

Bases: EnumProperty

An enumeration.

class cognite.client.data_classes.documents.TemporaryLink(url: 'str', expires_at: 'int'): Bases: object

Preview

Download Image Preview Bytes

DocumentPreviewAPI.download_page_as_png_bytes(id: int, page_number: int = 1) → bytes

Downloads an image preview for a specific page of the specified document.

Parameters

id (int) – The server-generated ID for the document you want to retrieve the preview of.
page_number (int) – Page number to preview. Starting at 1 for first page.

Returns

The png preview of the document.

Return type

bytes

Examples

Download image preview of page 5 of file with id 123:

>>> from cognite.client import CogniteClient
>>> c = CogniteClient()
>>> content = c.documents.previews.download_page_as_png_bytes(id=123, page_number=5)

Download an image preview and display using IPython.display.Image (for example in a Jupyter Notebook):

>>> from IPython.display import Image
>>> binary_png = c.documents.previews.download_page_as_png_bytes(id=123, page_number=5)
>>> Image(binary_png)

Download Image Preview to Path

DocumentPreviewAPI.download_page_as_png(path: Path | str | IO, id: int, page_number: int = 1, overwrite: bool = False) → None

Downloads an image preview for a specific page of the specified document.

Parameters

path (Path | str | IO) – The path to save the png preview of the document. If the path is a directory, the file name will be ‘[id]_page[page_number].png’.
id (int) – The server-generated ID for the document you want to retrieve the preview of.
page_number (int) – Page number to preview. Starting at 1 for first page.
overwrite (bool) – Whether to overwrite existing file at the given path. Defaults to False.

Examples

Download Image preview of page 5 of file with id 123 to folder “previews”:

>>> from cognite.client import CogniteClient
>>> c = CogniteClient()
>>> c.documents.previews.download_page_as_png("previews", id=123, page_number=5)

Download PDF Preview Bytes

DocumentPreviewAPI.download_document_as_pdf_bytes(id: int) → bytes

Downloads a pdf preview of the specified document.

Only the 100 first pages will be included.

Previews will be rendered if necessary during the request. Be prepared for the request to take a few seconds to complete.

Parameters: id (int) – The server-generated ID for the document you want to retrieve the preview of.
Returns: The pdf preview of the document.
Return type: bytes

Examples

Download PDF preview of file with id 123:

>>> from cognite.client import CogniteClient
>>> c = CogniteClient()
>>> content = c.documents.previews.download_document_as_pdf_bytes(id=123)

Download PDF Preview to Path

DocumentPreviewAPI.download_document_as_pdf(path: Path | str | IO, id: int, overwrite: bool = False) → None

Downloads a pdf preview of the specified document.

Only the 100 first pages will be included.

Previews will be rendered if necessary during the request. Be prepared for the request to take a few seconds to complete.

Parameters

path (Path | str | IO) – The path to save the pdf preview of the document. If the path is a directory, the file name will be ‘[id].pdf’.
id (int) – The server-generated ID for the document you want to retrieve the preview of.
overwrite (bool) – Whether to overwrite existing file at the given path. Defaults to False.

Examples

Download PDF preview of file with id 123 to folder “previews”:

>>> from cognite.client import CogniteClient
>>> c = CogniteClient()
>>> c.documents.previews.download_document_as_pdf("previews", id=123)

Retrieve PDF Preview Temporary Link

DocumentPreviewAPI.retrieve_pdf_link(id: int) → TemporaryLink

Retrieve a Temporary link to download pdf preview

Parameters: id (int) – The server-generated ID for the document you want to retrieve the preview of.
Returns: A temporary link to download the pdf preview.
Return type: TemporaryLink

Examples

Retrieve the PDF preview download link for document with id 123:

>>> from cognite.client import CogniteClient
>>> c = CogniteClient()
>>> link = c.documents.previews.retrieve_pdf_link(id=123)