Documents
Documents API
Aggregate Document Count
- DocumentsAPI.aggregate_count(query: str | None = None, filter: Filter | dict | None = None) int
Count of documents matching the specified filters and search.
- Parameters
query (str | None) – The free text search query, for details see the documentation referenced above.
filter (Filter | dict | None) – The filter to narrow down the documents to count.
- Returns
The number of documents matching the specified filters and search.
- Return type
int
Examples
Count the number of documents in your CDF project:
>>> from cognite.client import CogniteClient >>> c = CogniteClient() >>> count = c.documents.aggregate_count()
Count the number of PDF documents in your CDF project:
>>> from cognite.client import CogniteClient >>> from cognite.client.data_classes import filters >>> from cognite.client.data_classes.documents import DocumentProperty >>> c = CogniteClient() >>> is_pdf = filters.Equals(DocumentProperty.mime_type, "application/pdf") >>> pdf_count = c.documents.aggregate_count(filter=is_pdf)
Aggregate Document Value Cardinality
- DocumentsAPI.aggregate_cardinality_values(property: DocumentProperty | SourceFileProperty | list[str] | str, query: str | None = None, filter: Filter | dict | None = None, aggregate_filter: AggregationFilter | dict | None = None) int
Find approximate property count for documents.
- Parameters
property (DocumentProperty | SourceFileProperty | list[str] | str) – The property to count the cardinality of.
query (str | None) – The free text search query, for details see the documentation referenced above.
filter (Filter | dict | None) – The filter to narrow down the documents to count cardinality.
aggregate_filter (AggregationFilter | dict | None) – The filter to apply to the resulting buckets.
- Returns
The number of documents matching the specified filters and search.
- Return type
int
Examples
Count the number of types of documents in your CDF project:
>>> from cognite.client import CogniteClient >>> from cognite.client.data_classes.documents import DocumentProperty >>> c = CogniteClient() >>> count = c.documents.aggregate_cardinality_values(DocumentProperty.type)
Count the number of authors of plain/text documents in your CDF project:
>>> from cognite.client import CogniteClient >>> from cognite.client.data_classes import filters >>> from cognite.client.data_classes.documents import DocumentProperty >>> c = CogniteClient() >>> is_plain_text = filters.Equals(DocumentProperty.mime_type, "text/plain") >>> plain_text_author_count = c.documents.aggregate_cardinality_values(DocumentProperty.author, filter=is_plain_text)
Count the number of types of documents in your CDF project but exclude documents that start with “text”:
>>> from cognite.client import CogniteClient >>> from cognite.client.data_classes.documents import DocumentProperty >>> from cognite.client.data_classes import aggregations >>> c = CogniteClient() >>> agg = aggregations >>> is_not_text = agg.Not(agg.Prefix("text")) >>> type_count_excluded_text = c.documents.aggregate_cardinality_values(DocumentProperty.type, aggregate_filter=is_not_text)
Aggregate Document Property Cardinality
- DocumentsAPI.aggregate_cardinality_properties(path: DocumentProperty | SourceFileProperty | list[str] | str, query: str | None = None, filter: Filter | dict | None = None, aggregate_filter: AggregationFilter | dict | None = None) int
Find approximate paths count for documents.
- Parameters
path (DocumentProperty | SourceFileProperty | list[str] | str) – The scope in every document to aggregate properties. The only value allowed now is [“metadata”]. It means to aggregate only metadata properties (aka keys).
query (str | None) – The free text search query, for details see the documentation referenced above.
filter (Filter | dict | None) – The filter to narrow down the documents to count cardinality.
aggregate_filter (AggregationFilter | dict | None) – The filter to apply to the resulting buckets.
- Returns
The number of documents matching the specified filters and search.
- Return type
int
Examples
Count the number metadata keys for documents in your CDF project:
>>> from cognite.client import CogniteClient >>> from cognite.client.data_classes.documents import SourceFileProperty >>> c = CogniteClient() >>> count = c.documents.aggregate_cardinality_properties(SourceFileProperty.metadata)
Aggregate Document Unique Values
- DocumentsAPI.aggregate_unique_values(property: DocumentProperty | SourceFileProperty | list[str] | str, query: str | None = None, filter: Filter | dict | None = None, aggregate_filter: AggregationFilter | dict | None = None, limit: int = 25) UniqueResultList
Get unique properties with counts for documents.
- Parameters
property (DocumentProperty | SourceFileProperty | list[str] | str) – The property to group by.
query (str | None) – The free text search query, for details see the documentation referenced above.
filter (Filter | dict | None) – The filter to narrow down the documents to count cardinality.
aggregate_filter (AggregationFilter | dict | None) – The filter to apply to the resulting buckets.
limit (int) – Maximum number of items. Defaults to 25.
- Returns
List of unique values of documents matching the specified filters and search.
- Return type
UniqueResultList
Examples
Get the unique types with count of documents in your CDF project:
>>> from cognite.client import CogniteClient >>> from cognite.client.data_classes.documents import DocumentProperty >>> c = CogniteClient() >>> result = c.documents.aggregate_unique_values(DocumentProperty.mime_type) >>> unique_types = result.unique
Get the different languages with count for documents with external id prefix “abc”:
>>> from cognite.client import CogniteClient >>> from cognite.client.data_classes import filters >>> from cognite.client.data_classes.documents import DocumentProperty >>> c = CogniteClient() >>> is_abc = filters.Prefix(DocumentProperty.external_id, "abc") >>> result = c.documents.aggregate_unique_values(DocumentProperty.language, filter=is_abc) >>> unique_languages = result.unique
Get the unique mime types with count of documents, but exclude mime types that start with text:
>>> from cognite.client import CogniteClient >>> from cognite.client.data_classes.documents import DocumentProperty >>> from cognite.client.data_classes import aggregations >>> c = CogniteClient() >>> agg = aggregations >>> is_not_text = agg.Not(agg.Prefix("text")) >>> result = c.documents.aggregate_unique_values(DocumentProperty.mime_type, aggregate_filter=is_not_text) >>> unique_mime_types = result.unique
Aggregate Document Unique Properties
- DocumentsAPI.aggregate_unique_properties(path: DocumentProperty | SourceFileProperty | list[str] | str, query: str | None = None, filter: Filter | dict | None = None, aggregate_filter: AggregationFilter | dict | None = None, limit: int = 25) UniqueResultList
Get unique paths with counts for documents.
- Parameters
path (DocumentProperty | SourceFileProperty | list[str] | str) – The scope in every document to aggregate properties. The only value allowed now is [“metadata”]. It means to aggregate only metadata properties (aka keys).
query (str | None) – The free text search query, for details see the documentation referenced above.
filter (Filter | dict | None) – The filter to narrow down the documents to count cardinality.
aggregate_filter (AggregationFilter | dict | None) – The filter to apply to the resulting buckets.
limit (int) – Maximum number of items. Defaults to 25.
- Returns
List of unique values of documents matching the specified filters and search.
- Return type
UniqueResultList
Examples
Get the unique metadata keys with count of documents in your CDF project:
>>> from cognite.client import CogniteClient >>> from cognite.client.data_classes.documents import SourceFileProperty >>> c = CogniteClient() >>> result = c.documents.aggregate_unique_values(SourceFileProperty.metadata)
List Documents
- DocumentsAPI.list(filter: Filter | dict | None = None, limit: int | None = 25) DocumentList
-
You can use filters to narrow down the list. Unlike the search method, list does not restrict the number of documents to return, meaning that setting the limit to -1 will return all the documents in your project.
- Parameters
filter (Filter | dict | None) – Filter | dict | None): The filter to narrow down the documents to return.
limit (int | None) – Maximum number of documents to return. Defaults to 25. Set to None or -1 to return all documents.
- Returns
List of documents
- Return type
Examples
List all PDF documents in your CDF project:
>>> from cognite.client import CogniteClient >>> from cognite.client.data_classes import filters >>> from cognite.client.data_classes.documents import DocumentProperty >>> c = CogniteClient() >>> is_pdf = filters.Equals(DocumentProperty.mime_type, "application/pdf") >>> pdf_documents = c.documents.list(filter=is_pdf)
Iterate over all documents in your CDF project:
>>> from cognite.client import CogniteClient >>> from cognite.client.data_classes.documents import DocumentProperty >>> c = CogniteClient() >>> for document in c.documents: ... print(document.name)
Retrieve Document Content
- DocumentsAPI.retrieve_content(id: int) bytes
-
Returns extracted textual information for the given document.
The document pipeline extracts up to 1MiB of textual information from each processed document. The search and list endpoints truncate the textual content of each document, in order to reduce the size of the returned payload. If you want the whole text for a document, you can use this endpoint.
- Parameters
id (int) – The server-generated ID for the document you want to retrieve the content of.
- Returns
The content of the document.
- Return type
bytes
Examples
Retrieve the content of a document with id 123:
>>> from cognite.client import CogniteClient >>> c = CogniteClient() >>> content = c.documents.retrieve_content(id=123)
Retrieve Document Content Buffer
- DocumentsAPI.retrieve_content_buffer(id: int, buffer: BinaryIO) None
Retrieve document content into buffer
Returns extracted textual information for the given document.
The document pipeline extracts up to 1MiB of textual information from each processed document. The search and list endpoints truncate the textual content of each document, in order to reduce the size of the returned payload. If you want the whole text for a document, you can use this endpoint.
- Parameters
id (int) – The server-generated ID for the document you want to retrieve the content of.
buffer (BinaryIO) – The document content is streamed directly into the buffer. This is useful for retrieving large documents.
Examples
Retrieve the content of a document with id 123 into local file “my_text.txt”:
>>> from cognite.client import CogniteClient >>> from pathlib import Path >>> c = CogniteClient() >>> with Path("my_file.txt").open("wb") as buffer: ... c.documents.retrieve_content_buffer(id=123, buffer=buffer)
Search Documents
- DocumentsAPI.search(query: str, highlight: Literal[False] = False, filter: Filter | dict | None = None, sort: DocumentSort | str | list[str] | tuple[SortableProperty, Literal['asc', 'desc']] | None = None, limit: int = DEFAULT_LIMIT_READ) DocumentList
- DocumentsAPI.search(query: str, highlight: Literal[True], filter: Filter | dict | None = None, sort: DocumentSort | str | list[str] | tuple[SortableProperty, Literal['asc', 'desc']] | None = None, limit: int = DEFAULT_LIMIT_READ) DocumentHighlightList
-
This endpoint lets you search for documents by using advanced filters and free text queries. Free text queries are matched against the documents’ filenames and contents. For more information, see endpoint documentation referenced above.
- Parameters
query (str) – The free text search query.
highlight (bool) – Whether or not matches in search results should be highlighted.
filter (Filter | dict | None) – The filter to narrow down the documents to search.
sort (DocumentSort | SortableProperty | tuple[SortableProperty, Literal["asc", "desc"]] | None) – The property to sort by. The default order is ascending.
limit (int) – Maximum number of items to return. When using highlights, the maximum value is reduced to 20. Defaults to 25.
- Returns
List of search results. If highlight is True, a DocumentHighlightList is returned, otherwise a DocumentList is returned.
- Return type
Examples
Search for text “pump 123” in PDF documents in your CDF project:
>>> from cognite.client import CogniteClient >>> from cognite.client.data_classes import filters >>> from cognite.client.data_classes.documents import DocumentProperty >>> c = CogniteClient() >>> is_pdf = filters.Equals(DocumentProperty.mime_type, "application/pdf") >>> documents = c.documents.search("pump 123", filter=is_pdf)
Find all documents with exact text ‘CPLEX Error 1217: No Solution exists.’ in plain text files created the last week in your CDF project and highlight the matches:
>>> from datetime import datetime, timedelta >>> from cognite.client import CogniteClient >>> from cognite.client.data_classes import filters >>> from cognite.client.data_classes.documents import DocumentProperty >>> from cognite.client.utils import timestamp_to_ms >>> c = CogniteClient() >>> is_plain_text = filters.Equals(DocumentProperty.mime_type, "text/plain") >>> last_week = filters.Range(DocumentProperty.created_time, ... gt=timestamp_to_ms(datetime.now() - timedelta(days=7))) >>> documents = c.documents.search('"CPLEX Error 1217: No Solution exists."', ... highlight=True, ... filter=filters.And(is_plain_text, last_week))
Documents classes
- class cognite.client.data_classes.documents.Document(id: int, created_time: int, source_file: SourceFile, external_id: str | None = None, title: str | None = None, author: str | None = None, producer: str | None = None, modified_time: int | None = None, last_indexed_time: int | None = None, mime_type: str | None = None, extension: str | None = None, page_count: int | None = None, type: str | None = None, language: str | None = None, truncated_content: str | None = None, asset_ids: list[int] | None = None, labels: list[Label | str | LabelDefinition] | None = None, geo_location: GeoLocation | None = None, cognite_client: CogniteClient | None = None, **_: Any)
Bases:
CogniteResource
A representation of a document in CDF.
- Parameters
id (int) – A server-generated ID for the object.
created_time (int) – The creation time of the document in CDF in milliseconds since Jan 1, 1970.
source_file (SourceFile) – The source file that this document is derived from.
external_id (str | None) – The external ID provided by the client. Must be unique for the resource type.
title (str | None) – The title of the document.
author (str | None) – The author of the document.
producer (str | None) – The producer of the document. Many document types contain metadata indicating what software or system was used to create the document.
modified_time (int | None) – The last time the document was modified in CDF in milliseconds since Jan 1, 1970.
last_indexed_time (int | None) – The last time the document was indexed in the search engine, measured in milliseconds since Jan 1, 1970.
mime_type (str | None) – The detected mime type of the document.
extension (str | None) – Extension of the file (always in lowercase)
page_count (int | None) – The number of pages in the document.
type (str | None) – The detected type of the document.
language (str | None) – The detected language of the document.
truncated_content (str | None) – The truncated content of the document.
asset_ids (list[int] | None) – The ids of any assets referred to in the document.
labels (list[Label | str | LabelDefinition] | None) – The labels attached to the document.
geo_location (GeoLocation | None) – The geolocation of the document.
cognite_client (CogniteClient | None) – No description.
**_ (Any) – No description.
- dump(camel_case: bool = False) dict[str, Any]
Dump the instance into a json serializable Python data type.
- Parameters
camel_case (bool) – Use camelCase for attribute names. Defaults to False.
- Returns
A dictionary representation of the instance.
- Return type
dict[str, Any]
- class cognite.client.data_classes.documents.DocumentHighlight(highlight: Highlight, document: Document)
Bases:
CogniteResource
A pair of a document and highlights.
This is used in search results to represent the result
- Parameters
- dump(camel_case: bool = False) dict[str, Any]
Dump the instance into a json serializable Python data type.
- Parameters
camel_case (bool) – Use camelCase for attribute names. Defaults to False.
- Returns
A dictionary representation of the instance.
- Return type
dict[str, Any]
- class cognite.client.data_classes.documents.DocumentHighlightList(resources: Collection[Any], cognite_client: CogniteClient | None = None)
Bases:
CogniteResourceList
[DocumentHighlight
]
- class cognite.client.data_classes.documents.DocumentList(resources: Collection[Any], cognite_client: CogniteClient | None = None)
Bases:
CogniteResourceList
[Document
],IdTransformerMixin
- class cognite.client.data_classes.documents.DocumentProperty(value)
Bases:
EnumProperty
An enumeration.
- class cognite.client.data_classes.documents.DocumentUniqueResult(count: int, values: list[str | int | float | Label])
Bases:
UniqueResult
- class cognite.client.data_classes.documents.Highlight(name: list[str], content: list[str])
Bases:
CogniteResource
Highlighted snippets from name and content fields which show where the query matches are.
This is used in search results to represent the result.
- Parameters
name (list[str]) – Matches in name.
content (list[str]) – Matches in content.
- dump(camel_case: bool = False) dict[str, Any]
Dump the instance into a json serializable Python data type.
- Parameters
camel_case (bool) – Use camelCase for attribute names. Defaults to False.
- Returns
A dictionary representation of the instance.
- Return type
dict[str, Any]
- class cognite.client.data_classes.documents.SortableDocumentProperty(value)
Bases:
EnumProperty
An enumeration.
- class cognite.client.data_classes.documents.SortableSourceFileProperty(value)
Bases:
EnumProperty
An enumeration.
- class cognite.client.data_classes.documents.SourceFile(name: str, hash: str | None = None, directory: str | None = None, source: str | None = None, mime_type: str | None = None, size: int | None = None, asset_ids: list[int] | None = None, labels: list[Label | str | LabelDefinition] | None = None, geo_location: GeoLocation | None = None, dataset_id: int | None = None, security_categories: list[int] | None = None, metadata: dict[str, str] | None = None, cognite_client: CogniteClient | None = None, **_: Any)
Bases:
CogniteResource
The source file that a document is derived from.
- Parameters
name (str) – The name of the source file.
hash (str | None) – The hash of the source file. This is a SHA256 hash of the original file. The hash only covers the file content, and not other CDF metadata.
directory (str | None) – The directory the file can be found in.
source (str | None) – The source of the file.
mime_type (str | None) – The mime type of the file.
size (int | None) – The size of the file in bytes.
asset_ids (list[int] | None) – The ids of the assets related to this file.
labels (list[Label | str | LabelDefinition] | None) – A list of labels associated with this document’s source file in CDF.
geo_location (GeoLocation | None) – The geolocation of the source file.
dataset_id (int | None) – The id if the dataset this file belongs to, if any.
security_categories (list[int] | None) – The security category IDs required to access this file.
metadata (dict[str, str] | None) – Custom, application specific metadata. String key -> String value.
cognite_client (CogniteClient | None) – No description.
**_ (Any) – No description.
- dump(camel_case: bool = False) dict[str, Any]
Dump the instance into a json serializable Python data type.
- Parameters
camel_case (bool) – Use camelCase for attribute names. Defaults to False.
- Returns
A dictionary representation of the instance.
- Return type
dict[str, Any]
- class cognite.client.data_classes.documents.SourceFileProperty(value)
Bases:
EnumProperty
An enumeration.
- class cognite.client.data_classes.documents.TemporaryLink(url: 'str', expires_at: 'int')
Bases:
object
Preview
Download Image Preview Bytes
- DocumentPreviewAPI.download_page_as_png_bytes(id: int, page_number: int = 1) bytes
Downloads an image preview for a specific page of the specified document.
- Parameters
id (int) – The server-generated ID for the document you want to retrieve the preview of.
page_number (int) – Page number to preview. Starting at 1 for first page.
- Returns
The png preview of the document.
- Return type
bytes
Examples
Download image preview of page 5 of file with id 123:
>>> from cognite.client import CogniteClient >>> c = CogniteClient() >>> content = c.documents.previews.download_page_as_png_bytes(id=123, page_number=5)
Download an image preview and display using IPython.display.Image (for example in a Jupyter Notebook):
>>> from IPython.display import Image >>> binary_png = c.documents.previews.download_page_as_png_bytes(id=123, page_number=5) >>> Image(binary_png)
Download Image Preview to Path
- DocumentPreviewAPI.download_page_as_png(path: Path | str | IO, id: int, page_number: int = 1, overwrite: bool = False) None
Downloads an image preview for a specific page of the specified document.
- Parameters
path (Path | str | IO) – The path to save the png preview of the document. If the path is a directory, the file name will be ‘[id]_page[page_number].png’.
id (int) – The server-generated ID for the document you want to retrieve the preview of.
page_number (int) – Page number to preview. Starting at 1 for first page.
overwrite (bool) – Whether to overwrite existing file at the given path. Defaults to False.
Examples
Download Image preview of page 5 of file with id 123 to folder “previews”:
>>> from cognite.client import CogniteClient >>> c = CogniteClient() >>> c.documents.previews.download_page_as_png("previews", id=123, page_number=5)
Download PDF Preview Bytes
- DocumentPreviewAPI.download_document_as_pdf_bytes(id: int) bytes
Downloads a pdf preview of the specified document.
Only the 100 first pages will be included.
Previews will be rendered if necessary during the request. Be prepared for the request to take a few seconds to complete.
- Parameters
id (int) – The server-generated ID for the document you want to retrieve the preview of.
- Returns
The pdf preview of the document.
- Return type
bytes
Examples
Download PDF preview of file with id 123:
>>> from cognite.client import CogniteClient >>> c = CogniteClient() >>> content = c.documents.previews.download_document_as_pdf_bytes(id=123)
Download PDF Preview to Path
- DocumentPreviewAPI.download_document_as_pdf(path: Path | str | IO, id: int, overwrite: bool = False) None
Downloads a pdf preview of the specified document.
Only the 100 first pages will be included.
Previews will be rendered if necessary during the request. Be prepared for the request to take a few seconds to complete.
- Parameters
path (Path | str | IO) – The path to save the pdf preview of the document. If the path is a directory, the file name will be ‘[id].pdf’.
id (int) – The server-generated ID for the document you want to retrieve the preview of.
overwrite (bool) – Whether to overwrite existing file at the given path. Defaults to False.
Examples
Download PDF preview of file with id 123 to folder “previews”:
>>> from cognite.client import CogniteClient >>> c = CogniteClient() >>> c.documents.previews.download_document_as_pdf("previews", id=123)
Retrieve PDF Preview Temporary Link
- DocumentPreviewAPI.retrieve_pdf_link(id: int) TemporaryLink
Retrieve a Temporary link to download pdf preview
- Parameters
id (int) – The server-generated ID for the document you want to retrieve the preview of.
- Returns
A temporary link to download the pdf preview.
- Return type
Examples
Retrieve the PDF preview download link for document with id 123:
>>> from cognite.client import CogniteClient >>> c = CogniteClient() >>> link = c.documents.previews.retrieve_pdf_link(id=123)