Retrieve datapoints in pandas dataframe
- async AsyncCogniteClient.time_series.data.retrieve_dataframe(
- *,
- id: None | int | DatapointsQuery | Sequence[int | DatapointsQuery] = None,
- external_id: None | str | DatapointsQuery | SequenceNotStr[str | DatapointsQuery] = None,
- instance_id: None | NodeId | DatapointsQuery | Sequence[NodeId | DatapointsQuery] = None,
- start: int | str | datetime.datetime | None = None,
- end: int | str | datetime.datetime | None = None,
- aggregates: Aggregate | str | list[Aggregate | str] | None = None,
- granularity: str | None = None,
- timezone: str | datetime.timezone | ZoneInfo | None = None,
- target_unit: str | None = None,
- target_unit_system: str | None = None,
- limit: int | None = None,
- include_outside_points: bool = False,
- ignore_unknown_ids: bool = False,
- ignore_bad_datapoints: bool = True,
- treat_uncertain_as_bad: bool = True,
- uniform_index: bool = False,
- include_status: bool = False,
- include_unit: bool = True,
- include_aggregate_name: bool = True,
- include_granularity_name: bool = False,
Get datapoints directly in a pandas dataframe.
Time series support status codes like Good, Uncertain and Bad. You can read more in the Cognite Data Fusion developer documentation on status codes.
Note
For many more usage examples, check out the
retrieve()method which accepts exactly the same arguments.- Parameters:
id (None | int | DatapointsQuery | Sequence[int | DatapointsQuery]) – Id, DatapointsQuery or (mixed) sequence of these. See examples.
external_id (None | str | DatapointsQuery | SequenceNotStr[str | DatapointsQuery]) – External id, DatapointsQuery or (mixed) sequence of these. See examples.
instance_id (None | NodeId | DatapointsQuery | Sequence[NodeId | DatapointsQuery]) – Instance id, DatapointsQuery or (mixed) sequence of these. See examples.
start (int | str | datetime.datetime | None) – Inclusive start. Default: 1970-01-01 UTC.
end (int | str | datetime.datetime | None) – Exclusive end. Default: “now”
aggregates (Aggregate | str | list[Aggregate | str] | None) – Single aggregate or list of aggregates to retrieve. Available options:
average,continuous_variance,count,count_bad,count_good,count_uncertain,discrete_variance,duration_bad,duration_good,duration_uncertain,interpolation,max,max_datapoint,min,min_datapoint,step_interpolation,sumandtotal_variation. Default: None (raw datapoints returned)granularity (str | None) – The granularity to fetch aggregates at. Can be given as an abbreviation or spelled out for clarity:
s/second(s),m/minute(s),h/hour(s),d/day(s),w/week(s),mo/month(s),q/quarter(s), ory/year(s). Examples:30s,5m,1day,2weeks. Default: None.timezone (str | datetime.timezone | ZoneInfo | None) – For raw datapoints, which timezone to use when displaying (will not affect what is retrieved). For aggregates, which timezone to align to for granularity ‘hour’ and longer. Align to the start of the hour, -day or -month. For timezones of type Region/Location, like ‘Europe/Oslo’, pass a string or
ZoneInfoinstance. The aggregate duration will then vary, typically due to daylight saving time. You can also use a fixed offset from UTC by passing a string like ‘+04:00’, ‘UTC-7’ or ‘UTC-02:30’ or an instance ofdatetime.timezone. Note: Historical timezones with second offset are not supported, and timezones with minute offsets (e.g. UTC+05:30 or Asia/Kolkata) may take longer to execute.target_unit (str | None) – The unit_external_id of the datapoints returned. If the time series does not have a unit_external_id that can be converted to the target_unit, an error will be returned. Cannot be used with target_unit_system.
target_unit_system (str | None) – The unit system of the datapoints returned. Cannot be used with target_unit.
limit (int | None) – Maximum number of datapoints to return for each time series. Default: None (no limit)
include_outside_points (bool) – Whether to include outside points. Not allowed when fetching aggregates. Default: False
ignore_unknown_ids (bool) – Whether to ignore missing time series rather than raising an exception. Default: False
ignore_bad_datapoints (bool) – Treat datapoints with a bad status code as if they do not exist. If set to false, raw queries will include bad datapoints in the response, and aggregates will in general omit the time period between a bad datapoint and the next good datapoint. Also, the period between a bad datapoint and the previous good datapoint will be considered constant. Default: True.
treat_uncertain_as_bad (bool) – Treat datapoints with uncertain status codes as bad. If false, treat datapoints with uncertain status codes as good. Used for both raw queries and aggregates. Default: True.
uniform_index (bool) – If only querying aggregates AND a single granularity is used (that’s NOT a calendar granularity like month/quarter/year) AND no limit is used AND no timezone is used, specifying uniform_index=True will return a dataframe with an equidistant datetime index from the earliest start to the latest end (missing values will be NaNs). If these requirements are not met, a ValueError is raised. Default: False
include_status (bool) – Also return the status code, an integer, for each datapoint in the response. Only relevant for raw datapoint queries, and the object aggregates
min_datapointandmax_datapoint. Also adds the status info as a separate level in the columns (MultiIndex).include_unit (bool) – Include the unit_external_id in the dataframe columns, if present (separate MultiIndex level)
include_aggregate_name (bool) – Include aggregate in the dataframe columns, if present (separate MultiIndex level)
include_granularity_name (bool) – Include granularity in the dataframe columns, if present (separate MultiIndex level)
- Returns:
A pandas DataFrame containing the requested time series. The ordering of columns is ids first, then external_ids, and lastly instance_ids. For time series with multiple aggregates, they will be sorted in alphabetical order (“average” before “max”).
- Return type:
pd.DataFrame
Tip
Pandas DataFrames have one shared index, so when you fetch datapoints from multiple time series, the final index will be the union of all the timestamps. Thus, unless all time series have the exact same timestamps, the various columns will contain NaNs to fill the “missing” values. For lower memory usage on unaligned data, use the
retrieve_arrays()method.Warning
If you have duplicated time series in your query, the dataframe columns will also contain duplicates.
When retrieving raw datapoints with
ignore_bad_datapoints=False, bad datapoints with the value NaN can not be distinguished from those missing a value (due to being stored in a numpy array); all will become NaNs in the dataframe.Examples
Get a pandas dataframe using a single time series external ID, with data from the last two weeks, but with no more than 100 datapoints:
>>> from cognite.client import CogniteClient, AsyncCogniteClient >>> client = CogniteClient() >>> # async_client = AsyncCogniteClient() # another option >>> df = client.time_series.data.retrieve_dataframe( ... external_id="foo", start="2w-ago", end="now", limit=100 ... )
Get the pandas dataframe with a uniform index (fixed spacing between points) of 1 day, for two time series with individually specified aggregates, from 1990 through 2020:
>>> from datetime import datetime, timezone >>> from cognite.client.data_classes import DatapointsQuery >>> df = client.time_series.data.retrieve_dataframe( ... external_id=[ ... DatapointsQuery(external_id="foo", aggregates="discrete_variance"), ... DatapointsQuery( ... external_id="bar", aggregates=["total_variation", "continuous_variance"] ... ), ... ], ... granularity="1d", ... start=datetime(1990, 1, 1, tzinfo=timezone.utc), ... end=datetime(2020, 12, 31, tzinfo=timezone.utc), ... uniform_index=True, ... )
Get a pandas dataframe containing the ‘average’ aggregate for two time series using a monthly granularity, starting Jan 1, 1970 all the way up to present, without having the aggregate name in the columns:
>>> df = client.time_series.data.retrieve_dataframe( ... external_id=["foo", "bar"], ... aggregates="average", ... granularity="1mo", ... include_aggregate_name=False, ... )
You may also use
pandas.Timestampto define start and end. Here we fetch using instance_id:>>> import pandas as pd >>> df = client.time_series.data.retrieve_dataframe( ... instance_id=NodeId("my-space", "my-ts-xid"), ... start=pd.Timestamp("2023-01-01"), ... end=pd.Timestamp("2023-02-01"), ... )