arcgis.features module¶
The arcgis.features module contains types and functions for working with features and feature layers in the GIS.
Entities located in space with a geometrical representation (such as points, lines or polygons) and a set of properties can be represented as features. The arcgis.features module is used for working with feature data, feature layers and collections of feature layers in the GIS. It also contains the spatial analysis functions which operate against feature data.
In the GIS, entities located in space with a set of properties can be represented as features. Features are stored as feature classes, which represent a set of features located using a single spatial type (point, line, polygon) and a common set of properties. This is the geographic extension of the classic tabular or relational representation for entities - a set of entities is modelled as rows in a table. Tables represent entity classes with uniform properties. In addition to working with entities with location as features, the system can also work with non-spatial entities as rows in tables. The system can also model relationships between entities using properties which act as primary and foreign keys. A collection of feature classes and tables, with the associated relationships among the entities, is a feature layer collection. FeatureLayerCollections are one of the dataset types contained in a Datastore. Finally, features are not simply entities in a dataset. Features have a visual representation and user experience - on a map, in a 3D scene, as entities with a property sheet or popups.
arcgis.features.Feature¶
-
class
arcgis.features.
Feature
(geometry=None, attributes=None)¶ Entities located in space with a set of properties can be represented as features.
-
as_dict
¶ returns the feature as a dictionary
-
as_row
¶ converts a feature to a list for insertion into an insert cursor Output:
[row items], [field names] returns a list of fields and the row object
-
attributes
¶ returns the feature attributes
-
fields
¶ returns a list of feature fields
-
classmethod
from_dict
(feature)¶ returns a feature from a dict
-
classmethod
from_json
(json_str)¶ returns a feature from a JSON string
-
geometry
¶ returns the feature geometry
-
geometry_type
¶ returns the feature’s geometry type
-
get_value
(field_name)¶ returns a value for a given field name
-
set_value
(field_name, value)¶ sets an attribute value for a given field name
-
arcgis.features.FeatureLayer¶
-
class
arcgis.features.
FeatureLayer
(url, gis=None, container=None, dynamic_layer=None)¶ The feature layer is the primary concept for working with features in a GIS.
Users create, import, export, analyze, edit, and visualize features, i.e. entities in space as feature layers.
Feature layers can be added to and visualized using maps. They act as inputs to and outputs from feature analysis tools.
Feature layers are created by publishing feature data to a GIS, and are exposed as a broader resource (Item) in the GIS. Feature layer objects can be obtained through the layers attribute on feature layer Items in the GIS.
-
calculate
(where, calc_expression, sql_format='standard')¶ The calculate operation is performed on a feature layer resource. It updates the values of one or more fields in an existing feature service layer based on SQL expressions or scalar values. The calculate operation can only be used if the supportsCalculate property of the layer is true. Neither the Shape field nor system fields can be updated using calculate. System fields include ObjectId and GlobalId. See Calculate a field for more information on supported expressions
- Inputs:
- where - A where clause can be used to limit the updated records.
- Any legal SQL where clause operating on the fields in the layer is allowed.
- calcExpression - The array of field/value info objects that
contain the field or fields to update and their scalar values or SQL expression. Allowed types are dictionary and list. List must be a list of dictionary objects. Calculation Format is as follows:
{“field” : “<field name>”, “value” : “<value>”}- sqlFormat - The SQL format for the calcExpression. It can be
- either standard SQL92 (standard) or native SQL (native). The default is standard. Values: standard, native
- Output:
- JSON as string
Usage: >>>print(fl.calculate(where=”OBJECTID < 2”,
- calcExpression={“field”: “ZONE”,
- “value” : “R1”}))
{‘updatedFeatureCount’: 1, ‘success’: True}
-
container
¶ The feature layer collection to which this layer belongs.
-
delete_features
(deletes=None, where=None, geometry_filter=None, gdb_version=None, rollback_on_failure=True)¶ This operation deletes features in a feature layer or table Inputs:
deletes - string of OIDs to remove from service where - A where clause for the query filter.
Any legal SQL where clause operating on the fields in the layer is allowed. Features conforming to the specified where clause will be deleted.- geometry_filter - spatial filter from arcgis.geometry.filters module to filter results by a
- spatial relationship with another geometry
gdb_version - Geodatabase version to apply the edits. rollback_on_failure - Optional parameter to specify if the
edits should be applied only if all submitted edits succeed. If false, the server will apply the edits that succeed even if some of the submitted edits fail. If true, the server will apply the edits only if all edits succeed. The default value is true.- Output:
- dictionary of messages
-
edit_features
(adds=None, updates=None, deletes=None, gdb_version=None, use_global_ids=False, rollback_on_failure=True)¶ This operation adds, updates, and deletes features to the associated feature layer or table in a single call. Inputs:
adds - The array of features to be added. updates - The array of features to be updateded. deletes - string of OIDs to remove from service gdbVersion - Geodatabase version to apply the edits. useGlobalIds - instead of referencing the default Object ID
field, the service will look at a GUID field to track changes. This means the GUIDs will be passed instead of OIDs for delete, update or add features.- rollbackOnFailure - Optional parameter to specify if the
- edits should be applied only if all submitted edits succeed. If false, the server will apply the edits that succeed even if some of the submitted edits fail. If true, the server will apply the edits only if all edits succeed. The default value is true.
- Output:
- dictionary of messages
-
classmethod
fromitem
(item, layer_id=0)¶ Creates a feature layer from a GIS Item. The type of item should be a ‘Feature Service’ that represents a FeatureLayerCollection. The layer_id is the id of the layer in feature layer collection (feature service).
-
generate_renderer
(definition, where=None)¶ This operation groups data using the supplied definition (classification definition) and an optional where clause. The result is a renderer object. Use baseSymbol and colorRamp to define the symbols assigned to each class. If the operation is performed on a table, the result is a renderer object containing the data classes and no symbols.
Argument Description definition required dict. The definition using the renderer that is generated. Use either class breaks or unique value classificatoin definitions. See: https://resources.arcgis.com/en/help/rest/apiref/ms_classification.html where optional string. A where clause for which the data needs to be classified. Any legal SQL where clause operating on the fields in the dynamic layer/table is allowed. Returns: dictionary
-
get_html_popup
(oid)¶ The htmlPopup resource provides details about the HTML pop-up authored by the user using ArcGIS for Desktop. Input:
oid - object id of the feature where the HTML pop-upOutput:
-
manager
¶ Helper object to manage the feature layer, update it’s definition, etc
-
properties
¶ The properties of this object
-
query
(where='1=1', out_fields='*', time_filter=None, geometry_filter=None, return_geometry=True, return_count_only=False, return_ids_only=False, return_distinct_values=False, return_extent_only=False, group_by_fields_for_statistics=None, statistic_filter=None, result_offset=None, result_record_count=None, object_ids=None, distance=None, units=None, max_allowable_offset=None, out_sr=None, geometry_precision=None, gdb_version=None, order_by_fields=None, out_statistics=None, return_z=False, return_m=False, multipatch_option=None, quanitization_parameters=None, return_centroid=False, return_all_records=True, **kwargs)¶ queries a feature layer based on a sql statement Inputs:
where - the selection sql statement out_fields - the attribute fields to return object_ids - The object IDs of this layer or table to be
queried.- distance - The buffer distance for the input geometries.
- The distance unit is specified by units. For example, if the distance is 100, the query geometry is a point, units is set to meters, and all points within 100 meters of the point are returned.
- units - The unit for calculating the buffer distance. If
- unit is not specified, the unit is derived from the geometry spatial reference. If the geometry spatial reference is not specified, the unit is derived from the feature service data spatial reference. This parameter only applies if supportsQueryWithDistance is true. Values: esriSRUnit_Meter | esriSRUnit_StatuteMile | esriSRUnit_Foot | esriSRUnit_Kilometer | esriSRUnit_NauticalMile | esriSRUnit_USNauticalMile
- time_filter - a TimeFilter object where either the start time
- or start and end time are defined to limit the search results for a given time. The values in the timeFilter should be as UTC timestampes in milliseconds. No checking occurs to see if they are in the right format.
- geometry_filter - spatial filter from arcgis.geometry.filters module to filter results by a
- spatial relationship with another geometry
- max_allowable_offset - This option can be used to specify the
- maxAllowableOffset to be used for generalizing geometries returned by the query operation. The maxAllowableOffset is in the units of outSR. If outSR is not specified, maxAllowableOffset is assumed to be in the unit of the spatial reference of the map.
out_sr - The spatial reference of the returned geometry. geometry_precision - This option can be used to specify the
number of decimal places in the response geometries returned by the Query operation.gdb_version - Geodatabase version to query return_geometry - If true, geometry is returned with the query. Default is true. return_distinct_values - If true, it returns distinct values
based on the fields specified in outFields. This parameter applies only if the supportsAdvancedQueries property of the layer is true.- return_ids_only - If true, the response only includes an
- array of object IDs. Otherwise, the response is a feature set. The default is false.
- return_count_only - If true, the response only includes the
- count (number of features/records) that would be returned by a query. Otherwise, the response is a feature set. The default is false. This option supersedes the returnIdsOnly parameter. If returnCountOnly = true, the response will return both the count and the extent.
- return_extent_only - If true, the response only includes the
- extent of the features that would be returned by the query. If returnCountOnly=true, the response will return both the count and the extent. The default is false. This parameter applies only if the supportsReturningQueryExtent property of the layer is true.
- order_by_fields - One or more field names on which the
- features/records need to be ordered. Use ASC or DESC for ascending or descending, respectively, following every field to control the ordering.
- group_by_fields_for_statistics - One or more field names on
- which the values need to be grouped for calculating the statistics.
- out_statistics - The definitions for one or more field-based
- statistics to be calculated.
- return_z - If true, Z values are included in the results if
- the features have Z values. Otherwise, Z values are not returned. The default is false.
- return_m - If true, M values are included in the results if
- the features have M values. Otherwise, M values are not returned. The default is false.
- multipatch_option - This option dictates how the geometry of
- a multipatch feature will be returned.
- result_offset - This option can be used for fetching query
- results by skipping the specified number of records and starting from the next record (that is, resultOffset + 1th). This option is ignored if return_all_records is True (i.e. by default).
- result_record_count - This option can be used for fetching
- query results up to the result_record_count specified. When result_offset is specified but this parameter is not, the map service defaults it to max_record_count. The maximum value for this parameter is the value of the layer’s max_record_count property. This option is ignored if return_all_records is True (i.e. by default).
- quantization_parameters - Used to project the geometry onto
- a virtual grid, likely representing pixels on the screen.
- return_centroid - Used to return the geometry centroid
- associated with each feature returned. If true, the result includes the geometry centroid. The default is false.
- return_all_records - When True, the query operation will call
- the service until all records that satisfy the where_clause are returned. Note: result_offset and result_record_count will be ignored if return_all_records is True. Also, if return_count_only, return_ids_only, or return_extent_only are True, this parameter will be ignored.
- kwargs - optional parameters that can be passed to the Query
- function. This will allow users to pass additional parameters not explicitly implemented on the function. A complete list of functions available is documented on the Query REST API.
- Output:
- A FeatureSet containing the features matching the query unless another return type is specified, such as count
The Query operation is performed on a feature service layer resource. The result of this operation are feature sets grouped by source layer/table object IDs. Each feature set contains Feature objects including the values for the fields requested by the user. For related layers, if you request geometry information, the geometry of each feature is also returned in the feature set. For related tables, the feature set does not include geometries. Inputs:
objectIds - the object IDs of the table/layer to be queried relationshipId - The ID of the relationship to be queried. outFields - the list of fields from the related table/layer
to be included in the returned feature set. This list is a comma delimited list of field names. If you specify the shape field in the list of return fields, it is ignored. To request geometry, set returnGeometry to true. You can also specify the wildcard “*” as the value of this parameter. In this case, the result s will include all the field values.- definitionExpression - The definition expression to be
- applied to the related table/layer. From the list of objectIds, only those records that conform to this expression are queried for related records.
- returnGeometry - If true, the feature set includes the
- geometry associated with each feature. The default is true.
- maxAllowableOffset - This option can be used to specify the
- maxAllowableOffset to be used for generalizing geometries returned by the query operation. The maxAllowableOffset is in the units of the outSR. If outSR is not specified, then maxAllowableOffset is assumed to be in the unit of the spatial reference of the map.
- geometryPrecision - This option can be used to specify the
- number of decimal places in the response geometries.
outWKID - The spatial reference of the returned geometry. gdbVersion - The geodatabase version to query. This parameter
applies only if the isDataVersioned property of the layer queried is true.- returnZ - If true, Z values are included in the results if
- the features have Z values. Otherwise, Z values are not returned. The default is false.
- returnM - If true, M values are included in the results if
- the features have M values. Otherwise, M values are not returned. The default is false.
-
validate_sql
(sql, sql_type='where')¶ The validate_sql operation validates an SQL-92 expression or WHERE clause. The validate_sql operation ensures that an SQL-92 expression, such as one written by a user through a user interface, is correct before performing another operation that uses the expression. For example, validateSQL can be used to validate information that is subsequently passed in as part of the where parameter of the calculate operation. validate_sql also prevents SQL injection. In addition, all table and field names used in the SQL expression or WHERE clause are validated to ensure they are valid tables and fields.
Parameters: sql: the SQL expression of WHERE clause to validate Example: “Population > 300000”
sql_type: Three SQL types are supported in validate_sql - where (default) - Represents the custom WHERE clause the user
can compose when querying a layer or using calculate.
- expression - Represents an SQL-92 expression. Currently, expression is used as a default value expression when adding a new field or using the calculate API.
- statement - Represents the full SQL-92 statement that can be passed directly to the database. No current ArcGIS REST API resource or operation supports using the full SQL-92 SELECT statement directly. It has been added to the validateSQL for completeness. Values: where | expression | statement
-
arcgis.features.Table¶
-
class
arcgis.features.
Table
(url, gis=None, container=None, dynamic_layer=None)¶ Tables represent entity classes with uniform properties. In addition to working with “entities with location” as features, the GIS can also work with non-spatial entities as rows in tables.
Working with tables is similar to working with feature layers, except that the rows (Features) in a table do not have a geometry, and tables ignore any geometry related operation.
-
calculate
(where, calc_expression, sql_format='standard')¶ The calculate operation is performed on a feature layer resource. It updates the values of one or more fields in an existing feature service layer based on SQL expressions or scalar values. The calculate operation can only be used if the supportsCalculate property of the layer is true. Neither the Shape field nor system fields can be updated using calculate. System fields include ObjectId and GlobalId. See Calculate a field for more information on supported expressions
- Inputs:
- where - A where clause can be used to limit the updated records.
- Any legal SQL where clause operating on the fields in the layer is allowed.
- calcExpression - The array of field/value info objects that
contain the field or fields to update and their scalar values or SQL expression. Allowed types are dictionary and list. List must be a list of dictionary objects. Calculation Format is as follows:
{“field” : “<field name>”, “value” : “<value>”}- sqlFormat - The SQL format for the calcExpression. It can be
- either standard SQL92 (standard) or native SQL (native). The default is standard. Values: standard, native
- Output:
- JSON as string
Usage: >>>print(fl.calculate(where=”OBJECTID < 2”,
- calcExpression={“field”: “ZONE”,
- “value” : “R1”}))
{‘updatedFeatureCount’: 1, ‘success’: True}
-
container
¶ The feature layer collection to which this layer belongs.
-
delete_features
(deletes=None, where=None, geometry_filter=None, gdb_version=None, rollback_on_failure=True)¶ This operation deletes features in a feature layer or table Inputs:
deletes - string of OIDs to remove from service where - A where clause for the query filter.
Any legal SQL where clause operating on the fields in the layer is allowed. Features conforming to the specified where clause will be deleted.- geometry_filter - spatial filter from arcgis.geometry.filters module to filter results by a
- spatial relationship with another geometry
gdb_version - Geodatabase version to apply the edits. rollback_on_failure - Optional parameter to specify if the
edits should be applied only if all submitted edits succeed. If false, the server will apply the edits that succeed even if some of the submitted edits fail. If true, the server will apply the edits only if all edits succeed. The default value is true.- Output:
- dictionary of messages
-
edit_features
(adds=None, updates=None, deletes=None, gdb_version=None, use_global_ids=False, rollback_on_failure=True)¶ This operation adds, updates, and deletes features to the associated feature layer or table in a single call. Inputs:
adds - The array of features to be added. updates - The array of features to be updateded. deletes - string of OIDs to remove from service gdbVersion - Geodatabase version to apply the edits. useGlobalIds - instead of referencing the default Object ID
field, the service will look at a GUID field to track changes. This means the GUIDs will be passed instead of OIDs for delete, update or add features.- rollbackOnFailure - Optional parameter to specify if the
- edits should be applied only if all submitted edits succeed. If false, the server will apply the edits that succeed even if some of the submitted edits fail. If true, the server will apply the edits only if all edits succeed. The default value is true.
- Output:
- dictionary of messages
-
fromitem
(item, layer_id=0)¶ Creates a feature layer from a GIS Item. The type of item should be a ‘Feature Service’ that represents a FeatureLayerCollection. The layer_id is the id of the layer in feature layer collection (feature service).
-
generate_renderer
(definition, where=None)¶ This operation groups data using the supplied definition (classification definition) and an optional where clause. The result is a renderer object. Use baseSymbol and colorRamp to define the symbols assigned to each class. If the operation is performed on a table, the result is a renderer object containing the data classes and no symbols.
Argument Description definition required dict. The definition using the renderer that is generated. Use either class breaks or unique value classificatoin definitions. See: https://resources.arcgis.com/en/help/rest/apiref/ms_classification.html where optional string. A where clause for which the data needs to be classified. Any legal SQL where clause operating on the fields in the dynamic layer/table is allowed. Returns: dictionary
-
get_html_popup
(oid)¶ The htmlPopup resource provides details about the HTML pop-up authored by the user using ArcGIS for Desktop. Input:
oid - object id of the feature where the HTML pop-upOutput:
-
manager
¶ Helper object to manage the feature layer, update it’s definition, etc
-
properties
¶ The properties of this object
-
query
(where='1=1', out_fields='*', time_filter=None, geometry_filter=None, return_geometry=True, return_count_only=False, return_ids_only=False, return_distinct_values=False, return_extent_only=False, group_by_fields_for_statistics=None, statistic_filter=None, result_offset=None, result_record_count=None, object_ids=None, distance=None, units=None, max_allowable_offset=None, out_sr=None, geometry_precision=None, gdb_version=None, order_by_fields=None, out_statistics=None, return_z=False, return_m=False, multipatch_option=None, quanitization_parameters=None, return_centroid=False, return_all_records=True, **kwargs)¶ queries a feature layer based on a sql statement Inputs:
where - the selection sql statement out_fields - the attribute fields to return object_ids - The object IDs of this layer or table to be
queried.- distance - The buffer distance for the input geometries.
- The distance unit is specified by units. For example, if the distance is 100, the query geometry is a point, units is set to meters, and all points within 100 meters of the point are returned.
- units - The unit for calculating the buffer distance. If
- unit is not specified, the unit is derived from the geometry spatial reference. If the geometry spatial reference is not specified, the unit is derived from the feature service data spatial reference. This parameter only applies if supportsQueryWithDistance is true. Values: esriSRUnit_Meter | esriSRUnit_StatuteMile | esriSRUnit_Foot | esriSRUnit_Kilometer | esriSRUnit_NauticalMile | esriSRUnit_USNauticalMile
- time_filter - a TimeFilter object where either the start time
- or start and end time are defined to limit the search results for a given time. The values in the timeFilter should be as UTC timestampes in milliseconds. No checking occurs to see if they are in the right format.
- geometry_filter - spatial filter from arcgis.geometry.filters module to filter results by a
- spatial relationship with another geometry
- max_allowable_offset - This option can be used to specify the
- maxAllowableOffset to be used for generalizing geometries returned by the query operation. The maxAllowableOffset is in the units of outSR. If outSR is not specified, maxAllowableOffset is assumed to be in the unit of the spatial reference of the map.
out_sr - The spatial reference of the returned geometry. geometry_precision - This option can be used to specify the
number of decimal places in the response geometries returned by the Query operation.gdb_version - Geodatabase version to query return_geometry - If true, geometry is returned with the query. Default is true. return_distinct_values - If true, it returns distinct values
based on the fields specified in outFields. This parameter applies only if the supportsAdvancedQueries property of the layer is true.- return_ids_only - If true, the response only includes an
- array of object IDs. Otherwise, the response is a feature set. The default is false.
- return_count_only - If true, the response only includes the
- count (number of features/records) that would be returned by a query. Otherwise, the response is a feature set. The default is false. This option supersedes the returnIdsOnly parameter. If returnCountOnly = true, the response will return both the count and the extent.
- return_extent_only - If true, the response only includes the
- extent of the features that would be returned by the query. If returnCountOnly=true, the response will return both the count and the extent. The default is false. This parameter applies only if the supportsReturningQueryExtent property of the layer is true.
- order_by_fields - One or more field names on which the
- features/records need to be ordered. Use ASC or DESC for ascending or descending, respectively, following every field to control the ordering.
- group_by_fields_for_statistics - One or more field names on
- which the values need to be grouped for calculating the statistics.
- out_statistics - The definitions for one or more field-based
- statistics to be calculated.
- return_z - If true, Z values are included in the results if
- the features have Z values. Otherwise, Z values are not returned. The default is false.
- return_m - If true, M values are included in the results if
- the features have M values. Otherwise, M values are not returned. The default is false.
- multipatch_option - This option dictates how the geometry of
- a multipatch feature will be returned.
- result_offset - This option can be used for fetching query
- results by skipping the specified number of records and starting from the next record (that is, resultOffset + 1th). This option is ignored if return_all_records is True (i.e. by default).
- result_record_count - This option can be used for fetching
- query results up to the result_record_count specified. When result_offset is specified but this parameter is not, the map service defaults it to max_record_count. The maximum value for this parameter is the value of the layer’s max_record_count property. This option is ignored if return_all_records is True (i.e. by default).
- quantization_parameters - Used to project the geometry onto
- a virtual grid, likely representing pixels on the screen.
- return_centroid - Used to return the geometry centroid
- associated with each feature returned. If true, the result includes the geometry centroid. The default is false.
- return_all_records - When True, the query operation will call
- the service until all records that satisfy the where_clause are returned. Note: result_offset and result_record_count will be ignored if return_all_records is True. Also, if return_count_only, return_ids_only, or return_extent_only are True, this parameter will be ignored.
- kwargs - optional parameters that can be passed to the Query
- function. This will allow users to pass additional parameters not explicitly implemented on the function. A complete list of functions available is documented on the Query REST API.
- Output:
- A FeatureSet containing the features matching the query unless another return type is specified, such as count
The Query operation is performed on a feature service layer resource. The result of this operation are feature sets grouped by source layer/table object IDs. Each feature set contains Feature objects including the values for the fields requested by the user. For related layers, if you request geometry information, the geometry of each feature is also returned in the feature set. For related tables, the feature set does not include geometries. Inputs:
objectIds - the object IDs of the table/layer to be queried relationshipId - The ID of the relationship to be queried. outFields - the list of fields from the related table/layer
to be included in the returned feature set. This list is a comma delimited list of field names. If you specify the shape field in the list of return fields, it is ignored. To request geometry, set returnGeometry to true. You can also specify the wildcard “*” as the value of this parameter. In this case, the result s will include all the field values.- definitionExpression - The definition expression to be
- applied to the related table/layer. From the list of objectIds, only those records that conform to this expression are queried for related records.
- returnGeometry - If true, the feature set includes the
- geometry associated with each feature. The default is true.
- maxAllowableOffset - This option can be used to specify the
- maxAllowableOffset to be used for generalizing geometries returned by the query operation. The maxAllowableOffset is in the units of the outSR. If outSR is not specified, then maxAllowableOffset is assumed to be in the unit of the spatial reference of the map.
- geometryPrecision - This option can be used to specify the
- number of decimal places in the response geometries.
outWKID - The spatial reference of the returned geometry. gdbVersion - The geodatabase version to query. This parameter
applies only if the isDataVersioned property of the layer queried is true.- returnZ - If true, Z values are included in the results if
- the features have Z values. Otherwise, Z values are not returned. The default is false.
- returnM - If true, M values are included in the results if
- the features have M values. Otherwise, M values are not returned. The default is false.
-
validate_sql
(sql, sql_type='where')¶ The validate_sql operation validates an SQL-92 expression or WHERE clause. The validate_sql operation ensures that an SQL-92 expression, such as one written by a user through a user interface, is correct before performing another operation that uses the expression. For example, validateSQL can be used to validate information that is subsequently passed in as part of the where parameter of the calculate operation. validate_sql also prevents SQL injection. In addition, all table and field names used in the SQL expression or WHERE clause are validated to ensure they are valid tables and fields.
Parameters: sql: the SQL expression of WHERE clause to validate Example: “Population > 300000”
sql_type: Three SQL types are supported in validate_sql - where (default) - Represents the custom WHERE clause the user
can compose when querying a layer or using calculate.
- expression - Represents an SQL-92 expression. Currently, expression is used as a default value expression when adding a new field or using the calculate API.
- statement - Represents the full SQL-92 statement that can be passed directly to the database. No current ArcGIS REST API resource or operation supports using the full SQL-92 SELECT statement directly. It has been added to the validateSQL for completeness. Values: where | expression | statement
-
arcgis.features.FeatureLayerCollection¶
-
class
arcgis.features.
FeatureLayerCollection
(url, gis=None)¶ A FeatureLayerCollection is a collection of feature layers and tables, with the associated relationships among the entities.
In a web GIS, a feature layer collection is exposed as a feature service with multiple feature layers.
Instances of FeatureDatasets can be obtained from feature service Items in the GIS using FeatureLayerCollection.fromitem(item), from feature service endpoints using the constructor, or by accessing the dataset attribute of feature layer objects.
FeatureDatasets can be configured and managed using their manager helper object.
If the dataset supports the sync operation, the replicas helper object allows management and synchronization of replicas for disconnected editing of the feature layer collection.
Note: You can use the layers and tables property to get to the individual layers and tables in this feature layer collection.
-
fromitem
(item)¶
-
manager
¶ helper object to manage the feature layer collection, update it’s definition, etc
-
properties
¶ The properties of this object
-
query
(layer_defs_filter=None, geometry_filter=None, time_filter=None, return_geometry=True, return_ids_only=False, return_count_only=False, return_z=False, return_m=False, out_sr=None)¶ queries the feature layer collection
The Query operation is performed on a feature service layer resource. The result of this operation are feature sets grouped by source layer/table object IDs. Each feature set contains Feature objects including the values for the fields requested by the user. For related layers, if you request geometry information, the geometry of each feature is also returned in the feature set. For related tables, the feature set does not include geometries. Inputs:
objectIds - the object IDs of the table/layer to be queried relationshipId - The ID of the relationship to be queried. outFields - the list of fields from the related table/layer
to be included in the returned feature set. This list is a comma delimited list of field names. If you specify the shape field in the list of return fields, it is ignored. To request geometry, set returnGeometry to true. You can also specify the wildcard “*” as the value of this parameter. In this case, the result s will include all the field values.- definitionExpression - The definition expression to be
- applied to the related table/layer. From the list of objectIds, only those records that conform to this expression are queried for related records.
- returnGeometry - If true, the feature set includes the
- geometry associated with each feature. The default is true.
- maxAllowableOffset - This option can be used to specify the
- maxAllowableOffset to be used for generalizing geometries returned by the query operation. The maxAllowableOffset is in the units of the outSR. If outSR is not specified, then maxAllowableOffset is assumed to be in the unit of the spatial reference of the map.
- geometryPrecision - This option can be used to specify the
- number of decimal places in the response geometries.
outWKID - The spatial reference of the returned geometry. gdbVersion - The geodatabase version to query. This parameter
applies only if the isDataVersioned property of the layer queried is true.- returnZ - If true, Z values are included in the results if
- the features have Z values. Otherwise, Z values are not returned. The default is false.
- returnM - If true, M values are included in the results if
- the features have M values. Otherwise, M values are not returned. The default is false.
-
upload
(path, description=None)¶ Uploads a new item to the server. Once the operation is completed successfully, the JSON structure of the uploaded item is returned.
- Parameters:
path: path of the file to upload description: optional descriptive text for the upload item
-
arcgis.features.FeatureSet¶
-
class
arcgis.features.
FeatureSet
(features, fields=None, has_z=False, has_m=False, geometry_type=None, spatial_reference=None, display_field_name=None, object_id_field_name=None, global_id_field_name=None)¶ A set of features with information about their fields, field aliases, geometry type, spatial reference etc.
FeatureSets are commonly used as input/output with several Geoprocessing Tools, and can be the obtained through the query() methods of feature layers. A FeatureSet can be combined with a layer definition to compose a FeatureCollection.
FeatureSet contains Feature objects, including the values for the fields requested by the user. For layers, if you request geometry information, the geometry of each feature is also returned in the FeatureSet. For tables, the FeatureSet does not include geometries.
If a Spatial Reference is not specified at the FeatureSet level, the FeatureSet will assume the SpatialReference of its first feature. If the SpatialReference of the first feature is also not specified, the spatial reference will be UnknownCoordinateSystem.
-
df
¶ converts the FeatureSet to a Pandas dataframe. Requires pandas
-
display_field_name
¶ gets/sets the displayFieldName
-
features
¶ gets the features in the FeatureSet
-
fields
¶ gets the fields in the FeatureSet
-
static
from_dataframe
(df)¶ returns a featureset from a Pandas’ Data or Spatial DataFrame
-
static
from_dict
(featureset_dict)¶ returns a featureset from a dict
-
static
from_geojson
(geojson)¶ Converts a GeoJSON Feature Collection into a FeatureSet
-
static
from_json
(json_str)¶ returns a featureset from a JSON string
-
geometry_type
¶ gets/sets the geometry Type
-
global_id_field_name
¶ gets/sets the globalIdFieldName
-
has_m
¶ gets/set the M-property
-
has_z
¶ gets/sets the Z-property
-
object_id_field_name
¶ gets/sets the object id field
-
save
(save_location, out_name, encoding=None)¶ Saves a featureset object to a feature class Input:
saveLocation - output location of the data outName - name of the table the data will be saved to
-
spatial_reference
¶ gets the featureset’s spatial reference
-
to_dict
()¶ converts the object to Python dictionary
-
to_geojson
¶ converts the object to GeoJSON
-
to_json
¶ converts the object to JSON
-
value
¶ returns object as dictionary
-
arcgis.features.FeatureCollection¶
-
class
arcgis.features.
FeatureCollection
(dictdata)¶ FeatureCollection is an object with a layer definition and a feature set.
It is an in-memory collection of features with rendering information.
Feature Collections can be stored as Items in the GIS, added as layers to a map or scene, passed as inputs to feature analysis tools, and returned as results from feature analysis tools if an output name for a feature layer is not specified when calling the tool.
-
static
from_featureset
(fset, symbol=None)¶ Create a FeatureCollection object from a FeatureSet object.
Returns: A FeatureCollection object.
-
query
()¶ Returns the data in this feature collection as a FeatureSet. Filtering by where clause is not supported for feature collections
-
static
arcgis.features.SpatialDataFrame¶
-
class
arcgis.features.
SpatialDataFrame
(*args, **kwargs)¶ A Spatial Dataframe is an object to manipulate, manage and translate data into new forms of information for users.
- Required Parameters:
- None
- Optional:
param data: panda’s dataframe containing attribute information param geometry: list/array/geoseries of arcgis.geometry objects param sr: spatial reference of the dataframe. This can be the factory code, WKT string, arcpy.SpatialReference object, or arcgis.SpatailReference object. param gis: passing a gis.GIS object set to Pro will ensure arcpy is installed and a full swatch of functionality is available to the end user.
-
JSON
¶ Returns an Esri JSON representation of the geometry as a string.
-
T
¶ Transpose index and columns
-
WKB
¶ Returns the well-known binary (WKB) representation for OGC geometry. It provides a portable representation of a geometry value as a contiguous stream of bytes.
-
WKT
¶ Returns the well-known text (WKT) representation for OGC geometry. It provides a portable representation of a geometry value as a text string.
-
abs
()¶ Return an object with absolute value taken–only applicable to objects that are all numeric.
abs: type of caller
-
add
(other, axis='columns', level=None, fill_value=None)¶ Addition of dataframe and other, element-wise (binary operator add).
Equivalent to
dataframe + other
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.radd
-
add_prefix
(prefix)¶ Concatenate prefix string with panel items names.
prefix : string
with_prefix : type of caller
-
add_suffix
(suffix)¶ Concatenate suffix string with panel items names.
suffix : string
with_suffix : type of caller
-
agg
(func, axis=0, *args, **kwargs)¶ Aggregate using callable, string, dict, or list of string/callables
New in version 0.20.0.
- func : callable, string, dictionary, or list of string/callables
Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. For a DataFrame, can pass a dict, if the keys are DataFrame column names.
Accepted Combinations are:
- string function name
- function
- list of functions
- dict of column names -> functions (or list of functions)
Numpy functions mean/median/prod/sum/std/var are special cased so the default behavior is applying the function along axis=0 (e.g., np.mean(arr_2d, axis=0)) as opposed to mimicking the default Numpy behavior (e.g., np.mean(arr_2d)).
agg is an alias for aggregate. Use the alias.
aggregated : DataFrame
>>> df = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'], ... index=pd.date_range('1/1/2000', periods=10)) >>> df.iloc[3:7] = np.nan
Aggregate these functions across all columns
>>> df.agg(['sum', 'min']) A B C sum -0.182253 -0.614014 -2.909534 min -1.916563 -1.460076 -1.568297
Different aggregations per column
>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']}) A B max NaN 1.514318 min -1.916563 -1.460076 sum -0.182253 NaN
pandas.DataFrame.apply pandas.DataFrame.transform pandas.DataFrame.groupby.aggregate pandas.DataFrame.resample.aggregate pandas.DataFrame.rolling.aggregate
-
aggregate
(func, axis=0, *args, **kwargs)¶ Aggregate using callable, string, dict, or list of string/callables
New in version 0.20.0.
- func : callable, string, dictionary, or list of string/callables
Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. For a DataFrame, can pass a dict, if the keys are DataFrame column names.
Accepted Combinations are:
- string function name
- function
- list of functions
- dict of column names -> functions (or list of functions)
Numpy functions mean/median/prod/sum/std/var are special cased so the default behavior is applying the function along axis=0 (e.g., np.mean(arr_2d, axis=0)) as opposed to mimicking the default Numpy behavior (e.g., np.mean(arr_2d)).
agg is an alias for aggregate. Use the alias.
aggregated : DataFrame
>>> df = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'], ... index=pd.date_range('1/1/2000', periods=10)) >>> df.iloc[3:7] = np.nan
Aggregate these functions across all columns
>>> df.agg(['sum', 'min']) A B C sum -0.182253 -0.614014 -2.909534 min -1.916563 -1.460076 -1.568297
Different aggregations per column
>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']}) A B max NaN 1.514318 min -1.916563 -1.460076 sum -0.182253 NaN
pandas.DataFrame.apply pandas.DataFrame.transform pandas.DataFrame.groupby.aggregate pandas.DataFrame.resample.aggregate pandas.DataFrame.rolling.aggregate
-
align
(other, join='outer', axis=None, level=None, copy=True, fill_value=None, method=None, limit=None, fill_axis=0, broadcast_axis=None)¶ Align two objects on their axes with the specified join method for each axis Index
other : DataFrame or Series join : {‘outer’, ‘inner’, ‘left’, ‘right’}, default ‘outer’ axis : allowed axis of the other object, default None
Align on index (0), columns (1), or both (None)- level : int or level name, default None
- Broadcast across a level, matching Index values on the passed MultiIndex level
- copy : boolean, default True
- Always returns new objects. If copy=False and no reindexing is required then original objects are returned.
- fill_value : scalar, default np.NaN
- Value to use for missing values. Defaults to NaN, but can be any “compatible” value
method : str, default None limit : int, default None fill_axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Filling axis, method and limit- broadcast_axis : {0 or ‘index’, 1 or ‘columns’}, default None
Broadcast values along this axis, if aligning two objects of different dimensions
New in version 0.17.0.
- (left, right) : (DataFrame, type of other)
- Aligned objects
-
all
(axis=None, bool_only=None, skipna=None, level=None, **kwargs)¶ Return whether all elements are True over requested axis
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA- level : int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- bool_only : boolean, default None
- Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
all : Series or DataFrame (if level specified)
-
angle_distance_to
(second_geometry, method='GEODESIC')¶ Returns a tuple of angle and distance to another point using a measurement type.
- Paramters:
second_geometry: - a second geometry
method: - PLANAR measurements reflect the projection of geographic
data onto the 2D surface (in other words, they will not take into account the curvature of the earth). GEODESIC, GREAT_ELLIPTIC, LOXODROME, and PRESERVE_SHAPE measurement types may be chosen as an alternative, if desired.
-
any
(axis=None, bool_only=None, skipna=None, level=None, **kwargs)¶ Return whether any element is True over requested axis
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA- level : int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- bool_only : boolean, default None
- Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
any : Series or DataFrame (if level specified)
-
append
(other, ignore_index=False, verify_integrity=False)¶ Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.
- other : DataFrame or Series/dict-like object, or list of these
- The data to append.
- ignore_index : boolean, default False
- If True, do not use the index labels.
- verify_integrity : boolean, default False
- If True, raise ValueError on creating index with duplicates.
appended : DataFrame
If a list of dict/series is passed and the keys are all contained in the DataFrame’s index, the order of the columns in the resulting DataFrame will be unchanged.
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
- pandas.concat : General function to concatenate DataFrame, Series
- or Panel objects
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) >>> df A B 0 1 2 1 3 4 >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB')) >>> df.append(df2) A B 0 1 2 1 3 4 0 5 6 1 7 8
With ignore_index set to True:
>>> df.append(df2, ignore_index=True) A B 0 1 2 1 3 4 2 5 6 3 7 8
The following, while not recommended methods for generating DataFrames, show two ways to generate a DataFrame from multiple data sources.
Less efficient:
>>> df = pd.DataFrame(columns=['A']) >>> for i in range(5): ... df = df.append({'A'}: i}, ignore_index=True) >>> df A 0 0 1 1 2 2 3 3 4 4
More efficient:
>>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)], ... ignore_index=True) A 0 0 1 1 2 2 3 3 4 4
-
apply
(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)¶ Applies function along input axis of DataFrame.
Objects passed to functions are Series objects having index either the DataFrame’s index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty.
- func : function
- Function to apply to each column/row
- axis : {0 or ‘index’, 1 or ‘columns’}, default 0
- 0 or ‘index’: apply function to each column
- 1 or ‘columns’: apply function to each row
- broadcast : boolean, default False
- For aggregation functions, return object of same size with values propagated
- raw : boolean, default False
- If False, convert each row or column into a Series. If raw=True the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance
- reduce : boolean or None, default None
- Try to apply reduction procedures. If the DataFrame is empty, apply will use reduce to determine whether the result should be a Series or a DataFrame. If reduce is None (the default), apply’s return value will be guessed by calling func an empty Series (note: while guessing, exceptions raised by func will be ignored). If reduce is True a Series will always be returned, and if False a DataFrame will always be returned.
- args : tuple
- Positional arguments to pass to function in addition to the array/series
Additional keyword arguments will be passed as keywords to the function
In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.
>>> df.apply(numpy.sqrt) # returns DataFrame >>> df.apply(numpy.sum, axis=0) # equiv to df.sum(0) >>> df.apply(numpy.sum, axis=1) # equiv to df.sum(1)
DataFrame.applymap: For elementwise operations DataFrame.aggregate: only perform aggregating type operations DataFrame.transform: only perform transformating type operations
applied : Series or DataFrame
-
applymap
(func)¶ Apply a function to a DataFrame that is intended to operate elementwise, i.e. like doing map(func, series) for each series in the DataFrame
- func : function
- Python function, returns a single value from a single value
>>> df = pd.DataFrame(np.random.randn(3, 3)) >>> df 0 1 2 0 -0.029638 1.081563 1.280300 1 0.647747 0.831136 -1.549481 2 0.513416 -0.884417 0.195343 >>> df = df.applymap(lambda x: '%.2f' % x) >>> df 0 1 2 0 -0.03 1.08 1.28 1 0.65 0.83 -1.55 2 0.51 -0.88 0.20
applied : DataFrame
DataFrame.apply : For operations on rows/columns
-
area
¶ The area of a polygon feature. Empty for all other feature types.
-
as_arcpy
¶ Returns an Esri JSON representation of the geometry as a string.
-
as_blocks
(copy=True)¶ Convert the frame to a dict of dtype -> Constructor Types that each has a homogeneous dtype.
Deprecated since version 0.21.0.
- NOTE: the dtypes of the blocks WILL BE PRESERVED HERE (unlike in
- as_matrix)
copy : boolean, default True
values : a dict of dtype -> Constructor Types
-
as_matrix
(columns=None)¶ Convert the frame to its Numpy-array representation.
- columns: list, optional, default:None
- If None, return all columns, otherwise, returns specified columns.
- values : ndarray
- If the caller is heterogeneous and contains booleans or objects, the result will be of dtype=object. See Notes.
Return is NOT a Numpy-matrix, rather, a Numpy-array.
The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if you are not dealing with the blocks.
e.g. If the dtypes are float16 and float32, dtype will be upcast to float32. If dtypes are int32 and uint8, dtype will be upcase to int32. By numpy.find_common_type convention, mixing int64 and uint64 will result in a flot64 dtype.
This method is provided for backwards compatibility. Generally, it is recommended to use ‘.values’.
pandas.DataFrame.values
-
asfreq
(freq, method=None, how=None, normalize=False, fill_value=None)¶ Convert TimeSeries to specified frequency.
Optionally provide filling method to pad/backfill missing values.
Returns the original data conformed to a new index with the specified frequency.
resample
is more appropriate if an operation, such as summarization, is necessary to represent the data at the new frequency.freq : DateOffset object, or string method : {‘backfill’/’bfill’, ‘pad’/’ffill’}, default None
Method to use for filling holes in reindexed Series (note this does not fill NaNs that already were present):
- ‘pad’ / ‘ffill’: propagate last valid observation forward to next valid
- ‘backfill’ / ‘bfill’: use NEXT valid observation to fill
- how : {‘start’, ‘end’}, default end
- For PeriodIndex only, see PeriodIndex.asfreq
- normalize : bool, default False
- Whether to reset output index to midnight
- fill_value: scalar, optional
Value to use for missing values, applied during upsampling (note this does not fill NaNs that already were present).
New in version 0.20.0.
converted : type of caller
Start by creating a series with 4 one minute timestamps.
>>> index = pd.date_range('1/1/2000', periods=4, freq='T') >>> series = pd.Series([0.0, None, 2.0, 3.0], index=index) >>> df = pd.DataFrame({'s':series}) >>> df s 2000-01-01 00:00:00 0.0 2000-01-01 00:01:00 NaN 2000-01-01 00:02:00 2.0 2000-01-01 00:03:00 3.0
Upsample the series into 30 second bins.
>>> df.asfreq(freq='30S') s 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 NaN 2000-01-01 00:01:30 NaN 2000-01-01 00:02:00 2.0 2000-01-01 00:02:30 NaN 2000-01-01 00:03:00 3.0
Upsample again, providing a
fill value
.>>> df.asfreq(freq='30S', fill_value=9.0) s 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 9.0 2000-01-01 00:01:00 NaN 2000-01-01 00:01:30 9.0 2000-01-01 00:02:00 2.0 2000-01-01 00:02:30 9.0 2000-01-01 00:03:00 3.0
Upsample again, providing a
method
.>>> df.asfreq(freq='30S', method='bfill') s 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 NaN 2000-01-01 00:01:30 2.0 2000-01-01 00:02:00 2.0 2000-01-01 00:02:30 3.0 2000-01-01 00:03:00 3.0
reindex
To learn more about the frequency strings, please see this link.
-
asof
(where, subset=None)¶ The last row without any NaN is taken (or the last row without NaN considering only the subset of columns in the case of a DataFrame)
New in version 0.19.0: For DataFrame
If there is no good value, NaN is returned for a Series a Series of NaN values for a DataFrame
where : date or array of dates subset : string or list of strings, default None
if not None use these columns for NaN propagationDates are assumed to be sorted Raises if this is not the case
where is scalar
- value or NaN if input is Series
- Series if input is DataFrame
where is Index: same shape object as input
merge_asof
-
assign
(**kwargs)¶ Assign new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones.
- kwargs : keyword, value pairs
- keywords are the column names. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.
- df : DataFrame
- A new DataFrame with the new columns in addition to all the existing columns.
For python 3.6 and above, the columns are inserted in the order of **kwargs. For python 3.5 and earlier, since **kwargs is unordered, the columns are inserted in alphabetical order at the end of your DataFrame. Assigning multiple columns within the same
assign
is possible, but you cannot reference other columns created within the sameassign
call.>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
Where the value is a callable, evaluated on df:
>>> df.assign(ln_A = lambda x: np.log(x.A)) A B ln_A 0 1 0.426905 0.000000 1 2 -0.780949 0.693147 2 3 -0.418711 1.098612 3 4 -0.269708 1.386294 4 5 -0.274002 1.609438 5 6 -0.500792 1.791759 6 7 1.649697 1.945910 7 8 -1.495604 2.079442 8 9 0.549296 2.197225 9 10 -0.758542 2.302585
Where the value already exists and is inserted:
>>> newcol = np.log(df['A']) >>> df.assign(ln_A=newcol) A B ln_A 0 1 0.426905 0.000000 1 2 -0.780949 0.693147 2 3 -0.418711 1.098612 3 4 -0.269708 1.386294 4 5 -0.274002 1.609438 5 6 -0.500792 1.791759 6 7 1.649697 1.945910 7 8 -1.495604 2.079442 8 9 0.549296 2.197225 9 10 -0.758542 2.302585
-
astype
(dtype, copy=True, errors='raise', **kwargs)¶ Cast a pandas object to a specified dtype
dtype
.- dtype : data type, or dict of column name -> data type
- Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
- copy : bool, default True.
- Return a copy when
copy=True
(be very careful settingcopy=False
as changes to values then may propagate to other pandas objects). - errors : {‘raise’, ‘ignore’}, default ‘raise’.
Control raising of exceptions on invalid data for provided dtype.
raise
: allow exceptions to be raisedignore
: suppress exceptions. On error return original object
New in version 0.20.0.
- raise_on_error : raise on invalid input
Deprecated since version 0.20.0: Use
errors
instead
kwargs : keyword arguments to pass on to the constructor
casted : type of caller
>>> ser = pd.Series([1, 2], dtype='int32') >>> ser 0 1 1 2 dtype: int32 >>> ser.astype('int64') 0 1 1 2 dtype: int64
Convert to categorical type:
>>> ser.astype('category') 0 1 1 2 dtype: category Categories (2, int64): [1, 2]
Convert to ordered categorical type with custom ordering:
>>> ser.astype('category', ordered=True, categories=[2, 1]) 0 1 1 2 dtype: category Categories (2, int64): [2 < 1]
Note that using
copy=False
and changing data on a new pandas object may propagate changes:>>> s1 = pd.Series([1,2]) >>> s2 = s1.astype('int', copy=False) >>> s2[0] = 10 >>> s1 # note that s1[0] has changed too 0 10 1 2 dtype: int64
pandas.to_datetime : Convert argument to datetime. pandas.to_timedelta : Convert argument to timedelta. pandas.to_numeric : Convert argument to a numeric type. numpy.ndarray.astype : Cast a numpy array to a specified type.
-
at
¶ Fast label-based scalar accessor
Similarly to
loc
,at
provides label based scalar lookups. You can also set using these indexers.
-
at_time
(time, asof=False)¶ Select values at particular time of day (e.g. 9:30AM).
time : datetime.time or string
values_at_time : type of caller
-
axes
¶ Return a list with the row axis labels and column axis labels as the only members. They are returned in that order.
-
between_time
(start_time, end_time, include_start=True, include_end=True)¶ Select values between particular times of the day (e.g., 9:00-9:30 AM).
start_time : datetime.time or string end_time : datetime.time or string include_start : boolean, default True include_end : boolean, default True
values_between_time : type of caller
-
bfill
(axis=None, inplace=False, limit=None, downcast=None)¶ Synonym for
DataFrame.fillna(method='bfill')
-
blocks
¶ Internal property, property synonym for as_blocks()
Deprecated since version 0.21.0.
-
bool
()¶ Return the bool of a single element PandasObject.
This must be a boolean scalar value, either True or False. Raise a ValueError if the PandasObject does not have exactly 1 element, or that element is not boolean
-
boundary
()¶ Constructs the boundary of the geometry.
-
bounds
¶ Return a DataFrame of minx, miny, maxx, maxy values of geometry objects
-
boxplot
(column=None, by=None, ax=None, fontsize=None, rot=0, grid=True, figsize=None, layout=None, return_type=None, **kwds)¶ Make a box plot from DataFrame column optionally grouped by some columns or other inputs
data : the pandas object holding the data column : column name or list of names, or vector
Can be any valid input to groupby- by : string or sequence
- Column in the DataFrame to group by
ax : Matplotlib axes object, optional fontsize : int or string rot : label rotation angle figsize : A tuple (width, height) in inches grid : Setting this to True will show the grid layout : tuple (optional)
(rows, columns) for the layout of the plot- return_type : {None, ‘axes’, ‘dict’, ‘both’}, default None
The kind of object to return. The default is
axes
‘axes’ returns the matplotlib axes the boxplot is drawn on; ‘dict’ returns a dictionary whose values are the matplotlib Lines of the boxplot; ‘both’ returns a namedtuple with the axes and dict.When grouping with
by
, a Series mapping columns toreturn_type
is returned, unlessreturn_type
is None, in which case a NumPy array of axes is returned with the same shape aslayout
. See the prose documentation for more.- kwds : other plotting keyword arguments to be passed to matplotlib boxplot
- function
lines : dict ax : matplotlib Axes (ax, lines): namedtuple
Use
return_type='dict'
when you want to tweak the appearance of the lines after plotting. In this case a dict containing the Lines making up the boxes, caps, fliers, medians, and whiskers is returned.
-
buffer
(distance)¶ Constructs a polygon at a specified distance from the geometry.
- Parameters:
distance: - length in current projection. Only polygon accept
negative values.
-
centroid
¶ The true centroid if it is within or on the feature; otherwise, the label point is returned. Returns a point object.
-
clip
(envelope)¶ Constructs the intersection of the geometry and the specified extent.
- Parameters:
envelope: - arcpy.Extent object
-
clip_lower
(threshold, axis=None, inplace=False)¶ Return copy of the input with values below given value(s) truncated.
threshold : float or array_like axis : int or string axis name, optional
Align object with threshold along the given axis.- inplace : boolean, default False
- Whether to perform the operation in place on the data
New in version 0.21.0.
clip
clipped : same type as input
-
clip_upper
(threshold, axis=None, inplace=False)¶ Return copy of input with values above given value(s) truncated.
threshold : float or array_like axis : int or string axis name, optional
Align object with threshold along the given axis.- inplace : boolean, default False
- Whether to perform the operation in place on the data
New in version 0.21.0.
clip
clipped : same type as input
-
combine
(other, func, fill_value=None, overwrite=True)¶ Add two DataFrame objects and do not propagate NaN values, so if for a (column, time) one frame is missing a value, it will default to the other frame’s value (which might be NaN as well)
other : DataFrame func : function fill_value : scalar value overwrite : boolean, default True
If True then overwrite values for common keys in the calling frameresult : DataFrame
-
combine_first
(other)¶ Combine two DataFrame objects and default to non-null values in frame calling the method. Result index columns will be the union of the respective indexes and columns
other : DataFrame
a’s values prioritized, use values from b to fill holes:
>>> a.combine_first(b)
combined : DataFrame
-
compound
(axis=None, skipna=None, level=None)¶ Return the compound percentage of the values for the requested axis
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA- level : int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only : boolean, default None
- Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
compounded : Series or DataFrame (if level specified)
-
consolidate
(inplace=False)¶ DEPRECATED: consolidate will be an internal implementation only.
-
contains
(second_geometry, relation=None)¶ Indicates if the base geometry contains the comparison geometry.
- Paramters:
second_geometry: - a second geometry
-
convert_objects
(convert_dates=True, convert_numeric=False, convert_timedeltas=True, copy=True)¶ Deprecated. Attempt to infer better dtype for object columns
- convert_dates : boolean, default True
- If True, convert to date where possible. If ‘coerce’, force conversion, with unconvertible values becoming NaT.
- convert_numeric : boolean, default False
- If True, attempt to coerce to numbers (including strings), with unconvertible values becoming NaN.
- convert_timedeltas : boolean, default True
- If True, convert to timedelta where possible. If ‘coerce’, force conversion, with unconvertible values becoming NaT.
- copy : boolean, default True
- If True, return a copy even if no copy is necessary (e.g. no conversion was done). Note: This is meant for internal use, and should not be confused with inplace.
pandas.to_datetime : Convert argument to datetime. pandas.to_timedelta : Convert argument to timedelta. pandas.to_numeric : Return a fixed frequency timedelta index,
with day as the default.converted : same as input object
-
convex_hull
()¶ Constructs the geometry that is the minimal bounding polygon such that all outer angles are convex.
-
coordinates
()¶ returns the point coordinates of the geometry as a np.array object
-
copy
(deep=True)¶ Make a copy of this SpatialDataFrame object Parameters:
Deep: boolean, default True Make a deep copy, i.e. also copy data - Returns:
copy: of SpatialDataFrame
-
corr
(method='pearson', min_periods=1)¶ Compute pairwise correlation of columns, excluding NA/null values
- method : {‘pearson’, ‘kendall’, ‘spearman’}
- pearson : standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation
- min_periods : int, optional
- Minimum number of observations required per pair of columns to have a valid result. Currently only available for pearson and spearman correlation
y : DataFrame
-
corrwith
(other, axis=0, drop=False)¶ Compute pairwise correlation between rows or columns of two DataFrame objects.
other : DataFrame axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’ to compute column-wise, 1 or ‘columns’ for row-wise- drop : boolean, default False
- Drop missing indices from result, default returns union of all
correls : Series
-
count
(axis=0, level=None, numeric_only=False)¶ Return Series with number of non-NA/null observations over requested axis. Works with non-floating point data as well (detects NaN and None)
- axis : {0 or ‘index’, 1 or ‘columns’}, default 0
- 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
- level : int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame
- numeric_only : boolean, default False
- Include only float, int, boolean data
count : Series (or DataFrame if level specified)
-
cov
(min_periods=None)¶ Compute pairwise covariance of columns, excluding NA/null values
- min_periods : int, optional
- Minimum number of observations required per pair of columns to have a valid result.
y : DataFrame
y contains the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-1 (unbiased estimator).
-
crosses
(second_geometry)¶ Indicates if the two geometries intersect in a geometry of a lesser shape type.
- Paramters:
second_geometry: - a second geometry
-
cummax
(axis=None, skipna=True, *args, **kwargs)¶ Return cumulative max over requested axis.
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NAcummax : Series
- pandas.core.window.Expanding.max : Similar functionality
- but ignores
NaN
values.
-
cummin
(axis=None, skipna=True, *args, **kwargs)¶ Return cumulative minimum over requested axis.
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NAcummin : Series
- pandas.core.window.Expanding.min : Similar functionality
- but ignores
NaN
values.
-
cumprod
(axis=None, skipna=True, *args, **kwargs)¶ Return cumulative product over requested axis.
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NAcumprod : Series
- pandas.core.window.Expanding.prod : Similar functionality
- but ignores
NaN
values.
-
cumsum
(axis=None, skipna=True, *args, **kwargs)¶ Return cumulative sum over requested axis.
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NAcumsum : Series
- pandas.core.window.Expanding.sum : Similar functionality
- but ignores
NaN
values.
-
cut
(cutter)¶ Splits this geometry into a part left of the cutting polyline, and a part right of it.
- Parameters:
cutter: - The cutting polyline geometry.
-
densify
(method, distance, deviation)¶ Creates a new geometry with added vertices
- Parameters:
method: - The type of densification, DISTANCE, ANGLE, or GEODESIC
distance: - The maximum distance between vertices. The actual
distance between vertices will usually be less than the maximum distance as new vertices will be evenly distributed along the original segment. If using a type of DISTANCE or ANGLE, the distance is measured in the units of the geometry’s spatial reference. If using a type of GEODESIC, the distance is measured in meters.
deviation: - Densify uses straight lines to approximate curves.
You use deviation to control the accuracy of this approximation. The deviation is the maximum distance between the new segment and the original curve. The smaller its value, the more segments will be required to approximate the curve.
-
describe
(percentiles=None, include=None, exclude=None)¶ Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding
NaN
values.Analyzes both numeric and object series, as well as
DataFrame
column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.- percentiles : list-like of numbers, optional
- The percentiles to include in the output. All should
fall between 0 and 1. The default is
[.25, .5, .75]
, which returns the 25th, 50th, and 75th percentiles. - include : ‘all’, list-like of dtypes or None (default), optional
A white list of data types to include in the result. Ignored for
Series
. Here are the options:- ‘all’ : All columns of the input will be included in the output.
- A list-like of dtypes : Limits the results to the
provided data types.
To limit the result to numeric types submit
numpy.number
. To limit it instead to object columns submit thenumpy.object
data type. Strings can also be used in the style ofselect_dtypes
(e.g.df.describe(include=['O'])
). To select pandas categorical columns, use'category'
- None (default) : The result will include all numeric columns.
- exclude : list-like of dtypes or None (default), optional,
A black list of data types to omit from the result. Ignored for
Series
. Here are the options:- A list-like of dtypes : Excludes the provided data types
from the result. To exclude numeric types submit
numpy.number
. To exclude object columns submit the data typenumpy.object
. Strings can also be used in the style ofselect_dtypes
(e.g.df.describe(include=['O'])
). To exclude pandas categorical columns, use'category'
- None (default) : The result will exclude nothing.
- A list-like of dtypes : Excludes the provided data types
from the result. To exclude numeric types submit
summary: Series/DataFrame of summary statistics
For numeric data, the result’s index will include
count
,mean
,std
,min
,max
as well as lower,50
and upper percentiles. By default the lower percentile is25
and the upper percentile is75
. The50
percentile is the same as the median.For object data (e.g. strings or timestamps), the result’s index will include
count
,unique
,top
, andfreq
. Thetop
is the most common value. Thefreq
is the most common value’s frequency. Timestamps also include thefirst
andlast
items.If multiple object values have the highest count, then the
count
andtop
results will be arbitrarily chosen from among those with the highest count.For mixed data types provided via a
DataFrame
, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. Ifinclude='all'
is provided as an option, the result will include a union of attributes of each type.The include and exclude parameters can be used to limit which columns in a
DataFrame
are analyzed for the output. The parameters are ignored when analyzing aSeries
.Describing a numeric
Series
.>>> s = pd.Series([1, 2, 3]) >>> s.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Describing a categorical
Series
.>>> s = pd.Series(['a', 'a', 'b', 'c']) >>> s.describe() count 4 unique 3 top a freq 2 dtype: object
Describing a timestamp
Series
.>>> s = pd.Series([ ... np.datetime64("2000-01-01"), ... np.datetime64("2010-01-01"), ... np.datetime64("2010-01-01") ... ]) >>> s.describe() count 3 unique 2 top 2010-01-01 00:00:00 freq 2 first 2000-01-01 00:00:00 last 2010-01-01 00:00:00 dtype: object
Describing a
DataFrame
. By default only numeric fields are returned.>>> df = pd.DataFrame({ 'object': ['a', 'b', 'c'], ... 'numeric': [1, 2, 3], ... 'categorical': pd.Categorical(['d','e','f']) ... }) >>> df.describe() numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Describing all columns of a
DataFrame
regardless of data type.>>> df.describe(include='all') categorical numeric object count 3 3.0 3 unique 3 NaN 3 top f NaN c freq 1 NaN 1 mean NaN 2.0 NaN std NaN 1.0 NaN min NaN 1.0 NaN 25% NaN 1.5 NaN 50% NaN 2.0 NaN 75% NaN 2.5 NaN max NaN 3.0 NaN
Describing a column from a
DataFrame
by accessing it as an attribute.>>> df.numeric.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 Name: numeric, dtype: float64
Including only numeric columns in a
DataFrame
description.>>> df.describe(include=[np.number]) numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Including only string columns in a
DataFrame
description.>>> df.describe(include=[np.object]) object count 3 unique 3 top c freq 1
Including only categorical columns from a
DataFrame
description.>>> df.describe(include=['category']) categorical count 3 unique 3 top f freq 1
Excluding numeric columns from a
DataFrame
description.>>> df.describe(exclude=[np.number]) categorical object count 3 3 unique 3 3 top f c freq 1 1
Excluding object columns from a
DataFrame
description.>>> df.describe(exclude=[np.object]) categorical numeric count 3 3.0 unique 3 NaN top f NaN freq 1 NaN mean NaN 2.0 std NaN 1.0 min NaN 1.0 25% NaN 1.5 50% NaN 2.0 75% NaN 2.5 max NaN 3.0
DataFrame.count DataFrame.max DataFrame.min DataFrame.mean DataFrame.std DataFrame.select_dtypes
-
diff
(periods=1, axis=0)¶ 1st discrete difference of object
- periods : int, default 1
- Periods to shift for forming difference
- axis : {0 or ‘index’, 1 or ‘columns’}, default 0
- Take difference over rows (0) or columns (1).
diffed : DataFrame
-
difference
(second_geometry)¶ Constructs the geometry that is composed only of the region unique to the base geometry but not part of the other geometry. The following illustration shows the results when the red polygon is the source geometry.
- Paramters:
second_geometry: - a second geometry
-
disjoint
(second_geometry)¶ Indicates if the base and comparison geometries share no points in common.
- Paramters:
second_geometry: - a second geometry
-
distance_to
(second_geometry)¶ Returns the minimum distance between two geometries. If the geometries intersect, the minimum distance is 0. Both geometries must have the same projection.
- Paramters:
second_geometry: - a second geometry
-
div
(other, axis='columns', level=None, fill_value=None)¶ Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.rtruediv
-
divide
(other, axis='columns', level=None, fill_value=None)¶ Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.rtruediv
-
dot
(other)¶ Matrix multiplication with DataFrame or Series objects
other : DataFrame or Series
dot_product : DataFrame or Series
-
drop
(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')¶ Return new object with labels in requested axis removed.
- labels : single label or list-like
- Index or column labels to drop.
- axis : int or axis name
- Whether to drop labels from the index (0 / ‘index’) or columns (1 / ‘columns’).
- index, columns : single label or list-like
Alternative to specifying axis (
labels, axis=1
is equivalent tocolumns=labels
).New in version 0.21.0.
- level : int or level name, default None
- For MultiIndex
- inplace : bool, default False
- If True, do operation inplace and return None.
- errors : {‘ignore’, ‘raise’}, default ‘raise’
- If ‘ignore’, suppress error and existing labels are dropped.
dropped : type of caller
>>> df = pd.DataFrame(np.arange(12).reshape(3,4), columns=['A', 'B', 'C', 'D']) >>> df A B C D 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11
Drop columns
>>> df.drop(['B', 'C'], axis=1) A D 0 0 3 1 4 7 2 8 11
>>> df.drop(columns=['B', 'C']) A D 0 0 3 1 4 7 2 8 11
Drop a row by index
>>> df.drop([0, 1]) A B C D 2 8 9 10 11
Specifying both labels and index or columns will raise a ValueError.
-
drop_duplicates
(subset=None, keep='first', inplace=False)¶ Return DataFrame with duplicate rows removed, optionally only considering certain columns
- subset : column label or sequence of labels, optional
- Only consider certain columns for identifying duplicates, by default use all of the columns
- keep : {‘first’, ‘last’, False}, default ‘first’
first
: Drop duplicates except for the first occurrence.last
: Drop duplicates except for the last occurrence.- False : Drop all duplicates.
- inplace : boolean, default False
- Whether to drop duplicates in place or to return a copy
deduplicated : DataFrame
-
dropna
(axis=0, how='any', thresh=None, subset=None, inplace=False)¶ Return object with labels on given axis omitted where alternately any or all of the data are missing
- axis : {0 or ‘index’, 1 or ‘columns’}, or tuple/list thereof
- Pass tuple or list to drop on multiple axes
- how : {‘any’, ‘all’}
- any : if any NA values are present, drop that label
- all : if all values are NA, drop that label
- thresh : int, default None
- int value : require that many non-NA values
- subset : array-like
- Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include
- inplace : boolean, default False
- If True, do operation inplace and return None.
dropped : DataFrame
>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1], ... [np.nan, np.nan, np.nan, 5]], ... columns=list('ABCD')) >>> df A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5
Drop the columns where all elements are nan:
>>> df.dropna(axis=1, how='all') A B D 0 NaN 2.0 0 1 3.0 4.0 1 2 NaN NaN 5
Drop the columns where any of the elements is nan
>>> df.dropna(axis=1, how='any') D 0 0 1 1 2 5
Drop the rows where all of the elements are nan (there is no row to drop, so df stays the same):
>>> df.dropna(axis=0, how='all') A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5
Keep only the rows with at least 2 non-na values:
>>> df.dropna(thresh=2) A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1
-
dtypes
¶ Return the dtypes in this object.
-
duplicated
(subset=None, keep='first')¶ Return boolean Series denoting duplicate rows, optionally only considering certain columns
- subset : column label or sequence of labels, optional
- Only consider certain columns for identifying duplicates, by default use all of the columns
- keep : {‘first’, ‘last’, False}, default ‘first’
first
: Mark duplicates asTrue
except for the first occurrence.last
: Mark duplicates asTrue
except for the last occurrence.- False : Mark all duplicates as
True
.
duplicated : Series
-
empty
¶ True if NDFrame is entirely empty [no items], meaning any of the axes are of length 0.
If NDFrame contains only NaNs, it is still not considered empty. See the example below.
An example of an actual empty DataFrame. Notice the index is empty:
>>> df_empty = pd.DataFrame({'A' : []}) >>> df_empty Empty DataFrame Columns: [A] Index: [] >>> df_empty.empty True
If we only have NaNs in our DataFrame, it is not considered empty! We will need to drop the NaNs to make the DataFrame empty:
>>> df = pd.DataFrame({'A' : [np.nan]}) >>> df A 0 NaN >>> df.empty False >>> df.dropna().empty True
pandas.Series.dropna pandas.DataFrame.dropna
-
eq
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods eq
-
equals
(second_geometry)¶ Indicates if the base and comparison geometries are of the same shape type and define the same set of points in the plane. This is a 2D comparison only; M and Z values are ignored. Paramters:
second_geometry: - a second geometry
-
erase
(other, inplace=False)¶ Erases
-
eval
(expr, inplace=False, **kwargs)¶ Evaluate an expression in the context of the calling DataFrame instance.
- expr : string
- The expression string to evaluate.
- inplace : bool, default False
If the expression contains an assignment, whether to perform the operation inplace and mutate the existing DataFrame. Otherwise, a new DataFrame is returned.
New in version 0.18.0.
- kwargs : dict
- See the documentation for
eval()
for complete details on the keyword arguments accepted byquery()
.
ret : ndarray, scalar, or pandas object
pandas.DataFrame.query pandas.DataFrame.assign pandas.eval
For more details see the API documentation for
eval()
. For detailed examples see enhancing performance with eval.>>> from numpy.random import randn >>> from pandas import DataFrame >>> df = DataFrame(randn(10, 2), columns=list('ab')) >>> df.eval('a + b') >>> df.eval('c = a + b')
-
ewm
(com=None, span=None, halflife=None, alpha=None, min_periods=0, freq=None, adjust=True, ignore_na=False, axis=0)¶ Provides exponential weighted functions
New in version 0.18.0.
- com : float, optional
- Specify decay in terms of center of mass,
- span : float, optional
- Specify decay in terms of span,
- halflife : float, optional
- Specify decay in terms of half-life,
- alpha : float, optional
Specify smoothing factor directly,
New in version 0.18.0.
- min_periods : int, default 0
- Minimum number of observations in window required to have a value (otherwise result is NA).
- freq : None or string alias / date offset object, default=None
Deprecated since version 0.18.0: Frequency to conform to before computing statistic
- adjust : boolean, default True
- Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings (viewing EWMA as a moving average)
- ignore_na : boolean, default False
- Ignore missing values when calculating weights; specify True to reproduce pre-0.15.0 behavior
a Window sub-classed for the particular operation
>>> df = DataFrame({'B': [0, 1, 2, np.nan, 4]}) B 0 0.0 1 1.0 2 2.0 3 NaN 4 4.0
>>> df.ewm(com=0.5).mean() B 0 0.000000 1 0.750000 2 1.615385 3 1.615385 4 3.670213
Exactly one of center of mass, span, half-life, and alpha must be provided. Allowed values and relationship between the parameters are specified in the parameter descriptions above; see the link at the end of this section for a detailed explanation.
The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).When adjust is True (default), weighted averages are calculated using weights (1-alpha)**(n-1), (1-alpha)**(n-2), …, 1-alpha, 1.
- When adjust is False, weighted averages are calculated recursively as:
- weighted_average[0] = arg[0]; weighted_average[i] = (1-alpha)*weighted_average[i-1] + alpha*arg[i].
When ignore_na is False (default), weights are based on absolute positions. For example, the weights of x and y used in calculating the final weighted average of [x, None, y] are (1-alpha)**2 and 1 (if adjust is True), and (1-alpha)**2 and alpha (if adjust is False).
When ignore_na is True (reproducing pre-0.15.0 behavior), weights are based on relative positions. For example, the weights of x and y used in calculating the final weighted average of [x, None, y] are 1-alpha and 1 (if adjust is True), and 1-alpha and alpha (if adjust is False).
More details can be found at http://pandas.pydata.org/pandas-docs/stable/computation.html#exponentially-weighted-windows
-
expanding
(min_periods=1, freq=None, center=False, axis=0)¶ Provides expanding transformations.
New in version 0.18.0.
- min_periods : int, default None
- Minimum number of observations in window required to have a value (otherwise result is NA).
- freq : string or DateOffset object, optional (default None)
Deprecated since version 0.18.0: Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
- center : boolean, default False
- Set the labels at the center of the window.
axis : int or string, default 0
a Window sub-classed for the particular operation
>>> df = DataFrame({'B': [0, 1, 2, np.nan, 4]}) B 0 0.0 1 1.0 2 2.0 3 NaN 4 4.0
>>> df.expanding(2).sum() B 0 NaN 1 1.0 2 3.0 3 3.0 4 7.0
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).
-
extent
¶ the extent of the geometry
-
ffill
(axis=None, inplace=False, limit=None, downcast=None)¶ Synonym for
DataFrame.fillna(method='ffill')
-
fillna
(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)¶ Fill NA/NaN values using the specified method
- value : scalar, dict, Series, or DataFrame
- Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.
- method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
- Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap
axis : {0 or ‘index’, 1 or ‘columns’} inplace : boolean, default False
If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).- limit : int, default None
- If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
- downcast : dict, default is None
- a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)
reindex, asfreq
filled : DataFrame
>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0], ... [3, 4, np.nan, 1], ... [np.nan, np.nan, np.nan, 5], ... [np.nan, 3, np.nan, 4]], ... columns=list('ABCD')) >>> df A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5 3 NaN 3.0 NaN 4
Replace all NaN elements with 0s.
>>> df.fillna(0) A B C D 0 0.0 2.0 0.0 0 1 3.0 4.0 0.0 1 2 0.0 0.0 0.0 5 3 0.0 3.0 0.0 4
We can also propagate non-null values forward or backward.
>>> df.fillna(method='ffill') A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 3.0 4.0 NaN 5 3 3.0 3.0 NaN 4
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
>>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3} >>> df.fillna(value=values) A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 2.0 1 2 0.0 1.0 2.0 5 3 0.0 3.0 2.0 4
Only replace the first NaN element.
>>> df.fillna(value=values, limit=1) A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 NaN 1 2 NaN 1.0 NaN 5 3 NaN 3.0 NaN 4
-
filter
(items=None, like=None, regex=None, axis=None)¶ Subset rows or columns of dataframe according to labels in the specified index.
Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.
- items : list-like
- List of info axis to restrict to (must not all be present)
- like : string
- Keep info axis where “arg in col == True”
- regex : string (regular expression)
- Keep info axis with re.search(regex, col) == True
- axis : int or string axis name
- The axis to filter on. By default this is the info axis, ‘index’ for Series, ‘columns’ for DataFrame
same type as input object
>>> df one two three mouse 1 2 3 rabbit 4 5 6
>>> # select columns by name >>> df.filter(items=['one', 'three']) one three mouse 1 3 rabbit 4 6
>>> # select columns by regular expression >>> df.filter(regex='e$', axis=1) one three mouse 1 3 rabbit 4 6
>>> # select rows containing 'bbi' >>> df.filter(like='bbi', axis=0) one two three rabbit 4 5 6
pandas.DataFrame.loc
The
items
,like
, andregex
parameters are enforced to be mutually exclusive.axis
defaults to the info axis that is used when indexing with[]
.
-
first
(offset)¶ Convenience method for subsetting initial periods of time series data based on a date offset.
offset : string, DateOffset, dateutil.relativedelta
ts.first(‘10D’) -> First 10 days
subset : type of caller
-
first_point
¶ The first coordinate point of the geometry.
-
first_valid_index
()¶ Return index for first non-NA/null value.
If all elements are non-NA/null, returns None. Also returns None for empty DataFrame.
scalar : type of index
-
floordiv
(other, axis='columns', level=None, fill_value=None)¶ Integer division of dataframe and other, element-wise (binary operator floordiv).
Equivalent to
dataframe // other
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.rfloordiv
-
from_csv
(path, header=0, sep=', ', index_col=0, parse_dates=True, encoding=None, tupleize_cols=None, infer_datetime_format=False)¶ Read CSV file (DEPRECATED, please use
pandas.read_csv()
instead).It is preferable to use the more powerful
pandas.read_csv()
for most general purposes, butfrom_csv
makes for an easy roundtrip to and from a file (the exact counterpart ofto_csv
), especially with a DataFrame of time series data.This method only differs from the preferred
pandas.read_csv()
in some defaults:- index_col is
0
instead ofNone
(take first column as index by default) - parse_dates is
True
instead ofFalse
(try parsing the index as datetime by default)
So a
pd.DataFrame.from_csv(path)
can be replaced bypd.read_csv(path, index_col=0, parse_dates=True)
.path : string file path or file handle / StringIO header : int, default 0
Row to use as header (skip prior rows)- sep : string, default ‘,’
- Field delimiter
- index_col : int or sequence, default 0
- Column to use for index. If a sequence is given, a MultiIndex is used. Different default from read_table
- parse_dates : boolean, default True
- Parse dates. Different default from read_table
- tupleize_cols : boolean, default False
- write multi_index columns as a list of tuples (if True) or new (expanded format) if False)
- infer_datetime_format: boolean, default False
- If True and parse_dates is True for a column, try to infer the datetime format based on the first datetime string. If the format can be inferred, there often will be a large parsing speed-up.
pandas.read_csv
y : DataFrame
- index_col is
-
static
from_df
(df, address_column='address', geocoder=None)¶ Returns a SpatialDataFrame from a dataframe with an address column. Inputs:
df: Pandas dataframe with an address column- Optional Parameters:
- address_column: string, default “address”. This is the name of a
- column in the specified dataframe that contains addresses (as strings). The addresses are batch geocoded using the GIS’s first configured geocoder and their locations used as the geometry of the spatial dataframe. Ignored if the ‘geometry’ parameter is also specified.
- geocoder: the geocoder to be used. If not specified,
- the active GIS’s first geocoder is used.
NOTE: Credits will be consumed for batch_geocoding, from the GIS to which the geocoder belongs.
-
from_dict
(data, orient='columns', dtype=None)¶ Construct DataFrame from dict of array-like or dicts
- data : dict
- {field : array-like} or {field : dict}
- orient : {‘columns’, ‘index’}, default ‘columns’
- The “orientation” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.
- dtype : dtype, default None
- Data type to force, otherwise infer
DataFrame
-
static
from_featureclass
(filename, **kwargs)¶ Returns a SpatialDataFrame from a feature class. Inputs:
filename: full path to the feature class- Optional Parameters:
- sql_clause: sql clause to parse data down where_clause: where statement sr: spatial reference object
-
static
from_hdf
(path_or_buf, key=None, **kwargs)¶ read from the store, close it if we opened it
Retrieve pandas object stored in file, optionally based on where criteria
- path_or_buf : path (string), buffer, or path object (pathlib.Path or
py._path.local.LocalPath) to read from
New in version 0.19.0: support for pathlib, py.path.
- key : group identifier in the store. Can be omitted if the HDF file
- contains a single pandas object.
where : list of Term (or convertable) objects, optional start : optional, integer (defaults to None), row number to start
selection- stop : optional, integer (defaults to None), row number to stop
- selection
- columns : optional, a list of columns that if not None, will limit the
- return columns
iterator : optional, boolean, return an iterator, default False chunksize : optional, nrows to include in iteration, return an iterator
The selected object
-
from_items
(items, columns=None, orient='columns')¶ Convert (key, value) pairs to DataFrame. The keys will be the axis index (usually the columns, but depends on the specified orientation). The values should be arrays or Series.
- items : sequence of (key, value) pairs
- Values should be arrays or Series.
- columns : sequence of column labels, optional
- Must be passed if orient=’index’.
- orient : {‘columns’, ‘index’}, default ‘columns’
- The “orientation” of the data. If the keys of the input correspond to column labels, pass ‘columns’ (default). Otherwise if the keys correspond to the index, pass ‘index’.
frame : DataFrame
-
static
from_layer
(layer, **kwargs)¶ Returns a SpatialDataFrame from a FeatureLayer or Table object. Inputs:
param layer: FeatureLayer or Table param gis: GIS object Returns a SpatialDataFrame for services with geometry and Panda’s Dataframe for table services.
-
from_records
(data, index=None, exclude=None, columns=None, coerce_float=False, nrows=None)¶ Convert structured or record ndarray to DataFrame
data : ndarray (structured dtype), list of tuples, dict, or DataFrame index : string, list of fields, array-like
Field of array to use as the index, alternately a specific set of input labels to use- exclude : sequence, default None
- Columns or fields to exclude
- columns : sequence, default None
- Column names to use. If the passed data do not have names associated with them, this argument provides names for the columns. Otherwise this argument indicates the order of the columns in the result (any names not found in the data will become all-NA columns)
- coerce_float : boolean, default False
- Attempt to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets
df : DataFrame
-
ftypes
¶ Return the ftypes (indication of sparse/dense and dtype) in this object.
-
ge
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods ge
-
generalize
(max_offset)¶ Creates a new simplified geometry using a specified maximum offset tolerance.
- Parameters:
max_offset: - The maximum offset tolerance.
-
geoextent
¶ returns the extent of the spatial dataframe
-
geometry
¶ Get/Set the geometry data for SpatialDataFrame
-
geometry_type
¶ The geometry type: polygon, polyline, point, multipoint, multipatch, dimension, or annotation
-
get
(key, default=None)¶ Get item from object for given key (DataFrame column, Panel slice, etc.). Returns default value if not found.
key : object
value : type of items contained in object
-
get_area
(method, units=None)¶ Returns the area of the feature using a measurement type.
- Parameters:
method: - PLANAR measurements reflect the projection of
geographic data onto the 2D surface (in other words, they will not take into account the curvature of the earth). GEODESIC, GREAT_ELLIPTIC, LOXODROME, and PRESERVE_SHAPE measurement types may be chosen as an alternative, if desired.
units: - Areal unit of measure keywords: ACRES | ARES | HECTARES
SQUARECENTIMETERS | SQUAREDECIMETERS | SQUAREINCHES | SQUAREFEETSQUAREKILOMETERS | SQUAREMETERS | SQUAREMILES |SQUAREMILLIMETERS | SQUAREYARDS
-
get_dtype_counts
()¶ Return the counts of dtypes in this object.
-
get_ftype_counts
()¶ Return the counts of ftypes in this object.
-
get_length
(method, units)¶ Returns the length of the feature using a measurement type.
- Parameters:
method: - PLANAR measurements reflect the projection of
geographic data onto the 2D surface (in other words, they will not take into account the curvature of the earth). GEODESIC, GREAT_ELLIPTIC, LOXODROME, and PRESERVE_SHAPE measurement types may be chosen as an alternative, if desired.
units: - Linear unit of measure keywords: CENTIMETERS |
DECIMETERS | FEET | INCHES | KILOMETERS | METERS | MILES | MILLIMETERS | NAUTICALMILES | YARDS
-
get_part
(index=None)¶ Returns an array of point objects for a particular part of geometry or an array containing a number of arrays, one for each part.
- Parameters:
index: - The index position of the geometry.
-
get_value
(index, col, takeable=False)¶ Quickly retrieve single value at passed column and index
Deprecated since version 0.21.0.
Please use .at[] or .iat[] accessors.
index : row label col : column label takeable : interpret the index/col as indexers, default False
value : scalar value
-
get_values
()¶ same as values (but handles sparseness conversions)
-
groupby
(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)¶ Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.
- by : mapping, function, str, or iterable
- Used to determine the groups for the groupby.
If
by
is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see.align()
method). If an ndarray is passed, the values are used as-is determine the groups. A str or list of strs may be passed to group by the columns inself
axis : int, default 0 level : int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels- as_index : boolean, default True
- For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
- sort : boolean, default True
- Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.
- group_keys : boolean, default True
- When calling apply, add group keys to index to identify pieces
- squeeze : boolean, default False
- reduce the dimensionality of the return type if possible, otherwise return a consistent type
DataFrame results
>>> data.groupby(func, axis=0).mean() >>> data.groupby(['col1', 'col2'])['col3'].mean()
DataFrame with hierarchical index
>>> data.groupby(['col1', 'col2']).mean()
GroupBy object
-
gt
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods gt
-
head
(n=5)¶ Return the first n rows.
- n : int, default 5
- Number of rows to select.
- obj_head : type of caller
- The first n rows of the caller object.
-
hist
(data, column=None, by=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, ax=None, sharex=False, sharey=False, figsize=None, layout=None, bins=10, **kwds)¶ Draw histogram of the DataFrame’s series using matplotlib / pylab.
data : DataFrame column : string or sequence
If passed, will be used to limit data to a subset of columns- by : object, optional
- If passed, then used to form histograms for separate groups
- grid : boolean, default True
- Whether to show axis grid lines
- xlabelsize : int, default None
- If specified changes the x-axis label size
- xrot : float, default None
- rotation of x axis labels
- ylabelsize : int, default None
- If specified changes the y-axis label size
- yrot : float, default None
- rotation of y axis labels
ax : matplotlib axes object, default None sharex : boolean, default True if ax is None else False
In case subplots=True, share x axis and set some x axis labels to invisible; defaults to True if ax is None otherwise False if an ax is passed in; Be aware, that passing in both an ax and sharex=True will alter all x axis labels for all subplots in a figure!- sharey : boolean, default False
- In case subplots=True, share y axis and set some y axis labels to invisible
- figsize : tuple
- The size of the figure to create in inches by default
- layout : tuple, optional
- Tuple of (rows, columns) for the layout of the histograms
- bins : integer, default 10
- Number of histogram bins to be used
- kwds : other plotting keyword arguments
- To be passed to hist function
-
hull_rectangle
¶ A space-delimited string of the coordinate pairs of the convex hull rectangle.
-
iat
¶ Fast integer location scalar accessor.
Similarly to
iloc
,iat
provides integer based lookups. You can also set using these indexers.
-
idxmax
(axis=0, skipna=True)¶ Return index of first occurrence of maximum over requested axis. NA/null values are excluded.
- axis : {0 or ‘index’, 1 or ‘columns’}, default 0
- 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
- skipna : boolean, default True
- Exclude NA/null values. If an entire row/column is NA, the result will be first index.
idxmax : Series
This method is the DataFrame version of
ndarray.argmax
.Series.idxmax
-
idxmin
(axis=0, skipna=True)¶ Return index of first occurrence of minimum over requested axis. NA/null values are excluded.
- axis : {0 or ‘index’, 1 or ‘columns’}, default 0
- 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
- skipna : boolean, default True
- Exclude NA/null values. If an entire row/column is NA, the result will be NA
idxmin : Series
This method is the DataFrame version of
ndarray.argmin
.Series.idxmin
-
iloc
¶ Purely integer-location based indexing for selection by position.
.iloc[]
is primarily integer position based (from0
tolength-1
of the axis), but may also be used with a boolean array.Allowed inputs are:
- An integer, e.g.
5
. - A list or array of integers, e.g.
[4, 3, 0]
. - A slice object with ints, e.g.
1:7
. - A boolean array.
- A
callable
function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)
.iloc
will raiseIndexError
if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing (this conforms with python/numpy slice semantics).See more at Selection by Position
- An integer, e.g.
-
infer_objects
()¶ Attempt to infer better dtypes for object columns.
Attempts soft conversion of object-dtyped columns, leaving non-object and unconvertible columns unchanged. The inference rules are the same as during normal Series/DataFrame construction.
New in version 0.21.0.
pandas.to_datetime : Convert argument to datetime. pandas.to_timedelta : Convert argument to timedelta. pandas.to_numeric : Convert argument to numeric typeR
converted : same type as input object
>>> df = pd.DataFrame({"A": ["a", 1, 2, 3]}) >>> df = df.iloc[1:] >>> df A 1 1 2 2 3 3
>>> df.dtypes A object dtype: object
>>> df.infer_objects().dtypes A int64 dtype: object
-
info
(verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None)¶ Concise summary of a DataFrame.
- verbose : {None, True, False}, optional
- Whether to print the full summary. None follows the display.max_info_columns setting. True or False overrides the display.max_info_columns setting.
buf : writable buffer, defaults to sys.stdout max_cols : int, default None
Determines whether full summary or short summary is printed. None follows the display.max_info_columns setting.- memory_usage : boolean/string, default None
- Specifies whether total memory usage of the DataFrame elements (including index) should be displayed. None follows the display.memory_usage setting. True or False overrides the display.memory_usage setting. A value of ‘deep’ is equivalent of True, with deep introspection. Memory usage is shown in human-readable units (base-2 representation).
- null_counts : boolean, default None
Whether to show the non-null counts
- If None, then only show if the frame is smaller than max_info_rows and max_info_columns.
- If True, always show counts.
- If False, never show counts.
-
insert
(loc, column, value, allow_duplicates=False)¶ Insert column into DataFrame at specified location.
Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.
- loc : int
- Insertion index. Must verify 0 <= loc <= len(columns)
- column : string, number, or hashable object
- label of the inserted column
value : int, Series, or array-like allow_duplicates : bool, optional
-
interpolate
(method='linear', axis=0, limit=None, inplace=False, limit_direction='forward', downcast=None, **kwargs)¶ Interpolate values according to different methods.
Please note that only
method='linear'
is supported for DataFrames/Series with a MultiIndex.- method : {‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’,
- ‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘krogh’, ‘polynomial’, ‘spline’, ‘piecewise_polynomial’, ‘from_derivatives’, ‘pchip’, ‘akima’}
- ‘linear’: ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes. default
- ‘time’: interpolation works on daily and higher resolution data to interpolate given length of interval
- ‘index’, ‘values’: use the actual numerical values of the index
- ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’,
‘barycentric’, ‘polynomial’ is passed to
scipy.interpolate.interp1d
. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g. df.interpolate(method=’polynomial’, order=4). These use the actual numerical values of the index. - ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ are all wrappers around the scipy interpolation methods of similar names. These use the actual numerical values of the index. For more information on their behavior, see the scipy documentation and tutorial documentation
- ‘from_derivatives’ refers to BPoly.from_derivatives which replaces ‘piecewise_polynomial’ interpolation method in scipy 0.18
New in version 0.18.1: Added support for the ‘akima’ method Added interpolate method ‘from_derivatives’ which replaces ‘piecewise_polynomial’ in scipy 0.18; backwards-compatible with scipy < 0.18
- axis : {0, 1}, default 0
- 0: fill column-by-column
- 1: fill row-by-row
- limit : int, default None.
- Maximum number of consecutive NaNs to fill. Must be greater than 0.
- limit_direction : {‘forward’, ‘backward’, ‘both’}, default ‘forward’
If limit is specified, consecutive NaNs will be filled in this direction.
New in version 0.17.0.
- inplace : bool, default False
- Update the NDFrame in place if possible.
- downcast : optional, ‘infer’ or None, defaults to None
- Downcast dtypes if possible.
kwargs : keyword arguments to pass on to the interpolating function.
Series or DataFrame of same shape interpolated at the NaNs
reindex, replace, fillna
Filling in NaNs
>>> s = pd.Series([0, 1, np.nan, 3]) >>> s.interpolate() 0 0 1 1 2 2 3 3 dtype: float64
-
intersect
(second_geometry, dimension)¶ Constructs a geometry that is the geometric intersection of the two input geometries. Different dimension values can be used to create different shape types. The intersection of two geometries of the same shape type is a geometry containing only the regions of overlap between the original geometries.
- Paramters:
second_geometry: - a second geometry
dimension: - The topological dimension (shape type) of the
- resulting geometry.
1 -A zero-dimensional geometry (point or multipoint). 2 -A one-dimensional geometry (polyline). 4 -A two-dimensional geometry (polygon).
-
is_copy
= None¶
-
is_empty
¶ Return True for each empty geometry, False for non-empty
-
is_multipart
¶ True, if the number of parts for the geometry is more than 1
-
isin
(values)¶ Return boolean DataFrame showing whether each element in the DataFrame is contained in values.
- values : iterable, Series, DataFrame or dictionary
- The result will only be true at a location if all the labels match. If values is a Series, that’s the index. If values is a dictionary, the keys must be the column names, which must match. If values is a DataFrame, then both the index and column labels must match.
DataFrame of booleans
When
values
is a list:>>> df = DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']}) >>> df.isin([1, 3, 12, 'a']) A B 0 True True 1 False False 2 True False
When
values
is a dict:>>> df = DataFrame({'A': [1, 2, 3], 'B': [1, 4, 7]}) >>> df.isin({'A': [1, 3], 'B': [4, 7, 12]}) A B 0 True False # Note that B didn't match the 1 here. 1 False True 2 True True
When
values
is a Series or DataFrame:>>> df = DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']}) >>> other = DataFrame({'A': [1, 3, 3, 2], 'B': ['e', 'f', 'f', 'e']}) >>> df.isin(other) A B 0 True False 1 False False # Column A in `other` has a 3, but not at index 1. 2 True True
-
isna
()¶ Return a boolean same-sized object indicating if the values are NA.
DataFrame.notna : boolean inverse of isna DataFrame.isnull : alias of isna isna : top-level isna
-
isnull
()¶ Return a boolean same-sized object indicating if the values are NA.
DataFrame.notna : boolean inverse of isna DataFrame.isnull : alias of isna isna : top-level isna
-
items
()¶ Iterator over (column name, Series) pairs.
iterrows : Iterate over DataFrame rows as (index, Series) pairs. itertuples : Iterate over DataFrame rows as namedtuples of the values.
-
iteritems
()¶ Iterator over (column name, Series) pairs.
iterrows : Iterate over DataFrame rows as (index, Series) pairs. itertuples : Iterate over DataFrame rows as namedtuples of the values.
-
iterrows
()¶ Iterate over DataFrame rows as (index, Series) pairs.
Because
iterrows
returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). For example,>>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float']) >>> row = next(df.iterrows())[1] >>> row int 1.0 float 1.5 Name: 0, dtype: float64 >>> print(row['int'].dtype) float64 >>> print(df['int'].dtype) int64
To preserve dtypes while iterating over the rows, it is better to use
itertuples()
which returns namedtuples of the values and which is generally faster thaniterrows
.You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
- it : generator
- A generator that iterates over the rows of the frame.
itertuples : Iterate over DataFrame rows as namedtuples of the values. iteritems : Iterate over (column name, Series) pairs.
-
itertuples
(index=True, name='Pandas')¶ Iterate over DataFrame rows as namedtuples, with index value as first element of the tuple.
- index : boolean, default True
- If True, return the index as the first element of the tuple.
- name : string, default “Pandas”
- The name of the returned namedtuples or None to return regular tuples.
The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.
iterrows : Iterate over DataFrame rows as (index, Series) pairs. iteritems : Iterate over (column name, Series) pairs.
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b']) >>> df col1 col2 a 1 0.1 b 2 0.2 >>> for row in df.itertuples(): ... print(row) ... Pandas(Index='a', col1=1, col2=0.10000000000000001) Pandas(Index='b', col1=2, col2=0.20000000000000001)
-
ix
¶ A primarily label-location based indexer, with integer position fallback.
.ix[]
supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type..ix
is the most general indexer and will support any of the inputs in.loc
and.iloc
..ix
also supports floating point label schemes..ix
is exceptionally useful when dealing with mixed positional and label based hierachical indexes.However, when an axis is integer based, ONLY label based access and not positional access is supported. Thus, in such cases, it’s usually better to be explicit and use
.iloc
or.loc
.See more at Advanced Indexing.
-
join
(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)¶ Join columns with other DataFrame either on index or on a key column. Efficiently Join multiple DataFrame objects by index at once by passing a list.
- other : DataFrame, Series with name field set, or list of DataFrame
- Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame
- on : column name, tuple/list of column names, or array-like
- Column(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiples columns given, the passed DataFrame must have a MultiIndex. Can pass an array as the join key if not already contained in the calling DataFrame. Like an Excel VLOOKUP operation
- how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘left’
How to handle the operation of the two objects.
- left: use calling frame’s index (or column if on is specified)
- right: use other frame’s index
- outer: form union of calling frame’s index (or column if on is specified) with other frame’s index, and sort it lexicographically
- inner: form intersection of calling frame’s index (or column if on is specified) with other frame’s index, preserving the order of the calling’s one
- lsuffix : string
- Suffix to use from left frame’s overlapping columns
- rsuffix : string
- Suffix to use from right frame’s overlapping columns
- sort : boolean, default False
- Order result DataFrame lexicographically by the join key. If False, the order of the join key depends on the join type (how keyword)
on, lsuffix, and rsuffix options are not supported when passing a list of DataFrame objects
>>> caller = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], ... 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> caller A key 0 A0 K0 1 A1 K1 2 A2 K2 3 A3 K3 4 A4 K4 5 A5 K5
>>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'], ... 'B': ['B0', 'B1', 'B2']})
>>> other B key 0 B0 K0 1 B1 K1 2 B2 K2
Join DataFrames using their indexes.
>>> caller.join(other, lsuffix='_caller', rsuffix='_other')
>>> A key_caller B key_other 0 A0 K0 B0 K0 1 A1 K1 B1 K1 2 A2 K2 B2 K2 3 A3 K3 NaN NaN 4 A4 K4 NaN NaN 5 A5 K5 NaN NaN
If we want to join using the key columns, we need to set key to be the index in both caller and other. The joined DataFrame will have key as its index.
>>> caller.set_index('key').join(other.set_index('key'))
>>> A B key K0 A0 B0 K1 A1 B1 K2 A2 B2 K3 A3 NaN K4 A4 NaN K5 A5 NaN
Another option to join using the key columns is to use the on parameter. DataFrame.join always uses other’s index but we can use any column in the caller. This method preserves the original caller’s index in the result.
>>> caller.join(other.set_index('key'), on='key')
>>> A key B 0 A0 K0 B0 1 A1 K1 B1 2 A2 K2 B2 3 A3 K3 NaN 4 A4 K4 NaN 5 A5 K5 NaN
DataFrame.merge : For column(s)-on-columns(s) operations
joined : DataFrame
-
keys
()¶ Get the ‘info axis’ (see Indexing for more)
This is index for Series, columns for DataFrame and major_axis for Panel.
-
kurt
(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)¶ Return unbiased kurtosis over requested axis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA- level : int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only : boolean, default None
- Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
kurt : Series or DataFrame (if level specified)
-
kurtosis
(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)¶ Return unbiased kurtosis over requested axis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA- level : int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only : boolean, default None
- Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
kurt : Series or DataFrame (if level specified)
-
label_point
¶ The point at which the label is located. The labelPoint is always located within or on a feature.
-
last
(offset)¶ Convenience method for subsetting final periods of time series data based on a date offset.
offset : string, DateOffset, dateutil.relativedelta
ts.last(‘5M’) -> Last 5 months
subset : type of caller
-
last_point
¶ The last coordinate of the feature.
-
last_valid_index
()¶ Return index for first non-NA/null value.
If all elements are non-NA/null, returns None. Also returns None for empty DataFrame.
scalar : type of index
-
le
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods le
-
length
¶ The length of the linear feature. Zero for point and multipoint feature types.
-
length3D
¶ The 3D length of the linear feature. Zero for point and multipoint feature types.
-
loc
¶ Purely label-location based indexer for selection by label.
.loc[]
is primarily label based, but may also be used with a boolean array.Allowed inputs are:
- A single label, e.g.
5
or'a'
, (note that5
is interpreted as a label of the index, and never as an integer position along the index). - A list or array of labels, e.g.
['a', 'b', 'c']
. - A slice object with labels, e.g.
'a':'f'
(note that contrary to usual python slices, both the start and the stop are included!). - A boolean array.
- A
callable
function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)
.loc
will raise aKeyError
when the items are not found.See more at Selection by Label
- A single label, e.g.
-
lookup
(row_labels, col_labels)¶ Label-based “fancy indexing” function for DataFrame. Given equal-length arrays of row and column labels, return an array of the values corresponding to each (row, col) pair.
- row_labels : sequence
- The row labels to use for lookup
- col_labels : sequence
- The column labels to use for lookup
Akin to:
result = [] for row, col in zip(row_labels, col_labels): result.append(df.get_value(row, col))
- values : ndarray
- The found values
-
lt
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods lt
-
mad
(axis=None, skipna=None, level=None)¶ Return the mean absolute deviation of the values for the requested axis
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA- level : int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only : boolean, default None
- Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
mad : Series or DataFrame (if level specified)
-
mask
(cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False, raise_on_error=None)¶ Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.
- cond : boolean NDFrame, array-like, or callable
Where cond is False, keep the original value. Where True, replace with corresponding value from other. If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1: A callable can be used as cond.
- other : scalar, NDFrame, or callable
Entries where cond is True are replaced with corresponding value from other. If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1: A callable can be used as other.
- inplace : boolean, default False
- Whether to perform the operation in place on the data
axis : alignment axis if needed, default None level : alignment level if needed, default None errors : str, {‘raise’, ‘ignore’}, default ‘raise’
raise
: allow exceptions to be raisedignore
: suppress exceptions. On error return original object
Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.
- try_cast : boolean, default False
- try to cast the result back to the input type (if possible),
- raise_on_error : boolean, default True
Whether to raise on invalid data types (e.g. trying to where on strings)
Deprecated since version 0.21.0.
wh : same type as caller
The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if
cond
isFalse
the element is used; otherwise the corresponding element from the DataFrameother
is used.The signature for
DataFrame.where()
differs fromnumpy.where()
. Roughlydf1.where(m, df2)
is equivalent tonp.where(m, df1, df2)
.For further details and examples see the
mask
documentation in indexing.>>> s = pd.Series(range(5)) >>> s.where(s > 0) 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0
>>> s.mask(s > 0) 0 0.0 1 NaN 2 NaN 3 NaN 4 NaN
>>> s.where(s > 1, 10) 0 10.0 1 10.0 2 2.0 3 3.0 4 4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) >>> m = df % 3 == 0 >>> df.where(m, -df) A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True
DataFrame.where()
-
max
(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)¶ - This method returns the maximum of the values in the object.
- If you want the index of the maximum, use
idxmax
. This is the equivalent of thenumpy.ndarray
methodargmax
.
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA- level : int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only : boolean, default None
- Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
max : Series or DataFrame (if level specified)
-
mean
(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)¶ Return the mean of the values for the requested axis
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA- level : int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only : boolean, default None
- Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
mean : Series or DataFrame (if level specified)
-
measure_on_line
(second_geometry, as_percentage=False)¶ Returns a measure from the start point of this line to the in_point.
- Paramters:
second_geometry: - a second geometry
as_percentage: - If False, the measure will be returned as a
distance; if True, the measure will be returned as a percentage.
-
median
(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)¶ Return the median of the values for the requested axis
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA- level : int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only : boolean, default None
- Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
median : Series or DataFrame (if level specified)
-
melt
(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)¶ “Unpivots” a DataFrame from wide format to long format, optionally leaving identifier variables set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
New in version 0.20.0.
frame : DataFrame id_vars : tuple, list, or ndarray, optional
Column(s) to use as identifier variables.- value_vars : tuple, list, or ndarray, optional
- Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
- var_name : scalar
- Name to use for the ‘variable’ column. If None it uses
frame.columns.name
or ‘variable’. - value_name : scalar, default ‘value’
- Name to use for the ‘value’ column.
- col_level : int or string, optional
- If columns are a MultiIndex then use this level to melt.
melt pivot_table DataFrame.pivot
>>> import pandas as pd >>> df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'}, ... 'B': {0: 1, 1: 3, 2: 5}, ... 'C': {0: 2, 1: 4, 2: 6}}) >>> df A B C 0 a 1 2 1 b 3 4 2 c 5 6
>>> df.melt(id_vars=['A'], value_vars=['B']) A variable value 0 a B 1 1 b B 3 2 c B 5
>>> df.melt(id_vars=['A'], value_vars=['B', 'C']) A variable value 0 a B 1 1 b B 3 2 c B 5 3 a C 2 4 b C 4 5 c C 6
The names of ‘variable’ and ‘value’ columns can be customized:
>>> df.melt(id_vars=['A'], value_vars=['B'], ... var_name='myVarname', value_name='myValname') A myVarname myValname 0 a B 1 1 b B 3 2 c B 5
If you have multi-index columns:
>>> df.columns = [list('ABC'), list('DEF')] >>> df A B C D E F 0 a 1 2 1 b 3 4 2 c 5 6
>>> df.melt(col_level=0, id_vars=['A'], value_vars=['B']) A variable value 0 a B 1 1 b B 3 2 c B 5
>>> df.melt(id_vars=[('A', 'D')], value_vars=[('B', 'E')]) (A, D) variable_0 variable_1 value 0 a B E 1 1 b B E 3 2 c B E 5
-
memory_usage
(index=True, deep=False)¶ Memory usage of DataFrame columns.
- index : bool
- Specifies whether to include memory usage of DataFrame’s index in returned Series. If index=True (default is False) the first index of the Series is Index.
- deep : bool
- Introspect the data deeply, interrogate object dtypes for system-level memory consumption
- sizes : Series
- A series with column names as index and memory usage of columns with units of bytes.
Memory usage does not include memory consumed by elements that are not components of the array if deep=False
numpy.ndarray.nbytes
-
merge
(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)¶ Merge DataFrame objects by performing a database-style join operation by columns or indexes.
If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.
right : DataFrame how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
- left: use only keys from left frame, similar to a SQL left outer join; preserve key order
- right: use only keys from right frame, similar to a SQL right outer join; preserve key order
- outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
- inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys
- on : label or list
- Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.
- left_on : label or list, or array-like
- Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns
- right_on : label or list, or array-like
- Field names to join on in right DataFrame or vector/list of vectors per left_on docs
- left_index : boolean, default False
- Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels
- right_index : boolean, default False
- Use the index from the right DataFrame as the join key. Same caveats as left_index
- sort : boolean, default False
- Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword)
- suffixes : 2-length sequence (tuple, list, …)
- Suffix to apply to overlapping column names in the left and right side, respectively
- copy : boolean, default True
- If False, do not copy data unnecessarily
- indicator : boolean or string, default False
If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.
New in version 0.17.0.
- validate : string, default None
If specified, checks if merge is of specified type.
- “one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
- “one_to_many” or “1:m”: check if merge keys are unique in left dataset.
- “many_to_one” or “m:1”: check if merge keys are unique in right dataset.
- “many_to_many” or “m:m”: allowed, but does not result in checks.
New in version 0.21.0.
>>> A >>> B lkey value rkey value 0 foo 1 0 foo 5 1 bar 2 1 bar 6 2 baz 3 2 qux 7 3 foo 4 3 bar 8
>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer') lkey value_x rkey value_y 0 foo 1 foo 5 1 foo 4 foo 5 2 bar 2 bar 6 3 bar 2 bar 8 4 baz 3 NaN NaN 5 NaN NaN qux 7
- merged : DataFrame
- The output type will the be same as ‘left’, if it is a subclass of DataFrame.
merge_ordered merge_asof
-
merge_datasets
(other)¶ This operation combines two dataframes into one new DataFrame. If the operation is combining two SpatialDataFrames, the geometry_type must match.
-
min
(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)¶ - This method returns the minimum of the values in the object.
- If you want the index of the minimum, use
idxmin
. This is the equivalent of thenumpy.ndarray
methodargmin
.
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA- level : int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only : boolean, default None
- Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
min : Series or DataFrame (if level specified)
-
mod
(other, axis='columns', level=None, fill_value=None)¶ Modulo of dataframe and other, element-wise (binary operator mod).
Equivalent to
dataframe % other
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.rmod
-
mode
(axis=0, numeric_only=False)¶ Gets the mode(s) of each element along the axis selected. Adds a row for each mode per label, fills in gaps with nan.
Note that there could be multiple values returned for the selected axis (when more than one item share the maximum frequency), which is the reason why a dataframe is returned. If you want to impute missing values with the mode in a dataframe
df
, you can just do this:df.fillna(df.mode().iloc[0])
- axis : {0 or ‘index’, 1 or ‘columns’}, default 0
- 0 or ‘index’ : get mode of each column
- 1 or ‘columns’ : get mode of each row
- numeric_only : boolean, default False
- if True, only apply to numeric columns
modes : DataFrame (sorted)
>>> df = pd.DataFrame({'A': [1, 2, 1, 2, 1, 2, 3]}) >>> df.mode() A 0 1 1 2
-
mul
(other, axis='columns', level=None, fill_value=None)¶ Multiplication of dataframe and other, element-wise (binary operator mul).
Equivalent to
dataframe * other
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.rmul
-
multiply
(other, axis='columns', level=None, fill_value=None)¶ Multiplication of dataframe and other, element-wise (binary operator mul).
Equivalent to
dataframe * other
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.rmul
-
ndim
¶ Number of axes / array dimensions
-
ne
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods ne
-
nlargest
(n, columns, keep='first')¶ Get the rows of a DataFrame sorted by the n largest values of columns.
New in version 0.17.0.
- n : int
- Number of items to retrieve
- columns : list or str
- Column name or names to order by
- keep : {‘first’, ‘last’, False}, default ‘first’
- Where there are duplicate values:
-
first
: take the first occurrence. -last
: take the last occurrence.
DataFrame
>>> df = DataFrame({'a': [1, 10, 8, 11, -1], ... 'b': list('abdce'), ... 'c': [1.0, 2.0, np.nan, 3.0, 4.0]}) >>> df.nlargest(3, 'a') a b c 3 11 c 3 1 10 b 2 2 8 d NaN
-
notna
()¶ Return a boolean same-sized object indicating if the values are not NA.
DataFrame.isna : boolean inverse of notna DataFrame.notnull : alias of notna notna : top-level notna
-
notnull
()¶ Return a boolean same-sized object indicating if the values are not NA.
DataFrame.isna : boolean inverse of notna DataFrame.notnull : alias of notna notna : top-level notna
-
nsmallest
(n, columns, keep='first')¶ Get the rows of a DataFrame sorted by the n smallest values of columns.
New in version 0.17.0.
- n : int
- Number of items to retrieve
- columns : list or str
- Column name or names to order by
- keep : {‘first’, ‘last’, False}, default ‘first’
- Where there are duplicate values:
-
first
: take the first occurrence. -last
: take the last occurrence.
DataFrame
>>> df = DataFrame({'a': [1, 10, 8, 11, -1], ... 'b': list('abdce'), ... 'c': [1.0, 2.0, np.nan, 3.0, 4.0]}) >>> df.nsmallest(3, 'a') a b c 4 -1 e 4 0 1 a 1 2 8 d NaN
-
nunique
(axis=0, dropna=True)¶ Return Series with number of distinct observations over requested axis.
New in version 0.20.0.
axis : {0 or ‘index’, 1 or ‘columns’}, default 0 dropna : boolean, default True
Don’t include NaN in the counts.nunique : Series
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]}) >>> df.nunique() A 3 B 1
>>> df.nunique(axis=1) 0 1 1 2 2 2
-
overlaps
(second_geometry)¶ Indicates if the intersection of the two geometries has the same shape type as one of the input geometries and is not equivalent to either of the input geometries.
- Paramters:
second_geometry: - a second geometry
-
part_count
¶ The number of geometry parts for the feature.
-
pct_change
(periods=1, fill_method='pad', limit=None, freq=None, **kwargs)¶ Percent change over given number of periods.
- periods : int, default 1
- Periods to shift for forming percent change
- fill_method : str, default ‘pad’
- How to handle NAs before computing percent changes
- limit : int, default None
- The number of consecutive NAs to fill before stopping
- freq : DateOffset, timedelta, or offset alias string, optional
- Increment to use from time series API (e.g. ‘M’ or BDay())
chg : NDFrame
By default, the percentage change is calculated along the stat axis: 0, or
Index
, forDataFrame
and 1, orminor
forPanel
. You can change this with theaxis
keyword argument.
-
pipe
(func, *args, **kwargs)¶ Apply func(self, *args, **kwargs)
- func : function
- function to apply to the NDFrame.
args
, andkwargs
are passed intofunc
. Alternatively a(callable, data_keyword)
tuple wheredata_keyword
is a string indicating the keyword ofcallable
that expects the NDFrame. - args : iterable, optional
- positional arguments passed into
func
. - kwargs : mapping, optional
- a dictionary of keyword arguments passed into
func
.
object : the return type of
func
.Use
.pipe
when chaining together functions that expect Series, DataFrames or GroupBy objects. Instead of writing>>> f(g(h(df), arg1=a), arg2=b, arg3=c)
You can write
>>> (df.pipe(h) ... .pipe(g, arg1=a) ... .pipe(f, arg2=b, arg3=c) ... )
If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose
f
takes its data asarg2
:>>> (df.pipe(h) ... .pipe(g, arg1=a) ... .pipe((f, 'arg2'), arg1=a, arg3=c) ... )
pandas.DataFrame.apply pandas.DataFrame.applymap pandas.Series.map
-
pivot
(index=None, columns=None, values=None)¶ Reshape data (produce a “pivot” table) based on column values. Uses unique values from index / columns to form axes of the resulting DataFrame.
- index : string or object, optional
- Column name to use to make new frame’s index. If None, uses existing index.
- columns : string or object
- Column name to use to make new frame’s columns
- values : string or object, optional
- Column name to use for populating new frame’s values. If not specified, all remaining columns will be used and the result will have hierarchically indexed columns
pivoted : DataFrame
- DataFrame.pivot_table : generalization of pivot that can handle
- duplicate values for one index/column pair
- DataFrame.unstack : pivot based on the index values instead of a
- column
For finer-tuned control, see hierarchical indexing documentation along with the related stack/unstack methods
>>> df = pd.DataFrame({'foo': ['one','one','one','two','two','two'], 'bar': ['A', 'B', 'C', 'A', 'B', 'C'], 'baz': [1, 2, 3, 4, 5, 6]}) >>> df foo bar baz 0 one A 1 1 one B 2 2 one C 3 3 two A 4 4 two B 5 5 two C 6
>>> df.pivot(index='foo', columns='bar', values='baz') A B C one 1 2 3 two 4 5 6
>>> df.pivot(index='foo', columns='bar')['baz'] A B C one 1 2 3 two 4 5 6
-
pivot_table
(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')¶ Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame
values : column to aggregate, optional index : column, Grouper, array, or list of the previous
If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.- columns : column, Grouper, array, or list of the previous
- If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
- aggfunc : function or list of functions, default numpy.mean
- If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves)
- fill_value : scalar, default None
- Value to replace missing values with
- margins : boolean, default False
- Add all row / columns (e.g. for subtotal / grand totals)
- dropna : boolean, default True
- Do not include columns whose entries are all NaN
- margins_name : string, default ‘All’
- Name of the row / column that will contain the totals when margins is True.
>>> df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo", ... "bar", "bar", "bar", "bar"], ... "B": ["one", "one", "one", "two", "two", ... "one", "one", "two", "two"], ... "C": ["small", "large", "large", "small", ... "small", "large", "small", "small", ... "large"], ... "D": [1, 2, 2, 3, 3, 4, 5, 6, 7]}) >>> df A B C D 0 foo one small 1 1 foo one large 2 2 foo one large 2 3 foo two small 3 4 foo two small 3 5 bar one large 4 6 bar one small 5 7 bar two small 6 8 bar two large 7
>>> table = pivot_table(df, values='D', index=['A', 'B'], ... columns=['C'], aggfunc=np.sum) >>> table ... C large small A B bar one 4.0 5.0 two 7.0 6.0 foo one 4.0 1.0 two NaN 6.0
table : DataFrame
- DataFrame.pivot : pivot without aggregation that can handle
- non-numeric data
-
plot
(*args, **kwargs)¶ writes the spatial dataframe to a map
-
point_count
¶ The total number of points for the feature.
-
point_from_angle_and_distance
(angle, distance, method='GEODESCIC')¶ Returns a point at a given angle and distance in degrees and meters using the specified measurement type.
- Parameters:
angle: - The angle in degrees to the returned point.
distance: - The distance in meters to the returned point.
method: - PLANAR measurements reflect the projection of geographic
data onto the 2D surface (in other words, they will not take into account the curvature of the earth). GEODESIC, GREAT_ELLIPTIC, LOXODROME, and PRESERVE_SHAPE measurement types may be chosen as an alternative, if desired.
-
pop
(item)¶ Return item and drop from frame. Raise KeyError if not found.
- item : str
- Column label to be popped
popped : Series
>>> df = pd.DataFrame([('falcon', 'bird', 389.0), ... ('parrot', 'bird', 24.0), ... ('lion', 'mammal', 80.5), ... ('monkey', 'mammal', np.nan)], ... columns=('name', 'class', 'max_speed')) >>> df name class max_speed 0 falcon bird 389.0 1 parrot bird 24.0 2 lion mammal 80.5 3 monkey mammal NaN
>>> df.pop('class') 0 bird 1 bird 2 mammal 3 mammal Name: class, dtype: object
>>> df name max_speed 0 falcon 389.0 1 parrot 24.0 2 lion 80.5 3 monkey NaN
-
position_along_line
(value, use_percentage=False)¶ Returns a point on a line at a specified distance from the beginning of the line.
- Parameters:
value: - The distance along the line.
use_percentage: - The distance may be specified as a fixed unit
of measure or a ratio of the length of the line. If True, value is used as a percentage; if False, value is used as a distance. For percentages, the value should be expressed as a double from 0.0 (0%) to 1.0 (100%).
-
pow
(other, axis='columns', level=None, fill_value=None)¶ Exponential power of dataframe and other, element-wise (binary operator pow).
Equivalent to
dataframe ** other
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.rpow
-
prod
(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)¶ Return the product of the values for the requested axis
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA- level : int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only : boolean, default None
- Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
prod : Series or DataFrame (if level specified)
-
product
(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)¶ Return the product of the values for the requested axis
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA- level : int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only : boolean, default None
- Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
prod : Series or DataFrame (if level specified)
-
project_as
(spatial_reference, transformation_name=None)¶ Projects a geometry and optionally applies a geotransformation.
- Parameter:
spatial_reference: - The new spatial reference. This can be a
SpatialReference object or the coordinate system name.
transformation_name: - The geotransformation name.
-
quantile
(q=0.5, axis=0, numeric_only=True, interpolation='linear')¶ Return values at the given quantile over requested axis, a la numpy.percentile.
- q : float or array-like, default 0.5 (50% quantile)
- 0 <= q <= 1, the quantile(s) to compute
- axis : {0, 1, ‘index’, ‘columns’} (default 0)
- 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
- interpolation : {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}
New in version 0.18.0.
This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:
- linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
- lower: i.
- higher: j.
- nearest: i or j whichever is nearest.
- midpoint: (i + j) / 2.
quantiles : Series or DataFrame
- If
q
is an array, a DataFrame will be returned where the index isq
, the columns are the columns of self, and the values are the quantiles. - If
q
is a float, a Series will be returned where the index is the columns of self and the values are the quantiles.
>>> df = DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]), columns=['a', 'b']) >>> df.quantile(.1) a 1.3 b 3.7 dtype: float64 >>> df.quantile([.1, .5]) a b 0.1 1.3 3.7 0.5 2.5 55.0
-
query
(expr, inplace=False, **kwargs)¶ Query the columns of a frame with a boolean expression.
- expr : string
- The query string to evaluate. You can refer to variables
in the environment by prefixing them with an ‘@’ character like
@a + b
. - inplace : bool
Whether the query should modify the data in place or return a modified copy
New in version 0.18.0.
- kwargs : dict
- See the documentation for
pandas.eval()
for complete details on the keyword arguments accepted byDataFrame.query()
.
q : DataFrame
The result of the evaluation of this expression is first passed to
DataFrame.loc
and if that fails because of a multidimensional key (e.g., a DataFrame) then the result will be passed toDataFrame.__getitem__()
.This method uses the top-level
pandas.eval()
function to evaluate the passed query.The
query()
method uses a slightly modified Python syntax by default. For example, the&
and|
(bitwise) operators have the precedence of their boolean cousins,and
andor
. This is syntactically valid Python, however the semantics are different.You can change the semantics of the expression by passing the keyword argument
parser='python'
. This enforces the same semantics as evaluation in Python space. Likewise, you can passengine='python'
to evaluate an expression using Python itself as a backend. This is not recommended as it is inefficient compared to usingnumexpr
as the engine.The
DataFrame.index
andDataFrame.columns
attributes of theDataFrame
instance are placed in the query namespace by default, which allows you to treat both the index and columns of the frame as a column in the frame. The identifierindex
is used for the frame index; you can also use the name of the index to identify it in a query.For further details and examples see the
query
documentation in indexing.pandas.eval DataFrame.eval
>>> from numpy.random import randn >>> from pandas import DataFrame >>> df = DataFrame(randn(10, 2), columns=list('ab')) >>> df.query('a > b') >>> df[df.a > df.b] # same result as the previous expression
-
query_point_and_distance
(second_geometry, use_percentage=False)¶ Finds the point on the polyline nearest to the in_point and the distance between those points. Also returns information about the side of the line the in_point is on as well as the distance along the line where the nearest point occurs.
- Paramters:
second_geometry: - a second geometry
as_percentage: - if False, the measure will be returned as
distance, True, measure will be a percentage
-
radd
(other, axis='columns', level=None, fill_value=None)¶ Addition of dataframe and other, element-wise (binary operator radd).
Equivalent to
other + dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.add
-
rank
(axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False)¶ Compute numerical data ranks (1 through n) along axis. Equal values are assigned a rank that is the average of the ranks of those values
- axis : {0 or ‘index’, 1 or ‘columns’}, default 0
- index to direct ranking
- method : {‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}
- average: average rank of group
- min: lowest rank in group
- max: highest rank in group
- first: ranks assigned in order they appear in the array
- dense: like ‘min’, but rank always increases by 1 between groups
- numeric_only : boolean, default None
- Include only float, int, boolean data. Valid only for DataFrame or Panel objects
- na_option : {‘keep’, ‘top’, ‘bottom’}
- keep: leave NA values where they are
- top: smallest rank if ascending
- bottom: smallest rank if descending
- ascending : boolean, default True
- False for ranks by high (1) to low (N)
- pct : boolean, default False
- Computes percentage rank of data
ranks : same type as caller
-
rdiv
(other, axis='columns', level=None, fill_value=None)¶ Floating division of dataframe and other, element-wise (binary operator rtruediv).
Equivalent to
other / dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.truediv
-
reindex
(labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=nan, limit=None, tolerance=None)¶ Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False
- labels : array-like, optional
- New labels / index to conform the axis specified by ‘axis’ to.
- index, columns : array-like, optional (should be specified using keywords)
- New labels / index to conform to. Preferably an Index object to avoid duplicating data
- axis : int or str, optional
- Axis to target. Can be either the axis name (‘index’, ‘columns’) or number (0, 1).
- method : {None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}, optional
method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.
- default: don’t fill gaps
- pad / ffill: propagate last valid observation forward to next valid
- backfill / bfill: use next valid observation to fill gap
- nearest: use nearest valid observations to fill gap
- copy : boolean, default True
- Return a new object, even if the passed indexes are the same
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
- fill_value : scalar, default np.NaN
- Value to use for missing values. Defaults to NaN, but can be any “compatible” value
- limit : int, default None
- Maximum number of consecutive elements to forward or backward fill
- tolerance : optional
Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation
abs(index[indexer] - target) <= tolerance
.Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.
New in version 0.17.0.
New in version 0.21.0: (list-like tolerance)
DataFrame.reindex
supports two calling conventions(index=index_labels, columns=column_labels, ...)
(labels, axis={'index', 'columns'}, ...)
We highly recommend using keyword arguments to clarify your intent.
Create a dataframe with some fictional data.
>>> index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror'] >>> df = pd.DataFrame({ ... 'http_status': [200,200,404,404,301], ... 'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]}, ... index=index) >>> df http_status response_time Firefox 200 0.04 Chrome 200 0.02 Safari 404 0.07 IE10 404 0.08 Konqueror 301 1.00
Create a new index and reindex the dataframe. By default values in the new index that do not have corresponding records in the dataframe are assigned
NaN
.>>> new_index= ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10', ... 'Chrome'] >>> df.reindex(new_index) http_status response_time Safari 404.0 0.07 Iceweasel NaN NaN Comodo Dragon NaN NaN IE10 404.0 0.08 Chrome 200.0 0.02
We can fill in the missing values by passing a value to the keyword
fill_value
. Because the index is not monotonically increasing or decreasing, we cannot use arguments to the keywordmethod
to fill theNaN
values.>>> df.reindex(new_index, fill_value=0) http_status response_time Safari 404 0.07 Iceweasel 0 0.00 Comodo Dragon 0 0.00 IE10 404 0.08 Chrome 200 0.02
>>> df.reindex(new_index, fill_value='missing') http_status response_time Safari 404 0.07 Iceweasel missing missing Comodo Dragon missing missing IE10 404 0.08 Chrome 200 0.02
We can also reindex the columns.
>>> df.reindex(columns=['http_status', 'user_agent']) http_status user_agent Firefox 200 NaN Chrome 200 NaN Safari 404 NaN IE10 404 NaN Konqueror 301 NaN
Or we can use “axis-style” keyword arguments
>>> df.reindex(['http_status', 'user_agent'], axis="columns") http_status user_agent Firefox 200 NaN Chrome 200 NaN Safari 404 NaN IE10 404 NaN Konqueror 301 NaN
To further illustrate the filling functionality in
reindex
, we will create a dataframe with a monotonically increasing index (for example, a sequence of dates).>>> date_index = pd.date_range('1/1/2010', periods=6, freq='D') >>> df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]}, ... index=date_index) >>> df2 prices 2010-01-01 100 2010-01-02 101 2010-01-03 NaN 2010-01-04 100 2010-01-05 89 2010-01-06 88
Suppose we decide to expand the dataframe to cover a wider date range.
>>> date_index2 = pd.date_range('12/29/2009', periods=10, freq='D') >>> df2.reindex(date_index2) prices 2009-12-29 NaN 2009-12-30 NaN 2009-12-31 NaN 2010-01-01 100 2010-01-02 101 2010-01-03 NaN 2010-01-04 100 2010-01-05 89 2010-01-06 88 2010-01-07 NaN
The index entries that did not have a value in the original data frame (for example, ‘2009-12-29’) are by default filled with
NaN
. If desired, we can fill in the missing values using one of several options.For example, to backpropagate the last valid value to fill the
NaN
values, passbfill
as an argument to themethod
keyword.>>> df2.reindex(date_index2, method='bfill') prices 2009-12-29 100 2009-12-30 100 2009-12-31 100 2010-01-01 100 2010-01-02 101 2010-01-03 NaN 2010-01-04 100 2010-01-05 89 2010-01-06 88 2010-01-07 NaN
Please note that the
NaN
value present in the original dataframe (at index value 2010-01-03) will not be filled by any of the value propagation schemes. This is because filling while reindexing does not look at dataframe values, but only compares the original and desired indexes. If you do want to fill in theNaN
values present in the original dataframe, use thefillna()
method.See the user guide for more.
reindexed : DataFrame
-
reindex_axis
(labels, axis=0, method=None, level=None, copy=True, limit=None, fill_value=nan)¶ Conform input object to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False
- labels : array-like
- New labels / index to conform to. Preferably an Index object to avoid duplicating data
axis : {0 or ‘index’, 1 or ‘columns’} method : {None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}, optional
Method to use for filling holes in reindexed DataFrame:
- default: don’t fill gaps
- pad / ffill: propagate last valid observation forward to next valid
- backfill / bfill: use next valid observation to fill gap
- nearest: use nearest valid observations to fill gap
- copy : boolean, default True
- Return a new object, even if the passed indexes are the same
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
- limit : int, default None
- Maximum number of consecutive elements to forward or backward fill
- tolerance : optional
Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation
abs(index[indexer] - target) <= tolerance
.Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.
New in version 0.17.0.
New in version 0.21.0: (list-like tolerance)
>>> df.reindex_axis(['A', 'B', 'C'], axis=1)
reindex, reindex_like
reindexed : DataFrame
-
reindex_like
(other, method=None, copy=True, limit=None, tolerance=None)¶ Return an object with matching indices to myself.
other : Object method : string or None copy : boolean, default True limit : int, default None
Maximum number of consecutive labels to fill for inexact matches.- tolerance : optional
Maximum distance between labels of the other object and this object for inexact matches. Can be list-like.
New in version 0.17.0.
New in version 0.21.0: (list-like tolerance)
- Like calling s.reindex(index=other.index, columns=other.columns,
- method=…)
reindexed : same as input
-
rename
(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None)¶ Alter axes labels.
Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.
See the user guide for more.
- mapper, index, columns : dict-like or function, optional
- dict-like or functions transformations to apply to
that axis’ values. Use either
mapper
andaxis
to specify the axis to target withmapper
, orindex
andcolumns
. - axis : int or str, optional
- Axis to target with
mapper
. Can be either the axis name (‘index’, ‘columns’) or number (0, 1). The default is ‘index’. - copy : boolean, default True
- Also copy underlying data
- inplace : boolean, default False
- Whether to return a new %(klass)s. If True then value of copy is ignored.
- level : int or level name, default None
- In case of a MultiIndex, only rename labels in the specified level.
renamed : DataFrame
pandas.DataFrame.rename_axis
DataFrame.rename
supports two calling conventions(index=index_mapper, columns=columns_mapper, ...)
(mapper, axis={'index', 'columns'}, ...)
We highly recommend using keyword arguments to clarify your intent.
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> df.rename(index=str, columns={"A": "a", "B": "c"}) a c 0 1 4 1 2 5 2 3 6
>>> df.rename(index=str, columns={"A": "a", "C": "c"}) a B 0 1 4 1 2 5 2 3 6
Using axis-style parameters
>>> df.rename(str.lower, axis='columns') a b 0 1 4 1 2 5 2 3 6
>>> df.rename({1: 2, 2: 4}, axis='index') A B 0 1 4 2 2 5 4 3 6
-
rename_axis
(mapper, axis=0, copy=True, inplace=False)¶ Alter the name of the index or columns.
- mapper : scalar, list-like, optional
- Value to set the axis name attribute.
axis : int or string, default 0 copy : boolean, default True
Also copy underlying datainplace : boolean, default False
renamed : type of caller or None if inplace=True
Prior to version 0.21.0,
rename_axis
could also be used to change the axis labels by passing a mapping or scalar. This behavior is deprecated and will be removed in a future version. Userename
instead.pandas.Series.rename, pandas.DataFrame.rename pandas.Index.rename
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> df.rename_axis("foo") A B foo 0 1 4 1 2 5 2 3 6
>>> df.rename_axis("bar", axis="columns") bar A B 0 1 4 1 2 5 2 3 6
-
reorder_levels
(order, axis=0)¶ Rearrange index levels using input order. May not drop or duplicate levels
- order : list of int or list of str
- List representing new level order. Reference level by number (position) or by key (label).
- axis : int
- Where to reorder levels.
type of caller (new object)
-
replace
(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad', axis=None)¶ Replace values given in ‘to_replace’ with ‘value’.
to_replace : str, regex, list, dict, Series, numeric, or None
str or regex:
- str: string exactly matching to_replace will be replaced with value
- regex: regexs matching to_replace will be replaced with value
list of str, regex, or numeric:
- First, if to_replace and value are both lists, they must be the same length.
- Second, if
regex=True
then all of the strings in both lists will be interpreted as regexs otherwise they will match directly. This doesn’t matter much for value since there are only a few possible substitution regexes you can use. - str and regex rules apply as above.
dict:
- Nested dictionaries, e.g., {‘a’: {‘b’: nan}}, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with nan. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.
- Keys map to column names and values map to substitution values. You can treat this as a special case of passing two lists except that you are specifying the column to search in.
None:
- This means that the
regex
argument must be a string, compiled regular expression, or list, dict, ndarray or Series of such elements. If value is alsoNone
then this must be a nested dictionary orSeries
.
- This means that the
See the examples section for examples of each of these.
- value : scalar, dict, list, str, regex, default None
- Value to use to fill holes (e.g. 0), alternately a dict of values specifying which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such objects are also allowed.
- inplace : boolean, default False
- If True, in place. Note: this will modify any other views on this object (e.g. a column from a DataFrame). Returns the caller if this is True.
- limit : int, default None
- Maximum size gap to forward or backward fill
- regex : bool or same types as to_replace, default False
- Whether to interpret to_replace and/or value as regular
expressions. If this is
True
then to_replace must be a string. Otherwise, to_replace must beNone
because this parameter will be interpreted as a regular expression or a list, dict, or array of regular expressions. - method : string, optional, {‘pad’, ‘ffill’, ‘bfill’}
- The method to use when for replacement, when
to_replace
is alist
.
NDFrame.reindex NDFrame.asfreq NDFrame.fillna
filled : NDFrame
- AssertionError
- If regex is not a
bool
and to_replace is notNone
.
- If regex is not a
- TypeError
- If to_replace is a
dict
and value is not alist
,dict
,ndarray
, orSeries
- If to_replace is
None
and regex is not compilable into a regular expression or is a list, dict, ndarray, or Series.
- If to_replace is a
- ValueError
- If to_replace and value are
list
s orndarray
s, but they are not the same length.
- If to_replace and value are
- Regex substitution is performed under the hood with
re.sub
. The rules for substitution forre.sub
are the same. - Regular expressions will only substitute on strings, meaning you cannot provide, for example, a regular expression matching floating point numbers and expect the columns in your frame that have a numeric dtype to be matched. However, if those floating point numbers are strings, then you can do this.
- This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works.
-
reproject
(spatial_reference, transformation=None, inplace=False)¶ Reprojects a given dataframe into a new coordinate system.
-
resample
(rule, how=None, axis=0, fill_method=None, closed=None, label=None, convention='start', kind=None, loffset=None, limit=None, base=0, on=None, level=None)¶ Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.
- rule : string
- the offset string or object representing target conversion
axis : int, optional, default 0 closed : {‘right’, ‘left’}
Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.- label : {‘right’, ‘left’}
- Which bin edge label to label bucket with. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
- convention : {‘start’, ‘end’, ‘s’, ‘e’}
- For PeriodIndex only, controls whether to use the start or end of rule
- loffset : timedelta
- Adjust the resampled time labels
- base : int, default 0
- For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0
- on : string, optional
For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.
New in version 0.19.0.
- level : string or int, optional
For a MultiIndex, level (name or number) to use for resampling. Level must be datetime-like.
New in version 0.19.0.
To learn more about the offset strings, please see this link.
Start by creating a series with 9 one minute timestamps.
>>> index = pd.date_range('1/1/2000', periods=9, freq='T') >>> series = pd.Series(range(9), index=index) >>> series 2000-01-01 00:00:00 0 2000-01-01 00:01:00 1 2000-01-01 00:02:00 2 2000-01-01 00:03:00 3 2000-01-01 00:04:00 4 2000-01-01 00:05:00 5 2000-01-01 00:06:00 6 2000-01-01 00:07:00 7 2000-01-01 00:08:00 8 Freq: T, dtype: int64
Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.
>>> series.resample('3T').sum() 2000-01-01 00:00:00 3 2000-01-01 00:03:00 12 2000-01-01 00:06:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket
2000-01-01 00:03:00
contains the value 3, but the summed value in the resampled bucket with the label2000-01-01 00:03:00
does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.>>> series.resample('3T', label='right').sum() 2000-01-01 00:03:00 3 2000-01-01 00:06:00 12 2000-01-01 00:09:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but close the right side of the bin interval.
>>> series.resample('3T', label='right', closed='right').sum() 2000-01-01 00:00:00 0 2000-01-01 00:03:00 6 2000-01-01 00:06:00 15 2000-01-01 00:09:00 15 Freq: 3T, dtype: int64
Upsample the series into 30 second bins.
>>> series.resample('30S').asfreq()[0:5] #select first 5 rows 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 1.0 2000-01-01 00:01:30 NaN 2000-01-01 00:02:00 2.0 Freq: 30S, dtype: float64
Upsample the series into 30 second bins and fill the
NaN
values using thepad
method.>>> series.resample('30S').pad()[0:5] 2000-01-01 00:00:00 0 2000-01-01 00:00:30 0 2000-01-01 00:01:00 1 2000-01-01 00:01:30 1 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Upsample the series into 30 second bins and fill the
NaN
values using thebfill
method.>>> series.resample('30S').bfill()[0:5] 2000-01-01 00:00:00 0 2000-01-01 00:00:30 1 2000-01-01 00:01:00 1 2000-01-01 00:01:30 2 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Pass a custom function via
apply
>>> def custom_resampler(array_like): ... return np.sum(array_like)+5
>>> series.resample('3T').apply(custom_resampler) 2000-01-01 00:00:00 8 2000-01-01 00:03:00 17 2000-01-01 00:06:00 26 Freq: 3T, dtype: int64
For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or end of rule.
>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01', freq='A', periods=2)) >>> s 2012 1 2013 2 Freq: A-DEC, dtype: int64
Resample by month using ‘start’ convention. Values are assigned to the first month of the period.
>>> s.resample('M', convention='start').asfreq().head() 2012-01 1.0 2012-02 NaN 2012-03 NaN 2012-04 NaN 2012-05 NaN Freq: M, dtype: float64
Resample by month using ‘end’ convention. Values are assigned to the last month of the period.
>>> s.resample('M', convention='end').asfreq() 2012-12 1.0 2013-01 NaN 2013-02 NaN 2013-03 NaN 2013-04 NaN 2013-05 NaN 2013-06 NaN 2013-07 NaN 2013-08 NaN 2013-09 NaN 2013-10 NaN 2013-11 NaN 2013-12 2.0 Freq: M, dtype: float64
For DataFrame objects, the keyword
on
can be used to specify the column instead of the index for resampling.>>> df = pd.DataFrame(data=9*[range(4)], columns=['a', 'b', 'c', 'd']) >>> df['time'] = pd.date_range('1/1/2000', periods=9, freq='T') >>> df.resample('3T', on='time').sum() a b c d time 2000-01-01 00:00:00 0 3 6 9 2000-01-01 00:03:00 0 3 6 9 2000-01-01 00:06:00 0 3 6 9
For a DataFrame with MultiIndex, the keyword
level
can be used to specify on level the resampling needs to take place.>>> time = pd.date_range('1/1/2000', periods=5, freq='T') >>> df2 = pd.DataFrame(data=10*[range(4)], columns=['a', 'b', 'c', 'd'], index=pd.MultiIndex.from_product([time, [1, 2]]) ) >>> df2.resample('3T', level=0).sum() a b c d 2000-01-01 00:00:00 0 6 12 18 2000-01-01 00:03:00 0 4 8 12
-
reset_index
(level=None, drop=False, inplace=False, col_level=0, col_fill='')¶ For DataFrame with multi-level index, return new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc. if any are None. For a standard index, the index name will be used (if set), otherwise a default ‘index’ or ‘level_0’ (if ‘index’ is already taken) will be used.
- level : int, str, tuple, or list, default None
- Only remove the given levels from the index. Removes all levels by default
- drop : boolean, default False
- Do not try to insert index into dataframe columns. This resets the index to the default integer index.
- inplace : boolean, default False
- Modify the DataFrame in place (do not create a new object)
- col_level : int or str, default 0
- If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level.
- col_fill : object, default ‘’
- If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.
resetted : DataFrame
>>> df = pd.DataFrame([('bird', 389.0), ... ('bird', 24.0), ... ('mammal', 80.5), ... ('mammal', np.nan)], ... index=['falcon', 'parrot', 'lion', 'monkey'], ... columns=('class', 'max_speed')) >>> df class max_speed falcon bird 389.0 parrot bird 24.0 lion mammal 80.5 monkey mammal NaN
When we reset the index, the old index is added as a column, and a new sequential index is used:
>>> df.reset_index() index class max_speed 0 falcon bird 389.0 1 parrot bird 24.0 2 lion mammal 80.5 3 monkey mammal NaN
We can use the drop parameter to avoid the old index being added as a column:
>>> df.reset_index(drop=True) class max_speed 0 bird 389.0 1 bird 24.0 2 mammal 80.5 3 mammal NaN
You can also use reset_index with MultiIndex.
>>> index = pd.MultiIndex.from_tuples([('bird', 'falcon'), ... ('bird', 'parrot'), ... ('mammal', 'lion'), ... ('mammal', 'monkey')], ... names=['class', 'name']) >>> columns = pd.MultiIndex.from_tuples([('speed', 'max'), ... ('species', 'type')]) >>> df = pd.DataFrame([(389.0, 'fly'), ... ( 24.0, 'fly'), ... ( 80.5, 'run'), ... (np.nan, 'jump')], ... index=index, ... columns=columns) >>> df speed species max type class name bird falcon 389.0 fly parrot 24.0 fly mammal lion 80.5 run monkey NaN jump
If the index has multiple levels, we can reset a subset of them:
>>> df.reset_index(level='class') class speed species max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump
If we are not dropping the index, by default, it is placed in the top level. We can place it in another level:
>>> df.reset_index(level='class', col_level=1) speed species class max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump
When the index is inserted under another level, we can specify under which one with the parameter col_fill:
>>> df.reset_index(level='class', col_level=1, col_fill='species') species speed species class max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump
If we specify a nonexistent level for col_fill, it is created:
>>> df.reset_index(level='class', col_level=1, col_fill='genus') genus speed species class max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump
-
rfloordiv
(other, axis='columns', level=None, fill_value=None)¶ Integer division of dataframe and other, element-wise (binary operator rfloordiv).
Equivalent to
other // dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.floordiv
-
rmod
(other, axis='columns', level=None, fill_value=None)¶ Modulo of dataframe and other, element-wise (binary operator rmod).
Equivalent to
other % dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.mod
-
rmul
(other, axis='columns', level=None, fill_value=None)¶ Multiplication of dataframe and other, element-wise (binary operator rmul).
Equivalent to
other * dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.mul
-
rolling
(window, min_periods=None, freq=None, center=False, win_type=None, on=None, axis=0, closed=None)¶ Provides rolling window calculations.
New in version 0.18.0.
- window : int, or offset
Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.
If its an offset then this will be the time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes. This is new in 0.19.0
- min_periods : int, default None
- Minimum number of observations in window required to have a value (otherwise result is NA). For a window that is specified by an offset, this will default to 1.
- freq : string or DateOffset object, optional (default None)
Deprecated since version 0.18.0: Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.
- center : boolean, default False
- Set the labels at the center of the window.
- win_type : string, default None
- Provide a window type. See the notes below.
- on : string, optional
- For a DataFrame, column on which to calculate the rolling window, rather than the index
- closed : string, default None
Make the interval closed on the ‘right’, ‘left’, ‘both’ or ‘neither’ endpoints. For offset-based windows, it defaults to ‘right’. For fixed windows, defaults to ‘both’. Remaining cases not implemented for fixed windows.
New in version 0.20.0.
axis : int or string, default 0
a Window or Rolling sub-classed for the particular operation
>>> df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]}) >>> df B 0 0.0 1 1.0 2 2.0 3 NaN 4 4.0
Rolling sum with a window length of 2, using the ‘triang’ window type.
>>> df.rolling(2, win_type='triang').sum() B 0 NaN 1 1.0 2 2.5 3 NaN 4 NaN
Rolling sum with a window length of 2, min_periods defaults to the window length.
>>> df.rolling(2).sum() B 0 NaN 1 1.0 2 3.0 3 NaN 4 NaN
Same as above, but explicity set the min_periods
>>> df.rolling(2, min_periods=1).sum() B 0 0.0 1 1.0 2 3.0 3 2.0 4 4.0
A ragged (meaning not-a-regular frequency), time-indexed DataFrame
>>> df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]}, ....: index = [pd.Timestamp('20130101 09:00:00'), ....: pd.Timestamp('20130101 09:00:02'), ....: pd.Timestamp('20130101 09:00:03'), ....: pd.Timestamp('20130101 09:00:05'), ....: pd.Timestamp('20130101 09:00:06')])
>>> df B 2013-01-01 09:00:00 0.0 2013-01-01 09:00:02 1.0 2013-01-01 09:00:03 2.0 2013-01-01 09:00:05 NaN 2013-01-01 09:00:06 4.0
Contrasting to an integer rolling window, this will roll a variable length window corresponding to the time period. The default for min_periods is 1.
>>> df.rolling('2s').sum() B 2013-01-01 09:00:00 0.0 2013-01-01 09:00:02 1.0 2013-01-01 09:00:03 3.0 2013-01-01 09:00:05 NaN 2013-01-01 09:00:06 4.0
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of
resample()
(i.e. using the mean).To learn more about the offsets & frequency strings, please see this link.
The recognized win_types are:
boxcar
triang
blackman
hamming
bartlett
parzen
bohman
blackmanharris
nuttall
barthann
kaiser
(needs beta)gaussian
(needs std)general_gaussian
(needs power, width)slepian
(needs width).
-
round
(decimals=0, *args, **kwargs)¶ Round a DataFrame to a variable number of decimal places.
New in version 0.17.0.
- decimals : int, dict, Series
- Number of decimal places to round each column to. If an int is given, round each column to the same number of places. Otherwise dict and Series round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.
>>> df = pd.DataFrame(np.random.random([3, 3]), ... columns=['A', 'B', 'C'], index=['first', 'second', 'third']) >>> df A B C first 0.028208 0.992815 0.173891 second 0.038683 0.645646 0.577595 third 0.877076 0.149370 0.491027 >>> df.round(2) A B C first 0.03 0.99 0.17 second 0.04 0.65 0.58 third 0.88 0.15 0.49 >>> df.round({'A': 1, 'C': 2}) A B C first 0.0 0.992815 0.17 second 0.0 0.645646 0.58 third 0.9 0.149370 0.49 >>> decimals = pd.Series([1, 0, 2], index=['A', 'B', 'C']) >>> df.round(decimals) A B C first 0.0 1 0.17 second 0.0 1 0.58 third 0.9 0 0.49
DataFrame object
numpy.around Series.round
-
rpow
(other, axis='columns', level=None, fill_value=None)¶ Exponential power of dataframe and other, element-wise (binary operator rpow).
Equivalent to
other ** dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.pow
-
rsub
(other, axis='columns', level=None, fill_value=None)¶ Subtraction of dataframe and other, element-wise (binary operator rsub).
Equivalent to
other - dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.sub
-
rtruediv
(other, axis='columns', level=None, fill_value=None)¶ Floating division of dataframe and other, element-wise (binary operator rtruediv).
Equivalent to
other / dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.truediv
-
sample
(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)¶ Returns a random sample of items from an axis of object.
- n : int, optional
- Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.
- frac : float, optional
- Fraction of axis items to return. Cannot be used with n.
- replace : boolean, optional
- Sample with or without replacement. Default = False.
- weights : str or ndarray-like, optional
- Default ‘None’ results in equal probability weighting. If passed a Series, will align with target object on index. Index values in weights not found in sampled object will be ignored and index values in sampled object not in weights will be assigned weights of zero. If called on a DataFrame, will accept the name of a column when axis = 0. Unless weights are a Series, weights must be same length as axis being sampled. If weights do not sum to 1, they will be normalized to sum to 1. Missing values in the weights column will be treated as zero. inf and -inf values not allowed.
- random_state : int or numpy.random.RandomState, optional
- Seed for the random number generator (if int), or numpy RandomState object.
- axis : int or string, optional
- Axis to sample. Accepts axis number or name. Default is stat axis for given data type (0 for Series and DataFrames, 1 for Panels).
A new object of same type as caller.
Generate an example
Series
andDataFrame
:>>> s = pd.Series(np.random.randn(50)) >>> s.head() 0 -0.038497 1 1.820773 2 -0.972766 3 -1.598270 4 -1.095526 dtype: float64 >>> df = pd.DataFrame(np.random.randn(50, 4), columns=list('ABCD')) >>> df.head() A B C D 0 0.016443 -2.318952 -0.566372 -1.028078 1 -1.051921 0.438836 0.658280 -0.175797 2 -1.243569 -0.364626 -0.215065 0.057736 3 1.768216 0.404512 -0.385604 -1.457834 4 1.072446 -1.137172 0.314194 -0.046661
Next extract a random sample from both of these objects…
3 random elements from the
Series
:>>> s.sample(n=3) 27 -0.994689 55 -1.049016 67 -0.224565 dtype: float64
And a random 10% of the
DataFrame
with replacement:>>> df.sample(frac=0.1, replace=True) A B C D 35 1.981780 0.142106 1.817165 -0.290805 49 -1.336199 -0.448634 -0.789640 0.217116 40 0.823173 -0.078816 1.009536 1.015108 15 1.421154 -0.055301 -1.922594 -0.019696 6 -0.148339 0.832938 1.787600 -1.383767
-
segment_along_line
(start_measure, end_measure, use_percentage=False)¶ Returns a Polyline between start and end measures. Similar to Polyline.positionAlongLine but will return a polyline segment between two points on the polyline instead of a single point.
- Parameters:
start_measure: - The starting distance from the beginning of the
line.
end_measure: - The ending distance from the beginning of the
line.
use_percentage: - The start and end measures may be specified as
fixed units or as a ratio. If True, start_measure and end_measure are used as a percentage; if False, start_measure and end_measure are used as a distance. For percentages, the measures should be expressed as a double from 0.0 (0 percent) to 1.0 (100 percent).
-
select
(crit, axis=0)¶ Return data corresponding to axis labels matching criteria
DEPRECATED: use df.loc[df.index.map(crit)] to select via labels
- crit : function
- To be called on each index (label). Should return True or False
axis : int
selection : type of caller
-
select_by_location
(other, matches_only=True)¶ Selects all rows in a given SpatialDataFrame based on a given geometry
- Inputs:
other: arcpy.Geometry object matches_only: boolean value, if true, only matched records will be
returned, else a field called ‘select_by_location’ will be added to the dataframe with the results of the select by location.
-
select_dtypes
(include=None, exclude=None)¶ Return a subset of a DataFrame including/excluding columns based on their
dtype
.- include, exclude : scalar or list-like
- A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied.
- ValueError
- If both of
include
andexclude
are empty - If
include
andexclude
have overlapping elements - If any kind of string dtype is passed in.
- If both of
- subset : DataFrame
- The subset of the frame including the dtypes in
include
and excluding the dtypes inexclude
.
- To select all numeric types use the numpy dtype
numpy.number
- To select strings you must use the
object
dtype, but note that this will return all object dtype columns - See the numpy dtype hierarchy
- To select datetimes, use np.datetime64, ‘datetime’ or ‘datetime64’
- To select timedeltas, use np.timedelta64, ‘timedelta’ or ‘timedelta64’
- To select Pandas categorical dtypes, use ‘category’
- To select Pandas datetimetz dtypes, use ‘datetimetz’ (new in 0.20.0), or a ‘datetime64[ns, tz]’ string
>>> df = pd.DataFrame({'a': np.random.randn(6).astype('f4'), ... 'b': [True, False] * 3, ... 'c': [1.0, 2.0] * 3}) >>> df a b c 0 0.3962 True 1 1 0.1459 False 2 2 0.2623 True 1 3 0.0764 False 2 4 -0.9703 True 1 5 -1.2094 False 2 >>> df.select_dtypes(include='bool') c 0 True 1 False 2 True 3 False 4 True 5 False >>> df.select_dtypes(include=['float64']) c 0 1 1 2 2 1 3 2 4 1 5 2 >>> df.select_dtypes(exclude=['floating']) b 0 True 1 False 2 True 3 False 4 True 5 False
-
sem
(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)¶ Return unbiased standard error of the mean over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA- level : int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- ddof : int, default 1
- degrees of freedom
- numeric_only : boolean, default None
- Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
sem : Series or DataFrame (if level specified)
-
series_extent
¶ Return a single bounding box (xmin, ymin, xmax, ymax) for all geometries
This is a shortcut for calculating the min/max x and y bounds individually.
-
set_axis
(labels, axis=0, inplace=None)¶ Assign desired index to given axis
- labels: list-like or Index
- The values for the new index
axis : int or string, default 0 inplace : boolean, default None
Whether to return a new NDFrame instance.
WARNING: inplace=None currently falls back to to True, but in a future version, will default to False. Use inplace=True explicitly rather than relying on the default.
New in version 0.21.0: The signature is make consistent to the rest of the API. Previously, the “axis” and “labels” arguments were respectively the first and second positional arguments.
- renamed : NDFrame or None
- An object of same type as caller if inplace=False, None otherwise.
pandas.NDFrame.rename
>>> s = pd.Series([1, 2, 3]) >>> s 0 1 1 2 2 3 dtype: int64 >>> s.set_axis(['a', 'b', 'c'], axis=0, inplace=False) a 1 b 2 c 3 dtype: int64 >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> df.set_axis(['a', 'b', 'c'], axis=0, inplace=False) A B a 1 4 b 2 5 c 3 6 >>> df.set_axis(['I', 'II'], axis=1, inplace=False) I II 0 1 4 1 2 5 2 3 6 >>> df.set_axis(['i', 'ii'], axis=1, inplace=True) >>> df i ii 0 1 4 1 2 5 2 3 6
-
set_geometry
(col, drop=False, inplace=False, sr=None)¶ Set the SpatialDataFrame geometry using either an existing column or the specified input. By default yields a new object.
The original geometry column is replaced with the input.
Parameters: col: column label or array drop: boolean, default True
Delete column to be used as the new geometry- inplace: boolean, default False
- Modify the SpatialDataFrame in place (do not create a new object)
- sr : str/integer the wkid value
- Coordinate system to use. If passed, overrides both DataFrame and col’s sr. Otherwise, tries to get sr from passed col values or DataFrame.
Returns: SpatialDataFrame
-
set_index
(keys, drop=True, append=False, inplace=False, verify_integrity=False)¶ Set the DataFrame index (row labels) using one or more existing columns. By default yields a new object.
keys : column label or list of column labels / arrays drop : boolean, default True
Delete columns to be used as the new index- append : boolean, default False
- Whether to append columns to existing index
- inplace : boolean, default False
- Modify the DataFrame in place (do not create a new object)
- verify_integrity : boolean, default False
- Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method
>>> df = pd.DataFrame({'month': [1, 4, 7, 10], ... 'year': [2012, 2014, 2013, 2014], ... 'sale':[55, 40, 84, 31]}) month sale year 0 1 55 2012 1 4 40 2014 2 7 84 2013 3 10 31 2014
Set the index to become the ‘month’ column:
>>> df.set_index('month') sale year month 1 55 2012 4 40 2014 7 84 2013 10 31 2014
Create a multi-index using columns ‘year’ and ‘month’:
>>> df.set_index(['year', 'month']) sale year month 2012 1 55 2014 4 40 2013 7 84 2014 10 31
Create a multi-index using a set of values and a column:
>>> df.set_index([[1, 2, 3, 4], 'year']) month sale year 1 2012 1 55 2 2014 4 40 3 2013 7 84 4 2014 10 31
dataframe : DataFrame
-
set_value
(index, col, value, takeable=False)¶ Put single value at passed column and index
Deprecated since version 0.21.0.
Please use .at[] or .iat[] accessors.
index : row label col : column label value : scalar value takeable : interpret the index/col as indexers, default False
- frame : DataFrame
- If label pair is contained, will be reference to calling DataFrame, otherwise a new object
-
shape
¶ Return a tuple representing the dimensionality of the DataFrame.
-
shift
(periods=1, freq=None, axis=0)¶ Shift index by desired number of periods with an optional time freq
- periods : int
- Number of periods to move, can be positive or negative
- freq : DateOffset, timedelta, or time rule string, optional
- Increment to use from the tseries module or time rule (e.g. ‘EOM’). See Notes.
axis : {0 or ‘index’, 1 or ‘columns’}
If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.
shifted : DataFrame
-
sindex
¶
-
size
¶ number of elements in the NDFrame
-
skew
(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)¶ Return unbiased skew over requested axis Normalized by N-1
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA- level : int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only : boolean, default None
- Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
skew : Series or DataFrame (if level specified)
-
slice_shift
(periods=1, axis=0)¶ Equivalent to shift without copying data. The shifted data will not include the dropped periods and the shifted axis will be smaller than the original.
- periods : int
- Number of periods to move, can be positive or negative
While the slice_shift is faster than shift, you may pay for it later during alignment.
shifted : same type as caller
-
snap_to_line
(second_geometry)¶ Returns a new point based on in_point snapped to this geometry.
- Paramters:
second_geometry: - a second geometry
-
sort_index
(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, by=None)¶ Sort object by labels (along an axis)
axis : index, columns to direct sorting level : int or level name or list of ints or list of level names
if not None, sort on values in specified index level(s)- ascending : boolean, default True
- Sort ascending vs. descending
- inplace : bool, default False
- if True, perform operation in-place
- kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’
- Choice of sorting algorithm. See also ndarray.np.sort for more information. mergesort is the only stable algorithm. For DataFrames, this option is only applied when sorting on a single column or label.
- na_position : {‘first’, ‘last’}, default ‘last’
- first puts NaNs at the beginning, last puts NaNs at the end. Not implemented for MultiIndex.
- sort_remaining : bool, default True
- if true and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level
sorted_obj : DataFrame
-
sort_values
(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')¶ Sort by the values along either axis
New in version 0.17.0.
- by : str or list of str
- Name or list of names which refer to the axis items.
- axis : {0 or ‘index’, 1 or ‘columns’}, default 0
- Axis to direct sorting
- ascending : bool or list of bool, default True
- Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
- inplace : bool, default False
- if True, perform operation in-place
- kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’
- Choice of sorting algorithm. See also ndarray.np.sort for more information. mergesort is the only stable algorithm. For DataFrames, this option is only applied when sorting on a single column or label.
- na_position : {‘first’, ‘last’}, default ‘last’
- first puts NaNs at the beginning, last puts NaNs at the end
sorted_obj : DataFrame
>>> df = pd.DataFrame({ ... 'col1' : ['A', 'A', 'B', np.nan, 'D', 'C'], ... 'col2' : [2, 1, 9, 8, 7, 4], ... 'col3': [0, 1, 9, 4, 2, 3], ... }) >>> df col1 col2 col3 0 A 2 0 1 A 1 1 2 B 9 9 3 NaN 8 4 4 D 7 2 5 C 4 3
Sort by col1
>>> df.sort_values(by=['col1']) col1 col2 col3 0 A 2 0 1 A 1 1 2 B 9 9 5 C 4 3 4 D 7 2 3 NaN 8 4
Sort by multiple columns
>>> df.sort_values(by=['col1', 'col2']) col1 col2 col3 1 A 1 1 0 A 2 0 2 B 9 9 5 C 4 3 4 D 7 2 3 NaN 8 4
Sort Descending
>>> df.sort_values(by='col1', ascending=False) col1 col2 col3 4 D 7 2 5 C 4 3 2 B 9 9 0 A 2 0 1 A 1 1 3 NaN 8 4
Putting NAs first
>>> df.sort_values(by='col1', ascending=False, na_position='first') col1 col2 col3 3 NaN 8 4 4 D 7 2 5 C 4 3 2 B 9 9 0 A 2 0 1 A 1 1
-
sortlevel
(level=0, axis=0, ascending=True, inplace=False, sort_remaining=True)¶ DEPRECATED: use
DataFrame.sort_index()
Sort multilevel index by chosen axis and primary level. Data will be lexicographically sorted by the chosen level followed by the other levels (in order)
level : int axis : {0 or ‘index’, 1 or ‘columns’}, default 0 ascending : boolean, default True inplace : boolean, default False
Sort the DataFrame without creating a new instance- sort_remaining : boolean, default True
- Sort by the other levels too.
sorted : DataFrame
DataFrame.sort_index(level=…)
-
spatial_reference
¶ The spatial reference of the geometry.
-
squeeze
(axis=None)¶ Squeeze length 1 dimensions.
- axis : None, integer or string axis name, optional
The axis to squeeze if 1-sized.
New in version 0.20.0.
scalar if 1-sized, else original object
-
stack
(level=-1, dropna=True)¶ Pivot a level of the (possibly hierarchical) column labels, returning a DataFrame (or Series in the case of an object with a single level of column labels) having a hierarchical index with a new inner-most level of row labels. The level involved will automatically get sorted.
- level : int, string, or list of these, default last level
- Level(s) to stack, can pass level name
- dropna : boolean, default True
- Whether to drop rows in the resulting Frame/Series with no valid values
>>> s a b one 1. 2. two 3. 4.
>>> s.stack() one a 1 b 2 two a 3 b 4
stacked : DataFrame or Series
-
std
(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)¶ Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA- level : int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- ddof : int, default 1
- degrees of freedom
- numeric_only : boolean, default None
- Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
std : Series or DataFrame (if level specified)
-
style
¶ Property returning a Styler object containing methods for building a styled HTML representation fo the DataFrame.
pandas.io.formats.style.Styler
-
sub
(other, axis='columns', level=None, fill_value=None)¶ Subtraction of dataframe and other, element-wise (binary operator sub).
Equivalent to
dataframe - other
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.rsub
-
subtract
(other, axis='columns', level=None, fill_value=None)¶ Subtraction of dataframe and other, element-wise (binary operator sub).
Equivalent to
dataframe - other
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.rsub
-
sum
(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)¶ Return the sum of the values for the requested axis
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA- level : int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- numeric_only : boolean, default None
- Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
sum : Series or DataFrame (if level specified)
-
swapaxes
(axis1, axis2, copy=True)¶ Interchange axes and swap values axes appropriately
y : same as input
-
swaplevel
(i=-2, j=-1, axis=0)¶ Swap levels i and j in a MultiIndex on a particular axis
- i, j : int, string (can be mixed)
- Level of index to be swapped. Can pass level name as string.
swapped : type of caller (new object)
Changed in version 0.18.1: The indexes
i
andj
are now optional, and default to the two innermost levels of the index.
-
symmetric_difference
(second_geometry)¶ Constructs the geometry that is the union of two geometries minus the instersection of those geometries. The two input geometries must be the same shape type. Parameters:
second_geometry: - a second geometry
-
tail
(n=5)¶ Return the last n rows.
- n : int, default 5
- Number of rows to select.
- obj_tail : type of caller
- The last n rows of the caller object.
-
take
(indices, axis=0, convert=None, is_copy=True, **kwargs)¶ Return the elements in the given positional indices along an axis.
This means that we are not indexing according to actual values in the index attribute of the object. We are indexing according to the actual position of the element in the object.
- indices : array-like
- An array of ints indicating which positions to take.
- axis : int, default 0
- The axis on which to select elements. “0” means that we are selecting rows, “1” means that we are selecting columns, etc.
- convert : bool, default True
Deprecated since version 0.21.0: In the future, negative indices will always be converted.
Whether to convert negative indices into positive ones. For example,
-1
would map to thelen(axis) - 1
. The conversions are similar to the behavior of indexing a regular Python list.- is_copy : bool, default True
- Whether to return a copy of the original object or not.
>>> df = pd.DataFrame([('falcon', 'bird', 389.0), ('parrot', 'bird', 24.0), ('lion', 'mammal', 80.5), ('monkey', 'mammal', np.nan)], columns=('name', 'class', 'max_speed'), index=[0, 2, 3, 1]) >>> df name class max_speed 0 falcon bird 389.0 2 parrot bird 24.0 3 lion mammal 80.5 1 monkey mammal NaN
Take elements at positions 0 and 3 along the axis 0 (default).
Note how the actual indices selected (0 and 1) do not correspond to our selected indices 0 and 3. That’s because we are selecting the 0th and 3rd rows, not rows whose indices equal 0 and 3.
>>> df.take([0, 3]) 0 falcon bird 389.0 1 monkey mammal NaN
Take elements at indices 1 and 2 along the axis 1 (column selection).
>>> df.take([1, 2], axis=1) class max_speed 0 bird 389.0 2 bird 24.0 3 mammal 80.5 1 mammal NaN
We may take elements using negative integers for positive indices, starting from the end of the object, just like with Python lists.
>>> df.take([-1, -2]) name class max_speed 1 monkey mammal NaN 3 lion mammal 80.5
- taken : type of caller
- An array-like containing the elements taken from the object.
numpy.ndarray.take numpy.take
-
to_clipboard
(excel=None, sep=None, **kwargs)¶ Attempt to write text representation of object to the system clipboard This can be pasted into Excel, for example.
- excel : boolean, defaults to True
- if True, use the provided separator, writing in a csv format for allowing easy pasting into excel. if False, write a string representation of the object to the clipboard
sep : optional, defaults to tab other keywords are passed to to_csv
- Requirements for your platform
- Linux: xclip, or xsel (with gtk or PyQt4 modules)
- Windows: none
- OS X: none
-
to_csv
(path_or_buf=None, sep=', ', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression=None, quoting=None, quotechar='"', line_terminator='\n', chunksize=None, tupleize_cols=None, date_format=None, doublequote=True, escapechar=None, decimal='.')¶ Write DataFrame to a comma-separated values (csv) file
- path_or_buf : string or file handle, default None
- File path or object, if None is provided the result is returned as a string.
- sep : character, default ‘,’
- Field delimiter for the output file.
- na_rep : string, default ‘’
- Missing data representation
- float_format : string, default None
- Format string for floating point numbers
- columns : sequence, optional
- Columns to write
- header : boolean or list of string, default True
- Write out the column names. If a list of strings is given it is assumed to be aliases for the column names
- index : boolean, default True
- Write row names (index)
- index_label : string or sequence, or False, default None
- Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R
- mode : str
- Python write mode, default ‘w’
- encoding : string, optional
- A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
- compression : string, optional
- a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename
- line_terminator : string, default
'\n'
- The newline character or character sequence to use in the output file
- quoting : optional constant from csv module
- defaults to csv.QUOTE_MINIMAL. If you have set a float_format then floats are converted to strings and thus csv.QUOTE_NONNUMERIC will treat them as non-numeric
- quotechar : string (length 1), default ‘"’
- character used to quote fields
- doublequote : boolean, default True
- Control quoting of quotechar inside a field
- escapechar : string (length 1), default None
- character used to escape sep and quotechar when appropriate
- chunksize : int or None
- rows to write at a time
- tupleize_cols : boolean, default False
Deprecated since version 0.21.0: This argument will be removed and will always write each row of the multi-index as a separate row in the CSV file.
Write MultiIndex columns as a list of tuples (if True) or in the new, expanded format, where each MultiIndex column is a row in the CSV (if False).
- date_format : string, default None
- Format string for datetime objects
- decimal: string, default ‘.’
- Character recognized as decimal separator. E.g. use ‘,’ for European data
-
to_dense
()¶ Return dense representation of NDFrame (as opposed to sparse)
-
to_dict
(orient='dict', into=<class 'dict'>)¶ Convert DataFrame to dictionary.
- orient : str {‘dict’, ‘list’, ‘series’, ‘split’, ‘records’, ‘index’}
Determines the type of the values of the dictionary.
dict (default) : dict like {column -> {index -> value}}
list : dict like {column -> [values]}
series : dict like {column -> Series(values)}
split : dict like {index -> [index], columns -> [columns], data -> [values]}
records : list like [{column -> value}, … , {column -> value}]
index : dict like {index -> {column -> value}}
New in version 0.17.0.
Abbreviations are allowed. s indicates series and sp indicates split.
- into : class, default dict
The collections.Mapping subclass used for all Mappings in the return value. Can be the actual class or an empty instance of the mapping type you want. If you want a collections.defaultdict, you must pass it initialized.
New in version 0.21.0.
result : collections.Mapping like {column -> {index -> value}}
>>> df = pd.DataFrame( {'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b']) >>> df col1 col2 a 1 0.1 b 2 0.2 >>> df.to_dict() {'col1': {'a': 1, 'b': 2}, 'col2': {'a': 0.5, 'b': 0.75}}
You can specify the return orientation.
>>> df.to_dict('series') {'col1': a 1 b 2 Name: col1, dtype: int64, 'col2': a 0.50 b 0.75 Name: col2, dtype: float64} >>> df.to_dict('split') {'columns': ['col1', 'col2'], 'data': [[1.0, 0.5], [2.0, 0.75]], 'index': ['a', 'b']} >>> df.to_dict('records') [{'col1': 1.0, 'col2': 0.5}, {'col1': 2.0, 'col2': 0.75}] >>> df.to_dict('index') {'a': {'col1': 1.0, 'col2': 0.5}, 'b': {'col1': 2.0, 'col2': 0.75}}
You can also specify the mapping type.
>>> from collections import OrderedDict, defaultdict >>> df.to_dict(into=OrderedDict) OrderedDict([('col1', OrderedDict([('a', 1), ('b', 2)])), ('col2', OrderedDict([('a', 0.5), ('b', 0.75)]))])
If you want a defaultdict, you need to initialize it:
>>> dd = defaultdict(list) >>> df.to_dict('records', into=dd) [defaultdict(<type 'list'>, {'col2': 0.5, 'col1': 1.0}), defaultdict(<type 'list'>, {'col2': 0.75, 'col1': 2.0})]
-
to_excel
(excel_writer, sheet_name='Sheet1', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, startrow=0, startcol=0, engine=None, merge_cells=True, encoding=None, inf_rep='inf', verbose=True, freeze_panes=None)¶ Write DataFrame to an excel sheet
- excel_writer : string or ExcelWriter object
- File path or existing ExcelWriter
- sheet_name : string, default ‘Sheet1’
- Name of sheet which will contain DataFrame
- na_rep : string, default ‘’
- Missing data representation
- float_format : string, default None
- Format string for floating point numbers
- columns : sequence, optional
- Columns to write
- header : boolean or list of string, default True
- Write out the column names. If a list of strings is given it is assumed to be aliases for the column names
- index : boolean, default True
- Write row names (index)
- index_label : string or sequence, default None
- Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.
- startrow :
- upper left cell row to dump data frame
- startcol :
- upper left cell column to dump data frame
- engine : string, default None
- write engine to use - you can also set this via the options
io.excel.xlsx.writer
,io.excel.xls.writer
, andio.excel.xlsm.writer
. - merge_cells : boolean, default True
- Write MultiIndex and Hierarchical Rows as merged cells.
- encoding: string, default None
- encoding of the resulting excel file. Only necessary for xlwt, other writers support unicode natively.
- inf_rep : string, default ‘inf’
- Representation for infinity (there is no native representation for infinity in Excel)
- freeze_panes : tuple of integer (length 2), default None
Specifies the one-based bottommost row and rightmost column that is to be frozen
New in version 0.20.0.
If passing an existing ExcelWriter object, then the sheet will be added to the existing workbook. This can be used to save different DataFrames to one workbook:
>>> writer = pd.ExcelWriter('output.xlsx') >>> df1.to_excel(writer,'Sheet1') >>> df2.to_excel(writer,'Sheet2') >>> writer.save()
For compatibility with to_csv, to_excel serializes lists and dicts to strings before writing.
-
to_feather
(fname)¶ write out the binary feather-format for DataFrames
New in version 0.20.0.
- fname : str
- string file path
-
to_feature_collection
(name=None, drawing_info=None, extent=None, global_id_field=None)¶ converts a Spatial DataFrame to a Feature Collection
optional argument Description name optional string. Name of the Feature Collection drawing_info Optional dictionary. This is the rendering information for a Feature Collection. Rendering information is a dictionary with the symbology, labelling and other properties defined. See: http://resources.arcgis.com/en/help/arcgis-rest-api/index.html#/Renderer_objects/02r30000019t000000/ extent Optional dictionary. If desired, a custom extent can be provided to set where the map starts up when showing the data. The default is the full extent of the dataset in the Spatial DataFrame. global_id_field Optional string. The Global ID field of the dataset. Returns: FeatureCollection object
-
to_featureclass
(out_location, out_name, overwrite=True, skip_invalid=True)¶ converts a SpatialDataFrame to a feature class
- Parameters:
out_location: save location workspace out_name: name of the feature class to save as overwrite: boolean. True means to erase and replace value, false means to append skip_invalids: if True, any bad rows will be ignored. - Output:
- tuple of feature class path and list of bad rows by index number.
-
to_featurelayer
(title, gis=None, tags=None)¶ publishes a spatial dataframe to a new feature layer
-
to_featureset
()¶ Converts a spatial dataframe to a feature set object
-
to_gbq
(destination_table, project_id, chunksize=10000, verbose=True, reauth=False, if_exists='fail', private_key=None)¶ Write a DataFrame to a Google BigQuery table.
The main method a user calls to export pandas DataFrame contents to Google BigQuery table.
Google BigQuery API Client Library v2 for Python is used. Documentation is available here
Authentication to the Google BigQuery service is via OAuth 2.0.
If “private_key” is not provided:
By default “application default credentials” are used.
If default application credentials are not found or are restrictive, user account credentials are used. In this case, you will be asked to grant permissions for product name ‘pandas GBQ’.
If “private_key” is provided:
Service account credentials will be used to authenticate.
- dataframe : DataFrame
- DataFrame to be written
- destination_table : string
- Name of table to be written, in the form ‘dataset.tablename’
- project_id : str
- Google BigQuery Account project ID.
- chunksize : int (default 10000)
- Number of rows to be inserted in each chunk from the dataframe.
- verbose : boolean (default True)
- Show percentage complete
- reauth : boolean (default False)
- Force Google BigQuery to reauthenticate the user. This is useful if multiple accounts are used.
- if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- ‘fail’: If table exists, do nothing. ‘replace’: If table exists, drop it, recreate it, and insert data. ‘append’: If table exists, insert data. Create if does not exist.
- private_key : str (optional)
- Service account private key in JSON format. Can be file path or string contents. This is useful for remote server authentication (eg. jupyter iPython notebook on remote host)
-
to_hdf
(path_or_buf, key, **kwargs)¶ Write the contained data to an HDF5 file using HDFStore.
path_or_buf : the path (string) or HDFStore object key : string
indentifier for the group in the storemode : optional, {‘a’, ‘w’, ‘r+’}, default ‘a’
'w'
- Write; a new file is created (an existing file with the same name would be deleted).
'a'
- Append; an existing file is opened for reading and writing, and if the file does not exist it is created.
'r+'
- It is similar to
'a'
, but the file must already exist.
- format : ‘fixed(f)|table(t)’, default is ‘fixed’
- fixed(f) : Fixed format
- Fast writing/reading. Not-appendable, nor searchable
- table(t) : Table format
- Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data
- append : boolean, default False
- For Table formats, append the input data to the existing
- data_columns : list of columns, or True, default None
List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See here.
Applicable only to format=’table’.
- complevel : int, 1-9, default 0
- If a complib is specified compression will be applied where possible
- complib : {‘zlib’, ‘bzip2’, ‘lzo’, ‘blosc’, None}, default None
- If complevel is > 0 apply compression to objects written in the store wherever possible
- fletcher32 : bool, default False
- If applying compression use the fletcher32 checksum
- dropna : boolean, default False.
- If true, ALL nan rows will not be written to store.
-
to_html
(buf=None, columns=None, col_space=None, header=True, index=True, na_rep='NaN', formatters=None, float_format=None, sparsify=None, index_names=True, justify=None, bold_rows=True, classes=None, escape=True, max_rows=None, max_cols=None, show_dimensions=False, notebook=False, decimal='.', border=None)¶ Render a DataFrame as an HTML table.
to_html-specific options:
- bold_rows : boolean, default True
- Make the row labels bold in the output
- classes : str or list or tuple, default None
- CSS class(es) to apply to the resulting html table
- escape : boolean, default True
- Convert the characters <, >, and & to HTML-safe sequences.=
- max_rows : int, optional
- Maximum number of rows to show before truncating. If None, show all.
- max_cols : int, optional
- Maximum number of columns to show before truncating. If None, show all.
- decimal : string, default ‘.’
Character recognized as decimal separator, e.g. ‘,’ in Europe
New in version 0.18.0.
- border : int
A
border=border
attribute is included in the opening <table> tag. Defaultpd.options.html.border
.New in version 0.19.0.
- buf : StringIO-like, optional
- buffer to write to
- columns : sequence, optional
- the subset of columns to write; default None writes all columns
- col_space : int, optional
- the minimum width of each column
- header : bool, optional
- whether to print column labels, default True
- index : bool, optional
- whether to print index (row) labels, default True
- na_rep : string, optional
- string representation of NAN to use, default ‘NaN’
- formatters : list or dict of one-parameter functions, optional
- formatter functions to apply to columns’ elements by position or name, default None. The result of each function must be a unicode string. List must be of length equal to the number of columns.
- float_format : one-parameter function, optional
- formatter function to apply to columns’ elements if they are floats, default None. The result of this function must be a unicode string.
- sparsify : bool, optional
- Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row, default True
- index_names : bool, optional
- Prints the names of the indexes, default True
- line_width : int, optional
- Width to wrap a line in characters, default no wrap
- justify : {‘left’, ‘right’, ‘center’, ‘justify’,
- ‘justify-all’, ‘start’, ‘end’, ‘inherit’, ‘match-parent’, ‘initial’, ‘unset’}, default None
How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box.
formatted : string (or unicode, depending on data and options)
-
to_json
(path_or_buf=None, orient=None, date_format=None, double_precision=10, force_ascii=True, date_unit='ms', default_handler=None, lines=False, compression=None)¶ Convert the object to a JSON string.
Note NaN’s and None will be converted to null and datetime objects will be converted to UNIX timestamps.
- path_or_buf : the path or buffer to write the result string
- if this is None, return the converted string
orient : string
Series
- default is ‘index’
- allowed values are: {‘split’,’records’,’index’}
DataFrame
- default is ‘columns’
- allowed values are: {‘split’,’records’,’index’,’columns’,’values’}
The format of the JSON string
split : dict like {index -> [index], columns -> [columns], data -> [values]}
records : list like [{column -> value}, … , {column -> value}]
index : dict like {index -> {column -> value}}
columns : dict like {column -> {index -> value}}
values : just the values array
table : dict like {‘schema’: {schema}, ‘data’: {data}} describing the data, and the data component is like
orient='records'
.Changed in version 0.20.0.
- date_format : {None, ‘epoch’, ‘iso’}
- Type of date conversion. epoch = epoch milliseconds, iso = ISO8601. The default depends on the orient. For orient=’table’, the default is ‘iso’. For all other orients, the default is ‘epoch’.
- double_precision : The number of decimal places to use when encoding
- floating point values, default 10.
force_ascii : force encoded string to be ASCII, default True. date_unit : string, default ‘ms’ (milliseconds)
The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.- default_handler : callable, default None
- Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object.
- lines : boolean, default False
If ‘orient’ is ‘records’ write out line delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list like.
New in version 0.19.0.
- compression : {None, ‘gzip’, ‘bz2’, ‘xz’}
A string representing the compression to use in the output file, only used when the first argument is a filename
New in version 0.21.0.
same type as input object with filtered info axis
pd.read_json
>>> df = pd.DataFrame([['a', 'b'], ['c', 'd']], ... index=['row 1', 'row 2'], ... columns=['col 1', 'col 2']) >>> df.to_json(orient='split') '{"columns":["col 1","col 2"], "index":["row 1","row 2"], "data":[["a","b"],["c","d"]]}'
Encoding/decoding a Dataframe using
'index'
formatted JSON:>>> df.to_json(orient='index') '{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}'
Encoding/decoding a Dataframe using
'records'
formatted JSON. Note that index labels are not preserved with this encoding.>>> df.to_json(orient='records') '[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]'
Encoding with Table Schema
>>> df.to_json(orient='table') '{"schema": {"fields": [{"name": "index", "type": "string"}, {"name": "col 1", "type": "string"}, {"name": "col 2", "type": "string"}], "primaryKey": "index", "pandas_version": "0.20.0"}, "data": [{"index": "row 1", "col 1": "a", "col 2": "b"}, {"index": "row 2", "col 1": "c", "col 2": "d"}]}'
-
to_latex
(buf=None, columns=None, col_space=None, header=True, index=True, na_rep='NaN', formatters=None, float_format=None, sparsify=None, index_names=True, bold_rows=False, column_format=None, longtable=None, escape=None, encoding=None, decimal='.', multicolumn=None, multicolumn_format=None, multirow=None)¶ Render an object to a tabular environment table. You can splice this into a LaTeX document. Requires \usepackage{booktabs}.
Changed in version 0.20.2: Added to Series
to_latex-specific options:
- bold_rows : boolean, default False
- Make the row labels bold in the output
- column_format : str, default None
- The columns format as specified in LaTeX table format e.g ‘rcl’ for 3 columns
- longtable : boolean, default will be read from the pandas config module
- Default: False. Use a longtable environment instead of tabular. Requires adding a \usepackage{longtable} to your LaTeX preamble.
- escape : boolean, default will be read from the pandas config module
- Default: True. When set to False prevents from escaping latex special characters in column names.
- encoding : str, default None
- A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
- decimal : string, default ‘.’
Character recognized as decimal separator, e.g. ‘,’ in Europe.
New in version 0.18.0.
- multicolumn : boolean, default True
Use multicolumn to enhance MultiIndex columns. The default will be read from the config module.
New in version 0.20.0.
- multicolumn_format : str, default ‘l’
The alignment for multicolumns, similar to column_format The default will be read from the config module.
New in version 0.20.0.
- multirow : boolean, default False
Use multirow to enhance MultiIndex rows. Requires adding a \usepackage{multirow} to your LaTeX preamble. Will print centered labels (instead of top-aligned) across the contained rows, separating groups via clines. The default will be read from the pandas config module.
New in version 0.20.0.
-
to_msgpack
(path_or_buf=None, encoding='utf-8', **kwargs)¶ msgpack (serialize) object to input file path
THIS IS AN EXPERIMENTAL LIBRARY and the storage format may not be stable until a future release.
- path : string File path, buffer-like, or None
- if None, return generated string
- append : boolean whether to append to an existing msgpack
- (default is False)
- compress : type of compressor (zlib or blosc), default to None (no
- compression)
-
to_panel
()¶ Transform long (stacked) format (DataFrame) into wide (3D, Panel) format.
Currently the index of the DataFrame must be a 2-level MultiIndex. This may be generalized later
panel : Panel
-
to_parquet
(fname, engine='auto', compression='snappy', **kwargs)¶ Write a DataFrame to the binary parquet format.
New in version 0.21.0.
- fname : str
- string file path
- engine : {‘auto’, ‘pyarrow’, ‘fastparquet’}, default ‘auto’
- Parquet reader library to use. If ‘auto’, then the option ‘io.parquet.engine’ is used. If ‘auto’, then the first library to be installed is used.
- compression : str, optional, default ‘snappy’
- compression method, includes {‘gzip’, ‘snappy’, ‘brotli’}
- kwargs
- Additional keyword arguments passed to the engine
-
to_period
(freq=None, axis=0, copy=True)¶ Convert DataFrame from DatetimeIndex to PeriodIndex with desired frequency (inferred from index if not passed)
freq : string, default axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The axis to convert (the index by default)- copy : boolean, default True
- If False then underlying input data is not copied
ts : TimeSeries with PeriodIndex
-
to_pickle
(path, compression='infer', protocol=4)¶ Pickle (serialize) object to input file path.
- path : string
- File path
- compression : {‘infer’, ‘gzip’, ‘bz2’, ‘xz’, None}, default ‘infer’
a string representing the compression to use in the output file
New in version 0.20.0.
- protocol : int
Int which indicates which protocol should be used by the pickler, default HIGHEST_PROTOCOL (see [1], paragraph 12.1.2). The possible values for this parameter depend on the version of Python. For Python 2.x, possible values are 0, 1, 2. For Python>=3.0, 3 is a valid value. For Python >= 3.4, 4 is a valid value.A negative value for the protocol parameter is equivalent to setting its value to HIGHEST_PROTOCOL.
[1] https://docs.python.org/3/library/pickle.html New in version 0.21.0.
-
to_records
(index=True, convert_datetime64=True)¶ Convert DataFrame to record array. Index will be put in the ‘index’ field of the record array if requested
- index : boolean, default True
- Include index in resulting record array, stored in ‘index’ field
- convert_datetime64 : boolean, default True
- Whether to convert the index to datetime.datetime if it is a DatetimeIndex
y : recarray
-
to_sparse
(fill_value=None, kind='block')¶ Convert to SparseDataFrame
fill_value : float, default NaN kind : {‘block’, ‘integer’}
y : SparseDataFrame
-
to_sql
(name, con, flavor=None, schema=None, if_exists='fail', index=True, index_label=None, chunksize=None, dtype=None)¶ Write records stored in a DataFrame to a SQL database.
- name : string
- Name of SQL table
- con : SQLAlchemy engine or DBAPI2 connection (legacy mode)
- Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported.
- flavor : ‘sqlite’, default None
Deprecated since version 0.19.0: ‘sqlite’ is the only supported option if SQLAlchemy is not used.
- schema : string, default None
- Specify the schema (if database flavor supports this). If None, use default schema.
- if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’
- fail: If table exists, do nothing.
- replace: If table exists, drop it, recreate it, and insert data.
- append: If table exists, insert data. Create if does not exist.
- index : boolean, default True
- Write DataFrame index as a column.
- index_label : string or sequence, default None
- Column label for index column(s). If None is given (default) and index is True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.
- chunksize : int, default None
- If not None, then rows will be written in batches of this size at a time. If None, all rows will be written at once.
- dtype : dict of column name to SQL type, default None
- Optional specifying the datatype for columns. The SQL type should be a SQLAlchemy type, or a string for sqlite3 fallback connection.
-
to_stata
(fname, convert_dates=None, write_index=True, encoding='latin-1', byteorder=None, time_stamp=None, data_label=None, variable_labels=None)¶ A class for writing Stata binary dta files from array-like objects
- fname : str or buffer
- String path of file-like object
- convert_dates : dict
- Dictionary mapping columns containing datetime types to stata internal format to use when wirting the dates. Options are ‘tc’, ‘td’, ‘tm’, ‘tw’, ‘th’, ‘tq’, ‘ty’. Column can be either an integer or a name. Datetime columns that do not have a conversion type specified will be converted to ‘tc’. Raises NotImplementedError if a datetime column has timezone information
- write_index : bool
- Write the index to Stata dataset.
- encoding : str
- Default is latin-1. Unicode is not supported
- byteorder : str
- Can be “>”, “<”, “little”, or “big”. default is sys.byteorder
- time_stamp : datetime
- A datetime to use as file creation date. Default is the current time.
- dataset_label : str
- A label for the data set. Must be 80 characters or smaller.
- variable_labels : dict
Dictionary containing columns as keys and variable labels as values. Each label must be 80 characters or smaller.
New in version 0.19.0.
- NotImplementedError
- If datetimes contain timezone information
- Column dtype is not representable in Stata
- ValueError
- Columns listed in convert_dates are noth either datetime64[ns] or datetime.datetime
- Column listed in convert_dates is not in DataFrame
- Categorical label contains more than 32,000 characters
New in version 0.19.0.
>>> writer = StataWriter('./data_file.dta', data) >>> writer.write_file()
Or with dates
>>> writer = StataWriter('./date_data_file.dta', data, {2 : 'tw'}) >>> writer.write_file()
-
to_string
(buf=None, columns=None, col_space=None, header=True, index=True, na_rep='NaN', formatters=None, float_format=None, sparsify=None, index_names=True, justify=None, line_width=None, max_rows=None, max_cols=None, show_dimensions=False)¶ Render a DataFrame to a console-friendly tabular output.
- buf : StringIO-like, optional
- buffer to write to
- columns : sequence, optional
- the subset of columns to write; default None writes all columns
- col_space : int, optional
- the minimum width of each column
- header : bool, optional
- Write out the column names. If a list of strings is given, it is assumed to be aliases for the column names
- index : bool, optional
- whether to print index (row) labels, default True
- na_rep : string, optional
- string representation of NAN to use, default ‘NaN’
- formatters : list or dict of one-parameter functions, optional
- formatter functions to apply to columns’ elements by position or name, default None. The result of each function must be a unicode string. List must be of length equal to the number of columns.
- float_format : one-parameter function, optional
- formatter function to apply to columns’ elements if they are floats, default None. The result of this function must be a unicode string.
- sparsify : bool, optional
- Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row, default True
- index_names : bool, optional
- Prints the names of the indexes, default True
- line_width : int, optional
- Width to wrap a line in characters, default no wrap
- justify : {‘left’, ‘right’, ‘center’, ‘justify’,
- ‘justify-all’, ‘start’, ‘end’, ‘inherit’, ‘match-parent’, ‘initial’, ‘unset’}, default None
How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box.
formatted : string (or unicode, depending on data and options)
-
to_timestamp
(freq=None, how='start', axis=0, copy=True)¶ Cast to DatetimeIndex of timestamps, at beginning of period
- freq : string, default frequency of PeriodIndex
- Desired frequency
- how : {‘s’, ‘e’, ‘start’, ‘end’}
- Convention for converting period to timestamp; start of period vs. end
- axis : {0 or ‘index’, 1 or ‘columns’}, default 0
- The axis to convert (the index by default)
- copy : boolean, default True
- If false then underlying input data is not copied
df : DataFrame with DatetimeIndex
-
to_xarray
()¶ Return an xarray object from the pandas object.
a DataArray for a Series a Dataset for a DataFrame a DataArray for higher dims
>>> df = pd.DataFrame({'A' : [1, 1, 2], 'B' : ['foo', 'bar', 'foo'], 'C' : np.arange(4.,7)}) >>> df A B C 0 1 foo 4.0 1 1 bar 5.0 2 2 foo 6.0
>>> df.to_xarray() <xarray.Dataset> Dimensions: (index: 3) Coordinates: * index (index) int64 0 1 2 Data variables: A (index) int64 1 1 2 B (index) object 'foo' 'bar' 'foo' C (index) float64 4.0 5.0 6.0
>>> df = pd.DataFrame({'A' : [1, 1, 2], 'B' : ['foo', 'bar', 'foo'], 'C' : np.arange(4.,7)} ).set_index(['B','A']) >>> df C B A foo 1 4.0 bar 1 5.0 foo 2 6.0
>>> df.to_xarray() <xarray.Dataset> Dimensions: (A: 2, B: 2) Coordinates: * B (B) object 'bar' 'foo' * A (A) int64 1 2 Data variables: C (B, A) float64 5.0 nan 4.0 6.0
>>> p = pd.Panel(np.arange(24).reshape(4,3,2), items=list('ABCD'), major_axis=pd.date_range('20130101', periods=3), minor_axis=['first', 'second']) >>> p <class 'pandas.core.panel.Panel'> Dimensions: 4 (items) x 3 (major_axis) x 2 (minor_axis) Items axis: A to D Major_axis axis: 2013-01-01 00:00:00 to 2013-01-03 00:00:00 Minor_axis axis: first to second
>>> p.to_xarray() <xarray.DataArray (items: 4, major_axis: 3, minor_axis: 2)> array([[[ 0, 1], [ 2, 3], [ 4, 5]], [[ 6, 7], [ 8, 9], [10, 11]], [[12, 13], [14, 15], [16, 17]], [[18, 19], [20, 21], [22, 23]]]) Coordinates: * items (items) object 'A' 'B' 'C' 'D' * major_axis (major_axis) datetime64[ns] 2013-01-01 2013-01-02 2013-01-03 # noqa * minor_axis (minor_axis) object 'first' 'second'
See the xarray docs
-
touches
(second_geometry)¶ Indicates if the boundaries of the geometries intersect.
- Paramters:
second_geometry: - a second geometry
-
transform
(func, *args, **kwargs)¶ Call function producing a like-indexed NDFrame and return a NDFrame with the transformed values
New in version 0.20.0.
- func : callable, string, dictionary, or list of string/callables
To apply to column
Accepted Combinations are:
- string function name
- function
- list of functions
- dict of column names -> functions (or list of functions)
transformed : NDFrame
>>> df = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'], ... index=pd.date_range('1/1/2000', periods=10)) df.iloc[3:7] = np.nan
>>> df.transform(lambda x: (x - x.mean()) / x.std()) A B C 2000-01-01 0.579457 1.236184 0.123424 2000-01-02 0.370357 -0.605875 -1.231325 2000-01-03 1.455756 -0.277446 0.288967 2000-01-04 NaN NaN NaN 2000-01-05 NaN NaN NaN 2000-01-06 NaN NaN NaN 2000-01-07 NaN NaN NaN 2000-01-08 -0.498658 1.274522 1.642524 2000-01-09 -0.540524 -1.012676 -0.828968 2000-01-10 -1.366388 -0.614710 0.005378
pandas.NDFrame.aggregate pandas.NDFrame.apply
-
transpose
(*args, **kwargs)¶ Transpose index and columns
-
true_centroid
¶ The center of gravity for a feature.
-
truediv
(other, axis='columns', level=None, fill_value=None)¶ Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs.other : Series, DataFrame, or constant axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on- fill_value : None or float value, default None
- Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing
- level : int or name
- Broadcast across a level, matching Index values on the passed MultiIndex level
Mismatched indices will be unioned together
result : DataFrame
DataFrame.rtruediv
-
truncate
(before=None, after=None, axis=None, copy=True)¶ Truncates a sorted DataFrame/Series before and/or after some particular index value. If the axis contains only datetime values, before/after parameters are converted to datetime values.
- before : date, string, int
- Truncate all rows before this index value
- after : date, string, int
- Truncate all rows after this index value
axis : {0 or ‘index’, 1 or ‘columns’}
- 0 or ‘index’: apply truncation to rows
- 1 or ‘columns’: apply truncation to columns
Default is stat axis for given data type (0 for Series and DataFrames, 1 for Panels)
- copy : boolean, default is True,
- return a copy of the truncated section
truncated : type of caller
>>> df = pd.DataFrame({'A': ['a', 'b', 'c', 'd', 'e'], ... 'B': ['f', 'g', 'h', 'i', 'j'], ... 'C': ['k', 'l', 'm', 'n', 'o']}, ... index=[1, 2, 3, 4, 5]) >>> df.truncate(before=2, after=4) A B C 2 b g l 3 c h m 4 d i n >>> df = pd.DataFrame({'A': [1, 2, 3, 4, 5], ... 'B': [6, 7, 8, 9, 10], ... 'C': [11, 12, 13, 14, 15]}, ... index=['a', 'b', 'c', 'd', 'e']) >>> df.truncate(before='b', after='d') A B C b 2 7 12 c 3 8 13 d 4 9 14
The index values in
truncate
can be datetimes or string dates. Note thattruncate
assumes a 0 value for any unspecified date component in aDatetimeIndex
in contrast to slicing which returns any partially matching dates.>>> dates = pd.date_range('2016-01-01', '2016-02-01', freq='s') >>> df = pd.DataFrame(index=dates, data={'A': 1}) >>> df.truncate('2016-01-05', '2016-01-10').tail() A 2016-01-09 23:59:56 1 2016-01-09 23:59:57 1 2016-01-09 23:59:58 1 2016-01-09 23:59:59 1 2016-01-10 00:00:00 1 >>> df.loc['2016-01-05':'2016-01-10', :].tail() A 2016-01-10 23:59:55 1 2016-01-10 23:59:56 1 2016-01-10 23:59:57 1 2016-01-10 23:59:58 1 2016-01-10 23:59:59 1
-
tshift
(periods=1, freq=None, axis=0)¶ Shift the time index, using the index’s frequency if available.
- periods : int
- Number of periods to move, can be positive or negative
- freq : DateOffset, timedelta, or time rule string, default None
- Increment to use from the tseries module or time rule (e.g. ‘EOM’)
- axis : int or basestring
- Corresponds to the axis that contains the Index
If freq is not specified then tries to use the freq or inferred_freq attributes of the index. If neither of those attributes exist, a ValueError is thrown
shifted : NDFrame
-
tz_convert
(tz, axis=0, level=None, copy=True)¶ Convert tz-aware axis to target time zone.
tz : string or pytz.timezone object axis : the axis to convert level : int, str, default None
If axis ia a MultiIndex, convert a specific level. Otherwise must be None- copy : boolean, default True
- Also make a copy of the underlying data
- TypeError
- If the axis is tz-naive.
-
tz_localize
(tz, axis=0, level=None, copy=True, ambiguous='raise')¶ Localize tz-naive TimeSeries to target time zone.
tz : string or pytz.timezone object axis : the axis to localize level : int, str, default None
If axis ia a MultiIndex, localize a specific level. Otherwise must be None- copy : boolean, default True
- Also make a copy of the underlying data
- ambiguous : ‘infer’, bool-ndarray, ‘NaT’, default ‘raise’
- ‘infer’ will attempt to infer fall dst-transition hours based on order
- bool-ndarray where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)
- ‘NaT’ will return NaT where there are ambiguous times
- ‘raise’ will raise an AmbiguousTimeError if there are ambiguous times
- infer_dst : boolean, default False
Deprecated since version 0.15.0: Attempt to infer fall dst-transition hours based on order
- TypeError
- If the TimeSeries is tz-aware and tz is not None.
-
union
(second_geometry)¶ Constructs the geometry that is the set-theoretic union of the input geometries.
- Paramters:
second_geometry: - a second geometry
-
unstack
(level=-1, fill_value=None)¶ Pivot a level of the (necessarily hierarchical) index labels, returning a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels. If the index is not a MultiIndex, the output will be a Series (the analogue of stack when the columns are not a MultiIndex). The level involved will automatically get sorted.
- level : int, string, or list of these, default -1 (last level)
- Level(s) of index to unstack, can pass level name
- fill_value : replace NaN with this value if the unstack produces
- missing values
DataFrame.pivot : Pivot a table based on column values. DataFrame.stack : Pivot a level of the column labels (inverse operation
from unstack).>>> index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'), ... ('two', 'a'), ('two', 'b')]) >>> s = pd.Series(np.arange(1.0, 5.0), index=index) >>> s one a 1.0 b 2.0 two a 3.0 b 4.0 dtype: float64
>>> s.unstack(level=-1) a b one 1.0 2.0 two 3.0 4.0
>>> s.unstack(level=0) one two a 1.0 3.0 b 2.0 4.0
>>> df = s.unstack(level=0) >>> df.unstack() one a 1.0 b 2.0 two a 3.0 b 4.0 dtype: float64
unstacked : DataFrame or Series
-
update
(other, join='left', overwrite=True, filter_func=None, raise_conflict=False)¶ Modify DataFrame in place using non-NA values from passed DataFrame. Aligns on indices
other : DataFrame, or object coercible into a DataFrame join : {‘left’}, default ‘left’ overwrite : boolean, default True
If True then overwrite values for common keys in the calling frame- filter_func : callable(1d-array) -> 1d-array<boolean>, default None
- Can choose to replace values other than NA. Return True for values that should be updated
- raise_conflict : boolean
- If True, will raise an error if the DataFrame and other both contain data in the same place.
>>> df = pd.DataFrame({'A': [1, 2, 3], ... 'B': [400, 500, 600]}) >>> new_df = pd.DataFrame({'B': [4, 5, 6], ... 'C': [7, 8, 9]}) >>> df.update(new_df) >>> df A B 0 1 4 1 2 5 2 3 6
>>> df = pd.DataFrame({'A': ['a', 'b', 'c'], ... 'B': ['x', 'y', 'z']}) >>> new_df = pd.DataFrame({'B': ['d', 'e', 'f', 'g', 'h', 'i']}) >>> df.update(new_df) >>> df A B 0 a d 1 b e 2 c f
>>> df = pd.DataFrame({'A': ['a', 'b', 'c'], ... 'B': ['x', 'y', 'z']}) >>> new_column = pd.Series(['d', 'e'], name='B', index=[0, 2]) >>> df.update(new_column) >>> df A B 0 a d 1 b y 2 c e >>> df = pd.DataFrame({'A': ['a', 'b', 'c'], ... 'B': ['x', 'y', 'z']}) >>> new_df = pd.DataFrame({'B': ['d', 'e']}, index=[1, 2]) >>> df.update(new_df) >>> df A B 0 a x 1 b d 2 c e
If
other
contains NaNs the corresponding values are not updated in the original dataframe.>>> df = pd.DataFrame({'A': [1, 2, 3], ... 'B': [400, 500, 600]}) >>> new_df = pd.DataFrame({'B': [4, np.nan, 6]}) >>> df.update(new_df) >>> df A B 0 1 4.0 1 2 500.0 2 3 6.0
-
values
¶ Numpy representation of NDFrame
The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if you are not dealing with the blocks.
e.g. If the dtypes are float16 and float32, dtype will be upcast to float32. If dtypes are int32 and uint8, dtype will be upcast to int32. By numpy.find_common_type convention, mixing int64 and uint64 will result in a flot64 dtype.
-
var
(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)¶ Return unbiased variance over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
axis : {index (0), columns (1)} skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA- level : int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
- ddof : int, default 1
- degrees of freedom
- numeric_only : boolean, default None
- Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
var : Series or DataFrame (if level specified)
-
where
(cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False, raise_on_error=None)¶ Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.
- cond : boolean NDFrame, array-like, or callable
Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1: A callable can be used as cond.
- other : scalar, NDFrame, or callable
Entries where cond is False are replaced with corresponding value from other. If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1: A callable can be used as other.
- inplace : boolean, default False
- Whether to perform the operation in place on the data
axis : alignment axis if needed, default None level : alignment level if needed, default None errors : str, {‘raise’, ‘ignore’}, default ‘raise’
raise
: allow exceptions to be raisedignore
: suppress exceptions. On error return original object
Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.
- try_cast : boolean, default False
- try to cast the result back to the input type (if possible),
- raise_on_error : boolean, default True
Whether to raise on invalid data types (e.g. trying to where on strings)
Deprecated since version 0.21.0.
wh : same type as caller
The where method is an application of the if-then idiom. For each element in the calling DataFrame, if
cond
isTrue
the element is used; otherwise the corresponding element from the DataFrameother
is used.The signature for
DataFrame.where()
differs fromnumpy.where()
. Roughlydf1.where(m, df2)
is equivalent tonp.where(m, df1, df2)
.For further details and examples see the
where
documentation in indexing.>>> s = pd.Series(range(5)) >>> s.where(s > 0) 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0
>>> s.mask(s > 0) 0 0.0 1 NaN 2 NaN 3 NaN 4 NaN
>>> s.where(s > 1, 10) 0 10.0 1 10.0 2 2.0 3 3.0 4 4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) >>> m = df % 3 == 0 >>> df.where(m, -df) A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) A B 0 True True 1 True True 2 True True 3 True True 4 True True
DataFrame.mask()
-
within
(second_geometry, relation=None)¶ Indicates if the base geometry is within the comparison geometry. Paramters:
second_geometry: - a second geometry
relation: - The spatial relationship type.
BOUNDARY - Relationship has no restrictions for interiors or boundaries. CLEMENTINI - Interiors of geometries must intersect. Specifying
CLEMENTINI is equivalent to specifying None. This is the default.
PROPER - Boundaries of geometries must not intersect.
-
xs
(key, axis=0, level=None, drop_level=True)¶ Returns a cross-section (row(s) or column(s)) from the Series/DataFrame. Defaults to cross-section on the rows (axis=0).
- key : object
- Some label contained in the index, or partially in a MultiIndex
- axis : int, default 0
- Axis to retrieve cross-section on
- level : object, defaults to first n levels (n=1 or len(key))
- In case of a key partially contained in a MultiIndex, indicate which levels are used. Levels can be referred by label or position.
- drop_level : boolean, default True
- If False, returns object with same levels as self.
>>> df A B C a 4 5 2 b 4 0 9 c 9 7 3 >>> df.xs('a') A 4 B 5 C 2 Name: a >>> df.xs('C', axis=1) a 2 b 9 c 3 Name: C
>>> df A B C D first second third bar one 1 4 1 8 9 two 1 7 5 5 0 baz one 1 6 6 8 0 three 2 5 3 5 3 >>> df.xs(('baz', 'three')) A B C D third 2 5 3 5 3 >>> df.xs('one', level=1) A B C D first third bar 1 4 1 8 9 baz 1 6 6 8 0 >>> df.xs(('baz', 2), level=[0, 'third']) A B C D second three 5 3 5 3
xs : Series or DataFrame
xs is only for getting, not setting values.
MultiIndex Slicers is a generic way to get/set values on any level or levels. It is a superset of xs functionality, see MultiIndex Slicers