import os
'/home/thulasiram/personal/going_deep_and_wide/GiveDirectly/gx_tutorials/great_expectations') os.chdir(
Create a new Datasource
Use this notebook to configure a new pandas Datasource and add it to your project.
import great_expectations as gx
from great_expectations.cli.datasource import sanitize_yaml_and_save_datasource, check_if_datasource_name_exists
= gx.get_context() context
Customize Your Datasource Configuration
If you are new to Great Expectations Datasources, you should check out our how-to documentation
My configuration is not so simple - are there more advanced options? Glad you asked! Datasources are versatile. Please see our How To Guides!
Give your datasource a unique name:
= "data_quality_demo" datasource_name
For files based Datasources:
Here we are creating an example configuration. The configuration contains an InferredAssetFilesystemDataConnector which will add a Data Asset for each file in the base directory you provided. It also contains a RuntimeDataConnector which can accept filepaths. This is just an example, and you may customize this as you wish!
Also, if you would like to learn more about the DataConnectors used in this configuration, including other methods to organize assets, handle multi-file assets, name assets based on parts of a filename, please see our docs on InferredAssetDataConnectors and RuntimeDataConnectors.
= f"""
example_yaml name: {datasource_name}
class_name: Datasource
execution_engine:
class_name: PandasExecutionEngine
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: ../data
default_regex:
group_names:
- data_asset_name
pattern: (.*)
default_runtime_data_connector_name:
class_name: RuntimeDataConnector
assets:
my_runtime_asset_name:
batch_identifiers:
- runtime_batch_identifier_name
"""
print(example_yaml)
name: data_quality_demo
class_name: Datasource
execution_engine:
class_name: PandasExecutionEngine
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: ../data
default_regex:
group_names:
- data_asset_name
pattern: (.*)
default_runtime_data_connector_name:
class_name: RuntimeDataConnector
assets:
my_runtime_asset_name:
batch_identifiers:
- runtime_batch_identifier_name
Test Your Datasource Configuration
Here we will test your Datasource configuration to make sure it is valid.
This test_yaml_config()
function is meant to enable fast dev loops. If your configuration is correct, this cell will show you some snippets of the data assets in the data source. You can continually edit your Datasource config yaml and re-run the cell to check until the new config is valid.
If you instead wish to use python instead of yaml to configure your Datasource, you can use context.add_datasource()
and specify all the required parameters.
=example_yaml) context.test_yaml_config(yaml_config
Attempting to instantiate class from config...
Instantiating as a Datasource, since class_name is Datasource
Successfully instantiated Datasource
ExecutionEngine class name: PandasExecutionEngine
Data Connectors:
default_inferred_data_connector_name : InferredAssetFilesystemDataConnector
Available data_asset_names (2 of 2):
yellow_tripdata_sample_2019-01.csv (1 of 1): ['yellow_tripdata_sample_2019-01.csv']
yellow_tripdata_sample_2019-02.csv (1 of 1): ['yellow_tripdata_sample_2019-02.csv']
Unmatched data_references (0 of 0):[]
default_runtime_data_connector_name:RuntimeDataConnector
default_runtime_data_connector_name : RuntimeDataConnector
Available data_asset_names (1 of 1):
my_runtime_asset_name (0 of 0): []
Unmatched data_references (0 of 0):[]
<great_expectations.datasource.new_datasource.Datasource at 0x7ff07e005250>
Save Your Datasource Configuration
Here we will save your Datasource in your Data Context once you are satisfied with the configuration. Note that overwrite_existing
defaults to False, but you may change it to True if you wish to overwrite. Please note that if you wish to include comments you must add them directly to your great_expectations.yml
.
=False)
sanitize_yaml_and_save_datasource(context, example_yaml, overwrite_existing context.list_datasources()
[{'execution_engine': {'class_name': 'PandasExecutionEngine',
'module_name': 'great_expectations.execution_engine'},
'module_name': 'great_expectations.datasource',
'class_name': 'Datasource',
'name': 'data_quality_demo',
'data_connectors': {'default_inferred_data_connector_name': {'default_regex': {'group_names': ['data_asset_name'],
'pattern': '(.*)'},
'base_directory': '../data',
'module_name': 'great_expectations.datasource.data_connector',
'class_name': 'InferredAssetFilesystemDataConnector'},
'default_runtime_data_connector_name': {'module_name': 'great_expectations.datasource.data_connector',
'class_name': 'RuntimeDataConnector',
'assets': {'my_runtime_asset_name': {'module_name': 'great_expectations.datasource.data_connector.asset',
'class_name': 'Asset',
'batch_identifiers': ['runtime_batch_identifier_name']}}}}}]