0 purchases
tapbigquery 0.3.9
tap-bigquery
Extract data from BigQuery tables.
This is a Singer tap that produces JSON-formatted data
following the Singer spec.
This tap:
Pulls data from Google BigQuery tables/views with datetime field.
Infers the schema for each resource and produce catalog file.
Incrementally pulls data based on the input state.
Installation
Step 0: Acknowledge LICENSE and TERMS
Please especially note that the author(s) of tap-bigquery is not responsible
for the cost, including but not limited to BigQuery cost) incurred by running
this program.
Step 1: Activate the Google BigQuery API
(originally found in the Google API docs)
Use this wizard to create or select a project in the Google Developers Console and activate the BigQuery API. Click Continue, then Go to credentials.
On the Add credentials to your project page, click the Cancel button.
At the top of the page, select the OAuth consent screen tab. Select an Email address, enter a Product name if not already set, and click the Save button.
Select the Credentials tab, click the Create credentials button and select OAuth client ID.
Select the application type Other, enter the name "Singer BigQuery Tap", and click the Create button.
Click OK to dismiss the resulting dialog.
Click the Download button to the right of the client ID.
Move this file to your working directory and rename it client_secrets.json.
Export the location of the secret file:
export GOOGLE_APPLICATION_CREDENTIALS="./client_secret.json"
For other authentication method, please see Authentication section.
Step 2: Install
First, make sure Python 3 is installed on your system or follow these
installation instructions for Mac or Ubuntu.
pip install -U tap-bigquery
Or you can install the lastest development version from GitHub:
pip install --no-cache-dir https://github.com/anelendata/tap-bigquery/archive/master.tar.gz#egg=tap-bigquery
Run
Step 1: Configure
Create a file called tap_config.json in your working directory, following
config.sample.json:
{
"streams": [
{"name": "<some_schema_name>",
"table": "`<project>.<dataset>.<table>`",
"columns": ["<col_name_0>", "<col_name_1>", "<col_name_2>"],
"datetime_key": "<your_key>",
"filters": ["country='us'", "state='CA'",
"registered_on>=DATE_ADD(current_date, INTERVAL -7 day)"
] // also optional: these are parsed in 'WHERE' clause
}
],
"start_datetime": "2017-01-01T00:00:00Z", // This can be set at the command line argument
"end_datetime": "2017-02-01T00:00:00Z", // end_datetime is optional
"limit": 100,
"start_always_inclusive": false // default is false, optional
}
The required parameters is at least one stream (one bigquery table/view) to copy.
It is not a recommended BigQuery practice to use * to specify the columns
as it may blow up the cost for a table with a large number of columns.
filters are optional but we strongly recommend using this over a large
partitioned table to control the cost. LIMIT (The authors of tap-bigquery is not
responsible for the cost incurred by running this program. Always test
thoroughly with small data set first.)
start_datetime must also be set in the config file or as the command line
argument (See the next step).
limit will limit the number of results, but it does not result in reduce
the query cost.
The table/view is expected to have a column to indicate the creation or
update date and time so the tap sends the query with ORDER BY and use
the column to record the bookmark (See State section).
Step 2: Create catalog
Run tap-bigquery in discovery mode to let it create json schema file and then
run them together, piping the output of tap-bigquery to target-csv:
tap-bigquery -c tap_config.json -d > catalog.json
Step 3: Run
tap-bigquery can be run with any Singer Target. As example, let use
target-csv.
pip install target-csv
Run:
tap-bigquery -c tap_config.json \
--catalog catalog.json --start_datetime '2020-08-01T00:00:00Z' \
--end_datetime '2020-08-02T01:00:00Z' | target-csv --config target_config.json \
> state.json
This example should create a csv file in the same directory.
state.json should contain a state (bookmark) after the run. (See State section).
Notes:
start and end datetimes accept ISO 8601 format, can be date only. start datetime
is inclusive, end datetime is not.
It is recommended to inspect the catalog file and fix the auto-type assignment
if necessary.
target-csv's target_config.json is optinal.
tap-bigquery can produce nested records but it's up to target if the data
writing will be successful. In this example with target-csv, the table is
expected to be flat.
Authentication
It is recommended to use tap-bigquery with a service account.
Download the client_secrets.json file for your service account, and place it
on the machine where tap-bigquery will be executed.
Set a GOOGLE_APPLICATION_CREDENTIALS environment variable on the machine,
where the value is the fully qualified path to client_secrets.json
In environments where it is preferable to supply secrets or other configuration via
environment variables or where it is not feasible to make a client_secrets.json
file available, you may instead provide credential as a JSON string in the
GOOGLE_APPLICATION_CREDENTIALS_STRING environment variable. This JSON string
should contain the same contents and be formatted the same way as the contents
of a client_secrets.json file.
In the testing environment, you can also manually authenticate before runnig
the tap. In this case you do not need GOOGLE_APPLICATION_CREDENTIALS defined:
gcloud auth application-default login
You may also have to set the project:
gcloud config set project <project-id>
Though not tested, it should also be possible to use the OAuth flow to
authenticate to GCP as well:
tap-bigquery will attempt to open a new window or tab in your default
browser. If this fails, copy the URL from the console and manually open it
in your browser.
If you are not already logged into your Google account, you will be prompted
to log in.
If you are logged into multiple Google accounts, you will be asked to select
one account to use for the authorization.
Click the Accept button to allow tap-bigquery to access your Google BigQuery
table.
You can close the tab after the signup flow is complete.
State
This tap emits state.
The command also takes a state file input with --state <file-name> option.
If the state is set, start_datetime config and command line argument are
ignored and the datetime value from last_update key is used as the resuming
point.
To avoid the data duplication, start datetime is exclusive
start_datetime < datetime_column when the tap runs with state option. If
you fear a data loss because of this, just use the --start_datetime option
instead of state. Or set start_always_inclusive: true in configuration.
The tap itself does not output a state file. It anticipate the target program
or a downstream process to fianlize the state safetly and produce a state file.
Original repo
https://github.com/anelendata/tap-bigquery
About this project
This project is developed by
ANELEN and friends. Please check out the ANELEN's
open innovation philosophy and other projects
Copyright © 2020~ Anelen Co., LLC
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.