How to use Google BigQuery?

Résumer avec :

Today, organizations are collecting data at record speeds. Between sensor measurements and consumer behavior, the need for tools capable of efficiently storing and analyzing large amounts of data has never been more important. Google Cloud offers tailored solutions, particularly with Google BigQuery.

The right data analysis tools greatly facilitate data-driven decision-making. Google BigQuery is one of those powerful tools, and this article will explain how to use it step by step.

What is Google BigQuery?

BigQuery is a fully managed and serverless data warehouse offered by Google Cloud Platform (GCP). It allows you to analyze terabytes of data in seconds.

BigQuery is based on Dremel, a distributed system developed by Google to quickly query very large datasets. Dremel divides query execution into “slots” to fairly distribute resources among multiple users. This system uses Jupiter (Google’s internal network) to access storage, which is based on Colossus, a distributed file system that ensures data replication and recovery.

Data is stored in a columnar format, allowing for high compression and fast analysis speeds. BigQuery can also query data from other services such as BigTable, Cloud Storage, Cloud SQL, Google Analytics, or Google Drive.

BigQuery is ideal for analyzing very large volumes of data, especially when datasets are primarily read-only. It is not suitable for traditional transactional databases (OLTP) or small databases.

Finally, BigQuery operates without requiring infrastructure management: you only pay based on the storage space used and the volume of queries made. However, it is important to note that data must be hosted on Google Cloud, which may limit your architectural flexibility.

Practical Guide: How to Use Google BigQuery

BigQuery is accessible from the Google Cloud Platform web interface, or via API, SDK, or CLI.

Even without your own data, you can start with public datasets offered by Google Cloud. An interesting example is the COVID-19 dataset, which is freely accessible.

Here’s how to proceed:

Step 1: Download the dataset to your computer

Download an up-to-date version of the dataset in CSV format to your local machine.

Step 2: Import and store the dataset in Google BigQuery

  1. Log in to GCP and go to the BigQuery console (Big Data section).
  2. Click on CREATE DATASET to create a new dataset. Give it a unique identifier and choose a storage region.
  3. Once the dataset is created, click on CREATE TABLE:
    • Source: Upload
    • File format: CSV
    • Select your local file.
    • Name your table (for example: worldwide_cases).
    • Enable the Auto Detect option to automatically detect the schema.
How to use Big Query - Step 1 - mygrothbox.com

Step 3: Query the data stored in BigQuery

Once the table is created, you can run your first SQL queries:

  • To display 1000 rows: sqlCopierModifierSELECT * FROM `project_name.dataset_name.worldwide_cases` LIMIT 1000
  • To get the total number of cases and deaths by country: sqlCopierModifierSELECT countriesAndTerritories, SUM(cases) AS N_Cases, SUM(deaths) AS N_Deaths, COUNT(*) AS N_Rows FROM `project_name.dataset_name.worldwide_cases` GROUP BY countriesAndTerritories LIMIT 1000
How to use Big Query - Step 2 - mygrothbox.com

Step 4: Add the dataset to Google Cloud Storage

You can also store your files in Google Cloud Storage (GCS):

  1. Create a GCS bucket.
  2. Upload your CSV file to this bucket.

(You can refer to Google’s documentation to create a bucket if needed.)

How to use Big Query - Step 3 - mygrothbox.com

Step 5: Use BigQuery with a dataset in Google Cloud Storage

  1. In BigQuery, create a new table:
    • Source: Google Cloud Storage
    • Specify the file path in GCS.
    • Format: CSV
    • Give it a new name (for example worldwide_cases_in_bucket).
  2. You will be able to query this new table in the same way as before.
How to use Big Query - Step 4 - mygrothbox.com
How to use Big Query - Step 5 - mygrothbox.com

Conclusion

BigQuery is an extremely powerful solution for quickly exploring and analyzing large amounts of data. It allows you to go from zero to advanced analysis in no time.

However, despite its advantages, BigQuery is not perfect: it is less suited for frequently changing data and requires the use of Google Cloud storage. For greater flexibility, it is recommended to keep your raw data elsewhere.

To efficiently store large volumes of data while maintaining cloud choice freedom, Cloud Volumes ONTAP from NetApp is an excellent alternative. Available on AWS, Azure, and Google Cloud, this solution optimizes costs, improves storage efficiency, easily clones datasets, performs automatic tiering, and ensures data protection.

Résumer avec :

We will be happy to hear your thoughts

      Leave a reply

      mygrowthbox.com
      Logo
      Compare items
      • Total (0)
      Compare
      0
      Shopping cart