Resources
Overview
This is a list of resources that I keep in mind when building data platforms. I also include some day-to-day tools I use to help manage projects, time, and communication.
Development
Development tools I use when building data solutions.
Development Tools | Description |
---|---|
Visual Studio Code | A code editor that supports a wide variety of languages and uses-cases with extensions |
Google Colab | Hosted jupyter notebooks for running python. Has access to GPU/TPU if you want to explore ML. |
tmux | A terminal utility to allows running command line commands server-side. |
Docker | Package software code with a complete environment. |
Serverless | Utility that streamlines deploying serverless applications in cloud platforms such as AWS. |
AWS Python CDK | Python Cloud Development Kit (CDK) allows for deploying AWS infrastructure using Python code. |
data build tool(dbt) | Utility that supports for creating ELT data pipelines run within a database. Used for normalizing data into a data warehouse and curating data marts. Supports data testing. Auto-generated documentation allows for making it easy to understand how data is flowing through the data platform. |
GitHub | Source control platform of choice. |
GitHub Actions | Supports Continuous Integration/Continuous Deployment (CI/CD) |
dbeaver | Tool for interacting with SQL databases |
pre-commit | pre-commit is a tool that can run a series of checks when committing files to a git based source control |
Programming Languages
Programming languages I consider when building data platforms, doing statistical analysis, machine learning, etc.
Language | Purpose |
---|---|
Python | A general usage language that supports building data platforms, data analytics, statistical analysis, and machine learning. |
R | A language built for performing statistical analysis, machine learning, and data analysis. |
Julia | A high performant language for scientific computing. |
Scala | For writing performant Spark data applications. |
Go | Building efficient data microservices. |
SQL | Structured query language for pulling data from relational databases. |
bash | A shell scripting language that can be helpful in automating tasks. |
Python Development Tools
Tool | Type | Purpose |
---|---|---|
black | Clean Code | black is a python code formatter that helps keep code consistently formatted across teams |
isort | Clean Code | isort helps organize python imports |
flake8 | Clean Code | flake8 is a code linter that enforces coding styles such as flagging unused imports |
mypy | Clean Code | mypy is a static type checker that helps solve the problem in python with dynamic typing |
pytest | Testing | pytest is my unit testing tool of choice for it’s extensibility for building test harnesses |
great-expectations | Testing | Supports data quality testing and validation |
Test-Driven Data Analysis | Testing | A package that supports integrating data quality tests with unit tests such as pytest |
poetry | Environment Management | A python package that supports managing virtual environments |
data version control(dvc) | ML Ops | Utility that provides a streamlined approach for managing machine learning models from development to production. |
awswrangler | AWS SDK | AWS Development |
Favorite Python Analysis Tools
Tool | Type | Purpose |
---|---|---|
jupyter notebooks | Analysis Tool | Jupyter notebooks allow for running python in a cell format and being able to immediately see the results |
PySpark | Data Processing | Provides a Python interface to creating Spark applications |
pandas | Dataframe | pandas allows for importing data into a tabular data structure and can perform cleaning and analysis activities in python |
scipy | Calculations | Python package for scientific computing |
numpy | Calculations | Python package for mathematical computing |
streamlit | Interactive Data Visualization | Streamlined method for creating interactive dashboards in python |
plotly | Interactive Data Visualization | Streamlined method for creating interactive visualizations in python |
plotnine | Static Data Visualization | Allows the creation of ggplot visualizations in python |
scikit-learn | Machine Learning | Machine learning in python |
keras | Deep Learning | A high level deep learning library for creating deep learning models |
R Development Tools
Tool | Type | Purpose |
---|---|---|
renv | Environment Management | A tool for managing R packages used for a project. |
RStudio | IDE | A IDE tailored to running R code. It can also run Python but I have not tested this out. |
R Analysis Tools
Tool | Type | Purpose |
---|---|---|
tidyverse | Analysis | An ecosystem of r-packages that support doing a wide variety of data science/analytics tasks |
r-shiny | Interactive Dashboard | A framework for creating interactive dashboard sites with R. |
IT Tools
This section includes a wide variety of tools and platforms used to help build and deploy data platforms.
Item | Purpose |
---|---|
nomachine | Tool for remotely connecting to Linux desktop. |
VMWare Workstation | Virtual Machine for running a Windows VM on a Linux computer. |
Amazon Web Services | Cloud computing platform of choice for building and deploy data solutions. |
Quarto | Used for blog website. |
GitHub Pages | For serving up personal blog site. |
Google Domains | Used for custom domains |
Squarespace | Website builder for managing business website. |
draw.io | Free tool for creating ERD diagrams, process flows, and AWS Architectures |
Juicebox Analytics | Data Storytelling tool |
Markdown | Markup language for writing documentation. |
oh-my-bash | Framework for managing bash configuration. There is also a zshell equivalent, oh-my-zsh equivalent. |
Cookiecutter | A project templating tool. |
Visual Studio Code Extensions
Visual Studio Code Extensions I find useful.
Visual Studio Code Extension |
---|
GitHub Pull Requests and Issues |
Python |
Quarto |
markdownlint |
Project Manager |
Remote Development |
Code Spell Checker |
Docker |
Edit csv |
Productivity and General Use
Productivity tools and general use programs.
Item | Purpose |
---|---|
Trello | Productivity tool used for planning out work |
Google Workspace | Frequently use Google Docs and Google Sheets |
Brave | A privacy first browser built on Chrome |
Firefox | Firefox has a Multi-Account Container supports logging into multiple accounts in a single Window. |
Calendly | Website to streamline scheduling meetings. |
Feedly | RSS Feed Tool that has a good free tier. |
zoom | Video Conferencing tool of choice. |