{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Contributing Guide" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Contributing to data-describe\n", "\n", "Thanks for taking the time to contribute to data-describe!\n", "\n", "### Code of Conduct\n", "All contributors are expected to abide by the [Code of Conduct](https://github.com/data-describe/data-describe/blob/master/CODE_OF_CONDUCT.md).\n", "\n", "### Getting Started - Issues\n", "\n", "The \"to do\" task list for data-describe starts at the [Issues](https://github.com/data-describe/data-describe/issues) page in Github. Issues generally fall under one of two categories:\n", "- *Bugs*: Parts of data-describe that aren't working as they should\n", "- *Enhancements*: Ideas for additions or changes to data-describe to make it better\n", "\n", "> Note: Before submitting a new issue, you should search the existing issues to make sure that it hasn't already been reported. \n", "\n", "#### Filing a bug report\n", "Bug reports are important in helping identify things that aren't working in data-describe. Filing a bug report is straightforward: Head to the [Issues](https://github.com/data-describe/data-describe/issues) page, click on the [New issue](https://github.com/data-describe/data-describe/issues/new/choose) button, and select the [Bug Report](https://github.com/data-describe/data-describe/issues/new?labels=bug&template=bug_report.md) template to get started.\n", "\n", "![New Bug Report](./imgs/bug-report-page.png \"Filing a new Bug Report\")\n", "\n", "The first thing you'll want to do is to create a short, descriptive title.\n", "\n", "Next, you'll want to fill out the contents of the Bug Report. The contents have been pre-filled with a template with guidance on common details that should be included in a bug report. Here are a few quick tips on how to read and fill out the template:\n", "\n", " Some example sections are provided to help guide the contents of your report:\n", "`** Header Sections **`\n", "\n", "Comment lines are added in the template to provide guidance and will not show in the final output:\n", "`` : \n", "\n", "Triple back-tick marks can be used to add code examples:\n", "```\n", " ```\n", " def python_code():\n", " ```\n", "```\n", "\n", "#### Filing a Feature Request (Enhancement)\n", "Filing a feature request follows the similar steps. However, since enhancements can be more subjective in nature than bug reports, it is recommended to follow the guidelines below when filing a feature request:\n", "\n", "1. **Incremental changes**: Small, incremental changes (for example, clarifying the labels on a plot, or extending configuration support for a feature) can be started by using the [Feature Request template](https://github.com/data-describe/data-describe/issues/new?labels=enhancement&template=feature_request.md) when creating a new issue. Even for small changes, it's recommended to describe your thought process - how and/or why the enhancement would be beneficial.\n", "2. **Brand new features / Large Reworks / System Redesign**: For brand new features or other enhancements that may involve large changes, a design document should be written to explain and document the scope of the changes. A template for this design document can be found in the repository at `docs/designs/TEMPLATE.md`. This document can be submitted in a Pull Request, which is described further below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Finding something to work on\n", "\n", "For first time contributors, jumping head first into squashing bugs or designing new features can be overwhelming. It may be easiest to start by searching for the [*good first issue* label](https://github.com/data-describe/data-describe/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22). Issues that are tagged as *good first issue* typically involve small changes such as fixing errors in documentation. \n", "\n", "#### Claiming/assigning an issue\n", "To avoid overlapping work on issues, you should check to make sure that nobody else is actively working on the problem already. This is indicated by the *Assignees* section on the right side of the page for an issue.\n", "\n", "![assignees](./imgs/assignee.png \"Assignees\")\n", "\n", "1. If nobody is assigned to the issue, you should post a comment requesting to tackle the issue.\n", "2. If somebody else is already assigned to the issue, **and** there hasn't been any recent activity, post a comment to ask if the current assignee is still actively working on the issue. If they are unable or unwilling to complete work on the issue, or you don't receive any response in a reasonable time frame (~ 2 weeks), the issue may be re-assigned.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using Git / GitHub\n", "The following section provides a walkthrough of using git/GitHub to edit the README on the front page.\n", "\n", "1. **Fork the repository**: Only the primary project team has permissions to directly edit the main data-describe repository. As an external contributor, you will need to make a personal copy of the repository (a.k.a. \"fork\") to begin making changes. To create a fork, click on the Fork button at the top right of the page of the repository on Github.\n", "\n", "![Fork the repository](./imgs/fork.png \"Fork the repository\")\n", "\n", "This will create a copy of the data-describe repository under your own GitHub account. (Visit \"Your Repositories\" under your account.) \n", "\n", "2. **Clone the repository**: To access and edit your fork (copy) of the repository on your computer, you will need to clone the repository. First, you will need to ensure that you have the [Git software](https://git-scm.com/downloads) installed on your computer. Then, open the terminal/shell of your choice (e.g. Terminal for Mac, Powershell for Windows) in the location where you want the repository copy to be saved on your computer. \n", "\n", "For example, on Windows, ```SHIFT + RIGHT CLICK``` in a folder and select ```Open Powershell window here```. Alternatively, one can use the ```cd``` commands to navigate to the desired folder.\n", "\n", "Enter and run the following commands:\n", "\n", "```\n", "git clone https://github.com//data-describe.git data-describe-\n", "cd data-describe-\n", "git remote add upstream https://github.com/data-describe/data-describe\n", "```\n", "\n", "where `` is your GitHub username.\n", "\n", "3. **Checkout a new branch**: While this step is optional for forked repositories, it's generally best practice to create a new branch for each thing you're working on. Run the following command:\n", "\n", "```\n", "git checkout -b update-readme\n", "```\n", "\n", "where `update-readme` is the name of the branch. (For future contributions, replace this with a name of your choosing). This command creates the new branch *and* checks out the branch.\n", "\n", "4. **Adding your changes**: Now open (using a text editor) and edit the README.md file by adding to the title. \n", "\n", "![edit-readme](./imgs/edit-readme.png \"Edit README\")\n", "\n", "Now that you've made a change to the repository, you'll need to stage the changes so that it's tracked by git. \n", "Run the command `git add README.md`.\n", "\n", "where the `README.md` is the path to the file to be added, relative to the current working directory.\n", "To confirm that the file(s) have been added, you can run `git status`. You should see that the file has been added to *Staged Changes*.\n", "\n", "5. **Committing your changes**: To save this changes to Git, you must _commit_ the files. Run the command:\n", "\n", "```\n", "git commit -m \"Add emoji to README title\"\n", "```\n", "\n", "where the text inside the quotes is a description of the changes you've made in this commit.\n", "\n", "6. **Pushing your changes**: To push your changes online for the first time, run:\n", "\n", "`git push origin --set-upstream update-readme`\n", "\n", "where `--set-upstream update-readme` creates the same `update-readme` branch online, in your fork on GitHub.\n", "\n", "> If you're pushing more changes at a later time, you only need to run `git push`\n", "\n", "7. **Creating a Pull Request**: To submit your changes for review, you will need to create a pull request. When a new branch is pushed to GitHub, a prompt will appear to create a pull request:\n", "\n", "![open-pull-request](./imgs/open-pull-request.png \"Open pull request\")\n", "\n", "Then fill out the form with (at minimum) a title and description and create the pull request. At this point, a project team member or maintainer can review your pull request, provide comments, and officially merge it when approved.\n", "\n", "### Advanced Git Practices\n", "#### Rebase\n", "If you're working on some changes for a long period of time, it's possible that other contributors may have submitted other changes on the same files you're working on. To sync your branch, run:\n", "\n", "`git pull origin master --rebase`\n", "\n", "and follow the prompts to approve and/or resolve the changes that should be kept.\n", "\n", "#### Updating your Fork\n", "To update the master branch of your fork (so that new branches are created off of an up-to-date master branch), run:\n", "```\n", "git fetch upstream\n", "```" ] }, { "source": [ "### Pull Request Conventions\n", "- The pull request title is used to build the release notes. Write the title in past tense, describing the extent of the changes.\n", "- Pull Requests should have labels to identify which category belongs to in the release notes. Use the `exclude notes` label if the change doesn't make sense to document in release notes.\n", "- Pull Requests should be linked to issues, either manually or using [keywords](https://docs.github.com/en/enterprise/2.16/user/github/managing-your-work-on-github/closing-issues-using-keywords)." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python Developer Environment\n", "\n", "The following setup is recommended for Python development:\n", "\n", "- IDE: [VS Code](https://code.visualstudio.com/)\n", " - Extensions:\n", " - [ms-python.python](https://marketplace.visualstudio.com/items?itemName=ms-python.python)\n", " - [njpwerner.autodocstring](https://marketplace.visualstudio.com/items?itemName=njpwerner.autodocstring)\n", "- Package/Environment Manager: Conda \n", " - [Miniconda](https://docs.conda.io/en/latest/miniconda.html)\n", "\n", "To create a new conda environment for development, run:\n", "\n", "```\n", "conda create -n test-env\n", "conda env update -n test-env -f etc/test-environment.yml\n", "```\n", "\n", "then activate using `conda activate test-env` or select it as your default Python interpreter for VSCode.\n", "\n", "### Python Extension Configuration\n", "data-describe uses `flake8` for linting. Configure VSCode by disabling `pylint` and enabling `flake8`.\n", "data-describe uses `black` for auto formatting. Configure VSCode by enabling `black`.\n", "\n", "### Pre-commit\n", "[Pre-commit](https://pre-commit.com/) is also strongly recommended for running linting and style checks. This will warn and prevent you from committing code that does not pass certain standards. To install pre-commit:\n", "```\n", "pip install pre-commit\n", "pre-commit install --allow-missing-config\n", "```\n", "Code checks will now run prior to any commits you make.\n", "\n", "data-describe currently utilizes the following tools for code checks:\n", "- black: Ensure consistent formatting of Python files\n", "- mypy: Validate Python type hints\n", "- flake8: Multiple checks for\n", " - import order\n", " - syntax errors or anti-patterns\n", " - (lack of) executable flags on files\n", " - docstring validation\n", "\n", "See `.pre-commit-config.yaml` for a full list of hooks used by pre-commit.\n", "\n", "### Docstring Checks\n", "\n", "(Optional) [darglint](https://github.com/terrencepreilly/darglint) can be used to check if docstrings are outdated. It hasn't been added to the pre-commit hooks because it can take a long time to parse files.\n", "\n", "To run darglint, use the following command:\n", "\n", "```\n", "darglint -v 2 --strictness=short \n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Automatic Documentation\n", "Documentation is generated using Sphinx. \n", "\n", "[Example notebooks](#Example-notebooks) are stored in the `examples/` folder for easy access in the Github repository.\n", "\n", "### Notebook Execution\n", "Run `etc/run_notebooks.py` to re-run notebooks. Using this script instead of manual execute ensures that cells are executed in order and that the `kernelspec` metadata is not updated to a different kernel name.\n", "\n", "### Notebook Update\n", "> Note: This step is typically executed by a Github Actions workflow.\n", "\n", "Run `docs/update_notebook_docs.py` to manually copy these notebooks (if any updates) into the `docs/source` directory and update the index page of the Sphinx-generated documentation.\n", "\n", "### Sphinx Build\n", "To run the typical build process, use the provided conda environment definition at `etc/doc-environment.yml` and run `docs/make.py`.\n", "\n", "Note that this uses `sphinx-multiversion` to build documentation for multiple versions of `data-describe` and only captures changes in tagged commits and/or the master branch on remote - it **does not** build the documentation on your local branch.\n", "\n", "### Manual Build\n", "To test a build of the Sphinx-generated documentation without using `sphinx-multiversion`, you can run the following command:\n", "\n", "```\n", "sphinx-build -a -E docs/source docs/build\n", "```\n", "\n", "The generated HTML files will be in `docs/build`." ] }, { "source": [ "## Testing\n", "Unit testing uses `pytest`.\n", "\n", "### Test Environment\n", "Use the conda environment `etc/test-environment.yml` for executing tests.\n", "\n", "### Running the full test suite\n", "To run the full test suite, run `pytest` in the root directory of the repository.\n", "\n", "### Running a selected test file or function\n", "To run selected test(s), run `pytest -k `, where `` matches the test file name or function name.\n", "\n", "### Running selected marked tests\n", "To run selected groups of tests, run `pytest -m `.\n", "\n", "The current options for `` are:\n", "- `base`: Core features of data-describe (excluding any optional dependencies)" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Design Patterns\n", "The following section describes design patterns used in the Python package.\n", "\n", "### Optional Dependencies\n", "data-describe may make use of optional dependencies such as `nltk` or `modin`. When adding or using these optional dependencies in data-describe modules, the following patterns should be used:\n", "\n", "#### `_requires` marks functionality that requires a dependency\n", "\n", "Use the `_requires` decorator on any object function or class that needs the optional dependency. Usage:\n", "\n", "```python\n", "from data_describe.compat import _requires\n", "\n", "@_requires(\"nltk\")\n", "def function_that_uses_nltk():\n", " return\n", "```\n", "\n", "`_requires` should generally take the top-level package name as its sole argument. See the section on *packages vs subpackages* for more information.\n", "\n", "#### `_compat` is used to lazily import from dependencies\n", "\n", "Instead of having import statements at the top of the file, import and use the `_compat` object to use functionalities from the optional dependency:\n", "\n", "```python\n", "from data_describe.compat import _requires, _compat\n", "\n", "@_requires(\"nltk\")\n", "def function_that_uses_nltk_freqdist():\n", " _compat[\"nltk\"].FreqDist()\n", "```\n", "\n", "`_compat` should generally take the sub-package as its key in a dictionary-style access. See the section on *packages vs subpackages* for more information.\n", "\n", "#### packages vs subpackages\n", "\n", "Some packages do not export all of their subpackages. For example, `import statsmodels` does not provide access to `statsmodels.graphics.tsaplots`, as the `graphics` subpackage is not exported.\n", "\n", "As a result, `_requires` generally takes the top-level package name, as this checks if the package itself is installed. In contrast, `_compat` takes the subpackage to enable imports.\n", "\n", "One exception to this are the google client libraries such as `google-cloud-storage` or `google-cloud-bigquery`. Each of these are installed individually, but they are organized as subpackages of the `google` namespace i.e. `google.cloud.storage`. In this case, `_requires` should instead be the specific subpackage (i.e. `_requires(\"google.cloud.storage\")`) since requiring only the `google` package is not specific enough. \n", "\n", "#### Side imports\n", "\n", "Some packages require downloads of additional data or models to function. One example is the stopwords for `nltk`. Downloading of these resources is handled in `data_describe/compat/_dependency.py`. When adding a dependency that requires this download, adhere to the following steps:\n", "\n", "1. Add a function that takes the module as its sole argument, checks for the existence of the resource (i.e if it was already downloaded), and executes the download if it doesn't exist. \n", "2. Add this function to the module-import mapping used to initialize `_compat`:\n", "```\n", "_compat = DependencyManager(\n", " {\n", " \"nltk\": nltk_download,\n", " \"spacy\": spacy_download,\n", " # \"new_package\": downloader_function,\n", " }\n", ")\n", "```\n", "\n", "#### Add to `extras_require` in setup.py\n", "\n", "The new dependency should be added to the `extras_require` in setup.py. If applicable, try to use existing tags over creating new ones. New tags should be alphabetical and short.\n", "\n", "#### Add to conda environments\n", "\n", "The new dependency should be added to all conda environment definitions. These are located in two locations: `etc/*.yml` and `docker/*/*.yml`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example notebooks\n", "The `examples/` folder of the repository contains Jupyter notebooks that provide more detailed documentation on using features of data-describe. Ideally, all features in data-describe should have its own example notebook.\n", "\n", "### Notebook Naming Convention\n", "Notebook names should not contain spaces.\n", "\n", "### Notebook Structure\n", "- The first cell in an example notebook **must** be a level 1 markdown header with the name/title of the example:\n", "`# Title`\n", "\n", "- This header will be used as the name of the page in the auto-generated Sphinx documentation.\n", "\n", "- Ideally, a short description of how and why to use the particular feature (i.e. from the design document) should be included after the title.\n", "\n", "- Individual examples demonstrating different methods of using the feature should be separated by level 2 (or lower) headers.\n", "\n", "### Notebook Execution\n", "To ensure consistency in cell execution order and kernel specifications, a Python script `etc/run_notebooks.py` should be used. This script will use papermill to execute all notebooks.\n", "\n", "You can optionally add the `--notebook-name` argument to select specific notebooks for execution. This argument can be used multiple times and, if specified, utilizes str contains logic to select notebooks.\n", "\n", "### Notebook Testing\n", "Unit tests for notebooks are in `tests/test_notebooks.py`. Follow the same pattern of using pytest-notebooks to add a test for a notebook." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" } }, "nbformat": 4, "nbformat_minor": 4 }