项目作者: MartinThoma

项目描述 :
Exploratory Data Analysis with Python
高级语言: Python
项目地址: git://github.com/MartinThoma/edapy.git
创建时间: 2017-12-03T12:22:53Z
项目社区:https://github.com/MartinThoma/edapy

开源协议:MIT License

下载


PyPI version
Python Support
Build Status
Coverage Status
Code style: black
GitHub last commit
GitHub commits since latest release (by SemVer)
CodeFactor

edapy is a first resource to analyze a new dataset.

Installation

  1. $ pip install git+https://github.com/MartinThoma/edapy.git

For the pdf part, you also need pdftotext:

  1. $ sudo apt-get install poppler-utils

Usage

  1. $ edapy --help
  2. Usage: edapy [OPTIONS] COMMAND [ARGS]...
  3. edapy is a tool for exploratory data analysis with Python.
  4. You can use it to get a first idea what a CSV is about or to get an
  5. overview over a directory of PDF files.
  6. Options:
  7. --version Show the version and exit.
  8. --help Show this message and exit.
  9. Commands:
  10. csv Analyze CSV files.
  11. images Analyze image files.
  12. pdf Analyze PDF files.

The workflow is as follows:

  • edapy pdf find --path . --output results.csv creates a results.csv
    for you. This results.csv contains meta data about all PDF files in the
    path directory.
  • edapy csv predict --csv_path my-new.csv --types types.yaml will start /
    resume a process in which the user is lead through a series of questions. In
    those questions, the user has to decide which delimiter, quotechar is used
    and which types the columns have.
  • edapy generates a types.yaml file which can be used to load the CSV in
    other applications with df = edapy.load_csv(csv_path, yaml_path).

Example types.yaml

For the Titanic Dataset, the resulting
types.yaml looks as follows:

  1. columns:
  2. - dtype: other
  3. name: Name
  4. - dtype: int
  5. name: Parch
  6. - dtype: float
  7. name: Age
  8. - dtype: other
  9. name: Ticket
  10. - dtype: float
  11. name: Fare
  12. - dtype: int
  13. name: PassengerId
  14. - dtype: other
  15. name: Cabin
  16. - dtype: other
  17. name: Embarked
  18. - dtype: int
  19. name: Pclass
  20. - dtype: int
  21. name: Survived
  22. - dtype: other
  23. name: Sex
  24. - dtype: int
  25. name: SibSp
  26. csv_meta:
  27. delimiter: ','

A sample run then would look like this:

  1. $ edapy csv predict --types types_titanik.yaml --csv_path train.csv
  2. Number of datapoints: 891
  3. 2018-04-16 21:51:56,279 WARNING Column 'Survived' has only 2 different values ([0, 1]). You might want to make it a 'category'
  4. 2018-04-16 21:51:56,280 WARNING Column 'Pclass' has only 3 different values ([3, 1, 2]). You might want to make it a 'category'
  5. 2018-04-16 21:51:56,281 WARNING Column 'Sex' has only 2 different values (['male', 'female']). You might want to make it a 'category'
  6. 2018-04-16 21:51:56,282 WARNING Column 'SibSp' has only 7 different values ([0, 1, 2, 4, 3, 8, 5]). You might want to make it a 'category'
  7. 2018-04-16 21:51:56,283 WARNING Column 'Parch' has only 7 different values ([0, 1, 2, 5, 3, 4, 6]). You might want to make it a 'category'
  8. 2018-04-16 21:51:56,285 WARNING Column 'Embarked' has only 3 different values (['S', 'C', 'Q']). You might want to make it a 'category'
  9. ## Integer Columns
  10. Column name: Non-nan mean std min 25% 50% 75% max
  11. PassengerId: 891 446.00 257.35 1 224 446 668 891
  12. Survived : 891 0.38 0.49 0 0 0 1 1
  13. Pclass : 891 2.31 0.84 1 2 3 3 3
  14. SibSp : 891 0.52 1.10 0 0 0 1 8
  15. Parch : 891 0.38 0.81 0 0 0 0 6
  16. ## Float Columns
  17. Column name: Non-nan mean std min 25% 50% 75% max
  18. Age : 714 29.70 14.53 0.42 20.12 28.00 38.00 80.00
  19. Fare : 891 32.20 49.69 0.00 7.91 14.45 31.00 512.33
  20. ## Other Columns
  21. Column name: Non-nan unique top (count)
  22. Name : 891 891 Goldschmidt, Mr. George B (1)
  23. Sex : 891 2 male (577)
  24. Ticket : 891 681 347082 (7)
  25. Cabin : 204 148 C23 C25 C27 (4)
  26. Embarked : 889 4 S (644)