Skip to content

Pandas

Pandas is the primary tool data scientists use for exploring and manipulating data.

Pandas uses DataFrame to hold the type of data a table which contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

Pandas supports the integration with many file formats or data sources out of the box (csv, excel, sql, json, parquet,…)

import pandas as pd
masses_data=pd.read_csv('mammographic_masses.data.txt',na_values=['?'],names= ['BI-RADS','age','shape','margin','density','severity'])
masses_data.head()

How to

Work on the data

masses_data.describe()
# Search for row with no data in one of the column
masses_data.loc[masses_data['age'].isnull()]
# remove such rows
masses_data.dropna(inplace=True)

Transform data for Sklearn

all_features = masses_data[['age', 'shape',
                             'margin', 'density']].values


all_classes = masses_data['severity'].values

feature_names = ['age', 'shape', 'margin', 'density']

Sources