Library that allows us to efficiently manipulate and structure our data. Panda can read csv,

pandas offers us series and dataframe data structures. A data frame can be though of like a SQL table-like structure or excel sheet like structure. You have columns, rows, indices.

Pandas allows us to sort data, export, serialize, visualize.

Import pandas as pd

pandas can visualize, we can combine it with matplotlib to display graphs

Series

A 2 column table like a dictionary. Key-value pairs or indices-values pairs. To make one, do pd.series(valueslist,indiceslist)

it is so very much like a dictionary. Even referencing a value can be done by giving it the key. To get 30 I would do ceres[‘C’]

series are very cool though. There are specific functions you can use with a series

**.**iloc is how you can traditionally index a dictionary like a list. Just do ceres.iloc[index]

.name name value of the series. The name of the series becomes its index or column name when used to make a dataframe

.count() the length of the series in elements

.head(num) the first X items in the row

.tail(num) the last X items in the row

.apply(function) apply the changes in the function. Like run every element in the series through that function. The function must have 1 parameter and a return value.

.sort_index() sort by natural index order. Just a copy, no permanent modification unless you pass parameter inplace=True

.sort_values() sort by natural value order. Just a copy, no permanent modification unless you pass parameter inplace=True

you are able to convert a series into a dict by just using dict(series)

with series, you can also add 2 series together to perform elementwise addition.

Indices should be the same, if not it wont error but its not recommended.

Visualization

You can plot your series from pandas to matplotlib. This is how:

series.plot() and then a plt.show()

to do bar graph, its just series.plot.bar()

to horizontal bar graph: series.plot.barh()

to pie: series.plot.pie()

to histogram: series.plot.hist(). Shows the frequency of values

Exporting

We can export our data to a select few formats.

And there are some more.

then give it the filename and you are good

Data frames

To create a data frame, we first need a dictionary with all the column values assigned to a column name. Then turn that dictionary into a pandas data frame

we have indexes at the side. 0,1,2,3. These are autogenerated indexes, however we can also make our own indexes manually, lets say using a SSN as index.

inplace is true, I modify the actual dataframe itself

Dataframe attributes and functions

Similar to series attributes, dataframes share a bit between them.

.head(num) the first X rows of the dataframe

.tail(num) the last X rows of the dataframe

.ndim the dimension the database is in. only a integer

.shape the shape of the database

.dtypes the data types of EACH row

.T gives us a transposed version of the

.iloc locate from index. Allows for indexing and slicing to get a column from a columnname, you will do referencing with square brackets. To get a cell from the column, turn the column iloc then grab the square bracket reference again.

ALSO BECAUSE DATAFRAMES ARE SO SMART! You can just do df.columnname and get the friggin columnname!! OMG!!!

df.Name = df[‘Name’]

Visualization

You can plot the rows/columns and you can also just straight up plot the entire dataframe. Pandas will plot and make subplots automatically for things it can plot.

pandas statistics

Aggregate functions

All for dataframes

.count() returns the count of the number of elements in each row.

.sum() returns the summed rows. String rows concatenate, number rows addition, bool rows, well, bools are converted to ints then added

.prod() returns the product rows. Cannot work if there are strings.

you can also use these for regular series aswell.

Statistical functions

.mean() gets the average granted there is no string rows

.median() middle element good if you have a lot of outliers

.mode() value that occurs most often

.std() standard deviation. How much do the values tend to deviate from the mean

.max() return maximum number

.min() return minimum number

.describe() a description of all previously mentioned statistics.

and you can see some percentiles too. 25% of values are below X, 50% of values are below X, 75% of values are below X, they are percentiles

Dataframe merging

We merge tables like relational data bases.

So we may have a dataframe for employee name, and another dataframe for employee ages. Both have a SSN primary key so we can merge the 2 very easily.

The SSN primary keys do not have to be the exact same series, they just must need to share the same dict key

now for merging we need to give it 4 things:

  1. list1 to merge

  2. list2 to merge

  3. on(primary index)

  4. how(method to joining. inner, outer, left, right)

    1. the outer join is a full join. Add all series from one dataframe into another. Ones that don’t have same primary index/key can still stay with NaN(Not a Number) values

    2. the inner join will only join series which have all the information from both dataframes. Where both have the same primary keys

    3. the left join will only take from the list1’s primary index series, ignoring list2’s series.

    4. the right join will only take form the list2’s primary index series and ignore the list1’s series

CSV files

Heres how we make a csv:

a csv is a plain text file, no special encoding but has different metadata

First row is the column names

the following rows are the entries.

Heres an example people.csv:

I fuckin love vscodium

Pandas open CSV

Pandas can simply use: pd.read_csv(‘somefile.csv’,delimiter=’,’). Delimiter varies but most commonly it is a command, semicolon or whitespace or a mix of them.

Pandas queries

If we want to locate a row from a specific value being equal, bigger, lesser, not equal to something else, then here is an example:

df.loc[df[key] comparedvalue]

you can also chain these condtions together. Bracket each condition and use boolean operators like and, or, xor.