#Pandas #DataAnalysis #Python #DataScience #Tutorial
Top 30 Pandas Functions & Methods
This lesson covers 30 essential Pandas functions for data manipulation and analysis, each with a standalone example and its output.
---
1.
Creates a new DataFrame (a 2D labeled data structure) from various inputs like dictionaries or lists.
---
2.
Creates a new Series (a 1D labeled array).
---
3.
Reads data from a CSV file into a DataFrame. (Assuming a file
---
4.
Writes a DataFrame to a CSV file.
#PandasIO #DataFrame #Series
---
5.
Returns the first
---
6.
Returns the last
---
7.
Provides a concise summary of the DataFrame, including data types and non-null values.
---
8.
Returns a tuple representing the dimensionality (rows, columns) of the DataFrame.
#DataInspection #PandasBasics
---
9.
Generates descriptive statistics for numerical columns (count, mean, std, min, max, etc.).
Top 30 Pandas Functions & Methods
This lesson covers 30 essential Pandas functions for data manipulation and analysis, each with a standalone example and its output.
---
1.
pd.DataFrame()Creates a new DataFrame (a 2D labeled data structure) from various inputs like dictionaries or lists.
import pandas as pd
data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data)
print(df)
col1 col2
0 1 3
1 2 4
---
2.
pd.Series()Creates a new Series (a 1D labeled array).
import pandas as pd
s = pd.Series([10, 20, 30, 40], name='MyNumbers')
print(s)
0 10
1 20
2 30
3 40
Name: MyNumbers, dtype: int64
---
3.
pd.read_csv()Reads data from a CSV file into a DataFrame. (Assuming a file
data.csv exists).# Create a dummy csv file first
with open('data.csv', 'w') as f:
f.write('Name,Age\nAlice,25\nBob,30')
df = pd.read_csv('data.csv')
print(df)
Name Age
0 Alice 25
1 Bob 30
---
4.
df.to_csv()Writes a DataFrame to a CSV file.
import pandas as pd
df = pd.DataFrame({'Name': ['Charlie'], 'Age': [35]})
# index=False prevents writing the DataFrame index to the file
df.to_csv('output.csv', index=False)
# You can check that 'output.csv' has been created.
print("File 'output.csv' created.")
File 'output.csv' created.
#PandasIO #DataFrame #Series
---
5.
df.head()Returns the first
n rows of the DataFrame (default is 5).import pandas as pd
data = {'Name': ['A', 'B', 'C', 'D', 'E', 'F'], 'Value': [1, 2, 3, 4, 5, 6]}
df = pd.DataFrame(data)
print(df.head(3))
Name Value
0 A 1
1 B 2
2 C 3
---
6.
df.tail()Returns the last
n rows of the DataFrame (default is 5).import pandas as pd
data = {'Name': ['A', 'B', 'C', 'D', 'E', 'F'], 'Value': [1, 2, 3, 4, 5, 6]}
df = pd.DataFrame(data)
print(df.tail(2))
Name Value
4 E 5
5 F 6
---
7.
df.info()Provides a concise summary of the DataFrame, including data types and non-null values.
import pandas as pd
import numpy as np
data = {'col1': [1, 2, 3], 'col2': [4.0, 5.0, np.nan], 'col3': ['A', 'B', 'C']}
df = pd.DataFrame(data)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 3 non-null int64
1 col2 2 non-null float64
2 col3 3 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes
---
8.
df.shapeReturns a tuple representing the dimensionality (rows, columns) of the DataFrame.
import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4], 'C': [5, 6]})
print(df.shape)
(2, 3)
#DataInspection #PandasBasics
---
9.
df.describe()Generates descriptive statistics for numerical columns (count, mean, std, min, max, etc.).
import pandas as pd
df = pd.DataFrame({'Age': [22, 38, 26, 35, 29]})
print(df.describe())
❤2
Top 100 Data Analyst Interview Questions & Answers
#DataAnalysis #InterviewQuestions #SQL #Python #Statistics #CaseStudy #DataScience
Part 1: SQL Questions (Q1-30)
#1. What is the difference between
A:
•
•
•
#2. Select all unique departments from the
A: Use the
#3. Find the top 5 highest-paid employees.
A: Use
#4. What is the difference between
A:
•
•
#5. What are the different types of SQL joins?
A:
•
•
•
•
•
#6. Write a query to find the second-highest salary.
A: Use
#7. Find duplicate emails in a
A: Group by the email column and use
#8. What is a primary key vs. a foreign key?
A:
• A Primary Key is a constraint that uniquely identifies each record in a table. It must contain unique values and cannot contain NULL values.
• A Foreign Key is a key used to link two tables together. It is a field (or collection of fields) in one table that refers to the Primary Key in another table.
#9. Explain Window Functions. Give an example.
A: Window functions perform a calculation across a set of table rows that are somehow related to the current row. Unlike aggregate functions, they do not collapse rows.
#10. What is a CTE (Common Table Expression)?
A: A CTE is a temporary, named result set that you can reference within a
#DataAnalysis #InterviewQuestions #SQL #Python #Statistics #CaseStudy #DataScience
Part 1: SQL Questions (Q1-30)
#1. What is the difference between
DELETE, TRUNCATE, and DROP?A:
•
DELETE is a DML command that removes rows from a table based on a WHERE clause. It is slower as it logs each row deletion and can be rolled back.•
TRUNCATE is a DDL command that quickly removes all rows from a table. It is faster, cannot be rolled back, and resets table identity.•
DROP is a DDL command that removes the entire table, including its structure, data, and indexes.#2. Select all unique departments from the
employees table.A: Use the
DISTINCT keyword.SELECT DISTINCT department
FROM employees;
#3. Find the top 5 highest-paid employees.
A: Use
ORDER BY and LIMIT.SELECT name, salary
FROM employees
ORDER BY salary DESC
LIMIT 5;
#4. What is the difference between
WHERE and HAVING?A:
•
WHERE is used to filter records before any groupings are made (i.e., it operates on individual rows).•
HAVING is used to filter groups after aggregations (GROUP BY) have been performed.-- Find departments with more than 10 employees
SELECT department, COUNT(employee_id)
FROM employees
GROUP BY department
HAVING COUNT(employee_id) > 10;
#5. What are the different types of SQL joins?
A:
•
(INNER) JOIN: Returns records that have matching values in both tables.•
LEFT (OUTER) JOIN: Returns all records from the left table, and the matched records from the right table.•
RIGHT (OUTER) JOIN: Returns all records from the right table, and the matched records from the left table.•
FULL (OUTER) JOIN: Returns all records when there is a match in either the left or right table.•
SELF JOIN: A regular join, but the table is joined with itself.#6. Write a query to find the second-highest salary.
A: Use
OFFSET or a subquery.-- Method 1: Using OFFSET
SELECT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
-- Method 2: Using a Subquery
SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
#7. Find duplicate emails in a
customers table.A: Group by the email column and use
HAVING to find groups with a count greater than 1.SELECT email, COUNT(email)
FROM customers
GROUP BY email
HAVING COUNT(email) > 1;
#8. What is a primary key vs. a foreign key?
A:
• A Primary Key is a constraint that uniquely identifies each record in a table. It must contain unique values and cannot contain NULL values.
• A Foreign Key is a key used to link two tables together. It is a field (or collection of fields) in one table that refers to the Primary Key in another table.
#9. Explain Window Functions. Give an example.
A: Window functions perform a calculation across a set of table rows that are somehow related to the current row. Unlike aggregate functions, they do not collapse rows.
-- Rank employees by salary within each department
SELECT
name,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM employees;
#10. What is a CTE (Common Table Expression)?
A: A CTE is a temporary, named result set that you can reference within a
SELECT, INSERT, UPDATE, or DELETE statement. It helps improve readability and break down complex queries.💡 Applying Image Filters with Pillow
Pillow's
Code explanation: The script opens an image file, applies a
#Python #Pillow #ImageProcessing #ImageFilter #PIL
━━━━━━━━━━━━━━━
By: @DataScienceM ✨
Pillow's
ImageFilter module provides a set of pre-defined filters you can apply to your images with a single line of code. This example demonstrates how to apply a Gaussian blur effect, which is useful for softening images or creating depth-of-field effects.from PIL import Image, ImageFilter
try:
# Open an existing image
with Image.open("your_image.jpg") as img:
# Apply the Gaussian Blur filter
# The radius parameter controls the blur intensity
blurred_img = img.filter(ImageFilter.GaussianBlur(radius=5))
# Display the blurred image
blurred_img.show()
# Save the new image
blurred_img.save("blurred_image.png")
except FileNotFoundError:
print("Error: 'your_image.jpg' not found. Please provide an image.")
Code explanation: The script opens an image file, applies a
GaussianBlur filter from the ImageFilter module using the .filter() method, and then displays and saves the resulting blurred image. The blur intensity is controlled by the radius argument.#Python #Pillow #ImageProcessing #ImageFilter #PIL
━━━━━━━━━━━━━━━
By: @DataScienceM ✨
• Get raw audio data as a NumPy array.
• Create a Pydub segment from a NumPy array.
• Read a WAV file directly into a NumPy array.
• Write a NumPy array to a WAV file.
• Generate a sine wave.
VIII. Audio Analysis with Librosa
• Load audio with Librosa.
• Estimate tempo (Beats Per Minute).
• Get beat event times in seconds.
• Decompose into harmonic and percussive components.
• Compute a spectrogram.
• Compute Mel-Frequency Cepstral Coefficients (MFCCs).
• Compute Chroma features (related to musical pitch).
• Detect onset events (the start of notes).
• Pitch shifting.
• Time stretching (change speed without changing pitch).
IX. More Utilities
• Detect leading silence.
• Get the root mean square (RMS) energy.
• Get the maximum possible RMS for the audio format.
• Find the loudest section of an audio file.
• Change the frame rate (resample).
• Create a simple band-pass filter.
• Convert file format in one line.
• Get the raw bytes of the audio data.
• Get the maximum amplitude.
• Match the volume of two segments.
#Python #AudioProcessing #Pydub #Librosa #SignalProcessing
━━━━━━━━━━━━━━━
By: @DataScienceM ✨
import numpy as np
samples = np.array(audio.get_array_of_samples())
• Create a Pydub segment from a NumPy array.
new_audio = AudioSegment(
samples.tobytes(),
frame_rate=audio.frame_rate,
sample_width=audio.sample_width,
channels=audio.channels
)
• Read a WAV file directly into a NumPy array.
from scipy.io.wavfile import read
rate, data = read("sound.wav")
• Write a NumPy array to a WAV file.
from scipy.io.wavfile import write
write("new_sound.wav", rate, data)
• Generate a sine wave.
import numpy as np
sample_rate = 44100
frequency = 440 # A4 note
duration = 5
t = np.linspace(0., duration, int(sample_rate * duration))
amplitude = np.iinfo(np.int16).max * 0.5
data = amplitude * np.sin(2. * np.pi * frequency * t)
# This array can now be written to a file
VIII. Audio Analysis with Librosa
• Load audio with Librosa.
import librosa
y, sr = librosa.load("sound.mp3")
• Estimate tempo (Beats Per Minute).
tempo, beat_frames = librosa.beat.beat_track(y=y, sr=sr)
• Get beat event times in seconds.
beat_times = librosa.frames_to_time(beat_frames, sr=sr)
• Decompose into harmonic and percussive components.
y_harmonic, y_percussive = librosa.effects.hpss(y)
• Compute a spectrogram.
import numpy as np
D = librosa.stft(y)
S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)
• Compute Mel-Frequency Cepstral Coefficients (MFCCs).
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
• Compute Chroma features (related to musical pitch).
chroma = librosa.feature.chroma_stft(y=y, sr=sr)
• Detect onset events (the start of notes).
onset_frames = librosa.onset.onset_detect(y=y, sr=sr)
onset_times = librosa.frames_to_time(onset_frames, sr=sr)
• Pitch shifting.
y_pitched = librosa.effects.pitch_shift(y, sr=sr, n_steps=4) # Shift up 4 semitones
• Time stretching (change speed without changing pitch).
y_fast = librosa.effects.time_stretch(y, rate=2.0) # Double speed
IX. More Utilities
• Detect leading silence.
from pydub.silence import detect_leading_silence
trim_ms = detect_leading_silence(audio)
trimmed_audio = audio[trim_ms:]
• Get the root mean square (RMS) energy.
rms = audio.rms
• Get the maximum possible RMS for the audio format.
max_possible_rms = audio.max_possible_amplitude
• Find the loudest section of an audio file.
from pydub.scipy_effects import normalize
loudest_part = normalize(audio.strip_silence(silence_len=1000, silence_thresh=-32))
• Change the frame rate (resample).
resampled = audio.set_frame_rate(16000)
• Create a simple band-pass filter.
from pydub.scipy_effects import band_pass_filter
filtered = band_pass_filter(audio, 400, 2000) # Pass between 400Hz and 2000Hz
• Convert file format in one line.
AudioSegment.from_file("music.ogg").export("music.mp3", format="mp3")• Get the raw bytes of the audio data.
raw_data = audio.raw_data
• Get the maximum amplitude.
max_amp = audio.max
• Match the volume of two segments.
matched_audio2 = audio2.apply_gain(audio1.dBFS - audio2.dBFS)
#Python #AudioProcessing #Pydub #Librosa #SignalProcessing
━━━━━━━━━━━━━━━
By: @DataScienceM ✨
❤2
segment = sine_wave[0:51]
windowed_segment = segment * window
VI. Convolution & Correlation
• Perform linear convolution.
sig1 = np.repeat([0., 1., 0.], 100)
sig2 = np.repeat([0., 1., 1., 0.], 100)
convolved = signal.convolve(sig1, sig2, mode='same')
• Compute cross-correlation.
# Useful for finding delays between signals
correlation = signal.correlate(sig1, sig2, mode='full')
• Compute auto-correlation.
# Useful for finding periodicities in a signal
autocorr = signal.correlate(sine_wave, sine_wave, mode='full')
VII. Time-Frequency Analysis
• Compute and plot a spectrogram.
f, t_spec, Sxx = signal.spectrogram(chirp_signal, fs)
plt.pcolormesh(t_spec, f, Sxx, shading='gouraud')
plt.show()
• Perform Continuous Wavelet Transform (CWT).
widths = np.arange(1, 31)
cwt_matrix = signal.cwt(chirp_signal, signal.ricker, widths)
• Perform Hilbert transform to get the analytic signal.
analytic_signal = signal.hilbert(sine_wave)
• Calculate instantaneous frequency.
instant_phase = np.unwrap(np.angle(analytic_signal))
instant_freq = (np.diff(instant_phase) / (2.0*np.pi) * fs)
VIII. Feature Extraction
• Find peaks in a signal.
peaks, _ = signal.find_peaks(sine_wave, height=0.5)
• Find peaks with prominence criteria.
peaks_prom, _ = signal.find_peaks(noisy_signal, prominence=1)
• Differentiate a signal (e.g., to find velocity from position).
derivative = np.diff(sine_wave)
• Integrate a signal.
from scipy.integrate import cumulative_trapezoid
integral = cumulative_trapezoid(sine_wave, t, initial=0)
• Detrend a signal to remove a linear trend.
trend = np.linspace(0, 1, fs)
trended_signal = sine_wave + trend
detrended = signal.detrend(trended_signal)
IX. System Analysis
• Define a system via a transfer function (numerator, denominator).
# Example: 2nd order low-pass filter
system = signal.TransferFunction([1], [1, 1, 1])
• Compute the step response of a system.
t_step, y_step = signal.step(system)
• Compute the impulse response of a system.
t_impulse, y_impulse = signal.impulse(system)
• Compute the Bode plot of a system's frequency response.
w, mag, phase = signal.bode(system)
X. Signal Generation from Data
• Generate a signal from a function.
t = np.linspace(0, 1, 500)
custom_signal = np.sinc(2 * np.pi * 4 * t)
• Convert a list of values to a signal array.
my_data = [0, 1, 2, 3, 2, 1, 0, -1, -2, -1, 0]
data_signal = np.array(my_data)
• Read signal data from a WAV file.
from scipy.io import wavfile
samplerate, data = wavfile.read('audio.wav')
• Create a pulse train signal.
pulse_train = np.zeros(fs)
pulse_train[::100] = 1 # Impulse every 100 samples
#Python #SignalProcessing #SciPy #NumPy #DSP
━━━━━━━━━━━━━━━
By: @DataScienceM ✨
fig, ax = plt.subplots() # Single subplot
fig, axes = plt.subplots(2, 2) # 2x2 grid of subplots
• Plot on a specific subplot (Axes object).
axes[0, 0].plot(x, np.sin(x))
• Set the title for a specific subplot.
axes[0, 0].set_title('Subplot 1')• Set labels for a specific subplot.
axes[0, 0].set_xlabel('X-axis')
axes[0, 0].set_ylabel('Y-axis')• Add a legend to a specific subplot.
axes[0, 0].legend(['Sine'])
• Add a main title for the entire figure.
fig.suptitle('Main Figure Title')• Automatically adjust subplot parameters for a tight layout.
plt.tight_layout()
• Share x or y axes between subplots.
fig, axes = plt.subplots(2, 1, sharex=True)
• Get the current Axes instance.
ax = plt.gca()
• Create a second y-axis that shares the x-axis.
ax2 = ax.twinx()
VI. Specialized Plots
• Create a contour plot.
X, Y = np.meshgrid(x, x)
Z = np.sin(X) * np.cos(Y)
plt.contour(X, Y, Z, levels=10)
• Create a filled contour plot.
plt.contourf(X, Y, Z)
• Create a stream plot for vector fields.
U, V = np.cos(X), np.sin(Y)
plt.streamplot(X, Y, U, V)
• Create a 3D surface plot.
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, Z)
#Python #Matplotlib #DataVisualization #DataScience #Plotting
━━━━━━━━━━━━━━━
By: @DataScienceM ✨