Data Science Machine Learning Data Analysis

#Pandas #DataAnalysis #Python #DataScience #Tutorial

Top 30 Pandas Functions & Methods

This lesson covers 30 essential Pandas functions for data manipulation and analysis, each with a standalone example and its output.

---

1. pd.DataFrame()
Creates a new DataFrame (a 2D labeled data structure) from various inputs like dictionaries or lists.

import pandas as pd
data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data)
print(df)

col1  col2
0     1     3
1     2     4

---

2. pd.Series()
Creates a new Series (a 1D labeled array).

import pandas as pd
s = pd.Series([10, 20, 30, 40], name='MyNumbers')
print(s)

0    10
1    20
2    30
3    40
Name: MyNumbers, dtype: int64

---

3. pd.read_csv()
Reads data from a CSV file into a DataFrame. (Assuming a file data.csv exists).

# Create a dummy csv file first
with open('data.csv', 'w') as f:
    f.write('Name,Age\nAlice,25\nBob,30')

df = pd.read_csv('data.csv')
print(df)

Name  Age
0  Alice   25
1    Bob   30

---

4. df.to_csv()
Writes a DataFrame to a CSV file.

import pandas as pd
df = pd.DataFrame({'Name': ['Charlie'], 'Age': [35]})
# index=False prevents writing the DataFrame index to the file
df.to_csv('output.csv', index=False)
# You can check that 'output.csv' has been created.
print("File 'output.csv' created.")

File 'output.csv' created.

#PandasIO #DataFrame #Series

---

5. df.head()
Returns the first n rows of the DataFrame (default is 5).

import pandas as pd
data = {'Name': ['A', 'B', 'C', 'D', 'E', 'F'], 'Value': [1, 2, 3, 4, 5, 6]}
df = pd.DataFrame(data)
print(df.head(3))

Name  Value
0    A      1
1    B      2
2    C      3

---

6. df.tail()
Returns the last n rows of the DataFrame (default is 5).

import pandas as pd
data = {'Name': ['A', 'B', 'C', 'D', 'E', 'F'], 'Value': [1, 2, 3, 4, 5, 6]}
df = pd.DataFrame(data)
print(df.tail(2))

Name  Value
4    E      5
5    F      6

---

7. df.info()
Provides a concise summary of the DataFrame, including data types and non-null values.

import pandas as pd
import numpy as np
data = {'col1': [1, 2, 3], 'col2': [4.0, 5.0, np.nan], 'col3': ['A', 'B', 'C']}
df = pd.DataFrame(data)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col1    3 non-null      int64  
 1   col2    2 non-null      float64
 2   col3    3 non-null      object 
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes

---

8. df.shape
Returns a tuple representing the dimensionality (rows, columns) of the DataFrame.

import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4], 'C': [5, 6]})
print(df.shape)

(2, 3)

#DataInspection #PandasBasics

---

9. df.describe()
Generates descriptive statistics for numerical columns (count, mean, std, min, max, etc.).

import pandas as pd
df = pd.DataFrame({'Age': [22, 38, 26, 35, 29]})
print(df.describe())

❤2

393 views10:48

Data Science Machine Learning Data Analysis

Top 100 Data Analyst Interview Questions & Answers

#DataAnalysis #InterviewQuestions #SQL #Python #Statistics #CaseStudy #DataScience

Part 1: SQL Questions (Q1-30)

#1. What is the difference between DELETE, TRUNCATE, and DROP?
A:
• DELETE is a DML command that removes rows from a table based on a WHERE clause. It is slower as it logs each row deletion and can be rolled back.
• TRUNCATE is a DDL command that quickly removes all rows from a table. It is faster, cannot be rolled back, and resets table identity.
• DROP is a DDL command that removes the entire table, including its structure, data, and indexes.

#2. Select all unique departments from the employees table.
A: Use the DISTINCT keyword.

SELECT DISTINCT department
FROM employees;

#3. Find the top 5 highest-paid employees.
A: Use ORDER BY and LIMIT.

SELECT name, salary
FROM employees
ORDER BY salary DESC
LIMIT 5;

#4. What is the difference between WHERE and HAVING?
A:
• WHERE is used to filter records before any groupings are made (i.e., it operates on individual rows).
• HAVING is used to filter groups after aggregations (GROUP BY) have been performed.

-- Find departments with more than 10 employees
SELECT department, COUNT(employee_id)
FROM employees
GROUP BY department
HAVING COUNT(employee_id) > 10;

#5. What are the different types of SQL joins?
A:
• (INNER) JOIN: Returns records that have matching values in both tables.
• LEFT (OUTER) JOIN: Returns all records from the left table, and the matched records from the right table.
• RIGHT (OUTER) JOIN: Returns all records from the right table, and the matched records from the left table.
• FULL (OUTER) JOIN: Returns all records when there is a match in either the left or right table.
• SELF JOIN: A regular join, but the table is joined with itself.

#6. Write a query to find the second-highest salary.
A: Use OFFSET or a subquery.

-- Method 1: Using OFFSET
SELECT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 1;

-- Method 2: Using a Subquery
SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

#7. Find duplicate emails in a customers table.
A: Group by the email column and use HAVING to find groups with a count greater than 1.

SELECT email, COUNT(email)
FROM customers
GROUP BY email
HAVING COUNT(email) > 1;

#8. What is a primary key vs. a foreign key?
A:
• A Primary Key is a constraint that uniquely identifies each record in a table. It must contain unique values and cannot contain NULL values.
• A Foreign Key is a key used to link two tables together. It is a field (or collection of fields) in one table that refers to the Primary Key in another table.

#9. Explain Window Functions. Give an example.
A: Window functions perform a calculation across a set of table rows that are somehow related to the current row. Unlike aggregate functions, they do not collapse rows.

-- Rank employees by salary within each department
SELECT
    name,
    department,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM employees;

#10. What is a CTE (Common Table Expression)?
A: A CTE is a temporary, named result set that you can reference within a SELECT, INSERT, UPDATE, or DELETE statement. It helps improve readability and break down complex queries.

362 views19:27

Data Science Machine Learning Data Analysis

💡 Applying Image Filters with Pillow

Pillow's ImageFilter module provides a set of pre-defined filters you can apply to your images with a single line of code. This example demonstrates how to apply a Gaussian blur effect, which is useful for softening images or creating depth-of-field effects.

from PIL import Image, ImageFilter

try:
    # Open an existing image
    with Image.open("your_image.jpg") as img:
        # Apply the Gaussian Blur filter
        # The radius parameter controls the blur intensity
        blurred_img = img.filter(ImageFilter.GaussianBlur(radius=5))

        # Display the blurred image
        blurred_img.show()

        # Save the new image
        blurred_img.save("blurred_image.png")

except FileNotFoundError:
    print("Error: 'your_image.jpg' not found. Please provide an image.")

Code explanation: The script opens an image file, applies a GaussianBlur filter from the ImageFilter module using the .filter() method, and then displays and saves the resulting blurred image. The blur intensity is controlled by the radius argument.

#Python #Pillow #ImageProcessing #ImageFilter #PIL

━━━━━━━━━━━━━━━
By: @DataScienceM ✨

487 views10:48

Data Science Machine Learning Data Analysis

• Get raw audio data as a NumPy array.

import numpy as np
samples = np.array(audio.get_array_of_samples())

• Create a Pydub segment from a NumPy array.

new_audio = AudioSegment(
    samples.tobytes(),
    frame_rate=audio.frame_rate,
    sample_width=audio.sample_width,
    channels=audio.channels
)

• Read a WAV file directly into a NumPy array.

from scipy.io.wavfile import read
rate, data = read("sound.wav")

• Write a NumPy array to a WAV file.

from scipy.io.wavfile import write
write("new_sound.wav", rate, data)

• Generate a sine wave.

import numpy as np
sample_rate = 44100
frequency = 440 # A4 note
duration = 5
t = np.linspace(0., duration, int(sample_rate * duration))
amplitude = np.iinfo(np.int16).max * 0.5
data = amplitude * np.sin(2. * np.pi * frequency * t)
# This array can now be written to a file

VIII. Audio Analysis with Librosa

• Load audio with Librosa.

import librosa
y, sr = librosa.load("sound.mp3")

• Estimate tempo (Beats Per Minute).

tempo, beat_frames = librosa.beat.beat_track(y=y, sr=sr)

• Get beat event times in seconds.

beat_times = librosa.frames_to_time(beat_frames, sr=sr)

• Decompose into harmonic and percussive components.

y_harmonic, y_percussive = librosa.effects.hpss(y)

• Compute a spectrogram.

import numpy as np
D = librosa.stft(y)
S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)

• Compute Mel-Frequency Cepstral Coefficients (MFCCs).

mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

• Compute Chroma features (related to musical pitch).

chroma = librosa.feature.chroma_stft(y=y, sr=sr)

• Detect onset events (the start of notes).

onset_frames = librosa.onset.onset_detect(y=y, sr=sr)
onset_times = librosa.frames_to_time(onset_frames, sr=sr)

• Pitch shifting.

y_pitched = librosa.effects.pitch_shift(y, sr=sr, n_steps=4) # Shift up 4 semitones

• Time stretching (change speed without changing pitch).

y_fast = librosa.effects.time_stretch(y, rate=2.0) # Double speed

IX. More Utilities

• Detect leading silence.

from pydub.silence import detect_leading_silence
trim_ms = detect_leading_silence(audio)
trimmed_audio = audio[trim_ms:]

• Get the root mean square (RMS) energy.

rms = audio.rms

• Get the maximum possible RMS for the audio format.

max_possible_rms = audio.max_possible_amplitude

• Find the loudest section of an audio file.

from pydub.scipy_effects import normalize
loudest_part = normalize(audio.strip_silence(silence_len=1000, silence_thresh=-32))

• Change the frame rate (resample).

resampled = audio.set_frame_rate(16000)

• Create a simple band-pass filter.

from pydub.scipy_effects import band_pass_filter
filtered = band_pass_filter(audio, 400, 2000) # Pass between 400Hz and 2000Hz

• Convert file format in one line.

AudioSegment.from_file("music.ogg").export("music.mp3", format="mp3")

• Get the raw bytes of the audio data.

raw_data = audio.raw_data

• Get the maximum amplitude.

max_amp = audio.max

• Match the volume of two segments.

matched_audio2 = audio2.apply_gain(audio1.dBFS - audio2.dBFS)

#Python #AudioProcessing #Pydub #Librosa #SignalProcessing

━━━━━━━━━━━━━━━
By: @DataScienceM ✨

❤2

648 views10:57

Data Science Machine Learning Data Analysis

segment = sine_wave[0:51]
windowed_segment = segment * window

VI. Convolution & Correlation

• Perform linear convolution.

sig1 = np.repeat([0., 1., 0.], 100)
sig2 = np.repeat([0., 1., 1., 0.], 100)
convolved = signal.convolve(sig1, sig2, mode='same')

• Compute cross-correlation.

# Useful for finding delays between signals
correlation = signal.correlate(sig1, sig2, mode='full')

• Compute auto-correlation.

# Useful for finding periodicities in a signal
autocorr = signal.correlate(sine_wave, sine_wave, mode='full')

VII. Time-Frequency Analysis

• Compute and plot a spectrogram.

f, t_spec, Sxx = signal.spectrogram(chirp_signal, fs)
plt.pcolormesh(t_spec, f, Sxx, shading='gouraud')
plt.show()

• Perform Continuous Wavelet Transform (CWT).

widths = np.arange(1, 31)
cwt_matrix = signal.cwt(chirp_signal, signal.ricker, widths)

• Perform Hilbert transform to get the analytic signal.

analytic_signal = signal.hilbert(sine_wave)

• Calculate instantaneous frequency.

instant_phase = np.unwrap(np.angle(analytic_signal))
instant_freq = (np.diff(instant_phase) / (2.0*np.pi) * fs)

VIII. Feature Extraction

• Find peaks in a signal.

peaks, _ = signal.find_peaks(sine_wave, height=0.5)

• Find peaks with prominence criteria.

peaks_prom, _ = signal.find_peaks(noisy_signal, prominence=1)

• Differentiate a signal (e.g., to find velocity from position).

derivative = np.diff(sine_wave)

• Integrate a signal.

from scipy.integrate import cumulative_trapezoid
integral = cumulative_trapezoid(sine_wave, t, initial=0)

• Detrend a signal to remove a linear trend.

trend = np.linspace(0, 1, fs)
trended_signal = sine_wave + trend
detrended = signal.detrend(trended_signal)

IX. System Analysis

• Define a system via a transfer function (numerator, denominator).

# Example: 2nd order low-pass filter
system = signal.TransferFunction([1], [1, 1, 1])

• Compute the step response of a system.

t_step, y_step = signal.step(system)

• Compute the impulse response of a system.

t_impulse, y_impulse = signal.impulse(system)

• Compute the Bode plot of a system's frequency response.

w, mag, phase = signal.bode(system)

X. Signal Generation from Data

• Generate a signal from a function.

t = np.linspace(0, 1, 500)
custom_signal = np.sinc(2 * np.pi * 4 * t)

• Convert a list of values to a signal array.

my_data = [0, 1, 2, 3, 2, 1, 0, -1, -2, -1, 0]
data_signal = np.array(my_data)

• Read signal data from a WAV file.

from scipy.io import wavfile
samplerate, data = wavfile.read('audio.wav')

• Create a pulse train signal.

pulse_train = np.zeros(fs)
pulse_train[::100] = 1 # Impulse every 100 samples

#Python #SignalProcessing #SciPy #NumPy #DSP

━━━━━━━━━━━━━━━
By: @DataScienceM ✨

323 views04:19

Data Science Machine Learning Data Analysis

fig, ax = plt.subplots() # Single subplot
fig, axes = plt.subplots(2, 2) # 2x2 grid of subplots

• Plot on a specific subplot (Axes object).

axes[0, 0].plot(x, np.sin(x))

• Set the title for a specific subplot.

axes[0, 0].set_title('Subplot 1')

• Set labels for a specific subplot.

axes[0, 0].set_xlabel('X-axis')
axes[0, 0].set_ylabel('Y-axis')

• Add a legend to a specific subplot.

axes[0, 0].legend(['Sine'])

• Add a main title for the entire figure.

fig.suptitle('Main Figure Title')

• Automatically adjust subplot parameters for a tight layout.

plt.tight_layout()

• Share x or y axes between subplots.

fig, axes = plt.subplots(2, 1, sharex=True)

• Get the current Axes instance.

ax = plt.gca()

• Create a second y-axis that shares the x-axis.

ax2 = ax.twinx()

VI. Specialized Plots

• Create a contour plot.

X, Y = np.meshgrid(x, x)
Z = np.sin(X) * np.cos(Y)
plt.contour(X, Y, Z, levels=10)

• Create a filled contour plot.

plt.contourf(X, Y, Z)

• Create a stream plot for vector fields.

U, V = np.cos(X), np.sin(Y)
plt.streamplot(X, Y, U, V)

• Create a 3D surface plot.

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, Z)

#Python #Matplotlib #DataVisualization #DataScience #Plotting

━━━━━━━━━━━━━━━
By: @DataScienceM ✨

615 views04:21

About

Blog

Apps

Platform