Data Types Every Data Scientist Needs
The types you work with daily — numbers, strings, lists, dicts, None/NaN — have subtle behaviours that cause silent bugs in ML metrics, NLP pipelines, and data quality checks. Get these right before reaching for any library.
Python floats are IEEE 754 double-precision (64-bit binary). Most decimal fractions cannot be represented exactly, so 0.1 + 0.2 != 0.3. In ML this matters whenever you compare a metric to a threshold: accuracy >= 0.95 may silently fire or not depending on floating-point rounding. Use math.isclose() for all float comparisons in evaluation code. For counts (correct predictions, sample sizes) prefer int arithmetic before dividing. In NumPy, int64 arrays cannot hold NaN — only float64 can, which is why Pandas silently promotes int columns to float64 when NaN is introduced.
import math, numpy as np, pandas as pd
# ── Float precision in metric comparisons ────────────────
correct, total = 19, 20
accuracy = correct / total # 0.95
a = 0.1 + 0.2
print(a == 0.3) # False (0.30000000000000004)
print(math.isclose(a, 0.3, rel_tol=1e-9)) # True ✓
# Safe threshold comparison for model evaluation
threshold = 0.95
if math.isclose(accuracy, threshold, rel_tol=1e-6) or accuracy > threshold:
print("Model meets target")
# ── RMSE comparison ───────────────────────────────────────
rmse_v1, rmse_v2 = 0.4230000000001, 0.423
improved = not math.isclose(rmse_v1, rmse_v2, rel_tol=1e-6)
print("V2 improved:", improved) # False — same within tolerance
# ── NumPy dtype: int64 cannot hold NaN ───────────────────
arr_int = np.array([1, 2, 3], dtype=np.int64)
# arr_int[0] = np.nan → ValueError: cannot convert float NaN to int
arr_float = np.array([1, 2, 3], dtype=np.float64)
arr_float[0] = np.nan # works — float64 holds NaN
# ── Pandas silently promotes int → float64 with NaN ──────
s = pd.Series([1, 2, 3])
print(s.dtype) # int64
s[0] = None
print(s.dtype) # float64 ← automatic promotion!
print(s) # [NaN, 2.0, 3.0]
# Use nullable Int64 to keep integers with NaN
s2 = pd.array([1, 2, None], dtype=pd.Int64Dtype())
print(s2) # [1, 2, <NA>] ← no float promotion if accuracy == 0.95: silently fails when accuracy is 0.9500000000001 due to floating-point arithmetic. The model passes all logic checks but the evaluation branch never fires.
A Pandas int64 column gets silently cast to float64 when you assign NaN or None. Downstream dtype checks like dtype == "int64" then fail unexpectedly.
In Python 3, 5 / 2 == 2.5 (true division always). If you expect an integer index from a division like mid = (lo + hi) / 2, you get a float and array indexing raises TypeError.
Python floats follow IEEE 754 double-precision binary. Most decimal fractions have no exact binary representation — 0.1 is stored as 0.1000000000000000055511... Arithmetic accumulates rounding errors. In ML: use math.isclose(a, b, rel_tol=1e-9) for threshold comparisons, numpy.isclose() for array-level comparisons, and round(metric, 4) when logging to suppress spurious precision in experiment reports.
NumPy int64 has no NaN representation — NaN is an IEEE 754 float concept. When you assign None or np.nan to a Pandas int64 Series, Pandas must promote the dtype to float64 to accommodate it. In Pandas 1.0+ use pd.Int64Dtype() for nullable integers. Always check df.dtypes after loading data to catch unexpected promotions before they affect model inputs.
Use int for counts, class labels, indices, and batch sizes — integer arithmetic is exact and avoids NaN-promotion issues. Use float for ratios, probabilities, and continuous values. Key rule: if the column will ever contain NaN, it must be float64 or a nullable integer type. Mixing silently (e.g., label 0 becomes 0.0) causes issues in model APIs that expect integer labels for classification tasks.
Strings are immutable Unicode sequences. For NLP and LLM work the core skills are: (1) text normalisation — lowercase, strip, remove special characters; (2) f-strings for building log messages and LLM prompts efficiently; (3) join() not + in loops for string building; (4) regex for extracting structured features from raw text. In LLM engineering, prompt strings are your primary interface to the model — treat them with the same rigour as code.
import re
# ── Text normalisation pipeline ───────────────────────────
def clean_text(text: str) -> str:
text = text.lower().strip()
text = re.sub(r'[^a-z0-9 ]', '', text) # keep alphanumeric + spaces
return ' '.join(text.split()) # collapse whitespace
raw = " Hello, World! This is a TEST. "
print(clean_text(raw)) # "hello world this is a test"
# ── f-strings for experiment logging ─────────────────────
epoch, loss, acc = 5, 0.234, 0.912
log_line = f"Epoch {epoch:02d} | loss={loss:.4f} | acc={acc:.2%}"
print(log_line) # "Epoch 05 | loss=0.2340 | acc=91.20%"
# ── LLM prompt building ───────────────────────────────────
def build_prompt(query: str, context: list[str], max_ctx: int = 3) -> str:
ctx_block = "
".join(f" - {c}" for c in context[:max_ctx])
return (
"You are a helpful assistant. Answer based on the context below.
"
f"Context:
{ctx_block}
"
f"Question: {query}
Answer:"
)
prompt = build_prompt(
"What is RAG?",
["RAG retrieves relevant documents", "LLMs use retrieved context to answer"]
)
# ── Efficient string building (join not +) ────────────────
tokens = ["the", "model", "predicted", "positive"]
# BAD — O(n²): creates a new string object on every iteration
result = ""
for t in tokens: result += t + " "
# GOOD — O(n): one allocation, one pass
result = " ".join(tokens)
# ── Regex for feature extraction ──────────────────────────
text = "Order #12345 confirmed. Contact [email protected]"
order_ids = re.findall(r'#(\d+)', text) # ['12345']
emails = re.findall(r'[\w.]+@[\w.]+', text) # ['[email protected]'] for token in tokens: result += token builds a new string object every iteration — O(n²) time. For a 10,000-token NLP preprocessing loop this is measurably slow.
Text from CSV or API responses often carries trailing newlines or spaces. if label == "positive" fails silently when the value is "positive\n", causing missed detections in evaluation pipelines.
open("file.txt", "rb") returns bytes. Calling .lower() or .split() on bytes raises AttributeError. This surfaces when reading binary model files or when file encoding is not specified.
Use f-strings or string concatenation for parameterised prompts. Write a dedicated function that takes context and query as typed arguments and returns the prompt string. Always strip and normalise user input before interpolating it into prompts to reduce prompt injection risk. For production systems, consider Jinja2 templates for versioning and validation. Store prompt templates separately from code so they can be updated without a redeploy.
"".join(list_of_parts) is the correct idiom — a single allocation for the final string. Concatenation with += in a loop is O(n²) because each step allocates a new string. For very large outputs, io.StringIO acts as a mutable buffer: write parts with sio.write(), retrieve with sio.getvalue(). In Pandas, use vectorised string operations (df["col"].str.lower()) instead of apply with a lambda for column-wise transformations.
Python 3 strings are Unicode by default. Use unicodedata.normalize("NFC", text) to canonicalise accented characters. To strip accents: normalize("NFD", text) then filter out combining characters (unicodedata.category(c) == "Mn"). Use re with the default Unicode-aware mode for Python 3. For tokenising non-Latin scripts, use dedicated tokenisers (spaCy, HuggingFace tokenizers) rather than split(), which only handles whitespace.
Lists (ordered sequences), dicts (hash maps, O(1) lookup), and sets (hash sets, O(1) membership) are the core preprocessing structures. Use dicts for label-to-integer mappings, use sets for deduplication and vocabulary membership, use lists for feature vectors and batch items. Choosing the wrong structure — a list where a set belongs — can silently turn an O(n) operation into O(n²) at scale.
from collections import defaultdict
# ── Label encoding with dict ─────────────────────────────
labels = ["cat", "dog", "cat", "bird", "dog", "cat"]
label2id = {label: idx for idx, label in enumerate(sorted(set(labels)))}
id2label = {v: k for k, v in label2id.items()}
# label2id = {'bird': 0, 'cat': 1, 'dog': 2}
encoded = [label2id[l] for l in labels] # [1, 2, 1, 0, 2, 1]
decoded = [id2label[i] for i in encoded]
# Safe lookup for unknown labels at inference time
def encode_label(label: str, fallback: int = -1) -> int:
return label2id.get(label, fallback)
encode_label("fish") # -1 — unknown class, no KeyError
# ── Deduplication with set ────────────────────────────────
urls = ["http://a.com", "http://b.com", "http://a.com", "http://c.com"]
unique_urls = list(dict.fromkeys(urls)) # preserves insertion order (3.7+)
# ── O(1) vocabulary membership check ────────────────────
vocab = {"cat", "dog", "bird", "fish"} # set, not list!
word = "fish"
print(word in vocab) # O(1) ← hashed lookup
# AVOID: word in ["cat", "dog", "bird", "fish"] → O(n) linear scan
# ── List comprehension for feature extraction ─────────────
records = [
{"text": "great product", "stars": 5},
{"text": "terrible", "stars": 1},
{"text": "it's ok", "stars": 3},
]
texts = [r["text"] for r in records]
ratings = [r["stars"] for r in records]
is_pos = [r["stars"] >= 4 for r in records] # binary label
# ── defaultdict for group-by ──────────────────────────────
buckets = defaultdict(list)
for r in records:
key = "positive" if r["stars"] >= 4 else "negative"
buckets[key].append(r["text"])
# {"positive": ["great product"], "negative": ["terrible", "it's ok"]} if token in stop_words_list — O(n) per lookup. For a 50,000-word vocabulary and 1M tokens, that is 50 billion comparisons. This looks correct but is catastrophically slow.
label2id[unknown_label] raises KeyError if the model encounters a category not seen during training. In a production serving endpoint this crashes the request.
for i, item in enumerate(data): if bad(item): data.pop(i) — skips items because indices shift as elements are removed, silently letting bad data through.
Build a dict from sorted unique values: label2id = {label: idx for idx, label in enumerate(sorted(set(labels)))}. Encode with a comprehension: [label2id[l] for l in data]. Sorted order makes encoding reproducible across runs. For unknown labels at inference, use .get(label, -1). For production, serialise the mapping dict to JSON so encoding is consistent between training and serving. LabelEncoder from scikit-learn does this automatically and is the preferred option in sklearn pipelines.
list: O(n) — linear scan through all elements. set: O(1) average — hash lookup. dict key check: O(1) average — hash lookup. This matters for vocabulary lookups, stop-word filtering, and deduplication over large datasets. Rule: if you test membership more than once, convert to set first. The O(n) conversion is paid once; every subsequent check is O(1).
defaultdict(list) is ideal for group-by accumulation — no KeyError if the key is new, just creates an empty list. defaultdict(int) builds counters. defaultdict(set) builds inverted indexes (word → document IDs). Use a regular dict when keys are known in advance and a missing key should raise (an error signals a bug). Use .get() with a default when absence is expected but you should not auto-create keys.
None is Python's null singleton. NaN is IEEE 754 "not a number" — a float value. In Pandas both represent missing data but behave differently. The critical trap: np.nan == np.nan is always False by IEEE 754 design. Always use pd.isna() to check for missing values, never ==. Before feeding data to a model, validate and handle missing values explicitly — sklearn raises ValueError on NaN inputs by default.
import numpy as np, pandas as pd
# ── None vs NaN comparison ────────────────────────────────
print(None == None) # True
print(np.nan == np.nan) # False ← IEEE 754: NaN != NaN always
print(np.nan is np.nan) # True ← same object in CPython
value = np.nan
print(value == np.nan) # False — always False, even when it IS NaN!
print(pd.isna(value)) # True ✓ use pd.isna()
# pd.isna handles both None and NaN uniformly
print(pd.isna(None)) # True
print(pd.isna(np.nan)) # True
print(pd.isna(0)) # False
print(pd.isna("")) # False — empty string is NOT missing!
# ── Handling missing values in a DataFrame ────────────────
df = pd.DataFrame({
"age": [25.0, None, 30.0, np.nan],
"income": [50000, 60000, None, 80000],
"label": [1, 0, 1, 0],
})
print(df.isna().sum()) # missing count per column
# Strategy 1: drop rows with any missing value
df_clean = df.dropna()
# Strategy 2: fill with statistics
df["age"] = df["age"].fillna(df["age"].median())
df["income"] = df["income"].fillna(df["income"].mean())
# ── Validate before model.predict() ──────────────────────
def validate_features(X: pd.DataFrame) -> None:
missing = X.isna().any()
if missing.any():
cols = missing[missing].index.tolist()
raise ValueError(f"Missing values in columns: {cols}")
if not np.isfinite(X.to_numpy(dtype=float)).all():
raise ValueError("Non-finite values (inf / -inf) detected")
validate_features(df[["age", "income"]]) # must pass before predict() x == np.nan is always False — even when x is NaN. This is IEEE 754 behaviour. Checking np.nan is np.nan is True only because CPython reuses the same object, but this is not guaranteed by the language.
pd.isna("") returns False — an empty string is not considered missing by Pandas. Text columns loaded from CSV may use "" as the missing marker, silently passing your isna() validation.
1.0 / 0.0 returns inf in NumPy (no exception for floats). Inf values propagate silently through arithmetic and can produce NaN in softmax or log operations, crashing model training with an opaque loss=nan.
None is Python's null object — appears in object-dtype columns (strings, mixed types). NaN is IEEE 754 "not a number" — appears in float64 columns. Pandas uses NaN for numeric missing because None cannot be stored in a NumPy float array. When you assign None to a float column, Pandas silently converts it to NaN. pd.isna() detects both, making it the universal safe check. Pandas 1.0+ introduces pd.NA for nullable integer and boolean extensions.
sklearn estimators raise ValueError on NaN by default. Strategy: (1) dropna() if missing is rare and random; (2) fillna(median/mean) for numeric features; (3) use SimpleImputer inside a Pipeline so imputation statistics are fit on training data only and applied consistently at inference — preventing data leakage. For tree models like XGBoost and LightGBM that handle NaN natively, pass NaN directly and let the model learn missingness as a split criterion.
By IEEE 754 standard, NaN is defined as not equal to any value, including itself — two undefined results (0/0, sqrt(-1)) are not necessarily the same undefined result. Python floats follow IEEE 754, so float("nan") != float("nan") is True. The correct checks are math.isnan(), np.isnan(), or pd.isna(). NaN also cannot be used as a dict key or set member — hashing would be inconsistent with equality.
Most ML bugs are not algorithmic. They are float precision in a metric threshold, None vs NaN confusion in a missing-value check, or a list membership test that silently runs in O(n) for 1M tokens. These basics prevent a week of debugging.