Strings and Text Processing

Strings appear early in Python because the first program usually prints text. The textbook introduces strings with quotes, indexing, slicing, len(), case conversion, replacement, splitting, and input. Those operations are enough to build many useful scripts: clean a line of data, parse a simple record, format a report, or ask a user for information.

Figure: Python provides the practical environment for many CS, ML, and data examples. Image: Wikimedia Commons, Python Software Foundation, GPL-compatible free license; trademark terms apply.

Python strings are more than arrays of characters. They are immutable Unicode text objects with a rich method library. Once you understand immutability, indexing, slicing, formatting, and the difference between text and bytes, string processing becomes predictable. The goal is not to memorize every method; it is to recognize the small set of transformations that show up repeatedly.

Definitions

A string is an instance of str, written with single quotes, double quotes, or triple quotes:

name = "Python"
same = 'Python'
message = """multiple
lines"""

Python strings are immutable. Methods such as .lower() and .replace() return new strings; they do not modify the original object.

An index selects one character by position. Python uses zero-based indexing:

word = "Python"
word[0]  # "P"
word[1]  # "y"
word[-1] # "n"

A slice selects a range:

word[1:4]  # "yth"
word[:2]   # "Py"
word[2:]   # "thon"

The stop index is excluded. This rule makes slice lengths easy: word[a:b] has length b - a when the indices are in range.

An escape sequence represents a character that is hard to type directly. Common examples are \n for newline and \t for tab. A raw string, prefixed with r, treats backslashes more literally and is often used for regular expressions and Windows paths:

pattern = r"\d+"

An f-string is a formatted string literal:

temperature = 21.456
print(f"{temperature:.1f} C")

The expression inside braces is evaluated, and format specifiers control presentation.

Key results

The first key result is that string operations return new values. If you write:

text = " Hello "
text.strip()

text is still " Hello ". You need:

text = text.strip()

The second result is that splitting and joining are inverse patterns. .split() turns one string into a list of pieces. .join() turns many strings into one string:

fields = "a,b,c".split(",")
line = ",".join(fields)

This pair is central to simple file processing.

The third result is that formatting should be separated from calculation. Keep numbers as numbers while computing, then format them at the output boundary. A value such as "3.14" is text; it cannot be used safely in numeric formulas until converted.

The fourth result is that text matching has levels. Use in, .startswith(), and .endswith() for simple checks. Use .split() for simple delimiters. Use the csv module for comma-separated data with quoting rules. Use re only when a real pattern language is needed.

The fifth result is that str and bytes are different. Files opened in text mode return strings. Files opened in binary mode return bytes. Encoding, usually UTF-8, is the mapping between them. Most beginner scripts should open text files with an explicit encoding:

open("data.txt", encoding="utf-8")

A sixth result is that text processing should state its assumptions. If a line is expected to have exactly three comma-separated fields, check that it does. If a label is expected to be uppercase, either normalize it with .upper() or reject invalid input. Silent string cleanup can be useful for user convenience, but it can also hide bad data. Decide whether the program is accepting flexible human input or enforcing a machine-readable format.

A seventh result is that formatting is a separate concern from storage. A temperature may be stored as the float 21.4567, displayed as "21.5 C", and written to JSON as 21.4567. If the program stores the displayed text, later calculations must parse it again and may lose precision. Keep raw values in appropriate types, then format at the boundary where a human reads the output.

Finally, regular expressions should have names when they encode a rule. A line such as if re.match(r"^[A-Z]\d{3}$", code): is acceptable in a short script, but a named compiled pattern such as SAMPLE_CODE_RE tells future readers that the expression represents a domain rule. When a pattern becomes difficult to read, use verbose mode or split the problem into simpler string operations.

Visual

String:  P  y  t  h  o  n
Index:   0  1  2  3  4  5
Neg:    -6 -5 -4 -3 -2 -1

word[1:4] includes indexes 1, 2, 3:

P [y  t  h] o  n  -> "yth"

Method	Purpose	Example	Result
`.strip()`	Remove surrounding whitespace	`" hi \n".strip()`	`"hi"`
`.lower()`	Normalize case	`"Py".lower()`	`"py"`
`.replace(a, b)`	Replace substrings	`"2020-01".replace("-", "/")`	`"2020/01"`
`.split(sep)`	Break into list	`"a,b".split(",")`	`["a", "b"]`
`sep.join(parts)`	Combine strings	`"-".join(["a", "b"])`	`"a-b"`
`.find(sub)`	Return index or `-1`	`"abc".find("b")`	`1`
`.startswith(prefix)`	Prefix test	`"data.csv".startswith("data")`	`True`

Worked example 1: parse a simple sensor line

Problem: parse the line "time=10,temp=22.5,status=OK" into useful values.

Method:

Split the line on commas to get fields.
Split each field on the first equals sign.
Store keys and values in a dictionary.
Convert numeric fields after parsing.

Work:

line = "time=10,temp=22.5,status=OK"
fields = line.split(",")
record = {}

for field in fields:
    key, value = field.split("=", 1)
    record[key] = value

time_s = int(record["time"])
temp_c = float(record["temp"])
status = record["status"]

Step-by-step:

line.split(",") gives ["time=10", "temp=22.5", "status=OK"].
"time=10".split("=", 1) gives ["time", "10"], so store "time": "10".
"temp=22.5" becomes "temp": "22.5".
"status=OK" becomes "status": "OK".
Convert "10" to integer 10.
Convert "22.5" to float 22.5.

Checked answer:

time_s == 10
temp_c == 22.5
status == "OK"

This is a simple format. If the format allowed quoted commas, use the csv module instead of manual splitting.

Worked example 2: format a report line

Problem: print a table of temperatures with aligned names and one decimal place.

Data:

rows = [("Oslo", 4.25), ("Seoul", 21.8), ("Cairo", 30.125)]

Method:

Choose column widths.
Use an f-string for alignment and numeric precision.
Keep the original values numeric.

Work:

for city, temp in rows:
    print(f"{city:<10} {temp:>6.1f} C")

Step-by-step:

{city:<10} means left-align the city in a field of width 10.
{temp:>6.1f} means right-align the number in width 6 with one digit after the decimal point.
"Oslo" becomes "Oslo ".
4.25 rounds to 4.2 for display because one decimal place is requested.
30.125 displays as 30.1 under usual binary floating-point and formatting behavior.

Checked output:

Oslo          4.2 C
Seoul        21.8 C
Cairo        30.1 C

The numbers are still stored as floats in rows; only the printed representation is rounded.

Code

def normalize_name(text):
    parts = text.strip().split()
    return " ".join(part.capitalize() for part in parts)

def parse_key_value_line(line):
    result = {}
    for field in line.strip().split(","):
        if not field:
            continue
        key, value = field.split("=", 1)
        result[key.strip().lower()] = value.strip()
    return result

name = normalize_name("  ada   lovelace ")
record = parse_key_value_line("time=10, temp=22.5, status=OK")

print(name)
print(record)

The code demonstrates whitespace normalization, splitting, joining, generator expressions, case normalization, and defensive handling of empty fields.

Treat this style of parser as appropriate for controlled, simple formats. It is a good fit when the input is produced by your own script or by a small exercise. It is not a full replacement for csv, json, or a formal parser when input can contain quoting, escaping, nested structures, or user-controlled edge cases. A useful test is to write down one malformed input and decide whether the parser should reject it or clean it. If that decision matters, encode it as a test.

For human names, capitalization rules are culturally complex; the function here is only a programming example, not a universal name normalizer.

For text-processing exercises, keep a short list of sample inputs beside the parser: a normal case, an empty case, a case with extra whitespace, and one invalid case. These examples document the contract better than prose alone and can later become tests.

Common pitfalls

Expecting string methods to modify the original string. Assign the returned value.
Forgetting that indexes start at zero and that slice stop positions are excluded.
Using manual string splitting for full CSV, JSON, or XML formats. Use the standard library parsers.
Converting numbers to strings too early, then needing arithmetic later.
Building long output with repeated + concatenation in a loop. Accumulate pieces and use "".join(parts) or write directly to a file.
Forgetting explicit encodings when reading and writing text files across machines.
Overusing regular expressions for simple prefix, suffix, or delimiter problems.

Definitions​

Key results​

Visual​

Worked example 1: parse a simple sensor line​

Worked example 2: format a report line​

Code​

Common pitfalls​

Connections​