Member-only story
Data pre-processing
Skills block. Session 7
Real-world datasets often require pre-processing , i.e. cleaning the data (handling missing values, outliers, and inconsistencies), transforming variables, normalizing data, and addressing any other issues that might affect analysis or modelling. This article describes the process I’ve come across
1. Standardisation
Converting the docs to a unified format. Check, what are the formats of the original files?
- HTML or RST, json, csv, parquet, avro etc..
Check, are there tables, list or grid? Depending on the complexity you can choose either parse the document with Beautiful Soup or to convert them all to Markdown using markdownify, why?
- cleaner than HTML
- still contained anchors
- standardized
Alternative way is to use open source library LangChain (less control).
2. Pre-pocessing the documents
2.1. Cleaning
i.e. removing unnecessary elements, including:
- Headers and footers
- Table row and column scaffolding — e.g. the |
’s in |select()| select_by()|
Extra newlines
-
- Links
- Images
- Unicode characters
- Bolding — i.e. **text**
→ text