Member-only story

Data pre-processing

4 min readJun 11, 2023

Skills block. Session 7

Real-world datasets often require pre-processing , i.e. cleaning the data (handling missing values, outliers, and inconsistencies), transforming variables, normalizing data, and addressing any other issues that might affect analysis or modelling. This article describes the process I’ve come across

1. Standardisation

Converting the docs to a unified format. Check, what are the formats of the original files?
- HTML or RST, json, csv, parquet, avro etc..

Check, are there tables, list or grid? Depending on the complexity you can choose either parse the document with Beautiful Soup or to convert them all to Markdown using markdownify, why?
- cleaner than HTML
- still contained anchors
- standardized

Alternative way is to use open source library LangChain (less control).

2. Pre-pocessing the documents

2.1. Cleaning

i.e. removing unnecessary elements, including:
- Headers and footers
- Table row and column scaffolding — e.g. the |’s in |select()| select_by()| -Extra newlines
- Links
- Images
- Unicode characters
- Bolding — i.e. **text** → text

Data pre-processing

1. Standardisation

2. Pre-pocessing the documents

2.1. Cleaning

Written by Ayub Yanturin

No responses yet