Member-only story

Data pre-processing

Ayub Yanturin
4 min readJun 11, 2023

--

Skills block. Session 7

Real-world datasets often require pre-processing , i.e. cleaning the data (handling missing values, outliers, and inconsistencies), transforming variables, normalizing data, and addressing any other issues that might affect analysis or modelling. This article describes the process I’ve come across

1. Standardisation

Converting the docs to a unified format. Check, what are the formats of the original files?
- HTML or RST, json, csv, parquet, avro etc..

Check, are there tables, list or grid? Depending on the complexity you can choose either parse the document with Beautiful Soup or to convert them all to Markdown using markdownify, why?
- cleaner than HTML
- still contained anchors
- standardized

Alternative way is to use open source library LangChain (less control).

2. Pre-pocessing the documents

2.1. Cleaning

i.e. removing unnecessary elements, including:
- Headers and footers
- Table row and column scaffolding — e.g. the |’s in |select()| select_by()|
-
Extra newlines
- Links
- Images
- Unicode characters
- Bolding — i.e. **text**text

--

--

Ayub Yanturin
Ayub Yanturin

Written by Ayub Yanturin

Welcome to PRODUCTology page. Here I'm decoding the scientific principles behind product development, transforming complex innovation into actionable insights.

No responses yet