Being a data engineer, one of the tasks which you have to do almost on a daily basis is load huge amounts of data into your data warehouse or data lakes. Sometimes to do benchmark load times or emulate performance tuning issues in your test environment, you need to use test datasets. While their is a lot of very good huge open datasets available on Kaggle and AWS
But instead of having actual data all you need is a CSV file with dummy data in it. Fear not, up comes Python to the resuce. Python is the golden goose in the age of information not only can it help you sort through massive amounts of data it can also help you generate data.
Faker is a Python package which can generate fake data for you. First you need to pip install faker. For this excercise we are using Python 3.7.2
$ python -m pip install faker |
— Script to Generate a CSV file with Fake Data and 1 Billion Rows —
Caution : The file size will be about 1.3GB and it can really hammer your machine. I have an Ec2 instance on which i generate this test data and let it leave running in the background. You can use multiprocessor in Python and hammer all cores but that is a discussion worthy of it’s own blog post.
import csv
import random
from time import time
from decimal import Decimal
from faker import Faker
RECORD_COUNT = 1000000000
fake = Faker()
def create_csv_file():
with open('/u01/users1.csv', 'w', newline='') as csvfile:
fieldnames = ['userid', 'username', 'firstname', 'lastname', 'city','state', 'email', 'phone', 'cardno', 'likesports', 'liketheatre','likeconcerts','likejazz','likeclassical','likeopera','likerock','likevegas'
,'likebroadway','likemusicals']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for i in range(RECORD_COUNT):
writer.writerow(
{
'userid': fake.ean8(),
'username': fake.user_name(),
'firstname': fake.first_name(),
'lastname': fake.last_name(),
'city': fake.city(),
'state': fake.state_abbr(),
'email': fake.email(),
'phone': fake.phone_number(),
'cardno': fake.credit_card_number(card_type=None),
'likesports': fake.null_boolean(),
'liketheatre': fake.null_boolean(),
'likeconcerts': fake.null_boolean(),
'likejazz': fake.null_boolean(),
'likeclassical': fake.null_boolean(),
'likeopera': fake.null_boolean(),
'likerock': fake.null_boolean(),
'likevegas': fake.null_boolean(),
'likebroadway': fake.null_boolean(),
'likemusicals': fake.null_boolean(),
}
)
if __name__ == '__main__':
create_csv_file()
This will create a file users1.csv with a billion rows and generated fake data which is almost like real data
Attached Script :