Recently, I have had a lot of conversations about Python and been getting a lot of questions on how to get started with Python (for data analysis, data science and automation). So I wanted to write about my python setup for “Data Science”. And since the majority of you that I talked to were using Windows, this post is for you.
Since I prefer unix-like OS, most of my favorite tools are cross platform, so even if you’re using a mac or some flavor of Linux, this guide will be useful for you too.
Who is this for?
This is for you if you are someone who’s trying to get started with Python on Windows. (Mac & Linux too!)
This is for you if a Data Analyst or Data Scientist used to a unix environment but need to use a Windows device.
This is for you if you like seeing other people’s setups and find/try new tools.
Let’s dive in!
“Deep Work” by Cal Newport is a great read! The book has two parts, the first part lays out evidence to persuade you that deep work is valuable, rare and meaningful, while the second lays out strategy and principles to improve the quality and quantity of your deep work.
I listened to a podcast giving advice on effectively extract more value from books. It was put eloquently:
- Value extracted from fictional books grows exponentially, the story gets better and better as you read.
- Value extracted from non-fiction books is linear with diminishing return, and not necessarily sequentially.
- Takeaway: Don’t feel sorry to quit the book, especially if the author keep repeating him/herself.
- Takeaway: Skip around, not necessary to read sequentially.
Cal did repeat himself a bit on the 1st half to persuade the reader and emphasize various points. However, the 2nd half is very efficient in presenting strategies that Cal had implemented in his own life. I suggest to skip around in part 1 unless you’re not convinced, and read the 2nd part in its entirety.
Why Inbox Zero?
Time is limited, and attention is finite
I need to safeguard my time and attention.
Really, the email inbox is just too distracting to be a good task tracker
New emails comes in - what do we do? We click it, open it and read it.
Research has shown that it takes an average of 23 minutes and 15 seconds to get back to a task after being interrupted.
Pandas’ read_excel performance is way too slow.
Pandas reading from excel (pandas.read_excel()) is really, really slow, even some with small datasets (<50000 rows), it could take minutes. To speed it up, we are going to convert the Excel files from .xlsx to .csv and use panda.read_csv() instead.
Currently, my technology stack is using Python to blend and clean data before pushing the data up to Tableau Online (SaaS version of Tableau Server). Most of the dashboards need to be refreshed daily, some hourly. Today we’re going to walk through how to automate the data refresh with Tableau Server Client & Pandelau. Let’s jump in!
Back in February, I started seeing some articles talking about tax refunds and how it will be smaller this year, and wanted to dive into the data myself and see if I can agree with them, I did a short write-up here and the tl;dr was that it was too early to tell.
I found this treasure trove of IRS data a while back when I was working on an analysis on tax season retail promotions. I didn’t want capture the data manually so I wrote a quick and dirty script for it. So here it is -
Check out the Tableau Dashboard here.
The “hype” around a lower 2019 tax returns
Surely no one is hyped up about a lower tax return, but there are more news articles popping up each day (Times, CBS) writing about how the tax refund is smaller this year. But is that the whole picture?
So let’s dive in and see …
As of when I was writing this post, only the first two weeks (2/1/19 and 2/8/19) of data had been published. Comparing 2019 against 2018, there is a 23.15% ($6.68 Billion) decrease in total refund ($), and average refund decrease by 8.73% ($2135.31 to $1947.86).
But … less people filed this year compared to last year - 2.12 Million (-6.86%) less returns were received and 3.05 Million (-10.16%) less returns were processed compared to 2018!
02 Feb 2019
The cold start problem: how to build your machine learning portfolio
“…In a real job, unless you’re doing state of the art AI research, you’ll be spending 80–90% of your time cleaning your data anyway. Why would your personal project be different?
… What Ron and Alex did seems insane. And it was insane. Normal people don’t duct tape their phones to shopping carts. Normal people don’t spend their days cropping pilots out of YouTube videos. You know who does that? People who will do whatever it takes get their work done…”
- Phenomenal article on building up a portfolio and inspiring case studies on some very interesting projects.
Airbnb Rental Listings Dataset Mining
” … perform an exploratory analysis of the Airbnb dataset sourced from the Inside Airbnb website to understand the rental landscape in NYC through various static and interactive visualisations…”
- Great post on mining AirBnB data and visualizing it and communicating insights.
- I didn’t know that this dataset existed, hope to work with it in the short future.
Facebook’s Suicide Algorithms are Invasive
Facebook automatically scores all of us.
The algorithm touches nearly every post on Facebook, rating each piece of content on a scale from zero to one
25 Jan 2019
From AWS S3 to GitHub Pages
It all started with the 12 months free tier from Amazon Web Services (AWS). I looked through the “Host a Static Website” tutorial and learned how easy it was to get a static webpage up and running, that’s when I first published my resume webpage. Later, I wanted to start blogging and needed a different platform. That’s when I stumbled across GitHub Pages with Jekyll Now. Here’s my journey:
All started with S3 - my simple resume page