Leveraging PySpark, Airflow, and AWS to Streamline Marketing Data Ingestion to the Data Lake
The leading real estate marketplace. Search millions of for-sale and rental listings, compare Zestimate® home values and connect with local professionals. Whether selling, buying, renting, or financing, Zillow customers can get into their next home
with speed, certainty, and ease.
Customer Need / Business Driver
Zillow’s marketing team is looking for ways to gather data from various sources including all major social data platforms and to import them into Zillow data lake for analytics.
Zillow’s Selection Process
Zillow selected KPI over multiple vendors through a rigorous RFP process. KPI’s expertise in Big Data Ecosystem, Airflow, PySpark, and AWS was the key differentiator for Zillow in addition to KPI’s blended shore model to minimize cost and risk for Zillow.
What KPI Delivered
Delivered multiple pipelines to automate the ingestion of data from various sources including all major social media platforms like Apple, Google, FB, Twitter, etc. into Zillow Data Lake on a daily, weekly and monthly basis. Data ingestion includes fetching data from APIs, SPTP, S3 using python and perform basic transformations, aggregations, and consolidating using PySpark and load into Zillow data lake (AWS S3 Storage). We have also provided years of historical data for all sources in the Zillow data lake.
Additionally, we are delivering data on a daily, weekly, and monthly basis to external vendors such as Gain theory (GT), a third-party AIML platform for marketing decisions which is critical for Zillow’s business.
- Business is able to save dozens of hours every month by the automated data ingestion KPI has implemented
- Big-time money saver and Zillow business have appreciated the partnership
- Businesses now have the ability to perform uplift analysis on campaigns to gauge the effectiveness of the advertisement
- Automated the manual data loading process into the data lake. Optimized/ Refined the data for further analytics using AIML on current and historical data
- Supporting data ingestion to external vendors such as Gain theory (GT), AIML platform for marketing decisions which is critical for Zillow business
- Accomplished data governance which includes data quality and data security.
- Repeatable processes with restartability and recoverability
- Automated the ad-hoc data scheduling problem using airflow and now the data can be made available on any given date including historical backfill