CAIML #22
After almost one year without any meetups and more than two and a half years without an in-person meetup, CAIML #22 happened on October 26, 2022, at talentsconnect 🎉
We had two talks on data engineering and machine learning, and time for networking with food and drinks:
18h30 Open Doors 19h00 Welcome & Intro 19h15 Amira Lotfy (Data Engineer at PlatformX GmbH): PyDeequ - data quality testing on steroids
How to provide a high data quality when having >15TB of data run through daily ETL processes? In our serverless data platform, it is challenging to provide high data quality by depending on 3rd party data provider. Calculations are done via SQL or Spark, data is flowing in via Kafka streams, pulled by direct database connections or via APIs from 3rd party systems, or just directly put into AWS S3 Buckets. Due to highly automated BI processes and ML pipelines, we need to ensure that all processes run stable and provide consistent data. Because of the dependency on 3rd parties, we are not able to guarantee this. This leads to trust issues by end users when data in our dashboards is missing or analysis shows suspicious values. We invested a lot of time researching how to monitor ETL processes and calculations and combine it with alerting and visualisation of results. This talk shows how we utilize (Py)Deequ for data quality checks. We will show you the concept, our infrastrcture and a live example.
- 5 Minute Break -
19h45 Aleksandra Stankevich (Junior Data Scientist at talentsconnect): Distributional embeddings and optimization techniques: a use case for matching skills in the HR domain
Knowledge about skills and competencies has a crucial role in the domain of HR. Understanding content in terms of such concepts is basic for enabling accurate HR software applications, such as matching candidates to job offers or recommending similar jobs. Both the US government and EU commission financed initiatives that generated open-access comprehensive ontologies which capture the relationships between occupations, jobs, skills and skill families (O*net and ESCO). Nevertheless, the task of using such ontologies to produce skill representations is far from trivial, specially if we want to combine the highly specialized knowledge stemming from them with other very valuable generic sources of information, such as distributional languages models (ie, concept-embeddings). In this talk, we will present a use case for optimizing distributional embeddings by adding relational information extracted from a skill taxonomy. The results demonstrated great potential of “inject” embedding approaches in the domains where information from structured knowledge bases has major prominence.
20h15 Networking with food and drinks