Data Analyst
M2.0 Communications Inc.
- Quezon City, Philippines# 94 Scout Castor, Quezon City, Metro Manila, PhilippinesQuezon CityMetro ManilaPhilippinesPhilippines
- Full timeFULL_TIME
Job Description
As a Data Analyst, you will play a crucial role in the data preprocessing phase of our project to fine-tune the Whisper model for Taglish and other languages. Your responsibilities will include collecting, organizing, cleaning, and preparing high-quality multilingual data for model training. You will work closely with the machine learning team to ensure that the data meets the necessary standards for effective model training.
Key Responsibilities:
Data Collection and Organization:
- Gather raw audio files in various formats (e.g., MP3, WAV, FLAC) from diverse sources such as interviews, podcasts, and YouTube videos.
- Organize files into a structured directory hierarchy, ensuring a clear and consistent file naming convention.
Audio Preprocessing:
- Convert audio files to the required format (16kHz mono, 16-bit signed integer WAV) using tools like FFmpeg.
- Transcribe audio files, either manually or through a transcription service, and store text files with corresponding filenames.
Data Cleaning and Normalization:
- Clean and normalize text data to address spelling variations, punctuation issues, and formatting inconsistencies.
- Standardize abbreviations and contractions, and remove special characters or unnecessary symbols.
Data Segmentation and Labeling:
- Split lengthy audio recordings into smaller, manageable segments.
- Create and maintain a metadata file that maps audio files to their corresponding transcriptions and alignment details.
Quality Assurance and Validation:
- Conduct thorough quality checks to validate the dataset for accuracy, consistency, and completeness.
- Identify and resolve issues in the audio and text data, such as misalignments or incorrect transcriptions.
Data Analysis and Reporting:
- Use data analysis techniques to evaluate dataset health and completeness.
- Provide regular reports on data collection progress, challenges, and recommendations for improvements.
Collaboration and Communication:
- Work closely with the machine learning team to address any data-related issues.
- Provide regular updates on data collection and preprocessing progress.
Minimum Qualifications
Qualifications:
- Strong Proficiency in Python: Experience with data manipulation, cleaning, and preprocessing using Python libraries such as Pandas, NumPy, and TensorFlow.
- Data Cleaning and Preprocessing: Proven ability to clean, organize, and preprocess data for machine learning applications.
- NLP Knowledge: Familiarity with natural language processing techniques, including text normalization and handling multilingual or code-mixed data.
- SQL Skills: Experience with SQL for data querying and management.
- Problem-Solving Skills: Ability to identify and solve complex data-related problems with creativity and efficiency.
- Work Under Pressure: Capable of handling multiple tasks simultaneously and meeting deadlines in a fast-paced environment.
- Adaptability: Willingness to learn new tools and techniques as needed for the project.
- Attention to Detail: Meticulous attention to detail to ensure data accuracy and integrity.
- Communication Skills: Excellent communication skills to collaborate effectively with cross-functional teams.
Desired Skills:
- Familiarity with audio processing tools like FFmpeg.
- Familiarity with transcription tools and alignment software (e.g., Aeneas, Gentle).
- Knowledge of Taglish language nuances and variations.
- Experience with version control systems like Git.
- Familiarity with code-mixing or multilingual NLP techniques
Jobs Summary
- Job Level
- Entry Level / Junior, Apprentice
- Job Category
- IT and Software
- Educational Requirement
- Bachelor's degree graduate
- Office Address
- # 94 Scout Castor