Sourcing tech professionals with certain emerging or hard-to-find skills can be a challenge – even for the most seasoned recruiter.
In the second installment of our “Transferable Skills Guide” series, we look at the Big Data Engineering role and skill-sets in other disciplines that translate to success in big data. Use these tips to better evaluate tech candidates and build a bigger pipeline of talent.
Big Data Engineer
Big data engineers are a new breed. A mix between data scientist and engineer, the skills required for big data engineering roles aren’t necessarily new things. And while I always like to know how big is “BIG” to each candidate, the reality is that size doesn’t matter – experience does. Most big data roles require a bit more math or scientific analysis than traditional engineering roles.
YOU CAN DO LOTS OF THINGS WITH DATA
Analysis, storage, transformation, and collection
For big data engineering, positions can go in a few different directions. Typically the role will include a subset of the following high-level skills:
- Data Analysis – This is the processing of the data and is often referred to as MapReduce, Hadoop or even data mining. Sometimes this can include more specialized techniques like machine learning or even statistical analysis.
- Data Warehousing – This involves being familiar with large data stores. It can be putting data in or taking data out.
- Data Transformation – In order to do data analysis, sometimes data needs to be changed or transformed into a different format. This is also referred to as ETL or even just scripting.
- Data Collection – This is the process of collecting or extracting data from an existing database, API or even crawling the web.
Depending on the specifics of the role, it can require a lot or a little specialized knowledge of different systems. However, typically knowledge in one area will translate well to other areas.
What to look for:
- MapReduce, Hadoop, Cloudera, IBM Big Insights, Hortonworks or MapR. MapReduce is a technique for processing large datasets across many different computing instances (often referred to as a cluster), and Hadoop is the most common implementation of MapReduce (Hadoop can have different distributions or flavors, too). You may also see tools associated with Hadoop like Hive, Pig, or Oozie for example. Most people tend to have experience with one implementation of MapReduce (since most of these tools are only a few years old) but the underlying algorithms make it easy to learn new ones with a few weeks of ramp up time.
- Data mining or machine learning. If a candidate has experience withone of these, it is pretty likely they will be able to tackle a role requiring either of them. This can include technologies like Mahout, or more specialized techniques like Neural Networks.
- Statistical analysis software – R, SPSS, SAS, Weka, MATLAB. Most data scientists should have some statistical experience, but not all of them will use software to do their work. For example, some may use more traditional programming languages like Java to do the analysis. Experience with one of these will usually translate well to another language.
- Programming skills – Java, Scala, Ruby, C++. Typically more heavy lifting programming skills will be required for custom implementations or specialized implementations (leveraging things like machine learning, etc.). However, usually experience with one will make it easier to pick up a new language in a few months time.
- Relational Databases – MySQL, MS SQL Server, Oracle, DB2. Experience with one of these will make it easier to learn the basics of a new one in a matter of weeks. To become an expert though, it might take a new hire many months.
- NoSQL – HBase, SAP HANA, HDFS, Cassandra, MongoDB, CouchDB, Vertica, Greenplum Pentaho, and Teradata. These NoSQL platforms can be called key-value stores, graph databases, RDF triple stores, or distributed databases. Knowledge of NoSQL variants is often a signal of more experience, or specialized data extraction. This may be a nice to have, but these databases tend to work very differently and knowledge of one won’t necessarily translate to well to others.
- Data APIs (e.g. RESTful interfaces). Most candidates should have some experience working with APIs to collect or ingest data. If not, any candidate with programming or scripting experience can pick this up in less than a week.
- SQL expertise and data modeling. This is something all candidates should have, and since it is fundamental, most people with an engineering background will likely need to brush up their skills quickly with an online class or bit of training.
- ETL tools – Informatica, DataStage, SSIS, Redpoint. ETL tools are used to transform structured and unstructured data. Generally experience with one of them will enable a candidate to pick up others easily.
- Scripting – Linux/Unix commands, Python, Ruby, Perl. While each of these languages work differently, a candidate with knowledge of one type of scripting, or a high level programming language like Java should be able to pick up a new one in under a month.
Big data engineers can come in all shapes and sizes (some “bigger” than others), but the skills required can vary a lot depending on the position. Not all roles require knowledge in each of these areas, so be sure to pay attention to the ones that matter most to the hiring manager. Also keep in mind that most of the technologies are still new in this field, but a person with a background in data science, or even a programmer with strong math skills, can build the expertise required.